Subscribe to
Posts
Comments

I've released version 0.3 of the Haskell text package to Hackage. It supports the new error handling API that I wrote about the other day, along with proper support for case conversion.

What is "proper support for case conversion"? Correctly converting the case of a single Unicode code point can yield one, two, or three code points. You can't just call toUpper on each character in a string, as long-time monolingual C and Python programmers are likely to expect. Here are a few examples of the several hundred non-obvious rules to follow:

  • If converting to lower case, the Latin capital letter I with dot above (U+0130) maps to the bigram "Latin small letter i" followed by "combining dot above" (U+0069 U+0307).

  • In a conversion to upper case, the German "eszett" ß (U+00DF) maps to the bigram SS. (Try u'ß'.upper() in Python for fun.)

  • There also exists "folded" case, which is used for caseless (i.e. case insensitive) conversion. For instance, the Armenian small ligature men now (U+FB13) is case folded to the bigram men now (U+0574 U+0576), while the micro sign (U+00B5) is case folded to the Greek small letter letter mu (U+03BC) instead of itself.

To make matters more fun, case conversion for a few languages of the Eastern Mediterranean requires knowledge of the current locale, and I've chosen to skip that for now. Patches from Turkish and Azeri Haskell hackers are welcome.

Case conversion might seem like an odd thing to focus on. To be considered thorough, I think that a good Unicode implementation needs support for at least the following three "big ticket" items:

  • Normalization

  • Case conversion

  • Collation

Of these three, case conversion is by far the easiest, so I hit it first with a total of a few hours of work.

8 Responses to “Case conversion and text 0.3”

  1. on 07 Jun 2009 at 18:18bla

    Why do you convert ß to SS not to capital ß?
    http://en.wikipedia.org/wiki/Capital_%C3%9F

  2. on 08 Jun 2009 at 00:48Bryan O'Sullivan

    Because ẞ is not widely used in German, and to use it as a capital ß would violate the guidelines set in the Unicode SpecialCasing table.

  3. on 08 Jun 2009 at 14:31Simon Michael

    I really appreciate what you are doing here, more power to your hacking elbow.

  4. on 09 Jun 2009 at 02:21Porges

    Are you implementing this all yourself in pure Haskell? Why not hook into ICU or something similar, which would provide a proven-correct implementation?

  5. on 10 Jun 2009 at 01:07Bryan O'Sullivan

    Porges, where I can, I implement the code in pure Haskell. I’ve written a separate text-icu library that provides bindings to ICU for code that is currently just too much trouble.

  6. on 10 Jun 2009 at 07:10Porges

    Perhaps we should really have some kind of way to generate the code from the CLDR XML data, so that when that is updated we can update the Haskell library without too much hassle.

  7. on 10 Jun 2009 at 22:08Pseudonym

    As interesting as this is, I wonder whether or not this is something that the Haskell community should be maintaining. Wouldn’t a binding for, say, ICU be more appropriate?

  8. on 11 Jun 2009 at 12:45Bryan O'Sullivan

    Pseudonym, there’s already a text-icu package that contains ICU bindings. My goal is to write as much Unicode handling code in pure Haskell as possible, and to leave the complex and fugly stuff to the ICU bindings. That way, if you have fairly simple needs, your number of dependencies is kept low. Also, crossing back and forth between Haskell and C++ is very expensive (due to the different representations used for text), so calling into ICU shouldn’t be done often.

Leave a Reply