Case conversion and text 0.3
June 7th, 2009 by Bryan O'Sullivan
I've released version 0.3 of the Haskell text package to Hackage. It supports the new error handling API that I wrote about the other day, along with proper support for case conversion.
What is "proper support for case conversion"? Correctly converting the case of a single Unicode code point can yield one, two, or three code points. You can't just call toUpper on each character in a string, as long-time monolingual C and Python programmers are likely to expect. Here are a few examples of the several hundred non-obvious rules to follow:
If converting to lower case, the Latin capital letter I with dot above (U+0130) maps to the bigram "Latin small letter i" followed by "combining dot above" (U+0069 U+0307).
In a conversion to upper case, the German "eszett" ß (U+00DF) maps to the bigram SS. (Try
u'ß'.upper()in Python for fun.)There also exists "folded" case, which is used for caseless (i.e. case insensitive) conversion. For instance, the Armenian small ligature men now (U+FB13) is case folded to the bigram men now (U+0574 U+0576), while the micro sign (U+00B5) is case folded to the Greek small letter letter mu (U+03BC) instead of itself.
To make matters more fun, case conversion for a few languages of the Eastern Mediterranean requires knowledge of the current locale, and I've chosen to skip that for now. Patches from Turkish and Azeri Haskell hackers are welcome.
Case conversion might seem like an odd thing to focus on. To be considered thorough, I think that a good Unicode implementation needs support for at least the following three "big ticket" items:
Normalization
Case conversion
Collation
Of these three, case conversion is by far the easiest, so I hit it first with a total of a few hours of work.

Why do you convert ß to SS not to capital ß?
http://en.wikipedia.org/wiki/Capital_%C3%9F
Because ẞ is not widely used in German, and to use it as a capital ß would violate the guidelines set in the Unicode SpecialCasing table.
I really appreciate what you are doing here, more power to your hacking elbow.
Are you implementing this all yourself in pure Haskell? Why not hook into ICU or something similar, which would provide a proven-correct implementation?
Porges, where I can, I implement the code in pure Haskell. I’ve written a separate text-icu library that provides bindings to ICU for code that is currently just too much trouble.
Perhaps we should really have some kind of way to generate the code from the CLDR XML data, so that when that is updated we can update the Haskell library without too much hassle.
As interesting as this is, I wonder whether or not this is something that the Haskell community should be maintaining. Wouldn’t a binding for, say, ICU be more appropriate?
Pseudonym, there’s already a text-icu package that contains ICU bindings. My goal is to write as much Unicode handling code in pure Haskell as possible, and to leave the complex and fugly stuff to the ICU bindings. That way, if you have fairly simple needs, your number of dependencies is kept low. Also, crossing back and forth between Haskell and C++ is very expensive (due to the different representations used for text), so calling into ICU shouldn’t be done often.