Ersin Er wrote a brief blog post about handling the Turkish language in Haskell. Because Turkish uses a character set that mostly looks familiar to Westerners, it is notorious for its ability to trip up the unwary programmer (see examples in PHP and PostgreSQL).
1 | |
His example is quite nice, but we can write more compact version of his code using a few handy features of the text and text-icu packages:
In the text-icu library, we use the
LocaleNametype to describe the locale in which we want a function to operate. This type is an instance of theIsStringclass, so if we enable the OverloadedStrings language feature, we can write plain"tr-TR"to specify a Turkish locale.The
Texttype is also an instance of theIsStringclass, so we can write a literal string like"foo"and the compiler will infer the correct type for it.The Data.Text.IO module contains functions for performing locale-sensitive I/O using
Textvalues.
This combination of features can let us write a less cluttered program, following the dictum that simple things should be simple:
1 | |
I've intentionally kept the number of lines the same to preserve clarity, but there are a few advantages to the rewrite:
Less clutter, more speed: we don't need to explicitly pack or unpack
Textvalues to or fromStringvalues.Performance: we're not performing I/O on
Stringvalues. This would be a big deal if we were writing a real application: I/O withTextis much faster than withString.Putting inference to work: the compiler correctly infers the type of
"tr-TR"to be aLocaleName, and of the strings at the end to beText, so we don't need to be so explicit.
Oh, and we still give the right answer (look carefully at upper and lower case dotted and dotless "I"):
toLower ÇIİĞÖŞÜ gives çıiğöşü
The full documentation to the text and text-icu libraries is a little difficult to read on Hackage (in fact, the text-icu API docs are completely missing), so here are links:

I’m glad that work like this is going on. One of the things I use programming for is linguistics, and despite the fact I’m having fun with Haskell, the previous lack of sophisticated unicode functionality was bugging me. In fact, I upgraded to GHC 6.12 purely to get better handling of unicode I/O from the standard I/O functions.
Until your recent posts about text packages (which I got to from the planet haskell blog aggregator) I’d been hunting for a way to do things like normalisation of unicode for a while. It’s just so important now for any major programming language to have good unicode support.
I hope this will be part of the next Haskell Platform release in january.