A tiny example of clean Unicode handling in Haskell

Posted on 2010-09-27 by Bryan O'Sullivan — 2 Comments ↓

Ersin Er wrote a brief blog post about handling the Turkish language in Haskell. Because Turkish uses a character set that mostly looks familiar to Westerners, it is notorious for its ability to trip up the unwary programmer (see examples in PHP and PostgreSQL).

import Data.Text (pack, unpack)
import Data.Text.ICU (LocaleName(Locale), toLower)

main = do
  let trLocale = Locale "tr-TR"
  let upStr = "ÇIİĞÖŞÜ"
  let lowStr = unpack $ toLower trLocale $ pack upStr
  putStrLn ("toLower " ++ upStr ++ " gives " ++ lowStr)

His example is quite nice, but we can write more compact version of his code using a few handy features of the text and text-icu packages:

In the text-icu library, we use the LocaleName type to describe the locale in which we want a function to operate. This type is an instance of the IsString class, so if we enable the OverloadedStrings language feature, we can write plain "tr-TR" to specify a Turkish locale.
The Text type is also an instance of the IsString class, so we can write a literal string like "foo" and the compiler will infer the correct type for it.
The Data.Text.IO module contains functions for performing locale-sensitive I/O using Text values.

This combination of features can let us write a less cluttered program, following the dictum that simple things should be simple:

{-# LANGUAGE OverloadedStrings #-}
import Data.Text.IO as T
import Data.Text.ICU as T (toLower)

main = do
  let upper = "ÇIİĞÖŞÜ"
      lower = T.toLower "tr-TR" upper
  mapM_ T.putStr ["toLower ", upper, " gives ", lower, "\n"]

I've intentionally kept the number of lines the same to preserve clarity, but there are a few advantages to the rewrite:

Less clutter, more speed: we don't need to explicitly pack or unpack Text values to or from String values.
Performance: we're not performing I/O on String values. This would be a big deal if we were writing a real application: I/O with Text is much faster than with String.
Putting inference to work: the compiler correctly infers the type of "tr-TR" to be a LocaleName, and of the strings at the end to be Text, so we don't need to be so explicit.

Oh, and we still give the right answer (look carefully at upper and lower case dotted and dotless "I"):

toLower Ã‡IÄ°ÄžÃ–ÅžÃœ gives Ã§Ä±iÄŸÃ¶ÅŸÃ¼

The full documentation to the text and text-icu libraries is a little difficult to read on Hackage (in fact, the text-icu API docs are completely missing), so here are links:

Posted in haskell, open source

2 comments on “A tiny example of clean Unicode handling in Haskell”

chrisdb says:

2010-09-27 at 22:22

I’m glad that work like this is going on. One of the things I use programming for is linguistics, and despite the fact I’m having fun with Haskell, the previous lack of sophisticated unicode functionality was bugging me. In fact, I upgraded to GHC 6.12 purely to get better handling of unicode I/O from the standard I/O functions.

Until your recent posts about text packages (which I got to from the planet haskell blog aggregator) I’d been hunting for a way to do things like normalisation of unicode for a while. It’s just so important now for any major programming language to have good unicode support.
pl says:

2010-09-28 at 16:13

I hope this will be part of the next Haskell Platform release in january.

A tiny example of clean Unicode handling in Haskell

2 comments on “A tiny example of clean Unicode handling in Haskell”

Leave a Reply Cancel reply