Subscribe to
Posts
Comments

I've spent a few nights improving the Haskell text library's resilience in the face of bad input. In version 0.2, the library simply calls error if presented with invalid input. This is clearly not adequate, since there's no shortage of data out there that either contains coding errors or is misidentified (e.g. ISO-8859-1 served up as UTF-8).

The new API remains substantially the same as before: if you invoke decodeUtf8 with bad input, you'll now get a UnicodeException instead of a less helpful exception via error. I've also introduced a new decodeUtf8With function which takes an error handler as first parameter. This handler is a normal Haskell function. It can do one of several things:

  • Call error or throw an exception.

  • Replace the bogus input with something else, e.g. the Unicode replacement character U+FFFD.

  • Skip over the bad input.

The new Data.Text.Encoding.Error module predefines a few useful handler functions. I've extended the other decoding APIs to behave similarly. In my benchmarks, parameterised error handling imposes no measurable runtime cost.

I haven't released the updated code yet; I'll try to publish it within the next few days.

One Response to “Dealing with encoding errors in Data.Text”

  1. [...] released version 0.3 of the Haskell text package to Hackage. It supports the new error handling API that I wrote about the other day, along with proper support for case [...]

Leave a Reply