Announcing a major revision of the Haskell text library

Posted on 2009-10-09 by Bryan O'Sullivan — 4 Comments ↓

I'm pleased to announce the availability of version 0.5 of text, a library that provides fast Unicode text handling for Haskell.

This version contains numerous changes compared to version 0.4, in three broad categories:

I made improvements to the performance of some common functions by, in many cases, more than 10x.
I have substantially refined the API.
Many bugs fixed, almost all of them subtle, thanks to the ever-astounding QuickCheck library.

Performance improvements

In real-world code, many of the most frequently performed text manipulation operations involve searching for substrings. I've written a fast search function, based on Fredrik Lundh's very elegant adaptation of the Boyer-Moore-Horspool-Sunday algorithm, and rewritten all of the functions that perform searches internally to use it (without changing any of the APIs in question, of course).

Retooling with this function has made a big difference to performance. Here are measurements of old-style searches, searches over the popular ByteString type, and new-style searches:

big slow densities

big bytestring densities

big fast densities

For this data set, the new search code is 10.4 times faster than the old, and 24% faster than the ByteString search code.

If we perform a search for a short snippet of DNA in a large data set, we see a similar pattern:

dna slow densities

dna bytestring densities

dna fast densities

Over the DNA data set, the new search code is 15.3 times faster than the old, and 85% faster than the ByteString search code.

By the way, it was writing—and wanting to measure the performance of—this new search function that led me to create the criterion benchmarking package, which in turn begat the statistics package. What a month of hacking it's been!

API changes

While the API of the text package still somewhat resembles the venerable Data.List API (which is now almost two decades old), I have modified it to be both narrower and more useful for text processing tasks.

All of the integer-index-based search functions are now gone, since using integer offsets into Unicode strings is a bad idea (especially when the underlying representation uses a variable-length encoding).

Instead, the search functions break up the original string. Consider the new find function, for instance:

find :: Text -> Text -> (Text, [(Text, Text)])

This function returns all non-overlapping instances of needle in haystack. The first element of the returned pair is the prefix of haystack prior to any matches of needle. The second is a list of pairs.

The first element of each pair in the list is a span from the beginning of a match to the beginning of the next match, while the second is a span from the beginning of the match to the end of the input.

find "::" ""
  ==> ("", [])

find "/" "a/b/c/d"
  ==> ("a", [("/b","/b/c/d"), ("/c","/c/d"), ("/d","/d")])

Notice that no matter how much of the result list you consume, you can always reconstruct the entire original haystack from the elements you've already seen.

More often, you won't want something nearly this general. Instead, you may want a function that simply splits a string on a separator:

split :: Text -> Text -> [Text]

Here are a few examples of how the split function works:

split "\r\n" "a\r\nb\r\nd\r\ne"
  ==> ["a","b","d","e"]

split "aaa" "aaaXaaaXaaaXaaa"
  ==> ["","X","X","X",""]

split "x" "x"
  ==> ["",""]

Some of the functions supplied by the list API are not very useful in the context of a text-oriented API. Consider break:

break :: (a -> Bool) -> [a] -> ([a],[a])

I use this name in the text package, but give the function a simpler and more useful type signature (its behaviour is otherwise the same):

break :: Text -> Text -> (Text, Text)

If, for some reason, you still want the predicate-based functions, they're usually available, but (in most cases) with a "By" suffix added, to provide uniform naming:

breakBy :: (Char -> Bool) -> Text -> (Text, Text)

Testing and bug fixes

Even though the text package already had a very high degree of test coverage, I still found perhaps a dozen bugs while working on this release. The bugs in question were almost without exception quite subtle, and it would have been tremendously difficult to reproduce them, or to fix them with confidence, without the assistance of the QuickCheck library.

By the way, even though I've been using it for years, I still find myself feeling very strongly that the QuickCheck library is simply amazing. If unit testing is like riding a tricycle with wobbly wheels, working with QuickCheck more closely resembles flying a sleek starship. Wow.

At the moment, the profile of code covered by QuickCheck tests looks like this:

96% of public top-level definitions
85% of conditional expressions
88% of all expressions

(In case you're wondering, these are higher levels of test coverage than those achieved by the widely used bytestring package, so you should have high confidence that the code in the text package is pretty solid.)

Posted in haskell, open source

4 comments on “Announcing a major revision of the Haskell text library”

Ketil says:

2009-10-09 at 07:57

Nice work, as always!

It took me a little while to realize the scales are different in the graphs, and how great the improvement really was.

Presumably, the bytestring search function you compare to is more similar to the old/slow text search? As somebody who uses bytestring a lot for storing text, I wonder if there are improvements to be had by using a similar approach? (It seems pretty obvious that bytesting, using 8-bit units, should never be slower than text, using 16-bit, right?)

-k
Tom Limoncelli says:

2009-10-09 at 08:32

Dude, redraw those graphs all with the same scale. It will blow people’s minds!
Tom Moertel says:

2009-10-09 at 10:09

This is, as usual, great work. Thanks for this wonderful contribution to Haskell.

(Also, Tom Limoncelli is right — put those graphs on the same scale if you want to foster visual comparisons.)

Cheers,
Tom
Andrew Farmer says:

2009-10-09 at 14:57

Instead of having multiple graphs with the same scale, why not just draw multiple lines on the same graph? (Getting the same scale for free, and making the improvement very obvious.)

Announcing a major revision of the Haskell text library

Performance improvements

API changes

Testing and bug fixes

4 comments on “Announcing a major revision of the Haskell text library”

Leave a Reply Cancel reply