I'm pleased to announce the availability of version 0.5 of text, a library that provides fast Unicode text handling for Haskell.
This version contains numerous changes compared to version 0.4, in three broad categories:
I made improvements to the performance of some common functions by, in many cases, more than 10x.
I have substantially refined the API.
Many bugs fixed, almost all of them subtle, thanks to the ever-astounding QuickCheck library.
In real-world code, many of the most frequently performed text manipulation operations involve searching for substrings. I've written a fast search function, based on Fredrik Lundh's very elegant adaptation of the Boyer-Moore-Horspool-Sunday algorithm, and rewritten all of the functions that perform searches internally to use it (without changing any of the APIs in question, of course).
Retooling with this function has made a big difference to performance. Here are measurements of old-style searches, searches over the popular ByteString type, and new-style searches:
For this data set, the new search code is 10.4 times faster than the old, and 24% faster than the ByteString search code.
If we perform a search for a short snippet of DNA in a large data set, we see a similar pattern:
Over the DNA data set, the new search code is 15.3 times faster than the old, and 85% faster than the ByteString search code.
By the way, it was writing—and wanting to measure the performance of—this new search function that led me to create the criterion benchmarking package, which in turn begat the statistics package. What a month of hacking it's been!
While the API of the text package still somewhat resembles the venerable Data.List API (which is now almost two decades old), I have modified it to be both narrower and more useful for text processing tasks.
All of the integer-index-based search functions are now gone, since using integer offsets into Unicode strings is a bad idea (especially when the underlying representation uses a variable-length encoding).
Instead, the search functions break up the original string. Consider the new
find function, for instance:
find :: Text -> Text -> (Text, [(Text, Text)])
This function returns all non-overlapping instances of
haystack. The first element of the returned pair is the prefix of
haystack prior to any matches of
needle. The second is a list of pairs.
The first element of each pair in the list is a span from the beginning of a match to the beginning of the next match, while the second is a span from the beginning of the match to the end of the input.
find "::" ""
==> ("", )
find "/" "a/b/c/d"
==> ("a", [("/b","/b/c/d"), ("/c","/c/d"), ("/d","/d")])
Notice that no matter how much of the result list you consume, you can always reconstruct the entire original
haystack from the elements you've already seen.
More often, you won't want something nearly this general. Instead, you may want a function that simply splits a string on a separator:
split :: Text -> Text -> [Text]
Here are a few examples of how the
split function works:
split "\r\n" "a\r\nb\r\nd\r\ne"
split "aaa" "aaaXaaaXaaaXaaa"
split "x" "x"
Some of the functions supplied by the list API are not very useful in the context of a text-oriented API. Consider
break :: (a -> Bool) -> [a] -> ([a],[a])
I use this name in the text package, but give the function a simpler and more useful type signature (its behaviour is otherwise the same):
break :: Text -> Text -> (Text, Text)
If, for some reason, you still want the predicate-based functions, they're usually available, but (in most cases) with a "
By" suffix added, to provide uniform naming:
breakBy :: (Char -> Bool) -> Text -> (Text, Text)
Testing and bug fixes
Even though the text package already had a very high degree of test coverage, I still found perhaps a dozen bugs while working on this release. The bugs in question were almost without exception quite subtle, and it would have been tremendously difficult to reproduce them, or to fix them with confidence, without the assistance of the QuickCheck library.
By the way, even though I've been using it for years, I still find myself feeling very strongly that the QuickCheck library is simply amazing. If unit testing is like riding a tricycle with wobbly wheels, working with QuickCheck more closely resembles flying a sleek starship. Wow.
At the moment, the profile of code covered by QuickCheck tests looks like this:
96% of public top-level definitions
85% of conditional expressions
88% of all expressions
(In case you're wondering, these are higher levels of test coverage than those achieved by the widely used bytestring package, so you should have high confidence that the code in the text package is pretty solid.)