I'm pleased to announce the availability of version 0.5 of text, a library that provides fast Unicode text handling for Haskell.
This version contains numerous changes compared to version 0.4, in three broad categories:
I made improvements to the performance of some common functions by, in many cases, more than 10x.
I have substantially refined the API.
Many bugs fixed, almost all of them subtle, thanks to the ever-astounding QuickCheck library.
API changes
While the API of the text package still somewhat resembles the venerable Data.List API (which is now almost two decades old), I have modified it to be both narrower and more useful for text processing tasks.
All of the integer-index-based search functions are now gone, since using integer offsets into Unicode strings is a bad idea (especially when the underlying representation uses a variable-length encoding).
Instead, the search functions break up the original string. Consider the new find function, for instance:
find :: Text -> Text -> (Text, [(Text, Text)])
This function returns all non-overlapping instances of needle in haystack. The first element of the returned pair is the prefix of haystack prior to any matches of needle. The second is a list of pairs.
The first element of each pair in the list is a span from the beginning of a match to the beginning of the next match, while the second is a span from the beginning of the match to the end of the input.
find "::" ""
==> ("", [])
find "/" "a/b/c/d"
==> ("a", [("/b","/b/c/d"), ("/c","/c/d"), ("/d","/d")])
Notice that no matter how much of the result list you consume, you can always reconstruct the entire original haystack from the elements you've already seen.
More often, you won't want something nearly this general. Instead, you may want a function that simply splits a string on a separator:
split :: Text -> Text -> [Text]
Here are a few examples of how the split function works:
split "\r\n" "a\r\nb\r\nd\r\ne"
==> ["a","b","d","e"]
split "aaa" "aaaXaaaXaaaXaaa"
==> ["","X","X","X",""]
split "x" "x"
==> ["",""]
Some of the functions supplied by the list API are not very useful in the context of a text-oriented API. Consider break:
break :: (a -> Bool) -> [a] -> ([a],[a])
I use this name in the text package, but give the function a simpler and more useful type signature (its behaviour is otherwise the same):
break :: Text -> Text -> (Text, Text)
If, for some reason, you still want the predicate-based functions, they're usually available, but (in most cases) with a "By" suffix added, to provide uniform naming:
breakBy :: (Char -> Bool) -> Text -> (Text, Text)
Testing and bug fixes
Even though the text package already had a very high degree of test coverage, I still found perhaps a dozen bugs while working on this release. The bugs in question were almost without exception quite subtle, and it would have been tremendously difficult to reproduce them, or to fix them with confidence, without the assistance of the QuickCheck library.
By the way, even though I've been using it for years, I still find myself feeling very strongly that the QuickCheck library is simply amazing. If unit testing is like riding a tricycle with wobbly wheels, working with QuickCheck more closely resembles flying a sleek starship. Wow.
At the moment, the profile of code covered by QuickCheck tests looks like this:
(In case you're wondering, these are higher levels of test coverage than those achieved by the widely used bytestring package, so you should have high confidence that the code in the text package is pretty solid.)