What’s in a text API?
June 30th, 2009 by Bryan O'Sullivan
Now that I’ve got the DEFUN 2009 schedule sorted out (you are coming, aren’t you?), I’ve had time to take a breath and think about the Haskell text library again. Its API is currently a clone of the ancient and venerable Haskell list API. If you’ve used the list API to do much text processing, you’ve probably spilled more than a few tears into your whiskey. The bytestring library also mostly clones the list API, albeit with a few improvements. This state of affairs makes me somewhat sad: here we are with a fabulous language, but a 1991-era API for mangling text.
To put this state of affairs into perspective, here is a function-by-function comparison of the string manipulation APIs of Python 2.6 and Haskell. This is intentionally somewhat pessimistic: I focus on aspects of the Python API that are either absent from or not trivially reimplemented in Haskell, but not the reverse. (If the details that follow make your eyes glaze over, skip them and read on after the table below.)
| Python | Haskell |
| x + y | x `append` y |
| x in y | x `isInfixOf` y |
| x < y | x < y |
| x <= y | x <= y |
| x == y | x == y |
| x != y | x /= y |
| x > y | x > y |
| x >= y | x >= y |
| x % (...) | |
| x[i] | x `index` i |
| x[i:j] | (j-i) `take` (i `drop` x) |
| hash(x) | |
| len(x) | length x |
| x * y | y `replicate` x |
| x.capitalize() | |
| x.center(y) | |
| x.count() | |
| x.decode() | decode... family |
| x.encode() | encode... family |
| x.endswith(y) | y `isSuffixOf` x |
| x.expandtabs() | |
| x.find(y) | |
| x.format(...) | |
| x.index(y) | |
| x.isalnum() | all isAlphaNum x |
| x.isalpha() | all isAlpha x |
| x.isdigit() | all isDigit x |
| x.islower() | all isLower x |
| x.isspace() | all isSpace x |
| x.istitle() | |
| x.isupper() | all isUpper x |
| x.join(y) | intercalate x y |
| x.ljust(w) | |
| x.lower() | toLower x |
| x.lstrip() | dropWhile isSpace |
| x.partition(y) | break (==y) x |
| x.replace(y,z) | |
| x.rfind(y) | |
| x.rindex(y) | |
| x.rjust(y) | |
| x.rpartition(y) | |
| x.rsplit(y) | |
| x.rstrip(y) | |
| x.split(y) | |
| x.splitlines() | lines x |
| x.startswith(y) | y `isPrefixOf` x |
| x.strip() | |
| x.swapcase() | |
| x.title() | |
| x.translate(y) | |
| x.upper() | toUpper x |
| x.zfill() |
For now, I’m intentionally not looking at Python’s unicodedata or string packages, even though each contains a handful of additional useful functions.
How would I broadly categorise what’s missing from the current Haskell APIs?
- Formatting. The format method that’s new in Python 2.6 is well designed and extremely useful. While there are a few formatting libraries on Hackage, each has flaws which I think are substantial enough to make them undesirable for wide use. As examples of those shortcomings, I’m thinking of a lack of static type safety or a poor fit for automated translation tools.
- Searching and splitting text. The Haskell APIs are based on predicates over individual characters, whereas what’s usually needed is predicates over strings. In other words, don’t just find me a character; find me a substring.
- Parsing. I’m not overly concerned about this, since Haskell’s libraries far outshine those of Python in this area. Although they currently lack support for the text library, the Parsec and attoparsec libraries will acquire it, I’m sure, as soon as there’s demand. What would be welcome is a decent Unicode-capable regular expression engine, for those times when you just have to get yourself into trouble in the name of expediency.
I intend to address each of these areas over the coming months, and I’ll write up the APIs I intend to flesh out here before I actually implement them, to solicit feedback from the community. One step that I think I’ll probably take, for instance, is to move a few of the functions in the Data.Text module that clone the list API into a new module, Data.Text.Legacy, so that I can use the same function names in Data.Text, but with more useful types. As an example of what I have in mind, I’d be inclined to move split :: Char -> Text -> [Text] into the legacy module, and replace it with split :: Text -> Text -> [Text].
There’s something of a tension between the goals of providing a small, focused text library and getting all the API details right in a way that will make it truly useful. I find the proliferation of tiny libraries on Hackage, each providing a few little pieces of missing functionality, to be pretty dispiriting from the point of view of getting dug in and producing useful application code quickly, so I intend for the text and text-icu libraries to be broadly useful from the get-go.
If you have opinions, or better yet patches, to contribute, let’s get things rolling!
