What’s in a text API?

Now that I’ve got the DEFUN 2009 schedule sorted out (you are coming, aren’t you?), I’ve had time to take a breath and think about the Haskell text library again. Its API is currently a clone of the ancient and venerable Haskell list API. If you’ve used the list API to do much text processing, you’ve probably spilled more than a few tears into your whiskey. The bytestring library also mostly clones the list API, albeit with a few improvements. This state of affairs makes me somewhat sad: here we are with a fabulous language, but a 1991-era API for mangling text.

To put this state of affairs into perspective, here is a function-by-function comparison of the string manipulation APIs of Python 2.6 and Haskell. This is intentionally somewhat pessimistic: I focus on aspects of the Python API that are either absent from or not trivially reimplemented in Haskell, but not the reverse. (If the details that follow make your eyes glaze over, skip them and read on after the table below.)

x + yx `append` y
x in yx `isInfixOf` y
x < yx < y
x <= yx <= y
x == yx == y
x != yx /= y
x > yx > y
x >= yx >= y
x % (...)
x[i] x `index` i
x[i:j](j-i) `take` (i `drop` x)
len(x)length x
x * yy `replicate` x
x.decode()decode... family
x.encode()encode... family
x.endswith(y)y `isSuffixOf` x
x.isalnum()all isAlphaNum x
x.isalpha()all isAlpha x
x.isdigit()all isDigit x
x.islower()all isLower x
x.isspace()all isSpace x
x.isupper()all isUpper x
x.join(y)intercalate x y
x.lower()toLower x
x.lstrip()dropWhile isSpace
x.partition(y)break (==y) x
x.splitlines()lines x
x.startswith(y)y `isPrefixOf` x
x.upper()toUpper x

For now, I’m intentionally not looking at Python’s unicodedata or string packages, even though each contains a handful of additional useful functions.

How would I broadly categorise what’s missing from the current Haskell APIs?

  1. Formatting. The format method that’s new in Python 2.6 is well designed and extremely useful. While there are a few formatting libraries on Hackage, each has flaws which I think are substantial enough to make them undesirable for wide use. As examples of those shortcomings, I’m thinking of a lack of static type safety or a poor fit for automated translation tools.
  2. Searching and splitting text. The Haskell APIs are based on predicates over individual characters, whereas what’s usually needed is predicates over strings. In other words, don’t just find me a character; find me a substring.
  3. Parsing. I’m not overly concerned about this, since Haskell’s libraries far outshine those of Python in this area. Although they currently lack support for the text library, the Parsec and attoparsec libraries will acquire it, I’m sure, as soon as there’s demand. What would be welcome is a decent Unicode-capable regular expression engine, for those times when you just have to get yourself into trouble in the name of expediency.

I intend to address each of these areas over the coming months, and I’ll write up the APIs I intend to flesh out here before I actually implement them, to solicit feedback from the community. One step that I think I’ll probably take, for instance, is to move a few of the functions in the Data.Text module that clone the list API into a new module, Data.Text.Legacy, so that I can use the same function names in Data.Text, but with more useful types. As an example of what I have in mind, I’d be inclined to move split :: Char -> Text -> [Text] into the legacy module, and replace it with split :: Text -> Text -> [Text].

There’s something of a tension between the goals of providing a small, focused text library and getting all the API details right in a way that will make it truly useful. I find the proliferation of tiny libraries on Hackage, each providing a few little pieces of missing functionality, to be pretty dispiriting from the point of view of getting dug in and producing useful application code quickly, so I intend for the text and text-icu libraries to be broadly useful from the get-go.

If you have opinions, or better yet patches, to contribute, let’s get things rolling!

Posted in haskell, python
8 comments on “What’s in a text API?
  1. Magnus says:

    I often use ++ instead of append.

  2. Magnus says:

    Ah, yes, I should have mentioned that I use ++ on [Char]. It’d be nice to have a similar operator in Text, append is long!

  3. Eelco says:

    Very nice summary! I’m looking forward to your work. Have you seen http://splonderzoek.blogspot.com/2009/06/rfc-extensible-typed-scanf-and-printf.html BTW? It’s a recent attempt to create a better formatting (and parsing) library.

  4. Nicolas Pouillard says:

    About splitting, switching to Text -> Text -> [Text] won’t suffice. Having a look to Data.List.Split may be a good idea. Moreover splitting on a regexp is very common in string processing.

  5. I would love to see string oriented operations like toUpper, toLower, and trim. isInfixOf and intercalate are examples of *worst* *function* *name* *ever*!

    I’m excited to see what you come up with, but will it be easy to use over multiple list-like representations? E.g. strings, bytestring, and text?

  6. I love posting stuff before bed! Lots to respond to.

    Magnus, I wish there was a nice operator for the Monoid class’s mappend function. That would nicely solve the append notation problem for a huge family of types. For now, there’s nothing appealing 🙁

    Eelco, I did see Seán’s posting about typed formatting, and while it’s cute, it has the severe problem of being unsuitable for i18n.

    Nicolas, thanks for reminding me of Data.List.Split. It’s a richer API than I think is necessary (it edges into Parsec territory), but a slimmed down version would be good to use as a basis.

    Justin, the existing Data.Text API provides toUpper and friends over text, which behave correctly according to Unicode case folding rules.

    As for finding a way to get the extended API ported to strings and bytestrings, I don’t have a dog in that fight. I don’t think that either String or ByteString is an appropriate type to use for text manipulation, so I’m not going to make any efforts there.

  7. Christopher Lewis says:

    I share your frustrations regarding lots of little libraries and how this impacts a developer’s ability to quickly locate and use the functionality they need. I have frequently experienced this frustration when using other community library management platforms similar to Hackage. When it comes to Haskell, however, I suffer from some tension on this subject.

    One of the beauties of a strongly-typed functional language like Haskell are the possibilities it offers a developer to structure and combine different APIs. So, to a certain extent, I view the proliferation of little, interoperable libraries on Hackage as one of the unique advantages of Haskell.

    Is it possible that there’s a more desirable middle ground between lots of little packages and few or fewer larger packages? For example, do we need more structure in the current set of Hierarchical packages, allowing for the creation of what, for lack of a better term, I will call “super” packages that “collect” logically related functionality from many smaller packages? (Just to be clear, I’m not suggesting any form of extension to Haskell as a language, but additional organization layered over the packages offered on Hackage.)

  8. Sean says:

    I’ve gotten the impression that xformat is unsuitable for i18n from multiple people now. I am very interested in improving it. As I wrote, this is really to get something out there for comments, not as a final implementation. How can it be improved? Or, if you prefer, how can we do type-safe, extensible i18n formatting in Haskell?

    Please contact me if you’d like to start a conversation about this. (I prefer email, but anything’s fine.)

Leave a Reply

Your email address will not be published. Required fields are marked *