Subscribe to
Posts
Comments

I've spent a couple of hours over the past few evenings starting to make good on my recent promise to clean up the Haskell text library. This is part progress report, part solicitation of input.

I renamed the split function to splitChar:

splitChar :: Char -> Text -> [Text]

The split function has been replaced with one that has the following signature:

split :: Text -> Text -> [Text]

This function breaks a Text into pieces separated by the first Text argument, consuming the delimiter. Here are a few examples of its operation:

split "\r\n" "a\r\nb\r\nd\r\ne" == ["a","b","d","e"]
split "aaa"  "aaaXaaaXaaaXaaa"  == ["","X","X","X",""]
split "x"    "x"                == ["",""]

I've created a new Data.Text.Compat module that exports splitChar under the name split. The idea behind this module is that you can import it to easily port code from the list and bytestring APIs to the text library's API.

The new splitTimes function is similar to split, but limits the number of splitting operations it performs.

splitTimes :: Int -> Text -> Text -> [Text]

A few examples illustrate its operation:

splitTimes 0   "//"  "a//b//c"   == ["a//b//c"]
splitTimes 2   ":"   "a:b:c:d:e" == ["a","b","c:d:e"]
splitTimes 100 "???" "a????b"    == ["a","?b"]

The last element of the list contains the residual text left over once the given number of splits has been performed. A negative split count is treated as zero.

Both split and splitTimes obey some simple laws:

intercalate s . split s         == id
intercalate s . splitTimes k s  == id

Another new function is chunksOf, which splits a Text into equally-sized chunks:

chunksOf :: Int -> Text -> [Text]

The last chunk returned may be smaller than the others:

chunksOf 3 "foobarbaz"   == ["foo","bar","baz"]
chunksOf 4 "haskell.org" == ["hask","ell.","org"]

This is a function I've written by hand innumerable times in many languages. I'm surprised that no library I've encountered provides it by default.

To complement the long-established dropWhile function, I wrote dropAfter, which drops characters that fail a predicate at the end of a string instead of the beginning. I also wrote a dropAround function which does both, but I'm not thrilled with its name: suggestions welcome.

I have wondered whether it would make sense to go a little further and add a few more utility definitions:

strip      = dropAround isSpace
stripLeft  = dropWhile  isSpace
stripRight = dropAfter  isSpace

Currently, my thinking is that the few characters saved are worth the trouble, since trimming white space is far more common than trimming other kinds of text. I haven't added those functions yet, so I'm still amenable to persuasive arm-twisting to the contrary.

17 Responses to “First steps with Haskell text API improvement”

  1. on 06 Jul 2009 at 03:09Stephen Blackheath

    Bryan – This is great stuff! Thank you for your good works!

    I’m an extreme disliker of helper functions, and yet (contrary to what I said on #haskell) even I am starting to think that ’strip’ is a good idea. Some observations:

    - I don’t like ‘dropAfter’ because it sounds like it drops everything after the first occurrence of the delimiter (from the left). I thought about ‘dropFinal’ but I’m starting to think ‘dropLast’ might be better because it’s consistent with ‘Prelude.last’.

    - I can’t think of anything better than ‘dropAround’. It seems good in that it is clear and memorable.

    - ’stripLeft’ and ’stripRight’ assume that left == start. I realize there is a precedent in Haskell: foldl, etc. However, this is a Unicode library and there are 600 million speakers of Arabic, Farsi and Hebrew world-wide. If Haskell does take over the world in spite of itself it would be nice not to annoy/confuse people.

    Making the names consistent with drop* doesn’t work. ‘head’ and ‘last’ (inspired by Prelude) don’t work either because it sounds like they work on a single character only. stripHeads is just plain clunky. So here’s an idea – how about some new terms Start and End? (They’re nouns – ‘Begin’ is a verb). These are conceptually consistent with ‘head’ and ‘last’:

    heads == start
    lasts == end

    - Instead of ’strip’ you could consider ‘trim’, which is what Java uses – shorter and possibly clearer.

    Here they all are together:

    dropWhile, trimStart
    dropEnd, trimEnd
    dropAround, trim (or trimAround?)

    Less than perfect, I’m afraid, but hopefully there’s something useful in it. — Steve

  2. on 06 Jul 2009 at 04:35Nicolas Pouillard

    Great progress, thanks!

    About dropAfter, if I understand well the following equation holds:
    dropAfter p = reverse . dropWhile p . reverse

    If so I propose using the words reverse or backward in the name:
    revDropWhile
    dropWhileBackward

    I’m also in favor of trim{Start,End,}.

  3. on 06 Jul 2009 at 05:06Dougal Stanton

    This is all looking pretty good. I totally agree on the ubiquity of “chunksOf”. I always end up recreating it by some name, “groupsOf”, “breaklist”, etc etc. I wonder if just “chunk” would be a good name?

    I don’t really follow the logic of Stephen’s argument about Unicode. Surely the left and right in stripLeft and stripRight are referring to the underlying Haskell lists, which are always written x:y:z:[]. Stripping the left means stripping the head elements of the list. If writing Hebrew with Haskell allows one to write []:z:y:x I’d be very surprised!

  4. on 06 Jul 2009 at 05:31Mark Wotton

    I’d prefer “chomp” to either “strip” or “trim”, if a perlism isn’t considered too filthy…

  5. on 06 Jul 2009 at 06:27Arthur van Leeuwen

    I know for a fact dropAfter is useful, however, why not name it in accordance to spanEnd and breakEnd in Data.ByteString.Strict, i.e. dropWhileEnd ?

  6. on 06 Jul 2009 at 08:48Duncan Coutts

    The ’split’ function was only in the Data.ByteString[.Lazy].Char8 modules, not in Data.List. So there’s no great history or existing standard practice that needs preserving. I’m not sure I’d bother with the Compat module.

  7. on 06 Jul 2009 at 11:01Programmer

    Chomp? What a horrid name. My vote is for strip or trim, in that order.

  8. on 06 Jul 2009 at 12:14brian

    stripLeft and stripRight seem fine to me. If they were named stripStart and stripEnd, I’d agree that they were badly named because of the language issue.

  9. on 06 Jul 2009 at 12:42Ian Taylor

    ‘trimLeft’, ‘trimRight’, ‘trim’ sound good to me.

    I always liked the sound of ‘join’ when dealing with text rather than ‘intercalate’. It goes well with split.

  10. on 06 Jul 2009 at 18:35Simon Michael

    I would like to see strip* included. I write these helpers in every haskell project (as strip, lstrip, rstrip).

    You didn’t mention the split library on hackage.. did you see it, any more good ideas to be harvested from there ?

    Great stuff.

  11. on 06 Jul 2009 at 21:30solrize

    strip/stripLeft/stripRight are in the spirit of the Python names for those functions, which in turn probably has more in common with Haskell than Perl or Java do. So I’d stay with them.

  12. on 07 Jul 2009 at 01:13Greg

    chunksOf

    in Ruby this is Enumerable#each_slice.

    # Ruby
    (1..10).each_slice(3) {|a| p a} # [1,2,3] …

    – Haksell
    eachSlice 4 “haskell.org” == ["hask","ell.","org"]

    I like chunksOf or groupsOf and slicesOf

  13. on 07 Jul 2009 at 01:24nbloomf

    IIRC the chunksOf function was implemented as groupBy in “On Lisp”.

  14. on 08 Jul 2009 at 20:15Keith

    I think chunksOf is generally useful enough to be in Data.List. Is is possible (Haskell’ ?) to add functions like this that turn out to be general enough

  15. on 12 Jul 2009 at 20:48Stephen Blackheath

    I second Arthur van Leeuwen’s “dropWhileEnd” suggestion

  16. on 14 Jul 2009 at 04:43Johan Tibell

    I agree with Duncan that a Compat module is unnecessary. The number of modules listed at http://hackage.haskell.org/package/text is already rather intimidating. I also wouldn’t bother with splitChar unless it has serious performance benefits.

    I also prefer join to intercalate but I guess that boat already sailed. I don’t remember what the original argument was but if intercalate was chosen over join because of name clashes with monads I think that’s a poor reason since we have namespaces.

    A parting comment: Beware of the potential combinatorial explosion that comes from creating a helper function for every common case rather than relying on composition. Haskell’s lack of keyword arguments makes libraries prone to export lots of fooByBar functions for lots of different “Bars”. If lots of possible “configuration” parameters are absolutely needed for a function consider passing them in a record instead of creating separate functions.

  17. on 04 Oct 2009 at 16:50Andrew

    Good stuff.

    chunksOf :: Int -> [a] -> [[a]]
    please.

Leave a Reply