Subscribe to
Posts
Comments

I just published the base64-bytestring library, which fills a surprising gap in Haskell's library coverage by adding support for base64 encoding and decoding of the ByteString type.

The library is written in pure Haskell, and it's fast:

  • 250 MB/sec encoding
  • 200 MB/sec strict decoding (per RFC 4648)
  • 100 MB/sec lenient decoding

The above numbers compares favourably to the base64 code in the Python distribution, which was handy to measure against as it is implemented in C:

  • 250 MB/sec encoding
  • 134 MB/sec decoding

Enjoy!

After several months of work, I'm pleased to announce the release of version 0.8.0.0 of the Haskell text library, which provides fast Unicode support with a pleasing API.

Compared to previous versions, this has the following major changes:

  • Improved performance. Some common functions, including support for Unicode I/O, are up to 3x faster than before.

  • Better test coverage. 95% of top-level functions are tested, as are 89% of all expressions. Almost all testing is performed via QuickCheck, so the quality of the tested code is high.

  • Bug fixes. A few significant bugs and space leaks have been fixed, thanks to a combination of reports from users and better test coverage. Notably, the I/O code had a couple of nasty bugs that I smoked out via QuickCheck.

  • Cleaner APIs. Several functions have more useful type signatures than before, and a few more useful functions have been added.

If you are writing an application or maintain a library that uses the text library, you should update your Cabal dependencies to something like this:

text == 0.8.*

In most cases, you won't need to edit your source code, as very few API entry points have changed since 0.7.

At this point, I think I'm happy calling the text API stable. What can you expect in the near future?

  • Even more code! I am close to releasing a substantial revision to the text-icu library, which provides a rich (and soon greatly expanded) set of functionality around locale-sensitive collation, normalization, text breaking, and other complex Unicode-related functionality that does not belong in the core of the text package. This will enable the development of powerful and subtle text processing applications that have previously not been writable in Haskell.

  • Faster code for free! Simon Peyton Jones rewrote the GHC inliner several months ago. My benchmarks indicate that the new inliner makes a huge difference to the performance of the text library, often doubling or tripling the performance of functions. Since GHC 6.14.1, which includes the new inliner, should be released in a few months, you'll get a huge performance boost for free.

  • Further refinements! I continue to work on other performance improvements as time permits.

Don't forget, you can always improve the text library yourself. I love to receive patches, requests for improvement, and bug reports.

darcs get http://code.haskell.org/text
darcs get http://code.haskell.org/text-icu

With Johan Tibell, I'm organizing an informal discussion about Haskell in the Real World at this year's CUFP. The discussion will take place in the evening on Thursday 30 September. You don't have to sign up to CUFP or ICFP to attend the session, so please feel welcome to come.

Topic: where we as a community should be focusing our efforts over the coming year to continue to improve Haskell as a practical language for day-to-day hacking.

Subjects we expect to cover should include at least the following:

  • post-mortem crash analysis and debugging
  • improvements to the Cabal build system (e.g. parallel builds)
  • performance and documentation improvements for core libraries
  • plugging missing functionality holes in the family of packages on hackage

Here's a useful little tip if you need to use hg convert to generate a stripped-down copy of a Mercurial repository. For instance, maybe we have a tree that someone committed a large file to by accident, or perhaps someone accidentally checked some closed source code into an open source tree. If such a commit makes it into a busy repository, it can be a while before anyone notices. Worse, we can't necessarily expect everyone who's downstream of that repository to immediately drop everything they're doing and find a way to switch over to the freshly-scrubbed tree you want to publish.

There's a fairly painless way around this. Firstly, we'll need to use the filemap option to make hg convert strip out the files we want to lose from our new repository. Here's an idea of how that should work. To start off, we create a demo repository, named hg-100.

$ hg clone -r 100 http://www.selenic.com/repo/hg hg-100
$ ls -F hg-100
comparison.txt	hgweb.py     mercurial/  PKG-INFO  setup.py  tkmerge
hg		MANIFEST.in  notes.txt	 README    tests/

Next, we set up a filemap that eliminates all tests from the tree.

$ echo exclude tests > no-tests.map
$ hg convert --quiet --filemap no-tests.map hg-100 nukeme
$ hg --cwd nukeme update --quiet
$ ls -F nukeme
comparison.txt	hgweb.py     mercurial/  PKG-INFO  setup.py
hg		MANIFEST.in  notes.txt	 README    tkmerge

Now that we have run hg convert, it's changed the changeset ID of every commit that either contains a change to a file in tests or is a child of such a commit. Ouch, right? Haven't we now lost the ability to merge contaminated repositories with the scrubbed one?

Fortunately, no. We could always pull future changes into hg-100, then rerun hg convert with the same --filemap option, and successfully scrub later changes, which we could then share with our collaborators.

The only trouble with this is that it relies on a non-version-controlled file named .hg/shamap in our nukeme tree. That file contains the mapping from source changeset ID to target changeset ID. This is what hg convert uses to tell whether a changeset has been seen before, and if so, what its new ID is. So if we lose the nukeme tree, we can't run hg convert again, and we're stuck, right? Well, once again, not necessarily.

A final trick up our sleeves is to run every instance of hg convert with --config convert.hg.saverev=True. This is documented in the output of hg help, but sadly not on the wiki page. This saves each source changeset ID in a special field in the corresponding changeset in the target repo.

$ rm -rf nukeme
$ hg convert --quiet --filemap no-tests.map --config convert.hg.saverev=True \
    hg-100 nukeme
$ hg --cwd nukeme tip --debug | grep 'extra:'
extra:       branch=default
extra:       convert_revision=526722d24ee5b3b860d4060e008219e083488356

The convert_revision data gives us enough to create a new .hg/shamap file whenever we need to. First, we create a style file to extract the necessary data from Mercurial.

$ cat >> pants.map << EOF
changeset = '{extras}\n'
extra = '{key} {value} {node}\n'
EOF

Then, in any clone of our converted repo, it becomes simple to regenerate the .hg/shamap file:

$ hg log -r0: --style pants.map | awk '/convert_revision/{print $2,$3}' \
    > $(hg root)/.hg/shamap

And finally: remember, kids, if you like this kind of information, not only is my Mercurial book free to read online, but if you buy a paper or ebook copy, all the royalties go to the Software Freedom Convervancy, whose work is so worthy of support you should buy five copies, not just one.

DSC_0243

I'm on the program committee for the 2011 Practical Aspects of Declarative Languages (PADL) conference, so I heartily encourage you to submit a paper on that exciting functional programming project you've been working on.

Topics of interest include:

  • Innovative applications of declarative languages.
  • Declarative domain-specific languages and applications.
  • Practical applications of theoretical results.
  • New language developments and their impact on applications.
  • Declarative languages and Software Engineering.
  • Evaluation of implementation techniques on practical applications.
  • Practical experiences and industrial applications.
  • Novel uses of declarative languages in the classroom.

Abstracts are due by September 1, with full papers by September 8. For more details, see the PADL '11 web site.

Thanks to diligent work by Johan, our new I/O manager is ready to merge into the main GHC tree. All the tests pass, and it’s performing well. I’ll post an update when it gets merged into HEAD.

We also heard back from the peer reviewers for this year’s Haskell Symposium that our paper describing the new I/O manager has been accepted. The reviewers had some very helpful suggestions for how we can clarify the paper and improve its presentation, so I’ll be working on a final draft over the next couple of weeks (I’m currently on vacation in Ireland, so paper revisions aren’t yet at the top of my priority list).

That gives us two very encouraging pieces of news to forge ahead with. Cheers!

On the Haskell libraries mailing list, Cale Gibbard made an interesting suggestion for a couple of useful list-related functions that could be added to the standard list library.

The particular one I'm interested in is his separate function:

separate :: [a] -> [([a],a,[a])]

The idea is that this function should produce a list of all ways of separating a list into an initial segment, a single element, and a final segment. For instance:

separate [1,2,3,4]
  ==> [([],1,[2,3,4]),
       ([1],2,[3,4]),
       ([1,2],3,[4]),
       ([1,2,3],4,[])]

This is very similar to the find function I wrote for the text package, which has the following type:

find :: Text -> Text -> (Text, [(Text, Text)])

This behaves as follows:

find "/" "a/b/c/d"
  ==> ("a", [("/b","/b/c/d"),
             ("/c","/c/d"),
             ("/d","/d")])

The first element of each pair in the list is a span from the beginning of a match to the beginning of the next match, while the second is a span from the beginning of the match to the end of the input.

Even though I wrote this find function (after much helpful discussion with Duncan Coutts), I find its type signature and behaviour a little tricky to explain.

What's interesting to me is that Cale's proposed separate function has a much simpler type and, I believe, easier-to-understand behaviour.

In fact, I am thinking that I should change find to act similarly. In other words, its type would change to the following:

find :: Text -> [(Text,Text,Text)]

And its behaviour would change as follows:

find "/" "foo/bar/quux/"
  ==> [("foo","/","bar/quux/"),
       ("foo/bar","/","quux/"),
       ("foo/bar/quux","/","")]

From observation, it's easy to explain how this function behaves. The nth element in the list (counting from zero, because we're computer scientists, right?) consists of a triple:

  • The entire string, up to the nth match

  • The nth match itself

  • The remainder of the string following the match

At any match, you can concatenate all three elements and get the original string back.

The simplicity of this function greatly appeals to me. I also find it telling that when I changed the function's behaviour, I was able to simplify its implementation.

Another interesting way of looking at this proposed change to find is that it now sequentially encodes all the ways you can advance (without overlap) one step through a zipper over the text, where a step takes you from one match to the next.

I think the change of behaviour gives us a piece of functionality that is much more appealing than the original find function, and so I'm inclined to release a text 0.8.0.0 package in a few days with the rewrite. What do you think?

(Credit where it's due: my recollection is that Duncan tickled the back of my mind with a proposal similar to this a few months ago, but I've no idea where or when that exchange actually took place, or what the specifics of his proposal were.)

After a few months of blissfully doing precious little hacking in our spare time, Johan and I have returned to work on the new event manager for GHC.

I spent the past few days writing a paper about the motivation, design, and internals of the new event manager, which I submitted to this year’s Haskell Symposium at about 2am this morning.

Independently, last week I started the work of integrating the event manager into GHC’s runtime libraries, and Johan picked up the pace this week while I concentrated on writing the paper. We’ve reached the point of having the code compile, but it’s still a few weeks from actually working.

While you wait for the fruits of our labours, please indulge in some vicarious enjoyment of the California summer:

In my first of this pair of articles, I laid out some of the qualities I've been looking for in a parsing library.

Before I dive back into detail, I want to show off some numbers. The new Attoparsec code is fast.

Performance

What did I benchmark? I captured some real HTTP GET requests from a live public web server, averaging 431 bytes per request. I chucked them into a file, and measured the time needed to parse the entire contents of the file with the following libraries:

  • Ryan Dahl's http-parser library, which is 1,672 lines of hand-rolled C craziness. I mean "craziness" in a special sort of awed way, with that hushed voice you reserve for the guy who builds the scale model of Neuschwanstein Castle in matchsticks. The library's heritage is closely based on the Ragel-generated parser used by Mongrel. This library is a fair approximation to about as fast as you can get, since it's been tuned for just one purpose. I wrote a smallish, but reasonably realistic, driver program to wire it up to file-based data, adding another 210 lines of code.

  • An Attoparsec-based HTTP request parser, 54 lines long, with about 30 lines of driver program. (Attoparsec itself is about 900 lines of code.)

  • Several Parsec-3-based parsers, which are almost identical in length to the Attoparsec-based version.

The Parsec 3 parsers come in three varieties:

  • The fastest uses a patch that Antoine Latter wrote to switch Parsec 3's internal machinery over to using continuation passing style (CPS). This parser uses ByteString for input, and reads the entire 18.8MB file in one chunk.

  • Next is the same parser, using lazy ByteString I/O to read the file in 64KB chunks. This costs about 40% in performance, but is almost mandatory to maintain a reasonable memory footprint on large inputs.

  • In last place is the official version of Parsec 3, reading the input in one chunk. (Reading lazily still costs an additional 40%, but I didn't want to further clutter the chart with more big numbers.)

What's interesting to me is that the tiny Attoparsec-based parser, which is more or less a transliteration of the relevant parts of RFC 2616, is so fast.

I went back and remeasured performance of the Attoparsec and C parsers on a larger data set (295,568 URLs), and got these numbers:

  • Attoparsec: 2.889 seconds, or 102,308 requests/second

  • C: 1.614 seconds, or 183,128 requests/second

That clocks the Attoparsec-based parser at about 56% the speed of the C parser. Not bad, given that it's about 3.2% the number of lines of code!

Of course there are tradeoffs involved here.

  • Parsec 3 emits much more friendly error messages, and can handle many different input types. Attoparsec, being aimed at plumbing-oriented network protocols, considers friendly error messages to not be worth the effort, and is specialised to the arrays of bytes you get straight off the network.

  • Parsec 3 requires all of its input to be available when the parser is run (either in one large chunk or via lazy I/O). If Attoparsec has insufficient data to return a complete result, it hands back a continuation that you provide extra data to. This eliminates the need for lazy I/O and any additional buffering, and makes for a beautiful, pure API that doesn't care what its input source is.

The memory footprint of the Attoparsec-based parser is small: it will run in 568KB of heap on my 64-bit laptop. The smallest heap size that the Parsec 3 parser can survive in isn't all that much larger: with lazily read input, it will run in a 750KB heap.

Overall, this is yet another instance where a little careful attention to performance yields very exciting results. Personally, I'd be quite happy to trade a 97% reduction in code size for such a small performance hit, especially given the clarity, ease of use, and flexibility of the resulting code. (The http_parser API is frankly not so much fun to use, even though I completely understand the motivation behind it.)

Next »