Subscribe to
Posts
Comments

In my first of this pair of articles, I laid out some of the qualities I've been looking for in a parsing library.

Before I dive back into detail, I want to show off some numbers. The new Attoparsec code is fast.

Performance

What did I benchmark? I captured some real HTTP GET requests from a live public web server, averaging 431 bytes per request. I chucked them into a file, and measured the time needed to parse the entire contents of the file with the following libraries:

  • Ryan Dahl's http-parser library, which is 1,672 lines of hand-rolled C craziness. I mean "craziness" in a special sort of awed way, with that hushed voice you reserve for the guy who builds the scale model of Neuschwanstein Castle in matchsticks. The library's heritage is closely based on the Ragel-generated parser used by Mongrel. This library is a fair approximation to about as fast as you can get, since it's been tuned for just one purpose. I wrote a smallish, but reasonably realistic, driver program to wire it up to file-based data, adding another 210 lines of code.

  • An Attoparsec-based HTTP request parser, 54 lines long, with about 30 lines of driver program. (Attoparsec itself is about 900 lines of code.)

  • Several Parsec-3-based parsers, which are almost identical in length to the Attoparsec-based version.

The Parsec 3 parsers come in three varieties:

  • The fastest uses a patch that Antoine Latter wrote to switch Parsec 3's internal machinery over to using continuation passing style (CPS). This parser uses ByteString for input, and reads the entire 18.8MB file in one chunk.

  • Next is the same parser, using lazy ByteString I/O to read the file in 64KB chunks. This costs about 40% in performance, but is almost mandatory to maintain a reasonable memory footprint on large inputs.

  • In last place is the official version of Parsec 3, reading the input in one chunk. (Reading lazily still costs an additional 40%, but I didn't want to further clutter the chart with more big numbers.)

What's interesting to me is that the tiny Attoparsec-based parser, which is more or less a transliteration of the relevant parts of RFC 2616, is so fast.

I went back and remeasured performance of the Attoparsec and C parsers on a larger data set (295,568 URLs), and got these numbers:

  • Attoparsec: 2.889 seconds, or 102,308 requests/second

  • C: 1.614 seconds, or 183,128 requests/second

That clocks the Attoparsec-based parser at about 56% the speed of the C parser. Not bad, given that it's about 3.2% the number of lines of code!

Of course there are tradeoffs involved here.

  • Parsec 3 emits much more friendly error messages, and can handle many different input types. Attoparsec, being aimed at plumbing-oriented network protocols, considers friendly error messages to not be worth the effort, and is specialised to the arrays of bytes you get straight off the network.

  • Parsec 3 requires all of its input to be available when the parser is run (either in one large chunk or via lazy I/O). If Attoparsec has insufficient data to return a complete result, it hands back a continuation that you provide extra data to. This eliminates the need for lazy I/O and any additional buffering, and makes for a beautiful, pure API that doesn't care what its input source is.

The memory footprint of the Attoparsec-based parser is small: it will run in 568KB of heap on my 64-bit laptop. The smallest heap size that the Parsec 3 parser can survive in isn't all that much larger: with lazily read input, it will run in a 750KB heap.

Overall, this is yet another instance where a little careful attention to performance yields very exciting results. Personally, I'd be quite happy to trade a 97% reduction in code size for such a small performance hit, especially given the clarity, ease of use, and flexibility of the resulting code. (The http_parser API is frankly not so much fun to use, even though I completely understand the motivation behind it.)

My goal in working on the new GHC I/O manager has been to get the Haskell network stack into a state where it could be used to attack high-performance and scalable networking problems, domains in which it has historically been weak.

While it's encouraging to have an excellent networking stack (Johan and I now have this thoroughly in hand), the next thing I'd look for is libraries to help build networked applications. One of the fundamental things that such apps need to do well is parse data, be it received from the network or read from files.

The Haskell parsing library of first resort has for years been Parsec. While other capable libraries exist (e.g. polyparse and uu-parsinglib), they don't appear to see much use.

As appealing as Parsec's API is, it has a few problems:

  • Parsec 2 is slow, and it has high memory overhead, due to its use of Haskell's String type for tokens. Parsec 3 can use the more efficient ByteString type (which is in any case much more appropriate for networked applications that deal in octets), but it achieves this flexibility at the cost of being even slower than Parsec 2.

  • Parsec's API demands that all of a parser's input be available at once. People usually work around this by feeding a Parsec parser with lazily read data, but lazy I/O is at odds with my goal of writing solid networked code.

What properties should a parsing library for networked applications ideally possess? There are a few obvious desiderata that have been well known for years. For example, it's important to have an appealing API and programming model. Parsec squarely fits this desire.

Performance is also a big consideration. Ideally, a parsing library would be fast enough that you wouldn't feel any real need for either of the following:

  • A few weeks to write an insane hand-bummed parser.

  • Mechanical parser generators or lexers (e.g. happy or alex).

There are some additional important constraints on a realistic library: it must fit well into a highly concurrent networked world full of unreliable, hostile and incompetent clients.

  • High concurrency levels demand a low per-connection memory footprint.

  • The need to cope with poorly behaved clients requires that applications must be able to throttle connections that are too busy, or kill connections that are too slow or attempting to consume too many server resources. A good parsing library will not get in the way of these needs.

A few years ago, I made a few half-hearted attempts to write a specialised version of Parsec, which I eventually named Attoparsec.

I began with a stripped-down Parsec that was specialised to accept ByteString input. I then extended the API to allow a parser to consume small chunks of input at a time.

Because I wasn't using Attoparsec "in anger" at the time, I made sure that my library worked (more or less), but I was not measuring its performance.

In late January of this year, I began to think about using Attoparsec as the parser for a simple HTTP server that I could use to benchmark our new GHC I/O manager code. Clearly, I'd want the parser to perform well, or it would distort my numbers rather badly.

By coincidence, John MacFarlane emailed me around the same time, with disturbing findings: he'd tried Attoparsec, and found its performance to be terrible! In fact, it was 4 to 20 times slower than plain Parsec with his experimental parser and test data. Clearly, I had some hard work to look forward to.

Happily, that work is now almost complete, and I am pleased with the results. In the next post, I'll have some details of what this all entails.

While I’ve been in my corner hacking on low-level Haskell nonsense, apparently someone figured out how to make the internets more better.

To wit, a few judiciously curated sources of visual edification:

I'm from the internet
for great justice

The octopus was full of judgment
unhappy hipsters

Ooooooh
riot right click

Johan and I are hard at work getting the new I/O manager for GHC into shape. He published some numbers earlier today describing the dramatic performance difference between GHC's current timeout support and ours. Here, I'd like to talk about another aspect of performance: sending and receiving data.

We have a very simple benchmark that exercises our event and timeout APIs. It creates a number of pipes, and every time that it is told a pipe is ready for writing, it sends a 1-byte message over that pipe. When it is told the other end of a pipe is ready for reading, it reads the message after a configurable delay, until a specified number of messages have been sent and received.

In the charts below, we send a million 1-byte messages through an increasing number of pipes, starting at 100 and growing by a factor of 12/10 each time, ending at 83,898 pipes. Since a pipe consists of two file descriptors, we exceed the number of file descriptors that regular GHC can handle after just nine iterations. The new I/O manager deals happily with hundreds of thousands of file descriptors and active timeouts in use at one time, as the numbers below indicate. Note that the scales of all charts below are log-log.

I measured the first set of numbers on my Linux laptop running 64-bit Fedora 12 (2.4GHz Core 2 Duo), using GHC 6.10.4. The back end used here is epoll.

The second set of numbers come from my Mac laptop running OS X 10.6.2 (also a 2.4GHz Core 2 Duo), using GHC 6.12.1. The back end used here is kqueue.

And finally, a comparison between Linux and Mac performance numbers.

For small numbers of file descriptors and with no timeouts in use, epoll gives about 3 times better throughput than kqueue. With 160,000+ file descriptors in use, epoll's advantage declines to a factor of 2. We drop from 440,000 messages per second (nice!) with 100 pipes under Linux to 25,000 per second with 83,898 pipes.

When we introduce timeouts, the performance gap narrows, from a 2x advantage for epoll at low file descriptor count to parity at a high count, and throughput takes a hit that grows large as the number of file descriptors (and presumably timeouts) in use grows. I haven't checked to see why this is, but there can easily be over 100,000 timeouts active at one time when the delay between being notified and reading from a file descriptor is long; perhaps the expense of managing them has something to do with it?

In any case, with hundreds of thousands of file descriptors and timeouts active at once (oh, and a respectably small memory footprint, too), I am satisfied that our code is working extremely well!

Over the past couple of weeks, I have been working with Johan Tibbell on an event library to use for replacing GHC’s existing I/O manager. The work has been progressing rather nicely: I now have both the epoll and kqueue back ends working, while Johan has been focusing on a fast priority queue data structure for managing timeouts.

We’ve been working from a pair of git repositories:

(Incidentally, I’m using the excellent hg-git plugin for Mercurial, which has allowed me to continue to avoid using git. It’s been working very well. Many thanks to Scott Chacon and Augie Fackler for their work on it!)

Of course, now that we have the basic plumbing working, I have a few numbers to report. These are all for sending and receiving 1,000,000 messages over Unix pipes.

  • On 64-bit Linux, I can easily create 100,000 pipes (i.e. 200,000 file descriptors), and pass all messages through them in 6.69 seconds.
  • Under 32-bit Snow Leopard, for some reason I can’t create more than 2,048 pipes. Passing all messages takes 7.41 seconds.
  • With the same 2,048 pipes under Linux, the time required is 4.15 seconds.
  • Under Linux, if I try to create a very large number of pipes, I get an error message “VFS: file-max limit 291248 reached”, and as I just discovered, the machine starts to misbehave. Good times!

These are pretty gratifying numbers to have. This has been a very fun project so far, both in the technical parts and the dealings with the other people involved. I’m looking forward to continuing to work on it with Johan and others!

The Glasgow Haskell Compiler supports extraordinarily cheap threads, about as lightweight as those of Erlang. These are implemented using a two-level model, with threads scheduled across a set of OS-level threads. Since the lightweight threads can't afford to block when performing I/O operations, when a Haskell program starts, it runs an I/O manager thread whose job is to notify other threads when they can safely perform I/O.

On Unix systems, the I/O manager has for a long time managed its file descriptors using the select system call. While select performs well for a small number of numerically similar file descriptors, on most platforms it has a small fixed limit on the number of file descriptors it can manage, usually 1,024. Clearly, this makes GHC extremely tricky to use for large-scale server development: if you have more than a few hundred concurrent clients, and a few hundred worker threads opening and closing files, you're quickly going to encounter runtime crashes.

This problem has long been known; Simon Marlow opened a ticket against it 4 years ago. I've got the flu this week, so when I became too bored with sitting still, I took a quick crack at an initial solution to the problem, by writing a patch that replaces the I/O manager's use of select with poll.

The patch works well, but it's only an initial stab at the problem. The internals of the I/O manager are implemented less efficiently than I would like. Efficient is relative, of course: I can still get a dirt-simple Haskell HTTP server to handle 15,000 requests per second when hit by ab -c 75 -n 75000 (i.e. 75 concurrent clients issuing 75,000 requests in total). But I'd bet that number could be somewhat higher, even on a dual-core laptop.

I would not expect even the patched I/O manager to perform especially well if handling a large number of mostly idle clients, but I need to write some benchmarking code to confirm this. My ultimate goal is to be able to get the GHC runtime into a position of being able to handle hundreds of thousands of concurrent network connections. Now that the initial intermediate step is done, that next part, of changing the internals of the I/O manager to use better data structures (perhaps even relying on libev, is likely to be a lot more work.

Earlier this evening, I released a new version of the Haskell text package. This one adds support for I/O, which in previous releases you had to roll by hand. The Data.Text.IO and Data.Text.Lazy.IO modules are the places you'll want to look for details of how to read and write text easily.

The reason for this timing was that GHC 6.12.1 was released today, and its newly reworked I/O code includes thorough support for character set and line ending conversion. If you are using the text package on an older version of GHC, it will use UTF-8 encoding and the platform's native line ending mode. On 6.12 and above, the text package obeys the current settings for a given handle.

I still have some performance tuning to do with the lazy I/O code, but it works quite nicely right now. Enjoy!

I spent some time recently using my criterion benchmark suite to measure the performance of the Haskell text package. I've reproduced the 45 microbenchmarks that I've put together so far below.

The attention deficit summary

  • The performance of the text package is generally good, and there are some obvious areas for improvement.
  • In many cases, simply waiting for the release of GHC HEAD as 6.14 will yield huge (e.g. 100x) speedups "for free."
  • There are a few places where HEAD is noticeably slower than 6.10.

"Which type should I use?"

The text package is generally fast, and in many cases has a more refined and usefully text-oriented API than Haskell lists, so if you need to manipulate text, you should use this package.

If you are accustomed to using the bytestring package to work with text, you should probably switch to using the text package.

Lists are very slow in general, but they do win for a few trivial operations.

What I measured

When benchmarking, I tried two different versions of GHC:

  • The current stable release, 6.10.4.

  • The HEAD (aka the in-progress development code), which will become 6.14 late next year.

For each microbenchmark, I measured the performance of several different data types:

  • Plain old lists, the workhorse of Haskell. You'll see this in the charts below as "list".

  • The popular ByteString type, in both strict ("strict BS") and lazy ("lazy BS") variants.

  • Strict and lazy variants of the Text type—"strict T" and "lazy T", respectively.

I measured the performance of HEAD because it has a completely rewritten inliner. The performance of the text package is extremely sensitive to the quality of the inliner, as the numbers below indicate. In many cases, the inliner gives HEAD more than a factor of 100 speed advantage over 6.10!

When measuring, I evaluted results to head normal form, so operations that amortise as cheap (e.g. consing an element onto the front of a list) appear with their full unamortised costs.

How to read the charts

Here's a handy key for reading colours and labels:

  • Blue is GHC 6.10.4

  • Red is GHC HEAD

  • "lazy BS" is lazy bytestring

  • "strict BS" is strict bytestring

  • "list" is ... list

  • "lazy T" is lazy text

  • "strict T" is strict text

  • "utf8-string" is utf8-string

  • When you see a chart titled "length/foo", this indicates that I computed the length of a result, in an attempt to tickle the text package's stream fusion machinery into doing its thing.

Some of the charts below use linear Y axes, while those with large differences between fastest and slowest code use logarithmic axes. The kind of chart in use should always be clear. In every case, larger bars mean slower code. Unfortunately, using the Chart package, I couldn't figure out how to centre the label for a column beneath the column itself, so each label is slightly offset to the left. I also couldn't figure out how to turn off the tick marks or other chartjunk, so I apologize for the visual clutter.

The charts

I've added notes after whichever charts seem to deserve them. Charts are listed alphabetically.

time for append

time for concatMap

time for concat

time for cons

time for decode

time for drop

time for filter.filter

time for filter

Ouch! Something in HEAD is clobbering the performance of filter.

time for foldl'

time for foldr

time for head

time for init

time for intercalate

time for intersperse

time for isInfixOf

time for last

time for length-cons

time for length-decode

time for length-drop

time for length-filter.filter

time for length-filter

time for length-init

time for length-intercalate

time for length-intersperse

time for length-map.map

time for length-map

time for length-replicate-char

time for length-replicate-string

time for length-tail

time for length-take

time for length-toLower

time for length-toUpper

time for length-words

time for length-zipWith

time for map.map

time for map

time for replicate-char

time for replicate-string

time for reverse

time for tail

time for take

time for toLower

time for toUpper

time for words

time for zipWith

…if the world of blogging about software had by now developed some kind of a tradition of critical analysis?

Over at Inside Higher Ed, Scott McLemee writes a careful and thoughtful review of Cornel West’s new book. It performs the delicate feat of being at once both generous to its subject and devastating in its analysis:

Legend has it that the blues guitarist Robert Johnson acquired his haunting style by selling his soul to the devil at a crossroads. West, as a “bluesman of the life of the mind,” has clearly also been to the crossroads. The devil gave him a team of publicists. I don’t think this was a good bargain on West’s part. It left him unable to recognize that self-respect is often the enemy of self-esteem.

Although his prose style is impeccable, what I like even more about McLemee’s piece is the way in which he expresses hope that West might return to fulfilling his early promise. (I suspect that this expression of hope is mainly a form of rhetorical charity, but it’s stylish nonetheless.) This led me to wondering whether it’s even achievable to foster a similar style among people who write about code.

I suspect that many of the awful writing habits of software bloggers come from the fact that they are sometimes actually trying to do things: I tried to use some software; it did something dumb (or nothing at all); I am frustrated; I am going to get splenetic, possibly on a subject where I have no idea what I’m talking about. You can’t take an issue of Social Text out of the university library and do something fun with it after hours (hell, it takes a strong stomach to have fun with critical theory inside the library), so that particular variety of resentment born of ill experience doesn’t arise.

Maybe I’ll actually write up some thoughts on Go at some point, and see if I can live up to my admittedly forlorn hopes.

This evening, Matthew Podwysocki drew my attention to an amusing article over on Phil Trelford's blog, in which people compare various versions of an algorithm for producing left-truncatable primes.

I can't resist a round of golf, and I was of course tempted by Matt Curran's 17-line Haskell entry, so here's the exact same algorithm in 6 lines of Haskell.

ltps = take 4260 $ ps 1 [0]
  where
    ps m prev = prev ++ ps (m*10) cur
      where cur = concat [filter isPrime (map (+n*m) prev) | n <- [1..9]]
    isPrime 1 = False
    isPrime x = null (filter ((==0) . mod x) [2..floor . sqrt $ fromIntegral x])

To my mind, this is easier to read than the 17-line version, since it more directly expresses what the algorithm actually does.

Phil posts measurements of times to produce the 1000th left-truncatable prime, but this number is so easy to compute that I don't trust the measurements (if you measure something once, and it takes a handful of milliseconds, choose a bigger problem!). Here are the times required on my laptop for Haskell and C++ to compute the 2800th such number, which incidentally requires 64-bit arithmetic to represent:

  • Haskell: 1.703s

  • C++: 1.225s (requires most the ints in the code to turn into longs to give a correct answer)

The Haskell number isn't bad, given the compactness of the code, the 10 minutes I worked on it, and my total disinterest in working on the problem any further :-)

Next »