Subscribe to
Posts
Comments

While I’ve been in my corner hacking on low-level Haskell nonsense, apparently someone figured out how to make the internets more better.

To wit, a few judiciously curated sources of visual edification:

I'm from the internet
for great justice

The octopus was full of judgment
unhappy hipsters

Ooooooh
riot right click

Johan and I are hard at work getting the new I/O manager for GHC into shape. He published some numbers earlier today describing the dramatic performance difference between GHC's current timeout support and ours. Here, I'd like to talk about another aspect of performance: sending and receiving data.

We have a very simple benchmark that exercises our event and timeout APIs. It creates a number of pipes, and every time that it is told a pipe is ready for writing, it sends a 1-byte message over that pipe. When it is told the other end of a pipe is ready for reading, it reads the message after a configurable delay, until a specified number of messages have been sent and received.

In the charts below, we send a million 1-byte messages through an increasing number of pipes, starting at 100 and growing by a factor of 12/10 each time, ending at 83,898 pipes. Since a pipe consists of two file descriptors, we exceed the number of file descriptors that regular GHC can handle after just nine iterations. The new I/O manager deals happily with hundreds of thousands of file descriptors and active timeouts in use at one time, as the numbers below indicate. Note that the scales of all charts below are log-log.

I measured the first set of numbers on my Linux laptop running 64-bit Fedora 12 (2.4GHz Core 2 Duo), using GHC 6.10.4. The back end used here is epoll.

The second set of numbers come from my Mac laptop running OS X 10.6.2 (also a 2.4GHz Core 2 Duo), using GHC 6.12.1. The back end used here is kqueue.

And finally, a comparison between Linux and Mac performance numbers.

For small numbers of file descriptors and with no timeouts in use, epoll gives about 3 times better throughput than kqueue. With 160,000+ file descriptors in use, epoll's advantage declines to a factor of 2. We drop from 440,000 messages per second (nice!) with 100 pipes under Linux to 25,000 per second with 83,898 pipes.

When we introduce timeouts, the performance gap narrows, from a 2x advantage for epoll at low file descriptor count to parity at a high count, and throughput takes a hit that grows large as the number of file descriptors (and presumably timeouts) in use grows. I haven't checked to see why this is, but there can easily be over 100,000 timeouts active at one time when the delay between being notified and reading from a file descriptor is long; perhaps the expense of managing them has something to do with it?

In any case, with hundreds of thousands of file descriptors and timeouts active at once (oh, and a respectably small memory footprint, too), I am satisfied that our code is working extremely well!

Over the past couple of weeks, I have been working with Johan Tibbell on an event library to use for replacing GHC’s existing I/O manager. The work has been progressing rather nicely: I now have both the epoll and kqueue back ends working, while Johan has been focusing on a fast priority queue data structure for managing timeouts.

We’ve been working from a pair of git repositories:

(Incidentally, I’m using the excellent hg-git plugin for Mercurial, which has allowed me to continue to avoid using git. It’s been working very well. Many thanks to Scott Chacon and Augie Fackler for their work on it!)

Of course, now that we have the basic plumbing working, I have a few numbers to report. These are all for sending and receiving 1,000,000 messages over Unix pipes.

  • On 64-bit Linux, I can easily create 100,000 pipes (i.e. 200,000 file descriptors), and pass all messages through them in 6.69 seconds.
  • Under 32-bit Snow Leopard, for some reason I can’t create more than 2,048 pipes. Passing all messages takes 7.41 seconds.
  • With the same 2,048 pipes under Linux, the time required is 4.15 seconds.
  • Under Linux, if I try to create a very large number of pipes, I get an error message “VFS: file-max limit 291248 reached”, and as I just discovered, the machine starts to misbehave. Good times!

These are pretty gratifying numbers to have. This has been a very fun project so far, both in the technical parts and the dealings with the other people involved. I’m looking forward to continuing to work on it with Johan and others!

The Glasgow Haskell Compiler supports extraordinarily cheap threads, about as lightweight as those of Erlang. These are implemented using a two-level model, with threads scheduled across a set of OS-level threads. Since the lightweight threads can't afford to block when performing I/O operations, when a Haskell program starts, it runs an I/O manager thread whose job is to notify other threads when they can safely perform I/O.

On Unix systems, the I/O manager has for a long time managed its file descriptors using the select system call. While select performs well for a small number of numerically similar file descriptors, on most platforms it has a small fixed limit on the number of file descriptors it can manage, usually 1,024. Clearly, this makes GHC extremely tricky to use for large-scale server development: if you have more than a few hundred concurrent clients, and a few hundred worker threads opening and closing files, you're quickly going to encounter runtime crashes.

This problem has long been known; Simon Marlow opened a ticket against it 4 years ago. I've got the flu this week, so when I became too bored with sitting still, I took a quick crack at an initial solution to the problem, by writing a patch that replaces the I/O manager's use of select with poll.

The patch works well, but it's only an initial stab at the problem. The internals of the I/O manager are implemented less efficiently than I would like. Efficient is relative, of course: I can still get a dirt-simple Haskell HTTP server to handle 15,000 requests per second when hit by ab -c 75 -n 75000 (i.e. 75 concurrent clients issuing 75,000 requests in total). But I'd bet that number could be somewhat higher, even on a dual-core laptop.

I would not expect even the patched I/O manager to perform especially well if handling a large number of mostly idle clients, but I need to write some benchmarking code to confirm this. My ultimate goal is to be able to get the GHC runtime into a position of being able to handle hundreds of thousands of concurrent network connections. Now that the initial intermediate step is done, that next part, of changing the internals of the I/O manager to use better data structures (perhaps even relying on libev, is likely to be a lot more work.

Earlier this evening, I released a new version of the Haskell text package. This one adds support for I/O, which in previous releases you had to roll by hand. The Data.Text.IO and Data.Text.Lazy.IO modules are the places you'll want to look for details of how to read and write text easily.

The reason for this timing was that GHC 6.12.1 was released today, and its newly reworked I/O code includes thorough support for character set and line ending conversion. If you are using the text package on an older version of GHC, it will use UTF-8 encoding and the platform's native line ending mode. On 6.12 and above, the text package obeys the current settings for a given handle.

I still have some performance tuning to do with the lazy I/O code, but it works quite nicely right now. Enjoy!

I spent some time recently using my criterion benchmark suite to measure the performance of the Haskell text package. I've reproduced the 45 microbenchmarks that I've put together so far below.

The attention deficit summary

  • The performance of the text package is generally good, and there are some obvious areas for improvement.
  • In many cases, simply waiting for the release of GHC HEAD as 6.14 will yield huge (e.g. 100x) speedups "for free."
  • There are a few places where HEAD is noticeably slower than 6.10.

"Which type should I use?"

The text package is generally fast, and in many cases has a more refined and usefully text-oriented API than Haskell lists, so if you need to manipulate text, you should use this package.

If you are accustomed to using the bytestring package to work with text, you should probably switch to using the text package.

Lists are very slow in general, but they do win for a few trivial operations.

What I measured

When benchmarking, I tried two different versions of GHC:

  • The current stable release, 6.10.4.

  • The HEAD (aka the in-progress development code), which will become 6.14 late next year.

For each microbenchmark, I measured the performance of several different data types:

  • Plain old lists, the workhorse of Haskell. You'll see this in the charts below as "list".

  • The popular ByteString type, in both strict ("strict BS") and lazy ("lazy BS") variants.

  • Strict and lazy variants of the Text type—"strict T" and "lazy T", respectively.

I measured the performance of HEAD because it has a completely rewritten inliner. The performance of the text package is extremely sensitive to the quality of the inliner, as the numbers below indicate. In many cases, the inliner gives HEAD more than a factor of 100 speed advantage over 6.10!

When measuring, I evaluted results to head normal form, so operations that amortise as cheap (e.g. consing an element onto the front of a list) appear with their full unamortised costs.

How to read the charts

Here's a handy key for reading colours and labels:

  • Blue is GHC 6.10.4

  • Red is GHC HEAD

  • "lazy BS" is lazy bytestring

  • "strict BS" is strict bytestring

  • "list" is ... list

  • "lazy T" is lazy text

  • "strict T" is strict text

  • "utf8-string" is utf8-string

  • When you see a chart titled "length/foo", this indicates that I computed the length of a result, in an attempt to tickle the text package's stream fusion machinery into doing its thing.

Some of the charts below use linear Y axes, while those with large differences between fastest and slowest code use logarithmic axes. The kind of chart in use should always be clear. In every case, larger bars mean slower code. Unfortunately, using the Chart package, I couldn't figure out how to centre the label for a column beneath the column itself, so each label is slightly offset to the left. I also couldn't figure out how to turn off the tick marks or other chartjunk, so I apologize for the visual clutter.

The charts

I've added notes after whichever charts seem to deserve them. Charts are listed alphabetically.

time for append

time for concatMap

time for concat

time for cons

time for decode

time for drop

time for filter.filter

time for filter

Ouch! Something in HEAD is clobbering the performance of filter.

time for foldl'

time for foldr

time for head

time for init

time for intercalate

time for intersperse

time for isInfixOf

time for last

time for length-cons

time for length-decode

time for length-drop

time for length-filter.filter

time for length-filter

time for length-init

time for length-intercalate

time for length-intersperse

time for length-map.map

time for length-map

time for length-replicate-char

time for length-replicate-string

time for length-tail

time for length-take

time for length-toLower

time for length-toUpper

time for length-words

time for length-zipWith

time for map.map

time for map

time for replicate-char

time for replicate-string

time for reverse

time for tail

time for take

time for toLower

time for toUpper

time for words

time for zipWith

…if the world of blogging about software had by now developed some kind of a tradition of critical analysis?

Over at Inside Higher Ed, Scott McLemee writes a careful and thoughtful review of Cornel West’s new book. It performs the delicate feat of being at once both generous to its subject and devastating in its analysis:

Legend has it that the blues guitarist Robert Johnson acquired his haunting style by selling his soul to the devil at a crossroads. West, as a “bluesman of the life of the mind,” has clearly also been to the crossroads. The devil gave him a team of publicists. I don’t think this was a good bargain on West’s part. It left him unable to recognize that self-respect is often the enemy of self-esteem.

Although his prose style is impeccable, what I like even more about McLemee’s piece is the way in which he expresses hope that West might return to fulfilling his early promise. (I suspect that this expression of hope is mainly a form of rhetorical charity, but it’s stylish nonetheless.) This led me to wondering whether it’s even achievable to foster a similar style among people who write about code.

I suspect that many of the awful writing habits of software bloggers come from the fact that they are sometimes actually trying to do things: I tried to use some software; it did something dumb (or nothing at all); I am frustrated; I am going to get splenetic, possibly on a subject where I have no idea what I’m talking about. You can’t take an issue of Social Text out of the university library and do something fun with it after hours (hell, it takes a strong stomach to have fun with critical theory inside the library), so that particular variety of resentment born of ill experience doesn’t arise.

Maybe I’ll actually write up some thoughts on Go at some point, and see if I can live up to my admittedly forlorn hopes.

This evening, Matthew Podwysocki drew my attention to an amusing article over on Phil Trelford's blog, in which people compare various versions of an algorithm for producing left-truncatable primes.

I can't resist a round of golf, and I was of course tempted by Matt Curran's 17-line Haskell entry, so here's the exact same algorithm in 6 lines of Haskell.

ltps = take 4260 $ ps 1 [0]
  where
    ps m prev = prev ++ ps (m*10) cur
      where cur = concat [filter isPrime (map (+n*m) prev) | n <- [1..9]]
    isPrime 1 = False
    isPrime x = null (filter ((==0) . mod x) [2..floor . sqrt $ fromIntegral x])

To my mind, this is easier to read than the 17-line version, since it more directly expresses what the algorithm actually does.

Phil posts measurements of times to produce the 1000th left-truncatable prime, but this number is so easy to compute that I don't trust the measurements (if you measure something once, and it takes a handful of milliseconds, choose a bigger problem!). Here are the times required on my laptop for Haskell and C++ to compute the 2800th such number, which incidentally requires 64-bit arithmetic to represent:

  • Haskell: 1.703s

  • C++: 1.225s (requires most the ints in the code to turn into longs to give a correct answer)

The Haskell number isn't bad, given the compactness of the code, the 10 minutes I worked on it, and my total disinterest in working on the problem any further :-)

I’m pleased to announce the availability of version 0.2 of my criterion library for Haskell performance evaluation.

Compared to version 0.1, this version has some significant changes.

  • The benchmarking API has been improved! If you’re benchmarking a pure function, you no longer need to feed it an Int to ensure that it won’t get thunked or let-floated. Instead, Use the B data type (see the examples/ directory) to specify an unsaturated function that you want to benchmark, along with its last argument. Many thanks to Daniel Peebles for suggesting this approach, which makes the use of criterion considerably more pleasant.
  • Neil Brown added a --kde-same-axis command-line option to allow plotting the KDEs for all the benchmarks with the same X axis. I’d like to thank Neil for his helpful work on some other code cleanup issues, too.
  • There are some more, and better quality, examples in the examples/ directory.
  • Tim Docker contributed a patch that improves graph rendering.

As the changes above suggest, several people have been actively contributing to the improvement of criterion, and I’m delighted by this. Please feel welcome to get involved; there are many improvements I’d like to make, but of course I have limited time to work on them. I’ll be happy to talk with you about other potentially useful changes.

I'm pleased to announce the availability of version 0.5 of text, a library that provides fast Unicode text handling for Haskell.

This version contains numerous changes compared to version 0.4, in three broad categories:

  • I made improvements to the performance of some common functions by, in many cases, more than 10x.

  • I have substantially refined the API.

  • Many bugs fixed, almost all of them subtle, thanks to the ever-astounding QuickCheck library.

Performance improvements

In real-world code, many of the most frequently performed text manipulation operations involve searching for substrings. I've written a fast search function, based on Fredrik Lundh's very elegant adaptation of the Boyer-Moore-Horspool-Sunday algorithm, and rewritten all of the functions that perform searches internally to use it (without changing any of the APIs in question, of course).

Retooling with this function has made a big difference to performance. Here are measurements of old-style searches, searches over the popular ByteString type, and new-style searches:

big slow densities

big bytestring densities

big fast densities

For this data set, the new search code is 10.4 times faster than the old, and 24% faster than the ByteString search code.

If we perform a search for a short snippet of DNA in a large data set, we see a similar pattern:

dna slow densities

dna bytestring densities

dna fast densities

Over the DNA data set, the new search code is 15.3 times faster than the old, and 85% faster than the ByteString search code.

By the way, it was writing—and wanting to measure the performance of—this new search function that led me to create the criterion benchmarking package, which in turn begat the statistics package. What a month of hacking it's been!

API changes

While the API of the text package still somewhat resembles the venerable Data.List API (which is now almost two decades old), I have modified it to be both narrower and more useful for text processing tasks.

All of the integer-index-based search functions are now gone, since using integer offsets into Unicode strings is a bad idea (especially when the underlying representation uses a variable-length encoding).

Instead, the search functions break up the original string. Consider the new find function, for instance:

find :: Text -> Text -> (Text, [(Text, Text)])

This function returns all non-overlapping instances of needle in haystack. The first element of the returned pair is the prefix of haystack prior to any matches of needle. The second is a list of pairs.

The first element of each pair in the list is a span from the beginning of a match to the beginning of the next match, while the second is a span from the beginning of the match to the end of the input.

find "::" ""
==> ("", [])

find "/" "a/b/c/d"
==> ("a", [("/b","/b/c/d"), ("/c","/c/d"), ("/d","/d")])

Notice that no matter how much of the result list you consume, you can always reconstruct the entire original haystack from the elements you've already seen.

More often, you won't want something nearly this general. Instead, you may want a function that simply splits a string on a separator:

split :: Text -> Text -> [Text]

Here are a few examples of how the split function works:

split "\r\n" "a\r\nb\r\nd\r\ne"
==> ["a","b","d","e"]

split "aaa" "aaaXaaaXaaaXaaa"
==> ["","X","X","X",""]

split "x" "x"
==> ["",""]

Some of the functions supplied by the list API are not very useful in the context of a text-oriented API. Consider break:

break :: (a -> Bool) -> [a] -> ([a],[a])

I use this name in the text package, but give the function a simpler and more useful type signature (its behaviour is otherwise the same):

break :: Text -> Text -> (Text, Text)

If, for some reason, you still want the predicate-based functions, they're usually available, but (in most cases) with a "By" suffix added, to provide uniform naming:

breakBy :: (Char -> Bool) -> Text -> (Text, Text)

Testing and bug fixes

Even though the text package already had a very high degree of test coverage, I still found perhaps a dozen bugs while working on this release. The bugs in question were almost without exception quite subtle, and it would have been tremendously difficult to reproduce them, or to fix them with confidence, without the assistance of the QuickCheck library.

By the way, even though I've been using it for years, I still find myself feeling very strongly that the QuickCheck library is simply amazing. If unit testing is like riding a tricycle with wobbly wheels, working with QuickCheck more closely resembles flying a sleek starship. Wow.

At the moment, the profile of code covered by QuickCheck tests looks like this:

  • 96% of public top-level definitions

  • 85% of conditional expressions

  • 88% of all expressions

(In case you're wondering, these are higher levels of test coverage than those achieved by the widely used bytestring package, so you should have high confidence that the code in the text package is pretty solid.)

Next »