Book review: Parallel and Concurrent Programming in Haskell

It's time someone finally wrote a proper review of Simon Marlow's amazing book, Parallel and Concurrent Programming in Haskell.

I am really not the right person to tackle this job objectively, because I have known Simon for 20 years and I currently happen to be his boss at Facebook. Nevertheless, I fly my flag of editorial bias proudly, and in any case a moment's glance at Simon's book will convince you that the absurdly purple review I am about to write is entirely justified.

Moreover, this book is sufficiently clear, and introduces so many elegant ideas and beautiful abstractions, that you would do well to learn the minimal amount of Haskell necessary to absorb its lessons, simply so that you can become enriched in the reading.

Simon's book makes an overdue departure from the usual Haskell literature (including my own book, which in case you didn't know is fully titled "Real World Haskell Of Six Years Ago Which We Should Have Edited A Little More Carefully") in assuming that you already have a modest degree of understanding of the language. This alone is slightly pathetically refreshing! I can't tell you how glad I am that functional programming has finally reached the point where we no longer have to start every bloody book by explaining what it is.

Actually, there's a second reason that I might not be an ideal person to review this book: I have only skimmed most of the first half, which concerns itself with parallel programming. Just between you and me, I will confess that parallel programming in Haskell hasn't lit my internal fire of enthusiasm. I used to do a lot of parallel programming in a previous life, largely using MPI, and the experience burned me out. While parallel programming in Haskell is far nicer than grinding away in MPI ever was, I do not love the subject enough that I want to read about it.

So what I'm really reviewing here is the second part of Simon's book, which if issued all by itself at the same price as the current entire tome, would still be a bargain. Let's talk about just how good it is.

The second half of the book concerns itself with concurrent programming, an area where Haskell particularly shines, and which happens to be the bread-and-butter of many a working programmer today. The treatment of concurrency does not depend in any way on the preceding chapters, so if you're so inclined, you can read chapter one and then skip to the second half of the book without missing any necessary information.

Chapter 7 begins by introducing some of the basic components of concurrent Haskell, threads (familiar to all) and a data type called an MVar. An MVar acts a bit like a single-item box: you can put one item into it if it's empty, otherwise you must wait; and you can take an item out if it's full, otherwise you must wait.

As humble as the MVar is, Simon uses it as a simple communication channel with which he builds a simple concurrent logging service. He then deftly identifies the performance problem that a concurrent service will have when an MVar acts as a bottleneck. Not content with this bottleneck, he illustrates how to construct an efficient unbounded channel using MVar as the building block, and clearly explains how this more complex structure works safely.

This is the heart of Simon's teaching technique: he presents an idea that is simple to grasp, then pokes a hole in it. With this hole as motivation, he presents a slightly more complicated approach that corrects the weaknesses of the prior step, without sacrificing that clarity.

For instance, the mechanism behind unbounded channels is an intricate dance of two MVars, where Simon clearly explains how they ensure that a writer will not block, while a reader will block only if the channel is empty. He then goes on to show how this channel type can be extended to support multicast, such that one writer can send messages to several readers. His initial implementation is subtly incorrect, which he once again explains and uses as a springboard to a final version. By this time, you've accumulated enough lessons from the progression of examples that you can appreciate the good design taste and durability of these unbounded channels.

Incidentally, this is a good time to talk about the chapter on parallel computing that I made sure not to skip: chapter 4, which covers dataflow parallelism using an abstraction called Par. Many of the types and concerns in this chapter will be familiar to you if you're used to concurrent programming with threads, which makes this the most practical chapter to start with if you want to venture into parallel programming in Haskell, but don't know where to begin. Par is simply wonderfully put together, and is an inspiring example of tasteful, parsimonious API design. So put chapter 4 on your must-read list.

Returning to the concurrent world, chapter 8 introduces exceptions, using asynchronous operations as the motivation. Simon builds a data type called Async, which is similar to "futures" or "promises" from other languages (and to the IVar type from chapter 4), and proceeds to make Async operations progressively more robust in the face of exceptions, then more powerful so that we can wait on the completion of one of several Async operations.

Chapter 9 resumes the progress up the robustness curve, by showing how we can safely cancel Async operations that have not yet completed, how to deal with the trouble that exceptions can cause when thrown at an inopportune time (hello, resource leaks!), and how to put an upper bound on the amount of time that an operation can run for.

Software transactional memory gets an extended treatment in chapters 10 and 11. STM has gotten a bad rap in the concurrent programming community, mostly because the implementations of STM that target traditional programming languages have drawbacks so huge that they are deeply unappealing. In the same way that the Java and C++ of 10-15 years ago ruined the reputation of static type systems when there were vastly better alternatives out there, STM in Haskell might be easy to consign to the intellectual dustbin by association, when in fact it's a much more interesting beast than its relatives.

A key problem with traditional STM is that its performance is killed stone dead by the amount of mutable state that needs to be tracked during a transaction. Haskell sidesteps much of this need for book-keeping with its default stance that favours immutable data. Nevertheless, STM in Haskell does have a cost, and Simon shows how to structure code that uses STM to make its overheads acceptable.

Another huge difficulty with traditional STM lies in the messy boundary between transactional code and code that has side effects (and which hence cannot be safely called from a transaction). Haskell's type system eliminates these difficulties, and in fact makes it easier to construct sophisticated combinations of transactional operations. Although we touched on STM having some overhead, Simon revisits the Async API and uses some of the advanced features of Haskell STM to build a multiple-wait implementation that is more efficient than its MVar-based predecessor.

In chapter 14, Simon covers Cloud Haskell, a set of fascinating packages that implement Erlang-style distributed message passing, complete with monitoring and restart of remote nodes. I admire Cloud Haskell for its practical willingness to adopt wholesale the very solid ideas of the Erlang community, as they have a quarter of a century of positive experience with their distinctive approach to constructing robust distributed applications.

If you don't already know Haskell, this book offers two significant gifts. The first is a vigorous and compelling argument for why Haskell is an uncommonly good language for the kind of concurrent programming that is fundamental to much of today's computing. The second is an eye-opening illustration of some beautiful and powerful APIs that transcend any particular language. Concise, elegant design is worth celebrating wherever you see it, and this book is brimful of examples.

On the other hand, if you're already a Haskell programmer, it is very likely that this book will awaken you to bugs you didn't know your concurrent code had, abstractions that you could be building to make your applications cleaner, and practical lessons in how to start simple and then refine your code as you learn more about your needs.

Finally, for me as a writer of books about computing, this book has lessons too. It is understated, letting the quality of its examples and abstractions convince more deeply than bombast could reach. It is minimalist, revisiting the same few initially simple ideas through successive waves of refinement and teaching. And it is clear, with nary a word out of place.

In short, if you care about Haskell, if you are interested in concurrency, if you appreciate good design, if you have an ear for well-crafted teaching, Parallel and Concurrent Programming in Haskell is a book that you simply must read. We simply do not see books of this quality very often, so treasure 'em when you see 'em.

Posted in haskell, reading

New year, new library releases, new levels of speed

I just released new versions of the Haskell text, attoparsec, and aeson libraries on Hackage, and there’s a surprising amount to look forward to in them.

The summary for the impatient: some core operations in text and aeson are now much more efficient. With text, UTF-8 encoding is up to four times faster, while with aeson, encoding and decoding of JSON bytestrings are both up to twice as fast.


Perhaps the least interesting release is attoparsec. It adds a new dependency on Bas Van Dijk’s scientific package to allow efficient and more accurate parsing of floating point numbers, a longstanding minor weakness. It also introduces two new functions for single-token lookahead, which are used by the new release of aeson; read on for more details.


The new release of the text library has much better support for encoding to a UTF-8 bytestring via the encodeUtf8 function. The new encoder is up to four times faster than in the previous major release.

Simon Meier contributed a pair of UTF-8 encoding functions that can encode to the new Builder type in the latest version of the bytestring library. These functions are slower than the new encodeUtf8 implementation, but still twice as fast as the old encodeUtf8.

Not only are the new Builder encoders admirably fast, they’re more flexible than encodeUtf8, as Builders can be used to efficiently glue together from many small fragments. Once again, read on for more details about how this helped with the new release of aeson. (Note: if you don’t have the latest version of bytestring in your library environment, you won’t get the new Builder encoders.)

The second major change to the text library came about when I finally decided to expose all of the library’s internal modules. The newly exposed modules can be found in the Data.Text.Internal hierarchy. Before you get too excited, please understand that I can’t make guarantees of release-to-release stability for any functions or types that are documented as internal.


Finally, the new release of the aeson library focuses on improved performance and accuracy. We parse floating point numbers more accurately thanks once again to Bas van Dijk’s scientific library. And for performance, both decoding and encoding of JSON bytestrings are up to twice as fast as in the previous release.

On the decoding side, I used the new lookahead primitives from attoparsec to make parsing faster and less memory intensive (by avoiding backtracking, if you’re curious). Meanwhile, Simon Meier contributed a patch that uses his new Builder based UTF-8 encoder from the text library to double encoding performance. (Encoding performance is improved even if you don’t have the necessary new version of bytestring, but only by about 10%.)

On my crummy old Mac laptop, I can decode at 30-40 megabytes per second, and encode at 100-170 megabytes per second. Not bad!


I'd particularly like to thank Bas van Dijk and Simon Meier for their excellent contributions during this most recent development cycle. It's really a pleasure to work with such smart, friendly people.

Simon and Bas deserve some kind of an additional medal for being forgiving of my sometimes embarrassingly long review latencies: some of Simon's patches against the text library are almost two years old! (Please pardon me while I grasp at straws in my slightly shamefaced partial defence here: the necessary version of bytestring wasn't released until three months ago, so I'm not the only person in the Haskell community with long review latencies...)

Posted in haskell, open source

Testing a UTF-8 decoder with vigour

Yesterday, Michael Snoyman reported a surprising regression in version 1.0 of my Haskell text library: for some invalid inputs, the UTF-8 decoder was truncating the invalid data instead of throwing an exception.

Thanks to Michael providing an easy repro, I quickly bisected the origin of the regression to a commit from September that added support for incremental decoding of UTF-8. That work was motivated by applications that need to be able to consume incomplete input (e.g. a network packet containing possibly truncated data) as early as possible.

The low-level UTF-8 decoder is implemented as a state machine in C to squeeze as much performance out as possible. The machine has two visible end states: UTF8_ACCEPT indicates that a buffer was completely successfully decoded, while UTF8_REJECT specifies that the input contained invalid UTF-8 data. When the decoder stops, all other machine states count as work in progress, i.e. a decode that couldn’t complete because we reached the end of a buffer.

When the old all-or-nothing decoder encountered an incomplete or invalid input, it would back up by a single byte to indicate the location of the error. The incremental decoder is a refactoring of the old decoder, and the new all-or-nothing decoder calls it.

The critical error arose in the refactoring process. Here’s the old code for backing up a byte.

    /* Error recovery - if we're not in a
       valid finishing state, back up. */
    if (state != UTF8_ACCEPT)
        s -= 1;

This is what the refactoring changed it to:

    /* Invalid encoding, back up to the
       errant character. */
    if (state == UTF8_REJECT)
        s -= 1;

To preserve correctness, the refactoring should have added a check to the new all-or-nothing decoder so that it would step back a byte if the final state of the incremental decoder was neither UTF8_ACCEPT nor UTF8_REJECT. Oops! A very simple bug with unhappy consequences.

The text library has quite a large test suite that has revealed many bugs over the years, often before they ever escaped into the wild. Why did this ugly critter make it over the fence?

Well, a glance at the original code for trying to test UTF-8 error handling is telling—in fact, you don’t even need to be able to read a programming language, because the confession is in the comment.

-- This is a poor attempt to ensure that
-- the error handling paths on decode are
-- exercised in some way.  Proper testing
-- would be rather more involved.

“Proper testing” indeed. All that I did in the original test was generate a random byte sequence, and see if it provoked the decoder into throwing an exception. The chances of such a dumb test really offering any value are not great, but I had more or less forgotten about it, and so I had a sense of security without the accompanying security. But hey, at least past-me had left a mea culpa note for present-day-me. Right?

While finding and fixing the bug took just a few minutes, I spent several more hours strengthening the test for the UTF-8 decoder, and this was far more interesting.

As a variable-length self-synchronizing encoding, UTF-8 is very clever and elegant, but its cleverness allows for a number of implementation bugs. For reference, here is a table (lightly edited from Wikipedia) of the allowable bit patterns used in UTF-8.

code point
code point
byte 1 byte 2 byte 3 byte 4
U+0000 U+007F 0xxxxxxx
U+0080 U+07FF 110xxxxx 10xxxxxx
U+0800 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
U+10000 U+1FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

The best known of these bugs involves accepting non-canonical encodings. What a canonical encoding means takes a little explaining. UTF-8 can represent any ASCII character in a single byte, and in fact every ASCII character must be represented as a single byte. However, an illegal two-byte encoding of an ASCII character can be achieved by starting with 0xC0, followed by the ASCII character with the high bit set. For instance, the ASCII forward slash U+002F is represented in UTF-8 as 0x2F, but a decoder with this bug would also accept 0xC0 0xAF (three- and four-byte encodings are of course also possible).

This bug may seem innocent, but it was widely used to remotely exploit IIS 4 and IIS 5 servers over a decade ago. Correct UTF-8 decoders must reject non-canonical encodings. (These are also known as overlong encodings.)

In fact, the bytes 0xC0 and 0xC1 will never appear in a valid UTF-8 bytestream, as they can only be used to start two-byte sequences that cannot be canonical.

To test our UTF-8 decoder’s ability to spot bogus input, then, we might want to generate byte sequences that start with 0xC0 or 0xC1. Haskell’s QuickCheck library provides us with just such a generating function, choose, which generates a random value in the given range (inclusive).

choose (0xC0, 0xC1)

Once we have a bad leading byte, we may want to follow it with a continuation byte. The value of a particular continuation byte doesn’t much matter, but we would like it to be valid. A continuation byte always contains the bit pattern 0x80 combined with six bits of data in its least significant bits. Here’s a generator for a random continuation byte.

contByte = (0x80 +) <$> choose (0, 0x3F)

Our bogus leading byte should be rejected immediately, since it can never generate a canonical encoding. For the sake of thoroughness, we should sometimes follow it with a valid continuation byte to ensure that the two-byte sequence is also rejected.

To do this, we write a general combinator, upTo, that will generate a list of up to n random values.

upTo :: Int -> Gen a -> Gen [a]
upTo n gen = do
  k <- choose (0,n)
  vectorOf k gen -- a QuickCheck combinator

And now we have a very simple way of saying “either 0xC0 or 0xC1, optionally followed by a continuation byte”.

-- invalid leading byte of a 2-byte sequence.
(:) <$> choose (0xC0,0xC1) <*> upTo 1 contByte

Notice in the table above that a 4-byte sequence can encode any code point up to U+1FFFFF. The highest legal Unicode code point is U+10FFFF, so by implication there exists a range of leading bytes for 4-byte sequences that can never appear in valid UTF-8.

-- invalid leading byte of a 4-byte sequence.
(:) <$> choose (0xF5,0xFF) <*> upTo 3 contByte

We should never encounter a continuation byte without a leading byte somewhere before it.

-- Continuation bytes without a start byte.
listOf1 contByte
-- The listOf1 combinator generates a list
-- containing at least one element.

Similarly, a bit pattern that introduces a 2-byte sequence must be followed by one continuation byte, so it’s worth generating such a leading byte without its continuation byte.

-- Short 2-byte sequence.
(:[]) <$> choose (0xC2, 0xDF)

We do the same for 3-byte and 4-byte sequences.

-- Short 3-byte sequence.
(:) <$> choose (0xE0, 0xEF) <*> upTo 1 contByte
-- Short 4-byte sequence.
(:) <$> choose (0xF0, 0xF4) <*> upTo 2 contByte

Earlier, we generated 4-byte sequences beginning with a byte in the range 0xF5 to 0xFF. Although 0xF4 is a valid leading byte for a 4-byte sequence, it’s possible for a perverse choice of continuation bytes to yield an illegal code point between U+110000 and U+13FFFF. This code generates just such illegal sequences.

-- 4-byte sequence greater than U+10FFFF.
k <- choose (0x11, 0x13)
let w0 = 0xF0 + (k `Bits.shiftR` 2)
    w1 = 0x80 + ((k .&. 3) `Bits.shiftL` 4)
([w0,w1]++) <$> vectorOf 2 contByte

Finally, we arrive at the general case of non-canonical encodings. We take a one-byte code point and encode it as two, three, or four bytes; and so on for two-byte and three-byte characters.

-- Overlong encoding.
k <- choose (0,0xFFFF)
let c = chr k
case k of
  _ | k < 0x80  -> oneof [
          let (w,x)     = ord2 c in return [w,x]
        , let (w,x,y)   = ord3 c in return [w,x,y]
        , let (w,x,y,z) = ord4 c in return [w,x,y,z] ]
    | k < 0x7FF -> oneof [
          let (w,x,y)   = ord3 c in return [w,x,y]
        , let (w,x,y,z) = ord4 c in return [w,x,y,z] ]
    | otherwise ->
          let (w,x,y,z) = ord4 c in return [w,x,y,z]
-- The oneof combinator chooses a generator at random.
-- Functions ord2, ord3, and ord4 break down a character
-- into its 2, 3, or 4 byte encoding.

Armed with a generator that uses oneof to choose one of the above invalid UTF-8 encodings at random, we embed the invalid bytestream in one of three cases: by itself, at the end of an otherwise valid buffer, and at the beginning of an otherwise valid buffer. This variety gives us some assurance of catching buffer overrun errors.

Sure enough, this vastly more elaborate QuickCheck test immediately demonstrates the bug that Michael found.

The original test is a classic case of basic fuzzing: it simply generates random junk and hopes for the best. The fact that it let the decoder bug through underlines the weakness of fuzzing. If I had cranked the number of randomly generated test inputs up high enough, I’d probably have found the bug, but the approach of pure randomness would have caused the bug to remain difficult to reproduce and understand.

The revised test is much more sophisticated, as it generates only test cases that are known to be invalid, with a rich assortment of precisely generated invalid encodings to choose from. While it has the same probabilistic nature as the fuzzing approach, it excludes a huge universe of uninteresting inputs from being tested, and hence is much more likely to reveal a weakness quickly and efficiently.

The moral of the story: even QuickCheck tests, though vastly more powerful than unit tests and fuzz tests, are only as good as you make them!

Posted in haskell, open source, software

Open question: help me design a new encoding API for aeson

For a while now, I’ve had it in mind to improve the encoding performance of my Haskell JSON package, aeson.

Over the weekend, I went from hazy notion to a proof of concept for what I think could be a reasonable approach.

This post is a case of me “thinking out loud” about the initial design I came up with. I’m very interested in hearing if you have a cleaner idea.

The problem with the encoding method currently used by aeson is that it occurs via a translation to the Value type. While this is simple and uniform, it involves a large amount of intermediate work that is essentially wasted. When encoding a complex value, the Value that we build up is expensive, and it will become garbage immediately.

It should be much more efficient to simply serialize straight to a Builder, the type that is optimized for concatenating many short string fragments. But before marching down that road, I want to make sure that I provide a clean API that is easy to use correctly.

I’ve posted a gist that contains a complete copy of this proof-of-concept code.

{-# LANGUAGE GeneralizedNewtypeDeriving, FlexibleInstances,
    OverloadedStrings #-}

import Data.Monoid (Monoid(..), (<>))
import Data.Text (Text)
import Data.Text.Lazy.Builder (Builder, singleton)
import qualified Data.Text.Lazy.Builder as Builder
import qualified Data.Text.Lazy.Builder.Int as Builder

The core Build type has a phantom type that allows us to say “I am encoding a value of type t”. We’ll see where this type tracking is helpful (and annoying) below.

data Build a = Build {
    _count :: !Int
  , run    :: Builder

The internals of the Build type would be hidden from users; here’s what they mean. The _count field tracks the number of elements we’re encoding of an aggregate JSON value (an array or object); we’ll see why this matters shortly. The run field lets us access the underlying Builder.

We provide three empty types to use as parameters for the Build type.

data Object
data Array
data Mixed

We’ll want to use the Mixed type if we’re cramming a set of disparate Haskell values into a JSON array; read on for more.

When it comes to gluing values together, the Monoid class is exactly what we need.

instance Monoid (Build a) where
    mempty = Build 0 mempty
    mappend (Build i a) (Build j b)
      | ij > 1    = Build ij (a <> singleton ',' <> b)
      | otherwise = Build ij (a <> b)
      where ij = i + j

Here’s where the _count field comes in; we want to separate elements of an array or object using commas, but this is necessary only when the array or object contains more than one value.

To encode a simple value, we provide a few obvious helpers. (These are clearly so simple as to be wrong, but remember: my purpose here is to explore the API design, not to provide a proper implementation.)

build :: Builder -> Build a
build = Build 1

int :: Integral a => a -> Build a
int = build . Builder.decimal

text :: Text -> Build Text
text = build . Builder.fromText

Encoding a JSON array is easy.

array :: Build a -> Build Array
array (Build 0 _)  = build "[]"
array (Build _ vs) = build $ singleton '[' <> vs <> singleton ']'

If we try this out in ghci, it behaves as we might hope.

?> array $ int 1 <> int 2

JSON puts no constraints on the types of the elements of an array. Unfortunately, our phantom type causes us difficulty here.

An expression of this form will not typecheck, as it’s trying to join a Build Int with a Build Text.

?> array $ int 1 <> text "foo"

This is where the Mixed type from earlier comes in. We use it to forget the original phantom type so that we can construct an array with elements of different types.

mixed :: Build a -> Build Mixed
mixed (Build a b) = Build a b

Our new mixed function gets the types to be the same, giving us something that typechecks.

?> array $ mixed (int 1) <> mixed (text "foo")

This seems like a fair compromise to me. A Haskell programmer will normally want the types of values in an array to be the same, so the default behaviour of requiring this makes sense (at least to my current thinking), but we get a back door for when we absolutely have to go nuts with mixing types.

The last complication stems from the need to build JSON objects. Each key in an object must be a string, but the value can be of any type.

-- Encode a key-value pair.
(<:>) :: Build Text -> Build a -> Build Object
k <:> v = Build 1 (run k <> ":" <> run v)

object :: Build Object -> Build Object
object (Build 0 _)   = build "{}"
object (Build _ kvs) = build $ singleton '{' <> kvs <> singleton '}'

If you’ve had your morning coffee, you’ll notice that I am not living up to my high-minded principles from earlier. Perhaps the types involved here should be something closer to this:

data Object a

(<:>) :: Build Text -> Build a -> Build (Object a)

object :: Build (Object a) -> Build (Object a)

(In which case we’d need a mixed-like function to forget the phantom types for when we want to get mucky and unsafe—but I digress.)

How does this work out in practice?

?> object $ "foo" <:> int 1 <> "bar" <:> int 3

Hey look, that’s more or less as we might have hoped!

Open questions, for which I appeal to you for help:

  • Does this design appeal to you at all?

  • If not, what would you change?

  • If yes, to what extent am I wallowing in the “types for thee, but not for me” sin bin by omitting a phantom parameter for Object?

Helpful answers welcome!

Posted in haskell, open source

Big fucking deal

Quoth Wikipedia:

Big data[1][2] is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage,[3] search, sharing, transfer, analysis,[4] and visualization. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to “spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions.”

Now what if we get tired of the current hype cycle?

Big fucking deal[1][2] is a collection of deals so fucking large and complex that it becomes difficult to process using on-hand fuck giving tools or traditional shit giving techniques. The challenges include capture, curation, storage,[3] search, sharing, transfer, analysis,[4] and all kinds of who the fuck knows what else. The trend to larger fucking deals is due to the additional shit derivable from giving a fuck about a single large fucking pile of related shit, as compared to separate smaller piles with the same total amount of bullshit, allowing correlations to be found to “spot business shit, determine quality of whatever, prevent some nasty shit, link legal shit right the fuck together, combat fucking crime no I am not making this up it’s like fucking Batman, and determine real-time traffic shittiness.”
Posted in Uncategorized

What’s in a number? criterion edition

[Edit: a few hours after I wrote this post, I wrote some code to get rid of the inflation phenomenon it describes, and I'll publish a corresponding update to criterion shortly. See below for details, and the bottom for a new chart that shows the effect of the fix.]

A couple of days ago, Alexey Khudyakov did a little digging into the accuracy of criterion’s measurements. I thought his results were interesting enough to be worth some deeper analysis.

First, let’s briefly discuss Alexey’s method and findings. He created 1,000 identical copies of a benchmark, and looked to see if the measurements changed over time. They did, slowly increasing in a linear fashion. (This is a phenomenon known to statisticians as serial correlation, where a measurement influences a future measurement.)

If every benchmark is the same, why do the measurements increase? Criterion does its book-keeping in memory. For every run, it saves a piece of data in memory. Not until all benchmarks have finished does it write out that data to a Javascript+HTML report or CSV file.

I thought that the slow increase in measurements was probably related to this book-keeping, but how to test this hypothesis?

I created 200 identical copies of the same benchmark (I’m not patient enough to wait for 1,000!) and dumped a heap profile while it ran, then I plotted the numbers measured by criterion against heap size.

heap vs time

For this particular benchmark, criterion spends about 4% of its time in the garbage collector. The size of the heap increases by 300% as the program runs. If we expect garbage collection overhead to affect measurements, then the time we measure should increase by 12% as we repeat the benchmark over and over, slowly accumulating data.

This prediction exactly matches what we actually see: we start off measuring exp at 25.5 nanoseconds, and by the end we see it taking 28.5 nanoseconds.

The obvious next question is: how realistic a concern is this? A normal criterion program consists of a handful of benchmarks, usually all very different in what they do. I have not seen any cases of more than a few dozen benchmarks in a single suite. If only a few benchmarks get run, then there is essentially no opportunity for this inflation effect to become noticeable.

Nevertheless, I could definitely make some improvements (or even better, someone could contribute patches and I could continue to do nothing).

  • It would probably help to write data to a file after running each benchmark, and then to load that data back again before writing out the final report. [Edit: I wrote a patch that does just this; the increase in memory use vanishes, and along with it, the gradual inflation in measured times. Bingo!]

  • There is no benefit to looking for serial correlation across different benchmark runs, because nobody (except Alexey!) makes identical duplicates of a benchmark.

  • For the series of measurements collected for a single benchmark, it would probably be helpful to add an autocorrelation test, if only to have another opportunity to raise a red flag. Criterion is already careful to cry foul if its numbers look too messy, but first-order serial correlation would be likely to slip past the sophisticated tests it uses (like the bootstrap). I’ve long wanted to add a Durbin-Watson test, but I’ve been lazy for even longer.

If you were to run every benchmark in a large suite one after the other in a single pass, then your final numbers could indeed be inflated by a few percent [edit: at least until I release the patch]. However, there are many other ways to confound your measurements, most of which will be far larger than this book-keeping effect.

  • If you simply change the order in which you run your benchmarks, this can dramatically affect the numbers you’ll see.

  • The size of the heap that the GHC runtime uses makes a big difference, as do the threaded runtime, number of OS threads, and use of the parallel garbage collector. Any of these can change performance by a factor of two or more (!).

  • You should close busy tabs in a web browser (or preferably quit it entirely), kill your mail client and antivirus software, and try to eliminate other sources of system noise. You’ll be surprised by how big a difference these can make; anywhere from a handful to a few hundred percent.

If you want high-quality numbers, it is best to run just one benchmark from a suite at a time; on the quietest system you can manage; to watch for criterion's warnings about outliers affecting results; and to always compare several runs to see if your measurements are stable.

[Edit: Here is a chart of the measurements with the bug fixed, complete with a linear fit to indicate that the numbers are basically flat. Hooray!] new-time

Posted in haskell

What’s good for C++ is good for … Haskell!?

A few days ago, my Facebook colleague Andrei Alexandrescu posted a note entitled Three Optimization Tips for C++, which reminded me that I had unfinished business with Haskell’s text package.

I took his code, applied it to the text package, and this is the story of what happened.

The text package provides a type named Builder for efficiently constructing Unicode strings from smaller fragments. A couple of years after Johan contributed the initial Builder implementation, I wrote some helper functions to perform recurring tasks, such as rendering numbers.

In my initial number renderer, I put no effort into making my code fast—as this snippet demonstrates. Simplicity first.

positive :: (Integral a) => a -> Builder
positive = go
    go n | n < 10    = digit n
         | otherwise = go (n `quot` 10) <> digit (n `rem` 10)
    digit n = singleton . intTodigit . fromIntegral

Having gotten the code working, I promptly forgot about it for a while. When I saw Andrei’s note nine months later, it tickled my memory: didn’t I have code that I could possibly improve? Indeed, when I took a look at my number rendering code, I found that I hadn’t even bothered to write benchmarks to measure its performance.

Before I discuss how I improved it, a brief description of how a Builder works is in order. A Builder provides a safe way to destructively write data into a list of fixed-size buffers, and to convert the result into an immutable Text value. External users of Builder never see the mutable buffers; they can only use safe access methods.

The code above uses the safe API. Each use of singleton is bounds-checked internally: if a write will not fit into the current buffer, Builder “finalizes” that buffer (putting it at the tail of the list), allocates a new one, and starts writing there. The <> operator sequences the writes.

While this implementation is correct, it is far from fast, partly due to the overhead of performing a bounds check for every character to be written out. In fact, this approach is almost always slower than simply using the show function, then converting the resulting [Char] value to a Builder!

My first observation was that if I knew the number of digits I needed up front, I could perform just one buffer-size check, instead of a separate check prior to rendering each digit.

countDigits :: (Integral a) => a -> Int
countDigits v0 = go 1 (fromIntegral v0 :: Word64)
  where go !k v
           | v < 10    = k
           | v < 100   = k + 1
           | v < 1000  = k + 2
           | v < 10000 = k + 3
           | otherwise = go (k+4) (v `quot` 10000)

This is almost identical to Andrei’s second digits10 function.

Once I was able to count digits, I could use the internal writeN function to destructively update the buffer. writeN ensures that a buffer is available with at least the requested amount of space; it then calls a user-supplied function, giving it the buffer to write to (marr) and the position to start writing at (off).

positive :: (Integral a) => a -> Builder
positive i
    -- we win when we special-case single-digit numbers
    | i < 10    = writeN 1 $ \marr off ->
                  unsafeWrite marr off (i2w i)
    | otherwise = let !n = countDigits i
                  in writeN n $ \marr off ->
                     posDecimal marr off n i

posDecimal :: (Integral a) =>
              forall s. MArray s -> Int -> Int -> a -> ST s ()
posDecimal marr off0 ds v0 = go (off0 + ds - 1) v0
  where go off v
          | v < 10 = unsafeWrite marr off (i2w v)
          | otherwise = do
              unsafeWrite marr off (i2w (v `rem` 10))
              go (off-1) (v `div` 10)

i2w :: (Integral a) => a -> Word16
i2w v = 48 + fromIntegral v

I then took Andrei’s very clever third digits10 function and translated that to Haskell. Syntax apart, there is a small difference between his function and mine: his is recursive, while mine is tail recursive (i.e. a loop).

countDigits :: (Integral a) => a -> Int
countDigits v0 = go 1 (fromIntegral v0 :: Word64)
    go !k v
      | v < 10    = k
      | v < 100   = k + 1
      | v < 1000  = k + 2
      | v < 1000000000000 =
          k + if v < 100000000
              then if v < 1000000
                   then if v < 10000
                        then 3
                        else 4 + fin v 100000
                   else 6 + fin v 10000000
              else if v < 10000000000
                   then 8 + fin v 1000000000
                   else 10 + fin v 100000000000
      | otherwise = go (k + 12) (v `quot` 1000000000000)
   fin v n = if v >= n then 1 else 0

(To be sure my intuition was correct, I did indeed measure recursive against tail recursive versions of my Haskell translation, and tail recursion wins by a few percent here.)

While this countDigits function helped performance by quite a bit, there was another step remaining in following Andrei’s example: converting two digits at a time.

posDecimal :: (Integral a) =>
              forall s. MArray s -> Int -> Int -> a -> ST s ()
posDecimal marr off0 ds v0 = go (off0 + ds - 1) v0
  where go off v
           | v >= 100 = do
               let (q, r) = v `quotRem` 100
               write2 off r
               go (off - 2) q
           | v < 10    = unsafeWrite marr off (i2w v)
           | otherwise = write2 off v
        write2 off i0 = do
          let i = fromIntegral i0; j = i + i
          unsafeWrite marr off $ get (j + 1)
          unsafeWrite marr (off - 1) $ get j
        get = fromIntegral . B.unsafeIndex digits

A final surprise came when I decided to try an experiment: what if I replaced the separate uses of quot and rem with a single use of quotRem? This improved performance by a further 30% on large 64-bit numbers! Why such a big difference? Because quotRem can often be emitted as a single machine instruction instead of two, and division is expensive enough that in a hot loop, this helps a lot.

Many modern optimizing compilers can spot this kind of opportunity automatically. Although GHC’s optimizer performs many complex high-level transformations, its machinery for handling low-level optimizations is currently weak. (This is why you’ll see a few cases of v+v instead of v*2 above, where I’m strength-reducing operations by hand instead of trusting the compiler.)

I was not at all surprised that Andrei’s optimization tips should translate perfectly to Haskell, as most of what he says has nothing to do with C++ itself. His advice should apply well to any language that provides low-level access to the machine and ends up running as native code. Since Haskell is often not understood to be a perfectly good imperative language, it’s easy for less experienced programmers to overlook the relevance of low-level concerns to Haskell performance.

The end result of all this tweaking is that the new number renderer in the text package is not just much faster than its predecessor, it is also a lot faster than the venerable show function. We retain the same API, external immutability, and type safety as before, but have a very nice five-fold increase in performance to show for our efforts!

Posted in haskell

A major new release of the Haskell hashable library

I have spent quite some time over the last couple of months improving the Haskell hashable library, and all of my efforts eventually turned into a near-complete rewrite of the library.

The 1.2 release of hashable is not backwards compatible, and for several good reasons. Read on for the details.

Hash flooding

The threat landscape faced by applications that use hash-based data structures has shifted radically this year, with hash flooding emerging as an attack vector after a long period of obscurity (the earliest publication I know of on the subject dates back to 1998, but I doubt that it was new even then).

We know that both networked and local applications that use weak hash algorithms are vulnerable, with new attacks being published all the time.

Recently, even modern hash algorithms such as the MurmurHash family and CityHash have succumbed to surprising attacks via differential cryptanalysis, in which large numbers of collisions can be generated efficiently without knowledge of the salt.


The SipHash algorithm offers a strong defence against differential cryptanalysis. SipHash is now the algorithm used for hashing all string-like types.

We automatically choose optimized implementations for 64-bit and 32-bit platforms, depending on their capabilities (e.g. availability of SSE2 on 32-bit systems).

For inputs of just a few bytes in length, SipHash is about half the speed of the previous string hash algorithm, FNV-1. Its performance breaks even with FNV-1 at about 50 bytes, and it improves to become more than twice as fast for large inputs.

(FNV-1 is also particularly easy to attack, so escaping from it was imperative, even if that cost a little performance.)

Random choice of salt

To provide an additional measure of security against hash flooding attacks, the standard hash function chooses a random salt at program startup time, using the system’s cryptographic pseudo-random number generator (/dev/urandom on Unix, CryptGenRandom on Windows).

Effortless, fast hashing

I have also demoted hash out of the Hashable typeclass, so the API now looks like this.

class Hashable a where
    hashWithSalt :: Int -- salt
                 -> a   -- value to hash
                 -> Int -- result

hash :: Hashable a => a -> Int
hash = hashWithSalt defaultSalt

In other words, if you are writing an instance of Hashable by hand, you had better pay attention to the salt, and if you use the friendly hash function, you get the randomly-chosen default salt every time.

Now that writing Hashable instances by hand is slightly more painful, I have also made it almost effortless to hash your custom datatypes.

Using GHC’s no-longer-so-new generics machinery, you can have an efficient Hashable instance generated “for free” with just a few lines of code. Good code is generated for both product and sum types.

{-# LANGUAGE DeriveGeneric #-}

import GHC.Generics

data MyType = MyStr String
            | MyInt Integer
    deriving (Generic)

instance Hashable MyType

If your types are polymorphic, no problem; they’re just as easy to generate hash functions for.

data Poly a = Poly a
    deriving (Generic)

instance Hashable a => Hashable (Poly a)

Improved avalanche for basic types

We do not currently use SipHash to hash basic types such as numbers, as strings are by a huge margin the most common vector for hash flooding attacks.

Nevertheless, we’ve made improvements to our hashing of basic types. Previous versions of hashable did nothing at all, meaning that the hash of a number was simply the identity function. While (obviously) fast and good for locality of reference in a few narrow cases, the identity function makes a terrible hash for many data structures.

For instance, if a hash table uses open addressing with linear probing, the identity hash can cause quadratic performance when inserting values into the table that are identical modulo the table size.

It’s generally considered desirable for a hash function to have strong “avalanche” properties, meaning that a single-bit change in an input should cause every bit of the output to have a 50% probability of changing.

For numberic types, we now use a couple of algorithms developed by Thomas Wang that are both fast and have good avalanche properties. These functions are more expensive than doing nothing, but quite a lot faster than SipHash. If it turns out that integers become a popular vector for hash flooding, we may well switch to SipHash for everything at some point.

In summary

This release of the hashable library is a fairly big deal. While the performance implications aren’t entirely happy, I’ve done my best to write the fastest possible code.

I hope you’ll agree that the improvements in ease of use, security, and applicability (thanks to no longer using identity hashes) are more than worth the cost. Happy hacking!

Posted in haskell

A fast new SipHash implementation in Haskell

I’ve recently been talking with Johan Tibell about submitting his hashable package to become a part of the Haskell Platform.

Once we get that submission accepted, we can fold Johan’s excellent hash-based data structures from his unordered-containers package into the standard containers package. For many applications, the hash array mapped tries in unordered-containers offer a decent performance advantage over the current standard of ordered search trees.

Prior to making the submission, I found myself wanting to spring clean the hashable package. Although it is simple and well put together, it has some robustness problems that are important in practice.

Chief among these weaknesses is that it uses the FNV-1 hash, which is well known to be vulnerable to collisions. Since it currently uses FNV-1 with a fixed key, Internet-facing web applications that use hash trees are at risk.

The current state of the art in reasonably fast, secure hash algorithms is Aumasson and Bernstein’s SipHash. There already existed a Haskell implementation in the form of the siphash package, but I thought it was worth implementing a version of my own.

My criteria for developing a new Haskell SipHash implementation were that it had to be fast and easily reusable. I have a version in progress right now, and so far it is turning out well.

The main features of this implementation (specifically when hashing ByteStrings) are as follows:

  • All arithmetic is unboxed. I carefully inspected the generated Core at every step to ensure this would be the case. (It's usually easier to improve a code base that starts out with good performance properties than to retrofit good performance into existing code.)

  • The main loop performs 8-byte reads when possible, and if the final block is less than 8 bytes, unrolls that last bytewise fill to be as efficient as possible.

  • The common cases of SipHash's c and d parameters being fixed at 2 and 4 are unrolled.

  • The implementation is written in continuation passing style. This had the happy consequence of making unrolling particularly easy (read the source to see).

It’s worth looking at what “unrolling” means in this context. The SipHash algorithm is organized around a loop that performs a “round” a small number times, where a round is a step of an ARX (add-rotate-xor) cipher.

sipRound :: (Sip -> r) -> Sip -> r

Each 8-byte block of input is put through two rounds. After processing is complete, the state is put through four rounds.

Using a CPS representation, we can easily special-case these values so that each can be executed with just a single conditional branch.

runRounds :: Int -> (Sip -> r) -> Sip -> r
runRounds 2 k = sipRound (sipRound k)
runRounds 4 k = sipRound (sipRound (sipRound (sipRound k)))

GHC 7.6 generates excellent Core for the above code, inlining sipRound and the continuation k everywhere, and avoiding both jumps and uses of boxed values.

I measured the performance of my in-progress SipHash implementation against the FNV-1 hash currently used by the hashable package, and against two versions of Vincent Hanquez’s siphash package. Different input sizes are across the x axis; the y axis represents speedup relative to FNV-1 (less than 1 is slower, greater is faster).

My SipHash implementation starts off at a little over half the speed of FNV-1 on 5-byte inputs; breaks even at around 45 bytes; and reaches twice the speed at 512 bytes. For comparison, I included a C implementation; my Haskell code is just 10% slower. I’m pretty pleased by these numbers.

(After I published an initial benchmark, Vincent Hanquez sped up his siphash package by a large amount in just a couple of hours. The dramatic effect is shown above in the huge jump in performance between versions 1.0.1 and 1.0.2 of his package. This demonstrates the value of even a brief focus on performance!)

Incidentally, SipHash is another case where compiling with -fllvm really helps; compared to GHC’s normal code generator, I saw speed jump by a very pleasing 35%. (Sure enough, the numbers I show above are with -fllvm.)

With some extra work and a little time to focus, this new code should be ready to incorporate into the hashable package, hopefully within a few days.

Posted in haskell

The case of the mysterious explosion in space

The case of the mysterious explosion in space

A few months ago, reports began to filter in of an unhappy problem with the Haskell text package: it was causing huge object files to be generated when a file contained lots of string literals.

I didn’t notice the initial report (it was posted to a busy mailing list that I don’t try to keep up with), but Michael Snoyman was kind enough to take that message and file a bug.

The culprit was this very simple function definition, which converts a string from the venerable Haskell String type to the more modern Text.

pack :: String -> Text
pack txt = unstream
             ( safe
           (Stream.streamList txt))

The definition of pack is too innocent to be at fault; the problem lies with the extra directives that text gives to the compiler.

{-# INLINE pack #-}
{-# INLINE Stream.unstream #-}
{-# INLINE #-}
{-# INLINE Stream.streamList #-}

By asking the compiler to inline every function, we guaranteed that every string literal would result in a lot of code being generated. Worse, all of this space would be entirely redundant, consisting of repeated copies of exactly the same code.

(You might well wonder why we’d insist on inlining any of these functions, if the cost in space is so high. The answer is that inlining is key to why the text package achieves good performance. That deserves an article of its own, so I’ll return to the subject soon.)

Have you ever wondered how GHC represents string literals? Instead of somehow statically constructing a linked list of characters and emitting that into an object file, it’s smarter.

For strings of pure ASCII, GHC generates a packed zero-terminated byte sequence that looks like this.

.align 3
.align 0
        .byte   102    # f
        .byte   111    # o
        .byte   111    # o
        .byte   0

(For strings that contain Unicode or control characters, GHC still generates a packed sequence of bytes, but this time they’re specially encoded.)

In Haskell, these byte sequences have a type that is simply a fixed address, Addr#. During compilation, GHC takes a string literal and prefixes it with a function to convert from Addr# to String.

-- What we write:
foo :: String
foo = "foo"

-- What GHC generates:
foo :: String
foo = GHC.CString.unpackCString# co0_str
  where co0_str :: Addr#
        co0_str = "foo"

One of the lovelier features of GHC is that it exposes some of its internal machinery to authors. We’re paying a price for our aggressive use of its INLINE directive; is there another GHC feature we can use to save the day?

Enter the rewrite rule, a way of telling GHC how to perform source-to-source transformations.

Here is a naive attempt to specify a rewrite rule that might help us. First, we define a version of pack that we tell the compiler to never inline, then we supply a rewrite rule that tells the compiler to substitute the never-inlined version of pack for the normal version.

packNOINLINE :: String -> Text
packNOINLINE = unstream . safe . Stream.streamList

{-# RULES "TEXT literal" forall a.
    pack s = packNOINLINE s

Although this rule works and generates correct code, it swaps one problem for another: the object files we generate shrink dramatically, but we’ve defeated some of the compiler’s opportunities to improve the code it emits.

Oh, and during compilation, remember that after GHC has finished processing a string literal, we start out with an Addr#, then GHC converts to a String for us, and finally we convert to a Text. That intermediate step galls me, even though it really has no practical consequences.

Happily for us, GHC’s rewrite rules are applied cleverly: rather than being a simple one-shot affair, GHC keeps trying to apply rewrite rules as it optimises a program.

The critical addition to our rule is to recognise that when we write a string literal, it will be transformed into an application of GHC.String.unpackCString#, and target our rule to an expression containing this.

-- Introduce a new function ...
Text.unpackCString# :: Addr# -> Text
Text.unpackCString# addr# 
  = unstream (Stream.streamCString# addr#)
{-# NOINLINE unpackCString# #-}

-- ... and use it!
{-# RULES "TEXT literal" forall a.
    unstream ( safe
        (GHC.String.unpackCString# a)))
      = Text.unpackCString# a #-}

With this rewrite rule, GHC will transform code that we /never actually wrote/, using a type (Addr#) that we don’t use in our code. The conditions that trigger this rule will arise only when we define a literal Text value. This means that productive uses of stream fusion will not be affected. Even better, this rule eliminates that pesky intermediate String value, since the new unpackCString# performs a direct translation. Not a bad trick!

Posted in haskell