Sometimes, the old ways are the best

Posted on 2015-05-13 by Bryan O'Sullivan — 5 Comments ↓

Over the past few months, the Sigma engineering team at Facebook has rolled out a major Haskell project: a rewrite of Sigma, an important weapon in our armory for fighting spam and malware.

Sigma has a mission-critical job, and it needs to scale: its growing workload currently sees it handling tens of millions of requests per minute.

The rewrite of Sigma in Haskell, using the Haxl library that Simon Marlow developed, has been a success. Throughput is higher than under its predecessor, and CPU usage is lower. Sweet!

Nevertheless, success brings with it surprises, and even though I haven’t worked on Sigma or Haxl, I’ve been implicated in one such surprise. To understand my accidental bit part in the show, let's begin by mentioning that Sigma uses JSON internally for various purposes. These days, the Haskell-powered Sigma uses aeson, the JSON library I wrote, to handle JSON data.

A few months ago, the Haxl rewrite of Sigma was going through an episode of crazytown, in which it would intermittently and unpredictably use huge amounts of CPU and memory. The culprit turned out to be JSON strings containing zillions of backslashes. (I have no idea why. If you’ve worked with large volumes of data for a long time, you won’t even bat an eyelash at the idea that a data store somewhere contains some really weird records.)

The team quickly mitigated the problem, and gave me a nudge that I might want to look into the problem. On Sunday evening, with a glass of red wine in hand, I finally dove in to see what was wrong.

Since the Sigma developers had figured out what was causing these time and space explosions, I immediately had a test case to work with, and the results were grim: decoding a mere megabyte of continuous backslashes took over a second, consumed over a gigabyte of memory, and killed concurrency by causing the runtime system to spend almost 90% of its time in the garbage collector. Yikes!

Whatever was going on? If you look at the old implementation of aeson’s unescape function, it seems quite efficient and innocuous. It’s reasonably tightly optimized low-level Haskell.

Trouble is, unescape uses an API (a bytestring builder) that is intended for streaming a result incrementally. Unfortunately the unescape function can’t hand any data back to its caller until it has processed an entire string.

The result is as you’d expect: we build a huge chain of thunks. In this case, the thunks will eventually write data efficiently into buffers. Alas, the thunks have nobody demanding the evaluation of their contents. This chain consumes a lot (a lot!) of memory and incurs a huge amount of GC overhead (long chains of thunks are expensive). Sadness ensues.

The “old ways” in the title refer to the fix: in place of a fancy streaming API, I simply allocate a single big buffer and blast the bytes straight into it.

For that pathological string with almost a megabyte of consecutive backslashes, the new implementation is 27x faster and uses 42x less memory, all for the cost of perhaps an hour of Sunday evening hacking (including a little enabling work that incidentally illustrates just how easy it is to work with monad transformers). Not bad!

Posted in haskell

criterion 1.0

Posted on 2014-08-08 by Bryan O'Sullivan — 1 Comment ↓

Almost five years after I initially released criterion, I'm delighted to announce a major release with a large number of appealing new features.

As always, you can install the latest goodness using cabal install criterion, or fetch the source from github.

Please let me know if you find criterion useful!

New documentation

I built both a home page and a thorough tutorial for criterion. I've also extended the inline documentation and added a number of new examples.

All of the documentation lives in the github repo, so if you'd like to see something improved, please send a bug report or pull request.

New execution engine

Criterion's model of execution has evolved, becoming vastly more reliable and accurate. It can now measure events that take just a few hundred picoseconds.

benchmarking return ()
time                 512.9 ps   (512.8 ps .. 513.1 ps)

While almost all of the core types have changed, criterion should remain API-compatible with the vast majority of your benchmarking code.

New metrics

In addition to wall-clock time, criterion can now measure and regress on the following metrics:

CPU time
CPU cycles
bytes allocated
number of garbage collections
number of bytes copied during GC
wall-clock time spent in mutator threads
CPU time spent running mutator threads
wall-clock time spent doing GC
CPU time spent doing GC

Linear regression

Criterion now supports linear regression of a number of metrics.

Here's a regression conducted using --regress cycles:iters:

cycles:              1.000 R²   (1.000 R² .. 1.000 R²)
  iters              47.718     (47.657 .. 47.805)

The first line of the output is the R² goodness-of-fit measure for this regression, and the second is the number of CPU cycles (measured using the rdtsc instruction) to execute the operation in question (integer division).

This next regression uses --regress allocated:iters to measure the number of bytes allocated while constructing an IntMap of 40,000 values.

allocated:           1.000 R²   (1.000 R² .. 1.000 R²)
  iters              4.382e7    (4.379e7 .. 4.384e7)

(That's a little under 42 megabytes.)

New outputs

While its support for active HTML has improved, criterion can also now output JSON and JUnit XML files.

New internals

Criterion has received its first spring cleaning, and is much easier to understand as a result.

Acknowledgments

I was inspired into some of this work by the efforts of the authors of the OCaml Core_bench package.

Posted in haskell, open source

Win bigger statistical fights with a better jackknife

Posted on 2014-06-10 by Bryan O'Sullivan — 2 Comments ↓

(Summary: I’ve developed some algorithms for a statistical technique called the jackknife that run in O(n) time instead of O(n²).)

In statistics, an estimation technique called “the jackknife” has been widely used for over half a century. It’s a mainstay for taking a quick look at the quality of an estimator of a sample. (An estimator is a summary function over a sample, such as its mean or variance.)

Suppose we have a noisy sample. Our first stopping point might be to look at the variance of the sample, to get a sense of how much the values in the sample “spread out” around the average.

If the variance is not close to zero, then we know that the sample is somewhat noisy. But our curiosity may persist: is the variance unduly influenced by a few big spikes, or is the sample consistently noisy? The jackknife is a simple analytic tool that lets us quickly answer questions like this. There are more accurate, sophisticated approaches to this kind of problem, but they’re not nearly so easy to understand and use, so the jackknife has stayed popular since the 1950s.

The jackknife is easy to describe. We take the original sample, drop the first value out, and calculate the variance (or whatever the estimator is) over this subsample. We repeat this, dropping out only the second value, and continue. For an original sample with n elements, we end up with a collection of n jackknifed estimates of all the subsamples, each with one element left out. Once we’re done, there’s an optional last step: we compute the mean of these jackknifed estimates, which gives us the jackknifed variance.

For example, suppose we have the sample [1,3,2,1]. (I’m going to write all my examples in Haskell for brevity, but the code in this post should be easy to port to any statistical language.)

The simplest way to compute variance is as follows:

var xs = (sum (map (^2) xs) - sum xs ^ 2 / n) / n
  where n = fromIntegral (length xs)

Using this method, the variance of [1,3,2,1] is 0.6875.

To jackknife the variance:

var [1,3,2,1]  == 0.6875

-- leave out each element in succession
-- (I'm using ".." to denote repeating expansions)
var [  3,2,1]  == 0.6666..
var [1,  2,1]  == 0.2222..
var [1,3,  1]  == 0.8888..
var [1,3,2  ]  == 0.6666..

-- compute the mean of the estimates over the subsamples
mean [0.6666,0.2222,0.8888,0.6666]
               == 0.6111..

Since 0.6111 is quite different than 0.6875, we can see that the variance of this sample is affected rather a lot by bias.

While the jackknife is simple, it’s also slow. We can easily see that the approach outlined above takes O(n²) time, which means that we can’t jackknife samples above a modest size in a reasonable amount of time.

This approach to the jackknife is the one everybody actually uses. Nevertheless, it’s possible to improve the time complexity of the jackknife for some important estimators from O(n²) to O(n). Here’s how.

Jackknifing the mean

Let’s start with the simple case of the mean. Here’s the obvious way to measure the mean of a sample.

mean xs = sum xs / n
  where n = fromIntegral (length xs)

And here are the computations we need to perform during the naive approach to jackknifing the mean.

-- n = fromIntegral (length xs - 1)
sum [  3,2,1] / n
sum [1,  2,1] / n
sum [1,3,  1] / n
sum [1,3,2  ] / n

Let’s decompose the sum operations into two triangles as follows, and see what jumps out:

sum [  3,2,1] = sum [] + sum [3,2,1]
sum [1,  2,1] = sum [1]  + sum [2,1]
sum [1,3,  1] = sum [1,3]  + sum [1]
sum [1,3,2  ] = sum [1,3,2] + sum []

From this perspective, we’re doing a lot of redundant work. For example, to calculate sum [1,3,2], it would be very helpful if we could reuse the work we did in the previous calculation to calculate sum [1,3].

Prefix sums

We can achieve our desired reuse of earlier work if we store each intermediate sum in a separate list. This technique is called prefix summation, or (if you’re a Haskeller) scanning.

Here’s the bottom left triangle of sums we want to calculate.

sum [] {- + sum [3,2,1] -}
sum [1]  {- + sum [2,1] -}
sum [1,3]  {- + sum [1] -}
sum [1,3,2] {- + sum [] -}

We can prefix-sum these using Haskell’s standard scanl function.

>>> init (scanl (+) 0 [1,3,2,1])
[0,1,4,6]

{- e.g. [0,
         0 + 1,
         0 + 1 + 3,
         0 + 1 + 3 + 2]   -}

(We use init to drop out the final term, which we don’t want.)

And here’s the top right of the triangle.

{- sum [] + -} sum [3,2,1]
{- sum [1] + -}  sum [2,1]
{- sum [1,3] + -}  sum [1]
{- sum [1,3,2] + -} sum []

To prefix-sum these, we can use scanr, which scans “from the right”.

>>> tail (scanr (+) 0 [1,3,2,1])
[6,3,1,0]

{- e.g. [3 + 2 + 1 + 0,
         2 + 1 + 0,
         1 + 0,
         0]               -}

(As in the previous case, we use tail to drop out the first term, which we don’t want.)

Now we have two lists:

[0,1,4,6]
[6,3,1,0]

Next, we sum the lists pairwise, which gives get exactly the sums we need:

sum [  3,2,1]  == 0 + 6 == 6
sum [1,  2,1]  == 1 + 3 == 4
sum [1,3,  1]  == 4 + 1 == 5
sum [1,3,2  ]  == 6 + 0 == 6

Divide each sum by n-1, and we have the four subsample means we were hoping for—but in linear time, not quadratic time!

Here’s the complete method for jackknifing the mean in O(n) time.

jackknifeMean :: Fractional a => [a] -> [a]
jackknifeMean xs =
    map (/ n) $
    zipWith (+)
    (init (scanl (+) 0 xs))
    (tail (scanr (+) 0 xs))
  where n = fromIntegral (length xs - 1)

If we’re jackknifing the mean, there’s no point in taking the extra step of computing the mean of the jackknifed subsamples to estimate the bias. Since the mean is an unbiased estimator, the mean of the jackknifed means should be the same as the sample mean, so the bias will always be zero.

However, the jackknifed subsamples do serve a useful purpose: each one tells us how much its corresponding left-out data point affects the sample mean. Let’s see what this means.

>>> mean [1,3,2,1]
1.75

The sample mean is 1.75, and let’s see which subsample mean is farthest from this value:

>>> jackknifeMean [1,3,2,1]
[2, 1.3333, 1.6666, 2]

So if we left out 1 from the sample, the mean would be 2, but if we left out 3, the mean would become 1.3333. Clearly, this is the subsample mean that is farthest from the sample mean, so 3 is the most significant outlier in our estimate of the mean.

Prefix sums and variance

Let’s look again at the naive formula for calculating variance:

var xs = (sum (map (^2) xs) - sum xs ^ 2 / n) / n
  where n = fromIntegral (length xs)

Since this approach is based on sums, it looks like maybe we can use the same prefix summation technique to compute the variance in O(n) time.

Because we’re computing a sum of squares and an ordinary sum, we need to perform two sets of prefix sum computations:

Two to compute the sum of squares, one from the left and another from the right
And two more for computing the square of sums

jackknifeVar xs =
    zipWith4 var squaresLeft squaresRight sumsLeft sumsRight
  where
    var l2 r2 l r = ((l2 + r2) - (l + r) ^ 2 / n) / n
    squares       = map (^2) xs
    squaresLeft   = init (scanl (+) 0 squares)
    squaresRight  = tail (scanr (+) 0 squares)
    sumsLeft      = init (scanl (+) 0 xs)
    sumsRight     = tail (scanr (+) 0 xs)
    n             = fromIntegral (length xs - 1)

If we look closely, buried in the local function var above, we will see almost exactly the naive formulation for variance, only constructed from the relevant pieces of our four prefix sums.

Skewness, kurtosis, and more

Exactly the same prefix sum approach applies to jackknifing higher order moment statistics, such as skewness (lopsidedness of the distribution curve) and kurtosis (shape of the tails of the distribution).

Numerical accuracy of the jackknifed mean

When we’re dealing with a lot of floating point numbers, the ever present concerns about numerical stability and accuracy arise.

For example, suppose we compute the sum of ten million pseudo-qrandom floating point numbers between zero and one.

The most accurate way to sum numbers is by first converting them to Rational, summing, then converting back to Double. We’ll call this the “true sum”. The standard Haskell sum function (“basic sum” below) simply adds numbers as it goes. It manages 14 decimal digits of accuracy before losing precision.

true sum:    5000754.656937315
basic sum:   5000754.65693705
                           ^

However, Kahan’s algorithm does even better.

true sum:    5000754.656937315
kahan sum:   5000754.656937315

If you haven’t come across Kahan’s algorithm before, it looks like this.

kahanStep (sum, c) x = (sum', c')
  where y    = x - c
        sum' = sum + y
        c'   = (sum' - sum) - y

The c term maintains a running correction of the errors introduced by each addition.

Naive summation seems to do just fine, right? Well, watch what happens if we simply add 10¹⁰ to each number, sum these, then subtract 10¹⁷ at the end.

true sum:    4999628.983274754
basic sum:    450000.0
kahan sum:   4999632.0
                  ^

The naive approach goes completely off the rails, and produces a result that is off by an order of magnitude!

This catastrophic accumulation of error is often cited as the reason why the naive formula for the mean can’t be trusted.

mean xs = sum xs / n
  where n = fromIntegral (length xs)

Thanks to Don Knuth, what is usually suggested as a replacement is Welford’s algorithm.

import Data.List (foldl')

data WelfordMean a = M !a !Int
              deriving (Show)

welfordMean = end . foldl' step zero
  where end  (M m _)   = m
        step (M m n) x = M m' n'
          where m'     = m + (x - m) / fromIntegral n'
                n'     = n + 1
        zero           = M 0 0

Here’s what we get if we compare the three approaches:

true mean:    0.49996289832747537
naive mean:   0.04500007629394531
welford mean: 0.4998035430908203

Not surprisingly, the naive mean is worse than useless, but the long-respected Welford method only gives us three decimal digits of precision. That’s not so hot.

More accurate is the Kahan mean, which is simply the sum calculated using Kahan’s algorithm, then divided by the length:

true mean:    0.49996289832747537
kahan mean:   0.4999632
welford mean: 0.4998035430908203

This at least gets us to five decimal digits of precision.

So is the Kahan mean the answer? Well, Kahan summation has its own problems. Let’s try out a test vector.

-- originally due to Tim Peters
>>> let vec = concat (replicate 1000 [1,1e100,1,-1e100])

-- accurate sum
>>> sum (map toRational vec)
2000

-- naive sum
>>> sum vec
0.0

-- Kahan sum
>>> foldl kahanStep (S 0 0) vec
S 0.0 0.0

Ugh, the Kahan algorithm doesn’t do any better than naive addition. Fortunately, there’s an even better summation algorithm available, called the Kahan-Babuška-Neumaier algorithm.

kbnSum = uncurry (+) . foldl' step (0,0)
  where
    step (sum, c) x = (t, c')
      where c' | abs sum >= abs x = c + ((sum - t) + x)
               | otherwise        = c + ((x - t) + sum)
            t                     = sum + x

If we try this on the same test vector, we taste sweet success! Thank goodness!

>>> kbnSum vec
2000.0

Not only is Kahan-Babuška-Neumaier (let’s call it “KBN”) more accurate than Welford summation, it has the advantage of being directly usable in our desired prefix sum form. We’ll accumulate floating point error proportional to O(1) instead of the O(n) that naive summation gives.

Poor old Welford’s formula for the mean just can’t get a break! Not only is it less accurate than KBN, but since it’s a recurrence relation with a divisor that keeps changing, we simply can’t monkeywrench it into suitability for the same prefix-sum purpose.

Numerical accuracy of the jackknifed variance

In our jackknifed variance, we used almost exactly the same calculation as the naive variance, merely adjusted to prefix sums. Here's the plain old naive variance function once again.

var xs = (sum (map (^2) xs) - sum xs ^ 2 / n) / n
  where n = fromIntegral (length xs)

The problem with this algorithm arises as the size of the input grows. These two terms are likely to converge for large n:

sum (map (^2) xs)

sum xs ^ 2 / n

When we subtract them, floating point cancellation leads to a large error term that turns our result into nonsense.

The usual way to deal with this is to switch to a two-pass algorithm. (In case it’s not clear at first glance, the first pass below calculates mean.)

var2 xs    = (sum (map (^2) ys) - sum ys ^ 2 / n) / n
  where n  = fromIntegral (length xs)
        ys = map (subtract (mean xs)) xs

By subtracting the mean from every term, we keep the numbers smaller, so the two sum terms are less likely to converge.

This approach poses yet another conundrum: we want to jackknife the variance. If we have to correct for the mean to avoid cancellation errors, do we need to calculate each subsample mean? Well, no. We can get away with a cheat: instead of subtracting the subsample mean, we subtract the sample mean, on the assumption that it’s “close enough” to each of the subsample means to be a good enough substitute.

So. To calculate the jackknifed variance, we use KBN summation to avoid a big cumulative error penalty during addition, subtract the sample mean to avoid cancellation error when subtracting the sum terms, and then we’ve finally got a pretty reliable floating point algorithm.

Where can you use this?

The jackknife function in the Haskell statistics library uses all of these techniques where applicable, and the Sum module of the math-functions library provides reliable summation (including second-order Kahan-Babuška summation, if you gotta catch all those least significant bits).

(If you’re not already bored to death of summation algorithms, take a look into pairwise summation. It’s less accurate than KBN summation, but claims to be quite a bit faster—claims I found to be only barely true in my benchmarks, and not worth the loss of precision.)

Posted in haskell, software

A major upgrade to attoparsec: more speed, more power

Posted on 2014-05-31 by Bryan O'Sullivan — 4 Comments ↓

I’m pleased to introduce the third generation of my attoparsec parsing library. With a major change to its internals, it is both faster and more powerful than previous versions, while remaining backwards compatible.

Comparing to C

Let’s start with a speed comparison between the hand-written C code that powers Node.js’s HTTP parser and an idiomatic Haskell parser that uses attoparsec. There are good reasons to take these numbers with a fistful of salt, so imagine huge error bars, warning signs, and whatnot—but they’re still interesting.

A little explanation is in order for why there are two entries for http-parser. The “null” driver consists of a series of empty callbacks, and represents the best possible performance we can get. The “naive” http-parser driver allocates memory for both a request and each of its headers, and frees this memory once a request parse has finished. (A real user of http-parser is likely to be slower than the naive driver, as http-parser forces its clients to do complex book-keeping.)

Meanwhile, the attoparsec parser is of course tiny: a few dozen lines of code, instead of a few thousand. More interestingly, it’s faster than its do-nothing C counterpart. When I last compared the two, back in 2010, attoparsec was a little over half the speed of http-parser, so to pass it feels like an exciting development.

To be clear, you really shouldn’t treat comparing the two as anything other than a fast-and-loose exercise. The attoparsec parser does less work in some ways, for instance by not special-casing the Content-Length header. At the same time, it does more work in a different, but perhaps more important case: there’s no equivalent of the maze of edge cases that arise with http-parser when a parse spans a boundary between blocks of input. The attoparsec programming model is simply way less hairy.

Caveats aside, my purpose with this comparison is to paint with broad strokes what I hope is a compelling picture: you can write a compact, clean parser using attoparsec, and you can expect it to perform well.

Speed improvements

Compared to the previous version of attoparsec, the new internals of this version yield some solid speedups. On attoparsec’s own microbenchmark suite, speedups range from flat to nearly 2x.

If you use the aeson JSON library to parse JSON data that contains a lot of numbers, you can expect a nice boost in performance.

Space usage

In addition to being faster, attoparsec is now generally more space efficient too. In a test of an application that uses Johan Tibell’s cassava library for handling CSV files, the app used 39% less memory with the new version of attoparsec than before, while running 5% faster.

New API fun

The new internals of attoparsec allowed me to add a feature I’ve wanted for a few years, one I had given up on as impossible with the previous internals.

match :: Parser a -> Parser (ByteString, a)

Given an arbitrary parser, match returns both the result of the parse and the string that it consumed while matching.

>>> let p = (,) <$> decimal <*> ("," *> decimal)
>>> parseOnly (match p) "1,31337"
Right ("1,31337",(1,31337))

This is very handy when what you’re interested in is not just the components of a parse result, but also the precise input string that the parser matched. (Imagine using this to save the contents of a comment while parsing a programming language, for instance.)

The old internals

What changed to yield both big performance improvements and previously impossible capabilities? To understand this, let’s discuss how attoparsec worked until today.

The age-old way to write parser libraries in Haskell is to treat parsing as a job of consuming input from the front of a string. If you want to match the string "foo" and your input is "foobar", you pull the prefix from "foobar" and hand "bar" to your successor parser as its input. This is how attoparsec used to work, and we’ll see where it becomes relevant in a moment.

One of attoparsec’s major selling points is that it works with incomplete input. If we give it insufficient input to make a decision about what to do, it will tell us.

>>> parse ("bar" <|> "baz") "ba"
Partial _

If we get a Partial constructor, we resume parsing by feeding more input to the continuation it hands us. The easiest way is to use feed:

>>> let cont = parse ("bar" <|> "baz") "ba"
>>> cont `feed` "r"
Done "" "bar"

Continuations interact in an interesting way with backtracking. Let’s talk a little about backtracking in isolation first.

>>> let lefty = Left <$> decimal <* ".!"
>>> let righty = Right <$> rational

The parser lefty will not succeed until it has read a decimal number followed by some nonsense.

Suppose we get partway through a parse on input like this.

>>> let cont = parse (lefty <|> righty) "123."
>>> cont
Partial _

Even though the decimal portion of lefty has succeeded, if we feed the string "1!" to the continuation, lefty as a whole will fail, parsing will backtrack to the beginning of the input, and righty will succeed.

>>> cont `feed` "1!"
Done "!" Right 123.1

What’s happening behind the scenes here is important.

Under the old version of attoparsec, parsing proceeds by consuming input. By the time we reach the "." in the input of "123.", we have thrown away the leading "123" as a result of decimal succeeding, so our remaining input is "." when we ask for more.

The <|> combinator holds onto the original input in case a parse fails. Since a parse may need ask for more input before it fails (as in this case), the old attoparsec has to keep track of this additional continuation-fed input separately, and glue the saved and added inputs together on each backtrack. Worse yet, sometimes we have to throw away added input in order to avoid double-counting it.

This surely sounds complicated and fragile, but it was the only scheme I could think of that would work under the “parsing as consuming input” model that attoparsec started with. I managed to make this setup run fast enough that (once I’d worked the bugs out) I wasn’t too bothered by the additional complexity.

From strings to buffers and cursors

The model that attoparsec used to follow was that we consumed input, and for correctness when backtracking did our book-keeping of added input separately.

Under the new model, we manage input and added input in one unified Buffer abstraction. We track our position using a separate cursor, which is simply an integer index into a Buffer.

If we need to backtrack, we simply hand the current Buffer to the alternate parser, along with the cursor that will restart parsing at the right spot.

The idea of parsing with a cursor isn’t mine; it came up during a late night IRC conversation with Ed Kmett. I’m excited that this change happened to make it easy to add a new combinator, match, which had previously seemed impossible to write.

match :: Parser a -> Parser (ByteString, a)

In the new cursor-based world, all we need to build match is to remember the cursor position when we start parsing. If the parse succeeds, we extract the substring that spans the old and new cursor positions. I spent quite a bit of time pondering this problem with the old representation without getting anywhere, but by changing the internal representation, it suddenly became trivial.

Switching to the cursor-based representation accounts for some of the performance improvements in the new release, as it opened up a few new avenues for further small tweaks.

Bust my buffers!

There’s another implementation twist, though: why is the Buffer type not simply a ByteString? Here, the question is one of efficiency, specifically behaviour in response to pathologically crafted inputs.

Every time someone feeds us input via the Partial continuation, we have to add this to the input we already have. The obvious thing to do is treat Buffer as a glorified ByteString and simply string-append the new input to the existing input and get on with life.

Troublingly, this approach would require two string copies per append: we’d allocate a new string, copy the original string into it, then tack the appended string on the end. It’s easy to see that this has quadratic time complexity, which would allow a hostile attacker to DoS us by simply drip-feeding us a large volume of valid data, one byte at a time.

The new Buffer structure addresses such attacks by exponential doubling, such that most appends require only one string copy instead of two. This improves the worst-case time complexity of being drip-fed extra input from O(n²) to O(nlogn).

Preserving safety and speed

Making this work took a bit of a hack. The Buffer type contains a mutable array that contains both an immutable portion (visible to users) and an invisible mutable part at the end. Every time we append, we write to the mutable array, and hand back a Buffer that widens its immutable portion to include the part we just wrote to. The array is shared across successive Buffers until we run out of space.

This is very fast, but it’s also unsafe: nobody should ever append to the same Buffer twice, as the sharing of the array can lead to data corruption. Let’s think about how this could arise. Our original Buffer still thinks it can write to the mutable portion of an array, while our new Buffer considers the same area of memory to be immutable. If we append to the original Buffer again, we will scribble on memory that the new Buffer thinks is immutable.

Since neither our choice of API nor Haskell’s type system can prevent bad actions here, users are free to make the programming error of appending to a Buffer more than once, even though it makes no sense to do so. It’s not satisfactory to have pure code react badly even when the programmer is doing something wrong, so I addressed this problem in an interesting way.

The immutable shell of a Buffer contains a generation number. We embed a mutable generation number in the shared array that each Buffer points to. We increment the mutable generation number every time we append to a Buffer, and hand back a Buffer that also has an incremented immutable generation number.

The mutable and immutable generation numbers should always agree. If they fall out of sync, we know that someone is appending to a Buffer more than once. We react by duplicating the mutable array, so that the new append cannot interfere with the existing array. This amounts to a cheap copy-on-write scheme: copies never occur in the typical case of users behaving sensibly, while we preserve correctness if a programmer starts doing daft things.

Assurance

Before I embarked on this redesign, I doubled the size of attoparsec’s test and benchmark suites. This gave me a fair sense of safety that I wouldn’t accidentally break code as I went.

Once the rate of churn settled down, I found the most significant packages using attoparsec on Hackage and tried them out.

This revealed that an incompatible change I’d made in the core Parser type caused quite a lot of downstream build breakage, with a third of the packages that I tried failing to build. This was a good motivator for me to learn how to fix the problem.

Once I fixed this self-imposed difficulty, it turned out that all of the top packages turned out to be API-compatible with the new release. It was definitely helpful to have a tool that let me find important users of the package.

Between the expanded test suite, better benchmarks, and this extra degree of checking, I am now feeling moderately confident that the sweeping changes I’ve made should be fairly safe to inflict on people. I hope I’m right! Please enjoy the results of my work.

package	mojo	status
aeson	10000	clean
snap-core	2030	requires `--allow-newer`
conduit-extra	1816	clean
fay	1740	clean
snap	1681	requires `--allow-newer`
conduit-extra	1492	clean
persistent	1487	clean
yaml	1313	clean
io-streams	1205	requires `--allow-newer`
configurator	1161	clean
yesod-form	1077	requires `--allow-newer`
snap-server	889	requires `--allow-newer`
heist	881	requires `--allow-newer`
parsers	817	clean
cassava	643	clean

And finally

When I was compiling the list of significant packages using attoparsec, I made a guess that the Unix rev would reverse the order of lines in a file. What it does instead seems much less useful: it reverses the bytes on each line.

Why do I mention this? Because my mistake led to the discovery that there’s a surprising number of Haskell packages whose names read at least as well backwards as forwards.

citats-dosey           revres-foornus
corpetic-codnap        rotaremune-cesrapotta
eroc-ognid             rotaremune-ptth
eroc-pans              rotarugifnoc
forp-colla-emit-chg    sloot-ipa
kramtsop               stekcosbew
morf-gnirtsetyb        teppup-egaugnal
nosea                  tropmish
revirdbew              troppus-ipa-krowten

(And finally-most-of-all, if you’re curious about where I measured my numbers, I used my 2011-era 2.2GHz MacBook Pro running 64-bit GHC 7.6.3. Server-class hardware should do way better.)

Posted in haskell, open source

Top Haskell packages seen through graph centrality beer goggles

Posted on 2014-05-18 by Bryan O'Sullivan — 1 Comment ↓

I threw together a little code tonight to calculate the Katz centrality of packages on Hackage. This is a measure that states that a package is important if an important package depends on it. The definition is recursive, as is the matrix computation that converges towards a fixpoint to calculate it.

Here are the top hundred Hackage packages as calculated by this method, along with their numeric measures of centrality, to which I’ve given the slightly catchier name “mojo” here.

This method has a few obvious flaws: it doesn’t count downloads, nor can it take into account packages that only contain executables. That said, the results still look pretty robust.

package	mojo
base	10000
ghc-prim	9178
array	1354
bytestring	1278
deepseq	1197
containers	994
transformers	925
mtl	840
text	546
time	460
filepath	441
directory	351
parsec	299
old-locale	267
template-haskell	247
network	213
process	208
vector	208
pretty	187
random	172
binary	158
QuickCheck	130
utf8-string	128
stm	119
unix	116
haskell98	100
hashable	96
attoparsec	92
old-time	88
primitive	87
aeson	72
unordered-containers	70
syb	69
data-default	67
split	64
transformers-base	63
blaze-builder	62
monad-control	62
conduit	62
semigroups	59
cereal	57
tagged	57
bindings-DSL	55
HUnit	55
gtk	54
Cabal	54
lens	50
OpenGL	46
haskell-src-exts	45
cmdargs	45
HTTP	44
http-types	43
extensible-exceptions	43
glib	42
utility-ht	41
data-default-class	38
parallel	35
resourcet	34
semigroupoids	34
xml	34
comonad	33
lifted-base	33
cairo	33
safe	32
MissingH	31
exceptions	31
base-unicode-symbols	31
ansi-terminal	31
vector-space	30
nats	30
OpenGLRaw	30
monads-tf	28
wai	28
hslogger	28
regex-compat	28
GLUT	27
void	27
blaze-html	26
hxt	25
dlist	25
zlib	25
hmatrix	24
SDL	24
case-insensitive	24
scientific	23
X11	23
tagsoup	22
regex-posix	22
HaXml	22
system-filepath	22
enumerator	22
contravariant	21
base64-bytestring	21
http-conduit	21
blaze-markup	21
MonadRandom	20
failure	20
test-framework	20
xhtml	20
distributive	19

Posted in haskell, open source

Once more into the teach, dear friends

Posted on 2014-05-13 by Bryan O'Sullivan — 3 Comments ↓

Since the beginning of April, David Mazières and I have been back in the saddle teaching CS240H at Stanford again.

If you’re tuning in recently, David and I both love systems programming, and we particularly get a kick out of doing it in Haskell. Let me state this more plainly: Haskell is an excellent systems programming language.

Our aim with this class is to teach both enough advanced Haskell that students really get a feel for how different it is from other programming languages, and to apply this leverage to the kinds of problems that people typically think of as “systemsy”: How do I write solid concurrent software? How do I design it cleanly? What do I do to make it fast? How do I talk to other stuff, like databases and web servers?

As before, we’re making our lecture notes freely available. In my case, the notes are complete rewrites compared to the 2011 notes.

I had a few reasons for rewriting everything. I have changed the way I teach: every class has at least some amount of interactivity, including in-class assignments to give students a chance to absorb what I’m throwing at them. Compared to the first time around, I’ve dialed back the sheer volume of information in each lecture, to make the pace less overwhelming. Everything is simply fresher in my mind if I write the material right before I deliver it.

And finally, sometimes I can throw away plans at the last minute. On the syllabus for today, I was supposed to rehash an old talk about folds and parallel programming, but I found myself unable to get motivated by either subject at 8pm last night, once I’d gotten the kids to bed and settled down to start on the lecture notes. So I hemmed and hawed for a few minutes, decided that talking about lenses was way more important, and went with that.

Some of my favourite parts of the teaching experience are the most humbling. I hold office hours every week; this always feels like a place where I have to bring my “A” game, because there’s no longer a script. Some student will wander in with a problem where I have no idea what the answer is, but I vaguely remember reading a paper four years ago that covered it, so when I’m lucky I get to play glorified librarian and point people at really fun research.

I do get asked why we don’t do this as a MOOC.

It is frankly a pleasure to actually engage with a room full of bright, motivated people, and to try to find ways to help them and encourage them. I don’t know quite how I’d replicate that visceral feedback with an anonymous audience, but it qualitatively matters to me.

And to be honest, I’ve been skeptical of the MOOC phenomenon, because while the hype around them was huge, it’s always been clear that almost nobody knew what they were doing, or what it would even mean for that model to be successful. If the MOOC world converges on a few models that make some sense and don’t take a vast effort to do well, I’m sure we’ll revisit the possibility.

Until then, enjoy the slides, and happy hacking!

Posted in haskell

Where credit belongs for Hack

Posted on 2014-03-28 by Bryan O'Sullivan — 1 Comment ↓

It’s been exciting to see the positive reception in the tech press and the open source community to Hack, the new programming language that my team at Facebook released last week.

At the same time, one part of that coverage has made me wince: a couple of articles give the impression that I created Hack, but I didn’t. What I do is manage the Hack team, and since I’m incredibly proud of the work they do, I’d like to shine a little light on the people who really deserve the credit.

A few years ago, Julien Verlaguet and Alok Menghrajani had a notion that Facebook could improve some aspects of how it ships software. They developed a proof-of-concept of a system that could detect certain kinds of logic error in a program, and showed it to a few colleagues. Buoyed by the positive feedback they received, they decided to invest more effort in the project, which would in due course become Hack.

This process underscores one of the aspects I find most appealing about engineering at Facebook: engineers are welcome to try out bold ideas, and to push on with them if they look like they’ve got a good prospect of paying off.

The work that Julien and Alok did to get the project off the ground was far from simple. They had to design not just a type system, but one where the type checking algorithms could scale to run efficiently and incrementally across hundreds of thousands of source files. Not just a static type system, but one that could interoperate seamlessly with PHP’s dynamic types. Even the notionally simple act of watching the filesystem to quickly find changed files is impressively difficult; just ask anyone who’s had to wrestle with race conditions involving Linux’s inotify subsystem.

When presented with the early prototype of Hack, Drew Paroski went beyond simply offering support: he immediately signed up to enrich Hack by building the Collections system, a set of clean, modern APIs for dealing with bulk data. While the face that Collections presents to the programmer is easy to understand, there’s once again a lot of subtle work going on behind the scenes (including a lot of runtime support in HHVM, to make them efficient). Collections interoperate seamlessly with PHP arrays, both when a program is being typechecked and while it’s running.

While designing a language and building a typechecker are substantial efforts in their own right, a big reason that we’re excited about Hack is that it is already a success within Facebook. This is in large part due to the work of two engineers, Gabe Levi and Josh Watzman, who converted a large part of our codebase to use Hack’s type annotations. Pulling this off is a surprisingly subtle task. In a large dynamically typed codebase, there can be both latent type errors and code that is more elegantly left dynamically typed. Introducing static types on a large scale involves a series of judgment calls: is what we’re looking at an error; a piece of code that we can refactor so that we can express its types statically; or something that’s just fine as it stands?

As the rest of the team proceeded with the language design, engineering, and conversion work, Joel Marcey built out the HHVM and Hack documentation, while Eugene Letuchy both added features to the language (e.g. richer support for traits) and made significant contributions to our PHP-to-Hack conversion work.

I’m proud to be associated with such skilled people who put so much passion and care into their work. When it comes to Hack, I’m simply the messenger, and these engineers have all along been the ones making it happen.

Posted in open source

Book review: Parallel and Concurrent Programming in Haskell

Posted on 2014-03-18 by Bryan O'Sullivan — 2 Comments ↓

It's time someone finally wrote a proper review of Simon Marlow's amazing book, Parallel and Concurrent Programming in Haskell.

I am really not the right person to tackle this job objectively, because I have known Simon for 20 years and I currently happen to be his boss at Facebook. Nevertheless, I fly my flag of editorial bias proudly, and in any case a moment's glance at Simon's book will convince you that the absurdly purple review I am about to write is entirely justified.

Moreover, this book is sufficiently clear, and introduces so many elegant ideas and beautiful abstractions, that you would do well to learn the minimal amount of Haskell necessary to absorb its lessons, simply so that you can become enriched in the reading.

Simon's book makes an overdue departure from the usual Haskell literature (including my own book, which in case you didn't know is fully titled "Real World Haskell Of Six Years Ago Which We Should Have Edited A Little More Carefully") in assuming that you already have a modest degree of understanding of the language. This alone is slightly pathetically refreshing! I can't tell you how glad I am that functional programming has finally reached the point where we no longer have to start every bloody book by explaining what it is.

Actually, there's a second reason that I might not be an ideal person to review this book: I have only skimmed most of the first half, which concerns itself with parallel programming. Just between you and me, I will confess that parallel programming in Haskell hasn't lit my internal fire of enthusiasm. I used to do a lot of parallel programming in a previous life, largely using MPI, and the experience burned me out. While parallel programming in Haskell is far nicer than grinding away in MPI ever was, I do not love the subject enough that I want to read about it.

So what I'm really reviewing here is the second part of Simon's book, which if issued all by itself at the same price as the current entire tome, would still be a bargain. Let's talk about just how good it is.

The second half of the book concerns itself with concurrent programming, an area where Haskell particularly shines, and which happens to be the bread-and-butter of many a working programmer today. The treatment of concurrency does not depend in any way on the preceding chapters, so if you're so inclined, you can read chapter one and then skip to the second half of the book without missing any necessary information.

Chapter 7 begins by introducing some of the basic components of concurrent Haskell, threads (familiar to all) and a data type called an MVar. An MVar acts a bit like a single-item box: you can put one item into it if it's empty, otherwise you must wait; and you can take an item out if it's full, otherwise you must wait.

As humble as the MVar is, Simon uses it as a simple communication channel with which he builds a simple concurrent logging service. He then deftly identifies the performance problem that a concurrent service will have when an MVar acts as a bottleneck. Not content with this bottleneck, he illustrates how to construct an efficient unbounded channel using MVar as the building block, and clearly explains how this more complex structure works safely.

This is the heart of Simon's teaching technique: he presents an idea that is simple to grasp, then pokes a hole in it. With this hole as motivation, he presents a slightly more complicated approach that corrects the weaknesses of the prior step, without sacrificing that clarity.

For instance, the mechanism behind unbounded channels is an intricate dance of two MVars, where Simon clearly explains how they ensure that a writer will not block, while a reader will block only if the channel is empty. He then goes on to show how this channel type can be extended to support multicast, such that one writer can send messages to several readers. His initial implementation is subtly incorrect, which he once again explains and uses as a springboard to a final version. By this time, you've accumulated enough lessons from the progression of examples that you can appreciate the good design taste and durability of these unbounded channels.

Incidentally, this is a good time to talk about the chapter on parallel computing that I made sure not to skip: chapter 4, which covers dataflow parallelism using an abstraction called Par. Many of the types and concerns in this chapter will be familiar to you if you're used to concurrent programming with threads, which makes this the most practical chapter to start with if you want to venture into parallel programming in Haskell, but don't know where to begin. Par is simply wonderfully put together, and is an inspiring example of tasteful, parsimonious API design. So put chapter 4 on your must-read list.

Returning to the concurrent world, chapter 8 introduces exceptions, using asynchronous operations as the motivation. Simon builds a data type called Async, which is similar to "futures" or "promises" from other languages (and to the IVar type from chapter 4), and proceeds to make Async operations progressively more robust in the face of exceptions, then more powerful so that we can wait on the completion of one of several Async operations.

Chapter 9 resumes the progress up the robustness curve, by showing how we can safely cancel Async operations that have not yet completed, how to deal with the trouble that exceptions can cause when thrown at an inopportune time (hello, resource leaks!), and how to put an upper bound on the amount of time that an operation can run for.

Software transactional memory gets an extended treatment in chapters 10 and 11. STM has gotten a bad rap in the concurrent programming community, mostly because the implementations of STM that target traditional programming languages have drawbacks so huge that they are deeply unappealing. In the same way that the Java and C++ of 10-15 years ago ruined the reputation of static type systems when there were vastly better alternatives out there, STM in Haskell might be easy to consign to the intellectual dustbin by association, when in fact it's a much more interesting beast than its relatives.

A key problem with traditional STM is that its performance is killed stone dead by the amount of mutable state that needs to be tracked during a transaction. Haskell sidesteps much of this need for book-keeping with its default stance that favours immutable data. Nevertheless, STM in Haskell does have a cost, and Simon shows how to structure code that uses STM to make its overheads acceptable.

Another huge difficulty with traditional STM lies in the messy boundary between transactional code and code that has side effects (and which hence cannot be safely called from a transaction). Haskell's type system eliminates these difficulties, and in fact makes it easier to construct sophisticated combinations of transactional operations. Although we touched on STM having some overhead, Simon revisits the Async API and uses some of the advanced features of Haskell STM to build a multiple-wait implementation that is more efficient than its MVar-based predecessor.

In chapter 14, Simon covers Cloud Haskell, a set of fascinating packages that implement Erlang-style distributed message passing, complete with monitoring and restart of remote nodes. I admire Cloud Haskell for its practical willingness to adopt wholesale the very solid ideas of the Erlang community, as they have a quarter of a century of positive experience with their distinctive approach to constructing robust distributed applications.

If you don't already know Haskell, this book offers two significant gifts. The first is a vigorous and compelling argument for why Haskell is an uncommonly good language for the kind of concurrent programming that is fundamental to much of today's computing. The second is an eye-opening illustration of some beautiful and powerful APIs that transcend any particular language. Concise, elegant design is worth celebrating wherever you see it, and this book is brimful of examples.

On the other hand, if you're already a Haskell programmer, it is very likely that this book will awaken you to bugs you didn't know your concurrent code had, abstractions that you could be building to make your applications cleaner, and practical lessons in how to start simple and then refine your code as you learn more about your needs.

Finally, for me as a writer of books about computing, this book has lessons too. It is understated, letting the quality of its examples and abstractions convince more deeply than bombast could reach. It is minimalist, revisiting the same few initially simple ideas through successive waves of refinement and teaching. And it is clear, with nary a word out of place.

In short, if you care about Haskell, if you are interested in concurrency, if you appreciate good design, if you have an ear for well-crafted teaching, Parallel and Concurrent Programming in Haskell is a book that you simply must read. We simply do not see books of this quality very often, so treasure 'em when you see 'em.

Posted in haskell, reading

New year, new library releases, new levels of speed

Posted on 2014-01-09 by Bryan O'Sullivan — 1 Comment ↓

I just released new versions of the Haskell text, attoparsec, and aeson libraries on Hackage, and there’s a surprising amount to look forward to in them.

The summary for the impatient: some core operations in text and aeson are now much more efficient. With text, UTF-8 encoding is up to four times faster, while with aeson, encoding and decoding of JSON bytestrings are both up to twice as fast.

attoparsec 0.11.1.0

Perhaps the least interesting release is attoparsec. It adds a new dependency on Bas Van Dijk’s scientific package to allow efficient and more accurate parsing of floating point numbers, a longstanding minor weakness. It also introduces two new functions for single-token lookahead, which are used by the new release of aeson; read on for more details.

text 1.1.0.0

The new release of the text library has much better support for encoding to a UTF-8 bytestring via the encodeUtf8 function. The new encoder is up to four times faster than in the previous major release.

Simon Meier contributed a pair of UTF-8 encoding functions that can encode to the new Builder type in the latest version of the bytestring library. These functions are slower than the new encodeUtf8 implementation, but still twice as fast as the old encodeUtf8.

Not only are the new Builder encoders admirably fast, they’re more flexible than encodeUtf8, as Builders can be used to efficiently glue together from many small fragments. Once again, read on for more details about how this helped with the new release of aeson. (Note: if you don’t have the latest version of bytestring in your library environment, you won’t get the new Builder encoders.)

The second major change to the text library came about when I finally decided to expose all of the library’s internal modules. The newly exposed modules can be found in the Data.Text.Internal hierarchy. Before you get too excited, please understand that I can’t make guarantees of release-to-release stability for any functions or types that are documented as internal.

aeson 0.7.0.0

Finally, the new release of the aeson library focuses on improved performance and accuracy. We parse floating point numbers more accurately thanks once again to Bas van Dijk’s scientific library. And for performance, both decoding and encoding of JSON bytestrings are up to twice as fast as in the previous release.

On the decoding side, I used the new lookahead primitives from attoparsec to make parsing faster and less memory intensive (by avoiding backtracking, if you’re curious). Meanwhile, Simon Meier contributed a patch that uses his new Builder based UTF-8 encoder from the text library to double encoding performance. (Encoding performance is improved even if you don’t have the necessary new version of bytestring, but only by about 10%.)

On my crummy old Mac laptop, I can decode at 30-40 megabytes per second, and encode at 100-170 megabytes per second. Not bad!

Thanks

I'd particularly like to thank Bas van Dijk and Simon Meier for their excellent contributions during this most recent development cycle. It's really a pleasure to work with such smart, friendly people.

Simon and Bas deserve some kind of an additional medal for being forgiving of my sometimes embarrassingly long review latencies: some of Simon's patches against the text library are almost two years old! (Please pardon me while I grasp at straws in my slightly shamefaced partial defence here: the necessary version of bytestring wasn't released until three months ago, so I'm not the only person in the Haskell community with long review latencies...)

Posted in haskell, open source

Testing a UTF-8 decoder with vigour

Posted on 2013-12-30 by Bryan O'Sullivan — 2 Comments ↓

Yesterday, Michael Snoyman reported a surprising regression in version 1.0 of my Haskell text library: for some invalid inputs, the UTF-8 decoder was truncating the invalid data instead of throwing an exception.

Thanks to Michael providing an easy repro, I quickly bisected the origin of the regression to a commit from September that added support for incremental decoding of UTF-8. That work was motivated by applications that need to be able to consume incomplete input (e.g. a network packet containing possibly truncated data) as early as possible.

The low-level UTF-8 decoder is implemented as a state machine in C to squeeze as much performance out as possible. The machine has two visible end states: UTF8_ACCEPT indicates that a buffer was completely successfully decoded, while UTF8_REJECT specifies that the input contained invalid UTF-8 data. When the decoder stops, all other machine states count as work in progress, i.e. a decode that couldn’t complete because we reached the end of a buffer.

When the old all-or-nothing decoder encountered an incomplete or invalid input, it would back up by a single byte to indicate the location of the error. The incremental decoder is a refactoring of the old decoder, and the new all-or-nothing decoder calls it.

The critical error arose in the refactoring process. Here’s the old code for backing up a byte.

    /* Error recovery - if we're not in a
       valid finishing state, back up. */
    if (state != UTF8_ACCEPT)
        s -= 1;

This is what the refactoring changed it to:

    /* Invalid encoding, back up to the
       errant character. */
    if (state == UTF8_REJECT)
        s -= 1;

To preserve correctness, the refactoring should have added a check to the new all-or-nothing decoder so that it would step back a byte if the final state of the incremental decoder was neither UTF8_ACCEPT nor UTF8_REJECT. Oops! A very simple bug with unhappy consequences.

The text library has quite a large test suite that has revealed many bugs over the years, often before they ever escaped into the wild. Why did this ugly critter make it over the fence?

Well, a glance at the original code for trying to test UTF-8 error handling is telling—in fact, you don’t even need to be able to read a programming language, because the confession is in the comment.

-- This is a poor attempt to ensure that
-- the error handling paths on decode are
-- exercised in some way.  Proper testing
-- would be rather more involved.

“Proper testing” indeed. All that I did in the original test was generate a random byte sequence, and see if it provoked the decoder into throwing an exception. The chances of such a dumb test really offering any value are not great, but I had more or less forgotten about it, and so I had a sense of security without the accompanying security. But hey, at least past-me had left a mea culpa note for present-day-me. Right?

While finding and fixing the bug took just a few minutes, I spent several more hours strengthening the test for the UTF-8 decoder, and this was far more interesting.

As a variable-length self-synchronizing encoding, UTF-8 is very clever and elegant, but its cleverness allows for a number of implementation bugs. For reference, here is a table (lightly edited from Wikipedia) of the allowable bit patterns used in UTF-8.

first code point	last code point	byte 1	byte 2	byte 3	byte 4
U+0000	U+007F	`0xxxxxxx`
U+0080	U+07FF	`110xxxxx`	`10xxxxxx`
U+0800	U+FFFF	`1110xxxx`	`10xxxxxx`	`10xxxxxx`
U+10000	U+1FFFFF	`11110xxx`	`10xxxxxx`	`10xxxxxx`	`10xxxxxx`

The best known of these bugs involves accepting non-canonical encodings. What a canonical encoding means takes a little explaining. UTF-8 can represent any ASCII character in a single byte, and in fact every ASCII character must be represented as a single byte. However, an illegal two-byte encoding of an ASCII character can be achieved by starting with 0xC0, followed by the ASCII character with the high bit set. For instance, the ASCII forward slash U+002F is represented in UTF-8 as 0x2F, but a decoder with this bug would also accept 0xC0 0xAF (three- and four-byte encodings are of course also possible).

This bug may seem innocent, but it was widely used to remotely exploit IIS 4 and IIS 5 servers over a decade ago. Correct UTF-8 decoders must reject non-canonical encodings. (These are also known as overlong encodings.)

In fact, the bytes 0xC0 and 0xC1 will never appear in a valid UTF-8 bytestream, as they can only be used to start two-byte sequences that cannot be canonical.

To test our UTF-8 decoder’s ability to spot bogus input, then, we might want to generate byte sequences that start with 0xC0 or 0xC1. Haskell’s QuickCheck library provides us with just such a generating function, choose, which generates a random value in the given range (inclusive).

choose (0xC0, 0xC1)

Once we have a bad leading byte, we may want to follow it with a continuation byte. The value of a particular continuation byte doesn’t much matter, but we would like it to be valid. A continuation byte always contains the bit pattern 0x80 combined with six bits of data in its least significant bits. Here’s a generator for a random continuation byte.

contByte = (0x80 +) <$> choose (0, 0x3F)

Our bogus leading byte should be rejected immediately, since it can never generate a canonical encoding. For the sake of thoroughness, we should sometimes follow it with a valid continuation byte to ensure that the two-byte sequence is also rejected.

To do this, we write a general combinator, upTo, that will generate a list of up to n random values.

upTo :: Int -> Gen a -> Gen [a]
upTo n gen = do
  k <- choose (0,n)
  vectorOf k gen -- a QuickCheck combinator

And now we have a very simple way of saying “either 0xC0 or 0xC1, optionally followed by a continuation byte”.

-- invalid leading byte of a 2-byte sequence.
(:) <$> choose (0xC0,0xC1) <*> upTo 1 contByte

Notice in the table above that a 4-byte sequence can encode any code point up to U+1FFFFF. The highest legal Unicode code point is U+10FFFF, so by implication there exists a range of leading bytes for 4-byte sequences that can never appear in valid UTF-8.

-- invalid leading byte of a 4-byte sequence.
(:) <$> choose (0xF5,0xFF) <*> upTo 3 contByte

We should never encounter a continuation byte without a leading byte somewhere before it.

-- Continuation bytes without a start byte.
listOf1 contByte
-- The listOf1 combinator generates a list
-- containing at least one element.

Similarly, a bit pattern that introduces a 2-byte sequence must be followed by one continuation byte, so it’s worth generating such a leading byte without its continuation byte.

-- Short 2-byte sequence.
(:[]) <$> choose (0xC2, 0xDF)

We do the same for 3-byte and 4-byte sequences.

-- Short 3-byte sequence.
(:) <$> choose (0xE0, 0xEF) <*> upTo 1 contByte
-- Short 4-byte sequence.
(:) <$> choose (0xF0, 0xF4) <*> upTo 2 contByte

Earlier, we generated 4-byte sequences beginning with a byte in the range 0xF5 to 0xFF. Although 0xF4 is a valid leading byte for a 4-byte sequence, it’s possible for a perverse choice of continuation bytes to yield an illegal code point between U+110000 and U+13FFFF. This code generates just such illegal sequences.

-- 4-byte sequence greater than U+10FFFF.
k <- choose (0x11, 0x13)
let w0 = 0xF0 + (k `Bits.shiftR` 2)
    w1 = 0x80 + ((k .&. 3) `Bits.shiftL` 4)
([w0,w1]++) <$> vectorOf 2 contByte

Finally, we arrive at the general case of non-canonical encodings. We take a one-byte code point and encode it as two, three, or four bytes; and so on for two-byte and three-byte characters.

-- Overlong encoding.
k <- choose (0,0xFFFF)
let c = chr k
case k of
  _ | k < 0x80  -> oneof [
          let (w,x)     = ord2 c in return [w,x]
        , let (w,x,y)   = ord3 c in return [w,x,y]
        , let (w,x,y,z) = ord4 c in return [w,x,y,z] ]
    | k < 0x7FF -> oneof [
          let (w,x,y)   = ord3 c in return [w,x,y]
        , let (w,x,y,z) = ord4 c in return [w,x,y,z] ]
    | otherwise ->
          let (w,x,y,z) = ord4 c in return [w,x,y,z]
-- The oneof combinator chooses a generator at random.
-- Functions ord2, ord3, and ord4 break down a character
-- into its 2, 3, or 4 byte encoding.

Armed with a generator that uses oneof to choose one of the above invalid UTF-8 encodings at random, we embed the invalid bytestream in one of three cases: by itself, at the end of an otherwise valid buffer, and at the beginning of an otherwise valid buffer. This variety gives us some assurance of catching buffer overrun errors.

Sure enough, this vastly more elaborate QuickCheck test immediately demonstrates the bug that Michael found.

The original test is a classic case of basic fuzzing: it simply generates random junk and hopes for the best. The fact that it let the decoder bug through underlines the weakness of fuzzing. If I had cranked the number of randomly generated test inputs up high enough, I’d probably have found the bug, but the approach of pure randomness would have caused the bug to remain difficult to reproduce and understand.

The revised test is much more sophisticated, as it generates only test cases that are known to be invalid, with a rich assortment of precisely generated invalid encodings to choose from. While it has the same probabilistic nature as the fuzzing approach, it excludes a huge universe of uninteresting inputs from being tested, and hence is much more likely to reveal a weakness quickly and efficiently.

The moral of the story: even QuickCheck tests, though vastly more powerful than unit tests and fuzz tests, are only as good as you make them!

Posted in haskell, open source, software