In statistics, an estimation technique called “the jackknife” has been widely used for over half a century. It’s a mainstay for taking a quick look at the quality of an estimator of a sample. (An estimator is a summary function over a sample, such as its mean or variance.)
Suppose we have a noisy sample. Our first stopping point might be to look at the variance of the sample, to get a sense of how much the values in the sample “spread out” around the average.
If the variance is not close to zero, then we know that the sample is somewhat noisy. But our curiosity may persist: is the variance unduly influenced by a few big spikes, or is the sample consistently noisy? The jackknife is a simple analytic tool that lets us quickly answer questions like this. There are more accurate, sophisticated approaches to this kind of problem, but they’re not nearly so easy to understand and use, so the jackknife has stayed popular since the 1950s.
The jackknife is easy to describe. We take the original sample, drop the first value out, and calculate the variance (or whatever the estimator is) over this subsample. We repeat this, dropping out only the second value, and continue. For an original sample with n elements, we end up with a collection of n jackknifed estimates of all the subsamples, each with one element left out. Once we’re done, there’s an optional last step: we compute the mean of these jackknifed estimates, which gives us the jackknifed variance.
For example, suppose we have the sample [1,3,2,1]
. (I’m going to write all my examples in Haskell for brevity, but the code in this post should be easy to port to any statistical language.)
The simplest way to compute variance is as follows:
var xs = (sum (map (^2) xs)  sum xs ^ 2 / n) / n
where n = fromIntegral (length xs)
Using this method, the variance of [1,3,2,1]
is 0.6875
.
To jackknife the variance:
var [1,3,2,1] == 0.6875
 leave out each element in succession
 (I'm using ".." to denote repeating expansions)
var [ 3,2,1] == 0.6666..
var [1, 2,1] == 0.2222..
var [1,3, 1] == 0.8888..
var [1,3,2 ] == 0.6666..
 compute the mean of the estimates over the subsamples
mean [0.6666,0.2222,0.8888,0.6666]
== 0.6111..
Since 0.6111 is quite different than 0.6875, we can see that the variance of this sample is affected rather a lot by bias.
While the jackknife is simple, it’s also slow. We can easily see that the approach outlined above takes O(n^{2}) time, which means that we can’t jackknife samples above a modest size in a reasonable amount of time.
This approach to the jackknife is the one everybody actually uses. Nevertheless, it’s possible to improve the time complexity of the jackknife for some important estimators from O(n^{2}) to O(n). Here’s how.
Let’s start with the simple case of the mean. Here’s the obvious way to measure the mean of a sample.
mean xs = sum xs / n
where n = fromIntegral (length xs)
And here are the computations we need to perform during the naive approach to jackknifing the mean.
 n = fromIntegral (length xs  1)
sum [ 3,2,1] / n
sum [1, 2,1] / n
sum [1,3, 1] / n
sum [1,3,2 ] / n
Let’s decompose the sum
operations into two triangles as follows, and see what jumps out:
sum [ 3,2,1] = sum [] + sum [3,2,1]
sum [1, 2,1] = sum [1] + sum [2,1]
sum [1,3, 1] = sum [1,3] + sum [1]
sum [1,3,2 ] = sum [1,3,2] + sum []
From this perspective, we’re doing a lot of redundant work. For example, to calculate sum [1,3,2]
, it would be very helpful if we could reuse the work we did in the previous calculation to calculate sum [1,3]
.
We can achieve our desired reuse of earlier work if we store each intermediate sum in a separate list. This technique is called prefix summation, or (if you’re a Haskeller) scanning.
Here’s the bottom left triangle of sums we want to calculate.
sum [] { + sum [3,2,1] }
sum [1] { + sum [2,1] }
sum [1,3] { + sum [1] }
sum [1,3,2] { + sum [] }
We can prefixsum these using Haskell’s standard scanl
function.
>>> init (scanl (+) 0 [1,3,2,1])
[0,1,4,6]
{ e.g. [0,
0 + 1,
0 + 1 + 3,
0 + 1 + 3 + 2] }
(We use init
to drop out the final term, which we don’t want.)
And here’s the top right of the triangle.
{ sum [] + } sum [3,2,1]
{ sum [1] + } sum [2,1]
{ sum [1,3] + } sum [1]
{ sum [1,3,2] + } sum []
To prefixsum these, we can use scanr
, which scans “from the right”.
>>> tail (scanr (+) 0 [1,3,2,1])
[6,3,1,0]
{ e.g. [3 + 2 + 1 + 0,
2 + 1 + 0,
1 + 0,
0] }
(As in the previous case, we use tail
to drop out the first term, which we don’t want.)
Now we have two lists:
[0,1,4,6]
[6,3,1,0]
Next, we sum the lists pairwise, which gives get exactly the sums we need:
sum [ 3,2,1] == 0 + 6 == 6
sum [1, 2,1] == 1 + 3 == 4
sum [1,3, 1] == 4 + 1 == 5
sum [1,3,2 ] == 6 + 0 == 6
Divide each sum by n1, and we have the four subsample means we were hoping for—but in linear time, not quadratic time!
Here’s the complete method for jackknifing the mean in O(n) time.
jackknifeMean :: Fractional a => [a] > [a]
jackknifeMean xs =
map (/ n) $
zipWith (+)
(init (scanl (+) 0 xs))
(tail (scanr (+) 0 xs))
where n = fromIntegral (length xs  1)
If we’re jackknifing the mean, there’s no point in taking the extra step of computing the mean of the jackknifed subsamples to estimate the bias. Since the mean is an unbiased estimator, the mean of the jackknifed means should be the same as the sample mean, so the bias will always be zero.
However, the jackknifed subsamples do serve a useful purpose: each one tells us how much its corresponding leftout data point affects the sample mean. Let’s see what this means.
>>> mean [1,3,2,1]
1.75
The sample mean is 1.75
, and let’s see which subsample mean is farthest from this value:
>>> jackknifeMean [1,3,2,1]
[2, 1.3333, 1.6666, 2]
So if we left out 1
from the sample, the mean would be 2
, but if we left out 3
, the mean would become 1.3333
. Clearly, this is the subsample mean that is farthest from the sample mean, so 3
is the most significant outlier in our estimate of the mean.
Let’s look again at the naive formula for calculating variance:
var xs = (sum (map (^2) xs)  sum xs ^ 2 / n) / n
where n = fromIntegral (length xs)
Since this approach is based on sums, it looks like maybe we can use the same prefix summation technique to compute the variance in O(n) time.
Because we’re computing a sum of squares and an ordinary sum, we need to perform two sets of prefix sum computations:
Two to compute the sum of squares, one from the left and another from the right
And two more for computing the square of sums
jackknifeVar xs =
zipWith4 var squaresLeft squaresRight sumsLeft sumsRight
where
var l2 r2 l r = ((l2 + r2)  (l + r) ^ 2 / n) / n
squares = map (^2) xs
squaresLeft = init (scanl (+) 0 squares)
squaresRight = tail (scanr (+) 0 squares)
sumsLeft = init (scanl (+) 0 xs)
sumsRight = tail (scanr (+) 0 xs)
n = fromIntegral (length xs  1)
If we look closely, buried in the local function var
above, we will see almost exactly the naive formulation for variance, only constructed from the relevant pieces of our four prefix sums.
Exactly the same prefix sum approach applies to jackknifing higher order moment statistics, such as skewness (lopsidedness of the distribution curve) and kurtosis (shape of the tails of the distribution).
When we’re dealing with a lot of floating point numbers, the ever present concerns about numerical stability and accuracy arise.
For example, suppose we compute the sum of ten million pseudoqrandom floating point numbers between zero and one.
The most accurate way to sum numbers is by first converting them to Rational
, summing, then converting back to Double
. We’ll call this the “true sum”. The standard Haskell sum
function (“basic sum” below) simply adds numbers as it goes. It manages 14 decimal digits of accuracy before losing precision.
true sum: 5000754.656937315
basic sum: 5000754.65693705
^
However, Kahan’s algorithm does even better.
true sum: 5000754.656937315
kahan sum: 5000754.656937315
If you haven’t come across Kahan’s algorithm before, it looks like this.
kahanStep (sum, c) x = (sum', c')
where y = x  c
sum' = sum + y
c' = (sum'  sum)  y
The c
term maintains a running correction of the errors introduced by each addition.
Naive summation seems to do just fine, right? Well, watch what happens if we simply add 10^{10} to each number, sum these, then subtract 10^{17} at the end.
true sum: 4999628.983274754
basic sum: 450000.0
kahan sum: 4999632.0
^
The naive approach goes completely off the rails, and produces a result that is off by an order of magnitude!
This catastrophic accumulation of error is often cited as the reason why the naive formula for the mean can’t be trusted.
mean xs = sum xs / n
where n = fromIntegral (length xs)
Thanks to Don Knuth, what is usually suggested as a replacement is Welford’s algorithm.
import Data.List (foldl')
data WelfordMean a = M !a !Int
deriving (Show)
welfordMean = end . foldl' step zero
where end (M m _) = m
step (M m n) x = M m' n'
where m' = m + (x  m) / fromIntegral n'
n' = n + 1
zero = M 0 0
Here’s what we get if we compare the three approaches:
true mean: 0.49996289832747537
naive mean: 0.04500007629394531
welford mean: 0.4998035430908203
Not surprisingly, the naive mean is worse than useless, but the longrespected Welford method only gives us three decimal digits of precision. That’s not so hot.
More accurate is the Kahan mean, which is simply the sum calculated using Kahan’s algorithm, then divided by the length:
true mean: 0.49996289832747537
kahan mean: 0.4999632
welford mean: 0.4998035430908203
This at least gets us to five decimal digits of precision.
So is the Kahan mean the answer? Well, Kahan summation has its own problems. Let’s try out a test vector.
 originally due to Tim Peters
>>> let vec = concat (replicate 1000 [1,1e100,1,1e100])
 accurate sum
>>> sum (map toRational vec)
2000
 naive sum
>>> sum vec
0.0
 Kahan sum
>>> foldl kahanStep (S 0 0) vec
S 0.0 0.0
Ugh, the Kahan algorithm doesn’t do any better than naive addition. Fortunately, there’s an even better summation algorithm available, called the KahanBabuškaNeumaier algorithm.
kbnSum = uncurry (+) . foldl' step (0,0)
where
step (sum, c) x = (t, c')
where c'  abs sum >= abs x = c + ((sum  t) + x)
 otherwise = c + ((x  t) + sum)
t = sum + x
If we try this on the same test vector, we taste sweet success! Thank goodness!
>>> kbnSum vec
2000.0
Not only is KahanBabuškaNeumaier (let’s call it “KBN”) more accurate than Welford summation, it has the advantage of being directly usable in our desired prefix sum form. We’ll accumulate floating point error proportional to O(1) instead of the O(n) that naive summation gives.
Poor old Welford’s formula for the mean just can’t get a break! Not only is it less accurate than KBN, but since it’s a recurrence relation with a divisor that keeps changing, we simply can’t monkeywrench it into suitability for the same prefixsum purpose.
In our jackknifed variance, we used almost exactly the same calculation as the naive variance, merely adjusted to prefix sums. Here's the plain old naive variance function once again.
var xs = (sum (map (^2) xs)  sum xs ^ 2 / n) / n
where n = fromIntegral (length xs)
The problem with this algorithm arises as the size of the input grows. These two terms are likely to converge for large n:
sum (map (^2) xs)
sum xs ^ 2 / n
When we subtract them, floating point cancellation leads to a large error term that turns our result into nonsense.
The usual way to deal with this is to switch to a twopass algorithm. (In case it’s not clear at first glance, the first pass below calculates mean
.)
var2 xs = (sum (map (^2) ys)  sum ys ^ 2 / n) / n
where n = fromIntegral (length xs)
ys = map (subtract (mean xs)) xs
By subtracting the mean from every term, we keep the numbers smaller, so the two sum terms are less likely to converge.
This approach poses yet another conundrum: we want to jackknife the variance. If we have to correct for the mean to avoid cancellation errors, do we need to calculate each subsample mean? Well, no. We can get away with a cheat: instead of subtracting the subsample mean, we subtract the sample mean, on the assumption that it’s “close enough” to each of the subsample means to be a good enough substitute.
So. To calculate the jackknifed variance, we use KBN summation to avoid a big cumulative error penalty during addition, subtract the sample mean to avoid cancellation error when subtracting the sum terms, and then we’ve finally got a pretty reliable floating point algorithm.
The jackknife
function in the Haskell statistics
library uses all of these techniques where applicable, and the Sum
module of the mathfunctions
library provides reliable summation (including secondorder KahanBabuška summation, if you gotta catch all those least significant bits).
(If you’re not already bored to death of summation algorithms, take a look into pairwise summation. It’s less accurate than KBN summation, but claims to be quite a bit faster—claims I found to be only barely true in my benchmarks, and not worth the loss of precision.)
]]>Let’s start with a speed comparison between the handwritten C code that powers Node.js’s HTTP parser and an idiomatic Haskell parser that uses attoparsec. There are good reasons to take these numbers with a fistful of salt, so imagine huge error bars, warning signs, and whatnot—but they’re still interesting.
A little explanation is in order for why there are two entries for httpparser. The “null” driver consists of a series of empty callbacks, and represents the best possible performance we can get. The “naive” httpparser driver allocates memory for both a request and each of its headers, and frees this memory once a request parse has finished. (A real user of httpparser is likely to be slower than the naive driver, as httpparser forces its clients to do complex bookkeeping.)
Meanwhile, the attoparsec parser is of course tiny: a few dozen lines of code, instead of a few thousand. More interestingly, it’s faster than its donothing C counterpart. When I last compared the two, back in 2010, attoparsec was a little over half the speed of httpparser, so to pass it feels like an exciting development.
To be clear, you really shouldn’t treat comparing the two as anything other than a fastandloose exercise. The attoparsec parser does less work in some ways, for instance by not specialcasing the ContentLength header. At the same time, it does more work in a different, but perhaps more important case: there’s no equivalent of the maze of edge cases that arise with httpparser when a parse spans a boundary between blocks of input. The attoparsec programming model is simply way less hairy.
Caveats aside, my purpose with this comparison is to paint with broad strokes what I hope is a compelling picture: you can write a compact, clean parser using attoparsec, and you can expect it to perform well.
Compared to the previous version of attoparsec, the new internals of this version yield some solid speedups. On attoparsec’s own microbenchmark suite, speedups range from flat to nearly 2x.
If you use the aeson JSON library to parse JSON data that contains a lot of numbers, you can expect a nice boost in performance.
In addition to being faster, attoparsec is now generally more space efficient too. In a test of an application that uses Johan Tibell’s cassava library for handling CSV files, the app used 39% less memory with the new version of attoparsec than before, while running 5% faster.
The new internals of attoparsec allowed me to add a feature I’ve wanted for a few years, one I had given up on as impossible with the previous internals.
match :: Parser a > Parser (ByteString, a)
Given an arbitrary parser, match
returns both the result of the parse and the string that it consumed while matching.
>>> let p = (,) <$> decimal <*> ("," *> decimal)
>>> parseOnly (match p) "1,31337"
Right ("1,31337",(1,31337))
This is very handy when what you’re interested in is not just the components of a parse result, but also the precise input string that the parser matched. (Imagine using this to save the contents of a comment while parsing a programming language, for instance.)
What changed to yield both big performance improvements and previously impossible capabilities? To understand this, let’s discuss how attoparsec worked until today.
The ageold way to write parser libraries in Haskell is to treat parsing as a job of consuming input from the front of a string. If you want to match the string "foo"
and your input is "foobar"
, you pull the prefix from "foobar"
and hand "bar"
to your successor parser as its input. This is how attoparsec used to work, and we’ll see where it becomes relevant in a moment.
One of attoparsec’s major selling points is that it works with incomplete input. If we give it insufficient input to make a decision about what to do, it will tell us.
>>> parse ("bar" <> "baz") "ba"
Partial _
If we get a Partial
constructor, we resume parsing by feeding more input to the continuation it hands us. The easiest way is to use feed
:
>>> let cont = parse ("bar" <> "baz") "ba"
>>> cont `feed` "r"
Done "" "bar"
Continuations interact in an interesting way with backtracking. Let’s talk a little about backtracking in isolation first.
>>> let lefty = Left <$> decimal <* ".!"
>>> let righty = Right <$> rational
The parser lefty
will not succeed until it has read a decimal number followed by some nonsense.
Suppose we get partway through a parse on input like this.
>>> let cont = parse (lefty <> righty) "123."
>>> cont
Partial _
Even though the decimal
portion of lefty
has succeeded, if we feed
the string "1!"
to the continuation, lefty
as a whole will fail, parsing will backtrack to the beginning of the input, and righty
will succeed.
>>> cont `feed` "1!"
Done "!" Right 123.1
What’s happening behind the scenes here is important.
Under the old version of attoparsec, parsing proceeds by consuming input. By the time we reach the "."
in the input of "123."
, we have thrown away the leading "123"
as a result of decimal
succeeding, so our remaining input is "."
when we ask for more.
The <>
combinator holds onto the original input in case a parse fails. Since a parse may need ask for more input before it fails (as in this case), the old attoparsec has to keep track of this additional continuationfed input separately, and glue the saved and added inputs together on each backtrack. Worse yet, sometimes we have to throw away added input in order to avoid doublecounting it.
This surely sounds complicated and fragile, but it was the only scheme I could think of that would work under the “parsing as consuming input” model that attoparsec started with. I managed to make this setup run fast enough that (once I’d worked the bugs out) I wasn’t too bothered by the additional complexity.
The model that attoparsec used to follow was that we consumed input, and for correctness when backtracking did our bookkeeping of added input separately.
Under the new model, we manage input and added input in one unified Buffer
abstraction. We track our position using a separate cursor, which is simply an integer index into a Buffer
.
If we need to backtrack, we simply hand the current Buffer
to the alternate parser, along with the cursor that will restart parsing at the right spot.
The idea of parsing with a cursor isn’t mine; it came up during a late night IRC conversation with Ed Kmett. I’m excited that this change happened to make it easy to add a new combinator, match
, which had previously seemed impossible to write.
match :: Parser a > Parser (ByteString, a)
In the new cursorbased world, all we need to build match
is to remember the cursor position when we start parsing. If the parse succeeds, we extract the substring that spans the old and new cursor positions. I spent quite a bit of time pondering this problem with the old representation without getting anywhere, but by changing the internal representation, it suddenly became trivial.
Switching to the cursorbased representation accounts for some of the performance improvements in the new release, as it opened up a few new avenues for further small tweaks.
There’s another implementation twist, though: why is the Buffer
type not simply a ByteString
? Here, the question is one of efficiency, specifically behaviour in response to pathologically crafted inputs.
Every time someone feeds us input via the Partial
continuation, we have to add this to the input we already have. The obvious thing to do is treat Buffer
as a glorified ByteString
and simply stringappend the new input to the existing input and get on with life.
Troublingly, this approach would require two string copies per append: we’d allocate a new string, copy the original string into it, then tack the appended string on the end. It’s easy to see that this has quadratic time complexity, which would allow a hostile attacker to DoS us by simply dripfeeding us a large volume of valid data, one byte at a time.
The new Buffer
structure addresses such attacks by exponential doubling, such that most appends require only one string copy instead of two. This improves the worstcase time complexity of being dripfed extra input from O(n^{2}) to O(nlogn).
Making this work took a bit of a hack. The Buffer
type contains a mutable array that contains both an immutable portion (visible to users) and an invisible mutable part at the end. Every time we append, we write to the mutable array, and hand back a Buffer
that widens its immutable portion to include the part we just wrote to. The array is shared across successive Buffer
s until we run out of space.
This is very fast, but it’s also unsafe: nobody should ever append to the same Buffer
twice, as the sharing of the array can lead to data corruption. Let’s think about how this could arise. Our original Buffer
still thinks it can write to the mutable portion of an array, while our new Buffer
considers the same area of memory to be immutable. If we append to the original Buffer
again, we will scribble on memory that the new Buffer
thinks is immutable.
Since neither our choice of API nor Haskell’s type system can prevent bad actions here, users are free to make the programming error of appending to a Buffer
more than once, even though it makes no sense to do so. It’s not satisfactory to have pure code react badly even when the programmer is doing something wrong, so I addressed this problem in an interesting way.
The immutable shell of a Buffer
contains a generation number. We embed a mutable generation number in the shared array that each Buffer
points to. We increment the mutable generation number every time we append to a Buffer
, and hand back a Buffer
that also has an incremented immutable generation number.
The mutable and immutable generation numbers should always agree. If they fall out of sync, we know that someone is appending to a Buffer
more than once. We react by duplicating the mutable array, so that the new append cannot interfere with the existing array. This amounts to a cheap copyonwrite scheme: copies never occur in the typical case of users behaving sensibly, while we preserve correctness if a programmer starts doing daft things.
Before I embarked on this redesign, I doubled the size of attoparsec’s test and benchmark suites. This gave me a fair sense of safety that I wouldn’t accidentally break code as I went.
Once the rate of churn settled down, I found the most significant packages using attoparsec on Hackage and tried them out.
This revealed that an incompatible change I’d made in the core Parser
type caused quite a lot of downstream build breakage, with a third of the packages that I tried failing to build. This was a good motivator for me to learn how to fix the problem.
Once I fixed this selfimposed difficulty, it turned out that all of the top packages turned out to be APIcompatible with the new release. It was definitely helpful to have a tool that let me find important users of the package.
Between the expanded test suite, better benchmarks, and this extra degree of checking, I am now feeling moderately confident that the sweeping changes I’ve made should be fairly safe to inflict on people. I hope I’m right! Please enjoy the results of my work.
package  mojo  status 
aeson  10000  clean 
snapcore  2030 
requires allownewer

conduitextra  1816  clean 
fay  1740  clean 
snap  1681 
requires allownewer

conduitextra  1492  clean 
persistent  1487  clean 
yaml  1313  clean 
iostreams  1205 
requires allownewer

configurator  1161  clean 
yesodform  1077 
requires allownewer

snapserver  889 
requires allownewer

heist  881 
requires allownewer

parsers  817  clean 
cassava  643  clean 
When I was compiling the list of significant packages using attoparsec, I made a guess that the Unix rev
would reverse the order of lines in a file. What it does instead seems much less useful: it reverses the bytes on each line.
Why do I mention this? Because my mistake led to the discovery that there’s a surprising number of Haskell packages whose names read at least as well backwards as forwards.
citatsdosey revresfoornus
corpeticcodnap rotaremunecesrapotta
erocognid rotaremuneptth
erocpans rotarugifnoc
forpcollaemitchg slootipa
kramtsop stekcosbew
morfgnirtsetyb teppupegaugnal
nosea tropmish
revirdbew troppusipakrowten
(And finallymostofall, if you’re curious about where I measured my numbers, I used my 2011era 2.2GHz MacBook Pro running 64bit GHC 7.6.3. Serverclass hardware should do way better.)
]]>Here are the top hundred Hackage packages as calculated by this method, along with their numeric measures of centrality, to which I’ve given the slightly catchier name “mojo” here.
This method has a few obvious flaws: it doesn’t count downloads, nor can it take into account packages that only contain executables. That said, the results still look pretty robust.
package  mojo 

base  10000 
ghcprim  9178 
array  1354 
bytestring  1278 
deepseq  1197 
containers  994 
transformers  925 
mtl  840 
text  546 
time  460 
filepath  441 
directory  351 
parsec  299 
oldlocale  267 
templatehaskell  247 
network  213 
process  208 
vector  208 
pretty  187 
random  172 
binary  158 
QuickCheck  130 
utf8string  128 
stm  119 
unix  116 
haskell98  100 
hashable  96 
attoparsec  92 
oldtime  88 
primitive  87 
aeson  72 
unorderedcontainers  70 
syb  69 
datadefault  67 
split  64 
transformersbase  63 
blazebuilder  62 
monadcontrol  62 
conduit  62 
semigroups  59 
cereal  57 
tagged  57 
bindingsDSL  55 
HUnit  55 
gtk  54 
Cabal  54 
lens  50 
OpenGL  46 
haskellsrcexts  45 
cmdargs  45 
HTTP  44 
httptypes  43 
extensibleexceptions  43 
glib  42 
utilityht  41 
datadefaultclass  38 
parallel  35 
resourcet  34 
semigroupoids  34 
xml  34 
comonad  33 
liftedbase  33 
cairo  33 
safe  32 
MissingH  31 
exceptions  31 
baseunicodesymbols  31 
ansiterminal  31 
vectorspace  30 
nats  30 
OpenGLRaw  30 
monadstf  28 
wai  28 
hslogger  28 
regexcompat  28 
GLUT  27 
void  27 
blazehtml  26 
hxt  25 
dlist  25 
zlib  25 
hmatrix  24 
SDL  24 
caseinsensitive  24 
scientific  23 
X11  23 
tagsoup  22 
regexposix  22 
HaXml  22 
systemfilepath  22 
enumerator  22 
contravariant  21 
base64bytestring  21 
httpconduit  21 
blazemarkup  21 
MonadRandom  20 
failure  20 
testframework  20 
xhtml  20 
distributive  19 
If you’re tuning in recently, David and I both love systems programming, and we particularly get a kick out of doing it in Haskell. Let me state this more plainly: Haskell is an excellent systems programming language.
Our aim with this class is to teach both enough advanced Haskell that students really get a feel for how different it is from other programming languages, and to apply this leverage to the kinds of problems that people typically think of as “systemsy”: How do I write solid concurrent software? How do I design it cleanly? What do I do to make it fast? How do I talk to other stuff, like databases and web servers?
As before, we’re making our lecture notes freely available. In my case, the notes are complete rewrites compared to the 2011 notes.
I had a few reasons for rewriting everything. I have changed the way I teach: every class has at least some amount of interactivity, including inclass assignments to give students a chance to absorb what I’m throwing at them. Compared to the first time around, I’ve dialed back the sheer volume of information in each lecture, to make the pace less overwhelming. Everything is simply fresher in my mind if I write the material right before I deliver it.
And finally, sometimes I can throw away plans at the last minute. On the syllabus for today, I was supposed to rehash an old talk about folds and parallel programming, but I found myself unable to get motivated by either subject at 8pm last night, once I’d gotten the kids to bed and settled down to start on the lecture notes. So I hemmed and hawed for a few minutes, decided that talking about lenses was way more important, and went with that.
Some of my favourite parts of the teaching experience are the most humbling. I hold office hours every week; this always feels like a place where I have to bring my “A” game, because there’s no longer a script. Some student will wander in with a problem where I have no idea what the answer is, but I vaguely remember reading a paper four years ago that covered it, so when I’m lucky I get to play glorified librarian and point people at really fun research.
I do get asked why we don’t do this as a MOOC.
It is frankly a pleasure to actually engage with a room full of bright, motivated people, and to try to find ways to help them and encourage them. I don’t know quite how I’d replicate that visceral feedback with an anonymous audience, but it qualitatively matters to me.
And to be honest, I’ve been skeptical of the MOOC phenomenon, because while the hype around them was huge, it’s always been clear that almost nobody knew what they were doing, or what it would even mean for that model to be successful. If the MOOC world converges on a few models that make some sense and don’t take a vast effort to do well, I’m sure we’ll revisit the possibility.
Until then, enjoy the slides, and happy hacking!
]]>It's time someone finally wrote a proper review of Simon Marlow's amazing book,
I am really not the right person to tackle this job objectively, because I have known Simon for 20 years and I currently happen to be his boss at Facebook. Nevertheless, I fly my flag of editorial bias proudly, and in any case a moment's glance at Simon's book will convince you that the absurdly purple review I am about to write is entirely justified.
Moreover, this book is sufficiently clear, and introduces so many elegant ideas and beautiful abstractions, that you would do well to learn the minimal amount of Haskell necessary to absorb its lessons, simply so that you can become enriched in the reading.
Simon's book makes an overdue departure from the usual Haskell literature (including my own book, which in case you didn't know is fully titled "Real World Haskell Of Six Years Ago Which We Should Have Edited A Little More Carefully") in assuming that you already have a modest degree of understanding of the language. This alone is slightly pathetically refreshing! I can't tell you how glad I am that functional programming has finally reached the point where we no longer have to start every bloody book by explaining what it is.
Actually, there's a second reason that I might not be an ideal person to review this book: I have only skimmed most of the first half, which concerns itself with parallel programming. Just between you and me, I will confess that parallel programming in Haskell hasn't lit my internal fire of enthusiasm. I used to do a lot of parallel programming in a previous life, largely using MPI, and the experience burned me out. While parallel programming in Haskell is far nicer than grinding away in MPI ever was, I do not love the subject enough that I want to read about it.
So what I'm really reviewing here is the second part of Simon's book, which if issued all by itself at the same price as the current entire tome, would still be a bargain. Let's talk about just how good it is.
The second half of the book concerns itself with concurrent programming, an area where Haskell particularly shines, and which happens to be the breadandbutter of many a working programmer today. The treatment of concurrency does not depend in any way on the preceding chapters, so if you're so inclined, you can read chapter one and then skip to the second half of the book without missing any necessary information.
Chapter 7 begins by introducing some of the basic components of concurrent Haskell, threads (familiar to all) and a data type called an MVar
. An MVar
acts a bit like a singleitem box: you can put one item into it if it's empty, otherwise you must wait; and you can take an item out if it's full, otherwise you must wait.
As humble as the MVar
is, Simon uses it as a simple communication channel with which he builds a simple concurrent logging service. He then deftly identifies the performance problem that a concurrent service will have when an MVar
acts as a bottleneck. Not content with this bottleneck, he illustrates how to construct an efficient unbounded channel using MVar
as the building block, and clearly explains how this more complex structure works safely.
This is the heart of Simon's teaching technique: he presents an idea that is simple to grasp, then pokes a hole in it. With this hole as motivation, he presents a slightly more complicated approach that corrects the weaknesses of the prior step, without sacrificing that clarity.
For instance, the mechanism behind unbounded channels is an intricate dance of two MVar
s, where Simon clearly explains how they ensure that a writer will not block, while a reader will block only if the channel is empty. He then goes on to show how this channel type can be extended to support multicast, such that one writer can send messages to several readers. His initial implementation is subtly incorrect, which he once again explains and uses as a springboard to a final version. By this time, you've accumulated enough lessons from the progression of examples that you can appreciate the good design taste and durability of these unbounded channels.
Incidentally, this is a good time to talk about the chapter on parallel computing that I made sure not to skip: chapter 4, which covers dataflow parallelism using an abstraction called Par
. Many of the types and concerns in this chapter will be familiar to you if you're used to concurrent programming with threads, which makes this the most practical chapter to start with if you want to venture into parallel programming in Haskell, but don't know where to begin. Par
is simply wonderfully put together, and is an inspiring example of tasteful, parsimonious API design. So put chapter 4 on your mustread list.
Returning to the concurrent world, chapter 8 introduces exceptions, using asynchronous operations as the motivation. Simon builds a data type called Async
, which is similar to "futures" or "promises" from other languages (and to the IVar
type from chapter 4), and proceeds to make Async
operations progressively more robust in the face of exceptions, then more powerful so that we can wait on the completion of one of several Async
operations.
Chapter 9 resumes the progress up the robustness curve, by showing how we can safely cancel Async
operations that have not yet completed, how to deal with the trouble that exceptions can cause when thrown at an inopportune time (hello, resource leaks!), and how to put an upper bound on the amount of time that an operation can run for.
Software transactional memory gets an extended treatment in chapters 10 and 11. STM has gotten a bad rap in the concurrent programming community, mostly because the implementations of STM that target traditional programming languages have drawbacks so huge that they are deeply unappealing. In the same way that the Java and C++ of 1015 years ago ruined the reputation of static type systems when there were vastly better alternatives out there, STM in Haskell might be easy to consign to the intellectual dustbin by association, when in fact it's a much more interesting beast than its relatives.
A key problem with traditional STM is that its performance is killed stone dead by the amount of mutable state that needs to be tracked during a transaction. Haskell sidesteps much of this need for bookkeeping with its default stance that favours immutable data. Nevertheless, STM in Haskell does have a cost, and Simon shows how to structure code that uses STM to make its overheads acceptable.
Another huge difficulty with traditional STM lies in the messy boundary between transactional code and code that has side effects (and which hence cannot be safely called from a transaction). Haskell's type system eliminates these difficulties, and in fact makes it easier to construct sophisticated combinations of transactional operations. Although we touched on STM having some overhead, Simon revisits the Async
API and uses some of the advanced features of Haskell STM to build a multiplewait implementation that is more efficient than its MVar
based predecessor.
In chapter 14, Simon covers Cloud Haskell, a set of fascinating packages that implement Erlangstyle distributed message passing, complete with monitoring and restart of remote nodes. I admire Cloud Haskell for its practical willingness to adopt wholesale the very solid ideas of the Erlang community, as they have a quarter of a century of positive experience with their distinctive approach to constructing robust distributed applications.
If you don't already know Haskell, this book offers two significant gifts. The first is a vigorous and compelling argument for why Haskell is an uncommonly good language for the kind of concurrent programming that is fundamental to much of today's computing. The second is an eyeopening illustration of some beautiful and powerful APIs that transcend any particular language. Concise, elegant design is worth celebrating wherever you see it, and this book is brimful of examples.
On the other hand, if you're already a Haskell programmer, it is very likely that this book will awaken you to bugs you didn't know your concurrent code had, abstractions that you could be building to make your applications cleaner, and practical lessons in how to start simple and then refine your code as you learn more about your needs.
Finally, for me as a writer of books about computing, this book has lessons too. It is understated, letting the quality of its examples and abstractions convince more deeply than bombast could reach. It is minimalist, revisiting the same few initially simple ideas through successive waves of refinement and teaching. And it is clear, with nary a word out of place.
In short, if you care about Haskell, if you are interested in concurrency, if you appreciate good design, if you have an ear for wellcrafted teaching,
text
, attoparsec
, and aeson
libraries on Hackage, and there’s a surprising amount to look forward to in them.
The summary for the impatient: some core operations in text
and aeson
are now much more efficient. With text
, UTF8 encoding is up to four times faster, while with aeson
, encoding and decoding of JSON bytestrings are both up to twice as fast.
Perhaps the least interesting release is attoparsec
. It adds a new dependency on Bas Van Dijk’s scientific
package to allow efficient and more accurate parsing of floating point numbers, a longstanding minor weakness. It also introduces two new functions for singletoken lookahead, which are used by the new release of aeson
; read on for more details.
The new release of the text
library has much better support for encoding to a UTF8 bytestring via the encodeUtf8
function. The new encoder is up to four times faster than in the previous major release.
Simon Meier contributed a pair of UTF8 encoding functions that can encode to the new Builder
type in the latest version of the bytestring
library. These functions are slower than the new encodeUtf8
implementation, but still twice as fast as the old encodeUtf8
.
Not only are the new Builder
encoders admirably fast, they’re more flexible than encodeUtf8
, as Builder
s can be used to efficiently glue together from many small fragments. Once again, read on for more details about how this helped with the new release of aeson
. (Note: if you don’t have the latest version of bytestring
in your library environment, you won’t get the new Builder
encoders.)
The second major change to the text
library came about when I finally decided to expose all of the library’s internal modules. The newly exposed modules can be found in the Data.Text.Internal
hierarchy. Before you get too excited, please understand that I can’t make guarantees of releasetorelease stability for any functions or types that are documented as internal.
Finally, the new release of the aeson
library focuses on improved performance and accuracy. We parse floating point numbers more accurately thanks once again to Bas van Dijk’s scientific
library. And for performance, both decoding and encoding of JSON bytestrings are up to twice as fast as in the previous release.
On the decoding side, I used the new lookahead primitives from attoparsec
to make parsing faster and less memory intensive (by avoiding backtracking, if you’re curious). Meanwhile, Simon Meier contributed a patch that uses his new Builder
based UTF8 encoder from the text
library to double encoding performance. (Encoding performance is improved even if you don’t have the necessary new version of bytestring
, but only by about 10%.)
On my crummy old Mac laptop, I can decode at 3040 megabytes per second, and encode at 100170 megabytes per second. Not bad!
I'd particularly like to thank Bas van Dijk and Simon Meier for their excellent contributions during this most recent development cycle. It's really a pleasure to work with such smart, friendly people.
Simon and Bas deserve some kind of an additional medal for being forgiving of my sometimes embarrassingly long review latencies: some of Simon's patches against the text
library are almost two years old! (Please pardon me while I grasp at straws in my slightly shamefaced partial defence here: the necessary version of bytestring
wasn't released until three months ago, so I'm not the only person in the Haskell community with long review latencies...)
text
library: for some invalid inputs, the UTF8 decoder was truncating the invalid data instead of throwing an exception.
Thanks to Michael providing an easy repro, I quickly bisected the origin of the regression to a commit from September that added support for incremental decoding of UTF8. That work was motivated by applications that need to be able to consume incomplete input (e.g. a network packet containing possibly truncated data) as early as possible.
The lowlevel UTF8 decoder is implemented as a state machine in C to squeeze as much performance out as possible. The machine has two visible end states: UTF8_ACCEPT
indicates that a buffer was completely successfully decoded, while UTF8_REJECT
specifies that the input contained invalid UTF8 data. When the decoder stops, all other machine states count as work in progress, i.e. a decode that couldn’t complete because we reached the end of a buffer.
When the old allornothing decoder encountered an incomplete or invalid input, it would back up by a single byte to indicate the location of the error. The incremental decoder is a refactoring of the old decoder, and the new allornothing decoder calls it.
The critical error arose in the refactoring process. Here’s the old code for backing up a byte.
/* Error recovery  if we're not in a
valid finishing state, back up. */
if (state != UTF8_ACCEPT)
s = 1;
This is what the refactoring changed it to:
/* Invalid encoding, back up to the
errant character. */
if (state == UTF8_REJECT)
s = 1;
To preserve correctness, the refactoring should have added a check to the new allornothing decoder so that it would step back a byte if the final state of the incremental decoder was neither UTF8_ACCEPT
nor UTF8_REJECT
. Oops! A very simple bug with unhappy consequences.
The text
library has quite a large test suite that has revealed many bugs over the years, often before they ever escaped into the wild. Why did this ugly critter make it over the fence?
Well, a glance at the original code for trying to test UTF8 error handling is telling—in fact, you don’t even need to be able to read a programming language, because the confession is in the comment.
 This is a poor attempt to ensure that
 the error handling paths on decode are
 exercised in some way. Proper testing
 would be rather more involved.
“Proper testing” indeed. All that I did in the original test was generate a random byte sequence, and see if it provoked the decoder into throwing an exception. The chances of such a dumb test really offering any value are not great, but I had more or less forgotten about it, and so I had a sense of security without the accompanying security. But hey, at least pastme had left a mea culpa note for presentdayme. Right?
While finding and fixing the bug took just a few minutes, I spent several more hours strengthening the test for the UTF8 decoder, and this was far more interesting.
As a variablelength selfsynchronizing encoding, UTF8 is very clever and elegant, but its cleverness allows for a number of implementation bugs. For reference, here is a table (lightly edited from Wikipedia) of the allowable bit patterns used in UTF8.
first code point 
last code point 
byte 1  byte 2  byte 3  byte 4 

U+0000  U+007F 
0xxxxxxx


U+0080  U+07FF 
110xxxxx

10xxxxxx


U+0800  U+FFFF 
1110xxxx

10xxxxxx

10xxxxxx


U+10000  U+1FFFFF 
11110xxx

10xxxxxx

10xxxxxx

10xxxxxx

The best known of these bugs involves accepting noncanonical encodings. What a canonical encoding means takes a little explaining. UTF8 can represent any ASCII character in a single byte, and in fact every ASCII character must be represented as a single byte. However, an illegal twobyte encoding of an ASCII character can be achieved by starting with 0xC0, followed by the ASCII character with the high bit set. For instance, the ASCII forward slash U+002F is represented in UTF8 as 0x2F, but a decoder with this bug would also accept 0xC0 0xAF (three and fourbyte encodings are of course also possible).
This bug may seem innocent, but it was widely used to remotely exploit IIS 4 and IIS 5 servers over a decade ago. Correct UTF8 decoders must reject noncanonical encodings. (These are also known as overlong encodings.)
In fact, the bytes 0xC0 and 0xC1 will never appear in a valid UTF8 bytestream, as they can only be used to start twobyte sequences that cannot be canonical.
To test our UTF8 decoder’s ability to spot bogus input, then, we might want to generate byte sequences that start with 0xC0 or 0xC1. Haskell’s QuickCheck library provides us with just such a generating function, choose
, which generates a random value in the given range (inclusive).
choose (0xC0, 0xC1)
Once we have a bad leading byte, we may want to follow it with a continuation byte. The value of a particular continuation byte doesn’t much matter, but we would like it to be valid. A continuation byte always contains the bit pattern 0x80 combined with six bits of data in its least significant bits. Here’s a generator for a random continuation byte.
contByte = (0x80 +) <$> choose (0, 0x3F)
Our bogus leading byte should be rejected immediately, since it can never generate a canonical encoding. For the sake of thoroughness, we should sometimes follow it with a valid continuation byte to ensure that the twobyte sequence is also rejected.
To do this, we write a general combinator, upTo
, that will generate a list of up to n
random values.
upTo :: Int > Gen a > Gen [a]
upTo n gen = do
k < choose (0,n)
vectorOf k gen  a QuickCheck combinator
And now we have a very simple way of saying “either 0xC0 or 0xC1, optionally followed by a continuation byte”.
 invalid leading byte of a 2byte sequence.
(:) <$> choose (0xC0,0xC1) <*> upTo 1 contByte
Notice in the table above that a 4byte sequence can encode any code point up to U+1FFFFF. The highest legal Unicode code point is U+10FFFF, so by implication there exists a range of leading bytes for 4byte sequences that can never appear in valid UTF8.
 invalid leading byte of a 4byte sequence.
(:) <$> choose (0xF5,0xFF) <*> upTo 3 contByte
We should never encounter a continuation byte without a leading byte somewhere before it.
 Continuation bytes without a start byte.
listOf1 contByte
 The listOf1 combinator generates a list
 containing at least one element.
Similarly, a bit pattern that introduces a 2byte sequence must be followed by one continuation byte, so it’s worth generating such a leading byte without its continuation byte.
 Short 2byte sequence.
(:[]) <$> choose (0xC2, 0xDF)
We do the same for 3byte and 4byte sequences.
 Short 3byte sequence.
(:) <$> choose (0xE0, 0xEF) <*> upTo 1 contByte
 Short 4byte sequence.
(:) <$> choose (0xF0, 0xF4) <*> upTo 2 contByte
Earlier, we generated 4byte sequences beginning with a byte in the range 0xF5 to 0xFF. Although 0xF4 is a valid leading byte for a 4byte sequence, it’s possible for a perverse choice of continuation bytes to yield an illegal code point between U+110000 and U+13FFFF. This code generates just such illegal sequences.
 4byte sequence greater than U+10FFFF.
k < choose (0x11, 0x13)
let w0 = 0xF0 + (k `Bits.shiftR` 2)
w1 = 0x80 + ((k .&. 3) `Bits.shiftL` 4)
([w0,w1]++) <$> vectorOf 2 contByte
Finally, we arrive at the general case of noncanonical encodings. We take a onebyte code point and encode it as two, three, or four bytes; and so on for twobyte and threebyte characters.
 Overlong encoding.
k < choose (0,0xFFFF)
let c = chr k
case k of
_  k < 0x80 > oneof [
let (w,x) = ord2 c in return [w,x]
, let (w,x,y) = ord3 c in return [w,x,y]
, let (w,x,y,z) = ord4 c in return [w,x,y,z] ]
 k < 0x7FF > oneof [
let (w,x,y) = ord3 c in return [w,x,y]
, let (w,x,y,z) = ord4 c in return [w,x,y,z] ]
 otherwise >
let (w,x,y,z) = ord4 c in return [w,x,y,z]
 The oneof combinator chooses a generator at random.
 Functions ord2, ord3, and ord4 break down a character
 into its 2, 3, or 4 byte encoding.
Armed with a generator that uses oneof
to choose one of the above invalid UTF8 encodings at random, we embed the invalid bytestream in one of three cases: by itself, at the end of an otherwise valid buffer, and at the beginning of an otherwise valid buffer. This variety gives us some assurance of catching buffer overrun errors.
Sure enough, this vastly more elaborate QuickCheck test immediately demonstrates the bug that Michael found.
The original test is a classic case of basic fuzzing: it simply generates random junk and hopes for the best. The fact that it let the decoder bug through underlines the weakness of fuzzing. If I had cranked the number of randomly generated test inputs up high enough, I’d probably have found the bug, but the approach of pure randomness would have caused the bug to remain difficult to reproduce and understand.
The revised test is much more sophisticated, as it generates only test cases that are known to be invalid, with a rich assortment of precisely generated invalid encodings to choose from. While it has the same probabilistic nature as the fuzzing approach, it excludes a huge universe of uninteresting inputs from being tested, and hence is much more likely to reveal a weakness quickly and efficiently.
The moral of the story: even QuickCheck tests, though vastly more powerful than unit tests and fuzz tests, are only as good as you make them!
]]>For a while now, I’ve had it in mind to improve the encoding performance of my Haskell JSON package, aeson.
Over the weekend, I went from hazy notion to a proof of concept for what I think could be a reasonable approach.
This post is a case of me “thinking out loud” about the initial design I came up with. I’m very interested in hearing if you have a cleaner idea.
The problem with the encoding method currently used by aeson is that it occurs via a translation to the Value
type. While this is simple and uniform, it involves a large amount of intermediate work that is essentially wasted. When encoding a complex value, the Value
that we build up is expensive, and it will become garbage immediately.
It should be much more efficient to simply serialize straight to a Builder
, the type that is optimized for concatenating many short string fragments. But before marching down that road, I want to make sure that I provide a clean API that is easy to use correctly.
I’ve posted a gist that contains a complete copy of this proofofconcept code.
{# LANGUAGE GeneralizedNewtypeDeriving, FlexibleInstances,
OverloadedStrings #}
import Data.Monoid (Monoid(..), (<>))
import Data.Text (Text)
import Data.Text.Lazy.Builder (Builder, singleton)
import qualified Data.Text.Lazy.Builder as Builder
import qualified Data.Text.Lazy.Builder.Int as Builder
The core Build
type has a phantom type that allows us to say “I am encoding a value of type t
”. We’ll see where this type tracking is helpful (and annoying) below.
data Build a = Build {
_count :: !Int
, run :: Builder
}
The internals of the Build
type would be hidden from users; here’s what they mean. The _count
field tracks the number of elements we’re encoding of an aggregate JSON value (an array or object); we’ll see why this matters shortly. The run
field lets us access the underlying Builder
.
We provide three empty types to use as parameters for the Build
type.
data Object
data Array
data Mixed
We’ll want to use the Mixed
type if we’re cramming a set of disparate Haskell values into a JSON array; read on for more.
When it comes to gluing values together, the Monoid
class is exactly what we need.
instance Monoid (Build a) where
mempty = Build 0 mempty
mappend (Build i a) (Build j b)
 ij > 1 = Build ij (a <> singleton ',' <> b)
 otherwise = Build ij (a <> b)
where ij = i + j
Here’s where the _count
field comes in; we want to separate elements of an array or object using commas, but this is necessary only when the array or object contains more than one value.
To encode a simple value, we provide a few obvious helpers. (These are clearly so simple as to be wrong, but remember: my purpose here is to explore the API design, not to provide a proper implementation.)
build :: Builder > Build a
build = Build 1
int :: Integral a => a > Build a
int = build . Builder.decimal
text :: Text > Build Text
text = build . Builder.fromText
Encoding a JSON array is easy.
array :: Build a > Build Array
array (Build 0 _) = build "[]"
array (Build _ vs) = build $ singleton '[' <> vs <> singleton ']'
If we try this out in ghci
, it behaves as we might hope.
?> array $ int 1 <> int 2
"[1,2]"
JSON puts no constraints on the types of the elements of an array. Unfortunately, our phantom type causes us difficulty here.
An expression of this form will not typecheck, as it’s trying to join a Build Int
with a Build Text
.
?> array $ int 1 <> text "foo"
This is where the Mixed
type from earlier comes in. We use it to forget the original phantom type so that we can construct an array with elements of different types.
mixed :: Build a > Build Mixed
mixed (Build a b) = Build a b
Our new mixed
function gets the types to be the same, giving us something that typechecks.
?> array $ mixed (int 1) <> mixed (text "foo")
"[1,foo]"
This seems like a fair compromise to me. A Haskell programmer will normally want the types of values in an array to be the same, so the default behaviour of requiring this makes sense (at least to my current thinking), but we get a back door for when we absolutely have to go nuts with mixing types.
The last complication stems from the need to build JSON objects. Each key in an object must be a string, but the value can be of any type.
 Encode a keyvalue pair.
(<:>) :: Build Text > Build a > Build Object
k <:> v = Build 1 (run k <> ":" <> run v)
object :: Build Object > Build Object
object (Build 0 _) = build "{}"
object (Build _ kvs) = build $ singleton '{' <> kvs <> singleton '}'
If you’ve had your morning coffee, you’ll notice that I am not living up to my highminded principles from earlier. Perhaps the types involved here should be something closer to this:
data Object a
(<:>) :: Build Text > Build a > Build (Object a)
object :: Build (Object a) > Build (Object a)
(In which case we’d need a mixed
like function to forget the phantom types for when we want to get mucky and unsafe—but I digress.)
How does this work out in practice?
?> object $ "foo" <:> int 1 <> "bar" <:> int 3
"{foo:1,bar:3}"
Hey look, that’s more or less as we might have hoped!
Open questions, for which I appeal to you for help:
Does this design appeal to you at all?
If not, what would you change?
If yes, to what extent am I wallowing in the “types for thee, but not for me” sin bin by omitting a phantom parameter for Object
?
Helpful answers welcome!
]]>Big data[1][2] is a collection of data sets so large and complex that it becomes difficult to process using onhand database management tools or traditional data processing applications. The challenges include capture, curation, storage,[3] search, sharing, transfer, analysis,[4] and visualization. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to “spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine realtime roadway traffic conditions.”
Now what if we get tired of the current hype cycle?
Big fucking deal[1][2] is a collection of deals so fucking large and complex that it becomes difficult to process using onhand fuck giving tools or traditional shit giving techniques. The challenges include capture, curation, storage,[3] search, sharing, transfer, analysis,[4] and all kinds of who the fuck knows what else. The trend to larger fucking deals is due to the additional shit derivable from giving a fuck about a single large fucking pile of related shit, as compared to separate smaller piles with the same total amount of bullshit, allowing correlations to be found to “spot business shit, determine quality of whatever, prevent some nasty shit, link legal shit right the fuck together, combat fucking crime no I am not making this up it’s like fucking Batman, and determine realtime traffic shittiness.”]]>
A couple of days ago, Alexey Khudyakov did a little digging into the accuracy of criterion’s measurements. I thought his results were interesting enough to be worth some deeper analysis.
First, let’s briefly discuss Alexey’s method and findings. He created 1,000 identical copies of a benchmark, and looked to see if the measurements changed over time. They did, slowly increasing in a linear fashion. (This is a phenomenon known to statisticians as serial correlation, where a measurement influences a future measurement.)
If every benchmark is the same, why do the measurements increase? Criterion does its bookkeeping in memory. For every run, it saves a piece of data in memory. Not until all benchmarks have finished does it write out that data to a Javascript+HTML report or CSV file.
I thought that the slow increase in measurements was probably related to this bookkeeping, but how to test this hypothesis?
I created 200 identical copies of the same benchmark (I’m not patient enough to wait for 1,000!) and dumped a heap profile while it ran, then I plotted the numbers measured by criterion against heap size.
For this particular benchmark, criterion spends about 4% of its time in the garbage collector. The size of the heap increases by 300% as the program runs. If we expect garbage collection overhead to affect measurements, then the time we measure should increase by 12% as we repeat the benchmark over and over, slowly accumulating data.
This prediction exactly matches what we actually see: we start off measuring exp
at 25.5 nanoseconds, and by the end we see it taking 28.5 nanoseconds.
The obvious next question is: how realistic a concern is this? A normal criterion program consists of a handful of benchmarks, usually all very different in what they do. I have not seen any cases of more than a few dozen benchmarks in a single suite. If only a few benchmarks get run, then there is essentially no opportunity for this inflation effect to become noticeable.
Nevertheless, I could definitely make some improvements (or even better, someone could contribute patches and I could continue to do nothing).
It would probably help to write data to a file after running each benchmark, and then to load that data back again before writing out the final report. [Edit: I wrote a patch that does just this; the increase in memory use vanishes, and along with it, the gradual inflation in measured times. Bingo!]
There is no benefit to looking for serial correlation across different benchmark runs, because nobody (except Alexey!) makes identical duplicates of a benchmark.
For the series of measurements collected for a single benchmark, it would probably be helpful to add an autocorrelation test, if only to have another opportunity to raise a red flag. Criterion is already careful to cry foul if its numbers look too messy, but firstorder serial correlation would be likely to slip past the sophisticated tests it uses (like the bootstrap). I’ve long wanted to add a DurbinWatson test, but I’ve been lazy for even longer.
If you were to run every benchmark in a large suite one after the other in a single pass, then your final numbers could indeed be inflated by a few percent [edit: at least until I release the patch]. However, there are many other ways to confound your measurements, most of which will be far larger than this bookkeeping effect.
If you simply change the order in which you run your benchmarks, this can dramatically affect the numbers you’ll see.
The size of the heap that the GHC runtime uses makes a big difference, as do the threaded runtime, number of OS threads, and use of the parallel garbage collector. Any of these can change performance by a factor of two or more (!).
You should close busy tabs in a web browser (or preferably quit it entirely), kill your mail client and antivirus software, and try to eliminate other sources of system noise. You’ll be surprised by how big a difference these can make; anywhere from a handful to a few hundred percent.
If you want highquality numbers, it is best to run just one benchmark from a suite at a time; on the quietest system you can manage; to watch for criterion's warnings about outliers affecting results; and to always compare several runs to see if your measurements are stable.
[Edit: Here is a chart of the measurements with the bug fixed, complete with a linear fit to indicate that the numbers are basically flat. Hooray!]
]]>