[Edit: a few hours after I wrote this post, I wrote some code to get rid of the inflation phenomenon it describes, and I'll publish a corresponding update to criterion shortly. See below for details, and the bottom for a new chart that shows the effect of the fix.]
A couple of days ago, Alexey Khudyakov did a little digging into the accuracy of criterion’s measurements. I thought his results were interesting enough to be worth some deeper analysis.
First, let’s briefly discuss Alexey’s method and findings. He created 1,000 identical copies of a benchmark, and looked to see if the measurements changed over time. They did, slowly increasing in a linear fashion. (This is a phenomenon known to statisticians as serial correlation, where a measurement influences a future measurement.)
I thought that the slow increase in measurements was probably related to this book-keeping, but how to test this hypothesis?
I created 200 identical copies of the same benchmark (I’m not patient enough to wait for 1,000!) and dumped a heap profile while it ran, then I plotted the numbers measured by criterion against heap size.
For this particular benchmark, criterion spends about 4% of its time in the garbage collector. The size of the heap increases by 300% as the program runs. If we expect garbage collection overhead to affect measurements, then the time we measure should increase by 12% as we repeat the benchmark over and over, slowly accumulating data.
This prediction exactly matches what we actually see: we start off measuring
exp at 25.5 nanoseconds, and by the end we see it taking 28.5 nanoseconds.
The obvious next question is: how realistic a concern is this? A normal criterion program consists of a handful of benchmarks, usually all very different in what they do. I have not seen any cases of more than a few dozen benchmarks in a single suite. If only a few benchmarks get run, then there is essentially no opportunity for this inflation effect to become noticeable.
Nevertheless, I could definitely make some improvements (or even better, someone could contribute patches and I could continue to do nothing).
It would probably help to write data to a file after running each benchmark, and then to load that data back again before writing out the final report. [Edit: I wrote a patch that does just this; the increase in memory use vanishes, and along with it, the gradual inflation in measured times. Bingo!]
There is no benefit to looking for serial correlation across different benchmark runs, because nobody (except Alexey!) makes identical duplicates of a benchmark.
For the series of measurements collected for a single benchmark, it would probably be helpful to add an autocorrelation test, if only to have another opportunity to raise a red flag. Criterion is already careful to cry foul if its numbers look too messy, but first-order serial correlation would be likely to slip past the sophisticated tests it uses (like the bootstrap). I’ve long wanted to add a Durbin-Watson test, but I’ve been lazy for even longer.
If you were to run every benchmark in a large suite one after the other in a single pass, then your final numbers could indeed be inflated by a few percent [edit: at least until I release the patch]. However, there are many other ways to confound your measurements, most of which will be far larger than this book-keeping effect.
If you simply change the order in which you run your benchmarks, this can dramatically affect the numbers you’ll see.
The size of the heap that the GHC runtime uses makes a big difference, as do the threaded runtime, number of OS threads, and use of the parallel garbage collector. Any of these can change performance by a factor of two or more (!).
You should close busy tabs in a web browser (or preferably quit it entirely), kill your mail client and antivirus software, and try to eliminate other sources of system noise. You’ll be surprised by how big a difference these can make; anywhere from a handful to a few hundred percent.
If you want high-quality numbers, it is best to run just one benchmark from a suite at a time; on the quietest system you can manage; to watch for criterion's warnings about outliers affecting results; and to always compare several runs to see if your measurements are stable.
[Edit: Here is a chart of the measurements with the bug fixed, complete with a linear fit to indicate that the numbers are basically flat. Hooray!]