<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>teideal glic deisbhéalach</title>
	<atom:link href="http://www.serpentine.com/blog/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.serpentine.com/blog</link>
	<description>Bryan O&#039;Sullivan&#039;s blog</description>
	<lastBuildDate>Wed, 13 May 2015 17:43:20 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=4.2.2</generator>
	<item>
		<title>Sometimes, the old ways are the best</title>
		<link>http://www.serpentine.com/blog/2015/05/13/sometimes-the-old-ways-are-the-best/</link>
		<comments>http://www.serpentine.com/blog/2015/05/13/sometimes-the-old-ways-are-the-best/#comments</comments>
		<pubDate>Wed, 13 May 2015 16:13:43 +0000</pubDate>
		<dc:creator><![CDATA[Bryan O'Sullivan]]></dc:creator>
				<category><![CDATA[haskell]]></category>

		<guid isPermaLink="false">http://www.serpentine.com/blog/?p=1096</guid>
		<description><![CDATA[Over the past few months, the Sigma engineering team at Facebook has rolled out a major Haskell project: a rewrite of Sigma, an important weapon in our armory for fighting spam and malware. Sigma has a mission-critical job, and it<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.serpentine.com/blog/2015/05/13/sometimes-the-old-ways-are-the-best/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<p>Over the past few months, the Sigma engineering team at Facebook has rolled out a major Haskell project: a <a href="https://code.facebook.com/posts/302060973291128/open-sourcing-haxl-a-library-for-haskell/">rewrite of Sigma</a>, an important weapon in our armory for fighting spam and malware.</p>
<p>Sigma has a mission-critical job, and it needs to scale: its growing workload currently sees it handling tens of millions of requests per minute.</p>
<p>The rewrite of Sigma in Haskell, using the <a href="http://hackage.haskell.org/package/haxl">Haxl library</a> that Simon Marlow developed, has been a success. Throughput is higher than under its predecessor, and CPU usage is lower. Sweet!</p>
<p>Nevertheless, success brings with it surprises, and even though I haven’t worked on Sigma or Haxl, I’ve been implicated in one such surprise. To understand my accidental bit part in the show, let's begin by mentioning that Sigma uses JSON internally for various purposes. These days, the Haskell-powered Sigma uses aeson, the JSON library I wrote, to handle JSON data.</p>
<p>A few months ago, the Haxl rewrite of Sigma was going through an episode of crazytown, in which it would intermittently and unpredictably use huge amounts of CPU and memory. The culprit turned out to be JSON strings containing zillions of backslashes. (I have no idea why. If you’ve worked with large volumes of data for a long time, you won’t even bat an eyelash at the idea that a data store somewhere contains some really weird records.)</p>
<p>The team quickly mitigated the problem, and gave me a nudge that I might want to look into the problem. On Sunday evening, with a glass of red wine in hand, I finally dove in to see what was wrong.</p>
<p>Since the Sigma developers had figured out what was causing these time and space explosions, I immediately had a test case to work with, and the results were grim: decoding a mere megabyte of continuous backslashes took over a second, consumed over a gigabyte of memory, and killed concurrency by causing the runtime system to spend almost 90% of its time in the garbage collector. Yikes!</p>
<p>Whatever was going on? If you <a href="https://github.com/bos/aeson/blob/deb59828ac24cc4feb859a711ca16b257078ffbb/Data/Aeson/Parser/Internal.hs#L212">look at the old implementation of aeson’s <code>unescape</code> function</a>, it seems quite efficient and innocuous. It’s reasonably tightly optimized low-level Haskell.</p>
<p>Trouble is, <code>unescape</code> uses an API (a bytestring builder) that is intended for <em>streaming</em> a result incrementally. Unfortunately the <code>unescape</code> function can’t hand any data back to its caller until it has processed an entire string.</p>
<p>The result is as you’d expect: we build a huge chain of thunks. In this case, the thunks will eventually write data efficiently into buffers. Alas, the thunks have nobody demanding the evaluation of their contents. This chain consumes a lot (a <em>lot</em>!) of memory and incurs a huge amount of GC overhead (long chains of thunks are expensive). Sadness ensues.</p>
<p>The “old ways” in the title refer to the fix: in place of a fancy streaming API, I simply <a href="https://github.com/bos/aeson/commit/05c9e0cbbebc861303fc7dd6b3dfd03844490621">allocate a single big buffer and blast the bytes straight into it</a>.</p>
<p>For that pathological string with almost a megabyte of consecutive backslashes, the new implementation is 27x faster and uses 42x less memory, all for the cost of perhaps an hour of Sunday evening hacking (including <a href="https://github.com/bos/attoparsec/commit/4f56f0d6508acfdb60b09d9f945645242211cbca">a little enabling work</a> that incidentally illustrates just how easy it is to work with monad transformers). Not bad!</p>]]></content:encoded>
			<wfw:commentRss>http://www.serpentine.com/blog/2015/05/13/sometimes-the-old-ways-are-the-best/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>criterion 1.0</title>
		<link>http://www.serpentine.com/blog/2014/08/08/criterion-1-0/</link>
		<comments>http://www.serpentine.com/blog/2014/08/08/criterion-1-0/#comments</comments>
		<pubDate>Fri, 08 Aug 2014 10:02:48 +0000</pubDate>
		<dc:creator><![CDATA[Bryan O'Sullivan]]></dc:creator>
				<category><![CDATA[haskell]]></category>
		<category><![CDATA[open source]]></category>

		<guid isPermaLink="false">http://www.serpentine.com/blog/?p=1089</guid>
		<description><![CDATA[Almost five years after I initially released criterion, I'm delighted to announce a major release with a large number of appealing new features. As always, you can install the latest goodness using cabal install criterion, or fetch the source from<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.serpentine.com/blog/2014/08/08/criterion-1-0/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  <meta http-equiv="Content-Style-Type" content="text/css" />
  <meta name="generator" content="pandoc" />
  <title></title>
  <style type="text/css">code{white-space: pre;}</style>
</head>
<body>
<p>Almost five years after I initially released criterion, I'm delighted to announce a major release with a large number of appealing new features.</p>
<p>As always, you can install the latest goodness using <code>cabal install criterion</code>, or fetch the source <a href="https://github.com/bos/criterion">from github</a>.</p>
<p><a href="/criterion/fibber.html" target="_blank"><img src="/criterion/fibber-screenshot.png"></img></a></p>
<p>Please let me know if you find criterion useful!</p>
<h1 id="new-documentation">New documentation</h1>
<p>I built both a <a href="/criterion">home page</a> and a thorough <a href="/criterion/tutorial.html">tutorial</a> for criterion. I've also extended the inline documentation and added a number of new examples.</p>
<p>All of the documentation lives in the <a href="https://github.com/bos/criterion/tree/master/www">github repo</a>, so if you'd like to see something improved, please send a bug report or pull request.</p>
<h1 id="new-execution-engine">New execution engine</h1>
<p>Criterion's model of execution has evolved, becoming vastly more reliable and accurate. It can now measure events that take just a few hundred picoseconds.</p>
<pre><code>benchmarking return ()
time                 512.9 ps   (512.8 ps .. 513.1 ps)</code></pre>
<p>While almost all of the core types have changed, criterion should remain API-compatible with the vast majority of your benchmarking code.</p>
<h1 id="new-metrics">New metrics</h1>
<p>In addition to wall-clock time, criterion can now measure and regress on the following metrics:</p>
<ul>
<li>CPU time</li>
<li>CPU cycles</li>
<li>bytes allocated</li>
<li>number of garbage collections</li>
<li>number of bytes copied during GC</li>
<li>wall-clock time spent in mutator threads</li>
<li>CPU time spent running mutator threads</li>
<li>wall-clock time spent doing GC</li>
<li>CPU time spent doing GC</li>
</ul>
<h1 id="linear-regression">Linear regression</h1>
<p>Criterion now supports linear regression of a number of metrics.</p>
<p>Here's a regression conducted using <code>--regress cycles:iters</code>:</p>
<pre><code>cycles:              1.000 R²   (1.000 R² .. 1.000 R²)
  iters              47.718     (47.657 .. 47.805)</code></pre>
<p>The first line of the output is the R² goodness-of-fit measure for this regression, and the second is the number of CPU cycles (measured using the <code>rdtsc</code> instruction) to execute the operation in question (integer division).</p>
<p>This next regression uses <code>--regress allocated:iters</code> to measure the number of bytes allocated while constructing an <code>IntMap</code> of 40,000 values.</p>
<pre><code>allocated:           1.000 R²   (1.000 R² .. 1.000 R²)
  iters              4.382e7    (4.379e7 .. 4.384e7)</code></pre>
<p>(That's a little under 42 megabytes.)</p>
<h1 id="new-outputs">New outputs</h1>
<p>While its support for <a href="/criterion/report.html">active HTML</a> has improved, criterion can also now output JSON and JUnit XML files.</p>
<h1 id="new-internals">New internals</h1>
<p>Criterion has received its first spring cleaning, and is much easier to understand as a result.</p>
<h1 id="acknowledgments">Acknowledgments</h1>
<p>I was inspired into some of this work by the efforts of the authors of the OCaml <a href="https://blogs.janestreet.com/core_bench-micro-benchmarking-for-ocaml/">Core_bench</a> package.</p>
</body>
</html>]]></content:encoded>
			<wfw:commentRss>http://www.serpentine.com/blog/2014/08/08/criterion-1-0/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Win bigger statistical fights with a better jackknife</title>
		<link>http://www.serpentine.com/blog/2014/06/10/win-bigger-statistical-fights-with-a-better-jackknife/</link>
		<comments>http://www.serpentine.com/blog/2014/06/10/win-bigger-statistical-fights-with-a-better-jackknife/#comments</comments>
		<pubDate>Wed, 11 Jun 2014 04:26:09 +0000</pubDate>
		<dc:creator><![CDATA[Bryan O'Sullivan]]></dc:creator>
				<category><![CDATA[haskell]]></category>
		<category><![CDATA[software]]></category>

		<guid isPermaLink="false">http://www.serpentine.com/blog/?p=1079</guid>
		<description><![CDATA[(Summary: I’ve developed some algorithms for a statistical technique called the jackknife that run in O(n) time instead of O(n2).) In statistics, an estimation technique called “the jackknife” has been widely used for over half a century. It’s a mainstay<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.serpentine.com/blog/2014/06/10/win-bigger-statistical-fights-with-a-better-jackknife/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<p>(Summary: I’ve developed some algorithms for a statistical technique called the jackknife that run in <span class="math"><em>O</em>(<em>n</em>)</span> time instead of <span class="math"><em>O</em>(<em>n</em><sup>2</sup>)</span>.)</p>
<p>In statistics, an estimation technique called “the jackknife” has been widely used for over half a century. It’s a mainstay for taking a quick look at the quality of an estimator of a sample. (An estimator is a summary function over a sample, such as its mean or variance.)</p>
<p>Suppose we have a noisy sample. Our first stopping point might be to look at the variance of the sample, to get a sense of how much the values in the sample “spread out” around the average.</p>
<p>If the variance is not close to zero, then we know that the sample is somewhat noisy. But our curiosity may persist: is the variance unduly influenced by a few big spikes, or is the sample consistently noisy? The jackknife is a simple analytic tool that lets us quickly answer questions like this. There are more accurate, sophisticated approaches to this kind of problem, but they’re not nearly so easy to understand and use, so the jackknife has stayed popular since the 1950s.</p>
<p>The jackknife is easy to describe. We take the original sample, drop the first value out, and calculate the variance (or whatever the estimator is) over this subsample. We repeat this, dropping out only the second value, and continue. For an original sample with <span class="math"><em>n</em></span> elements, we end up with a collection of <span class="math"><em>n</em></span> jackknifed estimates of all the subsamples, each with one element left out. Once we’re done, there’s an optional last step: we compute the mean of these jackknifed estimates, which gives us the jackknifed variance.</p>
<p>For example, suppose we have the sample <code>[1,3,2,1]</code>. (I’m going to write all my examples in Haskell for brevity, but the code in this post should be easy to port to any statistical language.)</p>
<p>The simplest way to compute variance is as follows:</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell">var xs <span class="fu">=</span> (sum (map (<span class="fu">^</span><span class="dv">2</span>) xs) <span class="fu">-</span> sum xs <span class="fu">^</span> <span class="dv">2</span> <span class="fu">/</span> n) <span class="fu">/</span> n
  <span class="kw">where</span> n <span class="fu">=</span> fromIntegral (length xs)</code></pre>
<p>Using this method, the variance of <code>[1,3,2,1]</code> is <code>0.6875</code>.</p>
<p>To jackknife the variance:</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell">var [<span class="dv">1</span>,<span class="dv">3</span>,<span class="dv">2</span>,<span class="dv">1</span>]  <span class="fu">==</span> <span class="dv">0</span><span class="fu">.</span><span class="dv">6875</span>

<span class="co">-- leave out each element in succession</span>
<span class="co">-- (I&#39;m using &quot;..&quot; to denote repeating expansions)</span>
var [  <span class="dv">3</span>,<span class="dv">2</span>,<span class="dv">1</span>]  <span class="fu">==</span> <span class="dv">0</span><span class="fu">.</span><span class="dv">6666</span><span class="fu">..</span>
var [<span class="dv">1</span>,  <span class="dv">2</span>,<span class="dv">1</span>]  <span class="fu">==</span> <span class="dv">0</span><span class="fu">.</span><span class="dv">2222</span><span class="fu">..</span>
var [<span class="dv">1</span>,<span class="dv">3</span>,  <span class="dv">1</span>]  <span class="fu">==</span> <span class="dv">0</span><span class="fu">.</span><span class="dv">8888</span><span class="fu">..</span>
var [<span class="dv">1</span>,<span class="dv">3</span>,<span class="dv">2</span>  ]  <span class="fu">==</span> <span class="dv">0</span><span class="fu">.</span><span class="dv">6666</span><span class="fu">..</span>

<span class="co">-- compute the mean of the estimates over the subsamples</span>
mean [<span class="dv">0</span><span class="fu">.</span><span class="dv">6666</span>,<span class="dv">0</span><span class="fu">.</span><span class="dv">2222</span>,<span class="dv">0</span><span class="fu">.</span><span class="dv">8888</span>,<span class="dv">0</span><span class="fu">.</span><span class="dv">6666</span>]
               <span class="fu">==</span> <span class="dv">0</span><span class="fu">.</span><span class="dv">6111</span><span class="fu">..</span></code></pre>
<p>Since 0.6111 is quite different than 0.6875, we can see that the variance of this sample is affected rather a lot by bias.</p>
<p>While the jackknife is simple, it’s also <em>slow</em>. We can easily see that the approach outlined above takes <span class="math"><em>O</em>(<em>n</em><sup>2</sup>)</span> time, which means that we can’t jackknife samples above a modest size in a reasonable amount of time.</p>
<p>This approach to the jackknife is the one everybody actually uses. Nevertheless, it’s possible to improve the time complexity of the jackknife for some important estimators from <span class="math"><em>O</em>(<em>n</em><sup>2</sup>)</span> to <span class="math"><em>O</em>(<em>n</em>)</span>. Here’s how.</p>
<h1 id="jackknifing-the-mean">Jackknifing the mean</h1>
<p>Let’s start with the simple case of the mean. Here’s the obvious way to measure the mean of a sample.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell">mean xs <span class="fu">=</span> sum xs <span class="fu">/</span> n
  <span class="kw">where</span> n <span class="fu">=</span> fromIntegral (length xs)</code></pre>
<p>And here are the computations we need to perform during the naive approach to jackknifing the mean.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="co">-- n = fromIntegral (length xs - 1)</span>
sum [  <span class="dv">3</span>,<span class="dv">2</span>,<span class="dv">1</span>] <span class="fu">/</span> n
sum [<span class="dv">1</span>,  <span class="dv">2</span>,<span class="dv">1</span>] <span class="fu">/</span> n
sum [<span class="dv">1</span>,<span class="dv">3</span>,  <span class="dv">1</span>] <span class="fu">/</span> n
sum [<span class="dv">1</span>,<span class="dv">3</span>,<span class="dv">2</span>  ] <span class="fu">/</span> n</code></pre>
<p>Let’s decompose the <code>sum</code> operations into two triangles as follows, and see what jumps out:</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell">sum [  <span class="dv">3</span>,<span class="dv">2</span>,<span class="dv">1</span>] <span class="fu">=</span> sum [] <span class="fu">+</span> sum [<span class="dv">3</span>,<span class="dv">2</span>,<span class="dv">1</span>]
sum [<span class="dv">1</span>,  <span class="dv">2</span>,<span class="dv">1</span>] <span class="fu">=</span> sum [<span class="dv">1</span>]  <span class="fu">+</span> sum [<span class="dv">2</span>,<span class="dv">1</span>]
sum [<span class="dv">1</span>,<span class="dv">3</span>,  <span class="dv">1</span>] <span class="fu">=</span> sum [<span class="dv">1</span>,<span class="dv">3</span>]  <span class="fu">+</span> sum [<span class="dv">1</span>]
sum [<span class="dv">1</span>,<span class="dv">3</span>,<span class="dv">2</span>  ] <span class="fu">=</span> sum [<span class="dv">1</span>,<span class="dv">3</span>,<span class="dv">2</span>] <span class="fu">+</span> sum []</code></pre>
<p>From this perspective, we’re doing a lot of redundant work. For example, to calculate <code>sum [1,3,2]</code>, it would be very helpful if we could reuse the work we did in the previous calculation to calculate <code>sum [1,3]</code>.</p>
<h1 id="prefix-sums">Prefix sums</h1>
<p>We can achieve our desired reuse of earlier work if we store each intermediate sum in a separate list. This technique is called <em>prefix summation</em>, or (if you’re a Haskeller) <em>scanning</em>.</p>
<p>Here’s the bottom left triangle of sums we want to calculate.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell">sum [] <span class="co">{- + sum [3,2,1] -}</span>
sum [<span class="dv">1</span>]  <span class="co">{- + sum [2,1] -}</span>
sum [<span class="dv">1</span>,<span class="dv">3</span>]  <span class="co">{- + sum [1] -}</span>
sum [<span class="dv">1</span>,<span class="dv">3</span>,<span class="dv">2</span>] <span class="co">{- + sum [] -}</span></code></pre>
<p>We can prefix-sum these using Haskell’s standard <code>scanl</code> function.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="fu">&gt;&gt;&gt;</span> init (scanl (<span class="fu">+</span>) <span class="dv">0</span> [<span class="dv">1</span>,<span class="dv">3</span>,<span class="dv">2</span>,<span class="dv">1</span>])
[<span class="dv">0</span>,<span class="dv">1</span>,<span class="dv">4</span>,<span class="dv">6</span>]

<span class="co">{- e.g. [0,</span>
<span class="co">         0 + 1,</span>
<span class="co">         0 + 1 + 3,</span>
<span class="co">         0 + 1 + 3 + 2]   -}</span></code></pre>
<p>(We use <code>init</code> to drop out the final term, which we don’t want.)</p>
<p>And here’s the top right of the triangle.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="co">{- sum [] + -}</span> sum [<span class="dv">3</span>,<span class="dv">2</span>,<span class="dv">1</span>]
<span class="co">{- sum [1] + -}</span>  sum [<span class="dv">2</span>,<span class="dv">1</span>]
<span class="co">{- sum [1,3] + -}</span>  sum [<span class="dv">1</span>]
<span class="co">{- sum [1,3,2] + -}</span> sum []</code></pre>
<p>To prefix-sum these, we can use <code>scanr</code>, which scans “from the right”.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="fu">&gt;&gt;&gt;</span> tail (scanr (<span class="fu">+</span>) <span class="dv">0</span> [<span class="dv">1</span>,<span class="dv">3</span>,<span class="dv">2</span>,<span class="dv">1</span>])
[<span class="dv">6</span>,<span class="dv">3</span>,<span class="dv">1</span>,<span class="dv">0</span>]

<span class="co">{- e.g. [3 + 2 + 1 + 0,</span>
<span class="co">         2 + 1 + 0,</span>
<span class="co">         1 + 0,</span>
<span class="co">         0]               -}</span></code></pre>
<p>(As in the previous case, we use <code>tail</code> to drop out the first term, which we don’t want.)</p>
<p>Now we have two lists:</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell">[<span class="dv">0</span>,<span class="dv">1</span>,<span class="dv">4</span>,<span class="dv">6</span>]
[<span class="dv">6</span>,<span class="dv">3</span>,<span class="dv">1</span>,<span class="dv">0</span>]</code></pre>
<p>Next, we sum the lists pairwise, which gives get exactly the sums we need:</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell">sum [  <span class="dv">3</span>,<span class="dv">2</span>,<span class="dv">1</span>]  <span class="fu">==</span> <span class="dv">0</span> <span class="fu">+</span> <span class="dv">6</span> <span class="fu">==</span> <span class="dv">6</span>
sum [<span class="dv">1</span>,  <span class="dv">2</span>,<span class="dv">1</span>]  <span class="fu">==</span> <span class="dv">1</span> <span class="fu">+</span> <span class="dv">3</span> <span class="fu">==</span> <span class="dv">4</span>
sum [<span class="dv">1</span>,<span class="dv">3</span>,  <span class="dv">1</span>]  <span class="fu">==</span> <span class="dv">4</span> <span class="fu">+</span> <span class="dv">1</span> <span class="fu">==</span> <span class="dv">5</span>
sum [<span class="dv">1</span>,<span class="dv">3</span>,<span class="dv">2</span>  ]  <span class="fu">==</span> <span class="dv">6</span> <span class="fu">+</span> <span class="dv">0</span> <span class="fu">==</span> <span class="dv">6</span></code></pre>
<p>Divide each sum by <span class="math"><em>n</em>-1</span>, and we have the four subsample means we were hoping for—but in linear time, not quadratic time!</p>
<p>Here’s the complete method for jackknifing the mean in <span class="math"><em>O</em>(<em>n</em>)</span> time.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">jackknifeMean ::</span> <span class="dt">Fractional</span> a <span class="ot">=&gt;</span> [a] <span class="ot">-&gt;</span> [a]
jackknifeMean xs <span class="fu">=</span>
    map (<span class="fu">/</span> n) <span class="fu">$</span>
    zipWith (<span class="fu">+</span>)
    (init (scanl (<span class="fu">+</span>) <span class="dv">0</span> xs))
    (tail (scanr (<span class="fu">+</span>) <span class="dv">0</span> xs))
  <span class="kw">where</span> n <span class="fu">=</span> fromIntegral (length xs <span class="fu">-</span> <span class="dv">1</span>)</code></pre>
<p>If we’re jackknifing the mean, there’s no point in taking the extra step of computing the mean of the jackknifed subsamples to estimate the bias. Since the mean is an unbiased estimator, the mean of the jackknifed means should be the same as the sample mean, so the bias will always be zero.</p>
<p>However, the jackknifed subsamples <em>do</em> serve a useful purpose: each one tells us how much its corresponding left-out data point affects the sample mean. Let’s see what this means.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="fu">&gt;&gt;&gt;</span> mean [<span class="dv">1</span>,<span class="dv">3</span>,<span class="dv">2</span>,<span class="dv">1</span>]
<span class="dv">1</span><span class="fu">.</span><span class="dv">75</span></code></pre>
<p>The sample mean is <code>1.75</code>, and let’s see which subsample mean is farthest from this value:</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="fu">&gt;&gt;&gt;</span> jackknifeMean [<span class="dv">1</span>,<span class="dv">3</span>,<span class="dv">2</span>,<span class="dv">1</span>]
[<span class="dv">2</span>, <span class="dv">1</span><span class="fu">.</span><span class="dv">3333</span>, <span class="dv">1</span><span class="fu">.</span><span class="dv">6666</span>, <span class="dv">2</span>]</code></pre>
<p>So if we left out <code>1</code> from the sample, the mean would be <code>2</code>, but if we left out <code>3</code>, the mean would become <code>1.3333</code>. Clearly, this is the subsample mean that is farthest from the sample mean, so <code>3</code> is the most significant outlier in our estimate of the mean.</p>
<h1 id="prefix-sums-and-variance">Prefix sums and variance</h1>
<p>Let’s look again at the naive formula for calculating variance:</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell">var xs <span class="fu">=</span> (sum (map (<span class="fu">^</span><span class="dv">2</span>) xs) <span class="fu">-</span> sum xs <span class="fu">^</span> <span class="dv">2</span> <span class="fu">/</span> n) <span class="fu">/</span> n
  <span class="kw">where</span> n <span class="fu">=</span> fromIntegral (length xs)</code></pre>
<p>Since this approach is based on sums, it looks like maybe we can use the same prefix summation technique to compute the variance in <span class="math"><em>O</em>(<em>n</em>)</span> time.</p>
<p>Because we’re computing a sum of squares and an ordinary sum, we need to perform two sets of prefix sum computations:</p>
<ul>
<li><p>Two to compute the sum of squares, one from the left and another from the right</p></li>
<li><p>And two more for computing the square of sums</p></li>
</ul>
<pre class="sourceCode haskell"><code class="sourceCode haskell">jackknifeVar xs <span class="fu">=</span>
    zipWith4 var squaresLeft squaresRight sumsLeft sumsRight
  <span class="kw">where</span>
    var l2 r2 l r <span class="fu">=</span> ((l2 <span class="fu">+</span> r2) <span class="fu">-</span> (l <span class="fu">+</span> r) <span class="fu">^</span> <span class="dv">2</span> <span class="fu">/</span> n) <span class="fu">/</span> n
    squares       <span class="fu">=</span> map (<span class="fu">^</span><span class="dv">2</span>) xs
    squaresLeft   <span class="fu">=</span> init (scanl (<span class="fu">+</span>) <span class="dv">0</span> squares)
    squaresRight  <span class="fu">=</span> tail (scanr (<span class="fu">+</span>) <span class="dv">0</span> squares)
    sumsLeft      <span class="fu">=</span> init (scanl (<span class="fu">+</span>) <span class="dv">0</span> xs)
    sumsRight     <span class="fu">=</span> tail (scanr (<span class="fu">+</span>) <span class="dv">0</span> xs)
    n             <span class="fu">=</span> fromIntegral (length xs <span class="fu">-</span> <span class="dv">1</span>)</code></pre>
<p>If we look closely, buried in the local function <code>var</code> above, we will see almost exactly the naive formulation for variance, only constructed from the relevant pieces of our four prefix sums.</p>
<h1 id="skewness-kurtosis-and-more">Skewness, kurtosis, and more</h1>
<p>Exactly the same prefix sum approach applies to jackknifing higher order moment statistics, such as skewness (lopsidedness of the distribution curve) and kurtosis (shape of the tails of the distribution).</p>
<h1 id="numerical-accuracy-of-the-jackknifed-mean">Numerical accuracy of the jackknifed mean</h1>
<p>When we’re dealing with a lot of floating point numbers, the ever present concerns about numerical stability and accuracy arise.</p>
<p>For example, suppose we compute the sum of ten million pseudo-qrandom floating point numbers between zero and one.</p>
<p>The most accurate way to sum numbers is by first converting them to <code>Rational</code>, summing, then converting back to <code>Double</code>. We’ll call this the “true sum”. The standard Haskell <code>sum</code> function (“basic sum” below) simply adds numbers as it goes. It manages 14 decimal digits of accuracy before losing precision.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell">true sum<span class="fu">:</span>    <span class="dv">5000754</span><span class="fu">.</span><span class="dv">656937315</span>
basic sum<span class="fu">:</span>   <span class="dv">5000754</span><span class="fu">.</span><span class="dv">65693705</span>
                           <span class="fu">^</span></code></pre>
<p>However, Kahan’s algorithm does even better.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell">true sum<span class="fu">:</span>    <span class="dv">5000754</span><span class="fu">.</span><span class="dv">656937315</span>
kahan sum<span class="fu">:</span>   <span class="dv">5000754</span><span class="fu">.</span><span class="dv">656937315</span></code></pre>
<p>If you haven’t come across Kahan’s algorithm before, it looks like this.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell">kahanStep (sum, c) x <span class="fu">=</span> (sum&#39;, c&#39;)
  <span class="kw">where</span> y    <span class="fu">=</span> x <span class="fu">-</span> c
        sum&#39; <span class="fu">=</span> sum <span class="fu">+</span> y
        c&#39;   <span class="fu">=</span> (sum&#39; <span class="fu">-</span> sum) <span class="fu">-</span> y</code></pre>
<p>The <code>c</code> term maintains a running correction of the errors introduced by each addition.</p>
<p>Naive summation seems to do just fine, right? Well, watch what happens if we simply add <span class="math">10<sup>10</sup></span> to each number, sum these, then subtract <span class="math">10<sup>17</sup></span> at the end.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell">true sum<span class="fu">:</span>    <span class="dv">4999628</span><span class="fu">.</span><span class="dv">983274754</span>
basic sum<span class="fu">:</span>    <span class="dv">450000</span><span class="fu">.</span><span class="dv">0</span>
kahan sum<span class="fu">:</span>   <span class="dv">4999632</span><span class="fu">.</span><span class="dv">0</span>
                  <span class="fu">^</span></code></pre>
<p>The naive approach goes completely off the rails, and produces a result that is off by an order of magnitude!</p>
<p>This catastrophic accumulation of error is often cited as the reason why the naive formula for the mean can’t be trusted.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell">mean xs <span class="fu">=</span> sum xs <span class="fu">/</span> n
  <span class="kw">where</span> n <span class="fu">=</span> fromIntegral (length xs)</code></pre>
<p>Thanks to Don Knuth, what is usually suggested as a replacement is Welford’s algorithm.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="kw">import </span><span class="dt">Data.List</span> (foldl&#39;)

<span class="kw">data</span> <span class="dt">WelfordMean</span> a <span class="fu">=</span> <span class="dt">M</span> <span class="fu">!</span>a <span class="fu">!</span><span class="dt">Int</span>
              <span class="kw">deriving</span> (<span class="dt">Show</span>)

welfordMean <span class="fu">=</span> end <span class="fu">.</span> foldl&#39; step zero
  <span class="kw">where</span> end  (<span class="dt">M</span> m _)   <span class="fu">=</span> m
        step (<span class="dt">M</span> m n) x <span class="fu">=</span> <span class="dt">M</span> m&#39; n&#39;
          <span class="kw">where</span> m&#39;     <span class="fu">=</span> m <span class="fu">+</span> (x <span class="fu">-</span> m) <span class="fu">/</span> fromIntegral n&#39;
                n&#39;     <span class="fu">=</span> n <span class="fu">+</span> <span class="dv">1</span>
        zero           <span class="fu">=</span> <span class="dt">M</span> <span class="dv">0</span> <span class="dv">0</span></code></pre>
<p>Here’s what we get if we compare the three approaches:</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell">true mean<span class="fu">:</span>    <span class="dv">0</span><span class="fu">.</span><span class="dv">49996289832747537</span>
naive mean<span class="fu">:</span>   <span class="dv">0</span><span class="fu">.</span><span class="dv">04500007629394531</span>
welford mean<span class="fu">:</span> <span class="dv">0</span><span class="fu">.</span><span class="dv">4998035430908203</span></code></pre>
<p>Not surprisingly, the naive mean is worse than useless, but the long-respected Welford method only gives us three decimal digits of precision. That’s not so hot.</p>
<p>More accurate is the Kahan mean, which is simply the sum calculated using Kahan’s algorithm, then divided by the length:</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell">true mean<span class="fu">:</span>    <span class="dv">0</span><span class="fu">.</span><span class="dv">49996289832747537</span>
kahan mean<span class="fu">:</span>   <span class="dv">0</span><span class="fu">.</span><span class="dv">4999632</span>
welford mean<span class="fu">:</span> <span class="dv">0</span><span class="fu">.</span><span class="dv">4998035430908203</span></code></pre>
<p>This at least gets us to five decimal digits of precision.</p>
<p>So is the Kahan mean the answer? Well, Kahan summation has its own problems. Let’s try out a test vector.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="co">-- originally due to Tim Peters</span>
<span class="fu">&gt;&gt;&gt;</span> <span class="kw">let</span> vec <span class="fu">=</span> concat (replicate <span class="dv">1000</span> [<span class="dv">1</span>,1e100,<span class="dv">1</span>,<span class="fu">-</span>1e100])

<span class="co">-- accurate sum</span>
<span class="fu">&gt;&gt;&gt;</span> sum (map toRational vec)
<span class="dv">2000</span>

<span class="co">-- naive sum</span>
<span class="fu">&gt;&gt;&gt;</span> sum vec
<span class="dv">0</span><span class="fu">.</span><span class="dv">0</span>

<span class="co">-- Kahan sum</span>
<span class="fu">&gt;&gt;&gt;</span> foldl kahanStep (<span class="dt">S</span> <span class="dv">0</span> <span class="dv">0</span>) vec
<span class="dt">S</span> <span class="dv">0</span><span class="fu">.</span><span class="dv">0</span> <span class="dv">0</span><span class="fu">.</span><span class="dv">0</span></code></pre>
<p>Ugh, the Kahan algorithm doesn’t do any better than naive addition. Fortunately, there’s an even better summation algorithm available, called the Kahan-Babuška-Neumaier algorithm.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell">kbnSum <span class="fu">=</span> uncurry (<span class="fu">+</span>) <span class="fu">.</span> foldl&#39; step (<span class="dv">0</span>,<span class="dv">0</span>)
  <span class="kw">where</span>
    step (sum, c) x <span class="fu">=</span> (t, c&#39;)
      <span class="kw">where</span> c&#39; <span class="fu">|</span> abs sum <span class="fu">&gt;=</span> abs x <span class="fu">=</span> c <span class="fu">+</span> ((sum <span class="fu">-</span> t) <span class="fu">+</span> x)
               <span class="fu">|</span> otherwise        <span class="fu">=</span> c <span class="fu">+</span> ((x <span class="fu">-</span> t) <span class="fu">+</span> sum)
            t                     <span class="fu">=</span> sum <span class="fu">+</span> x</code></pre>
<p>If we try this on the same test vector, we taste sweet success! Thank goodness!</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="fu">&gt;&gt;&gt;</span> kbnSum vec
<span class="dv">2000</span><span class="fu">.</span><span class="dv">0</span></code></pre>
<p>Not only is Kahan-Babuška-Neumaier (let’s call it “KBN”) more accurate than Welford summation, it has the advantage of being directly usable in our desired prefix sum form. We’ll accumulate floating point error proportional to <span class="math"><em>O</em>(1)</span> instead of the <span class="math"><em>O</em>(<em>n</em>)</span> that naive summation gives.</p>
<p>Poor old Welford’s formula for the mean just can’t get a break! Not only is it less accurate than KBN, but since it’s a recurrence relation with a divisor that keeps changing, we simply can’t monkeywrench it into suitability for the same prefix-sum purpose.</p>
<h1 id="numerical-accuracy-of-the-jackknifed-variance">Numerical accuracy of the jackknifed variance</h1>
<p>In our jackknifed variance, we used almost exactly the same calculation as the naive variance, merely adjusted to prefix sums. Here's the plain old naive variance function once again.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell">var xs <span class="fu">=</span> (sum (map (<span class="fu">^</span><span class="dv">2</span>) xs) <span class="fu">-</span> sum xs <span class="fu">^</span> <span class="dv">2</span> <span class="fu">/</span> n) <span class="fu">/</span> n
  <span class="kw">where</span> n <span class="fu">=</span> fromIntegral (length xs)</code></pre>
<p>The problem with this algorithm arises as the size of the input grows. These two terms are likely to converge for large <span class="math"><em>n</em></span>:</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell">sum (map (<span class="fu">^</span><span class="dv">2</span>) xs)

sum xs <span class="fu">^</span> <span class="dv">2</span> <span class="fu">/</span> n</code></pre>
<p>When we subtract them, floating point cancellation leads to a large error term that turns our result into nonsense.</p>
<p>The usual way to deal with this is to switch to a two-pass algorithm. (In case it’s not clear at first glance, the first pass below calculates <code>mean</code>.)</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell">var2 xs    <span class="fu">=</span> (sum (map (<span class="fu">^</span><span class="dv">2</span>) ys) <span class="fu">-</span> sum ys <span class="fu">^</span> <span class="dv">2</span> <span class="fu">/</span> n) <span class="fu">/</span> n
  <span class="kw">where</span> n  <span class="fu">=</span> fromIntegral (length xs)
        ys <span class="fu">=</span> map (subtract (mean xs)) xs</code></pre>
<p>By subtracting the mean from every term, we keep the numbers smaller, so the two sum terms are less likely to converge.</p>
<p>This approach poses yet another conundrum: we want to jackknife the variance. If we have to correct for the mean to avoid cancellation errors, do we need to calculate each subsample mean? Well, no. We can get away with a cheat: instead of subtracting the subsample mean, we subtract the <em>sample</em> mean, on the assumption that it’s “close enough” to each of the subsample means to be a good enough substitute.</p>
<p>So. To calculate the jackknifed variance, we use KBN summation to avoid a big cumulative error penalty during addition, subtract the sample mean to avoid cancellation error when subtracting the sum terms, and then we’ve finally got a pretty reliable floating point algorithm.</p>
<h1 id="where-can-you-use-this">Where can you use this?</h1>
<p>The <a href="http://hackage.haskell.org/package/statistics/docs/Statistics-Resampling.html#v:jackknife"><code>jackknife</code></a> function in the Haskell <a href="http://hackage.haskell.org/package/statistics"><code>statistics</code></a> library uses all of these techniques where applicable, and the <a href="http://hackage.haskell.org/package/math-functions/docs/Numeric-Sum.html"><code>Sum</code></a> module of the <a href="http://hackage.haskell.org/package/math-functions"><code>math-functions</code></a> library provides reliable summation (including second-order Kahan-Babuška summation, if you gotta catch all those least significant bits).</p>
<p>(If you’re not already bored to death of summation algorithms, take a look into pairwise summation. It’s less accurate than KBN summation, but claims to be quite a bit faster—claims I found to be only barely true in my benchmarks, and not worth the loss of precision.)</p>
]]></content:encoded>
			<wfw:commentRss>http://www.serpentine.com/blog/2014/06/10/win-bigger-statistical-fights-with-a-better-jackknife/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>A major upgrade to attoparsec: more speed, more power</title>
		<link>http://www.serpentine.com/blog/2014/05/31/attoparsec/</link>
		<comments>http://www.serpentine.com/blog/2014/05/31/attoparsec/#comments</comments>
		<pubDate>Sat, 31 May 2014 07:34:55 +0000</pubDate>
		<dc:creator><![CDATA[Bryan O'Sullivan]]></dc:creator>
				<category><![CDATA[haskell]]></category>
		<category><![CDATA[open source]]></category>

		<guid isPermaLink="false">http://www.serpentine.com/blog/?p=1051</guid>
		<description><![CDATA[I’m pleased to introduce the third generation of my attoparsec parsing library. With a major change to its internals, it is both faster and more powerful than previous versions, while remaining backwards compatible. Comparing to C Let’s start with a<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.serpentine.com/blog/2014/05/31/attoparsec/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<p>I’m pleased to introduce the third generation of my <a href="https://github.com/bos/attoparsec">attoparsec parsing library</a>. With a major change to its internals, it is both faster and more powerful than previous versions, while remaining backwards compatible.</p>
<h1 id="comparing-to-c">Comparing to C</h1>
<p>Let’s start with a speed comparison between the <a href="https://github.com/joyent/http-parser/blob/master/http_parser.c">hand-written C code</a> that powers Node.js’s HTTP parser and an <a href="https://github.com/bos/attoparsec/blob/master/examples/RFC2616.hs">idiomatic Haskell parser</a> that uses attoparsec. There are good reasons to take these numbers with a fistful of salt, so imagine huge error bars, warning signs, and whatnot—but they’re still interesting.</p>
<iframe height=371 width=600 src="https://docs.google.com/a/serpentine.com/spreadsheets/d/1OapXAdci7YtuDBuqeHDccwKxUBMKNmRmAMfPlZdP_Nw/gviz/chartiframe?oid=1066929795" seamless frameborder=0 scrolling=no></iframe>

<p>A little explanation is in order for why there are two entries for http-parser. The “null” driver consists of a series of empty callbacks, and represents the best possible performance we can get. The “naive” http-parser driver allocates memory for both a request and each of its headers, and frees this memory once a request parse has finished. (A real user of http-parser is likely to be slower than the naive driver, as http-parser forces its clients to do complex book-keeping.)</p>
<p>Meanwhile, the attoparsec parser is of course tiny: a few dozen lines of code, instead of a few thousand. More interestingly, it’s <em>faster</em> than its do-nothing C counterpart. When <a href="http://www.serpentine.com/blog/2010/03/03/whats-in-a-parser-attoparsec-rewired-2/">I last compared the two</a>, back in 2010, attoparsec was a little over half the speed of http-parser, so <em>to pass it</em> feels like an exciting development.</p>
<p>To be clear, you really shouldn’t treat comparing the two as anything other than a fast-and-loose exercise. The attoparsec parser does less work in some ways, for instance by not special-casing the Content-Length header. At the same time, it does <em>more</em> work in a different, but perhaps more important case: there’s no equivalent of the maze of edge cases that arise with http-parser when a parse spans a boundary between blocks of input. The attoparsec programming model is simply way less hairy.</p>
<p>Caveats aside, my purpose with this comparison is to paint with broad strokes what I hope is a compelling picture: you can write a compact, clean parser using attoparsec, <em>and</em> you can expect it to perform well.</p>
<h1 id="speed-improvements">Speed improvements</h1>
<p>Compared to the previous version of attoparsec, the new internals of this version yield some solid speedups. On attoparsec’s own microbenchmark suite, speedups range from flat to nearly 2x.</p>
<iframe height=371 width=600
src="https://docs.google.com/a/serpentine.com/spreadsheets/d/1OapXAdci7YtuDBuqeHDccwKxUBMKNmRmAMfPlZdP_Nw/gviz/chartiframe?oid=736840736"
seamless frameborder=0 scrolling=no></iframe>

<p>If you use the <a href="https://github.com/bos/aeson">aeson JSON library</a> to parse JSON data that contains a lot of numbers, you can expect a nice boost in performance.</p>
<h1 id="space-usage">Space usage</h1>
<p>In addition to being faster, attoparsec is now generally more space efficient too. In a test of an application that uses Johan Tibell’s <a href="https://hackage.haskell.org/package/cassava">cassava library for handling CSV files</a>, the app used 39% less memory with the new version of attoparsec than before, while running 5% faster.</p>
<h1 id="new-api-fun">New API fun</h1>
<p>The new internals of attoparsec allowed me to add a feature I’ve wanted for a few years, one I had given up on as impossible with the previous internals.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">match ::</span> <span class="dt">Parser</span> a <span class="ot">-&gt;</span> <span class="dt">Parser</span> (<span class="dt">ByteString</span>, a)</code></pre>
<p>Given an arbitrary parser, <code>match</code> returns both the result of the parse and the string that it consumed while matching.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="fu">&gt;&gt;&gt;</span> <span class="kw">let</span> p <span class="fu">=</span> (,) <span class="fu">&lt;$&gt;</span> decimal <span class="fu">&lt;*&gt;</span> (<span class="st">&quot;,&quot;</span> <span class="fu">*&gt;</span> decimal)
<span class="fu">&gt;&gt;&gt;</span> parseOnly (match p) <span class="st">&quot;1,31337&quot;</span>
<span class="dt">Right</span> (<span class="st">&quot;1,31337&quot;</span>,(<span class="dv">1</span>,<span class="dv">31337</span>))</code></pre>
<p>This is very handy when what you’re interested in is not just the components of a parse result, but also the precise input string that the parser matched. (Imagine using this to save the contents of a comment while parsing a programming language, for instance.)</p>
<h1 id="the-old-internals">The old internals</h1>
<p>What changed to yield both big performance improvements and previously impossible capabilities? To understand this, let’s discuss how attoparsec worked until today.</p>
<p>The age-old way to write parser libraries in Haskell is to treat parsing as a job of consuming input from the front of a string. If you want to match the string <code>&quot;foo&quot;</code> and your input is <code>&quot;foobar&quot;</code>, you pull the prefix from <code>&quot;foobar&quot;</code> and hand <code>&quot;bar&quot;</code> to your successor parser as its input. This is how attoparsec used to work, and we’ll see where it becomes relevant in a moment.</p>
<p>One of attoparsec’s major selling points is that it works with incomplete input. If we give it insufficient input to make a decision about what to do, it will tell us.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="fu">&gt;&gt;&gt;</span> parse (<span class="st">&quot;bar&quot;</span> <span class="fu">&lt;|&gt;</span> <span class="st">&quot;baz&quot;</span>) <span class="st">&quot;ba&quot;</span>
<span class="dt">Partial</span> _</code></pre>
<p>If we get a <code>Partial</code> constructor, we resume parsing by feeding more input to the continuation it hands us. The easiest way is to use <code>feed</code>:</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="fu">&gt;&gt;&gt;</span> <span class="kw">let</span> cont <span class="fu">=</span> parse (<span class="st">&quot;bar&quot;</span> <span class="fu">&lt;|&gt;</span> <span class="st">&quot;baz&quot;</span>) <span class="st">&quot;ba&quot;</span>
<span class="fu">&gt;&gt;&gt;</span> cont <span class="ot">`feed`</span> <span class="st">&quot;r&quot;</span>
<span class="dt">Done</span> <span class="st">&quot;&quot;</span> <span class="st">&quot;bar&quot;</span></code></pre>
<p>Continuations interact in an interesting way with backtracking. Let’s talk a little about backtracking in isolation first.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="fu">&gt;&gt;&gt;</span> <span class="kw">let</span> lefty <span class="fu">=</span> <span class="dt">Left</span> <span class="fu">&lt;$&gt;</span> decimal <span class="fu">&lt;*</span> <span class="st">&quot;.!&quot;</span>
<span class="fu">&gt;&gt;&gt;</span> <span class="kw">let</span> righty <span class="fu">=</span> <span class="dt">Right</span> <span class="fu">&lt;$&gt;</span> rational</code></pre>
<p>The parser <code>lefty</code> will not succeed until it has read a decimal number followed by some nonsense.</p>
<p>Suppose we get partway through a parse on input like this.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="fu">&gt;&gt;&gt;</span> <span class="kw">let</span> cont <span class="fu">=</span> parse (lefty <span class="fu">&lt;|&gt;</span> righty) <span class="st">&quot;123.&quot;</span>
<span class="fu">&gt;&gt;&gt;</span> cont
<span class="dt">Partial</span> _</code></pre>
<p>Even though the <code>decimal</code> portion of <code>lefty</code> has succeeded, if we <code>feed</code> the string <code>&quot;1!&quot;</code> to the continuation, <code>lefty</code> as a whole will fail, parsing will backtrack to the beginning of the input, and <code>righty</code> will succeed.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="fu">&gt;&gt;&gt;</span> cont <span class="ot">`feed`</span> <span class="st">&quot;1!&quot;</span>
<span class="dt">Done</span> <span class="st">&quot;!&quot;</span> <span class="dt">Right</span> <span class="dv">123</span><span class="fu">.</span><span class="dv">1</span></code></pre>
<p>What’s happening behind the scenes here is important.</p>
<p>Under the old version of attoparsec, parsing proceeds by consuming input. By the time we reach the <code>&quot;.&quot;</code> in the input of <code>&quot;123.&quot;</code>, we have thrown away the leading <code>&quot;123&quot;</code> as a result of <code>decimal</code> succeeding, so our remaining input is <code>&quot;.&quot;</code> when we ask for more.</p>
<p>The <code>&lt;|&gt;</code> combinator holds onto the original input in case a parse fails. Since a parse may need ask for <em>more</em> input before it fails (as in this case), the old attoparsec has to keep track of this additional continuation-fed input separately, and glue the saved and added inputs together on each backtrack. Worse yet, sometimes we have to throw away added input in order to avoid double-counting it.</p>
<p>This surely sounds complicated and fragile, but it was the only scheme I could think of that would work under the “parsing as consuming input” model that attoparsec started with. I managed to make this setup run fast enough that (once I’d worked the bugs out) I wasn’t too bothered by the additional complexity.</p>
<h1 id="from-strings-to-buffers-and-cursors">From strings to buffers and cursors</h1>
<p>The model that attoparsec used to follow was that we consumed input, and for correctness when backtracking did our book-keeping of added input separately.</p>
<p>Under the new model, we manage input and added input in one unified <code>Buffer</code> abstraction. We track our position using a separate cursor, which is simply an integer index into a <code>Buffer</code>.</p>
<p>If we need to backtrack, we simply hand the current <code>Buffer</code> to the alternate parser, along with the cursor that will restart parsing at the right spot.</p>
<p>The idea of parsing with a cursor isn’t mine; it came up during a late night IRC conversation with Ed Kmett. I’m excited that this change happened to make it easy to add a new combinator, <code>match</code>, which had previously seemed impossible to write.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">match ::</span> <span class="dt">Parser</span> a <span class="ot">-&gt;</span> <span class="dt">Parser</span> (<span class="dt">ByteString</span>, a)</code></pre>
<p>In the new cursor-based world, all we need to build <code>match</code> is to remember the cursor position when we start parsing. If the parse succeeds, we extract the substring that spans the old and new cursor positions. I spent quite a bit of time pondering this problem with the old representation without getting anywhere, but by changing the internal representation, it suddenly became trivial.</p>
<p>Switching to the cursor-based representation accounts for some of the performance improvements in the new release, as it opened up a few new avenues for further small tweaks.</p>
<h1 id="bust-my-buffers">Bust my buffers!</h1>
<p>There’s another implementation twist, though: why is the <code>Buffer</code> type not simply a <code>ByteString</code>? Here, the question is one of efficiency, specifically behaviour in response to pathologically crafted inputs.</p>
<p>Every time someone feeds us input via the <code>Partial</code> continuation, we have to add this to the input we already have. The obvious thing to do is treat <code>Buffer</code> as a glorified <code>ByteString</code> and simply string-append the new input to the existing input and get on with life.</p>
<p>Troublingly, this approach would require two string copies per append: we’d allocate a new string, copy the original string into it, then tack the appended string on the end. It’s easy to see that this has quadratic time complexity, which would allow a hostile attacker to DoS us by simply drip-feeding us a large volume of valid data, one byte at a time.</p>
<p>The new <code>Buffer</code> structure addresses such attacks by exponential doubling, such that most appends require only one string copy instead of two. This improves the worst-case time complexity of being drip-fed extra input from <span class="math"><em>O</em>(<em>n</em><sup>2</sup>)</span> to <span class="math"><em>O</em>(<em>n</em>log<em>n</em>)</span>.</p>
<h1 id="preserving-safety-and-speed">Preserving safety and speed</h1>
<p>Making this work took a bit of a hack. The <code>Buffer</code> type contains a mutable array that contains both an immutable portion (visible to users) and an invisible mutable part at the end. Every time we append, we write to the mutable array, and hand back a <code>Buffer</code> that widens its immutable portion to include the part we just wrote to. The array is shared across successive <code>Buffer</code>s until we run out of space.</p>
<p>This is very fast, but it’s also unsafe: nobody should ever append to the same <code>Buffer</code> twice, as the sharing of the array can lead to data corruption. Let’s think about how this could arise. Our original <code>Buffer</code> still thinks it can write to the mutable portion of an array, while our <em>new</em> <code>Buffer</code> considers the <em>same area of memory</em> to be immutable. If we append to the original <code>Buffer</code> again, we will scribble on memory that the new <code>Buffer</code> thinks is immutable.</p>
<p>Since neither our choice of API nor Haskell’s type system can prevent bad actions here, users are free to make the programming error of appending to a <code>Buffer</code> more than once, even though it makes no sense to do so. It’s not satisfactory to have pure code react badly even when the programmer is doing something wrong, so I addressed this problem in an interesting way.</p>
<p>The immutable shell of a <code>Buffer</code> contains a generation number. We embed a mutable generation number in the shared array that each <code>Buffer</code> points to. We increment the mutable generation number every time we append to a <code>Buffer</code>, and hand back a <code>Buffer</code> that also has an incremented immutable generation number.</p>
<p>The mutable and immutable generation numbers should always agree. If they fall out of sync, we know that someone is appending to a <code>Buffer</code> more than once. We react by duplicating the mutable array, so that the new append cannot interfere with the existing array. This amounts to a cheap copy-on-write scheme: copies never occur in the typical case of users behaving sensibly, while we preserve correctness if a programmer starts doing daft things.</p>
<h1 id="assurance">Assurance</h1>
<p>Before I embarked on this redesign, I doubled the size of attoparsec’s test and benchmark suites. This gave me a fair sense of safety that I wouldn’t accidentally break code as I went.</p>
<p>Once the rate of churn settled down, I found the most significant packages using attoparsec on Hackage and tried them out.</p>
<p>This revealed that an incompatible change I’d made in the core <code>Parser</code> type caused quite a lot of downstream build breakage, with a third of the packages that I tried failing to build. This was a good motivator for me to <a href="https://github.com/bos/attoparsec/commit/e22d19512c9f606e4ea83f62b009c5d6cb608559#diff-19f60ca4c52af86b5f063634b784f8fcL91">learn how to fix the problem</a>.</p>
<p>Once I fixed this self-imposed difficulty, it turned out that <em>all</em> of the top packages turned out to be API-compatible with the new release. It was definitely helpful to have a tool that let me find important users of the package.</p>
<p>Between the expanded test suite, better benchmarks, and this extra degree of checking, I am now feeling moderately confident that the sweeping changes I’ve made should be fairly safe to inflict on people. I hope I’m right! Please enjoy the results of my work.</p>
<table style="font-size:80%">
<tbody>
<tr><td>
<i>package</i>
</td><td style="text-align:right">
<i>mojo</i>
</td><td>
<i>status</i>
</td></tr>
<tr><td>
aeson
</td><td style="text-align:right">
10000
</td><td>
clean
</td></tr>
<tr><td>
snap-core
</td><td style="text-align:right">
2030
</td><td>
requires <code>--allow-newer</code>
</td></tr>
<tr><td>
conduit-extra
</td><td style="text-align:right">
1816
</td><td>
clean
</td></tr>
<tr><td>
fay
</td><td style="text-align:right">
1740
</td><td>
clean
</td></tr>
<tr><td>
snap
</td><td style="text-align:right">
1681
</td><td>
requires <code>--allow-newer</code>
</td></tr>
<tr><td>
conduit-extra
</td><td style="text-align:right">
1492
</td><td>
clean
</td></tr>
<tr><td>
persistent
</td><td style="text-align:right">
1487
</td><td>
clean
</td></tr>
<tr><td>
yaml
</td><td style="text-align:right">
1313
</td><td>
clean
</td></tr>
<tr><td>
io-streams
</td><td style="text-align:right">
1205
</td><td>
requires <code>--allow-newer</code>
</td></tr>
<tr><td>
configurator
</td><td style="text-align:right">
1161
</td><td>
clean
</td></tr>
<tr><td>
yesod-form
</td><td style="text-align:right">
1077
</td><td>
requires <code>--allow-newer</code>
</td></tr>
<tr><td>
snap-server
</td><td style="text-align:right">
889
</td><td>
requires <code>--allow-newer</code>
</td></tr>
<tr><td>
heist
</td><td style="text-align:right">
881
</td><td>
requires <code>--allow-newer</code>
</td></tr>
<tr><td>
parsers
</td><td style="text-align:right">
817
</td><td>
clean
</td></tr>
<tr><td>
cassava
</td><td style="text-align:right">
643
</td><td>
clean
</td></tr>
</tbody>
</table>

<h1 id="and-finally">And finally</h1>
<p>When I was compiling the list of significant packages using attoparsec, I made a guess that the Unix <code>rev</code> would reverse the order of lines in a file. What it does instead seems much less useful: it reverses the bytes on each line.</p>
<p>Why do I mention this? Because my mistake led to the discovery that there’s a surprising number of Haskell packages whose names read at least as well backwards as forwards.</p>
<pre><code>citats-dosey           revres-foornus
corpetic-codnap        rotaremune-cesrapotta
eroc-ognid             rotaremune-ptth
eroc-pans              rotarugifnoc
forp-colla-emit-chg    sloot-ipa
kramtsop               stekcosbew
morf-gnirtsetyb        teppup-egaugnal
nosea                  tropmish
revirdbew              troppus-ipa-krowten</code></pre>
<p>(And finally-most-of-all, if you’re curious about where I measured my numbers, I used my 2011-era 2.2GHz MacBook Pro running 64-bit GHC 7.6.3. Server-class hardware should do <em>way</em> better.)</p>
]]></content:encoded>
			<wfw:commentRss>http://www.serpentine.com/blog/2014/05/31/attoparsec/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Top Haskell packages seen through graph centrality beer goggles</title>
		<link>http://www.serpentine.com/blog/2014/05/18/top-haskell-packages-a-graph-centrality-perspective/</link>
		<comments>http://www.serpentine.com/blog/2014/05/18/top-haskell-packages-a-graph-centrality-perspective/#comments</comments>
		<pubDate>Sun, 18 May 2014 07:40:38 +0000</pubDate>
		<dc:creator><![CDATA[Bryan O'Sullivan]]></dc:creator>
				<category><![CDATA[haskell]]></category>
		<category><![CDATA[open source]]></category>

		<guid isPermaLink="false">http://www.serpentine.com/blog/?p=1048</guid>
		<description><![CDATA[I threw together a little code tonight to calculate the Katz centrality of packages on Hackage. This is a measure that states that a package is important if an important package depends on it. The definition is recursive, as is<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.serpentine.com/blog/2014/05/18/top-haskell-packages-a-graph-centrality-perspective/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<p>I threw together a little code tonight to calculate the <a href="https://en.wikipedia.org/wiki/Katz_centrality">Katz centrality</a> of packages on Hackage. This is a measure that states that a package is important if an important package depends on it. The definition is recursive, as is the matrix computation that converges towards a fixpoint to calculate it.</p>
<p>Here are the top hundred Hackage packages as calculated by this method, along with their numeric measures of centrality, to which I’ve given the slightly catchier name “mojo” here.</p>
<p>This method has a few obvious flaws: it doesn’t count downloads, nor can it take into account packages that only contain executables. That said, the results still look pretty robust.</p>
<table>
<tr><th>
package
</th><th>
mojo
</th></tr>
<tr><td>
base
</td><td style="text-align:right">
10000
</td></tr>
<tr><td>
ghc-prim
</td><td style="text-align:right">
9178
</td></tr>
<tr><td>
array
</td><td style="text-align:right">
1354
</td></tr>
<tr><td>
bytestring
</td><td style="text-align:right">
1278
</td></tr>
<tr><td>
deepseq
</td><td style="text-align:right">
1197
</td></tr>
<tr><td>
containers
</td><td style="text-align:right">
994
</td></tr>
<tr><td>
transformers
</td><td style="text-align:right">
925
</td></tr>
<tr><td>
mtl
</td><td style="text-align:right">
840
</td></tr>
<tr><td>
text
</td><td style="text-align:right">
546
</td></tr>
<tr><td>
time
</td><td style="text-align:right">
460
</td></tr>
<tr><td>
filepath
</td><td style="text-align:right">
441
</td></tr>
<tr><td>
directory
</td><td style="text-align:right">
351
</td></tr>
<tr><td>
parsec
</td><td style="text-align:right">
299
</td></tr>
<tr><td>
old-locale
</td><td style="text-align:right">
267
</td></tr>
<tr><td>
template-haskell
</td><td style="text-align:right">
247
</td></tr>
<tr><td>
network
</td><td style="text-align:right">
213
</td></tr>
<tr><td>
process
</td><td style="text-align:right">
208
</td></tr>
<tr><td>
vector
</td><td style="text-align:right">
208
</td></tr>
<tr><td>
pretty
</td><td style="text-align:right">
187
</td></tr>
<tr><td>
random
</td><td style="text-align:right">
172
</td></tr>
<tr><td>
binary
</td><td style="text-align:right">
158
</td></tr>
<tr><td>
QuickCheck
</td><td style="text-align:right">
130
</td></tr>
<tr><td>
utf8-string
</td><td style="text-align:right">
128
</td></tr>
<tr><td>
stm
</td><td style="text-align:right">
119
</td></tr>
<tr><td>
unix
</td><td style="text-align:right">
116
</td></tr>
<tr><td>
haskell98
</td><td style="text-align:right">
100
</td></tr>
<tr><td>
hashable
</td><td style="text-align:right">
96
</td></tr>
<tr><td>
attoparsec
</td><td style="text-align:right">
92
</td></tr>
<tr><td>
old-time
</td><td style="text-align:right">
88
</td></tr>
<tr><td>
primitive
</td><td style="text-align:right">
87
</td></tr>
<tr><td>
aeson
</td><td style="text-align:right">
72
</td></tr>
<tr><td>
unordered-containers
</td><td style="text-align:right">
70
</td></tr>
<tr><td>
syb
</td><td style="text-align:right">
69
</td></tr>
<tr><td>
data-default
</td><td style="text-align:right">
67
</td></tr>
<tr><td>
split
</td><td style="text-align:right">
64
</td></tr>
<tr><td>
transformers-base
</td><td style="text-align:right">
63
</td></tr>
<tr><td>
blaze-builder
</td><td style="text-align:right">
62
</td></tr>
<tr><td>
monad-control
</td><td style="text-align:right">
62
</td></tr>
<tr><td>
conduit
</td><td style="text-align:right">
62
</td></tr>
<tr><td>
semigroups
</td><td style="text-align:right">
59
</td></tr>
<tr><td>
cereal
</td><td style="text-align:right">
57
</td></tr>
<tr><td>
tagged
</td><td style="text-align:right">
57
</td></tr>
<tr><td>
bindings-DSL
</td><td style="text-align:right">
55
</td></tr>
<tr><td>
HUnit
</td><td style="text-align:right">
55
</td></tr>
<tr><td>
gtk
</td><td style="text-align:right">
54
</td></tr>
<tr><td>
Cabal
</td><td style="text-align:right">
54
</td></tr>
<tr><td>
lens
</td><td style="text-align:right">
50
</td></tr>
<tr><td>
OpenGL
</td><td style="text-align:right">
46
</td></tr>
<tr><td>
haskell-src-exts
</td><td style="text-align:right">
45
</td></tr>
<tr><td>
cmdargs
</td><td style="text-align:right">
45
</td></tr>
<tr><td>
HTTP
</td><td style="text-align:right">
44
</td></tr>
<tr><td>
http-types
</td><td style="text-align:right">
43
</td></tr>
<tr><td>
extensible-exceptions
</td><td style="text-align:right">
43
</td></tr>
<tr><td>
glib
</td><td style="text-align:right">
42
</td></tr>
<tr><td>
utility-ht
</td><td style="text-align:right">
41
</td></tr>
<tr><td>
data-default-class
</td><td style="text-align:right">
38
</td></tr>
<tr><td>
parallel
</td><td style="text-align:right">
35
</td></tr>
<tr><td>
resourcet
</td><td style="text-align:right">
34
</td></tr>
<tr><td>
semigroupoids
</td><td style="text-align:right">
34
</td></tr>
<tr><td>
xml
</td><td style="text-align:right">
34
</td></tr>
<tr><td>
comonad
</td><td style="text-align:right">
33
</td></tr>
<tr><td>
lifted-base
</td><td style="text-align:right">
33
</td></tr>
<tr><td>
cairo
</td><td style="text-align:right">
33
</td></tr>
<tr><td>
safe
</td><td style="text-align:right">
32
</td></tr>
<tr><td>
MissingH
</td><td style="text-align:right">
31
</td></tr>
<tr><td>
exceptions
</td><td style="text-align:right">
31
</td></tr>
<tr><td>
base-unicode-symbols
</td><td style="text-align:right">
31
</td></tr>
<tr><td>
ansi-terminal
</td><td style="text-align:right">
31
</td></tr>
<tr><td>
vector-space
</td><td style="text-align:right">
30
</td></tr>
<tr><td>
nats
</td><td style="text-align:right">
30
</td></tr>
<tr><td>
OpenGLRaw
</td><td style="text-align:right">
30
</td></tr>
<tr><td>
monads-tf
</td><td style="text-align:right">
28
</td></tr>
<tr><td>
wai
</td><td style="text-align:right">
28
</td></tr>
<tr><td>
hslogger
</td><td style="text-align:right">
28
</td></tr>
<tr><td>
regex-compat
</td><td style="text-align:right">
28
</td></tr>
<tr><td>
GLUT
</td><td style="text-align:right">
27
</td></tr>
<tr><td>
void
</td><td style="text-align:right">
27
</td></tr>
<tr><td>
blaze-html
</td><td style="text-align:right">
26
</td></tr>
<tr><td>
hxt
</td><td style="text-align:right">
25
</td></tr>
<tr><td>
dlist
</td><td style="text-align:right">
25
</td></tr>
<tr><td>
zlib
</td><td style="text-align:right">
25
</td></tr>
<tr><td>
hmatrix
</td><td style="text-align:right">
24
</td></tr>
<tr><td>
SDL
</td><td style="text-align:right">
24
</td></tr>
<tr><td>
case-insensitive
</td><td style="text-align:right">
24
</td></tr>
<tr><td>
scientific
</td><td style="text-align:right">
23
</td></tr>
<tr><td>
X11
</td><td style="text-align:right">
23
</td></tr>
<tr><td>
tagsoup
</td><td style="text-align:right">
22
</td></tr>
<tr><td>
regex-posix
</td><td style="text-align:right">
22
</td></tr>
<tr><td>
HaXml
</td><td style="text-align:right">
22
</td></tr>
<tr><td>
system-filepath
</td><td style="text-align:right">
22
</td></tr>
<tr><td>
enumerator
</td><td style="text-align:right">
22
</td></tr>
<tr><td>
contravariant
</td><td style="text-align:right">
21
</td></tr>
<tr><td>
base64-bytestring
</td><td style="text-align:right">
21
</td></tr>
<tr><td>
http-conduit
</td><td style="text-align:right">
21
</td></tr>
<tr><td>
blaze-markup
</td><td style="text-align:right">
21
</td></tr>
<tr><td>
MonadRandom
</td><td style="text-align:right">
20
</td></tr>
<tr><td>
failure
</td><td style="text-align:right">
20
</td></tr>
<tr><td>
test-framework
</td><td style="text-align:right">
20
</td></tr>
<tr><td>
xhtml
</td><td style="text-align:right">
20
</td></tr>
<tr><td>
distributive
</td><td style="text-align:right">
19
</td></tr>
</table>

]]></content:encoded>
			<wfw:commentRss>http://www.serpentine.com/blog/2014/05/18/top-haskell-packages-a-graph-centrality-perspective/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Once more into the teach, dear friends</title>
		<link>http://www.serpentine.com/blog/2014/05/13/once-more-into-the-teach-dear-friends/</link>
		<comments>http://www.serpentine.com/blog/2014/05/13/once-more-into-the-teach-dear-friends/#comments</comments>
		<pubDate>Wed, 14 May 2014 05:18:22 +0000</pubDate>
		<dc:creator><![CDATA[Bryan O'Sullivan]]></dc:creator>
				<category><![CDATA[haskell]]></category>

		<guid isPermaLink="false">http://www.serpentine.com/blog/?p=1046</guid>
		<description><![CDATA[Since the beginning of April, David Mazières and I have been back in the saddle teaching CS240H at Stanford again. If you’re tuning in recently, David and I both love systems programming, and we particularly get a kick out of<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.serpentine.com/blog/2014/05/13/once-more-into-the-teach-dear-friends/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<p>Since the beginning of April, <a href="http://www.scs.stanford.edu/~dm/">David Mazières</a> and I have been back in the saddle <a href="http://www.scs.stanford.edu/14sp-cs240h/">teaching CS240H at Stanford</a> again.</p>
<p>If you’re tuning in recently, David and I both love systems programming, and we particularly get a kick out of doing it in Haskell. Let me state this more plainly: Haskell is an <em>excellent</em> systems programming language.</p>
<p>Our aim with this class is to teach both enough advanced Haskell that students really get a feel for how different it is from other programming languages, and to apply this leverage to the kinds of problems that people typically think of as “systemsy”: How do I write solid concurrent software? How do I design it cleanly? What do I do to make it fast? How do I talk to other stuff, like databases and web servers?</p>
<p>As before, <a href="http://www.scs.stanford.edu/14sp-cs240h/slides/">we’re making our lecture notes freely available</a>. In my case, the notes are complete rewrites compared to the <a href="http://www.scs.stanford.edu/11au-cs240h/notes/">2011 notes</a>.</p>
<p>I had a few reasons for rewriting everything. I have changed the way I teach: every class has at least some amount of interactivity, including in-class assignments to give students a chance to absorb what I’m throwing at them. Compared to the first time around, I’ve dialed back the sheer volume of information in each lecture, to make the pace less overwhelming. Everything is simply fresher in my mind if I write the material right before I deliver it.</p>
<p>And finally, sometimes I can throw away plans at the last minute. On the syllabus for today, I was supposed to rehash an old talk about <a href="http://www.scs.stanford.edu/11au-cs240h/notes/par.html">folds and parallel programming</a>, but I found myself unable to get motivated by either subject at 8pm last night, once I’d gotten the kids to bed and settled down to start on the lecture notes. So I hemmed and hawed for a few minutes, decided that <a href="http://www.scs.stanford.edu/14sp-cs240h/slides/lenses.html">talking about lenses was <em>way</em> more important</a>, and went with that.</p>
<p>Some of my favourite parts of the teaching experience are the most humbling. I hold office hours every week; this always feels like a place where I have to bring my “A” game, because there’s no longer a script. Some student will wander in with a problem where I have no idea what the answer is, but I vaguely remember reading a paper four years ago that covered it, so when I’m lucky I get to play glorified librarian and point people at really fun research.</p>
<p>I do get asked why we don’t do this as a MOOC.</p>
<p>It is frankly a pleasure to actually engage with a room full of bright, motivated people, and to try to find ways to help them and encourage them. I don’t know quite how I’d replicate that visceral feedback with an anonymous audience, but it qualitatively matters to me.</p>
<p>And to be honest, I’ve been skeptical of the MOOC phenomenon, because while the hype around them was huge, it’s always been clear that almost nobody knew what they were doing, or what it would even mean for that model to be successful. If the MOOC world converges on a few models that make some sense and don’t take a vast effort to do well, I’m sure we’ll revisit the possibility.</p>
<p>Until then, enjoy the slides, and happy hacking!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.serpentine.com/blog/2014/05/13/once-more-into-the-teach-dear-friends/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Book review: Parallel and Concurrent Programming in Haskell</title>
		<link>http://www.serpentine.com/blog/2014/03/18/book-review-parallel-and-concurrent-programming-in-haskell/</link>
		<comments>http://www.serpentine.com/blog/2014/03/18/book-review-parallel-and-concurrent-programming-in-haskell/#comments</comments>
		<pubDate>Wed, 19 Mar 2014 05:15:30 +0000</pubDate>
		<dc:creator><![CDATA[Bryan O'Sullivan]]></dc:creator>
				<category><![CDATA[haskell]]></category>
		<category><![CDATA[reading]]></category>

		<guid isPermaLink="false">http://www.serpentine.com/blog/?p=1034</guid>
		<description><![CDATA[It's time someone finally wrote a proper review of Simon Marlow's amazing book, Parallel and Concurrent Programming in Haskell. I am really not the right person to tackle this job objectively, because I have known Simon for 20 years and<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.serpentine.com/blog/2014/03/18/book-review-parallel-and-concurrent-programming-in-haskell/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  <meta http-equiv="Content-Style-Type" content="text/css" />
  <meta name="generator" content="pandoc" />
  <title></title>
</head>
<body>
<p>It's time someone finally wrote a proper review of Simon Marlow's amazing book, <a href="http://chimera.labs.oreilly.com/books/1230000000929"><emph>Parallel and Concurrent Programming in Haskell</emph></a>.</p>
<p>I am really not the right person to tackle this job objectively, because I have known Simon for 20 years and I currently happen to be his boss at Facebook. Nevertheless, I fly my flag of editorial bias proudly, and in any case a moment's glance at Simon's book will convince you that the absurdly purple review I am about to write is entirely justified.</p>
<p>Moreover, this book is sufficiently clear, and introduces so many elegant ideas and beautiful abstractions, that you would do well to learn the minimal amount of Haskell necessary to absorb its lessons, simply so that you can become enriched in the reading.</p>
<p>Simon's book makes an overdue departure from the usual Haskell literature (including my own book, which in case you didn't know is fully titled &quot;Real World Haskell Of Six Years Ago Which We Should Have Edited A Little More Carefully&quot;) in assuming that you already have a modest degree of understanding of the language. This alone is slightly pathetically refreshing! I can't tell you how glad I am that functional programming has finally reached the point where we no longer have to start every bloody book by explaining what it is.</p>
<p>Actually, there's a second reason that I might not be an ideal person to review this book: I have only skimmed most of the first half, which concerns itself with parallel programming. Just between you and me, I will confess that parallel programming in Haskell hasn't lit my internal fire of enthusiasm. I used to do a lot of parallel programming in a previous life, largely using MPI, and the experience burned me out. While parallel programming in Haskell is far nicer than grinding away in MPI ever was, I do not love the subject enough that I want to read about it.</p>
<p>So what I'm really reviewing here is the second part of Simon's book, which if issued all by itself at the same price as the current entire tome, would <em>still be a bargain</em>. Let's talk about just how good it is.</p>
<p>The second half of the book concerns itself with concurrent programming, an area where Haskell particularly shines, and which happens to be the bread-and-butter of many a working programmer today. The treatment of concurrency does not depend in any way on the preceding chapters, so if you're so inclined, you can read chapter one and then skip to the second half of the book without missing any necessary information.</p>
<p>Chapter 7 begins by introducing some of the basic components of concurrent Haskell, threads (familiar to all) and a data type called an <code>MVar</code>. An <code>MVar</code> acts a bit like a single-item box: you can put one item into it if it's empty, otherwise you must wait; and you can take an item out if it's full, otherwise you must wait.</p>
<p>As humble as the <code>MVar</code> is, Simon uses it as a simple communication channel with which he builds a simple concurrent logging service. He then deftly identifies the performance problem that a concurrent service will have when an <code>MVar</code> acts as a bottleneck. Not content with this bottleneck, he illustrates how to construct an efficient <em>unbounded</em> channel using <code>MVar</code> as the building block, and clearly explains how this more complex structure works safely.</p>
<p>This is the heart of Simon's teaching technique: he presents an idea that is simple to grasp, then pokes a hole in it. With this hole as motivation, he presents a slightly more complicated approach that corrects the weaknesses of the prior step, without sacrificing that clarity.</p>
<p>For instance, the mechanism behind unbounded channels is an intricate dance of two <code>MVar</code>s, where Simon clearly explains how they ensure that a writer will not block, while a reader will block only if the channel is empty. He then goes on to show how this channel type can be extended to support <em>multicast</em>, such that one writer can send messages to several readers. His initial implementation is subtly incorrect, which he once again explains and uses as a springboard to a final version. By this time, you've accumulated enough lessons from the progression of examples that you can appreciate the good design taste and durability of these unbounded channels.</p>
<p>Incidentally, this is a good time to talk about the chapter on parallel computing that I made sure <em>not</em> to skip: chapter 4, which covers dataflow parallelism using an abstraction called <code>Par</code>. Many of the types and concerns in this chapter will be familiar to you if you're used to concurrent programming with threads, which makes this the most practical chapter to start with if you want to venture into parallel programming in Haskell, but don't know where to begin. <code>Par</code> is simply wonderfully put together, and is an inspiring example of tasteful, parsimonious API design. So put chapter 4 on your must-read list.</p>
<p>Returning to the concurrent world, chapter 8 introduces exceptions, using asynchronous operations as the motivation. Simon builds a data type called <code>Async</code>, which is similar to &quot;futures&quot; or &quot;promises&quot; from other languages (and to the <code>IVar</code> type from chapter 4), and proceeds to make <code>Async</code> operations progressively more robust in the face of exceptions, then more powerful so that we can wait on the completion of one of several <code>Async</code> operations.</p>
<p>Chapter 9 resumes the progress up the robustness curve, by showing how we can safely cancel <code>Async</code> operations that have not yet completed, how to deal with the trouble that exceptions can cause when thrown at an inopportune time (hello, resource leaks!), and how to put an upper bound on the amount of time that an operation can run for.</p>
<p>Software transactional memory gets an extended treatment in chapters 10 and 11. STM has gotten a bad rap in the concurrent programming community, mostly because the implementations of STM that target traditional programming languages have drawbacks so huge that they are deeply unappealing. In the same way that the Java and C++ of 10-15 years ago ruined the reputation of static type systems when there were vastly better alternatives out there, STM in Haskell might be easy to consign to the intellectual dustbin by association, when in fact it's a much more interesting beast than its relatives.</p>
<p>A key problem with traditional STM is that its performance is killed stone dead by the amount of mutable state that needs to be tracked during a transaction. Haskell sidesteps much of this need for book-keeping with its default stance that favours immutable data. Nevertheless, STM in Haskell does have a cost, and Simon shows how to structure code that uses STM to make its overheads acceptable.</p>
<p>Another huge difficulty with traditional STM lies in the messy boundary between transactional code and code that has side effects (and which hence cannot be safely called from a transaction). Haskell's type system eliminates these difficulties, and in fact makes it easier to construct sophisticated combinations of transactional operations. Although we touched on STM having some overhead, Simon revisits the <code>Async</code> API and uses some of the advanced features of Haskell STM to build a multiple-wait implementation that is <em>more</em> efficient than its <code>MVar</code>-based predecessor.</p>
<p>In chapter 14, Simon covers Cloud Haskell, a set of fascinating packages that implement Erlang-style distributed message passing, complete with monitoring and restart of remote nodes. I admire Cloud Haskell for its practical willingness to adopt wholesale the very solid ideas of the Erlang community, as they have a quarter of a century of positive experience with their distinctive approach to constructing robust distributed applications.</p>
<p>If you don't already know Haskell, this book offers two significant gifts. The first is a vigorous and compelling argument for why Haskell is an uncommonly good language for the kind of concurrent programming that is fundamental to much of today's computing. The second is an eye-opening illustration of some beautiful and powerful APIs that transcend any particular language. Concise, elegant design is worth celebrating wherever you see it, and this book is brimful of examples.</p>
<p>On the other hand, if you're already a Haskell programmer, it is very likely that this book will awaken you to bugs you didn't know your concurrent code had, abstractions that you could be building to make your applications cleaner, and practical lessons in how to start simple and then refine your code as you learn more about your needs.</p>
<p>Finally, for me as a writer of books about computing, this book has lessons too. It is understated, letting the quality of its examples and abstractions convince more deeply than bombast could reach. It is minimalist, revisiting the same few initially simple ideas through successive waves of refinement and teaching. And it is clear, with nary a word out of place.</p>
<p>In short, if you care about Haskell, if you are interested in concurrency, if you appreciate good design, if you have an ear for well-crafted teaching, <a href="http://chimera.labs.oreilly.com/books/1230000000929"><emph>Parallel and Concurrent Programming in Haskell</emph></a> is a book that you simply <em>must</em> read. We simply do not see books of this quality very often, so treasure 'em when you see 'em.</p>
</body>
</html>
]]></content:encoded>
			<wfw:commentRss>http://www.serpentine.com/blog/2014/03/18/book-review-parallel-and-concurrent-programming-in-haskell/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>New year, new library releases, new levels of speed</title>
		<link>http://www.serpentine.com/blog/2014/01/09/new-year-new-library-releases-new-levels-of-speed/</link>
		<comments>http://www.serpentine.com/blog/2014/01/09/new-year-new-library-releases-new-levels-of-speed/#comments</comments>
		<pubDate>Thu, 09 Jan 2014 07:11:19 +0000</pubDate>
		<dc:creator><![CDATA[Bryan O'Sullivan]]></dc:creator>
				<category><![CDATA[haskell]]></category>
		<category><![CDATA[open source]]></category>

		<guid isPermaLink="false">http://www.serpentine.com/blog/?p=1026</guid>
		<description><![CDATA[I just released new versions of the Haskell text, attoparsec, and aeson libraries on Hackage, and there’s a surprising amount to look forward to in them. The summary for the impatient: some core operations in text and aeson are now<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.serpentine.com/blog/2014/01/09/new-year-new-library-releases-new-levels-of-speed/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<p>I just released new versions of the Haskell <a href="http://hackage.haskell.org/package/text"><code>text</code></a>, <a href="http://hackage.haskell.org/package/attoparsec"><code>attoparsec</code></a>, and <a href="http://hackage.haskell.org/package/aeson"><code>aeson</code></a> libraries on Hackage, and there’s a surprising amount to look forward to in them.</p>
<p>The summary for the impatient: some core operations in <code>text</code> and <code>aeson</code> are now much more efficient. With <code>text</code>, UTF-8 encoding is up to <em>four times faster</em>, while with <code>aeson</code>, encoding and decoding of JSON bytestrings are both up to <em>twice as fast</em>.</p>
<h2 id="attoparsec-0.11.1.0">attoparsec 0.11.1.0</h2>
<p>Perhaps the least interesting release is <code>attoparsec</code>. It adds a new dependency on Bas Van Dijk’s <a href="http://hackage.haskell.org/package/scientific"><code>scientific</code></a> package to allow efficient and more accurate parsing of floating point numbers, a longstanding minor weakness. It also introduces two new functions for single-token lookahead, which are used by the new release of <code>aeson</code>; read on for more details.</p>
<h2 id="text-1.1.0.0">text 1.1.0.0</h2>
<p>The new release of the <code>text</code> library has much better support for encoding to a UTF-8 bytestring via the <code>encodeUtf8</code> function. The new encoder is up to <em>four times faster</em> than in the previous major release.</p>
<p>Simon Meier contributed a pair of UTF-8 encoding functions that can encode to the <a href="http://hackage.haskell.org/package/bytestring-0.10.4.0/docs/Data-ByteString-Builder.html">new <code>Builder</code> type</a> in the latest version of the <a href="http://hackage.haskell.org/package/bytestring"><code>bytestring</code></a> library. These functions are slower than the new <code>encodeUtf8</code> implementation, but still twice as fast as the old <code>encodeUtf8</code>.</p>
<p>Not only are the new <code>Builder</code> encoders admirably fast, they’re more flexible than <code>encodeUtf8</code>, as <code>Builder</code>s can be used to efficiently glue together from many small fragments. Once again, read on for more details about how this helped with the new release of <code>aeson</code>. (Note: if you don’t have the latest version of <code>bytestring</code> in your library environment, you won’t get the new <code>Builder</code> encoders.)</p>
<p>The second major change to the <code>text</code> library came about when I finally decided to expose all of the library’s internal modules. The newly exposed modules can be found in the <code>Data.Text.Internal</code> hierarchy. Before you get too excited, please understand that I can’t make guarantees of release-to-release stability for any functions or types that are documented as internal.</p>
<h2 id="aeson-0.7.0.0">aeson 0.7.0.0</h2>
<p>Finally, the new release of the <code>aeson</code> library focuses on improved performance and accuracy. We parse floating point numbers more accurately thanks once again to Bas van Dijk’s <code>scientific</code> library. And for performance, both decoding and encoding of JSON bytestrings are up to <em>twice as fast</em> as in the previous release.</p>
<p>On the decoding side, I used the new lookahead primitives from <code>attoparsec</code> to make parsing faster and less memory intensive (by avoiding backtracking, if you’re curious). Meanwhile, Simon Meier contributed a patch that uses his new <code>Builder</code> based UTF-8 encoder from the <code>text</code> library to double encoding performance. (Encoding performance is improved even if you don’t have the necessary new version of <code>bytestring</code>, but only by about 10%.)</p>
<p>On my crummy old Mac laptop, I can decode at 30-40 megabytes per second, and encode at 100-170 megabytes per second. Not bad!</p>
<h2 id="thanks">Thanks</h2>
<p>I'd particularly like to thank Bas van Dijk and Simon Meier for their excellent contributions during this most recent development cycle. It's really a pleasure to work with such smart, friendly people.</p>
<p>Simon and Bas deserve some kind of an additional medal for being forgiving of my sometimes embarrassingly long review latencies: some of Simon's patches against the <code>text</code> library are almost two years old! (Please pardon me while I grasp at straws in my slightly shamefaced partial defence here: the necessary version of <code>bytestring</code> wasn't released until three months ago, so I'm not the only person in the Haskell community with long review latencies...)</p>]]></content:encoded>
			<wfw:commentRss>http://www.serpentine.com/blog/2014/01/09/new-year-new-library-releases-new-levels-of-speed/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Testing a UTF-8 decoder with vigour</title>
		<link>http://www.serpentine.com/blog/2013/12/30/testing-a-utf-8-decoder-with-vigour/</link>
		<comments>http://www.serpentine.com/blog/2013/12/30/testing-a-utf-8-decoder-with-vigour/#comments</comments>
		<pubDate>Tue, 31 Dec 2013 05:28:12 +0000</pubDate>
		<dc:creator><![CDATA[Bryan O'Sullivan]]></dc:creator>
				<category><![CDATA[haskell]]></category>
		<category><![CDATA[open source]]></category>
		<category><![CDATA[software]]></category>

		<guid isPermaLink="false">http://www.serpentine.com/blog/?p=1023</guid>
		<description><![CDATA[Yesterday, Michael Snoyman reported a surprising regression in version 1.0 of my Haskell text library: for some invalid inputs, the UTF-8 decoder was truncating the invalid data instead of throwing an exception. Thanks to Michael providing an easy repro, I<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.serpentine.com/blog/2013/12/30/testing-a-utf-8-decoder-with-vigour/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<p>Yesterday, Michael Snoyman reported a surprising regression in version 1.0 of my Haskell <code>text</code> library: for some invalid inputs, the UTF-8 decoder was truncating the invalid data instead of throwing an exception.</p>
<p>Thanks to Michael providing an easy repro, I quickly bisected the origin of the regression to a commit from September that added support for incremental decoding of UTF-8. That work was motivated by applications that need to be able to consume incomplete input (e.g. a network packet containing possibly truncated data) as early as possible.</p>
<p>The low-level UTF-8 decoder is implemented as a state machine in C to squeeze as much performance out as possible. The machine has two visible end states: <code>UTF8_ACCEPT</code> indicates that a buffer was completely successfully decoded, while <code>UTF8_REJECT</code> specifies that the input contained invalid UTF-8 data. When the decoder stops, all other machine states count as work in progress, i.e. a decode that couldn’t complete because we reached the end of a buffer.</p>
<p>When the old all-or-nothing decoder encountered an incomplete or invalid input, it would back up by a single byte to indicate the location of the error. The incremental decoder is a refactoring of the old decoder, and the new all-or-nothing decoder calls it.</p>
<p>The critical error arose in the refactoring process. Here’s the old code for backing up a byte.</p>
<pre class="sourceCode c"><code class="sourceCode c">    <span class="co">/* Error recovery - if we&#39;re not in a</span>
<span class="co">       valid finishing state, back up. */</span>
    <span class="kw">if</span> (state != UTF8_ACCEPT)
        s -= <span class="dv">1</span>;</code></pre>
<p>This is what the refactoring changed it to:</p>
<pre class="sourceCode c"><code class="sourceCode c">    <span class="co">/* Invalid encoding, back up to the</span>
<span class="co">       errant character. */</span>
    <span class="kw">if</span> (state == UTF8_REJECT)
        s -= <span class="dv">1</span>;</code></pre>
<p>To preserve correctness, the refactoring should have added a check to the new all-or-nothing decoder so that it would step back a byte if the final state of the incremental decoder was <em>neither</em> <code>UTF8_ACCEPT</code> nor <code>UTF8_REJECT</code>. Oops! A very simple bug with unhappy consequences.</p>
<p>The <code>text</code> library has quite a large test suite that has revealed many bugs over the years, often before they ever escaped into the wild. Why did this ugly critter make it over the fence?</p>
<p>Well, a glance at the original code for trying to test UTF-8 error handling is telling—in fact, you don’t even need to be able to read a programming language, because the confession is in the comment.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="co">-- This is a poor attempt to ensure that</span>
<span class="co">-- the error handling paths on decode are</span>
<span class="co">-- exercised in some way.  Proper testing</span>
<span class="co">-- would be rather more involved.</span></code></pre>
<p>“Proper testing” indeed. All that I did in the original test was generate a random byte sequence, and see if it provoked the decoder into throwing an exception. The chances of such a dumb test really offering any value are not great, but I had more or less forgotten about it, and so I had a sense of security without the accompanying security. But hey, at least past-me had left a <em>mea culpa</em> note for present-day-me. Right?</p>
<p>While finding and fixing the bug took just a few minutes, I spent several more hours strengthening the test for the UTF-8 decoder, and this was far more interesting.</p>
<p>As a <a href="http://en.wikipedia.org/wiki/UTF-8">variable-length self-synchronizing encoding</a>, UTF-8 is very clever and elegant, but its cleverness allows for a number of implementation bugs. For reference, here is a table (lightly edited from Wikipedia) of the allowable bit patterns used in UTF-8.</p>
<table>
  <tbody>
    <tr>
      <th>
first<br />code point
</th>
      <th>
last<br />code point
</th>
      <th>
byte 1
</th>
      <th>
byte 2
</th>
      <th>
byte 3
</th>
      <th>
byte 4
</th>
    </tr>
    <tr>
      <td>
U+0000
</td>
      <td>
U+007F
</td>
      <td>
<code>0xxxxxxx</code>
</td>
    </tr>
    <tr>
      <td>
U+0080
</td>
      <td>
U+07FF
</td>
      <td>
<code>110xxxxx</code>
</td>
      <td>
<code>10xxxxxx</code>
</td>
    </tr>
    <tr>
      <td>
U+0800
</td>
      <td>
U+FFFF
</td>
      <td>
<code>1110xxxx</code>
</td>
      <td>
<code>10xxxxxx</code>
</td>
      <td>
<code>10xxxxxx</code>
</td>
    </tr>
    <tr>
      <td>
U+10000
</td>
      <td>
U+1FFFFF
</td>
      <td>
<code>11110xxx</code>
</td>
      <td>
<code>10xxxxxx</code>
</td>
      <td>
<code>10xxxxxx</code>
</td>
      <td>
<code>10xxxxxx</code>
</td>
    </tr>
  </tbody>
</table>

<p>The best known of these bugs involves accepting non-canonical encodings. What a canonical encoding <em>means</em> takes a little explaining. UTF-8 can represent any ASCII character in a single byte, and in fact every ASCII character <em>must</em> be represented as a single byte. However, an illegal two-byte encoding of an ASCII character can be achieved by starting with 0xC0, followed by the ASCII character with the high bit set. For instance, the ASCII forward slash U+002F is represented in UTF-8 as 0x2F, but a decoder with this bug would also accept 0xC0 0xAF (three- and four-byte encodings are of course also possible).</p>
<p>This bug may seem innocent, but it was widely used to <a href="http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2000-0884">remotely exploit IIS 4 and IIS 5 servers</a> over a decade ago. Correct UTF-8 decoders must reject non-canonical encodings. (These are also known as <em>overlong</em> encodings.)</p>
<p>In fact, the bytes 0xC0 and 0xC1 will <em>never</em> appear in a valid UTF-8 bytestream, as they can only be used to start two-byte sequences that cannot be canonical.</p>
<p>To test our UTF-8 decoder’s ability to spot bogus input, then, we might want to generate byte sequences that start with 0xC0 or 0xC1. Haskell’s <a href="http://hackage.haskell.org/package/QuickCheck">QuickCheck</a> library provides us with just such a generating function, <code>choose</code>, which generates a random value in the given range (inclusive).</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell">choose (<span class="dv">0xC0</span>, <span class="dv">0xC1</span>)</code></pre>
<p>Once we have a bad leading byte, we may want to follow it with a continuation byte. The value of a particular continuation byte doesn’t much matter, but we would like it to be valid. A continuation byte always contains the bit pattern 0x80 combined with six bits of data in its least significant bits. Here’s a generator for a random continuation byte.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell">contByte <span class="fu">=</span> (<span class="dv">0x80</span> <span class="fu">+</span>) <span class="fu">&lt;$&gt;</span> choose (<span class="dv">0</span>, <span class="dv">0x3F</span>)</code></pre>
<p>Our bogus leading byte should be rejected immediately, since it can never generate a canonical encoding. For the sake of thoroughness, we should sometimes follow it with a valid continuation byte to ensure that the two-byte sequence is also rejected.</p>
<p>To do this, we write a general combinator, <code>upTo</code>, that will generate a list of up to <code>n</code> random values.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">upTo ::</span> <span class="dt">Int</span> <span class="ot">-&gt;</span> <span class="dt">Gen</span> a <span class="ot">-&gt;</span> <span class="dt">Gen</span> [a]
upTo n gen <span class="fu">=</span> <span class="kw">do</span>
  k <span class="ot">&lt;-</span> choose (<span class="dv">0</span>,n)
  vectorOf k gen <span class="co">-- a QuickCheck combinator</span></code></pre>
<p>And now we have a very simple way of saying “either 0xC0 or 0xC1, optionally followed by a continuation byte”.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="co">-- invalid leading byte of a 2-byte sequence.</span>
(<span class="fu">:</span>) <span class="fu">&lt;$&gt;</span> choose (<span class="dv">0xC0</span>,<span class="dv">0xC1</span>) <span class="fu">&lt;*&gt;</span> upTo <span class="dv">1</span> contByte</code></pre>
<p>Notice in the table above that a 4-byte sequence can encode any code point up to U+1FFFFF. The highest legal Unicode code point is U+10FFFF, so by implication there exists a range of leading bytes for 4-byte sequences that can never appear in valid UTF-8.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="co">-- invalid leading byte of a 4-byte sequence.</span>
(<span class="fu">:</span>) <span class="fu">&lt;$&gt;</span> choose (<span class="dv">0xF5</span>,<span class="dv">0xFF</span>) <span class="fu">&lt;*&gt;</span> upTo <span class="dv">3</span> contByte</code></pre>
<p>We should never encounter a continuation byte without a leading byte somewhere before it.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="co">-- Continuation bytes without a start byte.</span>
listOf1 contByte
<span class="co">-- The listOf1 combinator generates a list</span>
<span class="co">-- containing at least one element.</span></code></pre>
<p>Similarly, a bit pattern that introduces a 2-byte sequence must be followed by one continuation byte, so it’s worth generating such a leading byte <em>without</em> its continuation byte.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="co">-- Short 2-byte sequence.</span>
(<span class="fu">:</span>[]) <span class="fu">&lt;$&gt;</span> choose (<span class="dv">0xC2</span>, <span class="dv">0xDF</span>)</code></pre>
<p>We do the same for 3-byte and 4-byte sequences.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="co">-- Short 3-byte sequence.</span>
(<span class="fu">:</span>) <span class="fu">&lt;$&gt;</span> choose (<span class="dv">0xE0</span>, <span class="dv">0xEF</span>) <span class="fu">&lt;*&gt;</span> upTo <span class="dv">1</span> contByte
<span class="co">-- Short 4-byte sequence.</span>
(<span class="fu">:</span>) <span class="fu">&lt;$&gt;</span> choose (<span class="dv">0xF0</span>, <span class="dv">0xF4</span>) <span class="fu">&lt;*&gt;</span> upTo <span class="dv">2</span> contByte</code></pre>
<p>Earlier, we generated 4-byte sequences beginning with a byte in the range 0xF5 to 0xFF. Although 0xF4 is a valid leading byte for a 4-byte sequence, it’s possible for a perverse choice of continuation bytes to yield an illegal code point between U+110000 and U+13FFFF. This code generates just such illegal sequences.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="co">-- 4-byte sequence greater than U+10FFFF.</span>
k <span class="ot">&lt;-</span> choose (<span class="dv">0x11</span>, <span class="dv">0x13</span>)
<span class="kw">let</span> w0 <span class="fu">=</span> <span class="dv">0xF0</span> <span class="fu">+</span> (k <span class="ot">`Bits.shiftR`</span> <span class="dv">2</span>)
    w1 <span class="fu">=</span> <span class="dv">0x80</span> <span class="fu">+</span> ((k <span class="fu">.&amp;.</span> <span class="dv">3</span>) <span class="ot">`Bits.shiftL`</span> <span class="dv">4</span>)
([w0,w1]<span class="fu">++</span>) <span class="fu">&lt;$&gt;</span> vectorOf <span class="dv">2</span> contByte</code></pre>
<p>Finally, we arrive at the general case of non-canonical encodings. We take a one-byte code point and encode it as two, three, or four bytes; and so on for two-byte and three-byte characters.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="co">-- Overlong encoding.</span>
k <span class="ot">&lt;-</span> choose (<span class="dv">0</span>,<span class="dv">0xFFFF</span>)
<span class="kw">let</span> c <span class="fu">=</span> <span class="fu">chr</span> k
<span class="kw">case</span> k <span class="kw">of</span>
  _ <span class="fu">|</span> k <span class="fu">&lt;</span> <span class="dv">0x80</span>  <span class="ot">-&gt;</span> oneof [
          <span class="kw">let</span> (w,x)     <span class="fu">=</span> ord2 c <span class="kw">in</span> <span class="fu">return</span> [w,x]
        , <span class="kw">let</span> (w,x,y)   <span class="fu">=</span> ord3 c <span class="kw">in</span> <span class="fu">return</span> [w,x,y]
        , <span class="kw">let</span> (w,x,y,z) <span class="fu">=</span> ord4 c <span class="kw">in</span> <span class="fu">return</span> [w,x,y,z] ]
    <span class="fu">|</span> k <span class="fu">&lt;</span> <span class="dv">0x7FF</span> <span class="ot">-&gt;</span> oneof [
          <span class="kw">let</span> (w,x,y)   <span class="fu">=</span> ord3 c <span class="kw">in</span> <span class="fu">return</span> [w,x,y]
        , <span class="kw">let</span> (w,x,y,z) <span class="fu">=</span> ord4 c <span class="kw">in</span> <span class="fu">return</span> [w,x,y,z] ]
    <span class="fu">|</span> <span class="fu">otherwise</span> <span class="ot">-&gt;</span>
          <span class="kw">let</span> (w,x,y,z) <span class="fu">=</span> ord4 c <span class="kw">in</span> <span class="fu">return</span> [w,x,y,z]
<span class="co">-- The oneof combinator chooses a generator at random.</span>
<span class="co">-- Functions ord2, ord3, and ord4 break down a character</span>
<span class="co">-- into its 2, 3, or 4 byte encoding.</span></code></pre>
<p>Armed with a generator that uses <code>oneof</code> to choose one of the above invalid UTF-8 encodings at random, we embed the invalid bytestream in one of three cases: by itself, at the end of an otherwise valid buffer, and at the beginning of an otherwise valid buffer. This variety gives us some assurance of catching buffer overrun errors.</p>
<p>Sure enough, this vastly more elaborate QuickCheck test immediately demonstrates the bug that Michael found.</p>
<p>The original test is a classic case of basic fuzzing: it simply generates random junk and hopes for the best. The fact that it <a href="https://github.com/bos/text/issues/61">let the decoder bug through</a> underlines the weakness of fuzzing. If I had cranked the number of randomly generated test inputs up high enough, I’d probably have found the bug, but the approach of pure randomness would have caused the bug to remain difficult to reproduce and understand.</p>
<p>The revised test is much more sophisticated, as it generates only test cases that are known to be invalid, with a rich assortment of precisely generated invalid encodings to choose from. While it has the same probabilistic nature as the fuzzing approach, it excludes a huge universe of uninteresting inputs from being tested, and hence is much more likely to reveal a weakness quickly and efficiently.</p>
<p>The moral of the story: even QuickCheck tests, though vastly more powerful than unit tests and fuzz tests, are only as good as you make them!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.serpentine.com/blog/2013/12/30/testing-a-utf-8-decoder-with-vigour/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Open question: help me design a new encoding API for aeson</title>
		<link>http://www.serpentine.com/blog/2013/10/14/open-question-help-me-design-a-new-encoding-api-for-aeson/</link>
		<comments>http://www.serpentine.com/blog/2013/10/14/open-question-help-me-design-a-new-encoding-api-for-aeson/#comments</comments>
		<pubDate>Tue, 15 Oct 2013 05:01:03 +0000</pubDate>
		<dc:creator><![CDATA[Bryan O'Sullivan]]></dc:creator>
				<category><![CDATA[haskell]]></category>
		<category><![CDATA[open source]]></category>

		<guid isPermaLink="false">http://www.serpentine.com/blog/?p=1018</guid>
		<description><![CDATA[For a while now, I’ve had it in mind to improve the encoding performance of my Haskell JSON package, aeson. Over the weekend, I went from hazy notion to a proof of concept for what I think could be a<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.serpentine.com/blog/2013/10/14/open-question-help-me-design-a-new-encoding-api-for-aeson/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<body>
<p>For a while now, I’ve had it in mind to improve the encoding performance of my Haskell JSON package, <a href="https://github.com/bos/aeson">aeson</a>.</p>
<p>Over the weekend, I went from hazy notion to a proof of concept for what I think could be a reasonable approach.</p>
<p>This post is a case of me “thinking out loud” about the initial design I came up with. I’m very interested in hearing if you have a cleaner idea.</p>
<p>The problem with the encoding method currently used by aeson is that it occurs via a translation to the <a href="http://hackage.haskell.org/package/aeson-0.6.2.1/docs/Data-Aeson-Types.html#g:1"><code>Value</code></a> type. While this is simple and uniform, it involves a large amount of intermediate work that is essentially wasted. When encoding a complex value, the <code>Value</code> that we build up is expensive, and it will become garbage immediately.</p>
<p>It <em>should</em> be much more efficient to simply serialize straight to a <code>Builder</code>, the type that is optimized for concatenating many short string fragments. But before marching down that road, I want to make sure that I provide a clean API that is easy to use correctly.</p>
<p>I’ve posted <a href="https://gist.github.com/bos/6986451">a gist</a> that contains a complete copy of this proof-of-concept code.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="co">{-# LANGUAGE GeneralizedNewtypeDeriving, FlexibleInstances,</span>
<span class="co">    OverloadedStrings #-}</span>

<span class="kw">import</span> <span class="dt">Data.Monoid</span> (<span class="dt">Monoid</span>(<span class="fu">..</span>), (<span class="fu">&lt;&gt;</span>))
<span class="kw">import</span> <span class="dt">Data.Text</span> (<span class="dt">Text</span>)
<span class="kw">import</span> <span class="dt">Data.Text.Lazy.Builder</span> (<span class="dt">Builder</span>, singleton)
<span class="kw">import</span> <span class="kw">qualified</span> <span class="dt">Data.Text.Lazy.Builder</span> <span class="kw">as</span> <span class="dt">Builder</span>
<span class="kw">import</span> <span class="kw">qualified</span> <span class="dt">Data.Text.Lazy.Builder.Int</span> <span class="kw">as</span> <span class="dt">Builder</span></code></pre>
<p>The core <code>Build</code> type has a phantom type that allows us to say “I am encoding a value of type <code>t</code>”. We’ll see where this type tracking is helpful (and annoying) below.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="kw">data</span> <span class="dt">Build</span> a <span class="fu">=</span> <span class="dt">Build</span> {
    _<span class="ot">count ::</span> <span class="fu">!</span><span class="dt">Int</span>
  ,<span class="ot"> run    ::</span> <span class="dt">Builder</span>
  }</code></pre>
<p>The internals of the <code>Build</code> type would be hidden from users; here’s what they mean. The <code>_count</code> field tracks the number of elements we’re encoding of an aggregate JSON value (an array or object); we’ll see why this matters shortly. The <code>run</code> field lets us access the underlying <code>Builder</code>.</p>
<p>We provide three empty types to use as parameters for the <code>Build</code> type.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="kw">data</span> <span class="dt">Object</span>
<span class="kw">data</span> <span class="dt">Array</span>
<span class="kw">data</span> <span class="dt">Mixed</span></code></pre>
<p>We’ll want to use the <code>Mixed</code> type if we’re cramming a set of disparate Haskell values into a JSON array; read on for more.</p>
<p>When it comes to gluing values together, the <code>Monoid</code> class is exactly what we need.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="kw">instance</span> <span class="dt">Monoid</span> (<span class="dt">Build</span> a) <span class="kw">where</span>
    mempty <span class="fu">=</span> <span class="dt">Build</span> <span class="dv">0</span> mempty
    mappend (<span class="dt">Build</span> i a) (<span class="dt">Build</span> j b)
      <span class="fu">|</span> ij <span class="fu">&gt;</span> <span class="dv">1</span>    <span class="fu">=</span> <span class="dt">Build</span> ij (a <span class="fu">&lt;&gt;</span> singleton <span class="ch">&#39;,&#39;</span> <span class="fu">&lt;&gt;</span> b)
      <span class="fu">|</span> <span class="fu">otherwise</span> <span class="fu">=</span> <span class="dt">Build</span> ij (a <span class="fu">&lt;&gt;</span> b)
      <span class="kw">where</span> ij <span class="fu">=</span> i <span class="fu">+</span> j</code></pre>
<p>Here’s where the <code>_count</code> field comes in; we want to separate elements of an array or object using commas, but this is necessary only when the array or object contains more than one value.</p>
<p>To encode a simple value, we provide a few obvious helpers. (These are clearly so simple as to be wrong, but remember: my purpose here is to explore the API design, not to provide a proper implementation.)</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">build ::</span> <span class="dt">Builder</span> <span class="ot">-&gt;</span> <span class="dt">Build</span> a
build <span class="fu">=</span> <span class="dt">Build</span> <span class="dv">1</span>

<span class="ot">int ::</span> <span class="kw">Integral</span> a <span class="ot">=&gt;</span> a <span class="ot">-&gt;</span> <span class="dt">Build</span> a
int <span class="fu">=</span> build <span class="fu">.</span> Builder.decimal

<span class="ot">text ::</span> <span class="dt">Text</span> <span class="ot">-&gt;</span> <span class="dt">Build</span> <span class="dt">Text</span>
text <span class="fu">=</span> build <span class="fu">.</span> Builder.fromText</code></pre>
<p>Encoding a JSON array is easy.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">array ::</span> <span class="dt">Build</span> a <span class="ot">-&gt;</span> <span class="dt">Build</span> <span class="dt">Array</span>
array (<span class="dt">Build</span> <span class="dv">0</span> _)  <span class="fu">=</span> build <span class="st">&quot;[]&quot;</span>
array (<span class="dt">Build</span> _ vs) <span class="fu">=</span> build <span class="fu">$</span> singleton <span class="ch">&#39;[&#39;</span> <span class="fu">&lt;&gt;</span> vs <span class="fu">&lt;&gt;</span> singleton <span class="ch">&#39;]&#39;</span></code></pre>
<p>If we try this out in <code>ghci</code>, it behaves as we might hope.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell">?&gt; array <span class="fu">$</span> int <span class="dv">1</span> <span class="fu">&lt;&gt;</span> int <span class="dv">2</span>
<span class="st">&quot;[1,2]&quot;</span></code></pre>
<p>JSON puts no constraints on the types of the elements of an array. Unfortunately, our phantom type causes us difficulty here.</p>
<p>An expression of this form will not typecheck, as it’s trying to join a <code>Build Int</code> with a <code>Build Text</code>.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell">?&gt; array <span class="fu">$</span> int <span class="dv">1</span> <span class="fu">&lt;&gt;</span> text <span class="st">&quot;foo&quot;</span></code></pre>
<p>This is where the <code>Mixed</code> type from earlier comes in. We use it to forget the original phantom type so that we can construct an array with elements of different types.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">mixed ::</span> <span class="dt">Build</span> a <span class="ot">-&gt;</span> <span class="dt">Build</span> <span class="dt">Mixed</span>
mixed (<span class="dt">Build</span> a b) <span class="fu">=</span> <span class="dt">Build</span> a b</code></pre>
<p>Our new <code>mixed</code> function gets the types to be the same, giving us something that typechecks.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell">?&gt; array <span class="fu">$</span> mixed (int <span class="dv">1</span>) <span class="fu">&lt;&gt;</span> mixed (text <span class="st">&quot;foo&quot;</span>)
<span class="st">&quot;[1,foo]&quot;</span></code></pre>
<p>This seems like a fair compromise to me. A Haskell programmer will normally want the types of values in an array to be the same, so the default behaviour of requiring this makes sense (at least to my current thinking), but we get a back door for when we absolutely have to go nuts with mixing types.</p>
<p>The last complication stems from the need to build JSON objects. Each key in an object must be a string, but the value can be of any type.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="co">-- Encode a key-value pair.</span>
<span class="ot">(&lt;:&gt;) ::</span> <span class="dt">Build</span> <span class="dt">Text</span> <span class="ot">-&gt;</span> <span class="dt">Build</span> a <span class="ot">-&gt;</span> <span class="dt">Build</span> <span class="dt">Object</span>
k <span class="fu">&lt;:&gt;</span> v <span class="fu">=</span> <span class="dt">Build</span> <span class="dv">1</span> (run k <span class="fu">&lt;&gt;</span> <span class="st">&quot;:&quot;</span> <span class="fu">&lt;&gt;</span> run v)

<span class="ot">object ::</span> <span class="dt">Build</span> <span class="dt">Object</span> <span class="ot">-&gt;</span> <span class="dt">Build</span> <span class="dt">Object</span>
object (<span class="dt">Build</span> <span class="dv">0</span> _)   <span class="fu">=</span> build <span class="st">&quot;{}&quot;</span>
object (<span class="dt">Build</span> _ kvs) <span class="fu">=</span> build <span class="fu">$</span> singleton <span class="ch">&#39;{&#39;</span> <span class="fu">&lt;&gt;</span> kvs <span class="fu">&lt;&gt;</span> singleton <span class="ch">&#39;}&#39;</span></code></pre>
<p>If you’ve had your morning coffee, you’ll notice that I am not living up to my high-minded principles from earlier. Perhaps the types involved here should be something closer to this:</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="kw">data</span> <span class="dt">Object</span> a

<span class="ot">(&lt;:&gt;) ::</span> <span class="dt">Build</span> <span class="dt">Text</span> <span class="ot">-&gt;</span> <span class="dt">Build</span> a <span class="ot">-&gt;</span> <span class="dt">Build</span> (<span class="dt">Object</span> a)

<span class="ot">object ::</span> <span class="dt">Build</span> (<span class="dt">Object</span> a) <span class="ot">-&gt;</span> <span class="dt">Build</span> (<span class="dt">Object</span> a)</code></pre>
<p>(In which case we’d need a <code>mixed</code>-like function to forget the phantom types for when we want to get mucky and unsafe—but I digress.)</p>
<p>How does this work out in practice?</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell">?&gt; object <span class="fu">$</span> <span class="st">&quot;foo&quot;</span> <span class="fu">&lt;:&gt;</span> int <span class="dv">1</span> <span class="fu">&lt;&gt;</span> <span class="st">&quot;bar&quot;</span> <span class="fu">&lt;:&gt;</span> int <span class="dv">3</span>
<span class="st">&quot;{foo:1,bar:3}&quot;</span></code></pre>
<p>Hey look, that’s more or less as we might have hoped!</p>
<p>Open questions, for which I appeal to you for help:</p>
<ul>
<li><p>Does this design appeal to you at all?</p></li>
<li><p>If not, what would you change?</p></li>
<li><p>If yes, to what extent am I wallowing in the “types for thee, but not for me” sin bin by omitting a phantom parameter for <code>Object</code>?</p></li>
</ul>
<p>Helpful answers welcome!</p>
</body>
]]></content:encoded>
			<wfw:commentRss>http://www.serpentine.com/blog/2013/10/14/open-question-help-me-design-a-new-encoding-api-for-aeson/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>

<!-- Dynamic page generated in 0.160 seconds. -->
<!-- Cached page generated by WP-Super-Cache on 2015-09-02 17:03:25 -->
