<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>teideal glic deisbhéalach</title>
	<atom:link href="http://www.serpentine.com/blog/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.serpentine.com/blog</link>
	<description>Bryan O&#039;Sullivan&#039;s blog</description>
	<lastBuildDate>Wed, 01 May 2013 06:48:51 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
		<item>
		<title>Big fucking deal</title>
		<link>http://www.serpentine.com/blog/2013/04/30/big-fucking-deal/</link>
		<comments>http://www.serpentine.com/blog/2013/04/30/big-fucking-deal/#comments</comments>
		<pubDate>Wed, 01 May 2013 06:27:29 +0000</pubDate>
		<dc:creator>Bryan O'Sullivan</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.serpentine.com/blog/?p=1008</guid>
		<description><![CDATA[Quoth Wikipedia: Big data[1][2] is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage,[3] search, sharing, transfer, analysis,[4]<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.serpentine.com/blog/2013/04/30/big-fucking-deal/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<p>Quoth Wikipedia:</p>

<blockquote>Big data[1][2] is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage,[3] search, sharing, transfer, analysis,[4] and visualization. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to &#8220;spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions.&#8221;</blockquote>

<p>Now what if we get tired of the current hype cycle?</p>



<blockquote>Big fucking deal[1][2] is a collection of deals so fucking large and complex that it becomes difficult to process using on-hand fuck giving tools or traditional shit giving techniques. The challenges include capture, curation, storage,[3] search, sharing, transfer, analysis,[4] and all kinds of who the fuck knows what else. The trend to larger fucking deals is due to the additional shit derivable from giving a fuck about a single large fucking pile of related shit, as compared to separate smaller piles with the same total amount of bullshit, allowing correlations to be found to &#8220;spot business shit, determine quality of whatever, prevent some nasty shit, link legal shit right the fuck together, combat fucking crime no I am not making this up it&#8217;s like fucking Batman, and determine real-time traffic shittiness.&#8221;</blockquote>

]]></content:encoded>
			<wfw:commentRss>http://www.serpentine.com/blog/2013/04/30/big-fucking-deal/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>What&#8217;s in a number? criterion edition</title>
		<link>http://www.serpentine.com/blog/2013/04/13/whats-in-a-number-criterion-edition/</link>
		<comments>http://www.serpentine.com/blog/2013/04/13/whats-in-a-number-criterion-edition/#comments</comments>
		<pubDate>Sun, 14 Apr 2013 01:01:43 +0000</pubDate>
		<dc:creator>Bryan O'Sullivan</dc:creator>
				<category><![CDATA[haskell]]></category>

		<guid isPermaLink="false">http://www.serpentine.com/blog/?p=996</guid>
		<description><![CDATA[[Edit: a few hours after I wrote this post, I wrote some code to get rid of the inflation phenomenon it describes, and I'll publish a corresponding update to criterion shortly. See below for details, and the bottom for a<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.serpentine.com/blog/2013/04/13/whats-in-a-number-criterion-edition/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<p><em>[<strong>Edit:</strong> a few hours after I wrote this post, I wrote some code to get rid of the inflation phenomenon it describes, and I'll publish a corresponding update to criterion shortly. See below for details, and the bottom for a <a href="#bottom">new chart</a> that shows the effect of the fix.]</em></p>
<p>A couple of days ago, Alexey Khudyakov did a little <a href="http://sepulcarium.org/blog/posts/2013-04-07-criterion.html">digging into the accuracy of criterion’s measurements</a>. I thought his results were interesting enough to be worth some deeper analysis.</p>
<p>First, let’s briefly discuss Alexey’s method and findings. He created 1,000 identical copies of a benchmark, and looked to see if the measurements changed over time. They did, slowly increasing in a linear fashion. (This is a phenomenon known to statisticians as <em>serial correlation</em>, where a measurement influences a future measurement.)</p>
<p>If every benchmark is the same, why do the measurements increase? Criterion does its book-keeping in memory. For every run, it saves a piece of data in memory. Not until all benchmarks have finished does it write out that data to a Javascript+HTML report or CSV file.</p>
<p>I thought that the slow increase in measurements was probably related to this book-keeping, but how to test this hypothesis?</p>
<p>I created 200 identical copies of the same benchmark (I’m not patient enough to wait for 1,000!) and dumped a heap profile while it ran, then I plotted the numbers measured by criterion against heap size.</p>
<a href="http://www.serpentine.com/wordpress/wp-content/uploads/2013/04/heap-vs-time.png"><img src="http://www.serpentine.com/wordpress/wp-content/uploads/2013/04/heap-vs-time.png" alt="heap vs time" width="446" height="589" class="aligncenter size-full wp-image-997" /></a>
<p>For this particular benchmark, criterion spends about 4% of its time in the garbage collector. The size of the heap increases by 300% as the program runs. If we expect garbage collection overhead to affect measurements, then the time we measure should increase by 12% as we repeat the benchmark over and over, slowly accumulating data.</p>
<p>This prediction <em>exactly</em> matches what we actually see: we start off measuring <code>exp</code> at 25.5 nanoseconds, and by the end we see it taking 28.5 nanoseconds.</p>
<p>The obvious next question is: how realistic a concern is this? A normal criterion program consists of a handful of benchmarks, usually all very different in what they do. I have not seen any cases of more than a few dozen benchmarks in a single suite. If only a few benchmarks get run, then there is essentially no opportunity for this inflation effect to become noticeable.</p>
<p>Nevertheless, I could definitely make some improvements (or even better, someone could contribute patches and I could continue to do nothing).</p>
<ul>
<li><p>It would probably help to write data to a file after running each benchmark, and then to load that data back again before writing out the final report. [<em><strong>Edit:</strong> I wrote a patch that does just this; the increase in memory use vanishes, and along with it, the gradual inflation in measured times. <strong>Bingo!</strong></em>]</p></li>
<li><p>There is no benefit to looking for serial correlation across different benchmark runs, because nobody (except Alexey!) makes identical duplicates of a benchmark.</p></li>
<li><p>For the series of measurements collected for a single benchmark, it would probably be helpful to add an autocorrelation test, if only to have another opportunity to raise a red flag. Criterion is already careful to cry foul if its numbers look too messy, but first-order serial correlation would be likely to slip past the sophisticated tests it uses (like the bootstrap). I’ve long wanted to add a Durbin-Watson test, but I’ve been lazy for even longer.</p></li>
</ul>
<p>If you were to run every benchmark in a large suite one after the other in a single pass, then your final numbers could indeed be inflated by a few percent <em>[<strong>edit:</strong> at least until I release the patch]</em>. However, there are many other ways to confound your measurements, most of which will be far larger than this book-keeping effect.</p>
<ul>
<li><p>If you simply change the order in which you run your benchmarks, this can dramatically affect the numbers you’ll see.</p></li>
<li><p>The size of the heap that the GHC runtime uses makes a big difference, as do the threaded runtime, number of OS threads, and use of the parallel garbage collector. Any of these can change performance by a factor of two or more (!).</p></li>
<li><p>You should close busy tabs in a web browser (or preferably quit it entirely), kill your mail client and antivirus software, and try to eliminate other sources of system noise. You’ll be surprised by how big a difference these can make; anywhere from a handful to a few hundred percent.</p></li>
</ul>
<p>If you want high-quality numbers, it is best to run just one benchmark from a suite at a time; on the quietest system you can manage; to watch for criterion's warnings about outliers affecting results; and to always compare several runs to see if your measurements are stable.</p>
<p><em>[<a name="bottom"><strong>Edit:</strong></a> Here is a chart of the measurements with the bug fixed, complete with a linear fit to indicate that the numbers are basically flat. Hooray!]</em>
<a href="http://www.serpentine.com/wordpress/wp-content/uploads/2013/04/new-time.png"><img src="http://www.serpentine.com/wordpress/wp-content/uploads/2013/04/new-time.png" alt="new-time" width="469" height="352" class="aligncenter size-full wp-image-1004" /></a>]]></content:encoded>
			<wfw:commentRss>http://www.serpentine.com/blog/2013/04/13/whats-in-a-number-criterion-edition/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>What&#8217;s good for C++ is good for &#8230; Haskell!?</title>
		<link>http://www.serpentine.com/blog/2013/03/20/whats-good-for-c-is-good-for-haskell/</link>
		<comments>http://www.serpentine.com/blog/2013/03/20/whats-good-for-c-is-good-for-haskell/#comments</comments>
		<pubDate>Wed, 20 Mar 2013 19:29:32 +0000</pubDate>
		<dc:creator>Bryan O'Sullivan</dc:creator>
				<category><![CDATA[haskell]]></category>

		<guid isPermaLink="false">http://www.serpentine.com/blog/?p=970</guid>
		<description><![CDATA[A few days ago, my Facebook colleague Andrei Alexandrescu posted a note entitled Three Optimization Tips for C++, which reminded me that I had unfinished business with Haskell’s text package. I took his code, applied it to the text package,<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.serpentine.com/blog/2013/03/20/whats-good-for-c-is-good-for-haskell/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<p>A few days ago, my Facebook colleague <a href="http://erdani.com/">Andrei Alexandrescu</a> posted a note entitled <a href="https://www.facebook.com/notes/facebook-engineering/three-optimization-tips-for-c/10151361643253920">Three Optimization Tips for C++</a>, which reminded me that I had unfinished business with Haskell’s <a href="http://hackage.haskell.org/package/text"><code>text</code> package</a>.</p>
<p>I took his code, applied it to the <code>text</code> package, and this is the story of what happened.</p>
<script type="text/javascript" src="//ajax.googleapis.com/ajax/static/modules/gviz/1.0/chart.js"> {"dataSourceUrl":"//docs.google.com/a/serpentine.com/spreadsheet/tq?key=0AlCjMsgkVJXcdHIyeHV0QnpjM3pjVXdGNnhfSE1pVUE&#038;transpose=0&#038;headers=1&#038;range=A1%3AD100&#038;gid=0&#038;pub=1","options":{"titleTextStyle":{"bold":true,"color":"#000","fontSize":16},"series":{"0":{"color":"#ff9900"},"1":{"color":"#3366cc"},"2":{"color":"#dc3912"}},"curveType":"","animation":{"duration":500},"theme":"maximized","lineWidth":2,"hAxis":{"title":"x=digits","useFormatFromData":true,"minValue":null,"viewWindow":{"min":null,"max":null},"gridlines":{"count":"6"},"maxValue":null},"vAxes":[{"title":"y=nanoseconds","useFormatFromData":true,"minValue":null,"viewWindow":{"min":null,"max":null},"logScale":false,"maxValue":null},{"useFormatFromData":true,"minValue":null,"viewWindow":{"min":null,"max":null},"logScale":false,"maxValue":null}],"title":"Integer rendering times","booleanRole":"certainty","legend":"in","focusTarget":"category","useFirstColumnAsDomain":true,"tooltip":{},"width":450,"height":320},"state":{},"view":{},"isDefaultVisualization":true,"chartType":"LineChart","chartName":"Chart 1"} </script>

<p>The <code>text</code> package provides a <a href="http://hackage.haskell.org/packages/archive/text/latest/doc/html/Data-Text-Lazy-Builder.html">type named <code>Builder</code></a> for efficiently constructing Unicode strings from smaller fragments. A couple of years after <a href="http://blog.johantibell.com/">Johan</a> contributed the initial <code>Builder</code> implementation, I wrote some helper functions to perform recurring tasks, such as rendering numbers.</p>
<p>In my initial number renderer, I put no effort into making my code fast—as this snippet demonstrates. Simplicity first.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">positive ::</span> (<span class="kw">Integral</span> a) <span class="ot">=&gt;</span> a <span class="ot">-&gt;</span> <span class="dt">Builder</span>
positive <span class="fu">=</span> go
  <span class="kw">where</span>
    go n <span class="fu">|</span> n <span class="fu">&lt;</span> <span class="dv">10</span>    <span class="fu">=</span> digit n
         <span class="fu">|</span> <span class="fu">otherwise</span> <span class="fu">=</span> go (n <span class="ot">`quot`</span> <span class="dv">10</span>) <span class="fu">&lt;&gt;</span> digit (n <span class="ot">`rem`</span> <span class="dv">10</span>)
    digit n <span class="fu">=</span> singleton <span class="fu">.</span> intTodigit <span class="fu">.</span> <span class="fu">fromIntegral</span></code></pre>
<p>Having gotten the code working, I promptly forgot about it for a while. When I saw Andrei’s note nine months later, it tickled my memory: didn’t I have code that I could possibly improve? Indeed, when I took a look at my number rendering code, I found that I hadn’t even bothered to write benchmarks to measure its performance.</p>
<p>Before I discuss how I improved it, a brief description of how a <code>Builder</code> works is in order. A <code>Builder</code> provides a safe way to destructively write data into a list of fixed-size buffers, and to convert the result into an immutable <code>Text</code> value. External users of <code>Builder</code> never see the mutable buffers; they can only use safe access methods.</p>
<p>The code above uses the safe API. Each use of <code>singleton</code> is bounds-checked internally: if a write will not fit into the current buffer, <code>Builder</code> “finalizes” that buffer (putting it at the tail of the list), allocates a new one, and starts writing there. The <code>&lt;&gt;</code> operator sequences the writes.</p>
<p>While this implementation is correct, it is far from fast, partly due to the overhead of performing a bounds check for every character to be written out. In fact, this approach is almost always slower than simply using the <code>show</code> function, then converting the resulting <code>[Char]</code> value to a <code>Builder</code>!</p>
<p>My first observation was that if I knew the number of digits I needed up front, I could perform just one buffer-size check, instead of a separate check prior to rendering each digit.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">countDigits ::</span> (<span class="kw">Integral</span> a) <span class="ot">=&gt;</span> a <span class="ot">-&gt;</span> <span class="dt">Int</span>
countDigits v0 <span class="fu">=</span> go <span class="dv">1</span> (<span class="fu">fromIntegral</span><span class="ot"> v0 ::</span> <span class="dt">Word64</span>)
  <span class="kw">where</span> go <span class="fu">!</span>k v
           <span class="fu">|</span> v <span class="fu">&lt;</span> <span class="dv">10</span>    <span class="fu">=</span> k
           <span class="fu">|</span> v <span class="fu">&lt;</span> <span class="dv">100</span>   <span class="fu">=</span> k <span class="fu">+</span> <span class="dv">1</span>
           <span class="fu">|</span> v <span class="fu">&lt;</span> <span class="dv">1000</span>  <span class="fu">=</span> k <span class="fu">+</span> <span class="dv">2</span>
           <span class="fu">|</span> v <span class="fu">&lt;</span> <span class="dv">10000</span> <span class="fu">=</span> k <span class="fu">+</span> <span class="dv">3</span>
           <span class="fu">|</span> <span class="fu">otherwise</span> <span class="fu">=</span> go (k<span class="dv">+4</span>) (v <span class="ot">`quot`</span> <span class="dv">10000</span>)</code></pre>
<p>This is almost identical to Andrei’s second <code>digits10</code> function.</p>
<p>Once I was able to count digits, I could use the internal <code>writeN</code> function to destructively update the buffer. <code>writeN</code> ensures that a buffer is available with at least the requested amount of space; it then calls a user-supplied function, giving it the buffer to write to (<code>marr</code>) and the position to start writing at (<code>off</code>).</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">positive ::</span> (<span class="kw">Integral</span> a) <span class="ot">=&gt;</span> a <span class="ot">-&gt;</span> <span class="dt">Builder</span>
positive i
    <span class="co">-- we win when we special-case single-digit numbers</span>
    <span class="fu">|</span> i <span class="fu">&lt;</span> <span class="dv">10</span>    <span class="fu">=</span> writeN <span class="dv">1</span> <span class="fu">$</span> \marr off <span class="ot">-&gt;</span>
                  unsafeWrite marr off (i2w i)
    <span class="fu">|</span> <span class="fu">otherwise</span> <span class="fu">=</span> <span class="kw">let</span> <span class="fu">!</span>n <span class="fu">=</span> countDigits i
                  <span class="kw">in</span> writeN n <span class="fu">$</span> \marr off <span class="ot">-&gt;</span>
                     posDecimal marr off n i

<span class="ot">posDecimal ::</span> (<span class="kw">Integral</span> a) <span class="ot">=&gt;</span>
              forall s<span class="fu">.</span> <span class="dt">MArray</span> s <span class="ot">-&gt;</span> <span class="dt">Int</span> <span class="ot">-&gt;</span> <span class="dt">Int</span> <span class="ot">-&gt;</span> a <span class="ot">-&gt;</span> <span class="dt">ST</span> s ()
posDecimal marr off0 ds v0 <span class="fu">=</span> go (off0 <span class="fu">+</span> ds <span class="fu">-</span> <span class="dv">1</span>) v0
  <span class="kw">where</span> go off v
          <span class="fu">|</span> v <span class="fu">&lt;</span> <span class="dv">10</span> <span class="fu">=</span> unsafeWrite marr off (i2w v)
          <span class="fu">|</span> <span class="fu">otherwise</span> <span class="fu">=</span> <span class="kw">do</span>
              unsafeWrite marr off (i2w (v <span class="ot">`rem`</span> <span class="dv">10</span>))
              go (off<span class="dv">-1</span>) (v <span class="ot">`div`</span> <span class="dv">10</span>)

<span class="ot">i2w ::</span> (<span class="kw">Integral</span> a) <span class="ot">=&gt;</span> a <span class="ot">-&gt;</span> <span class="dt">Word16</span>
i2w v <span class="fu">=</span> <span class="dv">48</span> <span class="fu">+</span> <span class="fu">fromIntegral</span> v</code></pre>
<p>I then took Andrei’s very clever third <code>digits10</code> function and translated that to Haskell. Syntax apart, there is a small difference between his function and mine: his is recursive, while mine is tail recursive (i.e. a loop).</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">countDigits ::</span> (<span class="kw">Integral</span> a) <span class="ot">=&gt;</span> a <span class="ot">-&gt;</span> <span class="dt">Int</span>
countDigits v0 <span class="fu">=</span> go <span class="dv">1</span> (<span class="fu">fromIntegral</span><span class="ot"> v0 ::</span> <span class="dt">Word64</span>)
  <span class="kw">where</span>
    go <span class="fu">!</span>k v
      <span class="fu">|</span> v <span class="fu">&lt;</span> <span class="dv">10</span>    <span class="fu">=</span> k
      <span class="fu">|</span> v <span class="fu">&lt;</span> <span class="dv">100</span>   <span class="fu">=</span> k <span class="fu">+</span> <span class="dv">1</span>
      <span class="fu">|</span> v <span class="fu">&lt;</span> <span class="dv">1000</span>  <span class="fu">=</span> k <span class="fu">+</span> <span class="dv">2</span>
      <span class="fu">|</span> v <span class="fu">&lt;</span> <span class="dv">1000000000000</span> <span class="fu">=</span>
          k <span class="fu">+</span> <span class="kw">if</span> v <span class="fu">&lt;</span> <span class="dv">100000000</span>
              <span class="kw">then</span> <span class="kw">if</span> v <span class="fu">&lt;</span> <span class="dv">1000000</span>
                   <span class="kw">then</span> <span class="kw">if</span> v <span class="fu">&lt;</span> <span class="dv">10000</span>
                        <span class="kw">then</span> <span class="dv">3</span>
                        <span class="kw">else</span> <span class="dv">4</span> <span class="fu">+</span> fin v <span class="dv">100000</span>
                   <span class="kw">else</span> <span class="dv">6</span> <span class="fu">+</span> fin v <span class="dv">10000000</span>
              <span class="kw">else</span> <span class="kw">if</span> v <span class="fu">&lt;</span> <span class="dv">10000000000</span>
                   <span class="kw">then</span> <span class="dv">8</span> <span class="fu">+</span> fin v <span class="dv">1000000000</span>
                   <span class="kw">else</span> <span class="dv">10</span> <span class="fu">+</span> fin v <span class="dv">100000000000</span>
      <span class="fu">|</span> <span class="fu">otherwise</span> <span class="fu">=</span> go (k <span class="fu">+</span> <span class="dv">12</span>) (v <span class="ot">`quot`</span> <span class="dv">1000000000000</span>)
   fin v n <span class="fu">=</span> <span class="kw">if</span> v <span class="fu">&gt;=</span> n <span class="kw">then</span> <span class="dv">1</span> <span class="kw">else</span> <span class="dv">0</span></code></pre>
<p>(To be sure my intuition was correct, I did indeed measure recursive against tail recursive versions of my Haskell translation, and tail recursion wins by a few percent here.)</p>
<p>While this <code>countDigits</code> function helped performance by quite a bit, there was another step remaining in following Andrei’s example: converting two digits at a time.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">posDecimal ::</span> (<span class="kw">Integral</span> a) <span class="ot">=&gt;</span>
              forall s<span class="fu">.</span> <span class="dt">MArray</span> s <span class="ot">-&gt;</span> <span class="dt">Int</span> <span class="ot">-&gt;</span> <span class="dt">Int</span> <span class="ot">-&gt;</span> a <span class="ot">-&gt;</span> <span class="dt">ST</span> s ()
posDecimal marr off0 ds v0 <span class="fu">=</span> go (off0 <span class="fu">+</span> ds <span class="fu">-</span> <span class="dv">1</span>) v0
  <span class="kw">where</span> go off v
           <span class="fu">|</span> v <span class="fu">&gt;=</span> <span class="dv">100</span> <span class="fu">=</span> <span class="kw">do</span>
               <span class="kw">let</span> (q, r) <span class="fu">=</span> v <span class="ot">`quotRem`</span> <span class="dv">100</span>
               write2 off r
               go (off <span class="fu">-</span> <span class="dv">2</span>) q
           <span class="fu">|</span> v <span class="fu">&lt;</span> <span class="dv">10</span>    <span class="fu">=</span> unsafeWrite marr off (i2w v)
           <span class="fu">|</span> <span class="fu">otherwise</span> <span class="fu">=</span> write2 off v
        write2 off i0 <span class="fu">=</span> <span class="kw">do</span>
          <span class="kw">let</span> i <span class="fu">=</span> <span class="fu">fromIntegral</span> i0; j <span class="fu">=</span> i <span class="fu">+</span> i
          unsafeWrite marr off <span class="fu">$</span> get (j <span class="fu">+</span> <span class="dv">1</span>)
          unsafeWrite marr (off <span class="fu">-</span> <span class="dv">1</span>) <span class="fu">$</span> get j
        get <span class="fu">=</span> <span class="fu">fromIntegral</span> <span class="fu">.</span> B.unsafeIndex digits</code></pre>
<p>A final surprise came when I decided to try an experiment: what if I replaced the separate uses of <code>quot</code> and <code>rem</code> with a single use of <code>quotRem</code>? This improved performance by a further 30% on large 64-bit numbers! Why such a big difference? Because <code>quotRem</code> can often be emitted as a single machine instruction instead of two, and division is expensive enough that in a hot loop, this helps a lot.</p>
<p>Many modern optimizing compilers can spot this kind of opportunity automatically. Although GHC’s optimizer performs many complex high-level transformations, its machinery for handling low-level optimizations is currently weak. (This is why you’ll see a few cases of <code>v+v</code> instead of <code>v*2</code> above, where I’m strength-reducing operations by hand instead of trusting the compiler.)</p>
<p>I was not at all surprised that Andrei’s optimization tips should translate perfectly to Haskell, as most of what he says has nothing to do with C++ itself. His advice should apply well to <em>any</em> language that provides low-level access to the machine and ends up running as native code. Since Haskell is often not understood to be a perfectly good imperative language, it’s easy for less experienced programmers to overlook the relevance of low-level concerns to Haskell performance.</p>
<p>The end result of all this tweaking is that the new number renderer in the <code>text</code> package is not just much faster than its predecessor, it is also a lot faster than the venerable <code>show</code> function. We retain the same API, external immutability, and type safety as before, but have a very nice five-fold increase in performance to show for our efforts!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.serpentine.com/blog/2013/03/20/whats-good-for-c-is-good-for-haskell/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>A major new release of the Haskell hashable library</title>
		<link>http://www.serpentine.com/blog/2012/12/13/a-major-new-release-of-the-haskell-hashable-library/</link>
		<comments>http://www.serpentine.com/blog/2012/12/13/a-major-new-release-of-the-haskell-hashable-library/#comments</comments>
		<pubDate>Fri, 14 Dec 2012 00:29:22 +0000</pubDate>
		<dc:creator>Bryan O'Sullivan</dc:creator>
				<category><![CDATA[haskell]]></category>

		<guid isPermaLink="false">http://www.serpentine.com/blog/?p=962</guid>
		<description><![CDATA[I have spent quite some time over the last couple of months improving the Haskell hashable library, and all of my efforts eventually turned into a near-complete rewrite of the library. The 1.2 release of hashable is not backwards compatible,<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.serpentine.com/blog/2012/12/13/a-major-new-release-of-the-haskell-hashable-library/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<p>I have spent quite some time over the last couple of months improving the Haskell <a href="http://hackage.haskell.org/package/hashable">hashable</a> library, and all of my efforts eventually turned into a near-complete rewrite of the library.</p>
<p>The 1.2 release of hashable is not backwards compatible, and for several good reasons. Read on for the details.</p>
<h1 id="hash-flooding">Hash flooding</h1>
<p>The threat landscape faced by applications that use hash-based data structures has shifted radically this year, with hash flooding emerging as an attack vector after a long period of obscurity (the earliest publication I know of on the subject dates back to 1998, but I doubt that it was new even then).</p>
<p>We know that both networked and local applications that use weak hash algorithms are vulnerable, with <a href="http://crypto.junod.info/2012/12/13/hash-dos-and-btrfs/">new attacks being published all the time</a>.</p>
<p>Recently, even modern hash algorithms such as the MurmurHash family and CityHash have succumbed to surprising <a href="https://131002.net/data/talks/appsec12_slides.pdf">attacks via differential cryptanalysis</a>, in which large numbers of collisions can be generated efficiently <em>without knowledge of the salt</em>.</p>
<h1 id="siphash">SipHash</h1>
<p>The <a href="https://131002.net/siphash/">SipHash algorithm</a> offers a strong defence against differential cryptanalysis. SipHash is now the algorithm used for hashing all string-like types.</p>
<p>We automatically choose optimized implementations for 64-bit and 32-bit platforms, depending on their capabilities (e.g. availability of SSE2 on 32-bit systems).</p>
<p>For inputs of just a few bytes in length, SipHash is about half the speed of the previous string hash algorithm, FNV-1. Its performance breaks even with FNV-1 at about 50 bytes, and it improves to become more than twice as fast for large inputs.</p>
<p>(FNV-1 is also particularly easy to attack, so escaping from it was imperative, even if that cost a little performance.)</p>
<h1 id="random-choice-of-salt">Random choice of salt</h1>
<p>To provide an additional measure of security against hash flooding attacks, the standard <code>hash</code> function chooses a random salt at program startup time, using the system’s cryptographic pseudo-random number generator (<code>/dev/urandom</code> on Unix, <code>CryptGenRandom</code> on Windows).</p>
<h1 id="effortless-fast-hashing">Effortless, fast hashing</h1>
<p>I have also demoted <code>hash</code> out of the <code>Hashable</code> typeclass, so the API now looks like this.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="kw">class</span> <span class="dt">Hashable</span> a <span class="kw">where</span>
<span class="ot">    hashWithSalt ::</span> <span class="dt">Int</span> <span class="co">-- salt</span>
                 <span class="ot">-&gt;</span> a   <span class="co">-- value to hash</span>
                 <span class="ot">-&gt;</span> <span class="dt">Int</span> <span class="co">-- result</span>

<span class="ot">hash ::</span> <span class="dt">Hashable</span> a <span class="ot">=&gt;</span> a <span class="ot">-&gt;</span> <span class="dt">Int</span>
hash <span class="fu">=</span> hashWithSalt defaultSalt</code></pre>
<p>In other words, if you are writing an instance of <code>Hashable</code> by hand, you had better pay attention to the salt, and if you use the friendly <code>hash</code> function, you get the randomly-chosen default salt every time.</p>
<p>Now that writing <code>Hashable</code> instances by hand is slightly more painful, I have also made it almost effortless to hash your custom datatypes.</p>
<p>Using GHC’s no-longer-so-new generics machinery, you can have an efficient <code>Hashable</code> instance generated “for free” with just a few lines of code. Good code is generated for both product and sum types.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">{-# LANGUAGE DeriveGeneric #-}</span>

<span class="kw">import</span> <span class="dt">GHC.Generics</span>

<span class="kw">data</span> <span class="dt">MyType</span> <span class="fu">=</span> <span class="dt">MyStr</span> <span class="dt">String</span>
            <span class="fu">|</span> <span class="dt">MyInt</span> <span class="dt">Integer</span>
    <span class="kw">deriving</span> (<span class="dt">Generic</span>)

<span class="kw">instance</span> <span class="dt">Hashable</span> <span class="dt">MyType</span></code></pre>
<p>If your types are polymorphic, no problem; they’re just as easy to generate hash functions for.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="kw">data</span> <span class="dt">Poly</span> a <span class="fu">=</span> <span class="dt">Poly</span> a
    <span class="kw">deriving</span> (<span class="dt">Generic</span>)

<span class="kw">instance</span> <span class="dt">Hashable</span> a <span class="ot">=&gt;</span> <span class="dt">Hashable</span> (<span class="dt">Poly</span> a)</code></pre>
<h1 id="improved-avalanche-for-basic-types">Improved avalanche for basic types</h1>
<p>We do not currently use SipHash to hash basic types such as numbers, as strings are by a huge margin the most common vector for hash flooding attacks.</p>
<p>Nevertheless, we’ve made improvements to our hashing of basic types. Previous versions of hashable did nothing at all, meaning that the hash of a number was simply the identity function. While (obviously) fast and good for locality of reference in a few narrow cases, the identity function makes a terrible hash for many data structures.</p>
<p>For instance, if a hash table uses open addressing with linear probing, the identity hash can cause quadratic performance when inserting values into the table that are identical modulo the table size.</p>
<p>It’s generally considered desirable for a hash function to have strong “avalanche” properties, meaning that a single-bit change in an input should cause every bit of the output to have a 50% probability of changing.</p>
<p>For numberic types, we now use a couple of algorithms developed by Thomas Wang that are both fast and have good avalanche properties. These functions are more expensive than doing nothing, but quite a lot faster than SipHash. If it turns out that integers become a popular vector for hash flooding, we may well switch to SipHash for everything at some point.</p>
<h1 id="in-summary">In summary</h1>
<p>This release of the hashable library is a fairly big deal. While the performance implications aren’t entirely happy, I’ve done my best to write the fastest possible code.</p>
<p>I hope you’ll agree that the improvements in ease of use, security, and applicability (thanks to no longer using identity hashes) are more than worth the cost. Happy hacking!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.serpentine.com/blog/2012/12/13/a-major-new-release-of-the-haskell-hashable-library/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A fast new SipHash implementation in Haskell</title>
		<link>http://www.serpentine.com/blog/2012/10/02/a-fast-new-siphash-implementation-in-haskell/</link>
		<comments>http://www.serpentine.com/blog/2012/10/02/a-fast-new-siphash-implementation-in-haskell/#comments</comments>
		<pubDate>Tue, 02 Oct 2012 20:30:18 +0000</pubDate>
		<dc:creator>Bryan O'Sullivan</dc:creator>
				<category><![CDATA[haskell]]></category>

		<guid isPermaLink="false">http://www.serpentine.com/blog/?p=938</guid>
		<description><![CDATA[I’ve recently been talking with Johan Tibell about submitting his hashable package to become a part of the Haskell Platform. Once we get that submission accepted, we can fold Johan’s excellent hash-based data structures from his unordered-containers package into the<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.serpentine.com/blog/2012/10/02/a-fast-new-siphash-implementation-in-haskell/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<p>I’ve recently been talking with Johan Tibell about submitting his <a href="http://hackage.haskell.org/package/hashable">hashable</a> package to become a part of the Haskell Platform.</p>
<p>Once we get that submission accepted, we can fold Johan’s excellent hash-based data structures from his <a href="http://hackage.haskell.org/package/unordered-containers">unordered-containers</a> package into the standard containers package. For many applications, the hash array mapped tries in unordered-containers offer a decent performance advantage over the current standard of ordered search trees.</p>
<p>Prior to making the submission, I found myself wanting to spring clean the hashable package. Although it is simple and well put together, it has some robustness problems that are important in practice.</p>
<p>Chief among these weaknesses is that it uses the FNV-1 hash, which is well known to be vulnerable to collisions. Since it currently uses FNV-1 with a fixed key, Internet-facing web applications that use hash trees are at risk.</p>
<p>The current state of the art in reasonably fast, secure hash algorithms is Aumasson and Bernstein’s <a href="http://cr.yp.to/siphash/siphash-20120727.pdf">SipHash</a>. There already existed a Haskell implementation in the form of the <a href="http://hackage.haskell.org/package/siphash">siphash</a> package, but I thought it was worth implementing a version of my own.</p>
<p>My criteria for developing a new Haskell SipHash implementation were that it had to be fast and easily reusable. I have a <a href="https://github.com/bos/hashable/blob/sip/Data/Hashable/SipHash.hs">version in progress</a> right now, and so far it is turning out well.</p>
<p>The main features of this implementation (specifically when hashing ByteStrings) are as follows:</p>
<ul>
<li><p>All arithmetic is unboxed. I carefully inspected the generated Core at every step to ensure this would be the case. (It's usually easier to improve a code base that starts out with good performance properties than to retrofit good performance into existing code.)</p></li>
<li><p>The main loop performs 8-byte reads when possible, and if the final block is less than 8 bytes, unrolls that last bytewise fill to be as efficient as possible.</p></li>
<li><p>The common cases of SipHash's <span class="math"><em>c</em></span> and <span class="math"><em>d</em></span> parameters being fixed at 2 and 4 are unrolled.</p></li>
<li><p>The implementation is written in continuation passing style. This had the happy consequence of making unrolling particularly easy (read the source to see).</p></li>
</ul>
<p>It’s worth looking at what “unrolling” means in this context. The SipHash algorithm is organized around a loop that performs a “round” a small number times, where a round is a step of an ARX (add-rotate-xor) cipher.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">sipRound ::</span> (<span class="dt">Sip</span> <span class="ot">-&gt;</span> r) <span class="ot">-&gt;</span> <span class="dt">Sip</span> <span class="ot">-&gt;</span> r</code></pre>
<p>Each 8-byte block of input is put through two rounds. After processing is complete, the state is put through four rounds.</p>
<p>Using a CPS representation, we can easily special-case these values so that each can be executed with just a single conditional branch.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">runRounds ::</span> <span class="dt">Int</span> <span class="ot">-&gt;</span> (<span class="dt">Sip</span> <span class="ot">-&gt;</span> r) <span class="ot">-&gt;</span> <span class="dt">Sip</span> <span class="ot">-&gt;</span> r
runRounds <span class="dv">2</span> k <span class="fu">=</span> sipRound (sipRound k)
runRounds <span class="dv">4</span> k <span class="fu">=</span> sipRound (sipRound (sipRound (sipRound k)))</code></pre>
<p>GHC 7.6 generates excellent Core for the above code, inlining <code>sipRound</code> and the continuation <code>k</code> everywhere, and avoiding both jumps and uses of boxed values.</p>
<p>I measured the performance of my in-progress SipHash implementation against the FNV-1 hash currently used by the hashable package, and against two versions of Vincent Hanquez’s siphash package. Different input sizes are across the <span class="math"><em>x</em></span> axis; the <span class="math"><em>y</em></span> axis represents speedup relative to FNV-1 (less than 1 is slower, greater is faster).</p>
<a href="http://www.serpentine.com/wordpress/wp-content/uploads/2012/10/siphash2.png"><img src="http://www.serpentine.com/wordpress/wp-content/uploads/2012/10/siphash2.png" alt="" title="Performance of different SipHash implementations" width="335" height="291" class="aligncenter size-full wp-image-952" /></a>
<p>My SipHash implementation starts off at a little over half the speed of FNV-1 on 5-byte inputs; breaks even at around 45 bytes; and reaches twice the speed at 512 bytes. For comparison, I included a C implementation; my Haskell code is just 10% slower. I’m pretty pleased by these numbers.</p>
<p>(After I published an initial benchmark, Vincent Hanquez sped up his siphash package by a large amount in just a couple of hours. The dramatic effect is shown above in the huge jump in performance between versions 1.0.1 and 1.0.2 of his package. This demonstrates the value of even a brief focus on performance!)</p>
<p>Incidentally, SipHash is another case where compiling with <code>-fllvm</code> really helps; compared to GHC’s normal code generator, I saw speed jump by a very pleasing 35%. (Sure enough, the numbers I show above are with <code>-fllvm</code>.)</p>
<p>With some extra work and a little time to focus, this new code should be ready to incorporate into the hashable package, hopefully within a few days.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.serpentine.com/blog/2012/10/02/a-fast-new-siphash-implementation-in-haskell/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The case of the mysterious explosion in space</title>
		<link>http://www.serpentine.com/blog/2012/09/12/the-case-of-the-mysterious-explosion-in-space/</link>
		<comments>http://www.serpentine.com/blog/2012/09/12/the-case-of-the-mysterious-explosion-in-space/#comments</comments>
		<pubDate>Wed, 12 Sep 2012 17:00:10 +0000</pubDate>
		<dc:creator>Bryan O'Sullivan</dc:creator>
				<category><![CDATA[haskell]]></category>

		<guid isPermaLink="false">http://www.serpentine.com/blog/?p=931</guid>
		<description><![CDATA[The case of the mysterious explosion in space A few months ago, reports began to filter in of an unhappy problem with the Haskell text package: it was causing huge object files to be generated when a file contained lots<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.serpentine.com/blog/2012/09/12/the-case-of-the-mysterious-explosion-in-space/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<p>The case of the mysterious explosion in space</p>
<p>A few months ago, reports began to filter in of an unhappy problem with the Haskell <code>text</code> package: it was <a href="http://www.haskell.org/pipermail/haskell-cafe/2012-March/100162.html">causing huge object files to be generated when a file contained lots of string literals</a>.</p>
<p>I didn&#8217;t notice the initial report (it was posted to a busy mailing list that I don&#8217;t try to keep up with), but Michael Snoyman was kind enough to take that message and <a href="https://github.com/bos/text/issues/19">file a bug</a>.</p>
<p>The culprit was this very simple function definition, which converts a string from the venerable Haskell <code>String</code> type to the more modern <code>Text</code>.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="fu">pack</span><span class="ot"> ::</span> <span class="dt">String</span> <span class="ot">-&gt;</span> <span class="dt">Text</span>
<span class="fu">pack</span> txt <span class="fu">=</span> unstream
             (Stream.map safe
           (Stream.streamList txt))</code></pre>
<p>The definition of <code>pack</code> is too innocent to be at fault; the problem lies with the extra directives that <code>text</code> gives to the compiler.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">{-# INLINE pack #-}</span>
<span class="ot">{-# INLINE Stream.unstream #-}</span>
<span class="ot">{-# INLINE Stream.map #-}</span>
<span class="ot">{-# INLINE Stream.streamList #-}</span></code></pre>
<p>By asking the compiler to inline every function, we guaranteed that every string literal would result in a lot of code being generated. Worse, all of this space would be entirely redundant, consisting of repeated copies of exactly the same code.</p>
<p>(You might well wonder why we&#8217;d insist on inlining <em>any</em> of these functions, if the cost in space is so high. The answer is that inlining is key to why the <code>text</code> package achieves good performance. That deserves an article of its own, so I&#8217;ll return to the subject soon.)</p>
<p>Have you ever wondered how GHC represents string literals? Instead of somehow statically constructing a linked list of characters and emitting that into an object file, it&#8217;s smarter.</p>
<p>For strings of pure ASCII, GHC generates a packed zero-terminated byte sequence that looks like this.</p>
<pre class="sourceCode gnuassembler"><code class="sourceCode gnuassembler">.const
<span class="kw">.align</span> <span class="dv">3</span>
<span class="kw">.align</span> <span class="dv">0</span>
<span class="kw">_co0_str:</span>
        <span class="kw">.byte</span>   <span class="dv">102</span>    <span class="co"># f</span>
        <span class="kw">.byte</span>   <span class="dv">111</span>    <span class="co"># o</span>
        <span class="kw">.byte</span>   <span class="dv">111</span>    <span class="co"># o</span>
        <span class="kw">.byte</span>   <span class="dv">0</span></code></pre>
<p>(For strings that contain Unicode or control characters, GHC still generates a packed sequence of bytes, but this time they&#8217;re specially encoded.)</p>
<p>In Haskell, these byte sequences have a type that is simply a fixed address, <code>Addr#</code>. During compilation, GHC takes a string literal and prefixes it with a function to convert from <code>Addr#</code> to <code>String</code>.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="co">-- What we write:</span>
<span class="ot">foo ::</span> <span class="dt">String</span>
foo <span class="fu">=</span> <span class="st">&quot;foo&quot;</span>

<span class="co">-- What GHC generates:</span>
<span class="ot">foo ::</span> <span class="dt">String</span>
foo <span class="fu">=</span> GHC.CString.unpackCString<span class="fu">#</span> co0_str
  <span class="kw">where</span><span class="ot"> co0_str ::</span> <span class="dt">Addr</span><span class="fu">#</span>
        co0_str <span class="fu">=</span> <span class="st">&quot;foo&quot;</span></code></pre>
<p>One of the lovelier features of GHC is that it exposes some of its internal machinery to authors. We&#8217;re paying a price for our aggressive use of its <code>INLINE</code> directive; is there another GHC feature we can use to save the day?</p>
<p>Enter the rewrite rule, a way of telling GHC how to perform source-to-source transformations.</p>
<p>Here is a naive attempt to specify a rewrite rule that might help us. First, we define a version of <code>pack</code> that we tell the compiler to <em>never</em> inline, then we supply a rewrite rule that tells the compiler to substitute the never-inlined version of <code>pack</code> for the normal version.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">packNOINLINE ::</span> <span class="dt">String</span> <span class="ot">-&gt;</span> <span class="dt">Text</span>
packNOINLINE <span class="fu">=</span> unstream <span class="fu">.</span> Stream.map safe <span class="fu">.</span> Stream.streamList
<span class="ot">{-# NOINLINE packNOINLINE #-}</span>

<span class="co">{-# RULES &quot;TEXT literal&quot; forall a.</span>
<span class="co">    pack s = packNOINLINE s</span>
<span class="co">  #-}</span></code></pre>
<p>Although this rule works and generates correct code, it swaps one problem for another: the object files we generate shrink dramatically, but we&#8217;ve defeated some of the compiler&#8217;s opportunities to improve the code it emits.</p>
<p>Oh, and during compilation, remember that after GHC has finished processing a string literal, we start out with an <code>Addr#</code>, then GHC converts to a <code>String</code> for us, and finally we convert to a <code>Text</code>. That intermediate step galls me, even though it really has no practical consequences.</p>
<p>Happily for us, GHC&#8217;s rewrite rules are applied cleverly: rather than being a simple one-shot affair, GHC keeps trying to apply rewrite rules as it optimises a program.</p>
<p>The critical addition to our rule is to recognise that when we write a string literal, it will be transformed into an application of <code>GHC.String.unpackCString#</code>, and target our rule to an expression containing this.</p>
<pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="co">-- Introduce a new function ...</span>
Text.unpackCString<span class="fu">#</span><span class="ot"> ::</span> <span class="dt">Addr</span><span class="fu">#</span> <span class="ot">-&gt;</span> <span class="dt">Text</span>
Text.unpackCString<span class="fu">#</span> addr<span class="fu">#</span> 
  <span class="fu">=</span> unstream (Stream.streamCString<span class="fu">#</span> addr<span class="fu">#</span>)
<span class="ot">{-# NOINLINE unpackCString# #-}</span>

<span class="co">-- ... and use it!</span>
<span class="co">{-# RULES &quot;TEXT literal&quot; forall a.</span>
<span class="co">    unstream (Stream.map safe</span>
<span class="co">      (Stream.streamList</span>
<span class="co">        (GHC.String.unpackCString# a)))</span>
<span class="co">      = Text.unpackCString# a #-}</span></code></pre>
<p>With this rewrite rule, GHC will transform code that we /never actually wrote/, using a type (<code>Addr#</code>) that we don&#8217;t use in our code. The conditions that trigger this rule will arise only when we define a literal <code>Text</code> value. This means that productive uses of stream fusion will not be affected. Even better, this rule eliminates that pesky intermediate <code>String</code> value, since the new <code>unpackCString#</code> performs a direct translation. Not a bad trick!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.serpentine.com/blog/2012/09/12/the-case-of-the-mysterious-explosion-in-space/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Performance: yes, it&#8217;s worth looking at the small stuff</title>
		<link>http://www.serpentine.com/blog/2012/06/25/yes-its-worth-looking-at-the-small-stuff/</link>
		<comments>http://www.serpentine.com/blog/2012/06/25/yes-its-worth-looking-at-the-small-stuff/#comments</comments>
		<pubDate>Mon, 25 Jun 2012 15:17:05 +0000</pubDate>
		<dc:creator>Bryan O'Sullivan</dc:creator>
				<category><![CDATA[haskell]]></category>

		<guid isPermaLink="false">http://www.serpentine.com/blog/?p=909</guid>
		<description><![CDATA[While I was in New York for QCon last week, the temperatures started out quite mild, but soared back to their usual sweltering summertime levels by midweek. I thus found myself confined to my hotel room for a few hours<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.serpentine.com/blog/2012/06/25/yes-its-worth-looking-at-the-small-stuff/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<p>While I was in New York for QCon last week, the temperatures started out quite mild, but soared back to their usual sweltering summertime levels by midweek. I thus found myself confined to my hotel room for a few hours one afternoon, feeling grateful for the air conditioning.</p>
<p>As I waited for the sun to go down so that I might venture out in the heat without being cooked by both air and glare, I decided to return my attention to the Haskell <code>text</code> library for the first time in a while.</p>
<p>This library is in a happy state of feature and bug stability, and it has grown in popularity to the point where it is now one of the ten most used libraries on Hackage (counting the number of other packages depending on it).</p>
<table>
<tr><th>
library
</th><th>
# deps
</th></tr>
<tr><td>
base
</td><td align="right">
4757
</td></tr>
<tr><td>
containers
</td><td align="right">
1490
</td></tr>
<tr><td>
bytestring
</td><td align="right">
1368
</td></tr>
<tr><td>
mtl
</td><td align="right">
1243
</td></tr>
<tr><td>
directory
</td><td align="right">
742
</td></tr>
<tr><td>
filepath
</td><td align="right">
669
</td></tr>
<tr><td>
array
</td><td align="right">
552
</td></tr>
<tr><td>
transformers
</td><td align="right">
502
</td></tr>
<tr><td>
parsec
</td><td align="right">
491
</td></tr>
<tr><td>
text
</td><td align="right">
454
</td></tr>
</table>

<p>Very pleasing, right? But there are always improvements to be made, and I came across a couple of nice candidates after a little inspection.</p>
<p>I&#8217;m going to talk about one of these in some detail, because it&#8217;s worth digging into a worked example of how to further improve the performance of code that is already fast, and this is nice and brief.</p>
<p>The easiest of these candidates to explain is the function for decoding a bytestring containing UTF-8 text. Generally when we&#8217;re decoding UTF-8, the strings we use are relatively large, and so the small memory savings below will not usually be significant.</p>
<p>The optimisation I discuss below appears in other performance sensitive contexts within the library in which small strings are common, so this is definitely a meaningful optimisation. For instance, the code for converting between the internal <tt>Stream</tt> type and a <tt>Text</tt> admits the same optimisation.</p>
<p>Somewhat simplified, the beginning of our UTF-8 decoding function reads as follows:</p>
<pre class="sourceCode"><code class="sourceCode haskell"><span class="ot">decodeUtf8With </span><span class="ot">::</span> <span class="dt">OnDecodeError</span> <span class="ot">-&gt;</span> <span class="dt">ByteString</span> <span class="ot">-&gt;</span> <span class="dt">Text</span><br /><br />decodeUtf8With onErr bs <span class="fu">=</span> textP ary <span class="dv">0</span> alen<br /> <span class="kw">where</span><br />  (ary,alen) <span class="fu">=</span> A.run2 (A.new (<span class="fu">length</span> bs) <span class="fu">&gt;&gt;=</span> go)<br />  go dest <span class="fu">=</span> <span class="co">{- the actual decoding loop -}</span></code></pre>
<p>The <code>A.run2</code> function referred to above is important to understand. It runs an action that returns a <em>mutable</em> array, and it &#8220;freezes&#8221; that array into an <em>immutable</em> array.</p>
<pre class="sourceCode"><code class="sourceCode haskell"><span class="ot">run2 </span><span class="ot">::</span> (forall s<span class="fu">.</span> <span class="dt">ST</span> s (<span class="dt">MArray</span> s, a)) <span class="ot">-&gt;</span> (<span class="dt">Array</span>, a)<br /><br />run2 k <span class="fu">=</span> runST (<span class="kw">do</span><br />                 (marr,b) <span class="ot">&lt;-</span> k<br />                 arr <span class="ot">&lt;-</span> unsafeFreeze marr<br />                 <span class="fu">return</span> (arr,b))</code></pre>
<p>Our <code>go</code> function is given an initial mutable array allocated by <code>A.new</code>, fills it with text, and returns the final mutable array (which may have been reallocated) and its size.</p>
<p>I was curious about the performance of this code for very short strings, based on visual inspection alone.</p>
<ul>
<li><p>The <code>run2</code> function forces a tuple to be allocated by its callee, and both of the values in that tuple must be boxed.</p></li>
<li><p>It then allocates another tuple, containing one new boxed value.</p></li>
<li><p>That tuple is immediately deconstructed, and its components are passed to <code>textP</code>. This is a simple function that ensures that we always use the same value for zero-length strings, to save on allocation.</p></li>
<li><p>Because the caller of <code>decodeUtf8With</code> may not demand a result from <code>textP</code>, it is possible that the result of this function could be an unevaluated thunk.</p></li>
</ul>
<p>For very short inputs, the overhead of allocating these tuples and boxing their parameters (required because tuples are polymorphic, and polymorphic values must be boxed) worried me as potentially a significant fraction of the &quot;real work&quot;. (For longer strings, the actual decoding will dominate, so this becomes less of an issue. But short strings are common, and should be fast.)</p>
<p>But first, can we be sure that those tuples really exist once the compiler has worked its magic? GHC has for a long time performed a clever optimisation called <a href="http://research.microsoft.com/en-us/um/people/simonpj/Papers/cpr/index.htm"><em>constructed product result analysis</em></a>, or CPR. This tries to ensure that if a function returns multiple results (e.g. in a tuple or some other product type), it can (under some circumstances) use machine registers, avoiding boxing and memory allocation. Unfortunately, I checked the generated Core intermediate code, and CPR does not kick in for us here. It&#8217;s often fragile, and a case like this, where we carry a tuple from the <code>ST</code> monad into pure code, can easily defeat it. (It&#8217;s far from obvious when it will or will not work, so always best to check if it matters.)</p>
<p>Remaining optimistic, we can manually get a little bit of a CPR-like effect by judicious choice of product types. Recall the type of <code>run2</code>:</p>
<pre class="sourceCode"><code class="sourceCode haskell"><span class="ot">run2 </span><span class="ot">::</span> (forall s<span class="fu">.</span> <span class="dt">ST</span> s (<span class="dt">MArray</span> s, a)) <span class="ot">-&gt;</span> (<span class="dt">Array</span>, a)</code></pre>
<p>There are two profitable observations we can make here.</p>
<p>To begin with, we know that the <code>a</code> parameter of the first tuple will be an <code>Int</code> in some particularly important cases. We use a specialised tuple to represent this idea.</p>
<pre class="sourceCode"><code class="sourceCode haskell"><span class="co">-- Instead of (MArray s, a):</span><br /><span class="kw">data</span> <span class="dt">Run</span> s <span class="fu">=</span> <span class="dt">Run</span> (<span class="dt">MArray</span> s) <span class="ot">{-# UNPACK #-}</span> <span class="fu">!</span><span class="dt">Int</span></code></pre>
<p>We have explicitly instructed GHC to <em>not</em> box the <code>Int</code>; instead it will be stored <em>unboxed</em>, directly in the <code>Run</code> structure. This eliminates one memory allocation and the performance-sapping pointer indirection it would induce. (You might wonder why we don&#8217;t direct GHC to unbox the <code>MArray s</code> parameter. This value is just a pointer to an array, and we certainly don&#8217;t want to copy arrays around.)</p>
<p>Our next observation is that any time we want to return a <code>Run</code>, we will always use the result to construct a <code>Text</code>. So why not construct the <code>Text</code> directly, and save our callers from doing it in many places?</p>
<pre class="sourceCode"><code class="sourceCode haskell"><span class="ot">runText </span><span class="ot">::</span> (forall s<span class="fu">.</span> <span class="dt">ST</span> s (<span class="dt">Run</span> s)) <span class="ot">-&gt;</span> <span class="dt">Text</span></code></pre>
<p>Here&#8217;s what the body of <code>runText</code> looks like. We are no longer allocating a second tuple, and the <code>len</code> variable will never be boxed and then immediately unboxed. This brings us from allocating two tuples and an <code>Int</code> down to allocating just one <code>Run</code>.</p>
<pre class="sourceCode"><code class="sourceCode haskell">runText act <span class="fu">=</span> runST <span class="fu">$</span> <span class="kw">do</span><br />  <span class="dt">Run</span> marr len <span class="ot">&lt;-</span> act<br />  arr <span class="ot">&lt;-</span> A.unsafeFreeze marr<br />  <span class="fu">return</span> (textP arr <span class="dv">0</span> len)</code></pre>
<p>Experienced eyes will spot one last catch: because we are not forcing the result of the <code>textP</code> expression to be evaluated before we <code>return</code>, we are unnecessarily allocating a thunk here.</p>
<p>Nevertheless, even with that small inefficiency, both time and memory usage did improve: we allocated a little less, so we did a little less work, and thus a microbenchmark that decoded millions of short words became a few percent faster.</p>
<p>As gratifying as this is, can we improve on it further? Apart from the silly oversight of not forcing <code>textP</code>, there&#8217;s still that bothersome allocation of a <code>Run</code>. Reading the Core code generated by GHC indicates that CPR is still not happening, and so a <code>Run</code> really is being allocated and immediately thrown away. (I think that the polymorphic first parameter to <code>Run</code> might be defeating CPR, but I am not sure of this.)</p>
<p>All along, we&#8217;ve been either allocating tuples or <code>Run</code>s and then immediately throwing them away. Allocation might be cheap, but <em>ceteris paribus</em>, not allocating anything will always be cheaper.</p>
<p>Instead of returning a <code>Run</code> value, what if we were to have our action call a continuation to construct the final <code>Text</code> value?</p>
<pre class="sourceCode"><code class="sourceCode haskell">runText <span class="ot">::</span><br />  (forall s<span class="fu">.</span><br />   (<span class="dt">MArray</span> s <span class="ot">-&gt;</span> <span class="dt">Int</span> <span class="ot">-&gt;</span> <span class="dt">ST</span> s <span class="dt">Text</span>)<br />   <span class="ot">-&gt;</span> <span class="dt">ST</span> s <span class="dt">Text</span>)<br />  <span class="ot">-&gt;</span> <span class="dt">Text</span></code></pre>
<p>The function that we call as our continuation captures no variables from its environment, so it has no hidden state for which we might need to allocate memory to save.</p>
<pre class="sourceCode"><code class="sourceCode haskell">runText act <span class="fu">=</span> runST (act <span class="fu">$</span><br />  \ <span class="fu">!</span>marr <span class="fu">!</span>len <span class="ot">-&gt;</span> <span class="kw">do</span><br />    arr <span class="ot">&lt;-</span> A.unsafeFreeze marr<br />    <span class="fu">return</span> <span class="fu">$!</span> textP arr <span class="dv">0</span> len)</code></pre>
<p>We&#8217;ve now avoided the allocation of a <tt>Run</tt>, and just as importantly, we remembered to have the new <code>runText</code> force the result of the <code>textP</code> expression, so it will not allocate a thunk.</p>
<p>The only downside to this approach is that GHC does very little inlining of continuation-based code, so our use of a continuation leaves a jump in the code path that we&#8217;d have preferred to see the inliner eliminate.</p>
<p>This change in tack causes a significant reduction in memory allocation: on my small-decode microbenchmark, we allocate 17% less memory, and run 10% faster. I see the same improvement with another microbenchmark that exercises the <tt>Stream</tt> to <tt>Text</tt> conversion code, where I made the same optimisation. Given that I changed just a few lines of code in each case, this result makes me happy. If you&#8217;re interested, it might help to take a look at the final <a href="https://github.com/bos/text/blob/6e29fac297fe5d68a08e6314d508e4c67c265595/Data/Text/Encoding.hs#L98">continuation-using version of <code>decodeUtf8With</code> in context</a>.</p>
<p>Although simple, I hope that working through this in some detail has been interesting. Please let me know if you&#8217;d like to see more hands-on posts like this.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.serpentine.com/blog/2012/06/25/yes-its-worth-looking-at-the-small-stuff/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>(re)announcing statprof, a statistical profiler for Python</title>
		<link>http://www.serpentine.com/blog/2012/04/09/reannouncing-statprof-a-statistical-profiler-for-python/</link>
		<comments>http://www.serpentine.com/blog/2012/04/09/reannouncing-statprof-a-statistical-profiler-for-python/#comments</comments>
		<pubDate>Mon, 09 Apr 2012 18:51:52 +0000</pubDate>
		<dc:creator>Bryan O'Sullivan</dc:creator>
				<category><![CDATA[open source]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://www.serpentine.com/blog/?p=906</guid>
		<description><![CDATA[Back in 2005, Andy Wingo wrote a neat little statistical profiler named statprof that promptly disappeared into obscurity. It has since languished almost unknown, with a handful of people writing semi-private forks that themselves seem to be dead. Statistical profiling<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.serpentine.com/blog/2012/04/09/reannouncing-statprof-a-statistical-profiler-for-python/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<p>Back in 2005, Andy Wingo <a href="http://wingolog.org/archives/2005/10/28/profiling">wrote a neat little statistical profiler</a> named <code>statprof</code> that promptly disappeared into obscurity. It has since languished almost unknown, with a handful of people writing semi-private forks that themselves seem to be dead.</p>

<p>Statistical profiling (also known as sampling profiling) is simple and sweet: the profiler periodically wakes up and samples the stack, then when all is done, it prints a simple report of which lines showed up most often in the profile.</p>

<p>Why would this matter, though? Python already has two built-in profilers: lsprof and the long-deprecated hotshot. The trouble with lsprof is that it only tracks function calls. If you have a few hot loops <i>within</i> a function, lsprof is nearly worthless for figuring out which ones are actually important.</p>

<p>A few days ago, I found myself in exactly the situation in which lsprof fails: it was telling me that <a href="http://selenic.com/repo/hg/file/b9bd95e61b49/mercurial/scmutil.py#l520">I had a hot function</a>, but the function was unfamiliar to me, and long enough that it wasn&#8217;t immediately obvious where the problem was.</p>

<p>After a bit of begging on Twitter and Google+, someone pointed me at statprof. But there was a problem: although it was doing statistical sampling (yay!), it was only tracking the first line of a function when sampling (wtf!?). So I fixed that, spiffed up the documentation, and now it&#8217;s both usable and not misleading. Here&#8217;s an example of its output, locating the offending line in that hot function more accurately:</p>

<pre>
  %   cumulative      self          
 time    seconds   seconds  name    
 68.75      0.14      0.14  scmutil.py:546:revrange
  6.25      0.01      0.01  cmdutil.py:1006:walkchangerevs
  6.25      0.01      0.01  revlog.py:241:__init__
  [...blah blah blah...]
  0.00      0.01      0.00  util.py:237:__get__
---
Sample count: 16
Total time: 0.200000 seconds
</pre>

<p>I have uploaded statprof to the <a href="http://pypi.python.org/pypi/statprof/">Python package index</a>, so it&#8217;s almost trivial to install: &#8220;<code>easy_install statprof</code>&#8221; and you&#8217;re up and running.</p>

<p>Since <a href="https://github.com/bos/statprof.py">the code is up on github</a>, please feel welcome to contribute bug reports and improvements. Enjoy!</p>]]></content:encoded>
			<wfw:commentRss>http://www.serpentine.com/blog/2012/04/09/reannouncing-statprof-a-statistical-profiler-for-python/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>github is making me feel stupid(er)</title>
		<link>http://www.serpentine.com/blog/2012/04/08/github-is-making-me-feel-stupider/</link>
		<comments>http://www.serpentine.com/blog/2012/04/08/github-is-making-me-feel-stupider/#comments</comments>
		<pubDate>Sun, 08 Apr 2012 17:17:38 +0000</pubDate>
		<dc:creator>Bryan O'Sullivan</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.serpentine.com/blog/?p=898</guid>
		<description><![CDATA[I&#8217;m approaching my fourth anniversary of using github. I should hardly have to state that it&#8217;s a wonderful service, and especially so for being kept freely available to the open source community. At the same time, I&#8217;ve noticed over the<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.serpentine.com/blog/2012/04/08/github-is-making-me-feel-stupider/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<p>I&#8217;m approaching my fourth anniversary of using github. I should hardly have to state that it&#8217;s a wonderful service, and especially so for being kept freely available to the open source community. At the same time, I&#8217;ve noticed over the past year or so that in many ways I feel less efficient using it now than I used to, even though the github team continues to roll out new features that make me shout &#8220;hooray!&#8221;</p>

<p>I doubt that these difficulties are unique to me, or even related to the fact that I&#8217;ve got a new baby (so I have the cognitive sharpness of a cotton ball). So here&#8217;s what I&#8217;m seeing; I hope that these observations are helpful to the github folks in understanding how their service is used.</p>

<p>Firstly, a spot of cognitive organizing: I really like the newish &#8220;issues across all of my projects&#8221; dashboard, but when I&#8217;m thinking about &#8220;stuff that&#8217;s mine&#8221;, I tend to navigate to <tt>github.com/bos</tt>, and that dashboard isn&#8217;t there. Instead, I kick myself and navigate to plain old <tt>github.com</tt>.  You could reasonably respond &#8220;okay, fine, just remember that, and you&#8217;re done&#8221;. And yet somehow this knowledge refuses to stick in my head.</p>

<p>What I find more confusing is the visual clutter at the top of a project page. There are now <i>seven</i> short-but-wide horizontal rows of stuff (both information and links) at the top of a project&#8217;s main page. Here&#8217;s an annotated screenshot that I hope illustrates what I&#8217;m talking about.</p>

<a href="http://www.serpentine.com/wordpress/wp-content/uploads/2012/04/github.png"><img src="http://www.serpentine.com/wordpress/wp-content/uploads/2012/04/github.png" alt="" title="github" width="600" height="246" class="aligncenter size-full wp-image-902" /></a>

<p>I frequently find myself looking for the <a href="https://github.com/bos/text/commits/master">commits page</a>, which is in the middle of row number 6. At least for me, there seems to be no escaping the need to scan across every row in turn until I reach row 6, where I find the word &#8220;commits&#8221;. That is, I <i>usually</i> find it; I can easily miss it among all the similar entries if I&#8217;m not paying close attention. I find it difficult to visually distinguish the rows at a glance, so there&#8217;s no skipping past clusters of stuff that aren&#8217;t relevant.</p>

<p>These aren&#8217;t killer problems by any stretch, but I do all too often find myself staring at github web pages for 30 seconds at a time, wondering &#8220;am I looking at the right page? Did I miss the row of stuff I&#8217;m looking for?&#8221; I imagine there might be a way to organize these things better, though I&#8217;m no visual designer, and I&#8217;m afraid I don&#8217;t have any crisp suggestions for what might work.</p>]]></content:encoded>
			<wfw:commentRss>http://www.serpentine.com/blog/2012/04/08/github-is-making-me-feel-stupider/feed/</wfw:commentRss>
		<slash:comments>18</slash:comments>
		</item>
		<item>
		<title>aeson 0.4: easier, faster, better</title>
		<link>http://www.serpentine.com/blog/2011/11/30/893/</link>
		<comments>http://www.serpentine.com/blog/2011/11/30/893/#comments</comments>
		<pubDate>Thu, 01 Dec 2011 06:20:57 +0000</pubDate>
		<dc:creator>Bryan O'Sullivan</dc:creator>
				<category><![CDATA[haskell]]></category>
		<category><![CDATA[open source]]></category>

		<guid isPermaLink="false">http://www.serpentine.com/blog/?p=893</guid>
		<description><![CDATA[After months of work, and a number of great contributions from other developers, I just released version 0.4 of aeson, the de facto standard Haskell JSON library. This is a major release, with a number of improvements. Enjoy! Ease of<span class="ellipsis">&#8230;</span><div class="read-more"><a href="http://www.serpentine.com/blog/2011/11/30/893/">Read more &#8250;</a></div><!-- end of .read-more -->]]></description>
				<content:encoded><![CDATA[<p>After months of work, and a number of great contributions from other developers, I just released version 0.4 of <a href="http://hackage.haskell.org/package/aeson">aeson</a>, the de facto standard Haskell JSON library. This is a major release, with a number of improvements. Enjoy!</p>
<h2 id="ease-of-use">Ease of use</h2>
<p>The new <a href="http://hackage.haskell.org/packages/archive/aeson/latest/doc/html/Data-Aeson.html#v:decode"><code>decode</code> function</a> complements the longstanding <code>encode</code> function, and makes the API simpler.</p>
<p><a href="https://github.com/bos/aeson/tree/master/examples">New examples</a> make it easier to learn to use the package.</p>
<h2 id="generics-support">Generics support</h2>
<p>aeson&#8217;s support for data-type generic programming makes it possible to use JSON encodings of most data types without writing any boilerplate instances.</p>
<p>Thanks to Bas Van Dijk, aeson now supports the two major schemes for doing datatype-generic programming:</p>
<ul>
<li><p>the modern mechanism, <a href="http://www.haskell.org/ghc/docs/latest/html/users_guide/generic-programming.html">built into GHC itself</a></p></li>
<li><p>the older mechanism, based on SYB (aka &quot;scrap your boilerplate&quot;)</p></li>
</ul>
<p>The modern GHC-based generic mechanism is fast and terse: in fact, its performance is generally comparable in performance to hand-written and TH-derived <code>ToJSON</code> and <code>FromJSON</code> instances. To see how to use GHC generics, refer to <a href="https://github.com/bos/aeson/blob/master/examples/Generic.hs"><code>examples/Generic.hs</code></a>.</p>
<p>The SYB-based generics support lives in <a href="http://hackage.haskell.org/packages/archive/aeson/latest/doc/html/Data-Aeson-Generic.html">Data.Aeson.Generic</a>, and is provided mainly for users of GHC older than 7.2. SYB is far slower (by about 10x) than the more modern generic mechanism. To see how to use SYB generics, refer to <a href="https://github.com/bos/aeson/blob/master/examples/GenericSYB.hs"><code>examples/GenericSYB.hs</code></a>.</p>
<h2 id="improved-performance">Improved performance</h2>
<ul>
<li><p>We switched the intermediate representation of JSON objects from <code>Data.Map</code> to <a href="http://hackage.haskell.org/package/unordered-containers"><code>Data.HashMap</code></a>, which has improved type conversion performance.</p></li>
<li><p>Instances of <code>ToJSON</code> and <code>FromJSON</code> for tuples are between 45% and 70% faster than in 0.3.</p></li>
</ul>
<h2 id="evaluation-control">Evaluation control</h2>
<p>This version of aeson makes explicit the decoupling between <em>identifying</em> an element of a JSON document and <em>converting</em> it to Haskell. See the <a href="http://hackage.haskell.org/packages/archive/aeson/latest/doc/html/Data-Aeson-Parser.html"><code>Data.Aeson.Parser</code></a> documentation for details.</p>
<p>The normal aeson <code>decode</code> function performs identification strictly, but defers conversion until needed. This can result in improved performance (e.g. if the results of some conversions are never needed), but at a cost in increased memory consumption.</p>
<p>The new <code>decode'</code> function performs identification and conversion immediately. This incurs an up-front cost in CPU cycles, but reduces reduce memory consumption.</p>]]></content:encoded>
			<wfw:commentRss>http://www.serpentine.com/blog/2011/11/30/893/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>
