Performance: yes, it’s worth looking at the small stuff

Posted on 2012-06-25 by Bryan O'Sullivan — 11 Comments ↓

While I was in New York for QCon last week, the temperatures started out quite mild, but soared back to their usual sweltering summertime levels by midweek. I thus found myself confined to my hotel room for a few hours one afternoon, feeling grateful for the air conditioning.

As I waited for the sun to go down so that I might venture out in the heat without being cooked by both air and glare, I decided to return my attention to the Haskell text library for the first time in a while.

This library is in a happy state of feature and bug stability, and it has grown in popularity to the point where it is now one of the ten most used libraries on Hackage (counting the number of other packages depending on it).

library	# deps
base	4757
containers	1490
bytestring	1368
mtl	1243
directory	742
filepath	669
array	552
transformers	502
parsec	491
text	454

Very pleasing, right? But there are always improvements to be made, and I came across a couple of nice candidates after a little inspection.

I’m going to talk about one of these in some detail, because it’s worth digging into a worked example of how to further improve the performance of code that is already fast, and this is nice and brief.

The easiest of these candidates to explain is the function for decoding a bytestring containing UTF-8 text. Generally when we’re decoding UTF-8, the strings we use are relatively large, and so the small memory savings below will not usually be significant.

The optimisation I discuss below appears in other performance sensitive contexts within the library in which small strings are common, so this is definitely a meaningful optimisation. For instance, the code for converting between the internal Stream type and a Text admits the same optimisation.

Somewhat simplified, the beginning of our UTF-8 decoding function reads as follows:

decodeUtf8With :: OnDecodeError -> ByteString -> Text

decodeUtf8With onErr bs = textP ary 0 alen
 where
  (ary,alen) = A.run2 (A.new (length bs) >>= go)
  go dest = {- the actual decoding loop -}

The A.run2 function referred to above is important to understand. It runs an action that returns a mutable array, and it “freezes” that array into an immutable array.

run2 :: (forall s. ST s (MArray s, a)) -> (Array, a)

run2 k = runST (do
                 (marr,b) <- k
                 arr <- unsafeFreeze marr
                 return (arr,b))

Our go function is given an initial mutable array allocated by A.new, fills it with text, and returns the final mutable array (which may have been reallocated) and its size.

I was curious about the performance of this code for very short strings, based on visual inspection alone.

The run2 function forces a tuple to be allocated by its callee, and both of the values in that tuple must be boxed.
It then allocates another tuple, containing one new boxed value.
That tuple is immediately deconstructed, and its components are passed to textP. This is a simple function that ensures that we always use the same value for zero-length strings, to save on allocation.
Because the caller of decodeUtf8With may not demand a result from textP, it is possible that the result of this function could be an unevaluated thunk.

For very short inputs, the overhead of allocating these tuples and boxing their parameters (required because tuples are polymorphic, and polymorphic values must be boxed) worried me as potentially a significant fraction of the "real work". (For longer strings, the actual decoding will dominate, so this becomes less of an issue. But short strings are common, and should be fast.)

But first, can we be sure that those tuples really exist once the compiler has worked its magic? GHC has for a long time performed a clever optimisation called constructed product result analysis, or CPR. This tries to ensure that if a function returns multiple results (e.g. in a tuple or some other product type), it can (under some circumstances) use machine registers, avoiding boxing and memory allocation. Unfortunately, I checked the generated Core intermediate code, and CPR does not kick in for us here. It’s often fragile, and a case like this, where we carry a tuple from the ST monad into pure code, can easily defeat it. (It’s far from obvious when it will or will not work, so always best to check if it matters.)

Remaining optimistic, we can manually get a little bit of a CPR-like effect by judicious choice of product types. Recall the type of run2:

run2 :: (forall s. ST s (MArray s, a)) -> (Array, a)

There are two profitable observations we can make here.

To begin with, we know that the a parameter of the first tuple will be an Int in some particularly important cases. We use a specialised tuple to represent this idea.

-- Instead of (MArray s, a):
data Run s = Run (MArray s) {-# UNPACK #-} !Int

We have explicitly instructed GHC to not box the Int; instead it will be stored unboxed, directly in the Run structure. This eliminates one memory allocation and the performance-sapping pointer indirection it would induce. (You might wonder why we don’t direct GHC to unbox the MArray s parameter. This value is just a pointer to an array, and we certainly don’t want to copy arrays around.)

Our next observation is that any time we want to return a Run, we will always use the result to construct a Text. So why not construct the Text directly, and save our callers from doing it in many places?

runText :: (forall s. ST s (Run s)) -> Text

Here’s what the body of runText looks like. We are no longer allocating a second tuple, and the len variable will never be boxed and then immediately unboxed. This brings us from allocating two tuples and an Int down to allocating just one Run.

runText act = runST $ do
  Run marr len <- act
  arr <- A.unsafeFreeze marr
  return (textP arr 0 len)

Experienced eyes will spot one last catch: because we are not forcing the result of the textP expression to be evaluated before we return, we are unnecessarily allocating a thunk here.

Nevertheless, even with that small inefficiency, both time and memory usage did improve: we allocated a little less, so we did a little less work, and thus a microbenchmark that decoded millions of short words became a few percent faster.

As gratifying as this is, can we improve on it further? Apart from the silly oversight of not forcing textP, there’s still that bothersome allocation of a Run. Reading the Core code generated by GHC indicates that CPR is still not happening, and so a Run really is being allocated and immediately thrown away. (I think that the polymorphic first parameter to Run might be defeating CPR, but I am not sure of this.)

All along, we’ve been either allocating tuples or Runs and then immediately throwing them away. Allocation might be cheap, but ceteris paribus, not allocating anything will always be cheaper.

Instead of returning a Run value, what if we were to have our action call a continuation to construct the final Text value?

runText ::
  (forall s.
   (MArray s -> Int -> ST s Text)
   -> ST s Text)
  -> Text

The function that we call as our continuation captures no variables from its environment, so it has no hidden state for which we might need to allocate memory to save.

runText act = runST (act $
  \ !marr !len -> do
    arr <- A.unsafeFreeze marr
    return $! textP arr 0 len)

We’ve now avoided the allocation of a Run, and just as importantly, we remembered to have the new runText force the result of the textP expression, so it will not allocate a thunk.

The only downside to this approach is that GHC does very little inlining of continuation-based code, so our use of a continuation leaves a jump in the code path that we’d have preferred to see the inliner eliminate.

This change in tack causes a significant reduction in memory allocation: on my small-decode microbenchmark, we allocate 17% less memory, and run 10% faster. I see the same improvement with another microbenchmark that exercises the Stream to Text conversion code, where I made the same optimisation. Given that I changed just a few lines of code in each case, this result makes me happy. If you’re interested, it might help to take a look at the final continuation-using version of decodeUtf8With in context.

Although simple, I hope that working through this in some detail has been interesting. Please let me know if you’d like to see more hands-on posts like this.

Posted in haskell

11 comments on “Performance: yes, it’s worth looking at the small stuff”

Johan Tibell says:

2012-06-25 at 08:55

I would like to see more hands-on posts like this, but you already knew that. A few comments:

First, runST tends to get away of some optimizations as it’s marked as NOINLINE, as in presence of the full laziness optimization inlining it might give incorrect results. If you don’t need full laziness you can turn it off and use the trick I used here:

https://github.com/tibbe/unordered-containers/blob/master/Data/HashMap/Unsafe.hs

Since runST can’t be inlined it 1) causes a closure to be allocated (the argument to runST) and 2) inhibits other optimizations that would have happened if the body of runST would have been inlined.

Second, you do want to unpack the MArray! It’s definition is:

data MArray s = MArray {
maBA :: MutableByteArray# s
}

MutableByteArray# is already a pointer to a byte array. There’s no way (regrettably) to unpack the array content into the MArray constructor even if you wanted to. If you include an MArray as a field in another constructor, you get a pointer to a pointer to a byte array.
Bryan O'Sullivan says:

2012-06-25 at 11:24

That’s interesting about runST; I hadn’t really paid attention to it not being inlined. I’m nervous about silently breaking something if I play games with this. Namely that I’ll either lose due to marking realWorld# as a CAF, or that turning off full laziness will defeat other possible optimizations that could outweigh the value of inlining runSTRep.

GHC seems to be using the worker-wrapper transformation to keep MArray unboxed for me, so there’s actually nothing to do there.
Oleksandr Manzyuk says:

2012-06-25 at 12:35

Please count me in. I too would love to see more hands-on posts like this.

Writing in continuation-passing style to avoid consing seems to be a common pattern in the Lisp/Scheme world, but it’s the first time I see it mentioned in the context of Haskell.
Jake McArthur says:

2012-06-25 at 12:51

Another vote for more such posts!
Johan Tibell says:

2012-06-25 at 13:11

Bryan, I understand if you don’t want to mess with working code.

While the W/W transformation might save you this time, you can avoid the risk of it having to by unpacking that field, it doesn’t hurt and can only help. Might also save on allocation, depending on how the Run constructor is constructed.

Oleksandr, Haskell parser/serialization libraries tend to use continuation-passing style quite a lot (see e.g. binary, attoparsec.)
dons says:

2012-06-25 at 13:16

Re. CPR and nested CPR, see the ticket http://hackage.haskell.org/trac/ghc/ticket/2289 for a bit of a history (when CPRing into tuples).

There’s also the cpr-sum-types branch for even more goodies.
Luke Hoersten says:

2012-06-26 at 05:59

I would love to see more posts like this. Great write-up!
Bram says:

2012-06-26 at 08:50

Although the post went a bit over my head, it was still enjoyable. I like the fact that you went through the steps and explained you thought at each step.
Felipe Lessa says:

2012-06-27 at 15:18

+1
Lemming says:

2012-07-08 at 14:30

It seems that for efficient Haskell code we need a lot of experience, understanding of GHC’s working and luck, that GHC will continue to work the way it works today. I wonder whether there are more direct and reliable ways of writing efficient code in Haskell. So far I have programmed a lot with the llvm package. It gives me full control over allocations and inlining.
John Lato says:

2012-09-20 at 18:18

Very nice example. I’ve often been curious as to when ghc performs CPR optimizations, because it’s usually not as robust as I’d like. The overhead of tupling and boxing can often dominate the actual work in tight loops.

The CPS transform looks very similar to one I wrote about at http://johnlato.blogspot.sg/2012/03/faking-it-with-higher-rank-existential.html (which was incidentally also brought about by efforts to get the compiler to unbox more data).

1 Pings/Trackbacks for "Performance: yes, it’s worth looking at the small stuff"

» Crouching Laziness, Hidden Strictness mergeconflict says:

2012-07-07 at 09:47

[…] tuning, we don’t use the profiler (nope, that’s a debugging tool) but instead we manually inspect the Core code that the compiler […]

Performance: yes, it’s worth looking at the small stuff

11 comments on “Performance: yes, it’s worth looking at the small stuff”

1 Pings/Trackbacks for "Performance: yes, it’s worth looking at the small stuff"

Leave a Reply Cancel reply