What’s good for C++ is good for … Haskell!?

A few days ago, my Facebook colleague Andrei Alexandrescu posted a note entitled Three Optimization Tips for C++, which reminded me that I had unfinished business with Haskell’s text package.

I took his code, applied it to the text package, and this is the story of what happened.

The text package provides a type named Builder for efficiently constructing Unicode strings from smaller fragments. A couple of years after Johan contributed the initial Builder implementation, I wrote some helper functions to perform recurring tasks, such as rendering numbers.

In my initial number renderer, I put no effort into making my code fast—as this snippet demonstrates. Simplicity first.

positive :: (Integral a) => a -> Builder
positive = go
    go n | n < 10    = digit n
         | otherwise = go (n `quot` 10) <> digit (n `rem` 10)
    digit n = singleton . intTodigit . fromIntegral

Having gotten the code working, I promptly forgot about it for a while. When I saw Andrei’s note nine months later, it tickled my memory: didn’t I have code that I could possibly improve? Indeed, when I took a look at my number rendering code, I found that I hadn’t even bothered to write benchmarks to measure its performance.

Before I discuss how I improved it, a brief description of how a Builder works is in order. A Builder provides a safe way to destructively write data into a list of fixed-size buffers, and to convert the result into an immutable Text value. External users of Builder never see the mutable buffers; they can only use safe access methods.

The code above uses the safe API. Each use of singleton is bounds-checked internally: if a write will not fit into the current buffer, Builder “finalizes” that buffer (putting it at the tail of the list), allocates a new one, and starts writing there. The <> operator sequences the writes.

While this implementation is correct, it is far from fast, partly due to the overhead of performing a bounds check for every character to be written out. In fact, this approach is almost always slower than simply using the show function, then converting the resulting [Char] value to a Builder!

My first observation was that if I knew the number of digits I needed up front, I could perform just one buffer-size check, instead of a separate check prior to rendering each digit.

countDigits :: (Integral a) => a -> Int
countDigits v0 = go 1 (fromIntegral v0 :: Word64)
  where go !k v
           | v < 10    = k
           | v < 100   = k + 1
           | v < 1000  = k + 2
           | v < 10000 = k + 3
           | otherwise = go (k+4) (v `quot` 10000)

This is almost identical to Andrei’s second digits10 function.

Once I was able to count digits, I could use the internal writeN function to destructively update the buffer. writeN ensures that a buffer is available with at least the requested amount of space; it then calls a user-supplied function, giving it the buffer to write to (marr) and the position to start writing at (off).

positive :: (Integral a) => a -> Builder
positive i
    -- we win when we special-case single-digit numbers
    | i < 10    = writeN 1 $ \marr off ->
                  unsafeWrite marr off (i2w i)
    | otherwise = let !n = countDigits i
                  in writeN n $ \marr off ->
                     posDecimal marr off n i

posDecimal :: (Integral a) =>
              forall s. MArray s -> Int -> Int -> a -> ST s ()
posDecimal marr off0 ds v0 = go (off0 + ds - 1) v0
  where go off v
          | v < 10 = unsafeWrite marr off (i2w v)
          | otherwise = do
              unsafeWrite marr off (i2w (v `rem` 10))
              go (off-1) (v `div` 10)

i2w :: (Integral a) => a -> Word16
i2w v = 48 + fromIntegral v

I then took Andrei’s very clever third digits10 function and translated that to Haskell. Syntax apart, there is a small difference between his function and mine: his is recursive, while mine is tail recursive (i.e. a loop).

countDigits :: (Integral a) => a -> Int
countDigits v0 = go 1 (fromIntegral v0 :: Word64)
    go !k v
      | v < 10    = k
      | v < 100   = k + 1
      | v < 1000  = k + 2
      | v < 1000000000000 =
          k + if v < 100000000
              then if v < 1000000
                   then if v < 10000
                        then 3
                        else 4 + fin v 100000
                   else 6 + fin v 10000000
              else if v < 10000000000
                   then 8 + fin v 1000000000
                   else 10 + fin v 100000000000
      | otherwise = go (k + 12) (v `quot` 1000000000000)
   fin v n = if v >= n then 1 else 0

(To be sure my intuition was correct, I did indeed measure recursive against tail recursive versions of my Haskell translation, and tail recursion wins by a few percent here.)

While this countDigits function helped performance by quite a bit, there was another step remaining in following Andrei’s example: converting two digits at a time.

posDecimal :: (Integral a) =>
              forall s. MArray s -> Int -> Int -> a -> ST s ()
posDecimal marr off0 ds v0 = go (off0 + ds - 1) v0
  where go off v
           | v >= 100 = do
               let (q, r) = v `quotRem` 100
               write2 off r
               go (off - 2) q
           | v < 10    = unsafeWrite marr off (i2w v)
           | otherwise = write2 off v
        write2 off i0 = do
          let i = fromIntegral i0; j = i + i
          unsafeWrite marr off $ get (j + 1)
          unsafeWrite marr (off - 1) $ get j
        get = fromIntegral . B.unsafeIndex digits

A final surprise came when I decided to try an experiment: what if I replaced the separate uses of quot and rem with a single use of quotRem? This improved performance by a further 30% on large 64-bit numbers! Why such a big difference? Because quotRem can often be emitted as a single machine instruction instead of two, and division is expensive enough that in a hot loop, this helps a lot.

Many modern optimizing compilers can spot this kind of opportunity automatically. Although GHC’s optimizer performs many complex high-level transformations, its machinery for handling low-level optimizations is currently weak. (This is why you’ll see a few cases of v+v instead of v*2 above, where I’m strength-reducing operations by hand instead of trusting the compiler.)

I was not at all surprised that Andrei’s optimization tips should translate perfectly to Haskell, as most of what he says has nothing to do with C++ itself. His advice should apply well to any language that provides low-level access to the machine and ends up running as native code. Since Haskell is often not understood to be a perfectly good imperative language, it’s easy for less experienced programmers to overlook the relevance of low-level concerns to Haskell performance.

The end result of all this tweaking is that the new number renderer in the text package is not just much faster than its predecessor, it is also a lot faster than the venerable show function. We retain the same API, external immutability, and type safety as before, but have a very nice five-fold increase in performance to show for our efforts!

Posted in haskell
4 comments on “What’s good for C++ is good for … Haskell!?
  1. Kirill Zaborsky says:

    I suppose that “Access denied” – https://docs.google.com/file/d/0B_CUoZk6rWOIejN5NDZYRmxGcTA/edit isn’t the right picture?

  2. Oops, fixed. Thanks!

  3. Boris Lykah says:

    New primops were created for quotRem as a solution for GHC issue 5598 with duplicate divisions. So, if quot and rem are used separately, there will be still two primops and two calculations which might be joined by LLVM, but not by NCG.

  4. Yves Pares says:

    This may be overrated, depending on how log performs but you can count digits this way :

    baseLog = log 10 — computed once and reused

    countDigits :: (Integral a) => a -> Int
    countDigits x = log x/baseLog + 1

Leave a Reply

Your email address will not be published. Required fields are marked *