What’s good for C++ is good for … Haskell!?

A few days ago, my Facebook colleague Andrei Alexandrescu posted a note entitled Three Optimization Tips for C++, which reminded me that I had unfinished business with Haskell’s text package.

I took his code, applied it to the text package, and this is the story of what happened.

The text package provides a type named Builder for efficiently constructing Unicode strings from smaller fragments. A couple of years after Johan contributed the initial Builder implementation, I wrote some helper functions to perform recurring tasks, such as rendering numbers.

In my initial number renderer, I put no effort into making my code fast—as this snippet demonstrates. Simplicity first.

positive :: (Integral a) => a -> Builder
positive = go
  where
    go n | n < 10    = digit n
         | otherwise = go (n `quot` 10) <> digit (n `rem` 10)
    digit n = singleton . intTodigit . fromIntegral

Having gotten the code working, I promptly forgot about it for a while. When I saw Andrei’s note nine months later, it tickled my memory: didn’t I have code that I could possibly improve? Indeed, when I took a look at my number rendering code, I found that I hadn’t even bothered to write benchmarks to measure its performance.

Before I discuss how I improved it, a brief description of how a Builder works is in order. A Builder provides a safe way to destructively write data into a list of fixed-size buffers, and to convert the result into an immutable Text value. External users of Builder never see the mutable buffers; they can only use safe access methods.

The code above uses the safe API. Each use of singleton is bounds-checked internally: if a write will not fit into the current buffer, Builder “finalizes” that buffer (putting it at the tail of the list), allocates a new one, and starts writing there. The <> operator sequences the writes.

While this implementation is correct, it is far from fast, partly due to the overhead of performing a bounds check for every character to be written out. In fact, this approach is almost always slower than simply using the show function, then converting the resulting [Char] value to a Builder!

My first observation was that if I knew the number of digits I needed up front, I could perform just one buffer-size check, instead of a separate check prior to rendering each digit.

countDigits :: (Integral a) => a -> Int
countDigits v0 = go 1 (fromIntegral v0 :: Word64)
  where go !k v
           | v < 10    = k
           | v < 100   = k + 1
           | v < 1000  = k + 2
           | v < 10000 = k + 3
           | otherwise = go (k+4) (v `quot` 10000)

This is almost identical to Andrei’s second digits10 function.

Once I was able to count digits, I could use the internal writeN function to destructively update the buffer. writeN ensures that a buffer is available with at least the requested amount of space; it then calls a user-supplied function, giving it the buffer to write to (marr) and the position to start writing at (off).

positive :: (Integral a) => a -> Builder
positive i
    -- we win when we special-case single-digit numbers
    | i < 10    = writeN 1 $ \marr off ->
                  unsafeWrite marr off (i2w i)
    | otherwise = let !n = countDigits i
                  in writeN n $ \marr off ->
                     posDecimal marr off n i

posDecimal :: (Integral a) =>
              forall s. MArray s -> Int -> Int -> a -> ST s ()
posDecimal marr off0 ds v0 = go (off0 + ds - 1) v0
  where go off v
          | v < 10 = unsafeWrite marr off (i2w v)
          | otherwise = do
              unsafeWrite marr off (i2w (v `rem` 10))
              go (off-1) (v `div` 10)

i2w :: (Integral a) => a -> Word16
i2w v = 48 + fromIntegral v

I then took Andrei’s very clever third digits10 function and translated that to Haskell. Syntax apart, there is a small difference between his function and mine: his is recursive, while mine is tail recursive (i.e. a loop).

countDigits :: (Integral a) => a -> Int
countDigits v0 = go 1 (fromIntegral v0 :: Word64)
  where
    go !k v
      | v < 10    = k
      | v < 100   = k + 1
      | v < 1000  = k + 2
      | v < 1000000000000 =
          k + if v < 100000000
              then if v < 1000000
                   then if v < 10000
                        then 3
                        else 4 + fin v 100000
                   else 6 + fin v 10000000
              else if v < 10000000000
                   then 8 + fin v 1000000000
                   else 10 + fin v 100000000000
      | otherwise = go (k + 12) (v `quot` 1000000000000)
   fin v n = if v >= n then 1 else 0

(To be sure my intuition was correct, I did indeed measure recursive against tail recursive versions of my Haskell translation, and tail recursion wins by a few percent here.)

While this countDigits function helped performance by quite a bit, there was another step remaining in following Andrei’s example: converting two digits at a time.

posDecimal :: (Integral a) =>
              forall s. MArray s -> Int -> Int -> a -> ST s ()
posDecimal marr off0 ds v0 = go (off0 + ds - 1) v0
  where go off v
           | v >= 100 = do
               let (q, r) = v `quotRem` 100
               write2 off r
               go (off - 2) q
           | v < 10    = unsafeWrite marr off (i2w v)
           | otherwise = write2 off v
        write2 off i0 = do
          let i = fromIntegral i0; j = i + i
          unsafeWrite marr off $ get (j + 1)
          unsafeWrite marr (off - 1) $ get j
        get = fromIntegral . B.unsafeIndex digits

A final surprise came when I decided to try an experiment: what if I replaced the separate uses of quot and rem with a single use of quotRem? This improved performance by a further 30% on large 64-bit numbers! Why such a big difference? Because quotRem can often be emitted as a single machine instruction instead of two, and division is expensive enough that in a hot loop, this helps a lot.

Many modern optimizing compilers can spot this kind of opportunity automatically. Although GHC’s optimizer performs many complex high-level transformations, its machinery for handling low-level optimizations is currently weak. (This is why you’ll see a few cases of v+v instead of v*2 above, where I’m strength-reducing operations by hand instead of trusting the compiler.)

I was not at all surprised that Andrei’s optimization tips should translate perfectly to Haskell, as most of what he says has nothing to do with C++ itself. His advice should apply well to any language that provides low-level access to the machine and ends up running as native code. Since Haskell is often not understood to be a perfectly good imperative language, it’s easy for less experienced programmers to overlook the relevance of low-level concerns to Haskell performance.

The end result of all this tweaking is that the new number renderer in the text package is not just much faster than its predecessor, it is also a lot faster than the venerable show function. We retain the same API, external immutability, and type safety as before, but have a very nice five-fold increase in performance to show for our efforts!

Posted in haskell
11 comments on “What’s good for C++ is good for … Haskell!?
  1. Kirill Zaborsky says:

    I suppose that “Access denied” – https://docs.google.com/file/d/0B_CUoZk6rWOIejN5NDZYRmxGcTA/edit isn’t the right picture?

  2. Oops, fixed. Thanks!

  3. Boris Lykah says:

    New primops were created for quotRem as a solution for GHC issue 5598 with duplicate divisions. So, if quot and rem are used separately, there will be still two primops and two calculations which might be joined by LLVM, but not by NCG.

  4. Yves Pares says:

    This may be overrated, depending on how log performs but you can count digits this way :

    baseLog = log 10 — computed once and reused

    countDigits :: (Integral a) => a -> Int
    countDigits x = log x/baseLog + 1

  5. A picture of the bride smiling at you may be pretty, but it is also
    mundane. Another reason for the myth is that
    it seems logical that sitting on a seat and then repeatedly closing and opening your legs against resistance
    will burn up fat in your thighs, the very areas where you feel “the burn. ” As we mentioned earlier, everybody has cellulite, even the skinniest of people.

    My blog post – how to get thigh gap fast

  6. These give you an idea of what is meant by the requirements of the plan. This a match made in heaven as grains contain low quantities of another essential amino acid lysine, which pulses do contain.
    Allot of people have a hard time starting a diet journey, first step is to buy walking shoes.

    My website how to lose belly fat

  7. Let me start this by stating that I have never performed an Indignant Birds game before doing
    this overview. Do not use more or less effort on longer
    or shorter putts simple make your pendulum -type stroke larger or smaller.
    Mental condition characterized by perceptions that are abnormal and expressions of reality is called as Schizophrenia.

    Here is my blog: geometry dash hack

  8. It would be nice if a new game was released with no
    problems, no glitches, and no issues of any kind. But eventually, its supporters
    were able to spread the news of the tasty new treat
    and won over its detractors. Dialing this number provides you
    with the baseball bat, handgun, shotgun, MP5, M4, sniper rifle, RPG, and grenades.

    Feel free to surf to my web site … Telecharger Gta 5 gratuit pc complet

  9. For this reason it is the best to schedule the 3D ultrasound exam after the 25th week of pregnancy.
    Players can also own two properties at the same time with the upcoming ‘GTA Online’ High Life Update.
    Live concerts often incorporate a wide range of visual effects and lighting
    systems which enhance the excitement of an artist
    performing.

    Feel free to visit my web site; telecharger gta 5 pc gratuit complet en francais utorrent

  10. funny jokes says:

    Also, some of their wittiness is sure to brush off on you.
    These are exciting gag gift suggestions you may consider that will certainly
    entertain anyone on his or her 75th birthday.

    You can have a superhero costume party in which the birthday celebrant
    himself wears a groovy costume; an old-fashioned children’s celebration where magicinas, clowns, mascots, and mimes, and other entertainers perform; and many other interesting ideas.

    Feel free to visit my web-site :: funny jokes

  11. Humanity may not find it easy to eat what ancestors ate, but remember old is
    gold and will always remain no, no matter
    what is the era. Comparing Organic, Free-Range and Commercial Chickens.
    Crossfit is sweeping the globe and athletes are picking up this new method of
    working out to train their bodies more efficiently.

    My webpage: paleo diet recipes

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>