I took his code, applied it to the
text package, and this is the story of what happened.
text package provides a type named
Builder for efficiently constructing Unicode strings from smaller fragments. A couple of years after Johan contributed the initial
Builder implementation, I wrote some helper functions to perform recurring tasks, such as rendering numbers.
In my initial number renderer, I put no effort into making my code fast—as this snippet demonstrates. Simplicity first.
positive :: (Integral a) => a -> Builder positive = go where go n | n < 10 = digit n | otherwise = go (n `quot` 10) <> digit (n `rem` 10) digit n = singleton . intTodigit . fromIntegral
Having gotten the code working, I promptly forgot about it for a while. When I saw Andrei’s note nine months later, it tickled my memory: didn’t I have code that I could possibly improve? Indeed, when I took a look at my number rendering code, I found that I hadn’t even bothered to write benchmarks to measure its performance.
Before I discuss how I improved it, a brief description of how a
Builder works is in order. A
Builder provides a safe way to destructively write data into a list of fixed-size buffers, and to convert the result into an immutable
Text value. External users of
Builder never see the mutable buffers; they can only use safe access methods.
The code above uses the safe API. Each use of
singleton is bounds-checked internally: if a write will not fit into the current buffer,
Builder “finalizes” that buffer (putting it at the tail of the list), allocates a new one, and starts writing there. The
<> operator sequences the writes.
While this implementation is correct, it is far from fast, partly due to the overhead of performing a bounds check for every character to be written out. In fact, this approach is almost always slower than simply using the
show function, then converting the resulting
[Char] value to a
My first observation was that if I knew the number of digits I needed up front, I could perform just one buffer-size check, instead of a separate check prior to rendering each digit.
countDigits :: (Integral a) => a -> Int countDigits v0 = go 1 (fromIntegral v0 :: Word64) where go !k v | v < 10 = k | v < 100 = k + 1 | v < 1000 = k + 2 | v < 10000 = k + 3 | otherwise = go (k+4) (v `quot` 10000)
This is almost identical to Andrei’s second
Once I was able to count digits, I could use the internal
writeN function to destructively update the buffer.
writeN ensures that a buffer is available with at least the requested amount of space; it then calls a user-supplied function, giving it the buffer to write to (
marr) and the position to start writing at (
positive :: (Integral a) => a -> Builder positive i -- we win when we special-case single-digit numbers | i < 10 = writeN 1 $ \marr off -> unsafeWrite marr off (i2w i) | otherwise = let !n = countDigits i in writeN n $ \marr off -> posDecimal marr off n i posDecimal :: (Integral a) => forall s. MArray s -> Int -> Int -> a -> ST s () posDecimal marr off0 ds v0 = go (off0 + ds - 1) v0 where go off v | v < 10 = unsafeWrite marr off (i2w v) | otherwise = do unsafeWrite marr off (i2w (v `rem` 10)) go (off-1) (v `div` 10) i2w :: (Integral a) => a -> Word16 i2w v = 48 + fromIntegral v
I then took Andrei’s very clever third
digits10 function and translated that to Haskell. Syntax apart, there is a small difference between his function and mine: his is recursive, while mine is tail recursive (i.e. a loop).
countDigits :: (Integral a) => a -> Int countDigits v0 = go 1 (fromIntegral v0 :: Word64) where go !k v | v < 10 = k | v < 100 = k + 1 | v < 1000 = k + 2 | v < 1000000000000 = k + if v < 100000000 then if v < 1000000 then if v < 10000 then 3 else 4 + fin v 100000 else 6 + fin v 10000000 else if v < 10000000000 then 8 + fin v 1000000000 else 10 + fin v 100000000000 | otherwise = go (k + 12) (v `quot` 1000000000000) fin v n = if v >= n then 1 else 0
(To be sure my intuition was correct, I did indeed measure recursive against tail recursive versions of my Haskell translation, and tail recursion wins by a few percent here.)
countDigits function helped performance by quite a bit, there was another step remaining in following Andrei’s example: converting two digits at a time.
posDecimal :: (Integral a) => forall s. MArray s -> Int -> Int -> a -> ST s () posDecimal marr off0 ds v0 = go (off0 + ds - 1) v0 where go off v | v >= 100 = do let (q, r) = v `quotRem` 100 write2 off r go (off - 2) q | v < 10 = unsafeWrite marr off (i2w v) | otherwise = write2 off v write2 off i0 = do let i = fromIntegral i0; j = i + i unsafeWrite marr off $ get (j + 1) unsafeWrite marr (off - 1) $ get j get = fromIntegral . B.unsafeIndex digits
A final surprise came when I decided to try an experiment: what if I replaced the separate uses of
rem with a single use of
quotRem? This improved performance by a further 30% on large 64-bit numbers! Why such a big difference? Because
quotRem can often be emitted as a single machine instruction instead of two, and division is expensive enough that in a hot loop, this helps a lot.
Many modern optimizing compilers can spot this kind of opportunity automatically. Although GHC’s optimizer performs many complex high-level transformations, its machinery for handling low-level optimizations is currently weak. (This is why you’ll see a few cases of
v+v instead of
v*2 above, where I’m strength-reducing operations by hand instead of trusting the compiler.)
I was not at all surprised that Andrei’s optimization tips should translate perfectly to Haskell, as most of what he says has nothing to do with C++ itself. His advice should apply well to any language that provides low-level access to the machine and ends up running as native code. Since Haskell is often not understood to be a perfectly good imperative language, it’s easy for less experienced programmers to overlook the relevance of low-level concerns to Haskell performance.
The end result of all this tweaking is that the new number renderer in the
text package is not just much faster than its predecessor, it is also a lot faster than the venerable
show function. We retain the same API, external immutability, and type safety as before, but have a very nice five-fold increase in performance to show for our efforts!