Over the past few months, the Sigma engineering team at Facebook has rolled out a major Haskell project: a rewrite of Sigma, an important weapon in our armory for fighting spam and malware.
Sigma has a mission-critical job, and it needs to scale: its growing workload currently sees it handling tens of millions of requests per minute.
The rewrite of Sigma in Haskell, using the Haxl library that Simon Marlow developed, has been a success. Throughput is higher than under its predecessor, and CPU usage is lower. Sweet!
Nevertheless, success brings with it surprises, and even though I haven’t worked on Sigma or Haxl, I’ve been implicated in one such surprise. To understand my accidental bit part in the show, let's begin by mentioning that Sigma uses JSON internally for various purposes. These days, the Haskell-powered Sigma uses aeson, the JSON library I wrote, to handle JSON data.
A few months ago, the Haxl rewrite of Sigma was going through an episode of crazytown, in which it would intermittently and unpredictably use huge amounts of CPU and memory. The culprit turned out to be JSON strings containing zillions of backslashes. (I have no idea why. If you’ve worked with large volumes of data for a long time, you won’t even bat an eyelash at the idea that a data store somewhere contains some really weird records.)
The team quickly mitigated the problem, and gave me a nudge that I might want to look into the problem. On Sunday evening, with a glass of red wine in hand, I finally dove in to see what was wrong.
Since the Sigma developers had figured out what was causing these time and space explosions, I immediately had a test case to work with, and the results were grim: decoding a mere megabyte of continuous backslashes took over a second, consumed over a gigabyte of memory, and killed concurrency by causing the runtime system to spend almost 90% of its time in the garbage collector. Yikes!
Whatever was going on? If you look at the old implementation of aeson’s
unescape function, it seems quite efficient and innocuous. It’s reasonably tightly optimized low-level Haskell.
unescape uses an API (a bytestring builder) that is intended for streaming a result incrementally. Unfortunately the
unescape function can’t hand any data back to its caller until it has processed an entire string.
The result is as you’d expect: we build a huge chain of thunks. In this case, the thunks will eventually write data efficiently into buffers. Alas, the thunks have nobody demanding the evaluation of their contents. This chain consumes a lot (a lot!) of memory and incurs a huge amount of GC overhead (long chains of thunks are expensive). Sadness ensues.
The “old ways” in the title refer to the fix: in place of a fancy streaming API, I simply allocate a single big buffer and blast the bytes straight into it.
For that pathological string with almost a megabyte of consecutive backslashes, the new implementation is 27x faster and uses 42x less memory, all for the cost of perhaps an hour of Sunday evening hacking (including a little enabling work that incidentally illustrates just how easy it is to work with monad transformers). Not bad!