Over the past few months, the Sigma engineering team at Facebook has rolled out a major Haskell project: a rewrite of Sigma, an important weapon in our armory for fighting spam and malware.
Sigma has a mission-critical job, and it needs to scale: its growing workload currently sees it handling tens of millions of requests per minute.
The rewrite of Sigma in Haskell, using the Haxl library that Simon Marlow developed, has been a success. Throughput is higher than under its predecessor, and CPU usage is lower. Sweet!
Nevertheless, success brings with it surprises, and even though I haven’t worked on Sigma or Haxl, I’ve been implicated in one such surprise. To understand my accidental bit part in the show, let's begin by mentioning that Sigma uses JSON internally for various purposes. These days, the Haskell-powered Sigma uses aeson, the JSON library I wrote, to handle JSON data.
A few months ago, the Haxl rewrite of Sigma was going through an episode of crazytown, in which it would intermittently and unpredictably use huge amounts of CPU and memory. The culprit turned out to be JSON strings containing zillions of backslashes. (I have no idea why. If you’ve worked with large volumes of data for a long time, you won’t even bat an eyelash at the idea that a data store somewhere contains some really weird records.)
The team quickly mitigated the problem, and gave me a nudge that I might want to look into the problem. On Sunday evening, with a glass of red wine in hand, I finally dove in to see what was wrong.
Since the Sigma developers had figured out what was causing these time and space explosions, I immediately had a test case to work with, and the results were grim: decoding a mere megabyte of continuous backslashes took over a second, consumed over a gigabyte of memory, and killed concurrency by causing the runtime system to spend almost 90% of its time in the garbage collector. Yikes!
Whatever was going on? If you look at the old implementation of aeson’s unescape
function, it seems quite efficient and innocuous. It’s reasonably tightly optimized low-level Haskell.
Trouble is, unescape
uses an API (a bytestring builder) that is intended for streaming a result incrementally. Unfortunately the unescape
function can’t hand any data back to its caller until it has processed an entire string.
The result is as you’d expect: we build a huge chain of thunks. In this case, the thunks will eventually write data efficiently into buffers. Alas, the thunks have nobody demanding the evaluation of their contents. This chain consumes a lot (a lot!) of memory and incurs a huge amount of GC overhead (long chains of thunks are expensive). Sadness ensues.
The “old ways” in the title refer to the fix: in place of a fancy streaming API, I simply allocate a single big buffer and blast the bytes straight into it.
For that pathological string with almost a megabyte of consecutive backslashes, the new implementation is 27x faster and uses 42x less memory, all for the cost of perhaps an hour of Sunday evening hacking (including a little enabling work that incidentally illustrates just how easy it is to work with monad transformers). Not bad!
We believe that strong core values empower growth within organizations and create a company culture that allows team members to make good decisions by applying these values throughout everyday situations and challenges.
Automated marketing agency
Cheers to overcoming the backslash debacle! |
Cheers to overcoming the backslash debacle | how much does a concrete patio cost
Greetings from all of us here at Phoenix Business Directory! its growing workload currently sees it handling tens of millions of requests per minute.
The problem was identified and mitigated, showcasing the unpredictable nature of working with large data volumes. | drywall and insulation contractors
The Sigma team’s experience with the Haskell rewrite and subsequent optimization provides valuable lessons for developers working with high-performance applications. By addressing the JSON handling inefficiencies in aeson, significant improvements were achieved, ensuring that Sigma continues to perform effectively in its mission-critical role.
There’s something timeless and comforting about tried-and-true methods that have stood the test of time. Embracing the old ways and knowing What is the difference between drywall and sheetrock can remind us of the simplicity and richness of life.
sometimes, it takes a glass of wine and a Sunday evening dive to uncover the real bottlenecks! | wall repair contractors
Hmmm… there’s something timeless and comforting about tried-and-true methods that have stood the test of time. These enduring practices provide a sense of reliability and assurance, connecting us to generations past and guiding us into the future. Explore Our Site to discover a wealth of these time-honored traditions, learn about their histories, and find new ways to incorporate them into your life.
I agree that sometimes the old ways are the best. In this case, the old way was to allocate a single big buffer and blast the bytes straight into it. This approach was much faster for https://www.drywallatlanta and used less memory than the previous approach.
Thanks for the time you took to share this info here. commercial epoxy flooring
Traditional methods are not difficult for the drywall contractor near me, and you can gain a great deal of knowledge from them. However, this may not be the case for everyone.
That’s impressive! Rewriting Sigma in Haskell and seeing such significant performance improvements is a testament to the language’s capabilities. Haxl seems like a great choice for handling the high volume of requests Sigma processes.
You might also be interested in SEO agency.
It’s nice to see informative content here. Hendersonville Concrete Company concreters
Sometimes, the old ways are the best, as Bryan O’Sullivan shows with his fix for a performance issue in Facebook’s Sigma project. By opting for a simple, direct approach—allocating a big buffer and writing bytes straight into it—he was able to dramatically improve speed and reduce memory usage. It’s a reminder that sometimes, the tried-and-true methods can be the most effective solution.
Nashville Custom Railing
Sometimes, the old ways really are the best. In this case, a straightforward approach using a single buffer outperformed a more complex streaming method, leading to significant gains in both speed and memory efficiency. It’s a great reminder that simple solutions can often be the most effective, even in high-tech environments.
CityWide Property Appraisals in New Haven CT
Old ways have no complicated way to solve problems like now. Sandy Springs fence