I’m pleased to introduce the third generation of my attoparsec parsing library. With a major change to its internals, it is both faster and more powerful than previous versions, while remaining backwards compatible.
Comparing to C
Let’s start with a speed comparison between the hand-written C code that powers Node.js’s HTTP parser and an idiomatic Haskell parser that uses attoparsec. There are good reasons to take these numbers with a fistful of salt, so imagine huge error bars, warning signs, and whatnot—but they’re still interesting.
A little explanation is in order for why there are two entries for http-parser. The “null” driver consists of a series of empty callbacks, and represents the best possible performance we can get. The “naive” http-parser driver allocates memory for both a request and each of its headers, and frees this memory once a request parse has finished. (A real user of http-parser is likely to be slower than the naive driver, as http-parser forces its clients to do complex book-keeping.)
Meanwhile, the attoparsec parser is of course tiny: a few dozen lines of code, instead of a few thousand. More interestingly, it’s faster than its do-nothing C counterpart. When I last compared the two, back in 2010, attoparsec was a little over half the speed of http-parser, so to pass it feels like an exciting development.
To be clear, you really shouldn’t treat comparing the two as anything other than a fast-and-loose exercise. The attoparsec parser does less work in some ways, for instance by not special-casing the Content-Length header. At the same time, it does more work in a different, but perhaps more important case: there’s no equivalent of the maze of edge cases that arise with http-parser when a parse spans a boundary between blocks of input. The attoparsec programming model is simply way less hairy.
Caveats aside, my purpose with this comparison is to paint with broad strokes what I hope is a compelling picture: you can write a compact, clean parser using attoparsec, and you can expect it to perform well.
Speed improvements
Compared to the previous version of attoparsec, the new internals of this version yield some solid speedups. On attoparsec’s own microbenchmark suite, speedups range from flat to nearly 2x.
If you use the aeson JSON library to parse JSON data that contains a lot of numbers, you can expect a nice boost in performance.
Space usage
In addition to being faster, attoparsec is now generally more space efficient too. In a test of an application that uses Johan Tibell’s cassava library for handling CSV files, the app used 39% less memory with the new version of attoparsec than before, while running 5% faster.
New API fun
The new internals of attoparsec allowed me to add a feature I’ve wanted for a few years, one I had given up on as impossible with the previous internals.
match :: Parser a -> Parser (ByteString, a)
Given an arbitrary parser, match
returns both the result of the parse and the string that it consumed while matching.
>>> let p = (,) <$> decimal <*> ("," *> decimal)
>>> parseOnly (match p) "1,31337"
Right ("1,31337",(1,31337))
This is very handy when what you’re interested in is not just the components of a parse result, but also the precise input string that the parser matched. (Imagine using this to save the contents of a comment while parsing a programming language, for instance.)
The old internals
What changed to yield both big performance improvements and previously impossible capabilities? To understand this, let’s discuss how attoparsec worked until today.
The age-old way to write parser libraries in Haskell is to treat parsing as a job of consuming input from the front of a string. If you want to match the string "foo"
and your input is "foobar"
, you pull the prefix from "foobar"
and hand "bar"
to your successor parser as its input. This is how attoparsec used to work, and we’ll see where it becomes relevant in a moment.
One of attoparsec’s major selling points is that it works with incomplete input. If we give it insufficient input to make a decision about what to do, it will tell us.
>>> parse ("bar" <|> "baz") "ba"
Partial _
If we get a Partial
constructor, we resume parsing by feeding more input to the continuation it hands us. The easiest way is to use feed
:
>>> let cont = parse ("bar" <|> "baz") "ba"
>>> cont `feed` "r"
Done "" "bar"
Continuations interact in an interesting way with backtracking. Let’s talk a little about backtracking in isolation first.
>>> let lefty = Left <$> decimal <* ".!"
>>> let righty = Right <$> rational
The parser lefty
will not succeed until it has read a decimal number followed by some nonsense.
Suppose we get partway through a parse on input like this.
>>> let cont = parse (lefty <|> righty) "123."
>>> cont
Partial _
Even though the decimal
portion of lefty
has succeeded, if we feed
the string "1!"
to the continuation, lefty
as a whole will fail, parsing will backtrack to the beginning of the input, and righty
will succeed.
>>> cont `feed` "1!"
Done "!" Right 123.1
What’s happening behind the scenes here is important.
Under the old version of attoparsec, parsing proceeds by consuming input. By the time we reach the "."
in the input of "123."
, we have thrown away the leading "123"
as a result of decimal
succeeding, so our remaining input is "."
when we ask for more.
The <|>
combinator holds onto the original input in case a parse fails. Since a parse may need ask for more input before it fails (as in this case), the old attoparsec has to keep track of this additional continuation-fed input separately, and glue the saved and added inputs together on each backtrack. Worse yet, sometimes we have to throw away added input in order to avoid double-counting it.
This surely sounds complicated and fragile, but it was the only scheme I could think of that would work under the “parsing as consuming input” model that attoparsec started with. I managed to make this setup run fast enough that (once I’d worked the bugs out) I wasn’t too bothered by the additional complexity.
From strings to buffers and cursors
The model that attoparsec used to follow was that we consumed input, and for correctness when backtracking did our book-keeping of added input separately.
Under the new model, we manage input and added input in one unified Buffer
abstraction. We track our position using a separate cursor, which is simply an integer index into a Buffer
.
If we need to backtrack, we simply hand the current Buffer
to the alternate parser, along with the cursor that will restart parsing at the right spot.
The idea of parsing with a cursor isn’t mine; it came up during a late night IRC conversation with Ed Kmett. I’m excited that this change happened to make it easy to add a new combinator, match
, which had previously seemed impossible to write.
match :: Parser a -> Parser (ByteString, a)
In the new cursor-based world, all we need to build match
is to remember the cursor position when we start parsing. If the parse succeeds, we extract the substring that spans the old and new cursor positions. I spent quite a bit of time pondering this problem with the old representation without getting anywhere, but by changing the internal representation, it suddenly became trivial.
Switching to the cursor-based representation accounts for some of the performance improvements in the new release, as it opened up a few new avenues for further small tweaks.
Bust my buffers!
There’s another implementation twist, though: why is the Buffer
type not simply a ByteString
? Here, the question is one of efficiency, specifically behaviour in response to pathologically crafted inputs.
Every time someone feeds us input via the Partial
continuation, we have to add this to the input we already have. The obvious thing to do is treat Buffer
as a glorified ByteString
and simply string-append the new input to the existing input and get on with life.
Troublingly, this approach would require two string copies per append: we’d allocate a new string, copy the original string into it, then tack the appended string on the end. It’s easy to see that this has quadratic time complexity, which would allow a hostile attacker to DoS us by simply drip-feeding us a large volume of valid data, one byte at a time.
The new Buffer
structure addresses such attacks by exponential doubling, such that most appends require only one string copy instead of two. This improves the worst-case time complexity of being drip-fed extra input from O(n2) to O(nlogn).
Preserving safety and speed
Making this work took a bit of a hack. The Buffer
type contains a mutable array that contains both an immutable portion (visible to users) and an invisible mutable part at the end. Every time we append, we write to the mutable array, and hand back a Buffer
that widens its immutable portion to include the part we just wrote to. The array is shared across successive Buffer
s until we run out of space.
This is very fast, but it’s also unsafe: nobody should ever append to the same Buffer
twice, as the sharing of the array can lead to data corruption. Let’s think about how this could arise. Our original Buffer
still thinks it can write to the mutable portion of an array, while our new Buffer
considers the same area of memory to be immutable. If we append to the original Buffer
again, we will scribble on memory that the new Buffer
thinks is immutable.
Since neither our choice of API nor Haskell’s type system can prevent bad actions here, users are free to make the programming error of appending to a Buffer
more than once, even though it makes no sense to do so. It’s not satisfactory to have pure code react badly even when the programmer is doing something wrong, so I addressed this problem in an interesting way.
The immutable shell of a Buffer
contains a generation number. We embed a mutable generation number in the shared array that each Buffer
points to. We increment the mutable generation number every time we append to a Buffer
, and hand back a Buffer
that also has an incremented immutable generation number.
The mutable and immutable generation numbers should always agree. If they fall out of sync, we know that someone is appending to a Buffer
more than once. We react by duplicating the mutable array, so that the new append cannot interfere with the existing array. This amounts to a cheap copy-on-write scheme: copies never occur in the typical case of users behaving sensibly, while we preserve correctness if a programmer starts doing daft things.
Assurance
Before I embarked on this redesign, I doubled the size of attoparsec’s test and benchmark suites. This gave me a fair sense of safety that I wouldn’t accidentally break code as I went.
Once the rate of churn settled down, I found the most significant packages using attoparsec on Hackage and tried them out.
This revealed that an incompatible change I’d made in the core Parser
type caused quite a lot of downstream build breakage, with a third of the packages that I tried failing to build. This was a good motivator for me to learn how to fix the problem.
Once I fixed this self-imposed difficulty, it turned out that all of the top packages turned out to be API-compatible with the new release. It was definitely helpful to have a tool that let me find important users of the package.
Between the expanded test suite, better benchmarks, and this extra degree of checking, I am now feeling moderately confident that the sweeping changes I’ve made should be fairly safe to inflict on people. I hope I’m right! Please enjoy the results of my work.
package | mojo | status |
aeson | 10000 | clean |
snap-core | 2030 |
requires --allow-newer
|
conduit-extra | 1816 | clean |
fay | 1740 | clean |
snap | 1681 |
requires --allow-newer
|
conduit-extra | 1492 | clean |
persistent | 1487 | clean |
yaml | 1313 | clean |
io-streams | 1205 |
requires --allow-newer
|
configurator | 1161 | clean |
yesod-form | 1077 |
requires --allow-newer
|
snap-server | 889 |
requires --allow-newer
|
heist | 881 |
requires --allow-newer
|
parsers | 817 | clean |
cassava | 643 | clean |
And finally
When I was compiling the list of significant packages using attoparsec, I made a guess that the Unix rev
would reverse the order of lines in a file. What it does instead seems much less useful: it reverses the bytes on each line.
Why do I mention this? Because my mistake led to the discovery that there’s a surprising number of Haskell packages whose names read at least as well backwards as forwards.
citats-dosey revres-foornus
corpetic-codnap rotaremune-cesrapotta
eroc-ognid rotaremune-ptth
eroc-pans rotarugifnoc
forp-colla-emit-chg sloot-ipa
kramtsop stekcosbew
morf-gnirtsetyb teppup-egaugnal
nosea tropmish
revirdbew troppus-ipa-krowten
(And finally-most-of-all, if you’re curious about where I measured my numbers, I used my 2011-era 2.2GHz MacBook Pro running 64-bit GHC 7.6.3. Server-class hardware should do way better.)
What an incredible and brimming with data blog! This is certainly worth sharing to everybody! Will suggest this post! To be sure a magnificent blog. Many thanks for sharing. Finding for more visit my blog
Nice! This is really a great guide!
Amazing! That is the kind of data that are intended to be shared around the web. Shame on Google for done situating this submit upper! Come on finished and talk over with my site . Much thanks to you =) See top article
It’s extremely pleasant of you to share your insight through posts. I love to peruse tales about your encounters. They’re exceptionally helpful and intriguing. https://remingtondudj807.edublogs.org/2022/05/17/does-your-business-coaching-agency-tampa-pass-the-test-7-things-you-can-improve-on-today/
Thanks for sharing such useful information.
What if we don’t know the exact headcount? The underlying store depends on your initially assessed headcount. The last installment will be changed in accordance with mirror the quantity of participants who register by the cutoff time. Check for more Virtual Team Events
Glad to found this informative post. Very helpful.
Simply need to say your article is as amazing. The clearness in your post is essentially fantastic and I can accept at least for a moment that you’re a specialist regarding this matter. Fine with your authorization let me to snatch your feed to stay up with the latest with approaching post. Classaction Law
Great and interesting website content. Keep up the great work.
Thank you for sharing such a informative website content. Keep sharing.
Very informative article. Thanks. https://cincinnatiseo.io/
Glad to found this great website content. Keep up the good work.
Great information. Keep sharing.
Thank you for sharing this great article. fdglubbock.com/
Glad to visit this site, nice post.
Goodness, What an Incredible post. I really found this to much informatics. It is what I was glancing through for. I should propose you that assuming no one minds, keep on sharing such sort of data https://dominickxzpz987.wordpress.com/2022/09/06/5-bad-habits-that-people-in-the-hot-stocks-to-buy-today-industry-need-to-quit/
There are various ways that a Direct Response Copywriter can assist you with expanding your deals or leads. By making convincing duplicate that is intended to get the peruser to make a move, Direct Response Copywriters can assist with expanding your transformation rates. Moreover, Direct Response Copywriters can utilize different strategies, for example, making a need to get moving or utilizing solid invitations to take action, to additionally urge perusers to make the ideal move. Direct Response Copywriting Tampa
You’ve got the comfort and I would without doubt like all people to make use of their experiences as well. You’ve got the comfort and I would without doubt like all people to make use of their experiences as well. https://jeffreyalhb545.substack.com/p/the-worst-advice-you-could-ever-get?sd=pf
Great article. Keep sharing.
This is one magnificent article post. Truly anticipating read more https://andersonhehc225.substack.com/p/what-the-heck-is?sd=pf
This is very interesting, but it is necessary to click on this link https://diigo.com/0pv3lm
Imagine what’s happening where somebody needs to pick after the cutoff time. Taking into account everything, we can oblige late choices with an extra rush charge. Liberally wrap up with your picked occasion facilitator with any late individuals. Check for more http://simonuhux772.over-blog.com/2022/09/will-moosend-b2b-email-marketing-software-ever-die.html
I’m stunned by how well this post is created and how much assessment went into it. I esteem the fabulous piece and brilliant strategy that went into this article’s subject https://knoxkszj390.hpage.com/post1.html
Appreciation for sharing this data to develop our insight. Looking forward for more on your site. https://kadorafbvs.doodlekit.com/blog/entry/22472110/7-answers-to-the-most-frequently-asked-questions-about-levi-korsinsky
Pretty good post. I have just stumbled upon your blog and enjoyed reading your blog posts very much. I am looking for new posts to get more precious info. Big thanks for the useful info. https://sassyobjectlight.tumblr.com/post/694529094175588352/miley-cyrus-and-stock-broker-fraud-attorney-10
awesome post https://www.divephotoguide.com/user/freaghxcsi/
Thank you for the amazing post
Hypnotherapist Johannesburg
This is important, though it’s necessary to help you head over to it weblink: https://penzu.com/p/431845f5
This is my first visit here and I found a lot of interesting things in your blog especially its discussion. its a really good title. A very useful post.
A wedding is a help where two people are participated in marriage. Wedding customs and customs shift extraordinarily between social orders, ethnic get-togethers, religions, countries, and social classes. Most wedding administrations incorporate an exchange of marriage guarantees by a couple, show of a gift (offering, rings, significant thing, blooms, money, dress), and a public statement of marriage by an influence figure or celebrant. look at this
http://travisxwfx619.yousher.com/the-ultimate-cheat-sheet-on-620-loft-garden-wedding-venues
I am so thankful that you shared this resource for free.
Try clicking here
Thanks for sharing this amazing blog.
Nice Way for Blogging, and Amazing All Post that shares by you.
Amazing work. just the information I needed! Thanks for the blog
Amazing work. just the information I needed! Thanks for the blog!
this is great, thanks for sharing
Good work
Enjoyed to read the article of this page, good work
It was my pleasure to attend your blog post today. I had never imagined that you can have such a great sense of humor.
https://diamondappliancestl.com/
This is really helpful. Thanks!
Nice to read the article. https://cincinnatiseo.org/
Awesome work!
This is great because attoparsec is a fast parser combinator library made for dealing with network protocols and complex text/binary file formats.
Nice BLog, Thank You for sharing great information.
this is great thanks so much for sharing http://www.pghcleaners.com/
You really shouldn’t treat comparing the two as anything other than a fast-and-loose exercise like fiberglass insulation. The attoparsec parser does less work in some ways, for instance by not special-casing the Content-Length header.
Very much appreciated. Thank you for this excellent article. Keep posting!
Red Deer Concrete
Been applying these codes to some of our work and I think we will be working more on improving it.
click here
The obvious thing to do is treat Buffer as a glorified ByteString and simply string-append the new input of https://www.drywallmidland to the existing input and get on with life.
Goodness, What an Incredible post. I really found this to much informatics. It is what I was glancing through for. I should propose you that assuming no one minds, keep on sharing such sort of data https://happynewyear2024.in/