A major upgrade to attoparsec: more speed, more power

I’m pleased to introduce the third generation of my attoparsec parsing library. With a major change to its internals, it is both faster and more powerful than previous versions, while remaining backwards compatible.

Comparing to C

Let’s start with a speed comparison between the hand-written C code that powers Node.js’s HTTP parser and an idiomatic Haskell parser that uses attoparsec. There are good reasons to take these numbers with a fistful of salt, so imagine huge error bars, warning signs, and whatnot—but they’re still interesting.

A little explanation is in order for why there are two entries for http-parser. The “null” driver consists of a series of empty callbacks, and represents the best possible performance we can get. The “naive” http-parser driver allocates memory for both a request and each of its headers, and frees this memory once a request parse has finished. (A real user of http-parser is likely to be slower than the naive driver, as http-parser forces its clients to do complex book-keeping.)

Meanwhile, the attoparsec parser is of course tiny: a few dozen lines of code, instead of a few thousand. More interestingly, it’s faster than its do-nothing C counterpart. When I last compared the two, back in 2010, attoparsec was a little over half the speed of http-parser, so to pass it feels like an exciting development.

To be clear, you really shouldn’t treat comparing the two as anything other than a fast-and-loose exercise. The attoparsec parser does less work in some ways, for instance by not special-casing the Content-Length header. At the same time, it does more work in a different, but perhaps more important case: there’s no equivalent of the maze of edge cases that arise with http-parser when a parse spans a boundary between blocks of input. The attoparsec programming model is simply way less hairy.

Caveats aside, my purpose with this comparison is to paint with broad strokes what I hope is a compelling picture: you can write a compact, clean parser using attoparsec, and you can expect it to perform well.

Speed improvements

Compared to the previous version of attoparsec, the new internals of this version yield some solid speedups. On attoparsec’s own microbenchmark suite, speedups range from flat to nearly 2x.

If you use the aeson JSON library to parse JSON data that contains a lot of numbers, you can expect a nice boost in performance.

Space usage

In addition to being faster, attoparsec is now generally more space efficient too. In a test of an application that uses Johan Tibell’s cassava library for handling CSV files, the app used 39% less memory with the new version of attoparsec than before, while running 5% faster.

New API fun

The new internals of attoparsec allowed me to add a feature I’ve wanted for a few years, one I had given up on as impossible with the previous internals.

match :: Parser a -> Parser (ByteString, a)

Given an arbitrary parser, match returns both the result of the parse and the string that it consumed while matching.

>>> let p = (,) <$> decimal <*> ("," *> decimal)
>>> parseOnly (match p) "1,31337"
Right ("1,31337",(1,31337))

This is very handy when what you’re interested in is not just the components of a parse result, but also the precise input string that the parser matched. (Imagine using this to save the contents of a comment while parsing a programming language, for instance.)

The old internals

What changed to yield both big performance improvements and previously impossible capabilities? To understand this, let’s discuss how attoparsec worked until today.

The age-old way to write parser libraries in Haskell is to treat parsing as a job of consuming input from the front of a string. If you want to match the string "foo" and your input is "foobar", you pull the prefix from "foobar" and hand "bar" to your successor parser as its input. This is how attoparsec used to work, and we’ll see where it becomes relevant in a moment.

One of attoparsec’s major selling points is that it works with incomplete input. If we give it insufficient input to make a decision about what to do, it will tell us.

>>> parse ("bar" <|> "baz") "ba"
Partial _

If we get a Partial constructor, we resume parsing by feeding more input to the continuation it hands us. The easiest way is to use feed:

>>> let cont = parse ("bar" <|> "baz") "ba"
>>> cont `feed` "r"
Done "" "bar"

Continuations interact in an interesting way with backtracking. Let’s talk a little about backtracking in isolation first.

>>> let lefty = Left <$> decimal <* ".!"
>>> let righty = Right <$> rational

The parser lefty will not succeed until it has read a decimal number followed by some nonsense.

Suppose we get partway through a parse on input like this.

>>> let cont = parse (lefty <|> righty) "123."
>>> cont
Partial _

Even though the decimal portion of lefty has succeeded, if we feed the string "1!" to the continuation, lefty as a whole will fail, parsing will backtrack to the beginning of the input, and righty will succeed.

>>> cont `feed` "1!"
Done "!" Right 123.1

What’s happening behind the scenes here is important.

Under the old version of attoparsec, parsing proceeds by consuming input. By the time we reach the "." in the input of "123.", we have thrown away the leading "123" as a result of decimal succeeding, so our remaining input is "." when we ask for more.

The <|> combinator holds onto the original input in case a parse fails. Since a parse may need ask for more input before it fails (as in this case), the old attoparsec has to keep track of this additional continuation-fed input separately, and glue the saved and added inputs together on each backtrack. Worse yet, sometimes we have to throw away added input in order to avoid double-counting it.

This surely sounds complicated and fragile, but it was the only scheme I could think of that would work under the “parsing as consuming input” model that attoparsec started with. I managed to make this setup run fast enough that (once I’d worked the bugs out) I wasn’t too bothered by the additional complexity.

From strings to buffers and cursors

The model that attoparsec used to follow was that we consumed input, and for correctness when backtracking did our book-keeping of added input separately.

Under the new model, we manage input and added input in one unified Buffer abstraction. We track our position using a separate cursor, which is simply an integer index into a Buffer.

If we need to backtrack, we simply hand the current Buffer to the alternate parser, along with the cursor that will restart parsing at the right spot.

The idea of parsing with a cursor isn’t mine; it came up during a late night IRC conversation with Ed Kmett. I’m excited that this change happened to make it easy to add a new combinator, match, which had previously seemed impossible to write.

match :: Parser a -> Parser (ByteString, a)

In the new cursor-based world, all we need to build match is to remember the cursor position when we start parsing. If the parse succeeds, we extract the substring that spans the old and new cursor positions. I spent quite a bit of time pondering this problem with the old representation without getting anywhere, but by changing the internal representation, it suddenly became trivial.

Switching to the cursor-based representation accounts for some of the performance improvements in the new release, as it opened up a few new avenues for further small tweaks.

Bust my buffers!

There’s another implementation twist, though: why is the Buffer type not simply a ByteString? Here, the question is one of efficiency, specifically behaviour in response to pathologically crafted inputs.

Every time someone feeds us input via the Partial continuation, we have to add this to the input we already have. The obvious thing to do is treat Buffer as a glorified ByteString and simply string-append the new input to the existing input and get on with life.

Troublingly, this approach would require two string copies per append: we’d allocate a new string, copy the original string into it, then tack the appended string on the end. It’s easy to see that this has quadratic time complexity, which would allow a hostile attacker to DoS us by simply drip-feeding us a large volume of valid data, one byte at a time.

The new Buffer structure addresses such attacks by exponential doubling, such that most appends require only one string copy instead of two. This improves the worst-case time complexity of being drip-fed extra input from O(n2) to O(nlogn).

Preserving safety and speed

Making this work took a bit of a hack. The Buffer type contains a mutable array that contains both an immutable portion (visible to users) and an invisible mutable part at the end. Every time we append, we write to the mutable array, and hand back a Buffer that widens its immutable portion to include the part we just wrote to. The array is shared across successive Buffers until we run out of space.

This is very fast, but it’s also unsafe: nobody should ever append to the same Buffer twice, as the sharing of the array can lead to data corruption. Let’s think about how this could arise. Our original Buffer still thinks it can write to the mutable portion of an array, while our new Buffer considers the same area of memory to be immutable. If we append to the original Buffer again, we will scribble on memory that the new Buffer thinks is immutable.

Since neither our choice of API nor Haskell’s type system can prevent bad actions here, users are free to make the programming error of appending to a Buffer more than once, even though it makes no sense to do so. It’s not satisfactory to have pure code react badly even when the programmer is doing something wrong, so I addressed this problem in an interesting way.

The immutable shell of a Buffer contains a generation number. We embed a mutable generation number in the shared array that each Buffer points to. We increment the mutable generation number every time we append to a Buffer, and hand back a Buffer that also has an incremented immutable generation number.

The mutable and immutable generation numbers should always agree. If they fall out of sync, we know that someone is appending to a Buffer more than once. We react by duplicating the mutable array, so that the new append cannot interfere with the existing array. This amounts to a cheap copy-on-write scheme: copies never occur in the typical case of users behaving sensibly, while we preserve correctness if a programmer starts doing daft things.

Assurance

Before I embarked on this redesign, I doubled the size of attoparsec’s test and benchmark suites. This gave me a fair sense of safety that I wouldn’t accidentally break code as I went.

Once the rate of churn settled down, I found the most significant packages using attoparsec on Hackage and tried them out.

This revealed that an incompatible change I’d made in the core Parser type caused quite a lot of downstream build breakage, with a third of the packages that I tried failing to build. This was a good motivator for me to learn how to fix the problem.

Once I fixed this self-imposed difficulty, it turned out that all of the top packages turned out to be API-compatible with the new release. It was definitely helpful to have a tool that let me find important users of the package.

Between the expanded test suite, better benchmarks, and this extra degree of checking, I am now feeling moderately confident that the sweeping changes I’ve made should be fairly safe to inflict on people. I hope I’m right! Please enjoy the results of my work.

package mojo status
aeson 10000 clean
snap-core 2030 requires --allow-newer
conduit-extra 1816 clean
fay 1740 clean
snap 1681 requires --allow-newer
conduit-extra 1492 clean
persistent 1487 clean
yaml 1313 clean
io-streams 1205 requires --allow-newer
configurator 1161 clean
yesod-form 1077 requires --allow-newer
snap-server 889 requires --allow-newer
heist 881 requires --allow-newer
parsers 817 clean
cassava 643 clean

And finally

When I was compiling the list of significant packages using attoparsec, I made a guess that the Unix rev would reverse the order of lines in a file. What it does instead seems much less useful: it reverses the bytes on each line.

Why do I mention this? Because my mistake led to the discovery that there’s a surprising number of Haskell packages whose names read at least as well backwards as forwards.

citats-dosey           revres-foornus
corpetic-codnap        rotaremune-cesrapotta
eroc-ognid             rotaremune-ptth
eroc-pans              rotarugifnoc
forp-colla-emit-chg    sloot-ipa
kramtsop               stekcosbew
morf-gnirtsetyb        teppup-egaugnal
nosea                  tropmish
revirdbew              troppus-ipa-krowten

(And finally-most-of-all, if you’re curious about where I measured my numbers, I used my 2011-era 2.2GHz MacBook Pro running 64-bit GHC 7.6.3. Server-class hardware should do way better.)

Posted in haskell, open source
25 comments on “A major upgrade to attoparsec: more speed, more power
  1. Great work!

    BTW, “tac” reverses files line-by-line.

  2. rntz says:

    Is your copy-on-write scheme thread-safe? What if two threads try to append data at the same time and race?

  3. Mason Bially says:

    The generation numbers appear to use an atomic, before any other actions and since copy on write is otherwise inherently thread safe when copying immutable data it would appear so.

    Commit: https://github.com/bos/attoparsec/commit/bf37cfd07b8206a5bf4fc4fe02b079faad27af2c#diff-ea6084ddf77b2c9f6460b1871fbd34adL70

    Thank you for this post, I’m about to write a parsing library for a project I’m working on and this gave me some good tips.

  4. Hi, Neat post. There’s an issue along with your web site in internet explorer, might test this? IE nonetheless is the market chief and a good component of other folks will omit your magnificent writing because of this problem.

  5. You can definitely see your skills within the work you write.

    The sector hopes for more passionate writers like you who are not afraid to say
    how they believe. All the time go after your
    heart.

  6. payday loans says:

    Good answers іn return of this difficulty wih sklid arguments ɑnd explaining the ѡhole thing сoncerning tɦаt.

    Herre іs my site :: payday loans

  7. youtube says:

    This paragraph will assist the internet users for building
    up new website or even a weblog from start to end.

    Feel free to surf to my page … youtube

  8. Tanisha says:

    Currently it seems like Expression Engine is the top blogging platform out there right
    now. (from what I’ve read) Is that what you’re using on your blog?

  9. Today, I went to the beach front with my kids.

    I found a sea shell and gave it to my 4 year old
    daughter and said “You can hear the ocean if you put this to your ear.” She placed the shell to her ear
    and screamed. There was a hermit crab inside and it pinched her ear.
    She never wants to go back! LoL I know this is totally
    off topic but I had to tell someone!

  10. Great post. I used to be checking continuously this blog and I am impressed!

    Extremely helpful information particularly the closing section :) I
    deal with such info a lot. I used to be seeking this certain info for a very long time.
    Thanks and good luck.

  11. What a data of un-ambiguity and preserveness of
    precious know-how on the topic of unpredicted feelings.

  12. inspector's says:

    That is very interesting, You are an excessively professional blogger.
    I’ve joined your rss feed and look forward to searching for extra of your great post.
    Additionally, I’ve shared your web site in my social networks

    Feel free to visit my blog post inspector’s

  13. Omer says:

    This page certainly has all of the information I wanted about this subject
    and didn’t know who to ask.

  14. Magnolia says:

    I am genuinely pleased to glance at this web site
    posts which includes plenty of useful information, thanks for providing such data.

  15. Michele says:

    In your area, the temperatures may not be as low as to make
    the lily bulb “go to sleep” naturally and vernalize, cooling period which prompts the lily plant to bloom again. Supreme Commander can be a very, very good game – but only under controlled circumstances.
    I made an effort to crank out 100 kettle bell swings split
    up, but generally settled with 80.

  16. Gus says:

    Delco will testify as the lawyer’s forensic expert for Finley.
    The projects are relatively easy to build but offer real-world possibilities
    for solar power. In addition, you get a Walkie Talkie that can be used
    for effective supervision of the house or business property.

  17. I am really offering you this great site mainly because enjoying PSN is quite enslaving and
    obtaining PSN unique codes totally free is usually cool.

    Again, all you want to do will be go to the web site earlier mentioned,
    stick to the particular guidelines, and you will receive your individual totally free code regarding
    PSN. It isn’t challenging, that only usually takes some period.

    There is, however, any restriction of a single rule each day.
    Not surprisingly that will rule was in spot, because if it wasn’t the
    actual requirements might swiftly strain. All of those other rules and regulations can be found in the actual Conditions associated with Support and the Privacy of the web page.
    Providing you abide by every one of the rules, you’ll acquire your
    own value right at the end with the methods.

  18. reklama says:

    This design is wicked! You definitely know how to keep a reader entertained.
    Between your wit and your videos, I was almost moved
    to start my own blog (well, almost…HaHa!) Excellent job. I
    really loved what you had to say, and more than that,
    how you presented it. Too cool!

  19. I appreciate, lead to I found just what I was taking
    a look for. You’ve ended my four day lengthy hunt!
    God Bless you man. Have a great day. Bye

  20. My spouse and I stumbled over here by a different web page and
    thought I should check things out. I like what I see so i am just following you.
    Look forward to finding out about your web page repeatedly.

  21. If you are prone to blisters, then taping your feet
    in troublesome areas with zinc oxide tape is advisable.
    Aikido focuses on realizing synchronization between ‘ki’
    (spirit) and ‘tai’ (the body). Their club provides a huge state of the art
    equipment, saunas, masseurs, nutritionists, beauty therapists and the best personal
    training programs which will help you to get the best out of them.

  22. The other day, while I was at work, my cousin stole my iPad and tested to see if it can survive a twenty five foot drop, just so
    she can be a youtube sensation. My iPad is now destroyed and she has 83 views.
    I know this is entirely off topic but I had to share
    it with someone!

  23. You have nothing to worry about your own confused strategies for fat loss and
    you’re fully protected under competent trainers.
    Ralph realizes the hunters have let the fire
    go out. One weekend she did us the “favor” of throwing
    out a potted Amaryllis bulb; I rescued it just
    in time from the rubbish, but not before we argued as to whether there was anything actually growing in that pot of bone dry dirt under the pantry cabinets.

  24. My partner and I stumbled over here from a different web address and thought I may
    as well check things out. I like what I see so now i’m following you.
    Look forward to finding out about your web page yet again.

    my web blog – Titleinsurancecoloradosprings.com

  25. It’s fantastic that you are getting ideas from this
    post as well as from our dialogue made at this time.

    Have a look at my web page :: Grab your own account hacker here

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>