attoparsec 0.9, a major (and abortive) release [updated]

Update: I just released attoparsec 0.9.1.0, which undoes all of the changes described below. The problem? While removing backtracking, I accidentally changed the semantics of the <|> operator in an unforeseen and unfortunate way. The bug I introduced was that a parser of the form (char 'a' *> char 'b') <|> char 'c' would now accept as valid the input "ac", which is clearly highly undesirable. Once my error was pointed out (see the comments at the end of this post), I quickly came up with a fix, but the fix itself had a problem: it regressed performance to being slower than when attoparsec performed backtracking! So now attoparsec backtracks in <|> again, until such time as I can come up with an approach that I'm happier with (and who knows when that will be). Whew.

Original article, for posterity:

A couple of days ago, I released version 0.9 of my attoparsec parsing library. Although its visible API remains completely unchanged over 0.8, I made a large and important change to the semantics. So if you use attoparsec, read on before you bump your build dependencies!

Prior to 0.9, attoparsec backtracked too aggressively, and hence held onto too much memory.

Consider the following parser:

(char 'f' *> char 'o') <|> (char 'f' *> char 'i')

Under attoparsec 0.8 and earlier, the <|> (choice) operator saves the input state before executing the left branch. If the left branch fails, it restores the input state, then executes the right branch. So if we supply an input of "fo", the left branch will succeed, and so the whole parser succeeds. But if we supply an input of "fi", the left branch will fail. The <|> operator will restore the input to "fi", and then the right branch will succeed.

This behaviour is very appealing, because it makes it easy to write readable parsers. Unfortunately, it comes with a big downside: it makes it too easy to write inefficient parsers, since <|> needs to hold onto the entire input of its left branch in case it fails.

In attoparsec 0.9, I've dropped that backtracking-by-default behaviour. This will require changes to some parsers that use attoparsec; I'll describe what's involved below. (By the way, this change makes attoparsec's behaviour more compatible with the classic Parsec library.)

Given the parser above, here's how it will now behave when given an input of "fi". The left branch will succeed on its first match against 'f', so the input will become "i" after 'f' has been consumed. At this point, the left branch will fail because it cannot match 'o' against the input of "i". The <|> combinator will switch to executing the right branch, but it will not reset the input. At this point, the right branch will attempt to match 'f' against an input that is still "i", and will thus fail.

This change affects not just the <|> combinator, but also its synonyms mplus and mappend. It also affects the behaviour of higher-level combinators that are constructed using <|>, such as many.

If your parser is affected by this change, there are two good courses of action.

The easier path is to use the try combinator on the left hand side of a <|> branch. This combinator causes a parser to backtrack on failure: if a parser wrapped by try fails, it will reset the input before passing the failure along. What would this look like in practice?

try (char 'f' *> char 'o') <|> (char 'f' *> char 'i')

This is certainly easy to do, and it won't affect the performance of your code compared to attoparsec 0.8: you'll have exactly the same backtracking and memory usage behaviours as before.

There is an alternative that doesn't require backtracking, and which should make your parsers both faster and less memory-hungry: the classic technique of left factoring your grammar. For the toy parser above, this is easy to do:

char 'f' *> (char 'o' <|> char 'i')

It's clear that we'll see a small performance boost here because we only match against 'f' once in this parser, while our first version did some unnecessary extra work by matching 'f' on each side of the branch. Left factoring is not always this easy, and it can make parsers more difficult to read, so sometimes using try is the better course to pursue.

Given that this change in semantics introduces some awkwardness, you might reasonably wonder why I made it. By making the use of backtracking explicit, parsers that don't need backtracking will become faster "for free". A realistic instance of this is my aeson JSON library, where by using the no-backtracking version of attoparsec, I saw improvements of up to 10% in parsing performance, and slightly larger reductions in memory footprint.

Posted in haskell
9 comments on “attoparsec 0.9, a major (and abortive) release [updated]
  1. Sebastian says:

    I like about parser combinators that the parsers correspond to a grammar.

    Does the change imply that the parser

    (char ‘f’ *> char ‘o’) (char ‘f’ *> char ‘i’)

    now accepts the word “ffi” when using attoparsec?

  2. Sebastian, you’re correct that it does, and I think that’s likely to be a bug. Hmm.

  3. augustss says:

    That sounds horrible. A more sane behavior would be that if the left branch matches at least one token you’re committed to that branch and will not try the right branch.

  4. Lennart, you’re right.

  5. Indeed, that is a horrific new bug. FWIW, classic Parsec uses the “if consumed then commit” version of the choice combinator, which can at least be reasoned about even if it is obnoxious.

    Do you have any plans for being (ability to be) able to detect extraneous calls to try? Modulo the necessary considerations of performance and resolving ambiguous parses, I’ve always viewed the choice operator as a symmetric operation in terms of the semantics of the grammar. Given this perspective, the only reliable way to build grammars with Parsec-like semantics is to (a) add a lot of unnecessary backtracking, or (b) offer backtracking and non-backtracking variants of all the various building block parsers. The former can introduce performance regressions over the attoparsec 0.8 behavior, and the latter is the same sort of code bloat that happens with monadic and non-monadic variants of HOFs.

    Another option, which I’ve considered pursuing in a new library, is to explicitly track these things in the type system. In particular, tracking whether a parser can consume input and then fail. (Other things worth tracking are whether the parser always succeeds, and whether it accepts the empty string.)

  6. If you do end up changing the semantics of choice to be more like classic Parsec, perhaps you should use a new combinator in order to reduce the backwards compatibility issues. The incremental-parser library uses (<) for this purpose, which seems like a good name for left-biased choice.

  7. Er, that’s (<<|>)

  8. Mathieu Boespflug says:

    Hi Brian,

    with the old (and current) semantics for (), what is the use for the ‘try’ combinator? Could it not be dispensed with?

  9. Han Don Son says:

    Have you ever considered including a function like take1While which takes the first element without the test and then does exactly the same as takeWhile? Or takeNWhile to generalize. (slightly different from takeWhile1 – which does do the test on the first element)

    I think a lot of parsers may benefit from this function in terms of performance. Although takeWhile is the recommended way to get ByteString fragments, it can’t handle situations like “Stop! There’s a sequence that might be a token! Oh, never mind.”

    Oh, by the way, thank you so much for this wonderful ByteString parser.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>