Update: I just released attoparsec 0.9.1.0, which undoes all of the changes described below. The problem? While removing backtracking, I accidentally changed the semantics of the
<|> operator in an unforeseen and unfortunate way. The bug I introduced was that a parser of the form
(char 'a' *> char 'b') <|> char 'c' would now accept as valid the input
"ac", which is clearly highly undesirable. Once my error was pointed out (see the comments at the end of this post), I quickly came up with a fix, but the fix itself had a problem: it regressed performance to being slower than when attoparsec performed backtracking! So now attoparsec backtracks in
<|> again, until such time as I can come up with an approach that I'm happier with (and who knows when that will be). Whew.
Original article, for posterity:
A couple of days ago, I released version 0.9 of my attoparsec parsing library. Although its visible API remains completely unchanged over 0.8, I made a large and important change to the semantics. So if you use attoparsec, read on before you bump your build dependencies!
Prior to 0.9, attoparsec backtracked too aggressively, and hence held onto too much memory.
Consider the following parser:
(char 'f' *> char 'o') <|> (char 'f' *> char 'i')
Under attoparsec 0.8 and earlier, the
<|> (choice) operator saves the input state before executing the left branch. If the left branch fails, it restores the input state, then executes the right branch. So if we supply an input of
"fo", the left branch will succeed, and so the whole parser succeeds. But if we supply an input of
"fi", the left branch will fail. The
<|> operator will restore the input to
"fi", and then the right branch will succeed.
This behaviour is very appealing, because it makes it easy to write readable parsers. Unfortunately, it comes with a big downside: it makes it too easy to write inefficient parsers, since
<|> needs to hold onto the entire input of its left branch in case it fails.
In attoparsec 0.9, I've dropped that backtracking-by-default behaviour. This will require changes to some parsers that use attoparsec; I'll describe what's involved below. (By the way, this change makes attoparsec's behaviour more compatible with the classic Parsec library.)
Given the parser above, here's how it will now behave when given an input of
"fi". The left branch will succeed on its first match against
'f', so the input will become
'f' has been consumed. At this point, the left branch will fail because it cannot match
'o' against the input of
<|> combinator will switch to executing the right branch, but it will not reset the input. At this point, the right branch will attempt to match
'f' against an input that is still
"i", and will thus fail.
This change affects not just the
<|> combinator, but also its synonyms
mappend. It also affects the behaviour of higher-level combinators that are constructed using
<|>, such as
If your parser is affected by this change, there are two good courses of action.
The easier path is to use the
try combinator on the left hand side of a
<|> branch. This combinator causes a parser to backtrack on failure: if a parser wrapped by
try fails, it will reset the input before passing the failure along. What would this look like in practice?
try (char 'f' *> char 'o') <|> (char 'f' *> char 'i')
This is certainly easy to do, and it won't affect the performance of your code compared to attoparsec 0.8: you'll have exactly the same backtracking and memory usage behaviours as before.
There is an alternative that doesn't require backtracking, and which should make your parsers both faster and less memory-hungry: the classic technique of left factoring your grammar. For the toy parser above, this is easy to do:
char 'f' *> (char 'o' <|> char 'i')
It's clear that we'll see a small performance boost here because we only match against
'f' once in this parser, while our first version did some unnecessary extra work by matching
'f' on each side of the branch. Left factoring is not always this easy, and it can make parsers more difficult to read, so sometimes using
try is the better course to pursue.
Given that this change in semantics introduces some awkwardness, you might reasonably wonder why I made it. By making the use of backtracking explicit, parsers that don't need backtracking will become faster "for free". A realistic instance of this is my aeson JSON library, where by using the no-backtracking version of attoparsec, I saw improvements of up to 10% in parsing performance, and slightly larger reductions in memory footprint.