For the past number of years, whenever I’ve needed to write a parser for a language, I’ve turned to Terence Parr’s ANTLR
. It’s a wonderful piece of software, far more capable than the old standbys lex and yacc. Just as appealing is the fact that it generates recursive-descent parsers that are easy to read and debug. Compare an ANTLR-generated parser with the pages of pushdown automaton state transition muck that yacc and its descendants generate and you’ll quickly appreciate the clean nature of ANTLR’s code.
In the Haskell world, a close approximation to ANTLR is available in the form of Parsec
, a library for building parsers using combinators. Where ANTR has its own special syntax, and you run a tool that generates code in some target language, you construct a Parsec parser entirely in Haskell.
The Parsec documentation is excellent, with a great tutorial
that makes it clear how easy it is to use Parsec to build a parser. In my case, I wanted to parse a simple language that has some structure, but no formal grammar, and a few nasty corners.
This language makes heavy use of key/value pairs, in which the keys and values are usually just contiguous non-whitespace strings. In other words, a typical pair might be “foo=bar
“, or “offset=17
“. Sequences of pairs are separated by white space, so three pairs might look like “a=1 b=2 c=3
“. It’s easy to recognise one of these pairs.
normalPair = do name <- notSpaces
value <- notSpaces
return (name, value)
notSpaces = many1 (satisfy (not . isSpace))
One of the lurking nasty bits of this language lies in how it represents string literals. A key in this case will be the word "string
", and the value will be the text of the string. Unfortunately, the text of the string isn't escaped in any way, so a literal might look like "string=text with embedded spaces and newlines
". Oh, no! Where did my nice regular language go?
Fortunately, a string literal is always followed immediately by a pair that tells me how long it was, with a key of "length
", and a value that's an integer.
To further complicate matters, there can be an arbitrary amount of white space between the end of the string literal and the start of the length pair; some of this white space can belong to the string, while the remainder is padding. So in order to recognise a string properly, I have to identify the start of the string, scan for the length pair, parse the number for the length, then use that to chop out the actual string to return as the value of this production.
string=my damn string length=14 hurt=bad
(Aside: The astute reader will notice that we're sunk if a string literal contains a piece of text that looks like a valid length pair. This would allow an attacker to inject key/value pairs that could lead us off into the weeds, depending on what we are trying to do with all of this data. Consider the example below, where the "real" string is in italics, but it contains an embedded string that we can't distinguish.
string=trojan length=6 nukes=launch_em_all length=35
Kids, remember what mom said! When a stranger comes up to you and suggests that you cruft together a simple file format, just say "no
" to dumping unquoted literal strings out! You never know who'll be watching!)
If I want to be able to recognise a normal pair or one of these bizarre string creatures, it's easy enough.
anyPair = try stringPair
(yet to be defined) doesn't match,
will "unconsume" the characters it ate trying, and then
will try to use
= do s <- string "string"
(len, str) <- scanForLength
return (s, take len str)
The key to
is that it uses
to recognise the "length=n
" string. That function returns both the accumulated string that it scanned and the value of the length pair:
scanForLength = do r <- try lengthPair
return (r, )
do x <- anyChar
(r, xs) <- scanForLength
return (r, (x:xs))
This definition of
is very similar to Parsec's standard
combinator. Finally, let's provide a definition of
that avoids "do" notation, just because I could:
lengthPair = string "length=" >> many1 digit >>= return . read
What I like about Parsec is the ability to do things in a freewheeling manner when necessary, like parse these nasty string literals; this simple task is rather more painful with a traditional parser/lexer generator.
Oh, and I haven't mentioned the error messages that Parsec generates. If a parse fails, it does a nice job of telling you where the problem occurred, and what it was expecting to match at the time. Compared to trying to produce meaningful errors from yacc, it's rather a lot nicer.
There are a few downsides to using Parsec, compared to more traditional tools. The biggest is that a normal parser generator will tell you if your production rules will lead to nontermination (e.g. via a production whose RHS matches its LHS, like foo -> foo). Since Parsec isn't a parser generator, it doesn't have the ability to inspect productions; you'll get an infinite loop at runtime. However, the Parsec documentation does a good job of explaining its pitfalls, and how you can work around them. It's a lovely piece of software.