Parsing a simple config file in Haskell

Even though I wrote my Haskell blog helper tool purely for my own use, I don’t want to store hard-coded strings in it, lest my username and password escape into the wild.

This suggests that I need a small config file of some kind. I’m going to walk through the parser I wrote for this config file, not as a tutorial, but as an example of how to solve a simple pratical problem in Haskell.

The simplest kind of config file format that’s of any real use tends to look like this:

# Haskell blog config.

xmlrpc = http://www.example.com/wordpress/xmlrpc.php
editpost = http://www.example.com/wordpress/wp-admin/post.php?action=edit&post=
user = blogdude  # comments flow to the end of a line
password = h8x

From inspection, we can see a few informal rules. Config items are name/value pairs. Empty lines are okay, and comments start with #, spanning to the end of a line.

This is a format that’s easy to parse by hand or with regular expressions, but I lately prefer to use Parsec for these kinds of jobs. Regexps are difficult to read and debug, and often slower than Parsec parsers. Compared to a handwritten parser, a Parsec parser sacrifices a little performance, but I win in considerably faster development, clearer code, and excellent error messages “for free”.

Here follows the boilerplate for the beginning of the module. We’ll only be exporting two names from ConfigFile.

> module ConfigFile (
>                    Config,
>                    readConfig
>                   ) where

We’ll be needing a few modules.

> import Char
> import Control.Monad
> import qualified Data.Map as Map
> import Text.ParserCombinators.Parsec
> import Data.Either
> import Data.Maybe

The natural result of parsing a config file, at least in my mind, is a finite map from keys to values, where both are represented as strings.

> type Config = Map.Map String String

In my case, I’ll start with the left-hand-side of a config item, which I’d like to be a “C-like” identifier. The first character should be an alphabetic letter or underscore character. There might be only one character in an identifier, but if there are more, I’ll be expansive, and allow them to be digits, too.

> ident :: Parser String

> ident = do c <- letter <|> char '_'
>            cs <- many (letter <|> digit <|> char '_')
>            return (c:cs)
>       <?> "identifier"

This definition provides a clear illustration of how readable Parsec code can be. (By the way, the <?> means “this is the name to use when printing an error message”.)

When I’m developing a parser, I like to work from the bottom up, starting with simple productions and moving “up the food chain” as I debug each production. I’ll typically use Parsec’s parseTest function repeatedly from within ghci to test each production as I go. Testing interactively from within Emacs makes the process even more efficient; I can reload a module and start testing it with just a few keystrokes.

The result of a successful match is as we’d expect:

*ConfigFile> parseTest ident "ok"
"ok"

Compared with "([a-aA-Z_]\w*)" in Perl-like regexp notation, the Parsec description of ident is much longer. But if a match fails, we get a useful error message, which is a good reason to prefer Parsec. (When I use a normal Parsec entry point instead of parseTest, Parsec will give me a file name in its error message, too. Nice!)

*ConfigFile> parseTest ident "7"
parse error at (line 1, column 1):
unexpected "7"
expecting identifier

Comments are easily dealt with.

> comment :: Parser ()

> comment = do char '#'
>              skipMany (noneOf "\r\n")
>         <?> "comment"

We’d like to be agnostic about line endings.

> eol :: Parser ()

> eol = do oneOf "\n\r"
>          return ()
>     <?> "end of line"

Now to parse an actual config item.

> item :: Parser (String, String)

> item = do key <- ident
>           skipMany space
>           char '='
>           skipMany space
>           value <- manyTill anyChar (try eol <|> try comment <|> eof)
>           return (key, rstrip value)
>     where rstrip = reverse . dropWhile isSpace . reverse

The manyTill combinator builds up a result from each match of the parser that is its first argument, until the parser that is its second argument successfully matches. After matching a config item, we return it as a pair, stripping any trailing white space from the value.

A line can either be empty, or contain a comment or a config item. This makes it a good candidate for using the Maybe class. If we match a comment, we’ll return Nothing; if we match a config item, we return Just that item.

> line :: Parser (Maybe (String, String))

> line = do skipMany space
>           try (comment >> return Nothing) <|> (item >>= return . Just)

Note that skipMany space above will happily consume newlines, so we don’t need to explicitly check for empty lines. It also consumes leading whitespace.

Finally, we need to parse an entire file. This is an example of how it helps to spend some time browsing the standard Haskell libraries. In this case, we’ll use catMaybe from Data.Maybe to turn our list of Maybe values into a list of the Just config items (think of it as dropping all of the Nothing entries from the list, and stripping the Just from every other entry).

> file :: Parser [(String, String)]

> file = do lines <- many line
>           return (catMaybes lines)

The readConfig action parses a config file, so it must run in the IO monad. If the parse fails, it returns a parse error; on success, it returns a Config map.

> readConfig :: SourceName -> IO (Either ParseError Config)

> readConfig name =
>     parseFromFile file name >>=
>     return . fmap (foldr (uncurry Map.insert) Map.empty . reverse)

This is a dauntingly dense definition. Rather than jumping in to explain it piecewise, let’s first turn the code into something more like “beginner Haskell”.

> readConf2 :: SourceName -> IO (Either ParseError Config)

> readConf2 name =
>     do result <- parseFromFile file name
>        return $ case result of
>          Left err -> Left err
>          Right xs -> Right (listToMap (reverse xs))

Okay! We get the result of a parse; if the parse was an error, we pass the error along unmodified. Otherwise, we turn the list into a map, and return it.

> listToMap :: [(String, String)] -> Config

> listToMap ((k,v):xs) = Map.insert k v (listToMap xs)
> listToMap []         = Map.empty

A Haskell programmer with a little bit of experience will notice that the above function looks almost like a fold from the right of a list. The only problem is the (k,v) pattern match that we used to “pick apart” the arguments to Map.insert, so that we could pass it the three arguments it wants.

Map.insert :: (Ord k) => k -> v -> Map.Map k v -> Map.Map k v

It would be great if Map.insert took arguments, instead of three. We can fix this problem using uncurry.

uncurry Map.insert :: (Ord k) => (k, v) -> Map.Map k v -> Map.Map k v

So now we can rewrite listToMap in terms of foldr.

> listToMap3 :: [(String, String)] -> Config

> listToMap3 = foldr (uncurry Map.insert) Map.empty

We can eliminate the case expression in readConf2 by noticing that Either is an instance of Haskell’s Functor class. This means that we can use the Functor class’s fmap function. Calling fmap f on Left l will return Left l, but fmap f (Right r) will return Right (f r), which is exactly what we want.

Knowing about fmap and foldr, it’s now easy to go backwards from the relatively chatty definition of readConf2 to the more austere readConfig.

It’s possible to pare readConfig back even further, so that it’s entirely point-free, but this renders the code more confusing, at least to my eyes.

> readConf3 =
>     (return . fmap (foldr (uncurry Map.insert) Map.empty . reverse) =<<) .
>     parseFromFile file

I’m in no way claiming that this parser is the be-all and end-all of config file parsers. It doesn’t care if a value is redefined (it will use the last definition); it doesn’t impose any logical structure or constraints on the contents of a file; and it turns everything into strings. But it was quick to write (about an hour, not including the time to write this article); it doesn’t have any dependencies on third-party libraries; and it perfectly fits the needs of my trivial blog helper.

Posted in haskell, web
3 comments on “Parsing a simple config file in Haskell
  1. Chris Eidhof says:

    Nice tutorial! Most of the times, when building more advanced parser, you’d want to make use of a Lexer. Parsec also provides very good support for that. A lexer recognizes separate tokens, and the parser parses those tokens and makes something useful of it.

    For example, a very common and useful task of the Lexer is to deal with the whitespace. Parsec also provides good helpers for that, for example the function “lexeme”. When you use it (for example like: lexeme identifier), it’ll eat all the trailing whitespace, which is quite useful in practice!

  2. A nice article. It was discovering how to do parsing (as explained by J.S.Rohl in his “An Introduction to Compiler Writing”) that first got me into Pascal programming back in 1980. I think your article has done the same for me with Haskell. I have couple of questions though:

    1. Is there any reason that you used listToMap rather than Map.fromList in readconf2?

    2. Would you expect the “point-free” readConfig to be more efficient than the “beginner’s Haskell” readCon2?

  3. Tristram -

    I wrote listToMap because I overlooked the existence of Map.fromList.

    I have no intuition as to whether readConfig should be more or less efficient than readConf2, as Haskell compiler technology has advanced a long way since the last time I looked at it in any depth.

1 Pings/Trackbacks for "Parsing a simple config file in Haskell"
  1. [...] teideal glic deisbhéalach » Blog Archive » Parsing a simple config file in Haskell Yet another parsec article. This one is about parsing a simple configuration file. (tags: haskell parsec parser tutorial) [...]

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>