Subscribe to
Posts
Comments

Simon Peyton Jones will be giving two Haskell-related talks at OSCON in July! As far as I know, this will be the first time that Haskell gets an airing at a general-interest conference. Simon is a fantastic speaker, so if you were going to OSCON anyway, you absolutely should not miss his talks.

One of Simon’s talks is to be a three-hour Haskell tutorial. It’s explicitly not aimed at “pointy-headed academics”, so if you like to get practical things done quickly and safely, this should be just the job.

The other talk will be about nested data parallelism, which is a long-dormant subject that holds great promise. The idea here is that we’d like to be able to write parallel programs that are composable and modular, but that retain good performance characteristics, while not doing great violence to our source as we attempt to scale.

This is a hard problem, as it requires solving a number of not obviously related subproblems. A close imperative-land approximation I’ve seen to the work that the NPH people are doing is Charm++, which is itself very cool. And you can see from how long the Charm++ has been around that this isn’t a short-term campaign.

This year, the Haskell.org community has had nine projects funded by Google, out of 64 student applications. We had mentoring capacity for over 20 projects, so the final number was decided by Google, not by our ability to deal with them. Congratulations to those who were accepted! And to those who were not, take heart; the competition was tight.

Most of this year’s successful applications are centered around library and infrastructure improvements, so I’m hopeful that we’ll see quite a lot of generally useful code come out at the end of this summer.

One Haskell project is being sponsored by Portland State University: Spencer Janssen’s XHSB: an X binding for Haskell, based on XCB, mentored by James Sharp. So really, we have ten projects this year, not nine.

We plan to have at least one “backup” mentor for each project.

Malcolm Wallace has done a sterling job so far as the coordinator for Haskell.org. Thanks, Malcolm!

I recently bought a Thinkpad X60 that came with Windows preinstalled. I never actually use Windows, but it’s convenient to have around for the occasional BIOS update.

My Linux distribution of choice has long been Fedora. Unfortunately, unlike other modern Linux distros, Fedora doesn’t provide a simple graphical means to shrink an NTFS partition in its installer.

Since the X60 doesn’t have an optical drive, and I don’t have any USB storage like a memory stick or CD-ROM drive, I couldn’t easily boot into something like Knoppix. If I’d had such gear available, I could easily have shrunk the Windows partition using qtparted. So here are my notes on how to get dual-booting working, by hand, in a bare-bones environment.

Really, if there’s some way you can use a sane all-in-one GUI tool like qtparted to do all of this messy work, you should. You should think of these as the instructions of last resort for the terminally optimistic.

Since I had no suitable media handy, I installed Fedora onto the X60 over the network using PXE. (Getting a PXE server configured sounds scary, but I had one up and serving in less than five minutes on my older laptop. Granted, I’ve done this many times.)

When the Fedora installer (Anaconda) is running, it runs you through a series of questions in text mode, and then starts up an X-based GUI. Don’t run through any of the GUI setup steps yet. Behind the scenes, Anaconda also sets up some virtual terminals, one of which contains a root shell prompt. That root prompt is the key to shrinking your Windows partition.

To get to the virtual terminal with the root prompt, you’ll first need to press the key combination of Control+Alt+F1 to switch to the first virtual terminal. After that, you can press Alt+F2 to get to the second, Alt+F3 the third (I think this is the one where the root prompt lives), and so on. Alt+F6 will bring you back to the GUI.

Once you have a root prompt, you’ll need to find another system, such as the one you’re installing from, and download an ntfsprogs source tarball. Unpack it, and build static binaries. (You must have static binaries, because the install environment doesn’t provide shared libraries.)

tar zxf ntfsprogs-1.13.1.tar.gz
cd ntfsprogs-1.13.1
./configure --enable-really-static
make

(Don’t forget to make sure that you’re building binaries that can actually run on the machine you’re installing on! If you’re installing an x86_64 version of Fedora, for example, it doesn’t matter whether you build a 32-bit or 64-bit version of ntfsprogs. But the converse is not the case.)

Once the build has finished, find the ntfsresize program in the ntfsprogs subdirectory. This is what you’ll need to get onto the machine you’re installing on.

Go back to the root shell on the system you’re installing onto. Recent versions of Anaconda make ssh and scp commands available from the shell, so you can simply scp the ntfsresize program from the system where you built it into /tmp on the machine you’re installing on.

Before I continue, I must point out what should already be obvious: if you are foolhardy enough to follow any directions after this point, you could completely destroy all data on your disk. Proceed at your own risk! If you have any important data on the system, back it up.

You’ll need to use the fdisk command to find out the name of your Windows partition. If you have a newish machine, it probably contains a SATA drive, so you’ll probably need to invoke it thus:

fdisk /dev/sda

Just type p to print the partition table, and note the name of the big partition (there’s probably only one) that’s marked as HPFS/NTFS. Once you’ve done that, type q to quit out of fdisk.

Let’s assume that /dev/sda1 is your NTFS partition. The next thing to do is use ntfsresize to see what size it is:

/tmp/ntfsresize -i /dev/sda1

This will run for a few seconds, and spit out lots of information. The pertinent line to look for in the output begins with “You might resize at …”. Record that number: it’s the minimum you’re likely to get away with.

Now try a dry run of resizing. Let’s say you’ve chosen ten gigabytes as your target Windows partition size.

/tmp/ntfsresize -n -s 10G

Again, ntfsresize will run for a few seconds, and print lots of information. If it prints an error message, don’t try to follow any directions beyond this point. You’re on your own.

However, if the dry run completes successfully, you ought to be safe to resize the partition. This is the first point at which you can destroy things, so it’s not for the faint of heart. Simply run the same command as before, only without the -n option.

/tmp/ntfsresize -s 10G

This will take a few minutes to run. If it exits with an error message, this might be a good time to take up religion and prayer, or at least to look for your Windows install CD. That is, if your system shipped with one.

Assuming ntfsresize did its job, the next step is even more scary: you’ll have to fiddle with the disk’s partition table so that the NTFS partition’s size matches the size of the actual filesystem. Messing with the partition table is even more scary than resizing filesystems, so dust off that rabbit’s foot your well-meaning aunt gave you.

It’s time to run fdisk again, this time with prejudice.

fdisk /dev/sda

Once again, type p to print the partition table. This time, make a note of the start offset of the NTFS partition.

Next, you’ll need to delete that partition table entry, with d. It will ask you for the number of the partition to delete; if your NTFS partition is on /dev/sda1, you’ll be deleting partition 1.

With that partition table entry gone, you’ll be creating a new, smaller entry to replace it. To get started on this process, use n.

  1. fdisk will ask you whether to create a primary or extended partition. Choose p for primary.
  2. It will prompt you for the start offset of the partition. You must use the same start offset as the old partition had; this is likely to be the default.
  3. When fdisk asks you for an end offset, give it a relative offset that’s the exact same size as you gave earlier to ntfsresize. For example, if you resized to ten gigabytes, type +10G as the end offset.

The new shrunken partition you created will not have an NTFS system ID. Use t from the main fdisk prompt to change its ID. The number to change it to is 7.

Once you’re done, use p to take a look at the new partition table. This won’t actually be written to disk yet. Once you’re sure it looks correct (has the same start offset as before, the correct smaller size, and the right system ID), use w to write out the new partition table and quit fdisk.

In case I haven’t been successful in stating the risks of using fdisk by hand, here are a few charming possibilities for you to consider.

  • fdisk will make no attempt to ensure that your partition table is correct or consistent. Neither, most likely, will your operating system. It’s thus quite possible to accidentally cause your partitions to overlap. This can result in data corruption problems that might not surface until months later.
  • If you hork your partition table badly enough, it can be a pain to subsequently install anything on the system at all. I’ve been forced once or twice to use dd to splatter zeroes all over a partition table in order to get it back to a state where I could do anything with the disk.
  • Don’t say I didn’t warn you!

At this point, you can return to the Anaconda GUI installer by typing Alt+F6. Step through a few install screens. When Anaconda reaches the point of asking you about partitioning, you might be okay to let it partition automatically for you; I haven’t tried this, so I don’t know. I simply go with the custom disk layout option instead, and find to my satisfaction that I now have a big chunk of empty space to fill with swap and ext3 partitions. As you continue through the installation process, Anaconda should automatically set up the boot loader to let you boot into Windows or Linux.

One final thing to bear in mind: the next time you boot into Windows after your install is complete, Windows will probably run chkdsk during boot, to verify that its filesystem is still sane.

From diffbavis on #haskell:
> sum (map ord "al gore")
666

Someone showed up on #haskell yesterday, asking how to use regular expressions. This isn’t a completely straightforward question to answer. While Haskell’s regexp libraries provide the same functionality you’ll find in Perl, Python, and Java, they provide a rich and fairly abstract interface that can be daunting to newcomers.

So let’s fix that, and strip away the abstractions to show how we might actually use regexps in practice. I’m assuming that you’re already familiar with regexps from using them in some other language, and simply want to find your feet in Haskell. The standard library that implements regexps is Text.Regex.Posix. As the name suggests, this wraps the system’s native POSIX extended regexp library.

If you’re coming from a language like Perl, Python, or Java, you may not have encountered POSIX regular expressions before. Perl-style regexps are much more expressive, and hence far more widely used. POSIX regexps look superficially similar to Perl-style regexps, but they’re not as expressive, and they have different matching behaviour. There aren’t any concise online descriptions of the syntactic and semantic differences between the two regexp families, but the Wikipedia entry on regexps will help you to understand some of the differences. The only advantage of the POSIX regexp library is that it’s bundled with GHC; you don’t need to fetch any extra bits to get it to work.

Before we continue, let’s start up GHC’s interpreter, ghci.

  $ ghci
  Loading package base ... linking ... done.

To use the Text.Regex.Posix module, we use the :mod (or simply :m) command to load it and bring it into scope.

  Prelude> :mod +Text.Regex.Posix
  Prelude Text.Regex.Posix>

The simplest way to use regexps is via the “=~” function, which we normally use as an infix operator. And, of course, it’s exactly the same operator that Perl uses for regexp matching.

What’s interesting is that this function is polymorphic in both its arguments and its return type; this is why its documentation can be hard to read if you’re new to the language. Here’s a simplified type signature, to give you the idea.

  (=~) :: string -> pattern -> result

(I’ve left out the constraints on these type variables. The real type signature of =~ is huge, and only obscures early understanding of what on earth is going on, so I’m not going to reproduce it here.)

You can use either a normal Haskell string or a much more efficient ByteString as the pattern or string. You can use them in whatever combination you like; a String for one, a ByteString for the other, or whatever suits your needs.

If you try to use =~ interactively at the ghci prompt, ghci gives a fearsome error message.

  > "bar" =~ "(foo|bar)"

  <interactive>:1:0:
  No instance for (Text.Regex.Base.RegexLike.RegexContext Regex
                              [Char]
                              result)
    arising from use of `=~' at <interactive>:1:0-19
  Possible fix:
    add an instance declaration for
    (Text.Regex.Base.RegexLike.RegexContext Regex [Char] result)
  In the expression: "bar" =~ "(foo|bar)"
  In the definition of `it': it = "bar" =~ "(foo|bar)"

Yikes! What’s happened here is that we haven’t given a type for the result of the =~ operator. Since the result type is polymorphic, ghci has no way to infer what type we might actually want. We can easily fix this, by suffixing the expression with a type signature.

  > "bar" =~ "(foo|bar)" :: Bool
  True

  > "quux" =~ "(foo|bar)" :: Bool
  False

By constraining the type of the result to Bool, we get a simple “yes” or “no” answer when we ask if the match has succeeded.

But we can also use String as the result type, which gives us the first string that matches, or an empty string if the match fails.

  > let pat = "(foo[a-z]*bar|quux)"

  > "foodiebar, fooquuxbar" =~ pat :: String
  "foodiebar"

  > "nomatch" =~ pat :: String
  ""

If we use [String], we get a list of every string that matches.

  > "foodiebar, fooquuxbar" =~ pat :: String
  ["foodiebar","fooquuxbar"]

It’s also possible to get more detail about the context in which a match occurred. The 3-tuple in this result consists of the text before a match; the matched text; and the text after the match.

  > "before foodiebar after" =~ pat :: (String,String,String)
  ("before ","foodiebar"," after")

The 4-tuple below adds the text of all subgroups of the match.

  > "before foodiebar after" =~ pat :: (String,String,String,[String])
  ("before ","foodiebar"," after",["foodiebar"])

But wait, there’s more! You can get plain, simple offset information, either singly or in a list.

  > :mod +Text.Regex.Base

  > "the foodiebar" =~ pat :: (MatchOffset,MatchLength)
  (4,9)

  > "no match" =~ pat :: [(MatchOffset,MatchLength)]
  []

All of these different result types are instances of the RegexContext type class. By this point, you should have enough examples that you can read the complete documentation for the other RegexContext instances at Text.Regex.Base.Context without it seeming completely overwhelming. There are many more instances, some of which give you a lot of detailed information about a match.

There’s a monadic variant of the =~ operator, called =~~. You can use this to perform matches inside a monad, for example as follows. This binds a to the number of matches.

  > a <- "foo" =~~ "foo":: IO Int
  1

You can also use the =~~ operator outside of a monad in some cases. For example, you can try a match in the Maybe monad to tidily handle the possibility of a failure.

  > "foo" =~~ "bar":: Maybe String
  Nothing

  > "foo" =~~ "foo" :: Maybe String
  Just "foo"

Here’s an interesting pitfall to watch out for. There’s a small chance you could shoot yourself in the foot if you use a list as a RegexContext. Let me show you what I mean. This expression returns a list of all matches, which is what I’d normally expect.

  > "foo foo foo" =~ "foo" :: [String]
  ["foo","foo","foo"]

But this expression, which differs in only one character, runs the match in the list monad! It’s only ever going to return an empty or single-entry list. Eeek!

  > "foo foo foo" =~~ "foo" :: [String]
  ["foo"]

There’s normally no need to compile regular expression patterns. A pattern will be compiled the first time it’s used, and your Haskell runtime should memoise the compiled representation for you.

This is necessarily only a basic introduction to the Haskell regexp API. In practice, you will want to avoid the Text.Regex.Posix implementation; it’s terribly slow, and too strict besides. When I have time, I’ll talk about these problems in more detail, and what alternatives you can use. Until then, happy regexp matching!

I spent a while this evening reading through the documentation for the beta release of NVIDIA’s CUDA GPGPU system. My motivation for this was that nvcc, the CUDA compiler, is based on a code drop of the EkoPath compiler, which I’ve worked on intermittently over the past few years.

The programming model that these GPUs enforce is incredibly complex. It’s more than a little reminiscent of the Connection Machine (for a blast from the past, see a collection of scanned CM-5 documentation.

The idea is that the GPU executes a “kernel” of compute-intensive, highly parallelisable, code on behalf of the CPU. Data is transferred to the GPU when a kernel starts to execute, and back to the host when it completes. The GPU may execute multiple kernels simultaneously, if it is capable of it.

A kernel is executed as a collection of threads, with threads organised into “blocks”. The blocks are further organised into “grids”. Each thread has thread-local storage, and a block of threads shares a chunk read-write memory. This is how they communicate and synchronise with each other. Threads can also access the GPU’s global memory, although blocks of threads cannot synchronise with each other via global memory.

Global memory has a pile of constraints. There’s not one single kind of global memory; instead, there are three. Unlike the thread-local and block-shared memories, global memories can be accessed by the host (although accesses are expensive).

  • Plain old read-write memory. Accesses to this memory are not cached. When a thread block is scheduled to execute, and accesses global memory, its performance is heavily penalised if threads don’t access global memory in a suitable pattern (see section 6.1.2.1 of the Programming Guide).

  • “Constant” memory is read-only, and reads are cached. Again, there are restrictions on access patterns to constant memory that affect performance.

  • “Texture” memory is also read-only, and cached. Cached texture data is laid out to give best performance for 2D access patterns.

Block-shared memory isn’t immune to peculiar performance constraints, either. It’s organised into “banks”, to allow parallel access to multiple banks in a single operation. In other words, n threads can access n banks in one cycle. However, concurrent accesses to a single bank are forcibly serialised, so n threads hitting a single bank will take n cycles.

So at every level of the GPU memory hierarchy, there are multiple factors that you have to keep in mind if you want to achieve good performance.

The extensions that NVIDIA has made to the C language are fairly minimal. There are some new keywords to control layout of data, to determine where variables should live in the memory hierarchy. Not surprisingly, there are new vector types that look very much like OpenGL vector types.

A few similar keywords control function layout. Functions can be marked as device accessible only, resident on the device but called from the host (i.e. a kernel), or host-only. A weird piece of syntax `<<< Dg, Db, Ns >>>’ is required when calling a kernel, to control the execution parameters for the kernel: grid size, block size (and number of threads), and memory allocation.

With hardware of this complexity, a good optimising compiler can make a substantial difference to application performance. The heritage of NVCC is unimpeachable. It’s based on the EkoPath compiler, which has been the most frequent compiler used for SPEC benchmark submissions by AMD64 hardware vendors for several years. The EkoPath compiler was in turn based on Open64, a compiler that SGI GPLed when it dropped out of the compiler business.

Among the memory hierarchy related optimisations in the Open64 compiler family that are available to NVCC are the following (this is just a short list of highlights; the real number is big). I don’t know how many of these are enabled in NVCC, nor do I know enough about individual cache sizes or miss penalties to have a clue as to which ones are likely to be effective.

  • Loop nest optimisation. For a set of nested loops, this can change the order in which the inner and outer loops are executed, to improve the pattern of access to memory.

  • Vectorised intrinsics. If application code is, for example, computing sin(x[i]) for all i in a vector, the compiler can replace this with a single call to a highly optimised sin specialised for vectors.

  • Cache blocking. Replace a single loop over a large vector with smaller loops that operate on cache-sized chunks of the vector.

Given the degree of manual control over variable placement that NVIDIA’s C extensions seem to enforce, it’s not clear to me that their compiler team has had a chance to automate any of the transfer of data between levels of the memory hierarchy yet.

I also find it telling that the programmer’s guide includes specific guidelines on how to avoid bank conflicts in block-shared memory, where in at least some of these cases it’s clear that the compiler could be automating the job.

It’s worth reading the programmer’s guide in its entirety to get a sense of just how complex CUDA is, and how many different constraints the determined application programmer will have to keep in mind at a time. There’s not a lot of abstraction going on here (vendor-provided BLAS and FFT libraries notwithstanding).

We have plenty of previous examples of hardware that failed to live up to their early marketing promise, from the i860 to the PS3. CUDA looks set to follow in their footsteps: I expect that it will take vast amounts of work for programmers to get halfway decent performance out of a CUDA application, and that few will achieve more than 10% of theoretical peak performance.

People with the expertise, persistence, and bloody-mindedness to keep slogging away will undoubtedly see phenomenal speedups for some application kernels. I’m sure that the DOE and NSA, in particular, are drooling over this stuff, as are the quants on Wall Street. But those groups have a tolerance for pain that is fairly unique. This technology is a long way from anything like true accessibility, even to those already versed with parallel programming using environments like MPI or OpenMP. Still, it’s a great first step.

“Now, imagine a Beowulf cluster of these!”

The Haskell community has a very nice implementation-independent mechanism for building libraries and applications, called Cabal.

I spent a few hours over the past couple of days hacking on Cabal to add the ability to build RPM packages. You can fetch my darcs repository from here:

darcs get --partial http://darcs.serpentine.com/cabal-rpm

This new capability is easy to use. It adds a single new Cabal command, called rpm.

runhaskell Setup.*hs rpm

This generates a spec file, and builds source and binary RPMs.

Here’s a quick example, of trying to build the Haskell XML-RPC library:

~/src/darcs/haxr $ runghc Setup.*hs rpm
Source tarball created: dist/SOURCES/haxr-3000.0.0.tar.gz
error: Failed build dependencies:
    HaXml-ghc66 >= 1.13 is needed by haxr-3000.0.0-1.i386
    HaXml-ghc66 < 1.14 is needed by haxr-3000.0.0-1.i386
Setup.lhs: rpmbuild failed with status 1

The rpm command has converted the dependencies in the haxr.cabal file into build-time and runtime dependencies in the haxr.spec file that it generated, but rpmbuild can’t find the HaXml package.

Having earlier built a HaXml package using the rpm command, I can install it with the system’s rpm command.

# rpm -i haxml-ghc66-1.13.2-1.i386.rpm 
Reading package info from stdin ... done.
Saving old package config file... done.
Writing new package config file... done.

The RPM’s post-install scriptlet informs GHC’s package manager about the package’s availability:

# ghc-pkg list --simple | tr ' ' '\n' | grep -i haxml
HaXml-1.13.2

Now if I try to build haxr again, it will succeed.

By default, the rpm command builds both normal and profiling-enabled libraries. It also uses Haddock to generate library documentation. It’s possible to control these behaviours from the command line.

The command also provides a --gen-spec option, which only generates a spec file. You can use this >spec file as a basis for crafting one of your own.

I’ve used the rpm command to build about a third of the packages listed in the Hackage package database, with no problems.

In 1995, I moved from Ireland to the San Francisco Bay Area, because I’d been offered a job that promised to combine Unix and Scheme hacking. The prospect tickled me pink.

At Sun Microsystems, the SPARC design team used a home-built tool called DReAM to manage their ranch of servers as they ran huge batches of EDA design synthesis, verification and simulation runs. I was to join the DReAM group to maintain and extend the software.

DReAM was written in a combination of C++ and Scheme. The RPC and system code was all C++, while the scheduler was written in Scheme. It was a lovely piece of software to work on, and it scaled well. A single uninteresting server could handle a cluster of 3,000 servers with 600 clients without a hitch, not even breaking a sweat in the storm of network traffic after a power failure.

One reason DReAM scaled well was the choice of Scheme implementation for the scheduler. It was called esh (for Embeddable Shell); it was developed internally at Sun; and it was written by John Rose.

Esh compiled Scheme code to native (via C, as I recall), but it also had the usual interpreter, and interpreted and compiled code could call each other transparently. This doesn’t sound like a big deal if you’re used to Common Lisp, but it’s still a rare feature in Scheme implementations, 15 years later. The native code was idiomatic and performed well, and the environment coped just fine with native functions being replaced with interpreted versions at runtime.

Esh also had a very nice bidirectional foreign function interface, so it was easy to call into C from Scheme with a minimum of boilerplate, and vice versa.

We were able to write a lot of the core DReAM scheduling code in Scheme that was then compiled to native code, leaving “policy” decisions to interpreted code. I could telnet to a REPL to tweak variables, or replace entire functions with instrumented versions to hunt down problems on the fly without affecting the running system.

Unfortunately, by the time I joined Sun, John wasn’t working on esh any more; since there was essentially nobody else using it, it had already slipped into an early obscurity.

Even though I haven’t touched esh in ten years, I still remember it as a fine piece of software. A blog entry hardly qualifies as rescuing it from the Internet memory hole, but maybe John will be tickled to know that someone remembers his baby.

Here’s a wonderful excerpt from a book review written by Daniel Dennett, in which he paraphrases Rapaport on how to argue constructively.
Serious argument depends on mutual respect, and this is often hard to engender when disagreements turn vehement. The social psychologist and game theorist Anatol Rapoport (creator of the winning Tit-for-Tat strategy in Robert Axelrod’s legendary prisoner’s dilemma tournament) once promulgated a list of rules for how to write a successful critical commentary on an opponent’s work. First, he said, you must attempt to re-express your opponent’s position so clearly, vividly and fairly that your opponent says “Thanks, I wish I’d thought of putting it that way.” Then, you should list any points of agreement (especially if they are not matters of general or widespread agreement), and third, you should mention anything you have learned from your opponent. Only then are you permitted to say so much as a word of rebuttal or criticism. I have found this a salutary discipline to follow–or, since it is challenging, to attempt to follow. When it succeeds, the results are gratifying: your opponent is in a mood to be enlightened and eagerly attentive.

Even though I wrote my Haskell blog helper tool purely for my own use, I don’t want to store hard-coded strings in it, lest my username and password escape into the wild.

This suggests that I need a small config file of some kind. I’m going to walk through the parser I wrote for this config file, not as a tutorial, but as an example of how to solve a simple pratical problem in Haskell.

The simplest kind of config file format that’s of any real use tends to look like this:

# Haskell blog config.

xmlrpc = http://www.example.com/wordpress/xmlrpc.php
editpost = http://www.example.com/wordpress/wp-admin/post.php?action=edit&post=
user = blogdude  # comments flow to the end of a line
password = h8x

From inspection, we can see a few informal rules. Config items are name/value pairs. Empty lines are okay, and comments start with #, spanning to the end of a line.

This is a format that’s easy to parse by hand or with regular expressions, but I lately prefer to use Parsec for these kinds of jobs. Regexps are difficult to read and debug, and often slower than Parsec parsers. Compared to a handwritten parser, a Parsec parser sacrifices a little performance, but I win in considerably faster development, clearer code, and excellent error messages “for free”.

Here follows the boilerplate for the beginning of the module. We’ll only be exporting two names from ConfigFile.

> module ConfigFile (
>                    Config,
>                    readConfig
>                   ) where

We’ll be needing a few modules.

> import Char
> import Control.Monad
> import qualified Data.Map as Map
> import Text.ParserCombinators.Parsec
> import Data.Either
> import Data.Maybe

The natural result of parsing a config file, at least in my mind, is a finite map from keys to values, where both are represented as strings.

> type Config = Map.Map String String

In my case, I’ll start with the left-hand-side of a config item, which I’d like to be a “C-like” identifier. The first character should be an alphabetic letter or underscore character. There might be only one character in an identifier, but if there are more, I’ll be expansive, and allow them to be digits, too.

> ident :: Parser String

> ident = do c <- letter <|> char '_'
>            cs <- many (letter <|> digit <|> char '_')
>            return (c:cs)
>       <?> "identifier"

This definition provides a clear illustration of how readable Parsec code can be. (By the way, the <?> means “this is the name to use when printing an error message”.)

When I’m developing a parser, I like to work from the bottom up, starting with simple productions and moving “up the food chain” as I debug each production. I’ll typically use Parsec’s parseTest function repeatedly from within ghci to test each production as I go. Testing interactively from within Emacs makes the process even more efficient; I can reload a module and start testing it with just a few keystrokes.

The result of a successful match is as we’d expect:

*ConfigFile> parseTest ident "ok"
"ok"

Compared with "([a-aA-Z_]\w*)” in Perl-like regexp notation, the Parsec description of ident is much longer. But if a match fails, we get a useful error message, which is a good reason to prefer Parsec. (When I use a normal Parsec entry point instead of parseTest, Parsec will give me a file name in its error message, too. Nice!)

*ConfigFile> parseTest ident "7"
parse error at (line 1, column 1):
unexpected "7"
expecting identifier

Comments are easily dealt with.

> comment :: Parser ()

> comment = do char '#'
>              skipMany (noneOf "\r\n")
>         <?> "comment"

We’d like to be agnostic about line endings.

> eol :: Parser ()

> eol = do oneOf "\n\r"
>          return ()
>     <?> "end of line"

Now to parse an actual config item.

> item :: Parser (String, String)

> item = do key <- ident
>           skipMany space
>           char '='
>           skipMany space
>           value <- manyTill anyChar (try eol <|> try comment <|> eof)
>           return (key, rstrip value)
>     where rstrip = reverse . dropWhile isSpace . reverse

The manyTill combinator builds up a result from each match of the parser that is its first argument, until the parser that is its second argument successfully matches. After matching a config item, we return it as a pair, stripping any trailing white space from the value.

A line can either be empty, or contain a comment or a config item. This makes it a good candidate for using the Maybe class. If we match a comment, we’ll return Nothing; if we match a config item, we return Just that item.

> line :: Parser (Maybe (String, String))

> line = do skipMany space
>           try (comment >> return Nothing) <|> (item >>= return . Just)

Note that skipMany space above will happily consume newlines, so we don’t need to explicitly check for empty lines. It also consumes leading whitespace.

Finally, we need to parse an entire file. This is an example of how it helps to spend some time browsing the standard Haskell libraries. In this case, we’ll use catMaybe from Data.Maybe to turn our list of Maybe values into a list of the Just config items (think of it as dropping all of the Nothing entries from the list, and stripping the Just from every other entry).

> file :: Parser [(String, String)]

> file = do lines <- many line
>           return (catMaybes lines)

The readConfig action parses a config file, so it must run in the IO monad. If the parse fails, it returns a parse error; on success, it returns a Config map.

> readConfig :: SourceName -> IO (Either ParseError Config)

> readConfig name =
>     parseFromFile file name >>=
>     return . fmap (foldr (uncurry Map.insert) Map.empty . reverse)

This is a dauntingly dense definition. Rather than jumping in to explain it piecewise, let’s first turn the code into something more like “beginner Haskell”.

> readConf2 :: SourceName -> IO (Either ParseError Config)

> readConf2 name =
>     do result <- parseFromFile file name
>        return $ case result of
>          Left err -> Left err
>          Right xs -> Right (listToMap (reverse xs))

Okay! We get the result of a parse; if the parse was an error, we pass the error along unmodified. Otherwise, we turn the list into a map, and return it.

> listToMap :: [(String, String)] -> Config

> listToMap ((k,v):xs) = Map.insert k v (listToMap xs)
> listToMap []         = Map.empty

A Haskell programmer with a little bit of experience will notice that the above function looks almost like a fold from the right of a list. The only problem is the (k,v) pattern match that we used to “pick apart” the arguments to Map.insert, so that we could pass it the three arguments it wants.

Map.insert :: (Ord k) => k -> v -> Map.Map k v -> Map.Map k v

It would be great if Map.insert took arguments, instead of three. We can fix this problem using uncurry.

uncurry Map.insert :: (Ord k) => (k, v) -> Map.Map k v -> Map.Map k v

So now we can rewrite listToMap in terms of foldr.

> listToMap3 :: [(String, String)] -> Config

> listToMap3 = foldr (uncurry Map.insert) Map.empty

We can eliminate the case expression in readConf2 by noticing that Either is an instance of Haskell’s Functor class. This means that we can use the Functor class’s fmap function. Calling fmap f on Left l will return Left l, but fmap f (Right r) will return Right (f r), which is exactly what we want.

Knowing about fmap and foldr, it’s now easy to go backwards from the relatively chatty definition of readConf2 to the more austere readConfig.

It’s possible to pare readConfig back even further, so that it’s entirely point-free, but this renders the code more confusing, at least to my eyes.

> readConf3 =
>     (return . fmap (foldr (uncurry Map.insert) Map.empty . reverse) =<<) .
>     parseFromFile file

I’m in no way claiming that this parser is the be-all and end-all of config file parsers. It doesn’t care if a value is redefined (it will use the last definition); it doesn’t impose any logical structure or constraints on the contents of a file; and it turns everything into strings. But it was quick to write (about an hour, not including the time to write this article); it doesn’t have any dependencies on third-party libraries; and it perfectly fits the needs of my trivial blog helper.

« Prev - Next »