A Haskell regular expression tutorial

Someone showed up on #haskell yesterday, asking how to use regular expressions. This isn’t a completely straightforward question to answer. While Haskell’s regexp libraries provide the same functionality you’ll find in Perl, Python, and Java, they provide a rich and fairly abstract interface that can be daunting to newcomers.

So let’s fix that, and strip away the abstractions to show how we might actually use regexps in practice. I’m assuming that you’re already familiar with regexps from using them in some other language, and simply want to find your feet in Haskell. The standard library that implements regexps is Text.Regex.Posix. As the name suggests, this wraps the system’s native POSIX extended regexp library.

If you’re coming from a language like Perl, Python, or Java, you may not have encountered POSIX regular expressions before. Perl-style regexps are much more expressive, and hence far more widely used. POSIX regexps look superficially similar to Perl-style regexps, but they’re not as expressive, and they have different matching behaviour. There aren’t any concise online descriptions of the syntactic and semantic differences between the two regexp families, but the Wikipedia entry on regexps will help you to understand some of the differences. The only advantage of the POSIX regexp library is that it’s bundled with GHC; you don’t need to fetch any extra bits to get it to work.

Before we continue, let’s start up GHC’s interpreter, ghci.

  $ ghci
  Loading package base ... linking ... done.

To use the Text.Regex.Posix module, we use the :mod (or simply :m) command to load it and bring it into scope.

  Prelude> :mod +Text.Regex.Posix
  Prelude Text.Regex.Posix>

The simplest way to use regexps is via the “=~” function, which we normally use as an infix operator. And, of course, it’s exactly the same operator that Perl uses for regexp matching.

What’s interesting is that this function is polymorphic in both its arguments and its return type; this is why its documentation can be hard to read if you’re new to the language. Here’s a simplified type signature, to give you the idea.

  (=~) :: string -> pattern -> result

(I’ve left out the constraints on these type variables. The real type signature of =~ is huge, and only obscures early understanding of what on earth is going on, so I’m not going to reproduce it here.)

You can use either a normal Haskell string or a much more efficient ByteString as the pattern or string. You can use them in whatever combination you like; a String for one, a ByteString for the other, or whatever suits your needs.

If you try to use =~ interactively at the ghci prompt, ghci gives a fearsome error message.

  > "bar" =~ "(foo|bar)"

  <interactive>:1:0:
  No instance for (Text.Regex.Base.RegexLike.RegexContext Regex
                              [Char]
                              result)
    arising from use of `=~' at <interactive>:1:0-19
  Possible fix:
    add an instance declaration for
    (Text.Regex.Base.RegexLike.RegexContext Regex [Char] result)
  In the expression: "bar" =~ "(foo|bar)"
  In the definition of `it': it = "bar" =~ "(foo|bar)"

Yikes! What’s happened here is that we haven’t given a type for the result of the =~ operator. Since the result type is polymorphic, ghci has no way to infer what type we might actually want. We can easily fix this, by suffixing the expression with a type signature.

  > "bar" =~ "(foo|bar)" :: Bool
  True

  > "quux" =~ "(foo|bar)" :: Bool
  False

By constraining the type of the result to Bool, we get a simple “yes” or “no” answer when we ask if the match has succeeded.

But we can also use String as the result type, which gives us the first string that matches, or an empty string if the match fails.

  > let pat = "(foo[a-z]*bar|quux)"

  > "foodiebar, fooquuxbar" =~ pat :: String
  "foodiebar"

  > "nomatch" =~ pat :: String
  ""

If we use [String], we get a list of every string that matches.

  > "foodiebar, fooquuxbar" =~ pat :: String
  ["foodiebar","fooquuxbar"]

It’s also possible to get more detail about the context in which a match occurred. The 3-tuple in this result consists of the text before a match; the matched text; and the text after the match.

  > "before foodiebar after" =~ pat :: (String,String,String)
  ("before ","foodiebar"," after")

The 4-tuple below adds the text of all subgroups of the match.

  > "before foodiebar after" =~ pat :: (String,String,String,[String])
  ("before ","foodiebar"," after",["foodiebar"])

But wait, there’s more! You can get plain, simple offset information, either singly or in a list.

  > :mod +Text.Regex.Base

  > "the foodiebar" =~ pat :: (MatchOffset,MatchLength)
  (4,9)

  > "no match" =~ pat :: [(MatchOffset,MatchLength)]
  []

All of these different result types are instances of the RegexContext type class. By this point, you should have enough examples that you can read the complete documentation for the other RegexContext instances at Text.Regex.Base.Context without it seeming completely overwhelming. There are many more instances, some of which give you a lot of detailed information about a match.

There’s a monadic variant of the =~ operator, called =~~. You can use this to perform matches inside a monad, for example as follows. This binds a to the number of matches.

  > a <- "foo" =~~ "foo":: IO Int
  1

You can also use the =~~ operator outside of a monad in some cases. For example, you can try a match in the Maybe monad to tidily handle the possibility of a failure.

  > "foo" =~~ "bar":: Maybe String
  Nothing

  > "foo" =~~ "foo" :: Maybe String
  Just "foo"

Here’s an interesting pitfall to watch out for. There’s a small chance you could shoot yourself in the foot if you use a list as a RegexContext. Let me show you what I mean. This expression returns a list of all matches, which is what I’d normally expect.

  > "foo foo foo" =~ "foo" :: [String]
  ["foo","foo","foo"]

But this expression, which differs in only one character, runs the match in the list monad! It’s only ever going to return an empty or single-entry list. Eeek!

  > "foo foo foo" =~~ "foo" :: [String]
  ["foo"]

There’s normally no need to compile regular expression patterns. A pattern will be compiled the first time it’s used, and your Haskell runtime should memoise the compiled representation for you.

This is necessarily only a basic introduction to the Haskell regexp API. In practice, you will want to avoid the Text.Regex.Posix implementation; it’s terribly slow, and too strict besides. When I have time, I’ll talk about these problems in more detail, and what alternatives you can use. Until then, happy regexp matching!

Posted in haskell
29 comments on “A Haskell regular expression tutorial
  1. Nick Burlett says:

    Could you update the Haskell Cookbook with this information? There’s currently nothing there with regard to regular expressions.
    http://www.haskell.org/haskellwiki/Cookbook#Regular_expressions

    Also, this doesn’t seem to work in ghci 6.4.2:

    Prelude Text.Regex.Posix> “foo foo foo” =~ “foo” :: [String]

    :1:14: Not in scope: `=~’

    Thanks,
    Nick

  2. Yep, ghc 6.4.2 is too old. You’ll need 6.6.

  3. Kig says:

    Wow, thanks!
    I’ve been hoping someone would make a brief regex tutorial for some time now. It seems simple now that you walk through it, but when I first took a swing at it based on the documentation, the type signatures scared me off.

    Also confusing to new folks is the lack of a regex replace in the new API.

    I made an ad hoc one that is semi-speedy, but dealing with this issue in general might be a good advanced topic.
    Example: http://hpaste.org/697

  4. Thanks for the intro, its very good!

    Unfortunately regular expressions just don’t work on Hugs, which makes them unuseable for me…

  5. Nathaniel says:

    Typo?: “If we use [String], we get a list of every string that matches. > “foodiebar, fooquuxbar” =~ pat :: String”

  6. Nathaniel says:

    Err… wow, I was expecting I might get to play the learn-the-local-formatting-syntax-by-trial-and-error game, but that _really_ managed to trash my comment :-). Trying again:

    Typo?: “If we use [String], we get a list of every string that matches… “foodiebar, fooquuxbar” =~ pat :: String” I’m pretty sure this should be [String], not String?

    Typo?: “It’s only ever going to return an empty single-entry list.” What is an empty single-entry list?

    Not sure what I think about the API design where the return type is so magical. Isn’t it hard to remember, say, what exactly the third output from the four-tuple-of-strings-plus-a-list-of-strings overload does? Interesting to read about, though :-).

  7. Chris Eidhof says:

    Very cool! I’m also really interested in the “next time”. I really like these practical Haskell examples, which show that Haskell is ready for “real world” programming too. I’m looking forward to an article about the performance and/or other disadvantages (hint, hint ;)).

  8. Nathaniel -

    Thanks for pointing out the typo. I meant “an empty or single-item list”, and I’ve corrected that.

    As far as the magic polymorphic return types are concerned, in this instance I find them easy to remember. I put on my “what would I expect this field of the tuple to contain?” thinking cap, and lo, it gives me the right answer.

  9. Andy says:

    My Haskell-fu is weak, but is there any particular reason to style the regex lib like this? Why not just separate functions rather than adding a type signature to indicate what you’re looking for in terms of return value?

  10. ChrisKuklewicz says:

    Hi…I edited the latest versions of regex packages described above. Thanks for the tutorial page.

    I am using my time to update the packages at the moment, and will spend time on documentation after that, so this community help is great.

    Additional info, in no particular order:

    Using “import Text.Regex.Base” helps expose the whole API. This will be re-exported by backends in the next version. In particular I expect that regex-posix can be made to work with GHC-6.4.2 since I originally created it with that compiler.

    =~ is just a function that combines ‘makeRegex’ from the RegexMaker class and ‘match’ from the RegexContext class:

    (=~) x r =
    let make :: RegexMaker Regex CompOption ExecOption a =>
    a -> Regex
    make = makeRegex
    in match (make r) x

    [ The explicit type signature is needed to avoid (show . read) type ambiguity. ]

    You wrote:
    > There’s normally no need to compile regular expression
    > patterns. A pattern will be compiled the first time it’s used,
    > and your Haskell runtime should memoise the compiled
    > representation for you.

    Which is wrong. If you use =~ then the regexp is compiled with makeRegex and probably not cached. If you need to cache the compiled form then you need to use makeRegex and hold onto the result and then use match separately (and matchM for =~~).

    Andy wrote:
    > Why not just separate functions rather than adding a type
    > signature to indicate what you’re looking for in terms of
    > return value?

    The return-type-polymorphic match and matchM of RegexContex are built on top of the RegexLike class. The RegexLike class defined the separate functions you are looking for: matchAll, matchOnce, matchCount, matchTest, matchAllText, and matchOnceText with normally defined return types.

    As for the GHC-centric nature….I could not make the fancy polymorphic RegexContext work without various type system extensions. The actual backends work without this high level type class API — I suspect one should be able strip the code for regex-base and regex-posix/pcre/parsec/dfa down a bit and get it working on Hugs, just without the fancy =~ and =~~. This would be best to do once I announce the next release.

    In particular, if you look at the type-specific modules such as “Text.Regex.Posix.String” or “Text.Regex.Posix.Bytestring” they all export three functions named compile, regexec and execute. These can be used if you don’t want to go through any of the type classes at all. The Wrap modules like “Text.Regex.Posix.Wrap” expose even more below-type-class level functions, types, and constants.

  11. Miles Gould says:

    Neil:

    Unfortunately regular expressions just don’t work on Hugs, which makes them unuseable for me…

    As someone coming to Haskell from Perl, my first thought was “Unfortunately regular expressions just don’t work on Hugs, which makes Hugs unuseable for me…” :-)

  12. Anonymous says:

    Wow, awesome tutorial! :)

    Coming from Perl, not being able to use regexes was a semi-major hurdle for me; thanks! :)

    (Also, I like the fact that (=~) is so conveniently overloaded, very dwimmy! :))

  13. John J. Rood says:

    About giving up on Hugs vs giving up on Regex. What if I don’t WANT to do a compilation using a compiler everytime I want to try something? How serious do I have to be about this language to “play the game” at all?

  14. John: ghci is an interactive interpreter that’s a part of GHC, and runghc is its non-interactive counterpart, so you can certainly avoid the need to compile while not using Hugs.

  15. John J. Rood says:

    Thank you for the hint! It’s nice to have such quick turnaround. Also nice to see you’re still “alive” on this thread.

  16. John Meacham says:

    Is it ironic I had to read this tutorial to remember how to use the regex syntax I invented in the first place? hmm….

  17. Jason Dusek says:

    Thanks for posting — the multiple return types of Haskell regexen are really cool!

  18. Stephen Marsh says:

    Awesome post! It would have taken me…forever…to figure out these regexn otherwise.

    I did some tests with =~, and it seems it *does* only compile the regexp once, if you compile using ghc -O. =~ is a very small function, so with optimizations turned on (-O) ghc will inline it. If you look at the core, the Regex is put in a top-level expression, so it will only be compiled once per program invocation.

  19. Ubaldo says:

    This tutorial is exactly what I was looking for! Thanks!

  20. abuiles says:

    Great post, also It would be nice if you have mentioned splitRegex. I noticed that you guys didn’t used this in RWH, pag 305 you did your own split rather than using it, any particular reason ?

  21. Dave Bayer says:

    Coding regular expressions as strings is great for getting quick work down, but a ghastly experience as the regular expressions grow.

    Once one has used regular expressions written out in code as e.g. Bigloo Scheme allows, returning to regex strings is a huge “What could I possibly have been thinking!” recrimination. Sure, regex strings are familiar, but do I WANT to stay familiar with Perl?

    The parser alternatives that Haskell offers work well, and express the regex as code, not crammed into a string.

  22. Peter White says:

    I am unable to get the regular expression matching to work when I use Posix character classes such as [:digit:]. For example, if I use “^[:digit:] +INTRODUCTION”, this does not match the string “1 INTRODUCTION”, but if I use what is supposed to be equivalent, namely “^[0-9] +INTRODUCTION”, this does match the string “1 INTRODUCTION”. Any ideas what I am doing wrong?

  23. Peter White says:

    Oh I forgot to say that the tutorial was very useful!

  24. Bastien says:

    => Peter,
    use [[:digit:]] instead of [:digit:]. Common pitfall. Think about what you would have to write if you would enclose two or more character classes in the same range.

  25. czhedu says:

    “foo foo foo” =~ “foo” :: [String] is really awesome. However it doesn’t work any more in GHC 7.0.3.

  26. czhedu says:

    OK. I figure out how to do the above in latest GHC 7. Just use

    “foo foo foo” =~ “foo” :: [[String]]

    What is [[String]] ??? Yes, its the List of String List. Then you can use head/tail etc to get subgroup of matches. This is extremely useful, when you want to use captured parenthesized subexpressions.

  27. czhedu says:

    “In practice, you will want to avoid the Text.Regex.Posix implementation; it’s terribly slow, and too strict besides.” —- Then the question is: what shall we use???

    According to the comparision on
    http://www.haskell.org/haskellwiki/Regular_expressions#Overview
    http://www.haskell.org/haskellwiki/Regular_expressions#Sample_benchmark

    I recommend regex-tdfa, which is very fast and also native implementation in haskell.
    http://hackage.haskell.org/package/regex-tdfa

  28. Doesn’t work for me.

    haskell>(“22″ =~ “*”) :: Bool

    :1:7:
    No instances for (RegexMaker Regex CompOption ExecOption [Char],
    RegexContext Regex [Char] Bool)
    arising from a use of `=~’
    Possible fix:
    add instance declarations for
    (RegexMaker Regex CompOption ExecOption [Char],
    RegexContext Regex [Char] Bool)
    In the expression: (“22″ =~ “*”) :: Bool
    In an equation for `it': it = (“22″ =~ “*”) :: Bool

  29. Johnathan says:

    Great article! You can also use regular expression tester likes http://liveregex.com for realtime regex testing purpose. :)

3 Pings/Trackbacks for "A Haskell regular expression tutorial"
  1. [...] programs and doing useful work. Using just the techniques covered above (and maybe a dash of regular expressions) you can start to rock and roll. [...]

  2. [...] searches, the -l and -L flags, and regular expressions. Regular expressions were difficult. This blog post explains how to use Test.Regex.Posix, but I had difficulty compiling the code. Instead [...]

  3. [...] module.  This module contains regular expression functions and in particular exposes the =~ operator. This operator lets you match a regex pattern against a string and its behavior is overloaded based [...]

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>