Norvig’s spell checker and idiomatic Haskell

A few weeks ago, I spent a little time porting Peter Norvig’s Python code from his article How to Write a Spelling Corrector to Haskell. It’s a concise vehicle for a few Haskelly topics that don’t get much airing in front of a general audience: what idiomatic Haskell looks like, and the dreaded space leak.

Bear in mind as you read this that I’m porting Norvig’s algorithm, not implementing the approach that would be best suited to Haskell (dynamic programming). And I’m trying to do so idiomatically, so that the resulting code looks like decent Haskell. This isn’t an attempt to “golf” the code for either performance or terseness. Neither am I trying to show that either language is better than the other. Whew, enough with the disclaimers.

A Haskell module starts with a series of import statements that make names from other modules available in our namespace. A typical import section is likely to explicitly list every name it imports from a module, as here.

> import Control.Monad (liftM)
> import Data.Char (isAlpha, toLower)
> import Data.List (maximumBy)

There are some good reasons to list every name we import. It’s a helpful piece of self-documentation. By naming everything we import, we can later see which potentially unfamiliar module we’re importing a name from, without needing to search documentation or scratch our heads.

Explicit naming also reduces the likelihood of namespace collisions. It’s extremely common for Haskell modules to export functions with identical names but different types. If we import three modules that each export the name foo, but we only import the name foo from one of those modules, it’s clear to both the compiler and our reader which one we must want to use.

Another good way to avoid namespace collisions while preserving readability is to import all names from a module en masse, but in such a way that we have to qualify those names when we refer to them. Here’s what I mean by this.

> import qualified Data.Map as M

Given the above qualified import, if we want to refer to the name empty as exported by Data.Map, we’ll need to write M.empty.

Now for our first mention of performance. Haskell’s built-in String type is well known to be death to performance, as it’s no more than a linked list of boxed Chars. The effect this has on performance is easily measurable on data sets as small as a few hundred kilobytes, even on a fast CPU. The modern response is to use the newer, much faster ByteString types.

There are two kinds of ByteString. The “lazy” variant follows the Haskell norm of deferring work until it’s actually needed, while the “strict” variant completes its work immediately. To give you an idea of what this means, the lazy ByteString version of the readFile function reads a file in small chunks, on demand, while the strict variant reads the entire file into memory in one go. Strict ByteStrings can beat their lazy cousins on performance metrics, but this is fairly rare. In this case, when I ran some time measurements, they showed lazy beating strict by a factor of two.

> import qualified Data.ByteString.Lazy.Char8 as B

Let’s return to namespace clashes for a moment. The ByteString modules each contain dozens of names that clash with names defined in Prelude, the module that is always implicitly imported into every Haskell module. This is why we import them using the qualified mechanism. (And by the way, switching from lazy to strict ByteStrings throughout this module is a simple matter of deleting the .Lazy string from the imported module name above.)

In his Python spellchecker, Norvig uses a Python defaultdict, in other words a hash table, to hold his model data. Hash tables aren’t suitable for use in pure functional code, so we use a Haskell Map instead.

> type Model = M.Map B.ByteString Int

Norvig’s words function has a similar form in Haskell.

> -- def words(text): return re.findall('[a-z]+', text.lower()) 

> mkWords :: B.ByteString -> [B.ByteString]

> mkWords = filter (not . B.null) . B.splitWith (not . isAlpha) . toLower

Our mkWords is a composed pipeline of functions, which we read from right to left.

  • toLower returns a lower-cased copy of its input string.
  • B.splitWith (not . isAlpha) returns a list of strings, split everywhere that isAlpha returns False.
  • filter (not . B.null) eliminates any empty strings from the result of the call to B.splitWith.

The mkWords function is written in “point free” style, meaning that although it takes an argument, the argument isn’t referred to in the function definition. (To make matters confusing, the word “point” in “point free” really means “value”, not any kind of punctuation symbol.) Point free style can make one-line functions clearer, but it makes longer definitions much harder to understand and modify.

Notice that my mkWords function has a type signature. None of the type signatures in this module is actually necessary; the compiler can infer correct types for every value without ambiguity. But each one makes the code more readable to me, which is a huge benefit. If I were trying to golf this code, I’d have killed off all of the type signatures, and along with it about half of the code’s grokkability. For example, I’d have to work out for myself that the point-free function definition above is in fact a function of only one argument, and what its type is.

Where there’s a clear similarity between the Python and Haskell words functions, the train function has a radically different form in Haskell.

> -- def train(features):
> --     model = collections.defaultdict(lambda: 1)
> --     for f in features:
> --         model[f] += 1
> --     return model

> train :: [B.ByteString] -> Model

> train = foldl' updateMap M.empty
>     where updateMap model word = M.insertWith' (+) word 1 model

The observation here is that building the model is actually a “reduction” or “fold” operation: we start with an empty model, and fold over the input. Each fold/reduction step constructs a new model from the previous model and the next piece of input.

The Python code is a fold, too; there’s just not a clean way to write a fold that augments a dict in Python. So instead we have a loop that somewhat obfuscates the fact that we’re folding.

Folding isn’t an especially hard thing for neophytes to grasp, but here lurk a few demons. It is in the definition of train that Haskell’s laziness holds the power to bite us, in the form of the dreaded space leak. There are two potential pitfalls here, one of which nearly bit me as I was writing this article.

Our first pitfall is that there are two ways we can fold over a list. We can consume it from left to right, using foldl', or from right to left, using foldr. The choice of which order to use is not at all arbitrary.

The strict left-to-right fold is best for cases when an intermediate result is of no use to our caller. This is one of these cases: the caller will use the final complete Model; it can’t see or do anything with incremental pieces of it.

Note that if we didn’t use the strict left-to-right fold foldl', but instead used the lazy variant (foldl) we’d potentially get a space leak. Instead of constructing a new value for the next step in the fold, each step would construct a lazily deferred expression (called a thunk) that wouldn’t be evaluated until the entire fold had completed. As a result, we’d construct and return a huge thicket of thunks instead of a single tidy value.

A lazy right-to-left fold is frequently used in lazy functional programs, as it’s suitable for a caller that will lazily consume a result. However, if the caller needs a complete result (as in our case), foldr will generate a pile of thunks as it recurses down its input list, and they won’t be consumed until foldr reaches the end of the list.

Understanding when to use foldr or foldl' isn’t actually difficult; it’s just not something that newcomers to Haskell will naturally think about. Worse, the compiler can hide some of the effects of the choice of fold under some circumstances. When I first wrote this article, I was using foldl', but there seemed to be no difference in performance or space usage compared to foldl (lazy), so I switched to that. GHC’s strictness analysis was preventing me from noticing the problem with foldl at high optimisation levels, because it turned the lazy fold into a strict one for me.

Our second potential problem is that the choice of fold direction isn’t the only possible source of a space leak above. The expression M.insertWith' (+) word 1 model does one of two things when adding a key to the Model map.

  • If no such key exists in the map, it adds a new entry, with a key of word and value of 1.
  • If the key is already present, it creates a new map with the entry’s value updated, incremented by one.

The potential problem here is that in the “update an existing key” case, nothing is immediately going to inspect the new value. Thus a Haskell implementation will construct a thunk to defer its evaluation. Repeated updates of the same entry will thus construct a series of thunks of the form 1 + 1 + 1 + ..., instead of a single Int, and this pile of thunks won’t be evaluated until the first time someone tries to use that value.

Here, we’ve avoided this possible space leak by using the M.insertWith' function, which is deliberately strict: it forces the new value to be evaluated immediately. Haskell functions that are stricter than the default usually have a tick (') character at the ends of their names.

We now have two examples—in a simple-looking two-line function!—to demonstrate that avoiding space leaks in Haskell can be a subtle business for the uninitiated.

Space leaks are bad not merely for the memory they consume, but also because of their time effects. Build a big data structure in a leaky way, and the garbage collector will expend a lot of futile cycles traversing it.

And now, on we go to our next stop. The edits1 function in Python generates a set of all single-character edits of a word.

> -- def edits1(word):
> --     n = len(word)
> --     return set([word[0:i]+word[i+1:] for i in range(n)] +
> --                [word[0:i]+word[i+1]+word[i]+word[i+2:] for i in range(n-1)]
> --                [word[0:i]+c+word[i+1:] for i in range(n) for c in alphabet]
> --                [word[0:i]+c+word[i:] for i in range(n+1) for c in alphabet])

> edits1 :: B.ByteString -> [B.ByteString]

> edits1 word = concat
>     [[h `B.append` B.tail t | (h, t) <- hts'],
>      [B.concat [h, (B.take 1 . B.drop 1) t, B.take 1 t, B.drop 2 t] | (h, t) <- hts],
>      [B.concat [h, B.singleton c, B.tail t] | (h, t) <- hts', c <- alpha],
>      [B.concat [h, B.singleton c, t] | (h, t) <- hts, c <- alpha]]
>   where hts = [B.splitAt n word | n <- [0..fromIntegral len]]
>         hts' = take (len - 1) hts
>         len = fromIntegral (B.length word)
>         alpha = ['a'..'z']

If you find the above Haskell code less readable than its Python counterpart, this is not especially unusual. String manipulation code looks particularly clean in Python, at least to my eyes, while the corresponding Haskell is often more clunky.

The Haskell version of edits1 dispenses with sets, instead returning a list of strings. This list can contain duplicate entries, while the set returned by the Python version cannot. It’s surprisingly expensive to try to eliminate the duplicates in Haskell, but having them present has no effect on correctness, as far as I can tell.

Our penultimate stop is the correct function. The Python version uses the short-circuiting or operator to avoid the potentially expensive work of searching for single-character edits if the word is already present in the model, and it avoids the even more expensive search for two-character edits if it finds a suitable single-character edit.

> -- def known_edits2(word):
> --     return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

> -- def known(words): return set(w for w in words if w in NWORDS)

> -- def correct(word):
> --     candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
> --     return max(candidates, key=lambda w: NWORDS[w])

Our Haskell counterpart looks horribly inefficient by comparison. Oh noes! It constructs a complete list of all single- and two-character edits!

> correct :: Model -> B.ByteString -> B.ByteString

> correct model word = (maximumBy cmpFreq . head . filter (not . null))
>                      (map known [[word], edits1 word, edits2 word] ++ [[word]])
>     where known = filter (`M.member` model)
>           edits2 = concatMap edits1 . edits1
>           findFreq word = M.findWithDefault 1 word model
>           cmpFreq a b = compare (findFreq a) (findFreq b)

However, lazy evaluation works to our advantage here. What actually happens is that the call to head causes only the first non-null candidate (from filter) to be consumed. The construction of the rest of the list is deferred as a thunk that will never subsequently be evaluated. We thus get the short-circuiting behaviour of Python’s or operator for free: our expression could compute all of these expensive matches, but it doesn’t actually do so unless we really need them.

Finally, we separate the reading of data and interaction from our referentially transparent code above with a little bit of “lifting”. Our model construction and correction functions are completely pure; they have no side effects. But we need to read data from a file, which is an effectful operation. The liftM function takes the pure code on the left, and hooks it up to the impure action on the right, so that the result of the pure code is available within the impure IO monad.

> readModel :: FilePath -> IO Model

> readModel name = (train . mkWords) `liftM` B.readFile name

> main :: IO ()

(Hark back to the discussion of lazy ByteStrings earlier. The B.readFile function reads the entire model in small chunks, so the memory usage of this function is proportional to the size of the model we construct, with a constant factor for I/O that’s independent of the size of the input file.)

> main = do
>     model <- readModel "big.txt"
>     interact (unlines . map (B.unpack . correct model . B.pack) . lines)

The definition of main tersely says that we should construct our model, then use it to correct words fed to us on stdin, one per line. The argument to interact is once again a pure function; interact handles all the dirty work of reading input and writing output.

That’s us about done. Separation between pure and impure code; space leaks partly demystified; and no cheerleading.

Posted in haskell
11 comments on “Norvig’s spell checker and idiomatic Haskell
  1. petekaz says:

    Why not use foldl’ instead of foldl? As I understand, foldl will just accumulate thunks. I thought in order to be completely strict you need both foldl’ as well as a strict function such as insertWith’?

  2. Pied says:

    Wow !
    I’m not sure whether this was intended to be used as a tutorial, but it really is the best example of Haskell I have ever read !
    I wish you had written it a year or two earlier…

    I am to show your article to many people I know who have been frightened with the usual Haskell articles…


  3. Daniel Lyons says:


    I worked on this problem a month or two ago when Norvig’s spellchecker hit Reddit, but my Haskell implementation was ugly, inefficient and probably incorrect. Your elucidation clarifies all of the problems I ran into and results in beautiful, concise code. This is a service to the world and I thank you!

  4. petekaz says:

    For more discussion on the subject, see this thread on the cafe mailing list:

  5. Derek Elkins says:

    petekaz is correct. As illustrated on foldl will accumulate thunks and lead to a stack overflow in just about all cases. foldl is pretty much useless. foldl’ should be what is used if you have a strict function and foldr otherwise.

  6. I’ve fixed the article. Oops! In my defence, I had substituted foldl for foldl' and found no difference in performance or memory use, and was even looking at profiler output and seeing a nice slow heap growth instead of an explosion. I conclude that the compiler was being cleverer than I was.

  7. It does that. With optimisation, it seems to me foldl never causes trouble. Why shouldn’t the compiler be better at strictness analysis than me?

  8. Miles Gould says:

    The Python code is a fold, too; there’s just not a clean way to write a fold that augments a dict in Python. So instead we have a loop that somewhat obfuscates the fact that we’re folding.

    Huh? The reduce() function has been part of Python for as long as I’ve been aware of the language, and the following works fine for me:
    def updateWith(f, default):
    def curried(dict,index):
    if index in dict:
    dict[index] = f(dict[index])
    else: dict[index] = default
    return dict
    return curried

    print reduce(updateWith(lambda x: x + 1, 1), [1,2,3,4,5,1,3,7,0], {} This prints {0: 1, 1: 2, 2: 1, 3: 2, 4: 1, 5: 1, 7: 1} as expected.

    It would probably be considered poor style, though, and I believe there’s an ongoing argument in the Python community as to whether reduce() and map() should be removed from the language entirely: I guess they don’t buy your argument that code using map and fold is clearer than the equivalent code using loops 🙂

    Other than that, great article!

  9. Miles, I know about reduce, but as I pointed out and you confirm, the code you end up writing to use it to update a dict is messy.

  10. Miles Gould says:

    Ah, OK. Though I think it’s only messy because you have to define updateWith() yourself (and because defaultdicts aren’t available on the Python interpreter I have to hand). If you want to do this kind of thing with folds rather than loops, then write updateWith once, stick it in a module somewhere, and import it when you need it 🙂

  11. Daniel says:

    “… not implementing the approach that would be best suited to Haskell (dynamic programming).”

    Could you be convinced to do that? Would *love* to see that explained as nicely as you have explained this implementation.

    I really enjoyed your article.

Leave a Reply

Your email address will not be published. Required fields are marked *