A few weeks ago, I spent a little time porting Peter Norvig’s Python code from his article How to Write a Spelling Corrector to Haskell. It’s a concise vehicle for a few Haskelly topics that don’t get much airing in front of a general audience: what idiomatic Haskell looks like, and the dreaded space leak.
Bear in mind as you read this that I’m porting Norvig’s algorithm, not implementing the approach that would be best suited to Haskell (dynamic programming). And I’m trying to do so idiomatically, so that the resulting code looks like decent Haskell. This isn’t an attempt to “golf” the code for either performance or terseness. Neither am I trying to show that either language is better than the other. Whew, enough with the disclaimers.
A Haskell module starts with a series of import statements that make
names from other modules available in our namespace. A typical
import section is likely to explicitly list every name it imports
from a module, as here.
> import Control.Monad (liftM) > import Data.Char (isAlpha, toLower) > import Data.List (maximumBy)
There are some good reasons to list every name we import. It’s a helpful piece of self-documentation. By naming everything we import, we can later see which potentially unfamiliar module we’re importing a name from, without needing to search documentation or scratch our heads.
Explicit naming also reduces the likelihood of namespace collisions.
It’s extremely common for Haskell modules to export functions with
identical names but different types. If we import three modules that
each export the name
foo, but we only import the name
one of those modules, it’s clear to both the compiler and our reader
which one we must want to use.
Another good way to avoid namespace collisions while preserving readability is to import all names from a module en masse, but in such a way that we have to qualify those names when we refer to them. Here’s what I mean by this.
> import qualified Data.Map as M
Given the above qualified import, if we want to refer to the name
empty as exported by
Data.Map, we’ll need to write
Now for our first mention of performance. Haskell’s built-in
type is well known to be death to performance, as it’s no more than a
linked list of boxed
Chars. The effect this has on performance is
easily measurable on data sets as small as a few hundred kilobytes,
even on a fast CPU. The modern response is to use the newer, much
There are two kinds of
ByteString. The “lazy” variant follows the
Haskell norm of deferring work until it’s actually needed, while the
“strict” variant completes its work immediately. To give you an idea
of what this means, the lazy
ByteString version of the
function reads a file in small chunks, on demand, while the strict
variant reads the entire file into memory in one go. Strict
ByteStrings can beat their lazy cousins on performance metrics, but
this is fairly rare. In this case, when I ran some time measurements,
they showed lazy beating strict by a factor of two.
> import qualified Data.ByteString.Lazy.Char8 as B
Let’s return to namespace clashes for a moment. The
modules each contain dozens of names that clash with names defined
Prelude, the module that is always implicitly imported into every
Haskell module. This is why we import them using the
mechanism. (And by the way, switching from lazy to strict
ByteStrings throughout this module is a simple matter of deleting
.Lazy string from the imported module name above.)
In his Python spellchecker, Norvig uses a Python
other words a hash table, to hold his model data. Hash tables aren’t
suitable for use in pure functional code, so we use a Haskell
> type Model = M.Map B.ByteString Int
words function has a similar form in Haskell.
> -- def words(text): return re.findall('[a-z]+', text.lower()) > mkWords :: B.ByteString -> [B.ByteString] > mkWords = filter (not . B.null) . B.splitWith (not . isAlpha) . B.map toLower
mkWords is a composed pipeline of functions, which we read from
right to left.
B.map toLowerreturns a lower-cased copy of its input string.
B.splitWith (not . isAlpha)returns a list of strings, split everywhere that
filter (not . B.null)eliminates any empty strings from the result of the call to
mkWords function is written in “point free” style, meaning that
although it takes an argument, the argument isn’t referred to in the
function definition. (To make matters confusing, the word “point” in
“point free” really means “value”, not any kind of punctuation
symbol.) Point free style can make one-line functions clearer, but it
makes longer definitions much harder to understand and modify.
Notice that my
mkWords function has a type signature. None of the
type signatures in this module is actually necessary; the compiler can
infer correct types for every value without ambiguity. But each one
makes the code more readable to me, which is a huge benefit. If I
were trying to golf this code, I’d have killed off all of the type
signatures, and along with it about half of the code’s grokkability. For example, I’d have to work out for myself that the point-free function definition above is in fact a function of only one argument, and what its type is.
Where there’s a clear similarity between the Python and Haskell
words functions, the
train function has a radically different form
> -- def train(features): > -- model = collections.defaultdict(lambda: 1) > -- for f in features: > -- model[f] += 1 > -- return model > train :: [B.ByteString] -> Model > train = foldl' updateMap M.empty > where updateMap model word = M.insertWith' (+) word 1 model
The observation here is that building the model is actually a “reduction” or “fold” operation: we start with an empty model, and fold over the input. Each fold/reduction step constructs a new model from the previous model and the next piece of input.
The Python code is a fold, too; there’s just not a clean way to write
a fold that augments a
dict in Python. So instead we have a loop
that somewhat obfuscates the fact that we’re
Folding isn’t an especially hard thing for neophytes to grasp, but
here lurk a few demons. It is in the definition of
that Haskell’s laziness holds the power to bite us, in the form of
the dreaded space leak. There are two potential pitfalls here, one of which nearly bit me as I was writing this article.
Our first pitfall is that there are two ways we can fold over a list.
We can consume it from left to right, using
foldl', or from right to
foldr. The choice of which order to use is not at all
The strict left-to-right fold is best for cases when an intermediate result
is of no use to our caller. This is one of these cases: the caller
will use the final complete
Model; it can’t see or do anything with
incremental pieces of it.
Note that if we didn’t use the strict left-to-right fold
foldl', but instead used the lazy variant (
foldl) we’d potentially get a space leak. Instead of constructing a new value
for the next step in the fold, each step would construct a lazily
deferred expression (called a thunk) that wouldn’t be evaluated
until the entire fold had completed. As a result, we’d construct and
return a huge thicket of thunks instead of a single tidy value.
A lazy right-to-left fold is frequently used in lazy functional programs,
as it’s suitable for a caller that will lazily consume a result.
However, if the caller needs a complete result (as in our case),
foldr will generate a pile of thunks as it recurses down its input
list, and they won’t be consumed until
foldr reaches the end of the
Understanding when to use
foldl' isn’t actually difficult;
it’s just not something that newcomers to Haskell will naturally think
about. Worse, the compiler can hide some of the effects of the choice of fold under some circumstances. When I first wrote this article, I was using
foldl', but there seemed to be no difference in performance or space usage compared to
foldl (lazy), so I switched to that. GHC’s strictness analysis was preventing me from noticing the problem with
foldl at high optimisation levels, because it turned the lazy fold into a strict one for me.
Our second potential problem is that the choice of fold direction
isn’t the only possible source of a space leak above. The expression
M.insertWith' (+) word 1 model does one of two things when adding a
key to the
- If no such key exists in the map, it adds a new entry, with a key of
wordand value of
- If the key is already present, it creates a new map with the entry’s value updated, incremented by one.
The potential problem here is that in the “update an existing key”
case, nothing is immediately going to inspect the new value. Thus a
Haskell implementation will construct a thunk to defer its evaluation.
Repeated updates of the same entry will thus construct a series of
thunks of the form
1 + 1 + 1 + ..., instead of a single
this pile of thunks won’t be evaluated until the first time someone
tries to use that value.
Here, we’ve avoided this possible space leak by using the
M.insertWith' function, which is deliberately strict: it forces the
new value to be evaluated immediately. Haskell functions that are
stricter than the default usually have a tick (
') character at the
ends of their names.
We now have two examples—in a simple-looking two-line function!—to demonstrate that avoiding space leaks in Haskell can be a subtle business for the uninitiated.
Space leaks are bad not merely for the memory they consume, but also because of their time effects. Build a big data structure in a leaky way, and the garbage collector will expend a lot of futile cycles traversing it.
And now, on we go to our next stop. The
edits1 function in Python
set of all single-character edits of a word.
> -- def edits1(word): > -- n = len(word) > -- return set([word[0:i]+word[i+1:] for i in range(n)] + > -- [word[0:i]+word[i+1]+word[i]+word[i+2:] for i in range(n-1)] > -- [word[0:i]+c+word[i+1:] for i in range(n) for c in alphabet] > -- [word[0:i]+c+word[i:] for i in range(n+1) for c in alphabet]) > edits1 :: B.ByteString -> [B.ByteString] > edits1 word = concat > [[h `B.append` B.tail t | (h, t) <- hts'], > [B.concat [h, (B.take 1 . B.drop 1) t, B.take 1 t, B.drop 2 t] | (h, t) <- hts], > [B.concat [h, B.singleton c, B.tail t] | (h, t) <- hts', c <- alpha], > [B.concat [h, B.singleton c, t] | (h, t) <- hts, c <- alpha]] > where hts = [B.splitAt n word | n <- [0..fromIntegral len]] > hts' = take (len - 1) hts > len = fromIntegral (B.length word) > alpha = ['a'..'z']
If you find the above Haskell code less readable than its Python counterpart, this is not especially unusual. String manipulation code looks particularly clean in Python, at least to my eyes, while the corresponding Haskell is often more clunky.
The Haskell version of
edits1 dispenses with sets, instead returning
a list of strings. This list can contain duplicate entries, while the
set returned by the Python version cannot. It’s surprisingly
expensive to try to eliminate the duplicates in Haskell, but having
them present has no effect on correctness, as far as I can tell.
Our penultimate stop is the
correct function. The Python version
uses the short-circuiting
or operator to avoid the potentially
expensive work of searching for single-character edits if the word is
already present in the model, and it avoids the even more expensive
search for two-character edits if it finds a suitable single-character
> -- def known_edits2(word): > -- return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS) > -- def known(words): return set(w for w in words if w in NWORDS) > -- def correct(word): > -- candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word] > -- return max(candidates, key=lambda w: NWORDS[w])
Our Haskell counterpart looks horribly inefficient by comparison. Oh noes! It constructs a complete list of all single- and two-character edits!
> correct :: Model -> B.ByteString -> B.ByteString > correct model word = (maximumBy cmpFreq . head . filter (not . null)) > (map known [[word], edits1 word, edits2 word] ++ [[word]]) > where known = filter (`M.member` model) > edits2 = concatMap edits1 . edits1 > findFreq word = M.findWithDefault 1 word model > cmpFreq a b = compare (findFreq a) (findFreq b)
However, lazy evaluation works to our advantage here. What actually
happens is that the call to
head causes only the first non-null
filter) to be consumed. The construction of the
rest of the list is deferred as a thunk that will never subsequently
be evaluated. We thus get the short-circuiting behaviour of Python’s
or operator for free: our expression could compute all of
these expensive matches, but it doesn’t actually do so unless we
really need them.
Finally, we separate the reading of data and interaction from our
referentially transparent code above with a little bit of “lifting”.
Our model construction and correction functions are completely pure;
they have no side effects. But we need to read data from a file,
which is an effectful operation. The
liftM function takes the pure
code on the left, and hooks it up to the impure action on the right,
so that the result of the pure code is available within the impure
> readModel :: FilePath -> IO Model > readModel name = (train . mkWords) `liftM` B.readFile name > main :: IO ()
(Hark back to the discussion of lazy
ByteStrings earlier. The
B.readFile function reads the entire model in small chunks, so the
memory usage of this function is proportional to the size of the model
we construct, with a constant factor for I/O that’s independent of the
size of the input file.)
> main = do > model <- readModel "big.txt" > interact (unlines . map (B.unpack . correct model . B.pack) . lines)
The definition of
main tersely says that we should construct our
model, then use it to correct words fed to us on
stdin, one per
line. The argument to
interact is once again a pure function;
interact handles all the dirty work of reading input and writing
That’s us about done. Separation between pure and impure code; space leaks partly demystified; and no cheerleading.