A few weeks ago, I spent a little time porting Peter Norvig’s Python
code from his article How to Write a Spelling
Corrector to Haskell. It’s a
concise vehicle for a few Haskelly topics that don’t get much airing
in front of a general audience: what idiomatic Haskell looks like, and
the dreaded space leak.
Bear in mind as you read this that I’m porting Norvig’s algorithm, not
implementing the approach that would be best suited to Haskell
(dynamic programming). And I’m trying to do so idiomatically, so that
the resulting code looks like decent Haskell. This isn’t an attempt
to “golf” the code for either performance or terseness. Neither am I
trying to show that either language is better than the other. Whew, enough with the disclaimers.
A Haskell module starts with a series of import statements that make
names from other modules available in our namespace. A typical
import section is likely to explicitly list every name it imports
from a module, as here.
> import Control.Monad (liftM)
> import Data.Char (isAlpha, toLower)
> import Data.List (maximumBy)
There are some good reasons to list every name we import. It’s a
helpful piece of self-documentation. By naming everything we import,
we can later see which potentially unfamiliar module we’re importing a
name from, without needing to search documentation or scratch our
heads.
Explicit naming also reduces the likelihood of namespace collisions.
It’s extremely common for Haskell modules to export functions with
identical names but different types. If we import three modules that
each export the name foo, but we only import the name foo from
one of those modules, it’s clear to both the compiler and our reader
which one we must want to use.
Another good way to avoid namespace collisions while preserving
readability is to import all names from a module en masse, but in
such a way that we have to qualify those names when we refer to
them. Here’s what I mean by this.
> import qualified Data.Map as M
Given the above qualified import, if we want to refer to the name
empty as exported by Data.Map, we’ll need to write M.empty.
Now for our first mention of performance. Haskell’s built-in String
type is well known to be death to performance, as it’s no more than a
linked list of boxed Chars. The effect this has on performance is
easily measurable on data sets as small as a few hundred kilobytes,
even on a fast CPU. The modern response is to use the newer, much
faster ByteString types.
There are two kinds of ByteString. The “lazy” variant follows the
Haskell norm of deferring work until it’s actually needed, while the
“strict” variant completes its work immediately. To give you an idea
of what this means, the lazy ByteString version of the readFile
function reads a file in small chunks, on demand, while the strict
variant reads the entire file into memory in one go. Strict
ByteStrings can beat their lazy cousins on performance metrics, but
this is fairly rare. In this case, when I ran some time measurements,
they showed lazy beating strict by a factor of two.
> import qualified Data.ByteString.Lazy.Char8 as B
Let’s return to namespace clashes for a moment. The ByteString
modules each contain dozens of names that clash with names defined
in Prelude, the module that is always implicitly imported into every
Haskell module. This is why we import them using the qualified
mechanism. (And by the way, switching from lazy to strict
ByteStrings throughout this module is a simple matter of deleting
the .Lazy string from the imported module name above.)
In his Python spellchecker, Norvig uses a Python defaultdict, in
other words a hash table, to hold his model data. Hash tables aren’t
suitable for use in pure functional code, so we use a Haskell Map
instead.
> type Model = M.Map B.ByteString Int
Norvig’s words function has a similar form in Haskell.
> -- def words(text): return re.findall('[a-z]+', text.lower())
> mkWords :: B.ByteString -> [B.ByteString]
> mkWords = filter (not . B.null) . B.splitWith (not . isAlpha) . B.map toLower
Our mkWords is a composed pipeline of functions, which we read from
right to left.
B.map toLower returns a lower-cased copy of its input string.
B.splitWith (not . isAlpha) returns a list of strings, split
everywhere that isAlpha returns False.
filter (not . B.null) eliminates any empty strings from the result
of the call to B.splitWith.
The mkWords function is written in “point free” style, meaning that
although it takes an argument, the argument isn’t referred to in the
function definition. (To make matters confusing, the word “point” in
“point free” really means “value”, not any kind of punctuation
symbol.) Point free style can make one-line functions clearer, but it
makes longer definitions much harder to understand and modify.
Notice that my mkWords function has a type signature. None of the
type signatures in this module is actually necessary; the compiler can
infer correct types for every value without ambiguity. But each one
makes the code more readable to me, which is a huge benefit. If I
were trying to golf this code, I’d have killed off all of the type
signatures, and along with it about half of the code’s grokkability. For example, I’d have to work out for myself that the point-free function definition above is in fact a function of only one argument, and what its type is.
Where there’s a clear similarity between the Python and Haskell
words functions, the train function has a radically different form
in Haskell.
> -- def train(features):
> -- model = collections.defaultdict(lambda: 1)
> -- for f in features:
> -- model[f] += 1
> -- return model
> train :: [B.ByteString] -> Model
> train = foldl' updateMap M.empty
> where updateMap model word = M.insertWith' (+) word 1 model
The observation here is that building the model is actually a
“reduction” or “fold” operation: we start with an empty model, and
fold over the input. Each fold/reduction step constructs a new model
from the previous model and the next piece of input.
The Python code is a fold, too; there’s just not a clean way to write
a fold that augments a dict in Python. So instead we have a loop
that somewhat obfuscates the fact that we’re
folding.
Folding isn’t an especially hard thing for neophytes to grasp, but
here lurk a few demons. It is in the definition of train
that Haskell’s laziness holds the power to bite us, in the form of
the dreaded space leak. There are two potential pitfalls here, one of which nearly bit me as I was writing this article.
Our first pitfall is that there are two ways we can fold over a list.
We can consume it from left to right, using foldl', or from right to
left, using foldr. The choice of which order to use is not at all
arbitrary.
The strict left-to-right fold is best for cases when an intermediate result
is of no use to our caller. This is one of these cases: the caller
will use the final complete Model; it can’t see or do anything with
incremental pieces of it.
Note that if we didn’t use the strict left-to-right fold foldl', but instead used the lazy variant (foldl) we’d potentially get a space leak. Instead of constructing a new value
for the next step in the fold, each step would construct a lazily
deferred expression (called a thunk) that wouldn’t be evaluated
until the entire fold had completed. As a result, we’d construct and
return a huge thicket of thunks instead of a single tidy value.
A lazy right-to-left fold is frequently used in lazy functional programs,
as it’s suitable for a caller that will lazily consume a result.
However, if the caller needs a complete result (as in our case),
foldr will generate a pile of thunks as it recurses down its input
list, and they won’t be consumed until foldr reaches the end of the
list.
Understanding when to use foldr or foldl' isn’t actually difficult;
it’s just not something that newcomers to Haskell will naturally think
about. Worse, the compiler can hide some of the effects of the choice of fold under some circumstances. When I first wrote this article, I was using foldl', but there seemed to be no difference in performance or space usage compared to foldl (lazy), so I switched to that. GHC’s strictness analysis was preventing me from noticing the problem with foldl at high optimisation levels, because it turned the lazy fold into a strict one for me.
Our second potential problem is that the choice of fold direction
isn’t the only possible source of a space leak above. The expression
M.insertWith' (+) word 1 model does one of two things when adding a
key to the Model map.
- If no such key exists in the map, it adds a new entry, with a key of
word and value of 1.
- If the key is already present, it creates a new map with the entry’s
value updated, incremented by one.
The potential problem here is that in the “update an existing key”
case, nothing is immediately going to inspect the new value. Thus a
Haskell implementation will construct a thunk to defer its evaluation.
Repeated updates of the same entry will thus construct a series of
thunks of the form 1 + 1 + 1 + ..., instead of a single Int, and
this pile of thunks won’t be evaluated until the first time someone
tries to use that value.
Here, we’ve avoided this possible space leak by using the
M.insertWith' function, which is deliberately strict: it forces the
new value to be evaluated immediately. Haskell functions that are
stricter than the default usually have a tick (') character at the
ends of their names.
We now have two examples—in a simple-looking two-line
function!—to demonstrate that avoiding space leaks in Haskell
can be a subtle business for the uninitiated.
Space leaks are bad not merely for the memory they consume, but also
because of their time effects. Build a big data structure in a
leaky way, and the garbage collector will expend a lot of futile
cycles traversing it.
And now, on we go to our next stop. The edits1 function in Python
generates a set of all single-character edits of a word.
> -- def edits1(word):
> -- n = len(word)
> -- return set([word[0:i]+word[i+1:] for i in range(n)] +
> -- [word[0:i]+word[i+1]+word[i]+word[i+2:] for i in range(n-1)]
> -- [word[0:i]+c+word[i+1:] for i in range(n) for c in alphabet]
> -- [word[0:i]+c+word[i:] for i in range(n+1) for c in alphabet])
> edits1 :: B.ByteString -> [B.ByteString]
> edits1 word = concat
> [[h `B.append` B.tail t | (h, t) <- hts'],
> [B.concat [h, (B.take 1 . B.drop 1) t, B.take 1 t, B.drop 2 t] | (h, t) <- hts],
> [B.concat [h, B.singleton c, B.tail t] | (h, t) <- hts', c <- alpha],
> [B.concat [h, B.singleton c, t] | (h, t) <- hts, c <- alpha]]
> where hts = [B.splitAt n word | n <- [0..fromIntegral len]]
> hts' = take (len - 1) hts
> len = fromIntegral (B.length word)
> alpha = ['a'..'z']
If you find the above Haskell code less readable than its Python
counterpart, this is not especially unusual. String manipulation code
looks particularly clean in Python, at least to my eyes, while the
corresponding Haskell is often more clunky.
The Haskell version of edits1 dispenses with sets, instead returning
a list of strings. This list can contain duplicate entries, while the
set returned by the Python version cannot. It’s surprisingly
expensive to try to eliminate the duplicates in Haskell, but having
them present has no effect on correctness, as far as I can tell.
Our penultimate stop is the correct function. The Python version
uses the short-circuiting or operator to avoid the potentially
expensive work of searching for single-character edits if the word is
already present in the model, and it avoids the even more expensive
search for two-character edits if it finds a suitable single-character
edit.
> -- def known_edits2(word):
> -- return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)
> -- def known(words): return set(w for w in words if w in NWORDS)
> -- def correct(word):
> -- candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
> -- return max(candidates, key=lambda w: NWORDS[w])
Our Haskell counterpart looks horribly inefficient by comparison. Oh
noes! It constructs a complete list of all single- and two-character
edits!
> correct :: Model -> B.ByteString -> B.ByteString
> correct model word = (maximumBy cmpFreq . head . filter (not . null))
> (map known [[word], edits1 word, edits2 word] ++ [[word]])
> where known = filter (`M.member` model)
> edits2 = concatMap edits1 . edits1
> findFreq word = M.findWithDefault 1 word model
> cmpFreq a b = compare (findFreq a) (findFreq b)
However, lazy evaluation works to our advantage here. What actually
happens is that the call to head causes only the first non-null
candidate (from filter) to be consumed. The construction of the
rest of the list is deferred as a thunk that will never subsequently
be evaluated. We thus get the short-circuiting behaviour of Python’s
or operator for free: our expression could compute all of
these expensive matches, but it doesn’t actually do so unless we
really need them.
Finally, we separate the reading of data and interaction from our
referentially transparent code above with a little bit of “lifting”.
Our model construction and correction functions are completely pure;
they have no side effects. But we need to read data from a file,
which is an effectful operation. The liftM function takes the pure
code on the left, and hooks it up to the impure action on the right,
so that the result of the pure code is available within the impure
IO monad.
> readModel :: FilePath -> IO Model
> readModel name = (train . mkWords) `liftM` B.readFile name
> main :: IO ()
(Hark back to the discussion of lazy ByteStrings earlier. The
B.readFile function reads the entire model in small chunks, so the
memory usage of this function is proportional to the size of the model
we construct, with a constant factor for I/O that’s independent of the
size of the input file.)
> main = do
> model <- readModel "big.txt"
> interact (unlines . map (B.unpack . correct model . B.pack) . lines)
The definition of main tersely says that we should construct our
model, then use it to correct words fed to us on stdin, one per
line. The argument to interact is once again a pure function;
interact handles all the dirty work of reading input and writing
output.
That’s us about done. Separation between pure and impure code; space
leaks partly demystified; and no cheerleading.