What the heck is a Wide Finder, anyway?

Tim Bray has recently been writing about a simple log file processing task, giving his efforts the (decidedly peculiar) name of Wide Finder. The task at hand is to count popular links in an Apache log file.

Here’s my two minutes’ worth of hat in the ring, in Haskell.

> main = do
>     args <- getArgs
>     forM_ args $ \name -> do
>         m <- (foldl' count M.empty . LB.lines) `fmap` LB.readFile name
>         mapM_ print ((take 10 . sortBy (flip compare `on` snd) . M.toList) m)
>   where on f g x y = g x `f` (g y)
>         count m line = case line =~ "\"GET /en/([^ ]+\\.html)" of
>                          ((_:g:_):_) -> M.insertWith' (+) g 1 m
>                          _ -> m :: M.Map LB.ByteString Int

Here’s some comparable Python:

> pat = re.compile(r'.*"GET /en/([^ ]+\.html)')

> for name in sys.argv[1:]:
>     d = {}
>     for line in open(name):
>         m = pat.match(line)
>         if m:
>             d[m.group(1)] = d.setdefault(m.group(1), 0) + 1
>     for i in sorted(d.items(), key=lambda x:x[1], reverse=True)[:10]:
>         print i

The Haskell code can chew through 3.2 million records in 10.5 seconds on my laptop, while the Python takes 11.6 seconds.

Both of these programs spend about 90% of their time in regexp matching code, which makes this just a regexp engine and I/O benchmark. Yawn! Can we squeeze a little bit of entertainment out of the problem?

This is a trivially parallelisable problem: there are no data dependencies between different parts of a log file, so we can process them however we please.

Let’s split the input file into chunks of approximately equal size, aligned to line boundaries. This is dead easy to do with almost no I/O: just seek to the nearest chunk boundary, and read a little until we hit a newline.

> chunkedLineBoundaries :: Int -> FilePath -> IO [(Int64, Int64)]
> chunkedLineBoundaries numChunks path = do
>     totalSize <- (fromIntegral . fileSize) `fmap` getFileStatus path
>     let chunkSize = totalSize `div` fromIntegral numChunks
>     bracket (openFile path ReadMode) hClose $ \h ->
>       flip fix 0 $ \findOffsets offset -> do
>         let newOffset = offset + chunkSize
>         hSeek h AbsoluteSeek (fromIntegral newOffset)
>         flip fix newOffset $ \loop off -> do
>           eof <- hIsEOF h
>           if eof
>             then return [(offset, totalSize - offset)]
>             else do
>               bytes <- LB.hGet h 4096
>               case LB.elemIndex '\n' bytes of
>                 Just n -> do
>                   offsets <- findOffsets (off + n + 1)
>                   return ((offset, fst (head offsets) - offset):offsets)
>                 Nothing -> loop (off + LB.length bytes)

The chunkedLineBoundaries function returns a list of (offset, length) pairs. We’ll use this to fire off multiple threads, each of which will consume a single chunk of the file in parallel.

> withChunks :: Int -> (LB.ByteString -> a) -> FilePath -> IO [a]
> withChunks numThreads f path = do
>   offsets <- chunkedLineBoundaries numThreads path
>   ch <- newChan
>   forM_ offsets $ \(offset, count) -> forkIO $
>     handle (writeChan ch . Left) $
>       bracket (openFile path ReadMode) hClose $ \h -> do
>         hSeek h AbsoluteSeek (fromIntegral offset)
>         ret <- (f . LB.take count) `fmap` LB.hGetContents h
>         ret `seq` writeChan ch (Right ret)
>   forM offsets (const (readChan ch >>= either throwIO return))

With this process-a-file-in-chunks function in hand, we must restructure our original code a little to fit in. Here’s the core scan-and-update-the-map loop, which does no I/O.

> reCountLines :: LB.ByteString -> M.Map LB.ByteString Int
> reCountLines = foldl' count M.empty . LB.lines
>     where count m line = case line =~ "\"GET /en/([^ ]+\\.html)" of
>                            ((_:g:_):_) -> M.insertWith' (+) g 1 m
>                            _ -> m

We’ll give it an alternate name so we can swap in a better implementation later.

> countLines = reCountLines

Because this function does no I/O, we can run it either sequentially or in parallel.

> sequential = fmap countLines . LB.readFile
> parallel = fmap (M.unionsWith (+) . map snd) . withChunks 2 countLines

The parallel function takes the maps returned by each thread and reduces them into a single map, giving a result of exactly the same type as the sequential function.

> -- kind = sequential
> kind = parallel

By changing the definition of kind above, we can switch between the sequential and parallel versions of our code. Now main becomes just a framework:

> main = do
>     args <- getArgs
>     forM_ args $ \name -> kind name >>= \m -> 
>       mapM_ print ((take 10 . sortBy (flip compare `on` snd) . M.toList) m)
>   where on f g x y = g x `f` g y

In order to benefit from the potential parallelism, we have to recompile to use GHC’s threaded runtime. This imposes about a 4% penalty in execution time, so the serial version of the code processs our 3.2 million records in 10.9 seconds instead of 10.5.

Switching to the parallel code, it takes 7.7 seconds to process the same data. We get a less than perfect speedup in part because GHC’s garbage collector runs serially; that results in about 0.7 seconds of serial execution. Still, this is almost twice as fast as the Python code.

Next, let’s get rid of the gratuitous regular expressions, since they’re surely doing a lot more work than necessary for such a simple problem. Here’s a short handwritten replacement:

> fastCountLines :: LB.ByteString -> M.Map LB.ByteString Int
> fastCountLines = foldl' count M.empty . LB.lines
>   where count m line =
>           let quote = LB.drop (fromJust (LB.elemIndex '\"' line)) line
>           in if LB.pack "\"GET /en/" `LB.isPrefixOf` quote
>              then let pfx = LB.drop 9 quote
>                       uri = LB.take (fromJust (LB.elemIndex ' ' pfx)) pfx
>                   in if LB.pack ".html" `isSuffixOf` uri
>                      then M.insertWith' (+) uri 1 m
>                      else m
>              else m

Using fastCountLines as the value of countLines, this brings best-case serial execution time (i.e. without the threaded runtime) down to 5.1 seconds, and parallel execution time drops to 3.5 seconds, or a third the time required by the original serial-with-regexps Haskell code.

I would expect a four-core machine to further improve performance, though with an added drop in speedup due to GHC’s single-threaded garbage collector.

The withChunks function isn’t at all specialised to this task; we can use it to process any large text file in parallel.

Posted in haskell
9 comments on “What the heck is a Wide Finder, anyway?
  1. Luke Plant says:

    A nit: isn’t “compare `on` snd” the same as “comparing snd” (using Data.Ord)? You can remove the definition of ‘on’ and save a line.

  2. Itkovian says:

    Not to nitpick, but 7.7 is hardly twice as fast as 11.6 (Python). Nice job though.

  3. Kig says:

    Really cool post.
    Things like this make me happy that you’re part of the “Real World Haskell” thing, because this is exactly the kind of code that reminds me that Haskell is useful for day-to-day code-duggery.

  4. Greg says:

    Can you give some hints about how to get the two-minute version to compile?

    Most of the imports are easily guessed except LB. LB should include line, readFile, and ByteString, but these seem to be in Prelude, Prelude, and Data.ByteString.

    Looking ahead, how is the polymorphism for =~ resolved?

  5. Greg, here’s the list of imports.

    import Control.Monad (forM_)
    import qualified Data.ByteString.Lazy.Char8 as LB
    import Data.List (foldl’, sortBy)
    import qualified Data.Map as M
    import System.Environment (getArgs)
    import Text.Regex.POSIX ((=~))

    The result of (=~) is a list of lists, as the first pattern in the case expression indicates. Each element of the inner list is a lazy bytestring, representing a match subgroup.

  6. Greg says:

    I think these two statements are true:

    1. ghc 6.6 knows about about Text.Regex.Posix, but not about Text.Regex.POSIX.

    2. Text.Regex.Posix doesn’t know about lazy bytestrings:

    “No instance for (Text.Regex.Base.RegexLike.RegexLike Text.Regex.Posix.Wrap.Regexe
    Data.ByteString.Base.LazyByteString”

    Or am I misunderstanding something? The example compiles with Data.ByteString.Char8

    Thanks again.

  7. Oops, yes, typo there. And I’m using a newer version of the regexp libraries, which do work with lazy ByteStrings.

  8. Alexy says:

    Bryan — many thanks for very instructive Haskell. Is there a complete version to download and benchmark with both GHC 6.6.1 and 6.8.1?

  9. the drive-by reader says:

    Nice speedup.

    It would be interesting to see how Haskell compares to a similarly sped-up python version.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>