Subscribe to
Posts
Comments

A while ago, I had a need to monitor filesystem modifications, and I looked around for Python bindings for the Linux kernel’s inotify subsystem. At the time, the only existing library was pyinotify, so being a lazy sort, I naturally tried to use it.

On first glance, the documentation seems impressive, and the API looks reasonable. Effective use of inotify is a subtle affair, however, and pyinotify is not, shall we say, the best tool for the job. It’s difficult to tell what those problems might be from external inspection, though, so here are a few notes from my experience.

Correctness

A program using pyinotify can easily lose track of parts of its directory hierarchy. The library doesn’t raise an OSError exception if the inotify_add_watch system call fails: instead, it propagates the -1 error result up to the caller as a value in a dict, but without the value of errno to tell the caller why the error occurred.

It’s thus trivial to miss errors entirely, because the usual mechanism of raising exceptions isn’t used. Almost as bad, it’s impossible to distinguish between recoverable (tried to add a watch on a directory that no longer exists) and fatal (hit the system max_user_watches limit) errors.

Performance

To a regular Python hacker, the interface that pyinotify provides will probably look reasonable. If you want to handle some kind of event, just write a method that will get invoked with an Event object when that event occurs. How reassuringly normal.

Under the hood, though, the implementation is terrible. On every event, the library scans every event that the inotify interface could possibly report, and checks to see if your class implements one of several possible appropriately named methods. This means it’s traversing a 20-element dict, and performing up to 60 attribute lookups (of which up to 40 are based on %-formatted names), for every reported event.

This has disastrous performance implications. If you write a simple monitoring tool that uses pyinotify, use it to monitor activity in a Linux kernel source tree, and then start a build in that tree, try running top while your build runs. When I did this, I found that pyinotify was consuming an entire CPU trying to keep up with the flood of notification events.

Locking

All that needless attribute lookup churn isn’t the only problem: pyinotify uses a threading.RLock to protect every access to every attribute of its Watch class, by providing its own __getattribute__ and __setattr__ methods.

I can’t guess what the author thinks he’s protecting himself from, but he’s got a solid defence mounted against both correctness and performance there. (Blindly locking individual attributes isn’t going to protect the consistency of an entire data structure, and delegating responsibility for locking out to callers, who are probably all single-threaded anyway, might help to recover a bit of the execrable performance. Watch isn’t often on the fast path, thank goodness.)

Is it possible to do better?

A potential rejoinder to my performance criticisms is that Python isn’t a fast language. However, this doesn’t bear up in general: I’ve written plenty of nippy Python code. In this particular case, in response to my mounting horror at reading and fixing the pyinotify source, I wrote bindings of my own. In contrast to pyinotify consuming an entire CPU during moderately heavy filesystem activity, an app using my bindings consumes about 5% of a CPU, even in the face intensive activities like untarring a big file archive.

In part, this is because my bindings are less abstracted than those of pyinotify. I don’t dispatch out to user methods at all; the caller is responsible for checking a bitmask instead. The readability of application code isn’t really affected by this, but stripping out all the cruft massively improves performance.

In addition, the application itself is also responsible for using the library in an informed way. To get decent performance with inotify, you must delay calls to read so that the kernel has a chance to aggregate multiple notifications into a single buffer write. In other words, if a call to poll says “you’ve got events”, you have to wait a good fraction of a second before seeing what they are. I provide a Threshold class to help with this.

While it is certainly possible to call into pyinotify in a similarly informed way, I suspect that all its flab and abstraction will gull the unwary coder into thinking that maybe they’re not writing performance-critical code after all, when in fact they are.

There are other Python inotify interfaces available. One is, like mine, named python-inotify, but a quick glance at its source code revealed some of the same silliness with unnecessary locking that plagues pyinotify, so I quickly averted my eyes. There’s also a Python API to gamin. I have no opinion about it, beyond not wanting to run another daemon if I can avoid it.

My general advice would be to avoid writing code that involves monitoring filesystem activity. It’s all too easy to write code that looks sensible, but is actually racy, usually under circumstances that are difficult to reproduce. Tuning performance without introducing more races or bugs is tough. You’re getting the idea now: hard! scary! find something fun instead!

The corollary to this is, of course, that as a user, you ought to be suspicious of any programs you use that monitor filesystem activity. I bet the Beagle and Google Desktop teams have armloads of horror stories.

I’ve spent a bit of time over the past few days putting together some LLVM bindings for Haskell, based on Gordon Henriksen’s C bindings.

(If you don’t know what LLVM is, it’s a wonderful toybox of compiler components, from a complete toolchain supporting multiple architectures through a set of well-defined APIs and intermediate representation file formats that are designed for building interesting software.)

The C bindings are almost untyped, but the Haskell bindings re-add type safety to prevent runtime crashes and general badness.

Currently, almost the entire code generation system is implemented, with most LLVM data types supported (notably absent are structs). Also plugged in is JIT support, so you can generate code at runtime from Haskell and run it immediately. I’ve attached an example.

Please join in the hacking fun! Here’s the darcs repository:

darcs get http://darcs.serpentine.com/llvm

If you want a source tarball, fetch it from here for now. Hackage can’t yet host code that uses GHC 6.8.2’s language extension names.

There’s very light documentation at present, but it ought to be enough to get you going.

Here’s a quick example of some code “in the wild”:

buildFib :: T.Module -> IO (V.Function T.Int32 T.Int32)
buildFib m = do
  let one = C.const (1::Int32)
      two = C.const (2::Int32)
  -- the compiler infers the type for the function from our signature
  (fib, entry) <- U.defineFunction m "fib" (T.function undefined undefined)

  -- a builder is an instruction emitter
  bld <- B.createBuilder
  exit <- Core.appendBasicBlock fib "return"
  recurse <- Core.appendBasicBlock fib "recurse"
  let arg = V.params fib

  -- make the builder emit instructions in the "entry" basic block
  B.positionAtEnd bld entry
  -- if our argument is less than two, exit, else recurse
  test <- B.icmp bld "" I.IntSLE arg two
  B.condBr bld test exit recurse

  -- this is the exit basic block
  B.positionAtEnd bld exit
  B.ret bld one

  -- here's the recursion case
  B.positionAtEnd bld recurse
  x1 <- B.sub bld "" arg one
  fibx1 <- B.call bld "" fib x1

  x2 <- B.sub bld "" arg two
  fibx2 <- B.call bld "" fib x2

  B.add bld "" fibx1 fibx2 >>= B.ret bld

  -- hand the function definition back to our caller
  return fib

This emits a function definition that computes the Fibonacci series in LLVM assembly language. Run it under a JIT:

main :: IO ()
main = do
  args <- getArgs
  let args' = if null args then ["10"] else args

  m <- Core.createModule “fib”
  fib <- buildFib m
  — print the function definition to the screen
  V.dumpValue fib

  — create a JIT
  prov <- Core.createModuleProviderForExistingModule m
  ee <- EE.createExecutionEngine prov

  — evaluate the JITted fib function over every command line argument
  forM_ args’ $ \num -> do
    putStr $ “fib ” ++ num ++ ” = ”
    parm <- EE.createGeneric (read num :: Int)
    gv <- EE.runFunction ee fib [parm]
    print (EE.fromGeneric gv :: Int)

Running this on the command line gives the following output:

$ ./Fibonacci 

define i32 @fib(i32) {
entry:
        icmp sle i32 %0, 2              ; <i1>:1 [#uses=1]
        br i1 %1, label %return, label %recurse

return:         ; preds = %entry
        ret i32 1

recurse:                ; preds = %entry
        sub i32 %0, 1           ; <i32>:2 [#uses=1]
        call i32 @fib( i32 %2 )         ; :3 [#uses=1]
        sub i32 %0, 2           ; <i32>:4 [#uses=1]
        call i32 @fib( i32 %4 )         ; <i32>:5 [#uses=1]
        add i32 %3, %5          ; <i32>:6 [#uses=1]
        ret i32 %6
}

fib 10 = 55

Every language’s standard library has its weak spots. In C, for example, the stdio functions don’t have a consistent notion of where the FILE * belongs in the argument list. For fwrite, it goes at the end; for fseek, it’s at the beginning. This makes it harder to abstract away the details of the API. Instead of remembering “the FILE * always goes at the front”, I have to memorise the order in which each specific function takes its arguments. Of such niggles are programmer annoyance born.

Haskell isn’t immune to poor API design. Partial function application is ubiquitous, as is the use of maps (the equivalent of hash tables, dicts, or what have you in imperative languages). Unfortunately, although the Data.Map API is admirably thorough, it firmly resists partial application.

Yesterday, I wrote a quick program to find out, for a given GHC RPM, what versions of various libraries it’s bundled with. The easiest way to do this is by parsing the output of “rpm -ql ghc682“, looking for lines like this (library name and version in bold):

/usr/lib64/ghc-6.8.2/lib/Cabal-1.2.3.0

This information is easy to pull out using a regexp.

type PkgName = String
type PkgVersion = String

packageInfo :: FilePath -> Maybe (PkgName, PkgVersion)
packageInfo path = do
  ([_, name, version]:_) <- path =~~ “.*/lib/([^/]+)-([0-9][.0-9]*)”
  return (name, version)

If I slurp the entire output of rpm into a single string, I can build a map of name to version quite cleanly. (I’m importing the Data.Map module under the name M here.)

buildMap :: String -> M.Map PkgName PkgVersion
buildMap = foldl' updateMap M.empty . map packageInfo . lines

The problem is that updateMap isn’t as tidy as I’d like, because Data.Map’s functions are generally not friendly to partial application.

updateMap :: PkgMap -> Maybe PkgInfo -> PkgMap
updateMap map (Just (name, version)) = M.insert name version map
updateMap map _ = map

Here, foldl' wants the accumulator (my map) as its first argument, but M.insert for some reason puts the map argument at the end of its list. As a result, I had to write the above bulky definition for updateMap to fix up the argument ordering.

Here’s an alternative order of arguments for M.insert.

insertM :: (Ord a) => M.Map a b -> a -> b -> M.Map a b
insertM map key value = M.insert key value map

As you can see, it just moves the map argument to the front. If the API looked like this, my updateMap function would be considerably shorter. I could get rid of the pattern matching and use maybe instead.


updateMap’ m = maybe m (uncurry (insertM m))

It’s a happy accident that foldl' wants its arguments in what I think of as the “natural” order for partial application. Just so we don’t conflate the two considerations, a comparable example that has nothing to do with folds is M.findWithDefault, which has the following type.


findWithDefault :: (Ord k) => a -> k -> Map k a -> a

This takes a default value as its first argument, which is returned if the key isn’t present in the map. Viewed through the “does it make sense for partial application?” lens, this is good: findWithDefault "foo" gives us a function that always returns "foo" if a lookup fails.

Weirdly, the second argument is the key to look up, not the map to perform the lookup in. So findWithDefault "foo" "bar" gives me a function that will look for the key "bar" in a given map. In practice, I’ve found that this is almost never the order of arguments I want. Just about always, I find myself wanting “given this map, look up some key” instead.

There is, to be sure, an element of judgment and experience in deciding on the order of arguments to a function. However, unlike in C, where argument ordering is merely an annoyance, making an API harder to learn, in Haskell this decision has practical consequences: it directly affects how usable your API is.

If you find yourself having to write noddy adapter functions or use flip too frequently, chances are you’re using an API that didn’t have enough thought put into partial application.

Here is an absolute treat: a long, lively interview with Kim Stanley Robinson, conducted on one of my favourite blogs, BLDGBLOG.

At its best (the Three Californias trilogy, Antarctica), Robinson’s writing is at once haunting and beautifully evocative of a sense of place.

If you use Fedora 8, GHC 6.8.1 will be in the stable repository within 24 hours or so. I’ve also pushed a compatible build of gtk2hs 0.9.12.1 to the stable repository.

Upgrading my laptop from F-7 to F-8 yesterday was painless, so I’ve been able to verify that the latest version of GHC works smoothly. I’ve pushed the built packages to the F-8 testing repository, and will bump them to release in about a week.

If you’re feeling impatient, you can download the final RPMs from the koji build page. I’ve got a build of Gtk2Hs pending, too, so I’ll be releasing that at the same time as the bump of GHC itself.

This evening, I’ll be speaking at the peculiar but fun Ignite SF. My talk is notionally about functional programming, but it’s really about imposing constraints on yourself, and what you can get out of it.

When debugging or profiling Haskell code, it’s common practice to pepper it with cost centre annotations, often called SCC (for strongly connected component set cost centre) annotations. If you compile a program using ghc -prof -auto-all, this causes all of the top-level names in every module you compile to automatically be annotated with their names, which will show up in a profile file later on when you run your program with +RTS -P -RTS.

However, since much Haskell code tends to consist of a top-level declaration with a thicket of smaller local functions and partial applications “inside”, merely seeing top-level names in profiling output is rarely enough to identify the source of a problem with much accuracy: you get roughly the right place to look, but not a precise enough picture to actually fix it.

For example, here’s a piece of code that has a space leak.

count lines = foldl (λm i → Map.insertWith (+) i 1 m) Map.empty lines

Doing a bare-bones profiling run over a program containing a definition like this will readily identify the count function as leaky, but which part of it is leaky?

Rather than hoist a bunch of local functions up to the top level, or pull fragments of partially applied code out and give them top-level names, it’s much easier to manually add SCC annotations.

count lines = foldl (λm i → {-# SCC “insert” #-} Map.insertWith (+) i 1 m)
    Map.empty lines

Rerunning the program gives us a new entry, named insert, which GHC uses when accounting for time spent and memory allocated during evaluation of the code to the right of the SCC annotation, until the right parenthesis is hit. For more details about SCC annotations, cost centres, and profiling, see the GHC User’s Guide on Profiling.

Adding and removing SCC annotations by hand is moderately annoying, so I’ve automated the process in Emacs. This function attempts to somewhat intelligently insert an SCC annotation wherever the cursor is. The name within the annotation is empty, but the cursor is centred on it, eliminating all boilerplate keyboarding, and letting you just type the name you want to use.

(defun insert-scc-at-point ()
  "Insert an SCC annotation at point."
  (interactive)
  (if (or (looking-at "\\b\\|[ \t]\\|$”) (and (not (bolp))
					  (save-excursion
					    (forward-char -1)
					    (looking-at “\\b\\|[ \t]“))))
      (let ((space-at-point (looking-at “[ \t]“)))
	(unless (and (not (bolp)) (save-excursion
				    (forward-char -1)
				    (looking-at “[ \t]“)))
	  (insert ” “))
	(insert “{-# SCC \”\” #-}”)
	(unless space-at-point
	  (insert ” “))
	(forward-char (if space-at-point -5 -6)))
    (error “Not over an area of whitespace”)))

Nuking an SCC annotation can be made just as convenient.

(defun kill-scc-at-point ()
  "Kill the SCC annotation at point."
  (interactive)
  (save-excursion
    (let ((old-point (point))
	  (scc "\\({-#[ \t]*SCC \”[^\"]*\”[ \t]*#-}\\)[ \t]*”))
      (while (not (or (looking-at scc) (bolp)))
	(forward-char -1))
      (if (and (looking-at scc)
	       (<= (match-beginning 1) old-point)
	       (> (match-end 1) old-point))
	  (kill-region (match-beginning 0) (match-end 0))
	(error “No SCC at point”)))))

(By the way, the annotated code above has two space leaks. The first is due to the use of foldl instead of its strict counterpart foldl', while the second is due to the use of Data.Map’s insertWith instead of its respective strict counterpart, insertWith'.)

Update: Josef Sveningsson pointed out that “SCC” doesn’t mean “strongly connected component” in this context, but he didn’t know what it does mean. I asked Simon Marlow, who filled me in: it means “set cost centre”.

Tim Bray has recently been writing about a simple log file processing task, giving his efforts the (decidedly peculiar) name of Wide Finder. The task at hand is to count popular links in an Apache log file.

Here’s my two minutes’ worth of hat in the ring, in Haskell.

> main = do
>     args <- getArgs
>     forM_ args $ \name -> do
>         m <- (foldl' count M.empty . LB.lines) `fmap` LB.readFile name
>         mapM_ print ((take 10 . sortBy (flip compare `on` snd) . M.toList) m)
>   where on f g x y = g x `f` (g y)
>         count m line = case line =~ "\"GET /en/([^ ]+\\.html)” of
>                          ((_:g:_):_) -> M.insertWith’ (+) g 1 m
>                          _ -> m :: M.Map LB.ByteString Int

Here’s some comparable Python:

> pat = re.compile(r'.*"GET /en/([^ ]+\.html)’)

> for name in sys.argv[1:]:
>     d = {}
>     for line in open(name):
>         m = pat.match(line)
>         if m:
>             d[m.group(1)] = d.setdefault(m.group(1), 0) + 1
>     for i in sorted(d.items(), key=lambda x:x[1], reverse=True)[:10]:
>         print i

The Haskell code can chew through 3.2 million records in 10.5 seconds on my laptop, while the Python takes 11.6 seconds.

Both of these programs spend about 90% of their time in regexp matching code, which makes this just a regexp engine and I/O benchmark. Yawn! Can we squeeze a little bit of entertainment out of the problem?

This is a trivially parallelisable problem: there are no data dependencies between different parts of a log file, so we can process them however we please.

Let’s split the input file into chunks of approximately equal size, aligned to line boundaries. This is dead easy to do with almost no I/O: just seek to the nearest chunk boundary, and read a little until we hit a newline.

> chunkedLineBoundaries :: Int -> FilePath -> IO [(Int64, Int64)]
> chunkedLineBoundaries numChunks path = do
>     totalSize <- (fromIntegral . fileSize) `fmap` getFileStatus path
>     let chunkSize = totalSize `div` fromIntegral numChunks
>     bracket (openFile path ReadMode) hClose $ \h ->
>       flip fix 0 $ \findOffsets offset -> do
>         let newOffset = offset + chunkSize
>         hSeek h AbsoluteSeek (fromIntegral newOffset)
>         flip fix newOffset $ \loop off -> do
>           eof <- hIsEOF h
>           if eof
>             then return [(offset, totalSize - offset)]
>             else do
>               bytes <- LB.hGet h 4096
>               case LB.elemIndex ‘\n’ bytes of
>                 Just n -> do
>                   offsets <- findOffsets (off + n + 1)
>                   return ((offset, fst (head offsets) - offset):offsets)
>                 Nothing -> loop (off + LB.length bytes)

The chunkedLineBoundaries function returns a list of (offset, length) pairs. We’ll use this to fire off multiple threads, each of which will consume a single chunk of the file in parallel.

> withChunks :: Int -> (LB.ByteString -> a) -> FilePath -> IO [a]
> withChunks numThreads f path = do
>   offsets <- chunkedLineBoundaries numThreads path
>   ch <- newChan
>   forM_ offsets $ \(offset, count) -> forkIO $
>     handle (writeChan ch . Left) $
>       bracket (openFile path ReadMode) hClose $ \h -> do
>         hSeek h AbsoluteSeek (fromIntegral offset)
>         ret <- (f . LB.take count) `fmap` LB.hGetContents h
>         ret `seq` writeChan ch (Right ret)
>   forM offsets (const (readChan ch >>= either throwIO return))

With this process-a-file-in-chunks function in hand, we must restructure our original code a little to fit in. Here’s the core scan-and-update-the-map loop, which does no I/O.

> reCountLines :: LB.ByteString -> M.Map LB.ByteString Int
> reCountLines = foldl' count M.empty . LB.lines
>     where count m line = case line =~ "\"GET /en/([^ ]+\\.html)” of
>                            ((_:g:_):_) -> M.insertWith’ (+) g 1 m
>                            _ -> m

We’ll give it an alternate name so we can swap in a better implementation later.

> countLines = reCountLines

Because this function does no I/O, we can run it either sequentially or in parallel.

> sequential = fmap countLines . LB.readFile
> parallel = fmap (M.unionsWith (+) . map snd) . withChunks 2 countLines

The parallel function takes the maps returned by each thread and reduces them into a single map, giving a result of exactly the same type as the sequential function.

> -- kind = sequential
> kind = parallel

By changing the definition of kind above, we can switch between the sequential and parallel versions of our code. Now main becomes just a framework:

> main = do
>     args <- getArgs
>     forM_ args $ \name -> kind name >>= \m -> 
>       mapM_ print ((take 10 . sortBy (flip compare `on` snd) . M.toList) m)
>   where on f g x y = g x `f` g y

In order to benefit from the potential parallelism, we have to recompile to use GHC’s threaded runtime. This imposes about a 4% penalty in execution time, so the serial version of the code processs our 3.2 million records in 10.9 seconds instead of 10.5.

Switching to the parallel code, it takes 7.7 seconds to process the same data. We get a less than perfect speedup in part because GHC’s garbage collector runs serially; that results in about 0.7 seconds of serial execution. Still, this is almost twice as fast as the Python code.

Next, let’s get rid of the gratuitous regular expressions, since they’re surely doing a lot more work than necessary for such a simple problem. Here’s a short handwritten replacement:

> fastCountLines :: LB.ByteString -> M.Map LB.ByteString Int
> fastCountLines = foldl' count M.empty . LB.lines
>   where count m line =
>           let quote = LB.drop (fromJust (LB.elemIndex '\"' line)) line
>           in if LB.pack "\"GET /en/" `LB.isPrefixOf` quote
>              then let pfx = LB.drop 9 quote
>                       uri = LB.take (fromJust (LB.elemIndex ' ' pfx)) pfx
>                   in if LB.pack ".html" `isSuffixOf` uri
>                      then M.insertWith' (+) uri 1 m
>                      else m
>              else m

Using fastCountLines as the value of countLines, this brings best-case serial execution time (i.e. without the threaded runtime) down to 5.1 seconds, and parallel execution time drops to 3.5 seconds, or a third the time required by the original serial-with-regexps Haskell code.

I would expect a four-core machine to further improve performance, though with an added drop in speedup due to GHC’s single-threaded garbage collector.

The withChunks function isn’t at all specialised to this task; we can use it to process any large text file in parallel.

I’ve spent a few spare hours here and there working on a pure Haskell interface to MySQL recently. On the principle that perhaps someone else might want to join in the fun, I’ve published a darcs repository already (see the link above for more details):

darcs get http://darcs.serpentine.com/mysql

There are a few reasons I’m doing this. One is licensing-related: the existing hsql-mysql bindings claim to be BSD-licensed, but they’re just an FFI wrapper around the MySQL client library, so they (and any program that uses them) are subject to the GPL. My library speaks the MySQL wire protocol in pure Haskell, and I’ve BSD-licensed it.

The code isn’t actually usable for applications yet, as I’ve mainly done the work of getting login (a major pain) and querying working. I’ve spent a lot of time with the following very handy references:

This has also been a great opportunity to see how well Hackage, Haskell’s equivalent of CPAN, is working out in practice. When I wanted a library that would help me to decode and encode the wire protocol efficiently and flexibly, I was able to grab binary. When I subsequently needed SHA-1 hash support to implement the password hashing protocol, I downloaded the crypto package. No muss, no fuss.

« Prev - Next »