A few weeks ago, I decided that I'd like to focus for a while on getting a 1.0 release of the Haskell text library ready. That work has gone fairly well so far. I've focused on making sure that I like the API, and that a few key functions have good performance characteristics.
One such function is substring search, which is the underpinning for a number of useful functions (split a string on a delimiter, find all substrings, etc). My initial cut at that function had an obvious API, but ran in O(nm) time, which complexity measure I found mildly embarrassing. So I looked around for an alternative that would be faster, but still simple to implement, and settled upon Fredrik Lundh's elegant and trim adaptation of Boyer-Moore-Sunday-Horspool, which has O(n+m) complexity in the typical case.
Using QuickCheck, I was quickly able to satisfy myself that my new code was correct, so now I was faced with demonstrating to myself that the code was fast, too. I looked on Hackage for a good benchmarking package, but although there were several benchmarking packages available, they were all extremely simple, and I wasn't happy with any of them.
I have sort of a mutt's pedigree in computing, having spent several stints working in the high performance scientific computing world, so I felt antsy about measuring performance reliably. Over the course of a few days of hacking, I came up with a good framework that I am currently finishing off.
One of the things a good benchmarking framework needs is some statistical heavy lifting, to do things like autocorrelation analysis, kernel density function estimation, and bootstrapping, so I had to write some complicated statistical code along the way. I've now packaged that code up and released it as the statistics library on Hackage. Some of the features implemented in this first release of the statistics library include:
- Support for common discrete and continuous probability distributions (binomial, gamma, exponential, geometric, hypergeometric, normal, and Poisson)
- Kernel density estimation
- Autocorrelation analysis
- Functions over sample data
- Quantile estimation
- Resampling techniques: jackknife and bootstrap estimation
The statistics library certainly isn't yet comprehensive, but it has some features that I think make it very attractive as a base for further work:
- It's very fast, building on some of the fantastic software that's available on Hackage these days. I make heavy use of Don Stewart's uvector library (itself a port of Roman Leshchinskiy's vector library), which means that many functions allocate no memory and execute tight loops using only machine registers. I use Dan Doel's uvector-algorithms library to perform fast partial sorts. I also use Don's mersenne-random library for fast random number generation when doing bootstrap analysis.
- I've put a fair amount of effort into finding and using algorithms that are numerically stable (trying to avoid problems like catastrophic cancellation). Whenever possible, I indicate which methods are used in the documentation. (For more information on numerical stability, see What Every Scientist Should Know About Floating-Point Arithmetic).
For me, one of the killer features of Haskell for statistical computation is its combination of terseness and fabulous libraries. In particular, the modern collection libraries like uvector and text, which are built on stream fusion frameworks, provide a combination of expressiveness and high performance that is simply delightful to work with.
Speaking of improving the statistics library, I've already received a few patches to the statistics library even before announcing it anywhere, and I'm very much looking forward to receiving more. If you want to contribute, go get the source code and hack away:
darcs get http://darcs.serpentine.com/statistics
So now that the statistics library is in good enough shape, I can return to work on the benchmarking framework. Once that's done, I'll get back to finishing off the text library. It's been funny to see how rewriting 50 lines of code has spawned off weeks of enjoyable work on other projects! I hope that people will find all of this work useful.
By the way, the philosophy underlying the text, statistics, and benchmarking libraries is uniform across the lot of them: build and release something thatâ€”even though initially incompleteâ€”is thorough enough and of such high quality that other people will be drawn to improving your work instead of creating half a dozen tiny alternatives. So please, join in, and let the patches fly!