A few weeks ago, I decided that I'd like to focus for a while on getting a 1.0 release of the Haskell text library ready. That work has gone fairly well so far. I've focused on making sure that I like the API, and that a few key functions have good performance characteristics.

One such function is substring search, which is the underpinning for a number of useful functions (split a string on a delimiter, find all substrings, etc). My initial cut at that function had an obvious API, but ran in O(*nm*) time, which complexity measure I found mildly embarrassing. So I looked around for an alternative that would be faster, but still simple to implement, and settled upon Fredrik Lundh's elegant and trim adaptation of Boyer-Moore-Sunday-Horspool, which has O(*n*+*m*) complexity in the typical case.

Using QuickCheck, I was quickly able to satisfy myself that my new code was correct, so now I was faced with demonstrating to myself that the code was *fast*, too. I looked on Hackage for a good benchmarking package, but although there were several benchmarking packages available, they were all extremely simple, and I wasn't happy with any of them.

I have sort of a mutt's pedigree in computing, having spent several stints working in the high performance scientific computing world, so I felt antsy about measuring performance reliably. Over the course of a few days of hacking, I came up with a good framework that I am currently finishing off.

One of the things a good benchmarking framework needs is some statistical heavy lifting, to do things like autocorrelation analysis, kernel density function estimation, and bootstrapping, so I had to write some complicated statistical code along the way. I've now packaged that code up and released it as the statistics library on Hackage. Some of the features implemented in this first release of the statistics library include:

- Support for common discrete and continuous probability distributions (binomial, gamma, exponential, geometric, hypergeometric, normal, and Poisson)
- Kernel density estimation
- Autocorrelation analysis
- Functions over sample data
- Quantile estimation
- Resampling techniques: jackknife and bootstrap estimation

The statistics library certainly isn't yet comprehensive, but it has some features that I think make it very attractive as a base for further work:

- It's very fast, building on some of the fantastic software that's available on Hackage these days. I make heavy use of Don Stewart's uvector library (itself a port of Roman Leshchinskiy's vector library), which means that many functions allocate no memory and execute tight loops using only machine registers. I use Dan Doel's uvector-algorithms library to perform fast partial sorts. I also use Don's mersenne-random library for fast random number generation when doing bootstrap analysis.
- I've put a fair amount of effort into finding and using algorithms that are numerically stable (trying to avoid problems like catastrophic cancellation). Whenever possible, I indicate which methods are used in the documentation. (For more information on numerical stability, see What Every Scientist Should Know About Floating-Point Arithmetic).

For me, one of the killer features of Haskell for statistical computation is its combination of terseness and fabulous libraries. In particular, the modern collection libraries like uvector and text, which are built on stream fusion frameworks, provide a combination of expressiveness and high performance that is simply delightful to work with.

Speaking of improving the statistics library, I've already received a few patches to the statistics library even before announcing it anywhere, and I'm very much looking forward to receiving more. If you want to contribute, go get the source code and hack away:

darcs get http://darcs.serpentine.com/statistics

So now that the statistics library is in good enough shape, I can return to work on the benchmarking framework. Once that's done, I'll get back to finishing off the text library. It's been funny to see how rewriting 50 lines of code has spawned off weeks of enjoyable work on other projects! I hope that people will find all of this work useful.

By the way, the philosophy underlying the text, statistics, and benchmarking libraries is uniform across the lot of them: build and release something thatâ€”even though initially incompleteâ€”is thorough enough and of such high quality that other people will be drawn to improving your work instead of creating half a dozen tiny alternatives. So please, join in, and let the patches fly!

The floating point article at sun.com appears to be a reprint of a Computing Surveys article by David Goldberg:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.22.6768

For some reason Sun cites Computing Surveys but doesn’t name the original author of the article. They also say it was reprinted by permission but also have an enormous blob of legalbol claiming it is copyright by Sun!

By the way your stats library looks very interesting. I’m going to try understanding it, despite being somewhat stats-clueless (but trying to change that). Some tutorial docs (or anyway some pointers to where to find more info) would probably be helpful, though I guess I can start with Wikipedia.

Doesn’t the Boyer-Moore string searching algorithm have a complexity of O(n/m) in the average case and O(n+m) in the worst case?

I wonder if timsort can be made to work in the Unicode case.

http://en.wikipedia.org/wiki/Timsort

Thanks for the library Brian, once more very interesting code for us all to study and use.

I use R for statistical work all the time, it is robust, efficient and has packages for practically everything. At some point I’m sure (/I hope) someone will write the necessary binding layer so that Haskell can call into R via the direct C api or else through Rserve. However, I think we’ll also benefit from having a decent native Haskell implementation of standard statistical functionality; for one thing it means we won’t need to depend on an external library like R, and the functions will be easier for Haskell coders to use than requiring them to learn R, and the unique features of Haskell will provide many benefits to the implementation

So I’m rather confused as to what the “correct” configuration is, and would like to know what you think of this? It seems that the way that applications and libraries are structured, the oss community is destined to massive duplication of effort, again and again. Every language gets its own implementation of statistics functionality, some better than others, often coded by people with limited stats knowledge doing their best to add to their favourite language, even though we already have this functionality, written by statisticians. Think of the thousands of hours spent debugging and refining all the various implementations of essentially the same functionality… of course this isn’t unique to statistics libraries, BLASs are another pertinent example and there are dozens more.

Sorry for the long comment, and looking forward to the benchmarking lib.