Using Bloom filters for large scale gene sequence analysis in Haskell

Ketil Malde and I submitted this paper to PADL 09. Updated 2008-10-13: the paper was accepted!

Here is the PDF copy.

And for the impatient, here is the abstract.

Analysis of biological data often involves large data sets and computationally expensive algorithms. Databases of biological data continue to grow, leading to an increasing demand for improved algorithms and data structures. Despite having many advantages over more traditional indexing structures, the Bloom filter is almost unused in bioinformatics. Here we present a robust and efficient Bloom filter implementation in Haskell, and implement a simple bioinformatics application for indexing and matching sequence data. We use this to index the chromosomes that make up the human genome, and map all available gene sequences to it. Our experiences with developing and tuning our application suggest that for bioinformatics applications, Haskell offers a compelling combination of rapid development, quality assurance, and high performance.

I’ll write a more friendly overview of the paper later. The Cliff’s Notes version: Bloom filters are almost unused in bioinformatics; they’re tremendously useful; and our Haskell code is really fast.

Right now, I have to catch a plane from Victoria back to San Francisco, now that ICFP is safely put to bed.

Tagged with: , , ,
Posted in haskell, software
5 comments on “Using Bloom filters for large scale gene sequence analysis in Haskell
  1. Dan G says:

    I think this project uses similar techniques for network forensics, http://isis.poly.edu/projects/fornet/

  2. Yatima says:

    You’ll be pleased to hear that Haskell’s laziness and type-inferencing, along with Anathem, were major topics of conversation at my date night last night. I adore geeks. That is all.

  3. Interesting paper. Aligning ESTs is a bit old-school, but there is a lot of interest in aligning many very short sequences (<30bp) sequences to the genome at high or exact thresholds. Due to its k-mer based heuristics, BLAT has not been very good at finding these matches. A lot of researchers have been turning to suffix trees and as a result they are spending a lot more time at home with their families.
    I think would be interesting to implement a short sequence alignment tool along these lines in Haskell using your Bloom filters. I’m not sure the bottleneck is in storage, but perhaps the decreased footprint could make a distributed solution more attractive.

  4. Tim Yates says:

    Nice paper :) I’m evaluating using a Bloom filter for getting my 25mer probe sequences pre-filtered into sets per chromosome rather than searching for millions of them for each chromosome in turn.. How do you ensure you are not out of phase on the words you extract from the target sequence? ie: If I read 8mer words with an overlap of 2, how do I ensure I am just not out by one base, thereby missing the words existence when running it through my hashes?

    I’ve probably got the algorithm wrong in my head… More reading required 😉

  5. I’m currently interested in the subject. Pitty that the pdf link is down.

    Is it possible for you to upload it again?

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>