A Haskell library for lazy, efficient suffix tree construction, search, and traversal.
The implementation is based on the first of those described in the following paper:
- Robert Giegerich and Stefan Kurtz, “A comparison of imperative and purely functional suffix tree constructions“, Science of Computer Programming 25(2-3):187-218, 1995
The library includes functions for constructing, searching, and traversing suffix trees. It only constructs parts of the suffix tree as they are needed, thus avoiding unnecessary work. Construction typically takes O(n log n) time, with searches taking linear time.
API documentation is available here.
You can obtain releases from here:
The darcs source repository is available from here:
darcs get http://darcs.serpentine.com/suffixtree/
That’s way cool!
I’ve got a question if you don’t mind:
How do I label leaves? For example, in one application of suffix trees – Suffix Tree Clustering algorithm – leaves contain an index to documents where leaf label occurs. I could probably extend Leaf data constructor to contain labels of arbitrary type ‘b’, but ‘construct’ function gives no way to provide these labels anyway.
p.s. Is Haskell-Cafe better place for this question?
p.p.s. I’ve not read the article yet and just skimmed through API docs, so my question may be plain stupid.
Gleb, there’s a fold function that you can use to transform an existing tree. You could use this to generate a new tree of a different type that has annotations on its leaves.
Brian, I worded my question somewhat poorly.
I had in mind something like Generalized Suffix Trees that can store suffixes of more than one string. According to Gusfield’s book Ukkonen’s algorithm can be extended to Generalized Suffix Trees, but I don’t know whether it’s possible and worth the trouble to try extending Giegerich & Kurtz’s lazy construction algorithm. Basically, what is needed is a function addToSuffixTree :: STree a lbl -> ([a], lbl) -> STree a lbl.
lbl is a unique identifier of string being added (think filename or URL). fold is not appropriate for this kind of annotations – this info can only be given during construction.
The only way I can deal with this using current interface is to concat all documents into one string with unique delimiters between them. This forces me 1) to use smth. like Either Filename Char instead of Char 2) load all documents into memory or deal with peculiarities of lazy I/O. I’m afraid both issues place limit on amount of data that can be processed.
Thanks for reading this far. Your library is cool even without Generalized Suffix Tree support.
How can I use this to get k-th lexicographically small substring?