Subscribe to
Posts
Comments

Now that I’ve got the DEFUN 2009 schedule sorted out (you are coming, aren’t you?), I’ve had time to take a breath and think about the Haskell text library again. Its API is currently a clone of the ancient and venerable Haskell list API. If you’ve used the list API to do much text processing, you’ve probably spilled more than a few tears into your whiskey. The bytestring library also mostly clones the list API, albeit with a few improvements. This state of affairs makes me somewhat sad: here we are with a fabulous language, but a 1991-era API for mangling text.

To put this state of affairs into perspective, here is a function-by-function comparison of the string manipulation APIs of Python 2.6 and Haskell. This is intentionally somewhat pessimistic: I focus on aspects of the Python API that are either absent from or not trivially reimplemented in Haskell, but not the reverse. (If the details that follow make your eyes glaze over, skip them and read on after the table below.)

PythonHaskell
x + yx `append` y
x in yx `isInfixOf` y
x < yx < y
x <= yx <= y
x == yx == y
x != yx /= y
x > yx > y
x >= yx >= y
x % (...)
x[i] x `index` i
x[i:j](j-i) `take` (i `drop` x)
hash(x)
len(x)length x
x * yy `replicate` x
x.capitalize()
x.center(y)
x.count()
x.decode()decode... family
x.encode()encode... family
x.endswith(y)y `isSuffixOf` x
x.expandtabs()
x.find(y)
x.format(...)
x.index(y)
x.isalnum()all isAlphaNum x
x.isalpha()all isAlpha x
x.isdigit()all isDigit x
x.islower()all isLower x
x.isspace()all isSpace x
x.istitle()
x.isupper()all isUpper x
x.join(y)intercalate x y
x.ljust(w)
x.lower()toLower x
x.lstrip()dropWhile isSpace
x.partition(y)break (==y) x
x.replace(y,z)
x.rfind(y)
x.rindex(y)
x.rjust(y)
x.rpartition(y)
x.rsplit(y)
x.rstrip(y)
x.split(y)
x.splitlines()lines x
x.startswith(y)y `isPrefixOf` x
x.strip()
x.swapcase()
x.title()
x.translate(y)
x.upper()toUpper x
x.zfill()

For now, I’m intentionally not looking at Python’s unicodedata or string packages, even though each contains a handful of additional useful functions.

How would I broadly categorise what’s missing from the current Haskell APIs?

  1. Formatting. The format method that’s new in Python 2.6 is well designed and extremely useful. While there are a few formatting libraries on Hackage, each has flaws which I think are substantial enough to make them undesirable for wide use. As examples of those shortcomings, I’m thinking of a lack of static type safety or a poor fit for automated translation tools.
  2. Searching and splitting text. The Haskell APIs are based on predicates over individual characters, whereas what’s usually needed is predicates over strings. In other words, don’t just find me a character; find me a substring.
  3. Parsing. I’m not overly concerned about this, since Haskell’s libraries far outshine those of Python in this area. Although they currently lack support for the text library, the Parsec and attoparsec libraries will acquire it, I’m sure, as soon as there’s demand. What would be welcome is a decent Unicode-capable regular expression engine, for those times when you just have to get yourself into trouble in the name of expediency.

I intend to address each of these areas over the coming months, and I’ll write up the APIs I intend to flesh out here before I actually implement them, to solicit feedback from the community. One step that I think I’ll probably take, for instance, is to move a few of the functions in the Data.Text module that clone the list API into a new module, Data.Text.Legacy, so that I can use the same function names in Data.Text, but with more useful types. As an example of what I have in mind, I’d be inclined to move split :: Char -> Text -> [Text] into the legacy module, and replace it with split :: Text -> Text -> [Text].

There’s something of a tension between the goals of providing a small, focused text library and getting all the API details right in a way that will make it truly useful. I find the proliferation of tiny libraries on Hackage, each providing a few little pieces of missing functionality, to be pretty dispiriting from the point of view of getting dug in and producing useful application code quickly, so I intend for the text and text-icu libraries to be broadly useful from the get-go.

If you have opinions, or better yet patches, to contribute, let’s get things rolling!

I've released version 0.3 of the Haskell text package to Hackage. It supports the new error handling API that I wrote about the other day, along with proper support for case conversion.

What is "proper support for case conversion"? Correctly converting the case of a single Unicode code point can yield one, two, or three code points. You can't just call toUpper on each character in a string, as long-time monolingual C and Python programmers are likely to expect. Here are a few examples of the several hundred non-obvious rules to follow:

  • If converting to lower case, the Latin capital letter I with dot above (U+0130) maps to the bigram "Latin small letter i" followed by "combining dot above" (U+0069 U+0307).

  • In a conversion to upper case, the German "eszett" ß (U+00DF) maps to the bigram SS. (Try u'ß'.upper() in Python for fun.)

  • There also exists "folded" case, which is used for caseless (i.e. case insensitive) conversion. For instance, the Armenian small ligature men now (U+FB13) is case folded to the bigram men now (U+0574 U+0576), while the micro sign (U+00B5) is case folded to the Greek small letter letter mu (U+03BC) instead of itself.

To make matters more fun, case conversion for a few languages of the Eastern Mediterranean requires knowledge of the current locale, and I've chosen to skip that for now. Patches from Turkish and Azeri Haskell hackers are welcome.

Case conversion might seem like an odd thing to focus on. To be considered thorough, I think that a good Unicode implementation needs support for at least the following three "big ticket" items:

  • Normalization

  • Case conversion

  • Collation

Of these three, case conversion is by far the easiest, so I hit it first with a total of a few hours of work.

I've spent a few nights improving the Haskell text library's resilience in the face of bad input. In version 0.2, the library simply calls error if presented with invalid input. This is clearly not adequate, since there's no shortage of data out there that either contains coding errors or is misidentified (e.g. ISO-8859-1 served up as UTF-8).

The new API remains substantially the same as before: if you invoke decodeUtf8 with bad input, you'll now get a UnicodeException instead of a less helpful exception via error. I've also introduced a new decodeUtf8With function which takes an error handler as first parameter. This handler is a normal Haskell function. It can do one of several things:

  • Call error or throw an exception.

  • Replace the bogus input with something else, e.g. the Unicode replacement character U+FFFD.

  • Skip over the bad input.

The new Data.Text.Encoding.Error module predefines a few useful handler functions. I've extended the other decoding APIs to behave similarly. In my benchmarks, parameterised error handling imposes no measurable runtime cost.

I haven't released the updated code yet; I'll try to publish it within the next few days.

Update 2009-05-28: My Haskell code is currently the fastest pidgits implementation, beating even the C code. It's also the briefest submission by a fairly large margin. Nice!

I purposely pay little attention to the Computer Language Benchmarks Game, but I couldn't resist this: Arnaud Payement wrote a nice Haskell implementation of the pidgits benchmark, which was quickly eclipsed by a C implementation.

The speed difference between the two is small, just 23%, so I tried to see if I could improve on Arnaud's code.

import System

pidgits n = 0%(0#(1,0,1)) where
 i%ds
  | i >= n = return ()
  | True = putStrLn (concat h ++ "\t:" ++ show j) >> j%t
  where k = i+10; j = min n k
        (h,t) | k > n = (take (n`mod`10) ds ++ replicate (k-n) " ",[])
              | True = splitAt 10 ds
j#s | n>a || r+n>=d = k#t
    | True = show q : k#(n*10,(a-(q*d))*10,d)
 where k = j+1; t@(n,a,d)=k&#038;s; (q,r)=(n*3+a)`divMod`d
j&#038;(n,a,d) = (n*j,(a+n*2)*y,d*y) where y=(j*2+1)

main = pidgits.read.head =<< getArgs

(Yes, there's some HTML damage in there. Hooray Wordpress. Sigh.)

This version is 6% slower than the fastest C implementation, but 35% the amount of code. That seems like a good tradeoff.

I'm amused by the sole difference between the fastest and second fastest C implementations. Each avoids an expensive GMP multiplication via manual strength reduction: the faster one uses multiplication by a power of two followed by an add, while the slower version (which is the same speed as my Haskell version) uses two adds. Alas, while GHC's Integer implementation uses GMP, it doesn't use the multiply-by-a-power-of-two function (mpz_mul_2exp): if it did, I'd probably have performance on par with the fastest C code.

I just released version 0.2 of the Haskell text library that I announced back in February. This version fixes a number of bugs, but much more significantly, it adds a streaming mode: you can process a huge amount of text lazily using a small, fixed amount of memory, while maintaining high performance.

In case you need to report problems or suggest improvements, there’s now a Trac instance for the text and text-icu libraries.

As always, you can easily download the source:

darcs get http://code.haskell.org/text/

As of about a week ago, O’Reilly’s production team has the manuscript of the Mercurial book. Thanks to everyone who has submitted so many comments during the writing process!

If you watch the Mercurial development tree, you’ll have noticed that over the past few years, I’ve done almost no work on Mercurial itself. The software has a healthy community of able and enthusiastic contributors, and I’ve enjoyed focusing my energy elsewhere.

But I did fire a Parthian shot a few days ago: I improved Mercurial’s I/O performance on Windows, with speedups of up to a factor of 5 for some common operations. That change should be present in the July release. Windows users, enjoy!

I had a wonderful time at the Bay Area Erlang Factory this morning, speaking to an Erlang audience about the different perspective that Haskell brings to functional programming. It was a relaxed and friendly crowd, and speaking to a receptive audience is always a thrill.

Here are the slides from my talk.

ACM SIGPLAN 2009 Developer Tracks on Functional Programming
http://www.defun2009.info/
Edinburgh, Scotland, September 3 and 5, 2009
The workshop will be held in conjunction with ICFP 2009
http://www.cs.nott.ac.uk/~gmh/icfp09.html

Important dates

Proposal Deadline: June 5, 2009, 0:00 UTC
Notification: June 19, 2009

DEFUN 2009 invites functional programmers and researchers who know how to solve problems with functional progamming to give talks and lead tutorials at the The ICFP Developer Tracks.

We want to know about your favorite programming techniques, powerful libraries, and engineering approaches you’ve used that the world should know about and apply to other projects. We want to know how to be productive using functional programming, write better code, and avoid common pitfalls.

We invite proposals for presentations in the following categories.

Lightning talks

5- to 10-minute talks that introduce exciting and promising research or techniques that may be in progress or not yet ready for widespread use, but that offer a glimpse into the near future of real world functional programming.
Examples:
  • Clustered high performance computing in a functional language
  • Making advanced type systems more accessible to working programmers
  • How and why we’re infiltrating category theory info industry

How-to talks

45-minute “how-to” talks that provide specific information on how to solve specific problems using functional programming. These talks focus on concrete examples, but provide useful information for developers working on different projects or in different contexts.
Examples:
  • “How I use Haskell for oilfield simulations.”
  • “How I replaced /sbin/init by a Scheme program.”
  • “How I hooked up my home appliances to an Erlang control system.”
  • “How I got an SML program to drive my BMW.”

General language tutorials

Half-day general language tutorials for specific functional languages, given by recognized experts for the respective languages.

Technology tutorials

Half-day tutorials on techniques, technologies, or solving specific problems in functional programming.
Examples:
  • How to make the best use of specific FP programming techniques
  • How to inject FP into a development team used to more conventional technologies
  • How to connect FP to existing libraries / frameworks / platforms
  • How to deliver high-performance systems with FP
  • How to deliver high-reliability systems with FP

Remember that your audience will include computing professionals who are not academics and who may not already be experts on functional programming.

Presenters of tutorials will receive free registration to CUFP 2009.

Submission guidelines

Submit a proposal of 150 words or less for either a 45-minute talk with a short Q&A session at the end, or a 300-word-or-less proposal for a 3-hour tutorial, where you present your material, but also give participants a chance to practice it on their own laptops.

Some advice:

  • Give it a simple and straightforward title or name; avoid fancy titles or puns that would make it harder for attendees to figure out what you’ll be talking about.
  • Clearly identify the level of the talk: What knowledge should people have when they come to the presentation or tutorial?
  • Explain why people will want to attend:
    • Is the language or library useful for a wide range of attendees?
    • Is the pitfall you’re identifying common enough that a wide range of attendees is likely to encounter it?
  • Explain what benefits attendees are expected to take home to their own projects.
  • For a tutorial, explain how you want to structure the time, and what you expect to have attendees to do on their laptops. List what software you’ll expect attendees to have installed prior to coming.

Submit your proposal in plain text electronically to defun-2009-submissions@serpentine.com by the beginning of Friday, June 5 2009, Universal Coordinated Time.

Organizers

  • Yaron Minsky (Jane Street Capital)
  • Ulf Wiger (Erlang Training and Consulting)
  • Mike Sperber - co-chair (DeinProgramm)
  • Bryan O’Sullivan - co-chair (Linden Lab)

If you’ve looked at the Mercurial book site in the past 24 hours, you’ll have noticed that both its look and the name of the book have changed.

First, the cosmetic news. The change in appearance is due to my switching over to the system I used to publish Real World Haskell. I’ve always made the source code to the Mercurial book available, but now it includes the source of the Django app and jQuery code that drives the comment system. So if you’re writing an online book, have at it!

The change in title occurs because O’Reilly Media is going to publish the book in a few months, once I’ve tidied it up (with, I hope, your help!). The content is still licensed under the OPL, so you remain free to use and modify my work under remarkably liberal terms.

O’Reilly and I are donating my royalties on this book to the Software Freedom Conservancy, so if you buy a hardcopy, you’ll be helping one of the finest causes in the software world.

This is a great time to get involved in the production and polishing of the book, and as I did with the Haskell book, I’ll give credit to everyone who provides their name. If you’re interested in helping out, be it by contributing content, corrections, typo fixes, or suggestions, you have two easy ways to do so:

  • Clone a copy of the source tree and start sending patches
  • Use the comment system on the web site to point out errors and omissions, or to make suggestions
  • For bigger stuff, report an issue

So please dive on in, and let’s make this book an even more outstanding resource for the Mercurial community!

Scintillation

I think that this video is remarkably beautiful.


SCINTILLATION, by Xavier Chassaing on Vimeo.

Next »