Uncategorized – teideal glic deisbhéalach

Big fucking deal

Bryan O'Sullivan — Wed, 01 May 2013 06:27:29 +0000

Quoth Wikipedia:

Big data[1][2] is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage,[3] search, sharing, transfer, analysis,[4] and visualization. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to “spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions.”

Now what if we get tired of the current hype cycle?

Big fucking deal[1][2] is a collection of deals so fucking large and complex that it becomes difficult to process using on-hand fuck giving tools or traditional shit giving techniques. The challenges include capture, curation, storage,[3] search, sharing, transfer, analysis,[4] and all kinds of who the fuck knows what else. The trend to larger fucking deals is due to the additional shit derivable from giving a fuck about a single large fucking pile of related shit, as compared to separate smaller piles with the same total amount of bullshit, allowing correlations to be found to “spot business shit, determine quality of whatever, prevent some nasty shit, link legal shit right the fuck together, combat fucking crime no I am not making this up it’s like fucking Batman, and determine real-time traffic shittiness.”

github is making me feel stupid(er)

Bryan O'Sullivan — Sun, 08 Apr 2012 17:17:38 +0000

I’m approaching my fourth anniversary of using github. I should hardly have to state that it’s a wonderful service, and especially so for being kept freely available to the open source community. At the same time, I’ve noticed over the past year or so that in many ways I feel less efficient using it now than I used to, even though the github team continues to roll out new features that make me shout “hooray!”

I doubt that these difficulties are unique to me, or even related to the fact that I’ve got a new baby (so I have the cognitive sharpness of a cotton ball). So here’s what I’m seeing; I hope that these observations are helpful to the github folks in understanding how their service is used.

Firstly, a spot of cognitive organizing: I really like the newish “issues across all of my projects” dashboard, but when I’m thinking about “stuff that’s mine”, I tend to navigate to github.com/bos, and that dashboard isn’t there. Instead, I kick myself and navigate to plain old github.com. You could reasonably respond “okay, fine, just remember that, and you’re done”. And yet somehow this knowledge refuses to stick in my head.

What I find more confusing is the visual clutter at the top of a project page. There are now seven short-but-wide horizontal rows of stuff (both information and links) at the top of a project’s main page. Here’s an annotated screenshot that I hope illustrates what I’m talking about.

I frequently find myself looking for the commits page, which is in the middle of row number 6. At least for me, there seems to be no escaping the need to scan across every row in turn until I reach row 6, where I find the word “commits”. That is, I usually find it; I can easily miss it among all the similar entries if I’m not paying close attention. I find it difficult to visually distinguish the rows at a glance, so there’s no skipping past clusters of stuff that aren’t relevant.

These aren’t killer problems by any stretch, but I do all too often find myself staring at github web pages for 30 seconds at a time, wondering “am I looking at the right page? Did I miss the row of stuff I’m looking for?” I imagine there might be a way to organize these things better, though I’m no visual designer, and I’m afraid I don’t have any crisp suggestions for what might work.

Here be dragons: advances in problems you didn’t even know you had

Bryan O'Sullivan — Wed, 29 Jun 2011 07:27:08 +0000

Here’s something I bet you never think about, and for good reason: how are floating-point numbers rendered as text strings? This is a surprisingly tough problem, but it’s been regarded as essentially solved since about 1990.

Prior to Steele and White’s "How to print floating-point numbers accurately", implementations of printf and similar rendering functions did their best to render floating point numbers, but there was wide variation in how well they behaved. A number such as 1.3 might be rendered as 1.29999999, for instance, or if a number was put through a feedback loop of being written out and its written representation read back, each successive result could drift further and further away from the original.

Steele and White effectively solved the problem with a clever algorithm named "Dragon4" (the fourth version of the "Dragon" algorithm, which acquired its name because the authors were inspired to obscure puns by Heighway’s dragon curve).

The Dragon4 algorithm spread quickly across language runtimes, such that few programmers today understand that this was ever a problem, much less how hairy it was (and is). Indeed, prior to last year, there was almost no activity in this area: two papers proposed widely used refinements to Dragon4, and that was about it. (Alas, the problem was originally solved around a decade before Steele and White published their work, but nobody noticed. If you have a clever idea and sufficient chutzpah, try to enlist Guy Steele as a coauthor. Your work will be read.)

But how solved was the problem? Dragon4 and its derivatives are complicated and tricky, and they have a hefty performance cost, since they rely on arbitrary-precision integer arithmetic to compute their results. There might be a significant performance improvement to be gained if someone could figure out how to use native machine integers instead.

In 2010, Florian Loitsch published a wonderful paper in PLDI, "Printing floating-point numbers quickly and accurately with integers", which represents the biggest step in this field in 20 years: he mostly figured out how to use machine integers to perform accurate rendering! Why do I say "mostly"? Because although Loitsch’s "Grisu3" algorithm is very fast, it gives up on about 0.5% of numbers, in which case you have to fall back to Dragon4 or a derivative.

If you’re a language runtime author, the Grisu algorithms are a big deal: Grisu3 is about 5 times faster than the algorithm used by printf in GNU libc, for instance. A few language implementors have already taken note: Google hired Loitsch, and the Grisu family acts as the default rendering algorithms in both the V8 and Mozilla Javascript engines (replacing David Gay’s 17-year-old dtoa code). Loitsch has kindly released implementations of his Grisu algorithms as a library named double-conversion.

And of course I can’t talk about performance without mentioning Haskell somewhere :-) I’ve taken Loitsch’s library and written a Haskell interface, which I’ve measured to be 30 times faster than the default renderer used in the Haskell runtime libraries. This has some nice knock-on effects: my aeson JSON library is now 10 times faster at rendering big arrays of floating point numbers, for instance. I accidentally noticed in the course of that work that my Haskell text Unicode library‘s UTF-8 encoder wasn’t as fast as it could be, so I improved its performance by about 50% along the way. Hooray for faster code!

(By the way, the punnery in algorithm naming continues: the Grisu algorithms are named for GrisÃ¹, the little dragon.)

text 0.10.0.0 is here

Bryan O'Sullivan — Fri, 22 Oct 2010 05:36:29 +0000

Haskell in the Real World discussion at CUFP

Bryan O'Sullivan — Wed, 25 Aug 2010 17:20:20 +0000

What’s in a parser? Attoparsec rewired (2/2)

Bryan O'Sullivan — Wed, 03 Mar 2010 18:03:27 +0000

In my first of this pair of articles, I laid out some of the qualities I've been looking for in a parsing library.

Before I dive back into detail, I want to show off some numbers. The new Attoparsec code is fast.

What did I benchmark? I captured some real HTTP GET requests from a live public web server, averaging 431 bytes per request. I chucked them into a file, and measured the time needed to parse the entire contents of the file with the following libraries:

Ryan Dahl's http-parser library, which is 1,672 lines of hand-rolled C craziness. I mean "craziness" in a special sort of awed way, with that hushed voice you reserve for the guy who builds the scale model of Neuschwanstein Castle in matchsticks. The library's heritage is closely based on the Ragel-generated parser used by Mongrel. This library is a fair approximation to about as fast as you can get, since it's been tuned for just one purpose. I wrote a smallish, but reasonably realistic, driver program to wire it up to file-based data, adding another 210 lines of code.
An Attoparsec-based HTTP request parser, 54 lines long, with about 30 lines of driver program. (Attoparsec itself is about 900 lines of code.)
Several Parsec-3-based parsers, which are almost identical in length to the Attoparsec-based version.

The Parsec 3 parsers come in three varieties:

The fastest uses a patch that Antoine Latter wrote to switch Parsec 3's internal machinery over to using continuation passing style (CPS). This parser uses ByteString for input, and reads the entire 18.8MB file in one chunk.
Next is the same parser, using lazy ByteString I/O to read the file in 64KB chunks. This costs about 40% in performance, but is almost mandatory to maintain a reasonable memory footprint on large inputs.
In last place is the official version of Parsec 3, reading the input in one chunk. (Reading lazily still costs an additional 40%, but I didn't want to further clutter the chart with more big numbers.)

What's interesting to me is that the tiny Attoparsec-based parser, which is more or less a transliteration of the relevant parts of RFC 2616, is so fast.

I went back and remeasured performance of the Attoparsec and C parsers on a larger data set (295,568 URLs), and got these numbers:

Attoparsec: 2.889 seconds, or 102,308 requests/second
C: 1.614 seconds, or 183,128 requests/second

That clocks the Attoparsec-based parser at about 56% the speed of the C parser. Not bad, given that it's about 3.2% the number of lines of code!

Of course there are tradeoffs involved here.

Parsec 3 emits much more friendly error messages, and can handle many different input types. Attoparsec, being aimed at plumbing-oriented network protocols, considers friendly error messages to not be worth the effort, and is specialised to the arrays of bytes you get straight off the network.
Parsec 3 requires all of its input to be available when the parser is run (either in one large chunk or via lazy I/O). If Attoparsec has insufficient data to return a complete result, it hands back a continuation that you provide extra data to. This eliminates the need for lazy I/O and any additional buffering, and makes for a beautiful, pure API that doesn't care what its input source is.

The memory footprint of the Attoparsec-based parser is small: it will run in 568KB of heap on my 64-bit laptop. The smallest heap size that the Parsec 3 parser can survive in isn't all that much larger: with lazily read input, it will run in a 750KB heap.

Overall, this is yet another instance where a little careful attention to performance yields very exciting results. Personally, I'd be quite happy to trade a 97% reduction in code size for such a small performance hit, especially given the clarity, ease of use, and flexibility of the resulting code. (The http_parser API is frankly not so much fun to use, even though I completely understand the motivation behind it.)

What’s in a parsing library? (1/2)

Bryan O'Sullivan — Wed, 03 Mar 2010 07:12:28 +0000

My goal in working on the new GHC I/O manager has been to get the Haskell network stack into a state where it could be used to attack high-performance and scalable networking problems, domains in which it has historically been weak.

While it's encouraging to have an excellent networking stack (Johan and I now have this thoroughly in hand), the next thing I'd look for is libraries to help build networked applications. One of the fundamental things that such apps need to do well is parse data, be it received from the network or read from files.

The Haskell parsing library of first resort has for years been Parsec. While other capable libraries exist (e.g. polyparse and uu-parsinglib), they don't appear to see much use.

As appealing as Parsec's API is, it has a few problems:

Parsec 2 is slow, and it has high memory overhead, due to its use of Haskell's String type for tokens. Parsec 3 can use the more efficient ByteString type (which is in any case much more appropriate for networked applications that deal in octets), but it achieves this flexibility at the cost of being even slower than Parsec 2.
Parsec's API demands that all of a parser's input be available at once. People usually work around this by feeding a Parsec parser with lazily read data, but lazy I/O is at odds with my goal of writing solid networked code.

What properties should a parsing library for networked applications ideally possess? There are a few obvious desiderata that have been well known for years. For example, it's important to have an appealing API and programming model. Parsec squarely fits this desire.

Performance is also a big consideration. Ideally, a parsing library would be fast enough that you wouldn't feel any real need for either of the following:

A few weeks to write an insane hand-bummed parser.
Mechanical parser generators or lexers (e.g. happy or alex).

There are some additional important constraints on a realistic library: it must fit well into a highly concurrent networked world full of unreliable, hostile and incompetent clients.

High concurrency levels demand a low per-connection memory footprint.
The need to cope with poorly behaved clients requires that applications must be able to throttle connections that are too busy, or kill connections that are too slow or attempting to consume too many server resources. A good parsing library will not get in the way of these needs.

A few years ago, I made a few half-hearted attempts to write a specialised version of Parsec, which I eventually named Attoparsec.

I began with a stripped-down Parsec that was specialised to accept ByteString input. I then extended the API to allow a parser to consume small chunks of input at a time.

Because I wasn't using Attoparsec "in anger" at the time, I made sure that my library worked (more or less), but I was not measuring its performance.

In late January of this year, I began to think about using Attoparsec as the parser for a simple HTTP server that I could use to benchmark our new GHC I/O manager code. Clearly, I'd want the parser to perform well, or it would distort my numbers rather badly.

By coincidence, John MacFarlane emailed me around the same time, with disturbing findings: he'd tried Attoparsec, and found its performance to be terrible! In fact, it was 4 to 20 times slower than plain Parsec with his experimental parser and test data. Clearly, I had some hard work to look forward to.

Happily, that work is now almost complete, and I am pleased with the results. In the next post, I'll have some details of what this all entails.

Minuscule linkscrape of mischief

Bryan O'Sullivan — Sat, 30 Jan 2010 00:32:26 +0000

While I’ve been in my corner hacking on low-level Haskell nonsense, apparently someone figured out how to make the internets more better.

To wit, a few judiciously curated sources of visual edification:

for great justice

unhappy hipsters

riot right click

New GHC I/O manager, first sets of benchmark numbers

Bryan O'Sullivan — Fri, 22 Jan 2010 07:28:31 +0000

Progress on GHC’s I/O manager

Bryan O'Sullivan — Mon, 11 Jan 2010 08:52:53 +0000

Over the past couple of weeks, I have been working with Johan Tibbell on an event library to use for replacing GHC’s existing I/O manager. The work has been progressing rather nicely: I now have both the epoll and kqueue back ends working, while Johan has been focusing on a fast priority queue data structure for managing timeouts.

We’ve been working from a pair of git repositories:

Johan’s is at http://github.com/tibbe/event
Mine is at http://github.com/bos/event

(Incidentally, I’m using the excellent hg-git plugin for Mercurial, which has allowed me to continue to avoid using git. It’s been working very well. Many thanks to Scott Chacon and Augie Fackler for their work on it!)

Of course, now that we have the basic plumbing working, I have a few numbers to report. These are all for sending and receiving 1,000,000 messages over Unix pipes.

On 64-bit Linux, I can easily create 100,000 pipes (i.e. 200,000 file descriptors), and pass all messages through them in 6.69 seconds.
Under 32-bit Snow Leopard, for some reason I can’t create more than 2,048 pipes. Passing all messages takes 7.41 seconds.
With the same 2,048 pipes under Linux, the time required is 4.15 seconds.
Under Linux, if I try to create a very large number of pipes, I get an error message “VFS: file-max limit 291248 reached”, and as I just discovered, the machine starts to misbehave. Good times!

These are pretty gratifying numbers to have. This has been a very fun project so far, both in the technical parts and the dealings with the other people involved. I’m looking forward to continuing to work on it with Johan and others!