open source – teideal glic deisbhéalach

criterion 1.0

Bryan O'Sullivan — Fri, 08 Aug 2014 10:02:48 +0000

Almost five years after I initially released criterion, I'm delighted to announce a major release with a large number of appealing new features.

As always, you can install the latest goodness using cabal install criterion, or fetch the source from github.

Please let me know if you find criterion useful!

New documentation

I built both a home page and a thorough tutorial for criterion. I've also extended the inline documentation and added a number of new examples.

All of the documentation lives in the github repo, so if you'd like to see something improved, please send a bug report or pull request.

New execution engine

Criterion's model of execution has evolved, becoming vastly more reliable and accurate. It can now measure events that take just a few hundred picoseconds.

benchmarking return ()
time                 512.9 ps   (512.8 ps .. 513.1 ps)

While almost all of the core types have changed, criterion should remain API-compatible with the vast majority of your benchmarking code.

New metrics

In addition to wall-clock time, criterion can now measure and regress on the following metrics:

CPU time
CPU cycles
bytes allocated
number of garbage collections
number of bytes copied during GC
wall-clock time spent in mutator threads
CPU time spent running mutator threads
wall-clock time spent doing GC
CPU time spent doing GC

Linear regression

Criterion now supports linear regression of a number of metrics.

Here's a regression conducted using --regress cycles:iters:

cycles:              1.000 R²   (1.000 R² .. 1.000 R²)
  iters              47.718     (47.657 .. 47.805)

The first line of the output is the R² goodness-of-fit measure for this regression, and the second is the number of CPU cycles (measured using the rdtsc instruction) to execute the operation in question (integer division).

This next regression uses --regress allocated:iters to measure the number of bytes allocated while constructing an IntMap of 40,000 values.

allocated:           1.000 R²   (1.000 R² .. 1.000 R²)
  iters              4.382e7    (4.379e7 .. 4.384e7)

(That's a little under 42 megabytes.)

New outputs

While its support for active HTML has improved, criterion can also now output JSON and JUnit XML files.

New internals

Criterion has received its first spring cleaning, and is much easier to understand as a result.

Acknowledgments

I was inspired into some of this work by the efforts of the authors of the OCaml Core_bench package.

A major upgrade to attoparsec: more speed, more power

Bryan O'Sullivan — Sat, 31 May 2014 07:34:55 +0000

I’m pleased to introduce the third generation of my attoparsec parsing library. With a major change to its internals, it is both faster and more powerful than previous versions, while remaining backwards compatible.

Comparing to C

Let’s start with a speed comparison between the hand-written C code that powers Node.js’s HTTP parser and an idiomatic Haskell parser that uses attoparsec. There are good reasons to take these numbers with a fistful of salt, so imagine huge error bars, warning signs, and whatnot—but they’re still interesting.

A little explanation is in order for why there are two entries for http-parser. The “null” driver consists of a series of empty callbacks, and represents the best possible performance we can get. The “naive” http-parser driver allocates memory for both a request and each of its headers, and frees this memory once a request parse has finished. (A real user of http-parser is likely to be slower than the naive driver, as http-parser forces its clients to do complex book-keeping.)

Meanwhile, the attoparsec parser is of course tiny: a few dozen lines of code, instead of a few thousand. More interestingly, it’s faster than its do-nothing C counterpart. When I last compared the two, back in 2010, attoparsec was a little over half the speed of http-parser, so to pass it feels like an exciting development.

To be clear, you really shouldn’t treat comparing the two as anything other than a fast-and-loose exercise. The attoparsec parser does less work in some ways, for instance by not special-casing the Content-Length header. At the same time, it does more work in a different, but perhaps more important case: there’s no equivalent of the maze of edge cases that arise with http-parser when a parse spans a boundary between blocks of input. The attoparsec programming model is simply way less hairy.

Caveats aside, my purpose with this comparison is to paint with broad strokes what I hope is a compelling picture: you can write a compact, clean parser using attoparsec, and you can expect it to perform well.

Speed improvements

Compared to the previous version of attoparsec, the new internals of this version yield some solid speedups. On attoparsec’s own microbenchmark suite, speedups range from flat to nearly 2x.

If you use the aeson JSON library to parse JSON data that contains a lot of numbers, you can expect a nice boost in performance.

Space usage

In addition to being faster, attoparsec is now generally more space efficient too. In a test of an application that uses Johan Tibell’s cassava library for handling CSV files, the app used 39% less memory with the new version of attoparsec than before, while running 5% faster.

New API fun

The new internals of attoparsec allowed me to add a feature I’ve wanted for a few years, one I had given up on as impossible with the previous internals.

match :: Parser a -> Parser (ByteString, a)

Given an arbitrary parser, match returns both the result of the parse and the string that it consumed while matching.

>>> let p = (,) <$> decimal <*> ("," *> decimal)
>>> parseOnly (match p) "1,31337"
Right ("1,31337",(1,31337))

This is very handy when what you’re interested in is not just the components of a parse result, but also the precise input string that the parser matched. (Imagine using this to save the contents of a comment while parsing a programming language, for instance.)

The old internals

What changed to yield both big performance improvements and previously impossible capabilities? To understand this, let’s discuss how attoparsec worked until today.

The age-old way to write parser libraries in Haskell is to treat parsing as a job of consuming input from the front of a string. If you want to match the string "foo" and your input is "foobar", you pull the prefix from "foobar" and hand "bar" to your successor parser as its input. This is how attoparsec used to work, and we’ll see where it becomes relevant in a moment.

One of attoparsec’s major selling points is that it works with incomplete input. If we give it insufficient input to make a decision about what to do, it will tell us.

>>> parse ("bar" <|> "baz") "ba"
Partial _

If we get a Partial constructor, we resume parsing by feeding more input to the continuation it hands us. The easiest way is to use feed:

>>> let cont = parse ("bar" <|> "baz") "ba"
>>> cont `feed` "r"
Done "" "bar"

Continuations interact in an interesting way with backtracking. Let’s talk a little about backtracking in isolation first.

>>> let lefty = Left <$> decimal <* ".!"
>>> let righty = Right <$> rational

The parser lefty will not succeed until it has read a decimal number followed by some nonsense.

Suppose we get partway through a parse on input like this.

>>> let cont = parse (lefty <|> righty) "123."
>>> cont
Partial _

Even though the decimal portion of lefty has succeeded, if we feed the string "1!" to the continuation, lefty as a whole will fail, parsing will backtrack to the beginning of the input, and righty will succeed.

>>> cont `feed` "1!"
Done "!" Right 123.1

What’s happening behind the scenes here is important.

Under the old version of attoparsec, parsing proceeds by consuming input. By the time we reach the "." in the input of "123.", we have thrown away the leading "123" as a result of decimal succeeding, so our remaining input is "." when we ask for more.

The <|> combinator holds onto the original input in case a parse fails. Since a parse may need ask for more input before it fails (as in this case), the old attoparsec has to keep track of this additional continuation-fed input separately, and glue the saved and added inputs together on each backtrack. Worse yet, sometimes we have to throw away added input in order to avoid double-counting it.

This surely sounds complicated and fragile, but it was the only scheme I could think of that would work under the “parsing as consuming input” model that attoparsec started with. I managed to make this setup run fast enough that (once I’d worked the bugs out) I wasn’t too bothered by the additional complexity.

From strings to buffers and cursors

The model that attoparsec used to follow was that we consumed input, and for correctness when backtracking did our book-keeping of added input separately.

Under the new model, we manage input and added input in one unified Buffer abstraction. We track our position using a separate cursor, which is simply an integer index into a Buffer.

If we need to backtrack, we simply hand the current Buffer to the alternate parser, along with the cursor that will restart parsing at the right spot.

The idea of parsing with a cursor isn’t mine; it came up during a late night IRC conversation with Ed Kmett. I’m excited that this change happened to make it easy to add a new combinator, match, which had previously seemed impossible to write.

match :: Parser a -> Parser (ByteString, a)

In the new cursor-based world, all we need to build match is to remember the cursor position when we start parsing. If the parse succeeds, we extract the substring that spans the old and new cursor positions. I spent quite a bit of time pondering this problem with the old representation without getting anywhere, but by changing the internal representation, it suddenly became trivial.

Switching to the cursor-based representation accounts for some of the performance improvements in the new release, as it opened up a few new avenues for further small tweaks.

Bust my buffers!

There’s another implementation twist, though: why is the Buffer type not simply a ByteString? Here, the question is one of efficiency, specifically behaviour in response to pathologically crafted inputs.

Every time someone feeds us input via the Partial continuation, we have to add this to the input we already have. The obvious thing to do is treat Buffer as a glorified ByteString and simply string-append the new input to the existing input and get on with life.

Troublingly, this approach would require two string copies per append: we’d allocate a new string, copy the original string into it, then tack the appended string on the end. It’s easy to see that this has quadratic time complexity, which would allow a hostile attacker to DoS us by simply drip-feeding us a large volume of valid data, one byte at a time.

The new Buffer structure addresses such attacks by exponential doubling, such that most appends require only one string copy instead of two. This improves the worst-case time complexity of being drip-fed extra input from O(n²) to O(nlogn).

Preserving safety and speed

Making this work took a bit of a hack. The Buffer type contains a mutable array that contains both an immutable portion (visible to users) and an invisible mutable part at the end. Every time we append, we write to the mutable array, and hand back a Buffer that widens its immutable portion to include the part we just wrote to. The array is shared across successive Buffers until we run out of space.

This is very fast, but it’s also unsafe: nobody should ever append to the same Buffer twice, as the sharing of the array can lead to data corruption. Let’s think about how this could arise. Our original Buffer still thinks it can write to the mutable portion of an array, while our new Buffer considers the same area of memory to be immutable. If we append to the original Buffer again, we will scribble on memory that the new Buffer thinks is immutable.

Since neither our choice of API nor Haskell’s type system can prevent bad actions here, users are free to make the programming error of appending to a Buffer more than once, even though it makes no sense to do so. It’s not satisfactory to have pure code react badly even when the programmer is doing something wrong, so I addressed this problem in an interesting way.

The immutable shell of a Buffer contains a generation number. We embed a mutable generation number in the shared array that each Buffer points to. We increment the mutable generation number every time we append to a Buffer, and hand back a Buffer that also has an incremented immutable generation number.

The mutable and immutable generation numbers should always agree. If they fall out of sync, we know that someone is appending to a Buffer more than once. We react by duplicating the mutable array, so that the new append cannot interfere with the existing array. This amounts to a cheap copy-on-write scheme: copies never occur in the typical case of users behaving sensibly, while we preserve correctness if a programmer starts doing daft things.

Assurance

Before I embarked on this redesign, I doubled the size of attoparsec’s test and benchmark suites. This gave me a fair sense of safety that I wouldn’t accidentally break code as I went.

Once the rate of churn settled down, I found the most significant packages using attoparsec on Hackage and tried them out.

This revealed that an incompatible change I’d made in the core Parser type caused quite a lot of downstream build breakage, with a third of the packages that I tried failing to build. This was a good motivator for me to learn how to fix the problem.

Once I fixed this self-imposed difficulty, it turned out that all of the top packages turned out to be API-compatible with the new release. It was definitely helpful to have a tool that let me find important users of the package.

Between the expanded test suite, better benchmarks, and this extra degree of checking, I am now feeling moderately confident that the sweeping changes I’ve made should be fairly safe to inflict on people. I hope I’m right! Please enjoy the results of my work.

package	mojo	status
aeson	10000	clean
snap-core	2030	requires `--allow-newer`
conduit-extra	1816	clean
fay	1740	clean
snap	1681	requires `--allow-newer`
conduit-extra	1492	clean
persistent	1487	clean
yaml	1313	clean
io-streams	1205	requires `--allow-newer`
configurator	1161	clean
yesod-form	1077	requires `--allow-newer`
snap-server	889	requires `--allow-newer`
heist	881	requires `--allow-newer`
parsers	817	clean
cassava	643	clean

And finally

When I was compiling the list of significant packages using attoparsec, I made a guess that the Unix rev would reverse the order of lines in a file. What it does instead seems much less useful: it reverses the bytes on each line.

Why do I mention this? Because my mistake led to the discovery that there’s a surprising number of Haskell packages whose names read at least as well backwards as forwards.

citats-dosey           revres-foornus
corpetic-codnap        rotaremune-cesrapotta
eroc-ognid             rotaremune-ptth
eroc-pans              rotarugifnoc
forp-colla-emit-chg    sloot-ipa
kramtsop               stekcosbew
morf-gnirtsetyb        teppup-egaugnal
nosea                  tropmish
revirdbew              troppus-ipa-krowten

(And finally-most-of-all, if you’re curious about where I measured my numbers, I used my 2011-era 2.2GHz MacBook Pro running 64-bit GHC 7.6.3. Server-class hardware should do way better.)

Top Haskell packages seen through graph centrality beer goggles

Bryan O'Sullivan — Sun, 18 May 2014 07:40:38 +0000

I threw together a little code tonight to calculate the Katz centrality of packages on Hackage. This is a measure that states that a package is important if an important package depends on it. The definition is recursive, as is the matrix computation that converges towards a fixpoint to calculate it.

Here are the top hundred Hackage packages as calculated by this method, along with their numeric measures of centrality, to which I’ve given the slightly catchier name “mojo” here.

This method has a few obvious flaws: it doesn’t count downloads, nor can it take into account packages that only contain executables. That said, the results still look pretty robust.

package	mojo
base	10000
ghc-prim	9178
array	1354
bytestring	1278
deepseq	1197
containers	994
transformers	925
mtl	840
text	546
time	460
filepath	441
directory	351
parsec	299
old-locale	267
template-haskell	247
network	213
process	208
vector	208
pretty	187
random	172
binary	158
QuickCheck	130
utf8-string	128
stm	119
unix	116
haskell98	100
hashable	96
attoparsec	92
old-time	88
primitive	87
aeson	72
unordered-containers	70
syb	69
data-default	67
split	64
transformers-base	63
blaze-builder	62
monad-control	62
conduit	62
semigroups	59
cereal	57
tagged	57
bindings-DSL	55
HUnit	55
gtk	54
Cabal	54
lens	50
OpenGL	46
haskell-src-exts	45
cmdargs	45
HTTP	44
http-types	43
extensible-exceptions	43
glib	42
utility-ht	41
data-default-class	38
parallel	35
resourcet	34
semigroupoids	34
xml	34
comonad	33
lifted-base	33
cairo	33
safe	32
MissingH	31
exceptions	31
base-unicode-symbols	31
ansi-terminal	31
vector-space	30
nats	30
OpenGLRaw	30
monads-tf	28
wai	28
hslogger	28
regex-compat	28
GLUT	27
void	27
blaze-html	26
hxt	25
dlist	25
zlib	25
hmatrix	24
SDL	24
case-insensitive	24
scientific	23
X11	23
tagsoup	22
regex-posix	22
HaXml	22
system-filepath	22
enumerator	22
contravariant	21
base64-bytestring	21
http-conduit	21
blaze-markup	21
MonadRandom	20
failure	20
test-framework	20
xhtml	20
distributive	19

New year, new library releases, new levels of speed

Bryan O'Sullivan — Thu, 09 Jan 2014 07:11:19 +0000

I just released new versions of the Haskell text, attoparsec, and aeson libraries on Hackage, and there’s a surprising amount to look forward to in them.

The summary for the impatient: some core operations in text and aeson are now much more efficient. With text, UTF-8 encoding is up to four times faster, while with aeson, encoding and decoding of JSON bytestrings are both up to twice as fast.

attoparsec 0.11.1.0

Perhaps the least interesting release is attoparsec. It adds a new dependency on Bas Van Dijk’s scientific package to allow efficient and more accurate parsing of floating point numbers, a longstanding minor weakness. It also introduces two new functions for single-token lookahead, which are used by the new release of aeson; read on for more details.

text 1.1.0.0

The new release of the text library has much better support for encoding to a UTF-8 bytestring via the encodeUtf8 function. The new encoder is up to four times faster than in the previous major release.

Simon Meier contributed a pair of UTF-8 encoding functions that can encode to the new Builder type in the latest version of the bytestring library. These functions are slower than the new encodeUtf8 implementation, but still twice as fast as the old encodeUtf8.

Not only are the new Builder encoders admirably fast, they’re more flexible than encodeUtf8, as Builders can be used to efficiently glue together from many small fragments. Once again, read on for more details about how this helped with the new release of aeson. (Note: if you don’t have the latest version of bytestring in your library environment, you won’t get the new Builder encoders.)

The second major change to the text library came about when I finally decided to expose all of the library’s internal modules. The newly exposed modules can be found in the Data.Text.Internal hierarchy. Before you get too excited, please understand that I can’t make guarantees of release-to-release stability for any functions or types that are documented as internal.

aeson 0.7.0.0

Finally, the new release of the aeson library focuses on improved performance and accuracy. We parse floating point numbers more accurately thanks once again to Bas van Dijk’s scientific library. And for performance, both decoding and encoding of JSON bytestrings are up to twice as fast as in the previous release.

On the decoding side, I used the new lookahead primitives from attoparsec to make parsing faster and less memory intensive (by avoiding backtracking, if you’re curious). Meanwhile, Simon Meier contributed a patch that uses his new Builder based UTF-8 encoder from the text library to double encoding performance. (Encoding performance is improved even if you don’t have the necessary new version of bytestring, but only by about 10%.)

On my crummy old Mac laptop, I can decode at 30-40 megabytes per second, and encode at 100-170 megabytes per second. Not bad!

Thanks

I'd particularly like to thank Bas van Dijk and Simon Meier for their excellent contributions during this most recent development cycle. It's really a pleasure to work with such smart, friendly people.

Simon and Bas deserve some kind of an additional medal for being forgiving of my sometimes embarrassingly long review latencies: some of Simon's patches against the text library are almost two years old! (Please pardon me while I grasp at straws in my slightly shamefaced partial defence here: the necessary version of bytestring wasn't released until three months ago, so I'm not the only person in the Haskell community with long review latencies...)

Testing a UTF-8 decoder with vigour

Bryan O'Sullivan — Tue, 31 Dec 2013 05:28:12 +0000

Yesterday, Michael Snoyman reported a surprising regression in version 1.0 of my Haskell text library: for some invalid inputs, the UTF-8 decoder was truncating the invalid data instead of throwing an exception.

Thanks to Michael providing an easy repro, I quickly bisected the origin of the regression to a commit from September that added support for incremental decoding of UTF-8. That work was motivated by applications that need to be able to consume incomplete input (e.g. a network packet containing possibly truncated data) as early as possible.

The low-level UTF-8 decoder is implemented as a state machine in C to squeeze as much performance out as possible. The machine has two visible end states: UTF8_ACCEPT indicates that a buffer was completely successfully decoded, while UTF8_REJECT specifies that the input contained invalid UTF-8 data. When the decoder stops, all other machine states count as work in progress, i.e. a decode that couldn’t complete because we reached the end of a buffer.

When the old all-or-nothing decoder encountered an incomplete or invalid input, it would back up by a single byte to indicate the location of the error. The incremental decoder is a refactoring of the old decoder, and the new all-or-nothing decoder calls it.

The critical error arose in the refactoring process. Here’s the old code for backing up a byte.

    /* Error recovery - if we're not in a
       valid finishing state, back up. */
    if (state != UTF8_ACCEPT)
        s -= 1;

This is what the refactoring changed it to:

    /* Invalid encoding, back up to the
       errant character. */
    if (state == UTF8_REJECT)
        s -= 1;

To preserve correctness, the refactoring should have added a check to the new all-or-nothing decoder so that it would step back a byte if the final state of the incremental decoder was neither UTF8_ACCEPT nor UTF8_REJECT. Oops! A very simple bug with unhappy consequences.

The text library has quite a large test suite that has revealed many bugs over the years, often before they ever escaped into the wild. Why did this ugly critter make it over the fence?

Well, a glance at the original code for trying to test UTF-8 error handling is telling—in fact, you don’t even need to be able to read a programming language, because the confession is in the comment.

-- This is a poor attempt to ensure that
-- the error handling paths on decode are
-- exercised in some way.  Proper testing
-- would be rather more involved.

“Proper testing” indeed. All that I did in the original test was generate a random byte sequence, and see if it provoked the decoder into throwing an exception. The chances of such a dumb test really offering any value are not great, but I had more or less forgotten about it, and so I had a sense of security without the accompanying security. But hey, at least past-me had left a mea culpa note for present-day-me. Right?

While finding and fixing the bug took just a few minutes, I spent several more hours strengthening the test for the UTF-8 decoder, and this was far more interesting.

As a variable-length self-synchronizing encoding, UTF-8 is very clever and elegant, but its cleverness allows for a number of implementation bugs. For reference, here is a table (lightly edited from Wikipedia) of the allowable bit patterns used in UTF-8.

first code point	last code point	byte 1	byte 2	byte 3	byte 4
U+0000	U+007F	`0xxxxxxx`
U+0080	U+07FF	`110xxxxx`	`10xxxxxx`
U+0800	U+FFFF	`1110xxxx`	`10xxxxxx`	`10xxxxxx`
U+10000	U+1FFFFF	`11110xxx`	`10xxxxxx`	`10xxxxxx`	`10xxxxxx`

The best known of these bugs involves accepting non-canonical encodings. What a canonical encoding means takes a little explaining. UTF-8 can represent any ASCII character in a single byte, and in fact every ASCII character must be represented as a single byte. However, an illegal two-byte encoding of an ASCII character can be achieved by starting with 0xC0, followed by the ASCII character with the high bit set. For instance, the ASCII forward slash U+002F is represented in UTF-8 as 0x2F, but a decoder with this bug would also accept 0xC0 0xAF (three- and four-byte encodings are of course also possible).

This bug may seem innocent, but it was widely used to remotely exploit IIS 4 and IIS 5 servers over a decade ago. Correct UTF-8 decoders must reject non-canonical encodings. (These are also known as overlong encodings.)

In fact, the bytes 0xC0 and 0xC1 will never appear in a valid UTF-8 bytestream, as they can only be used to start two-byte sequences that cannot be canonical.

To test our UTF-8 decoder’s ability to spot bogus input, then, we might want to generate byte sequences that start with 0xC0 or 0xC1. Haskell’s QuickCheck library provides us with just such a generating function, choose, which generates a random value in the given range (inclusive).

choose (0xC0, 0xC1)

Once we have a bad leading byte, we may want to follow it with a continuation byte. The value of a particular continuation byte doesn’t much matter, but we would like it to be valid. A continuation byte always contains the bit pattern 0x80 combined with six bits of data in its least significant bits. Here’s a generator for a random continuation byte.

contByte = (0x80 +) <$> choose (0, 0x3F)

Our bogus leading byte should be rejected immediately, since it can never generate a canonical encoding. For the sake of thoroughness, we should sometimes follow it with a valid continuation byte to ensure that the two-byte sequence is also rejected.

To do this, we write a general combinator, upTo, that will generate a list of up to n random values.

upTo :: Int -> Gen a -> Gen [a]
upTo n gen = do
  k <- choose (0,n)
  vectorOf k gen -- a QuickCheck combinator

And now we have a very simple way of saying “either 0xC0 or 0xC1, optionally followed by a continuation byte”.

-- invalid leading byte of a 2-byte sequence.
(:) <$> choose (0xC0,0xC1) <*> upTo 1 contByte

Notice in the table above that a 4-byte sequence can encode any code point up to U+1FFFFF. The highest legal Unicode code point is U+10FFFF, so by implication there exists a range of leading bytes for 4-byte sequences that can never appear in valid UTF-8.

-- invalid leading byte of a 4-byte sequence.
(:) <$> choose (0xF5,0xFF) <*> upTo 3 contByte

We should never encounter a continuation byte without a leading byte somewhere before it.

-- Continuation bytes without a start byte.
listOf1 contByte
-- The listOf1 combinator generates a list
-- containing at least one element.

Similarly, a bit pattern that introduces a 2-byte sequence must be followed by one continuation byte, so it’s worth generating such a leading byte without its continuation byte.

-- Short 2-byte sequence.
(:[]) <$> choose (0xC2, 0xDF)

We do the same for 3-byte and 4-byte sequences.

-- Short 3-byte sequence.
(:) <$> choose (0xE0, 0xEF) <*> upTo 1 contByte
-- Short 4-byte sequence.
(:) <$> choose (0xF0, 0xF4) <*> upTo 2 contByte

Earlier, we generated 4-byte sequences beginning with a byte in the range 0xF5 to 0xFF. Although 0xF4 is a valid leading byte for a 4-byte sequence, it’s possible for a perverse choice of continuation bytes to yield an illegal code point between U+110000 and U+13FFFF. This code generates just such illegal sequences.

-- 4-byte sequence greater than U+10FFFF.
k <- choose (0x11, 0x13)
let w0 = 0xF0 + (k `Bits.shiftR` 2)
    w1 = 0x80 + ((k .&. 3) `Bits.shiftL` 4)
([w0,w1]++) <$> vectorOf 2 contByte

Finally, we arrive at the general case of non-canonical encodings. We take a one-byte code point and encode it as two, three, or four bytes; and so on for two-byte and three-byte characters.

-- Overlong encoding.
k <- choose (0,0xFFFF)
let c = chr k
case k of
  _ | k < 0x80  -> oneof [
          let (w,x)     = ord2 c in return [w,x]
        , let (w,x,y)   = ord3 c in return [w,x,y]
        , let (w,x,y,z) = ord4 c in return [w,x,y,z] ]
    | k < 0x7FF -> oneof [
          let (w,x,y)   = ord3 c in return [w,x,y]
        , let (w,x,y,z) = ord4 c in return [w,x,y,z] ]
    | otherwise ->
          let (w,x,y,z) = ord4 c in return [w,x,y,z]
-- The oneof combinator chooses a generator at random.
-- Functions ord2, ord3, and ord4 break down a character
-- into its 2, 3, or 4 byte encoding.

Armed with a generator that uses oneof to choose one of the above invalid UTF-8 encodings at random, we embed the invalid bytestream in one of three cases: by itself, at the end of an otherwise valid buffer, and at the beginning of an otherwise valid buffer. This variety gives us some assurance of catching buffer overrun errors.

Sure enough, this vastly more elaborate QuickCheck test immediately demonstrates the bug that Michael found.

The original test is a classic case of basic fuzzing: it simply generates random junk and hopes for the best. The fact that it let the decoder bug through underlines the weakness of fuzzing. If I had cranked the number of randomly generated test inputs up high enough, I’d probably have found the bug, but the approach of pure randomness would have caused the bug to remain difficult to reproduce and understand.

The revised test is much more sophisticated, as it generates only test cases that are known to be invalid, with a rich assortment of precisely generated invalid encodings to choose from. While it has the same probabilistic nature as the fuzzing approach, it excludes a huge universe of uninteresting inputs from being tested, and hence is much more likely to reveal a weakness quickly and efficiently.

The moral of the story: even QuickCheck tests, though vastly more powerful than unit tests and fuzz tests, are only as good as you make them!

Open question: help me design a new encoding API for aeson

Bryan O'Sullivan — Tue, 15 Oct 2013 05:01:03 +0000

For a while now, I’ve had it in mind to improve the encoding performance of my Haskell JSON package, aeson.

Over the weekend, I went from hazy notion to a proof of concept for what I think could be a reasonable approach.

This post is a case of me “thinking out loud” about the initial design I came up with. I’m very interested in hearing if you have a cleaner idea.

The problem with the encoding method currently used by aeson is that it occurs via a translation to the Value type. While this is simple and uniform, it involves a large amount of intermediate work that is essentially wasted. When encoding a complex value, the Value that we build up is expensive, and it will become garbage immediately.

It should be much more efficient to simply serialize straight to a Builder, the type that is optimized for concatenating many short string fragments. But before marching down that road, I want to make sure that I provide a clean API that is easy to use correctly.

I’ve posted a gist that contains a complete copy of this proof-of-concept code.

{-# LANGUAGE GeneralizedNewtypeDeriving, FlexibleInstances,
    OverloadedStrings #-}

import Data.Monoid (Monoid(..), (<>))
import Data.Text (Text)
import Data.Text.Lazy.Builder (Builder, singleton)
import qualified Data.Text.Lazy.Builder as Builder
import qualified Data.Text.Lazy.Builder.Int as Builder

The core Build type has a phantom type that allows us to say “I am encoding a value of type t”. We’ll see where this type tracking is helpful (and annoying) below.

data Build a = Build {
    _count :: !Int
  , run    :: Builder
  }

The internals of the Build type would be hidden from users; here’s what they mean. The _count field tracks the number of elements we’re encoding of an aggregate JSON value (an array or object); we’ll see why this matters shortly. The run field lets us access the underlying Builder.

We provide three empty types to use as parameters for the Build type.

data Object
data Array
data Mixed

We’ll want to use the Mixed type if we’re cramming a set of disparate Haskell values into a JSON array; read on for more.

When it comes to gluing values together, the Monoid class is exactly what we need.

instance Monoid (Build a) where
    mempty = Build 0 mempty
    mappend (Build i a) (Build j b)
      | ij > 1    = Build ij (a <> singleton ',' <> b)
      | otherwise = Build ij (a <> b)
      where ij = i + j

Here’s where the _count field comes in; we want to separate elements of an array or object using commas, but this is necessary only when the array or object contains more than one value.

To encode a simple value, we provide a few obvious helpers. (These are clearly so simple as to be wrong, but remember: my purpose here is to explore the API design, not to provide a proper implementation.)

build :: Builder -> Build a
build = Build 1

int :: Integral a => a -> Build a
int = build . Builder.decimal

text :: Text -> Build Text
text = build . Builder.fromText

Encoding a JSON array is easy.

array :: Build a -> Build Array
array (Build 0 _)  = build "[]"
array (Build _ vs) = build $ singleton '[' <> vs <> singleton ']'

If we try this out in ghci, it behaves as we might hope.

?> array $ int 1 <> int 2
"[1,2]"

JSON puts no constraints on the types of the elements of an array. Unfortunately, our phantom type causes us difficulty here.

An expression of this form will not typecheck, as it’s trying to join a Build Int with a Build Text.

?> array $ int 1 <> text "foo"

This is where the Mixed type from earlier comes in. We use it to forget the original phantom type so that we can construct an array with elements of different types.

mixed :: Build a -> Build Mixed
mixed (Build a b) = Build a b

Our new mixed function gets the types to be the same, giving us something that typechecks.

?> array $ mixed (int 1) <> mixed (text "foo")
"[1,foo]"

This seems like a fair compromise to me. A Haskell programmer will normally want the types of values in an array to be the same, so the default behaviour of requiring this makes sense (at least to my current thinking), but we get a back door for when we absolutely have to go nuts with mixing types.

The last complication stems from the need to build JSON objects. Each key in an object must be a string, but the value can be of any type.

-- Encode a key-value pair.
(<:>) :: Build Text -> Build a -> Build Object
k <:> v = Build 1 (run k <> ":" <> run v)

object :: Build Object -> Build Object
object (Build 0 _)   = build "{}"
object (Build _ kvs) = build $ singleton '{' <> kvs <> singleton '}'

If you’ve had your morning coffee, you’ll notice that I am not living up to my high-minded principles from earlier. Perhaps the types involved here should be something closer to this:

data Object a

(<:>) :: Build Text -> Build a -> Build (Object a)

object :: Build (Object a) -> Build (Object a)

(In which case we’d need a mixed-like function to forget the phantom types for when we want to get mucky and unsafe—but I digress.)

How does this work out in practice?

?> object $ "foo" <:> int 1 <> "bar" <:> int 3
"{foo:1,bar:3}"

Hey look, that’s more or less as we might have hoped!

Open questions, for which I appeal to you for help:

Does this design appeal to you at all?
If not, what would you change?
If yes, to what extent am I wallowing in the “types for thee, but not for me” sin bin by omitting a phantom parameter for Object?

Helpful answers welcome!

(re)announcing statprof, a statistical profiler for Python

Bryan O'Sullivan — Mon, 09 Apr 2012 18:51:52 +0000

Back in 2005, Andy Wingo wrote a neat little statistical profiler named statprof that promptly disappeared into obscurity. It has since languished almost unknown, with a handful of people writing semi-private forks that themselves seem to be dead.

Statistical profiling (also known as sampling profiling) is simple and sweet: the profiler periodically wakes up and samples the stack, then when all is done, it prints a simple report of which lines showed up most often in the profile.

Why would this matter, though? Python already has two built-in profilers: lsprof and the long-deprecated hotshot. The trouble with lsprof is that it only tracks function calls. If you have a few hot loops within a function, lsprof is nearly worthless for figuring out which ones are actually important.

A few days ago, I found myself in exactly the situation in which lsprof fails: it was telling me that I had a hot function, but the function was unfamiliar to me, and long enough that it wasn’t immediately obvious where the problem was.

After a bit of begging on Twitter and Google+, someone pointed me at statprof. But there was a problem: although it was doing statistical sampling (yay!), it was only tracking the first line of a function when sampling (wtf!?). So I fixed that, spiffed up the documentation, and now it’s both usable and not misleading. Here’s an example of its output, locating the offending line in that hot function more accurately:

  %   cumulative      self          
 time    seconds   seconds  name    
 68.75      0.14      0.14  scmutil.py:546:revrange
  6.25      0.01      0.01  cmdutil.py:1006:walkchangerevs
  6.25      0.01      0.01  revlog.py:241:__init__
  [...blah blah blah...]
  0.00      0.01      0.00  util.py:237:__get__
---
Sample count: 16
Total time: 0.200000 seconds

I have uploaded statprof to the Python package index, so it’s almost trivial to install: “easy_install statprof” and you’re up and running.

Since the code is up on github, please feel welcome to contribute bug reports and improvements. Enjoy!

aeson 0.4: easier, faster, better

Bryan O'Sullivan — Thu, 01 Dec 2011 06:20:57 +0000

After months of work, and a number of great contributions from other developers, I just released version 0.4 of aeson, the de facto standard Haskell JSON library. This is a major release, with a number of improvements. Enjoy!

Ease of use

The new decode function complements the longstanding encode function, and makes the API simpler.

New examples make it easier to learn to use the package.

Generics support

aeson’s support for data-type generic programming makes it possible to use JSON encodings of most data types without writing any boilerplate instances.

Thanks to Bas Van Dijk, aeson now supports the two major schemes for doing datatype-generic programming:

the modern mechanism, built into GHC itself
the older mechanism, based on SYB (aka "scrap your boilerplate")

The modern GHC-based generic mechanism is fast and terse: in fact, its performance is generally comparable in performance to hand-written and TH-derived ToJSON and FromJSON instances. To see how to use GHC generics, refer to examples/Generic.hs.

The SYB-based generics support lives in Data.Aeson.Generic, and is provided mainly for users of GHC older than 7.2. SYB is far slower (by about 10x) than the more modern generic mechanism. To see how to use SYB generics, refer to examples/GenericSYB.hs.

Improved performance

We switched the intermediate representation of JSON objects from Data.Map to Data.HashMap, which has improved type conversion performance.
Instances of ToJSON and FromJSON for tuples are between 45% and 70% faster than in 0.3.

Evaluation control

This version of aeson makes explicit the decoupling between identifying an element of a JSON document and converting it to Haskell. See the Data.Aeson.Parser documentation for details.

The normal aeson decode function performs identification strictly, but defers conversion until needed. This can result in improved performance (e.g. if the results of some conversions are never needed), but at a cost in increased memory consumption.

The new decode' function performs identification and conversion immediately. This incurs an up-front cost in CPU cycles, but reduces reduce memory consumption.

The future of MailRank’s open source technologies

Bryan O'Sullivan — Tue, 15 Nov 2011 22:23:53 +0000

(Cross-posted from the MailRank engineering blog.)

You may have seen my exciting news about our upcoming move to Facebook.

It’s been a total blast working on our product, and of course as we did so we released a number of open source libraries and tools. It only added to our pleasure to see so much of that code used outside of our own domain. I will continue to develop and maintain the code that we have released.

Here is a quick rundown of the code we have released, roughly ordered by significance. Yep, we wrote all of these projects in Haskell, definitely a decision that in retrospect I’m very happy about.

pronk (not yet actually released) is an application for load testing web servers. Think of it as similar to httperf or ab, only more modern, simpler to deal with, and with vastly better analytic and reporting capabilities.
configurator is a library that allows fast, dynamic reconfiguration of a Haskell application or daemon.
aeson is a JSON encoding and decoding library optimized for high performance and ease of use.
text-format is a library for printf-like text formatting.
mysql-simple is an easy-to-use client library for the MySQL database. It is several times faster than its competitors, and easier to use. It is built on top of the low-level mysql library.
riak-haskell-client is a client for the Riak decentralized data store.
blaze-textual is a library for efficiently rendering Haskell data as text.
double-conversion is a very fast library for rendering double precision floating point numbers as text, based on the code from the V8 Javascript engine.
resource-pool is a fast resource pooling library.
snappy provides Haskell bindings to Google’s extremely fast snappy compression library.
base16-bytestring provides fast handling of base16-encoded data.
hdbc-mysql provides a MySQL transport for the HDBC database access library. (Yes, we recommend using mysql-simple instead!)

Thanks to all of you who have contributed patches and bug reports. It’s going to be an exciting future!

A major new release of the Haskell statistics library

Bryan O'Sullivan — Fri, 11 Nov 2011 04:45:36 +0000

I'm pleased to announce a major release of of the Haskell statistics library, version 0.10.0.0.

I'd particularly like to thank Alexey Khudyakov for his wonderful work on this release.

New features:

Student-T, Fisher-Snedecor, F-distribution, and Cauchy-Lorentz distributions are added.
Histogram computation is added, in Sample.Histogram.
Forward and inverse discrete Fourier and cosine transforms are added, in Transform.
Root finding is added, in Math.RootFinding.

Major changes:

The Sample.KernelDensity module has been renamed, and completely rewritten to be much more robust. The older module oversmoothed multi-modal data. (The older module is still available under the name Sample.KernelDensity.Simple).
The type classes Mean and Variance are split in two. This is required for distributions which do not have finite variance or mean.

Smaller changes:

The complCumulative function is added to the Distribution class in order to accurately assess probalities P(X>x) which are used in one-tailed tests.
A stdDev function is added to the Variance class for distributions.
The constructor Distribution.normalDistr now takes standard deviation instead of variance as its parameter.
A bug in Quantile.weightedAvg is fixed. It produced a wrong answer if a sample contained only one element.
Bugs in quantile estimations for chi-square and gamma distribution are fixed.
Integer overlow in mannWhitneyUCriticalValue is fixed. It produced incorrect critical values for moderately large samples. Something around 20 for 32-bit machines and 40 for 64-bit ones.
A bug in mannWhitneyUSignificant is fixed. If either sample was larger than 20, it produced a completely incorrect answer.
One- and two-tailed tests in Tests.NonParametric are selected with sum types instead of Bool.
Test results returned as enumeration instead of Bool.
Performance improvements for Mann-Whitney U and Wilcoxon tests.
Module Tests.NonParamtric is split into Tests.MannWhitneyU and Tests.WilcoxonT
sortBy is added to Function.
Mean and variance for gamma distribution are fixed.
Much faster cumulative probablity functions for Poisson and hypergeometric distributions.
Better density functions for gamma and Poisson distributions.
The function Function.create is removed. Use generateM from the vector package instead.
A function to perform approximate comparion of doubles is added to Function.Comparison.
Regularized incomplete beta function and its inverse are added to Function.