Testing a UTF-8 decoder with vigour

Yesterday, Michael Snoyman reported a surprising regression in version 1.0 of my Haskell text library: for some invalid inputs, the UTF-8 decoder was truncating the invalid data instead of throwing an exception.

Thanks to Michael providing an easy repro, I quickly bisected the origin of the regression to a commit from September that added support for incremental decoding of UTF-8. That work was motivated by applications that need to be able to consume incomplete input (e.g. a network packet containing possibly truncated data) as early as possible.

The low-level UTF-8 decoder is implemented as a state machine in C to squeeze as much performance out as possible. The machine has two visible end states: UTF8_ACCEPT indicates that a buffer was completely successfully decoded, while UTF8_REJECT specifies that the input contained invalid UTF-8 data. When the decoder stops, all other machine states count as work in progress, i.e. a decode that couldn’t complete because we reached the end of a buffer.

When the old all-or-nothing decoder encountered an incomplete or invalid input, it would back up by a single byte to indicate the location of the error. The incremental decoder is a refactoring of the old decoder, and the new all-or-nothing decoder calls it.

The critical error arose in the refactoring process. Here’s the old code for backing up a byte.

    /* Error recovery - if we're not in a
       valid finishing state, back up. */
    if (state != UTF8_ACCEPT)
        s -= 1;

This is what the refactoring changed it to:

    /* Invalid encoding, back up to the
       errant character. */
    if (state == UTF8_REJECT)
        s -= 1;

To preserve correctness, the refactoring should have added a check to the new all-or-nothing decoder so that it would step back a byte if the final state of the incremental decoder was neither UTF8_ACCEPT nor UTF8_REJECT. Oops! A very simple bug with unhappy consequences.

The text library has quite a large test suite that has revealed many bugs over the years, often before they ever escaped into the wild. Why did this ugly critter make it over the fence?

Well, a glance at the original code for trying to test UTF-8 error handling is telling—in fact, you don’t even need to be able to read a programming language, because the confession is in the comment.

-- This is a poor attempt to ensure that
-- the error handling paths on decode are
-- exercised in some way.  Proper testing
-- would be rather more involved.

“Proper testing” indeed. All that I did in the original test was generate a random byte sequence, and see if it provoked the decoder into throwing an exception. The chances of such a dumb test really offering any value are not great, but I had more or less forgotten about it, and so I had a sense of security without the accompanying security. But hey, at least past-me had left a mea culpa note for present-day-me. Right?

While finding and fixing the bug took just a few minutes, I spent several more hours strengthening the test for the UTF-8 decoder, and this was far more interesting.

As a variable-length self-synchronizing encoding, UTF-8 is very clever and elegant, but its cleverness allows for a number of implementation bugs. For reference, here is a table (lightly edited from Wikipedia) of the allowable bit patterns used in UTF-8.

code point
code point
byte 1 byte 2 byte 3 byte 4
U+0000 U+007F 0xxxxxxx
U+0080 U+07FF 110xxxxx 10xxxxxx
U+0800 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
U+10000 U+1FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

The best known of these bugs involves accepting non-canonical encodings. What a canonical encoding means takes a little explaining. UTF-8 can represent any ASCII character in a single byte, and in fact every ASCII character must be represented as a single byte. However, an illegal two-byte encoding of an ASCII character can be achieved by starting with 0xC0, followed by the ASCII character with the high bit set. For instance, the ASCII forward slash U+002F is represented in UTF-8 as 0x2F, but a decoder with this bug would also accept 0xC0 0xAF (three- and four-byte encodings are of course also possible).

This bug may seem innocent, but it was widely used to remotely exploit IIS 4 and IIS 5 servers over a decade ago. Correct UTF-8 decoders must reject non-canonical encodings. (These are also known as overlong encodings.)

In fact, the bytes 0xC0 and 0xC1 will never appear in a valid UTF-8 bytestream, as they can only be used to start two-byte sequences that cannot be canonical.

To test our UTF-8 decoder’s ability to spot bogus input, then, we might want to generate byte sequences that start with 0xC0 or 0xC1. Haskell’s QuickCheck library provides us with just such a generating function, choose, which generates a random value in the given range (inclusive).

choose (0xC0, 0xC1)

Once we have a bad leading byte, we may want to follow it with a continuation byte. The value of a particular continuation byte doesn’t much matter, but we would like it to be valid. A continuation byte always contains the bit pattern 0x80 combined with six bits of data in its least significant bits. Here’s a generator for a random continuation byte.

contByte = (0x80 +) <$> choose (0, 0x3F)

Our bogus leading byte should be rejected immediately, since it can never generate a canonical encoding. For the sake of thoroughness, we should sometimes follow it with a valid continuation byte to ensure that the two-byte sequence is also rejected.

To do this, we write a general combinator, upTo, that will generate a list of up to n random values.

upTo :: Int -> Gen a -> Gen [a]
upTo n gen = do
  k <- choose (0,n)
  vectorOf k gen -- a QuickCheck combinator

And now we have a very simple way of saying “either 0xC0 or 0xC1, optionally followed by a continuation byte”.

-- invalid leading byte of a 2-byte sequence.
(:) <$> choose (0xC0,0xC1) <*> upTo 1 contByte

Notice in the table above that a 4-byte sequence can encode any code point up to U+1FFFFF. The highest legal Unicode code point is U+10FFFF, so by implication there exists a range of leading bytes for 4-byte sequences that can never appear in valid UTF-8.

-- invalid leading byte of a 4-byte sequence.
(:) <$> choose (0xF5,0xFF) <*> upTo 3 contByte

We should never encounter a continuation byte without a leading byte somewhere before it.

-- Continuation bytes without a start byte.
listOf1 contByte
-- The listOf1 combinator generates a list
-- containing at least one element.

Similarly, a bit pattern that introduces a 2-byte sequence must be followed by one continuation byte, so it’s worth generating such a leading byte without its continuation byte.

-- Short 2-byte sequence.
(:[]) <$> choose (0xC2, 0xDF)

We do the same for 3-byte and 4-byte sequences.

-- Short 3-byte sequence.
(:) <$> choose (0xE0, 0xEF) <*> upTo 1 contByte
-- Short 4-byte sequence.
(:) <$> choose (0xF0, 0xF4) <*> upTo 2 contByte

Earlier, we generated 4-byte sequences beginning with a byte in the range 0xF5 to 0xFF. Although 0xF4 is a valid leading byte for a 4-byte sequence, it’s possible for a perverse choice of continuation bytes to yield an illegal code point between U+110000 and U+13FFFF. This code generates just such illegal sequences.

-- 4-byte sequence greater than U+10FFFF.
k <- choose (0x11, 0x13)
let w0 = 0xF0 + (k `Bits.shiftR` 2)
    w1 = 0x80 + ((k .&. 3) `Bits.shiftL` 4)
([w0,w1]++) <$> vectorOf 2 contByte

Finally, we arrive at the general case of non-canonical encodings. We take a one-byte code point and encode it as two, three, or four bytes; and so on for two-byte and three-byte characters.

-- Overlong encoding.
k <- choose (0,0xFFFF)
let c = chr k
case k of
  _ | k < 0x80  -> oneof [
          let (w,x)     = ord2 c in return [w,x]
        , let (w,x,y)   = ord3 c in return [w,x,y]
        , let (w,x,y,z) = ord4 c in return [w,x,y,z] ]
    | k < 0x7FF -> oneof [
          let (w,x,y)   = ord3 c in return [w,x,y]
        , let (w,x,y,z) = ord4 c in return [w,x,y,z] ]
    | otherwise ->
          let (w,x,y,z) = ord4 c in return [w,x,y,z]
-- The oneof combinator chooses a generator at random.
-- Functions ord2, ord3, and ord4 break down a character
-- into its 2, 3, or 4 byte encoding.

Armed with a generator that uses oneof to choose one of the above invalid UTF-8 encodings at random, we embed the invalid bytestream in one of three cases: by itself, at the end of an otherwise valid buffer, and at the beginning of an otherwise valid buffer. This variety gives us some assurance of catching buffer overrun errors.

Sure enough, this vastly more elaborate QuickCheck test immediately demonstrates the bug that Michael found.

The original test is a classic case of basic fuzzing: it simply generates random junk and hopes for the best. The fact that it let the decoder bug through underlines the weakness of fuzzing. If I had cranked the number of randomly generated test inputs up high enough, I’d probably have found the bug, but the approach of pure randomness would have caused the bug to remain difficult to reproduce and understand.

The revised test is much more sophisticated, as it generates only test cases that are known to be invalid, with a rich assortment of precisely generated invalid encodings to choose from. While it has the same probabilistic nature as the fuzzing approach, it excludes a huge universe of uninteresting inputs from being tested, and hence is much more likely to reveal a weakness quickly and efficiently.

The moral of the story: even QuickCheck tests, though vastly more powerful than unit tests and fuzz tests, are only as good as you make them!

Posted in haskell, open source, software
8 comments on “Testing a UTF-8 decoder with vigour
  1. Sergei says:

    Hia Brian! Very nice and thoroug explanation!

    Should your test suite have ‘ASSERTS’ define be enabled?

    It causes tests fail on text- / text#git-master:

    dev/git/text $ runhaskell Setup.lhs test
    Running 1 test suites…
    Test suite tests: RUNNING…
    t_pack_unpack: [OK, passed 100 tests]
    t_ascii: [OK, passed 100 tests]
    t_utf8_err: [Failed]
    tests: Data/Text/Internal/Encoding/Utf8.hs:84:5-10: Assertion failed


  2. Sergei says:

    It rather lacks -DTEST_SUITE.

    Apologies for name misspelling!

  3. Chris Seth says:

    Sorry, my friend sent me a password in .csv file format. the characters are unknown for me (sorry am not professional in cyber stuffs). please help me with this things. what do it mean (bWIßõ, |QZÞ , l]WÜþ ? how can i change it to a normal text to login to my application ?
    please help me with these matters!

  4. Kelly says:

    These are also known as overlong encodings. Great work!

    Kelly | remove popcorn ceiling

  5. Thanks for sharing this insight into the development journey.

  6. Loisa says:

    Impressive! Keep up the great work here. http://www.drywallaugusta.com are so excited to be part of your project’s development.

  7. cement patio says:

    Kudos to Michael Snoyman for the helpful repro!

  8. gegaso67 says:

    Thanks to Michael’s repro, the issue was quickly identified and addressed.

Leave a Reply

Your email address will not be published. Required fields are marked *