A small matter of illegal characters

Posted on 2010-11-29 by Bryan O'Sullivan — 18 Comments ↓

Posted in haskell

18 comments on “A small matter of illegal characters”

Robert Massaioli says:

2010-11-29 at 23:23

I do not really like the idea of failing silently, the user needs to know that something has gone wrong and option three will just make garbage appear in the stream when they did not expect it. I will admit that calling error is icky. Maybe the problem is that pack is guaranteed to succeed; maybe it should know that sometimes it will fail like: pack :: String -> Maybe Text

Though that might not be a good idea either. My vote would go to 1 or 2 (let them know it has failed) though I could see how some people may like 3 and 4; and I would probably choose 4 before 3.
Tom Harper says:

2010-11-29 at 23:30

Using fallback characters is pretty common in Unicode processing; that’s why the fallback character exists. It lets you deal with erroneous characters without letting them crash the work you were doing, which could be rather irritating in massive amounts of text. This doesn’t keep you throwing errors yourself if you encounter a fallback character.
SiniÅ¡a says:

2010-11-29 at 23:35

Perhaps two different packs could be exposed. One (the default) handles this error silently (#3 sounds great and better than just dropping it) while the other (maybePack?) uses Maybe.
Anonymous says:

2010-11-30 at 02:03

#3 is what the Unicode standards recommend. #4 may be convenient but it can potentially introduce security vulnerabilities if you’re not careful.

Imagine a webapp that takes some user input as a String, sanitizes it, runs it through Text.pack, and outputs it as part of a web page. Now imagine someone sends the string ”. The sanitization step may consider this string safe since it doesn’t contain a valid tag, but after being run through Text.pack and the illegal character is stripped it becomes unsafe.
Anonymous says:

2010-11-30 at 02:09

Ugh, your blog ate part of my comment. Here’s the mangled sentence: http://hpaste.org/41942/httpwwwserpentinecomblog.
porges says:

2010-11-30 at 04:16

Dropping the character silently is actually against Unicode rules (this was changed in 5.0, because of security concerns, and ended up breaking some programs when Windows’ string functions were changed to conform to it).
Peter Davis says:

2010-11-30 at 04:20

That’s quite a bit worse than “garbage in garbage out”…any more details on the cause of the random return values? Unchecked table index in native code wrapped with unsafePerformIO?
Alessandro Stamatto says:

2010-11-30 at 04:34

Since unicode recommends #3, i’ll stand for it.
Daniel Lyons says:

2010-11-30 at 04:38

I agree with SiniÅ¡a: the API remains easy to understand and secure to use with #3, and another API can be exposed for power users or people who are concerned about handling invalid data being handed to the function.
David Powell says:

2010-11-30 at 08:09

My vote is for #3. This matches my typical use cases where I often read external strings that I don’t have control over. With any of the other suggestions, I’d just have to pre-filter the strings to put in the replacement character as appropriate. Although it may be nice to also have another function like:
packOnlyValid :: String -> Maybe Text
(or perhaps return Either Char Text to indicate the problem?)
Ketil says:

2010-11-30 at 08:46

How to deal with this depends a bit on how this problem arises. How did the String get to contain an invalid value? I don’t see how this is not a bug in the encoding, and if it is, I’d like to know about it – thus I much prefer exceptions to any silent replacement. Ideally, there should be a way to go back and do the right thing (whatever it is for the specific use case), though.

I don’t think Maybe is going to work – if I understand correctly, Text is a chunked list, and needs to work for infinite inputs.

One option might be to provide pack with a function telling it what to do with illegal characters:

pack’ :: (Char -> Text) -> String -> Text

packError = pack’ (\c -> error (“Non-unicode character: “++show c))
packReplace = pack (const ‘\xxxx’)

etc. This might have an impact on performance, but perhaps it can be inlined away?
Bryan O'Sullivan says:

2010-11-30 at 08:46

As it turns out, option #3 is actually the only one that works at all 🙂

The bug isn’t actually in the pack function, it’s in the difference between what’s a valid Char (a Unicode code point) and what’s a valid element of a Text (a Unicode scalar value).

The Char type allows values in the range U+D800 through U+DFFF, but Text disallows them as they’re not valid Unicode scalars (and more importantly, they’re UTF-16 surrogates). You could thus produce an invalid Text using cons '\xd800' empty or the like, not just via pack!

So now all functions in the text package that use a programmer-supplied Char value to construct a Text inspect every Char, and replace it with the replacement character ‘�’ (U+FFFD) if it’s in the forbidden range. Problem solved, text 0.11.0.0 released!
Johan Tibell says:

2010-11-30 at 10:42

Shouldn’t Char be fixed to not allow for values in the range U+D800 through U+DFFF?
Mark Lentczner says:

2010-11-30 at 15:19

@Johan – That is a *very* interesting question (at least to a Unicode fanboy like me!) Forthcoming post on this very topic on my blog in the next few days…
Jeremy Bowers says:

2010-11-30 at 15:45

#3 is also the best answer because it’s an easy post-processing step to explicitly remove it if you choose.

(Sanitization libraries that get confused when they remove the characters are broken. Step one of sanitization is to normalize the text itself before doing any HTML-level processing. Any HTML sanitizer still getting illegal characters has already lost.)
Johan Tibell says:

2010-11-30 at 23:23

If we fixed Char we could detect invalid code points in string literals at compile time.
Stephan Bergmann says:

2010-12-02 at 20:35

@Johan: My thoughts, too. Per the Haskell report, “The character type Char is an enumeration whose values represent Unicode characters”. The report is vague on how to interpret numeric escapes in character/string literals, but assuming the numeric escape values shall denote Unicode scalar values (which in turn denote Unicode abstract characters), then the given example of “\55296” would simply be illegal Haskell input, as 55296 (i.e., 0xD800) does not denote a Unicode scalar value.
Marc Ziegert (coeus) says:

2010-12-10 at 08:11

There is no reason to replace invalid characters, iff they will be automatically corrected by ignoring them.

Between the utf8 and utf16 encodings, there exist a modified-utf8 encoding, which is used by at least the JNI: Java uses 16bit-chars, which are to be interpreted as utf16. On the c/c++ side those 16bit-values are packed into utf8, where 0x0000 is packed into the two byte alternative in utf8 ([0xc0,0x80]) instead of one single zero-byte. With this, all unicode-chars will be encoded in 1- or 2- or 3- or 6-byte sequences.

Now, suppose, we interpret such a modified-utf8-string simply as utf8-string and then convert it to Text. IMHO, this should be allowed and the conversion back to String should then be as valid as possible. In other words, all inputs should be accepted: valid unicode input should â€“of courseâ€“ result in the same output, while valid modified-utf8 strings / unicode with matching utf16 surrogates should be corrected(=ignored) on the fly, and these invalid sequences of utf16 surrogates should be treated as they are.

The last part is new: The conversion from Text to String has to search for utf16 surrogates and should then combine only the matching(!) surrogates to single unicode chars and pass the others through â€“ unmodified.

The bug (pack “\55296”) was caused by reading a non-existant matching second utf16-surrogate â€¦from behind that input string. It is an unhandled buffer overflow.

A small matter of illegal characters

18 comments on “A small matter of illegal characters”

Leave a Reply Cancel reply