Unicode text performance improvements

Posted on 2010-10-15 by Bryan O'Sullivan — 7 Comments ↓

Posted in haskell, open source

7 comments on “Unicode text performance improvements”

Mattias Bengtsson says:

2010-10-15 at 09:41

I was thinking, shouldn’t the encode/decode-functions be laid out something like this instead (for a nicer API):

data Encoding = UTF8 | UTF16LE | UTF16BE | UTF32LE | UTF32BE

encode :: Text -> Encoding -> ByteString
decode :: ByteString -> Encoding -> Text
brian says:

2010-10-15 at 14:03

This is great.

If the optimizations are general principles, it’d be great if you could explain them a little.
Alessandro Stamatto says:

2010-10-15 at 14:55

This is Awesome! Great work.
Stefan Wehr says:

2010-10-15 at 15:49

In a recent project, we discovered that equality comparison (==) for Text (either strict or lazy) is much slower than for strings. To summarize, == on strict Text is more than 5 times slower than on plain strings, and == on lazy Text is more than 12 times slower than for plain strings. Considering that equality on strings is a very important operation, it would be great if this operation could be optimized for text.

Here is the code for the benchmark:

—
{-# LANGUAGE ScopedTypeVariables #-}

import Data.String
import System.Environment
import System.Exit
import qualified Data.Text as T
import qualified Data.Text.Lazy as TL
import Control.DeepSeq
import Criterion.Main

prepareStringBenchmark :: FilePath -> String -> IO ([String], String)
prepareStringBenchmark fp s =
do content [String] -> String -> ([s], s)
prepareTextBenchmark l s =
let l’ = map fromString l
s’ = fromString s
in l’ `deepseq` s’ `deepseq` (l’, s’)

search :: (NFData a, Eq a) => [a] -> a -> Pure
search l x = nf (filter (== x)) l

main =
do args <- getArgs
(fileName, searchTerm, restArgs) return (x, y, rest)
[“–gen”] ->
do writeFile “sample.txt” (unlines $ replicate 100000 $
concatMap show [1..70])
exitWith ExitSuccess
_ -> do putStrLn “USAGE: TextBenchmark FILE SEARCH_TERM”
exitWith (ExitFailure 1)
(sl, sst) <- prepareStringBenchmark fileName searchTerm
let (tsl, tsst :: T.Text) = prepareTextBenchmark sl sst
(tll, tlst :: TL.Text) = prepareTextBenchmark sl sst
putStrLn "starting benchmarks …"
withArgs restArgs $
defaultMain [bench "string" $ search sl sst
,bench "text-strict" $ search tsl tsst
,bench "text-lazy" $ search tll tlst]
—

And here are the results:

—
benchmarking string
collecting 10 samples, 1 iterations each, in estimated 5.555408 s
bootstrapping with 100000 resamples
mean: 555.4005 ms, lb 554.5234 ms, ub 557.1491 ms, ci 0.950
std dev: 1.985358 ms, lb 691.4959 us, ub 2.982229 ms, ci 0.950
found 2 outliers among 10 samples (20.0%)
1 (10.0%) high mild
1 (10.0%) high severe
variance introduced by outliers: 9.000%
variance is slightly inflated by outliers

benchmarking text-strict
collecting 10 samples, 1 iterations each, in estimated 144.9850 s
bootstrapping with 100000 resamples
mean: 2.804224 s, lb 2.756461 s, ub 2.953606 s, ci 0.950
std dev: 129.8160 ms, lb 36.29755 ms, ub 215.2957 ms, ci 0.950
found 2 outliers among 10 samples (20.0%)
1 (10.0%) high mild
1 (10.0%) high severe
variance introduced by outliers: 9.271%
variance is slightly inflated by outliers

benchmarking text-lazy
collecting 10 samples, 1 iterations each, in estimated 243.2759 s
bootstrapping with 100000 resamples
mean: 6.154018 s, lb 5.966649 s, ub 6.545954 s, ci 0.950
std dev: 441.2791 ms, lb 222.0038 ms, ub 691.5188 ms, ci 0.950
found 1 outliers among 10 samples (10.0%)
1 (10.0%) high mild
variance introduced by outliers: 9.696%
variance is slightly inflated by outliers
—

Here is how to compile, prepare, and run the benchmark, assuming the code is in TextBenchmark.hs:

$ ghc -O2 –make TextBenchmark.hs
$ ./TextBenchmark –gen
$ ./TextBenchmark sample.txt 12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970 -s 10
Stefan Wehr says:

2010-10-15 at 15:52

Hm, somehow the blog software messed up the source code. You can also download it from http://www.stefanwehr.de/TextBenchmark.hs
Jeremy Shaw says:

2010-10-17 at 03:54

Nice!

I think it is a shame that many people are designing Haskell apps using ByteStrings which (they hope) contain utf-8 encoded strings instead of using proper Unicode supporting types, because they think that is the only way to get acceptable speed.

Making Text fast is a great way to prevent many needless bugs from ever happening 🙂

– jeremy
Bryan O'Sullivan says:

2010-10-19 at 06:41

Stefan, thanks for the report.

Your performance problem was a little tricky, because equality comparison is far faster under GHC 7 than 6 due to the improved inliner. Here is the writeup of my work: http://www.serpentine.com/blog/2010/10/19/a-brief-tale-of-faster-equality/

2 Pings/Trackbacks for "Unicode text performance improvements"

teideal glic deisbhÃ©alach » Blog Archive » A brief tale of faster equality says:

2010-10-19 at 06:36

[…] I posted my note about speeding up UTF-8 transcoding the other day, Stefan Wehr posted a very helpful observation that comparing Text values for equality was surprisingly […]
teideal glic deisbhÃ©alach » Blog Archive » text 0.10.0.0 is here says:

2010-10-22 at 05:36

[…] decoding and encoding are now very fast. They're up to 9x faster than they were, and close to the performance of pure C UTF-8 decoding and […]

Unicode text performance improvements

7 comments on “Unicode text performance improvements”

2 Pings/Trackbacks for "Unicode text performance improvements"

Leave a Reply Cancel reply