Unicode text performance improvements

Posted in haskell, open source
7 comments on “Unicode text performance improvements
  1. Mattias Bengtsson says:

    I was thinking, shouldn’t the encode/decode-functions be laid out something like this instead (for a nicer API):

    data Encoding = UTF8 | UTF16LE | UTF16BE | UTF32LE | UTF32BE

    encode :: Text -> Encoding -> ByteString
    decode :: ByteString -> Encoding -> Text

  2. brian says:

    This is great.

    If the optimizations are general principles, it’d be great if you could explain them a little.

  3. Alessandro Stamatto says:

    This is Awesome! Great work.

  4. Stefan Wehr says:

    In a recent project, we discovered that equality comparison (==) for Text (either strict or lazy) is much slower than for strings. To summarize, == on strict Text is more than 5 times slower than on plain strings, and == on lazy Text is more than 12 times slower than for plain strings. Considering that equality on strings is a very important operation, it would be great if this operation could be optimized for text.

    Here is the code for the benchmark:


    {-# LANGUAGE ScopedTypeVariables #-}

    import Data.String
    import System.Environment
    import System.Exit
    import qualified Data.Text as T
    import qualified Data.Text.Lazy as TL
    import Control.DeepSeq
    import Criterion.Main

    prepareStringBenchmark :: FilePath -> String -> IO ([String], String)
    prepareStringBenchmark fp s =
    do content [String] -> String -> ([s], s)
    prepareTextBenchmark l s =
    let l’ = map fromString l
    s’ = fromString s
    in l’ `deepseq` s’ `deepseq` (l’, s’)

    search :: (NFData a, Eq a) => [a] -> a -> Pure
    search l x = nf (filter (== x)) l

    main =
    do args <- getArgs
    (fileName, searchTerm, restArgs) return (x, y, rest)
    ["--gen"] ->
    do writeFile “sample.txt” (unlines $ replicate 100000 $
    concatMap show [1..70])
    exitWith ExitSuccess
    _ -> do putStrLn “USAGE: TextBenchmark FILE SEARCH_TERM”
    exitWith (ExitFailure 1)
    (sl, sst) <- prepareStringBenchmark fileName searchTerm
    let (tsl, tsst :: T.Text) = prepareTextBenchmark sl sst
    (tll, tlst :: TL.Text) = prepareTextBenchmark sl sst
    putStrLn "starting benchmarks …"
    withArgs restArgs $
    defaultMain [bench "string" $ search sl sst
    ,bench "text-strict" $ search tsl tsst
    ,bench "text-lazy" $ search tll tlst]

    And here are the results:


    benchmarking string
    collecting 10 samples, 1 iterations each, in estimated 5.555408 s
    bootstrapping with 100000 resamples
    mean: 555.4005 ms, lb 554.5234 ms, ub 557.1491 ms, ci 0.950
    std dev: 1.985358 ms, lb 691.4959 us, ub 2.982229 ms, ci 0.950
    found 2 outliers among 10 samples (20.0%)
    1 (10.0%) high mild
    1 (10.0%) high severe
    variance introduced by outliers: 9.000%
    variance is slightly inflated by outliers

    benchmarking text-strict
    collecting 10 samples, 1 iterations each, in estimated 144.9850 s
    bootstrapping with 100000 resamples
    mean: 2.804224 s, lb 2.756461 s, ub 2.953606 s, ci 0.950
    std dev: 129.8160 ms, lb 36.29755 ms, ub 215.2957 ms, ci 0.950
    found 2 outliers among 10 samples (20.0%)
    1 (10.0%) high mild
    1 (10.0%) high severe
    variance introduced by outliers: 9.271%
    variance is slightly inflated by outliers

    benchmarking text-lazy
    collecting 10 samples, 1 iterations each, in estimated 243.2759 s
    bootstrapping with 100000 resamples
    mean: 6.154018 s, lb 5.966649 s, ub 6.545954 s, ci 0.950
    std dev: 441.2791 ms, lb 222.0038 ms, ub 691.5188 ms, ci 0.950
    found 1 outliers among 10 samples (10.0%)
    1 (10.0%) high mild
    variance introduced by outliers: 9.696%
    variance is slightly inflated by outliers

    Here is how to compile, prepare, and run the benchmark, assuming the code is in TextBenchmark.hs:

    $ ghc -O2 –make TextBenchmark.hs
    $ ./TextBenchmark –gen
    $ ./TextBenchmark sample.txt 12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970 -s 10

  5. Stefan Wehr says:

    Hm, somehow the blog software messed up the source code. You can also download it from http://www.stefanwehr.de/TextBenchmark.hs

  6. Jeremy Shaw says:

    Nice!

    I think it is a shame that many people are designing Haskell apps using ByteStrings which (they hope) contain utf-8 encoded strings instead of using proper Unicode supporting types, because they think that is the only way to get acceptable speed.

    Making Text fast is a great way to prevent many needless bugs from ever happening :)

    - jeremy

  7. Stefan, thanks for the report.

    Your performance problem was a little tricky, because equality comparison is far faster under GHC 7 than 6 due to the improved inliner. Here is the writeup of my work: http://www.serpentine.com/blog/2010/10/19/a-brief-tale-of-faster-equality/

2 Pings/Trackbacks for "Unicode text performance improvements"
  1. [...] I posted my note about speeding up UTF-8 transcoding the other day, Stefan Wehr posted a very helpful observation that comparing Text values for equality was surprisingly [...]

  2. [...] decoding and encoding are now very fast. They're up to 9x faster than they were, and close to the performance of pure C UTF-8 decoding and [...]

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>