Open question: help me design a new encoding API for aeson

For a while now, I’ve had it in mind to improve the encoding performance of my Haskell JSON package, aeson.

Over the weekend, I went from hazy notion to a proof of concept for what I think could be a reasonable approach.

This post is a case of me “thinking out loud” about the initial design I came up with. I’m very interested in hearing if you have a cleaner idea.

The problem with the encoding method currently used by aeson is that it occurs via a translation to the Value type. While this is simple and uniform, it involves a large amount of intermediate work that is essentially wasted. When encoding a complex value, the Value that we build up is expensive, and it will become garbage immediately.

It should be much more efficient to simply serialize straight to a Builder, the type that is optimized for concatenating many short string fragments. But before marching down that road, I want to make sure that I provide a clean API that is easy to use correctly.

I’ve posted a gist that contains a complete copy of this proof-of-concept code.

{-# LANGUAGE GeneralizedNewtypeDeriving, FlexibleInstances,
    OverloadedStrings #-}

import Data.Monoid (Monoid(..), (<>))
import Data.Text (Text)
import Data.Text.Lazy.Builder (Builder, singleton)
import qualified Data.Text.Lazy.Builder as Builder
import qualified Data.Text.Lazy.Builder.Int as Builder

The core Build type has a phantom type that allows us to say “I am encoding a value of type t”. We’ll see where this type tracking is helpful (and annoying) below.

data Build a = Build {
    _count :: !Int
  , run    :: Builder
  }

The internals of the Build type would be hidden from users; here’s what they mean. The _count field tracks the number of elements we’re encoding of an aggregate JSON value (an array or object); we’ll see why this matters shortly. The run field lets us access the underlying Builder.

We provide three empty types to use as parameters for the Build type.

data Object
data Array
data Mixed

We’ll want to use the Mixed type if we’re cramming a set of disparate Haskell values into a JSON array; read on for more.

When it comes to gluing values together, the Monoid class is exactly what we need.

instance Monoid (Build a) where
    mempty = Build 0 mempty
    mappend (Build i a) (Build j b)
      | ij > 1    = Build ij (a <> singleton ',' <> b)
      | otherwise = Build ij (a <> b)
      where ij = i + j

Here’s where the _count field comes in; we want to separate elements of an array or object using commas, but this is necessary only when the array or object contains more than one value.

To encode a simple value, we provide a few obvious helpers. (These are clearly so simple as to be wrong, but remember: my purpose here is to explore the API design, not to provide a proper implementation.)

build :: Builder -> Build a
build = Build 1

int :: Integral a => a -> Build a
int = build . Builder.decimal

text :: Text -> Build Text
text = build . Builder.fromText

Encoding a JSON array is easy.

array :: Build a -> Build Array
array (Build 0 _)  = build "[]"
array (Build _ vs) = build $ singleton '[' <> vs <> singleton ']'

If we try this out in ghci, it behaves as we might hope.

?> array $ int 1 <> int 2
"[1,2]"

JSON puts no constraints on the types of the elements of an array. Unfortunately, our phantom type causes us difficulty here.

An expression of this form will not typecheck, as it’s trying to join a Build Int with a Build Text.

?> array $ int 1 <> text "foo"

This is where the Mixed type from earlier comes in. We use it to forget the original phantom type so that we can construct an array with elements of different types.

mixed :: Build a -> Build Mixed
mixed (Build a b) = Build a b

Our new mixed function gets the types to be the same, giving us something that typechecks.

?> array $ mixed (int 1) <> mixed (text "foo")
"[1,foo]"

This seems like a fair compromise to me. A Haskell programmer will normally want the types of values in an array to be the same, so the default behaviour of requiring this makes sense (at least to my current thinking), but we get a back door for when we absolutely have to go nuts with mixing types.

The last complication stems from the need to build JSON objects. Each key in an object must be a string, but the value can be of any type.

-- Encode a key-value pair.
(<:>) :: Build Text -> Build a -> Build Object
k <:> v = Build 1 (run k <> ":" <> run v)

object :: Build Object -> Build Object
object (Build 0 _)   = build "{}"
object (Build _ kvs) = build $ singleton '{' <> kvs <> singleton '}'

If you’ve had your morning coffee, you’ll notice that I am not living up to my high-minded principles from earlier. Perhaps the types involved here should be something closer to this:

data Object a

(<:>) :: Build Text -> Build a -> Build (Object a)

object :: Build (Object a) -> Build (Object a)

(In which case we’d need a mixed-like function to forget the phantom types for when we want to get mucky and unsafe—but I digress.)

How does this work out in practice?

?> object $ "foo" <:> int 1 <> "bar" <:> int 3
"{foo:1,bar:3}"

Hey look, that’s more or less as we might have hoped!

Open questions, for which I appeal to you for help:

  • Does this design appeal to you at all?

  • If not, what would you change?

  • If yes, to what extent am I wallowing in the “types for thee, but not for me” sin bin by omitting a phantom parameter for Object?

Helpful answers welcome!

Posted in haskell, open source
3 comments on “Open question: help me design a new encoding API for aeson
  1. Anders Kaseorg says:

    JSON objects are occasionally used as ‘Map Text a’, where the value type is the same across all keys. But more often, I think, they’re used as records, where the value type depends on the key. It would certainly be possible to capture that in the type system, but I suspect it may involve an uncomfortable amount of syntactic overhead.

  2. Leon P Smith says:

    This sounds similar in approach to my json-builder package, which I tweaked to be able to serialize Aeson’s Value structure at the same speed as aeson itself. Have you tried my package? Anything you don’t like about it?

    One thing I don’t like here is I don’t see the value in explicitly tracking the number of elements that will be serialized to a json array or object. This introduces a strictness property that forces more computation before serialization can start. Json-builder on the other hand can start serializing json objects and arrays before the whole object or array has been computed.

    I thought about things such as phantom types, but just ended up having separate type of Json values, a type of Json Arrays, and a type of Json Objects. You can always serialize arrays and objects and treat them as values, but if you want to append more elements to either, then you do actually need the Array/Object type.

    As for syntactic overhead, I don’t know what your standard is exactly, but I’ve found json-builder fairly comfortable to use. At least, I haven’t found it worse (and sometimes better) than the existing ToJSON interface.

  3. Bas van Dijk says:

    It would be nice to both have the ability to encode a Haskell value to the concrete aeson Value type as well as going to a Builder directly. So how about using a kind of “finally tagless” style approach like the following:

    https://gist.github.com/bos/6986451#comment-930228

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>