Little Big Detail #3: Detecting Maps

internals

March 6, 2018

Little Big Detail #3: Detecting Maps

This Little Big Detail is about how quicktype decides whether a JSON object should be represented as a class or a map, using Markov chains to analyze property names.

Mark

Little Big Details are the subtle niceties that quicktype takes care of when turning your JSON into lovely, readable code. This Little Big Detail is about how quicktype decides whether a JSON object should be represented as a class or a map.

The problem

Suppose you're writing an app that uses the Bitcoin blockchain and it downloads data like this:

{
  "000000000000000000c846dee2e13c6408f5": {
    "hash": "000000000000000000c846dee2e13c6408f5",
    "height": 503162,
    "time": 1515187634
  },
  "00000000000000000024fb37364cbf81fd95": {
    "hash": "00000000000000000024fb37364cbf81fd95",
    "height": 503163,
    "time": 1515188162
  },
  "000000000000000000210b10f0a96a3b318e": {
    "hash": "000000000000000000210b10f0a96a3b318e",
    "height": 503164,
    "time": 1515188571
  }
}

How would you represent this data as a type in your program? Here's a naive translation of the JSON into Swift types:

struct Blocks {
    let the000000000000000000C846Dee2E13C6408F5: Block
    let the00000000000000000024Fb37364Cbf81Fd95: Block
    let the000000000000000000210B10F0A96A3B318E: Block
}

struct Block {
    let hash: String
    let height: Int
    let time: Int
}

Is this the Swift you would write? No, of course not! How about:

typealias Blocks = [String: Block]

struct Block {
    let hash: String
    let height: Int
    let time: Int
}

That's more like it! We're faced with this decision because JSON syntax doesn't differentiate maps from classes—they both just look like:

{ "key1": value1, "key2": value2, ... }

quicktype must decide which JSON represents a class (fixed properties) and which JSON represents a map (dynamic keys) so it can generate the same code you would in the example above. In this post we'll see how quicktype does that. You can also try it yourself in our map detection playground.

Simplistic heuristics for map detection

A first observation about maps is that their values are almost always the same type or "homogeneous". In fact, in statically-typed programming languages maps must be homogeneous. Our first heuristic for map detection is:

If all values in a JSON object have the same type, consider it a map.

Unfortunately this heuristic fails immediately:

{
  "name": "Alice",
  "email": "alice@example.com",
  "bio": "Software engineer"
}

You'd probably want a class for that data even though all values are strings.

Another observation is that maps often have more keys than classes have properties; if a JSON object has 100 properties, all of which are strings, it's probably a map, so let's add another condition to our heuristic:

If all values in a JSON object have the same type and it has at least 20 properties, consider it a map.

This was the previous heuristic quicktype used. It was adequate for large, regular samples but failed whenever a map had fewer than 20 keys, which happens all the time.

A heuristic that decides like a human programmer does

Let's come back to our original Bitcoin example. Most developers would correctly deduce that the type for each block should be a class, but that the outer object should be a map. Our previous heuristic would fail in this case, since there are only three blocks in the map.

How are we as human programmers able to infer that this data is a map? One strong clue is that the names of the properties (e.g. 000000000000000...c846dee2e13c6408f5) don't look like class property names—they look like data, like values. What we need is a way to determine whether a string looks like data or looks like a property name. To do that, quicktype uses a Markov chain.

What's a Markov chain?

If you're a developer, you may know Markov chains as a way of generating apparently meaningful nonsense, but they have many other important applications.

A Markov chain is a set of states and transitions among those states, where each transition has a probability. Transitions occur randomly, weighted by the probabilities.

For example, let's use a Markov chain to model what I'll eat for lunch. On most days I eat sushi, but now and then I'll have tacos. We can model this with two states: 🍣 and 🌮. The transitions between states have these probabilities:

🍣 → 🍣: if I eat sushi today, what's the likelihood I'll eat sushi tomorrow? Let's say 85%.
🍣 → 🌮: if I eat sushi today, how likely am I to have tacos tomorrow? This must be 15%, since I only eat either sushi or tacos, and we already know what the probability is for sushi.
🌮 → 🍣: if I eat tacos today, what's the likelihood I'll eat sushi tomorrow? Let's say 60%.
🌮 → 🌮: by similar reasoning, this must be 40%.

Diagram of the sushi/taco Markov chain

If you run this model you could get this sequence:

🍣 🍣 🍣 🍣 🍣 🍣 🍣 🍣 🌮 🌮 🍣 🍣 🍣 🍣 🍣 🍣 🍣 🌮 🌮 🌮 🍣 🌮 🌮 🌮 🌮 🍣 🍣 🍣

Yum! So how does this relate to detecting maps versus classes in JSON?

Markov chains in quicktype

Consider a Markov chain for English words that looks at three letters at a time. Our state is the first two letters of any three-letter sequence, and our transitions give the probabilities for each possible third letter.

For example, if the two letters are qu, then there is a high-probability transition for the letter i because many English words contain qui (e.g. acquire, colloquial, equip). On the other hand, the transition from qu to x has nearly zero probability because no English words contain qux.

Part of the English Markov chain

Rather than using the Markov chain to generate pseudo-English words, we can feed it a sequence of letters, look at the probabilities of the transitions that would have produced that sequence, then use the geometric mean of those probabilities to arrive at a single score.^[1]

Try the interactive widget at the bottom of this post to explore transition probabilities by typing. The color of each letter corresponds to its probability given its two predecessors (green is likely, red is unlikely).

Actually, the Markov chain used for this widget^[2] is not built for detecting English, but rather for detecting class property names. In fact, it's the same Markov chain quicktype uses to judge whether a property name looks like a class property name or whether it looks like data.^[3]

How did I arrive at the probability threshold for determining class property names? Ideally I'd have access to all the JSON that ever has been, and ever will be produced, and enough computing power to train on all of it. In practice I wrote a little script and ran it over JSON data I had lying around.

See map detection in action in our map detection playground.

What if it fails?

If quicktype still fails to identify your JSON object correctly, here's what you can do:

If your object should be a class but quicktype thinks it's a map, you can disable map detection in the options.
If quicktype thinks your object is a class when it should be a map, a bit more work is needed. The easiest way to correct this is to trick quicktype by making the property names look "weird"—add some numbers or special characters.

If you want total control over the types generated by quicktype, you can always output JSON Schema, correct the schema, then use the schema as the input to quicktype. This is the workflow we recommend for serious use.

Future improvements

Here are a few areas where we can improve map detection even further, and we'd love help from contributors on any of them:

quicktype should take the length of JSON property names into account. If a property name is a whole English sentence, that object is likely not a class.
English is not the only language in the world! Right now, Mandarin is Greek to quicktype (and so is Greek). Maybe quicktype needs to classify the JSON's "language" first and run a different Markov chain for each language.
Another clue in the Bitcoin JSON is that the map values are complex (Bitcoin objects in this case). Some classes have relatively many string properties, but it's less common to find classes with lots of complex object properties.
If quicktype sees more than one sample for the same type, and they have few shared property names, that makes that type more likely to be a map.
Some common forms of map keys are not easy for a Markov chain to detect, but can be covered easily with regular expressions. Email addresses are an example.

If you're interested in working on these problems, please come talk to us on Slack and check out quicktype on GitHub. Thanks!

[1] With probabilities we can't just take the arithmetic mean of our *n* individual probabilities—we need the geometric mean, which is the *n*th root of their product.

[2] Explore the Markov widget further at github.com/quicktype/markov-react.

[3] The details of that are not very scientific.

Explore transition probabilities by typing below. The color of each letter corresponds to its probability given its predecessors (green is likely, red is unlikely).