We wrote the book on distributed scale. Literally.
Free O'Reilly BookJSON is a ubiquitous data interchange format supported by many systems, including CockroachDB. Historically, at CockroachDB, the majority of our efforts around JSON focused on supporting various JSON functions, operators, and inverted indexes while our JSON parser did not receive much attention. And for a good reason.
If you are reading this blog, then you probably already appreciate Golang. There are many aspects to like in Go, and one of them is an excellent collection of standard (and third party) libraries. Becase Go ships with a “jack-of-all-trades” JSON parser out of the box (encoding/json library), it made sense for CockroachDB to use a standard parser when converting strings to internal JSON representation. However, at some point, we began to wonder if the time had arrived for us to invest in finding a better, more performant alternative.
This blog post is an exploration of JSON parser performance, and, ultimately, a description of the high-performance JSON parser used in CockroachDB.
So, let’s jump into the world of state machines, parsing, performance, and all things JSON.
A good performance exploration should start with reason. Why is it necessary to improve a particular performance aspect? Is it worth the effort? This performance exploration begins with a customer who casually mentioned that they use a BYTES column in a table, even though they insert JSON. The reason cited by this customer was that their INSERT-heavy workload was multiple orders of magnitude slower when inserting into JSONB columns. They prefered to use JSONB, but couldn’t due to performance considerations.
Naturally, the above conversation piques the curiosity: are we as slow as the customer claims? (Side note: customer is always right, but sometimes their conclusions are not correct.)
Turns out, in this case, the customer was correct.
We don’t need to go further than ParseJSON function which is invoked whenever CockroachDB needs to convert input STRING into JSON (for example, the following query INSERT INTO table (json_a, json_b) VALUES (‘{“key”: “value”}’::JSONB, ‘\[“a”, “r”, “r”, “a”, “y”]’::JSONB)
will invoke a ParseJSON function twice – once to produce JSON object, and once to produce a JSON array of strings).
ParseJSON function is obviously inefficient. So much so that it included a helpful comment explaining why it is inefficient: “Parsing goes in two phases - first it parses the string into raw interface{}s using the Go encoding/json package, then it transforms that into a JSON”.
A two phase approach makes sense – it’s simple, it’s small, but it’s obviously inefficient. Read on to see just how inefficient it is.
Before we can begin work on a performance improvement, we must first understand where the existing implementation falls short. To do this, we first must write a benchmark to measure the baseline performance: just how fast (or slow) the ParseJSON function is.
There is one more ingredient we need in order to understand performance: what is the representative input for our benchmark? Some customers insert small JSON blobs, some large; some use simple JSON objects, some use highly complex, nested structures, etc. Luckily, we already have a testing library that generates JSON objects of varying complexity and size.
Our job, therefore, is straightforward: generate a good corpus of JSON objects of varying sizes and complexity, and write a benchmark to measure the performance of ParseJSON function.
The results of the benchmark on a n2d-standard machine (AMD EPYC 7B13 processor, 2.5GHz, 32 cores) are shown below:
Max Len | Iterations | Time/op (ns) | MB/s | Avg. Input Len | Alloc B/op | Allocs/op |
---|---|---|---|---|---|---|
128 | 997956 | 6566 | 25.59 | 168.8 | 2439 | 39 |
512 | 593576 | 8887 | 47.94 | 426 | 3245 | 39 |
4096 | 190588 | 30081 | 97.14 | 2922 | 12976 | 41 |
16384 | 57171 | 101695 | 111.13 | 11302 | 45261 | 41 |
262144 | 3507 | 1477386 | 126.88 | 187453 | 723096 | 44 |
524288 | 1824 | 2800348 | 126.68 | 354745 | 1362211 | 45 |
The input data in the above benchmark had no escaped characters (no UTF8, no escaped quotes, etc).
The “Max Len” column indicates the maximum string length in the randomly generated JSON string. This is not the same as the length of the JSON object – it just refers to the strings inside the JSON object values, and the “Avg Input Len” column indicates the average length of those inputs.
The MB/s indicates the throughput of parsing input JSON. Please note that smaller input benchmarks have lower throughput. This makes sense since we simply are not driving enough data to get a representative throughput number.
First, the good news: ParseJSON, using Go standard encoding/json, has fairly stable allocation behavior – as our input size grows, the number of allocations per op stays fairly constant.
Now, the bad news: allocation bytes per op grows faster than linear: for the large input benchmark, we had an average input length of ~350KB, but we allocated almost 1.3MB to handle that input.
If you look at the small inputs, the memory allocation behavior is, arguably, worse: for an average input size of 168 bytes, we allocated 2439 bytes – 14 times more than the input size.
As input size increases, the time to process this input grows to 2.8 milliseconds per op – which is to be expected, but we will try to answer the question if this increase is excessive.
If we allow the input to contain escaped characters, such as “smiley face” emojis(U+1F600), the performance of encoding/json drops proportionally. For example, if we allow up to 3% of the input strings to contain escaped characters the performance of the base benchmark is as follows:
Max Len | Iterations | Time/op (ns) | MB/s | Avg. Input Len | Alloc B/op | Allocs/op |
---|---|---|---|---|---|---|
128 | 1042497 | 5577 | 31.02 | 173.6 | 2535 | 40 |
512 | 551197 | 10177 | 44.51 | 453.1 | 3753 | 41 |
4096 | 139592 | 40349 | 75.79 | 3058 | 16700 | 42 |
16384 | 49492 | 114133 | 104.78 | 11959 | 60130 | 43 |
262144 | 3232 | 1912541 | 101.41 | 193950 | 947012 | 46 |
524288 | 1501 | 3582581 | 108.62 | 389153 | 1864626 | 46 |
And allowing for almost “binary” data, with 40% of escaped characters:
Max Len | Iterations | Time/op (ns) | MB/s | Avg. Input Len | Alloc B/op | Allocs/op |
---|---|---|---|---|---|---|
128 | 902941 | 7192 | 28.22 | 203.7 | 2679 | 41 |
512 | 452046 | 12044 | 43.26 | 521.7 | 4004 | 40 |
4096 | 97638 | 55651 | 70.58 | 3929 | 20314 | 42 |
16384 | 29611 | 200888 | 74.64 | 14996 | 72766 | 43 |
262144 | 1892 | 3033428 | 78.91 | 239361 | 1127602 | 47 |
524288 | 1026 | 5694102 | 78.73 | 448318 | 2123032 | 46 |
Time per operation grows from 2.8 ms/op to 5.7 ms/op when processing heavily escaped, large inputs.
With these results in mind, we now have an immediate target for performance optimization: namely, we will try to improve allocation behavior. We could choose to work on improving time per op, or throughput, but we won’t, because the allocations behavior offers the largest opportunity for performance improvement. ParseJSON is inefficient – we know that because it’s a 2 phase Implementation. Removal of an intermediate step will certainly yield better allocation behavior.
In addition, improving allocation efficiency not only provides immediate benefits (i.e. better allocation behavior), but it also improves speed (fewer allocations – lower time per op) and it improves overall system efficiency, as reduced pressure on the Go garbage collector ultimately results in the performance improvement of the system as a whole.
That is the eternal question. A good rule of thumb is to start with a “no, let’s not rewrite”, and then see if you are still convinced to try something new.
We know that we are looking for a library that has a better allocation behavior, allows us to implement a single pass conversion from string to internal JSON representation, and (hopefully) runs faster than encoding/json.
The good news is that there are many libraries available that claim to be faster than the standard Go library. There are numerous, one can even say overwhelmingly numerous, options available. But which one is the best?
If you are looking for a silver bullet answer to the above question, I’m sorry to disappoint you. The answer to that question is – it really depends on your use case. For example, ffjson is reasonably fast, but it is not appropriate for our use case because it relies on the code generation, whereas we need to convert arbitrary input strings (i.e. schemaless inputs) into internal JSON representation. This requirement eliminates this library, along with other libraries that use code generation techniques.
Other libraries, such as jsonparser seem to be optimized toward single or few keys retrieval. However, we need to convert input into internal JSON representation, and thus we would need to traverse all values in the parsed JSON, likely resulting in suboptimal performance.
Other implementations use cgo – something we wanted to avoid if possible.
The library that seemed to hit all of our goals was jsoniter. Not only is the jsoniter library reasonably fast, it also exposes the Iterator API that allows us to build the final JSON object on the fly as we parse tokens from the input stream. Overall, this is an excellent library. We switched the internal implementation to use the jsoniter library with excellent results (see https://github.com/cockroachdb/cockroach/pull/89117):
name | old speed | new speed | delta |
---|---|---|---|
ParseJSON/escape=0.00%/128-32 | 30.6MB/s ± 4% | 51.2MB/s ± 3% | +67.60% (p=0.000 n=9+10) |
ParseJSON/escape=0.00%/512-32 | 58.4MB/s ± 6% | 117.5MB/s ± 1% | +101.26% (p=0.000 n=10+10) |
ParseJSON/escape=0.00%/4096-32 | 110MB/s ± 1% | 344MB/s ± 3% | +211.29% (p=0.000 n=8+10) |
ParseJSON/escape=0.00%/16384-32 | 133MB/s ± 1% | 532MB/s ± 3% | +300.49% (p=0.000 n=9+10) |
ParseJSON/escape=0.00%/65536-32 | 144MB/s ± 1% | 628MB/s ± 4% | +336.42% (p=0.000 n=9+10) |
ParseJSON/escape=0.00%/262144-32 | 138MB/s ± 1% | 710MB/s ± 1% | +415.99% (p=0.000 n=8+10) |
ParseJSON/escape=0.00%/524288-32 | 139MB/s ± 1% | 669MB/s ± 7% | +381.80% (p=0.000 n=9+10) |
The improvements in throughput are great – going up by almost 4 times. The one area that was a bit of a concern was that jsoniter seems to be less efficient in its allocation characteristics than the standard library when dealing with any data containing escaped characters:
name | old speed | new speed | delta |
---|---|---|---|
ParseJSON/escape=0.00%/128-32 | 40.0MB/s ± 0% | 41.0MB/s ± 0% | +2.50% (p=0.000 n=10+10) |
ParseJSON/escape=3.00%/512-32 | 41.0MB/s ± 0% | 43.0MB/s ± 0% | +4.88% (p=0.000 n=10+10) |
ParseJSON/escape=3.00%/4096-32 | 42.0MB/s ± 0% | 48.0MB/s ± 0% | +14.29% (p=0.000 n=10+10) |
ParseJSON/escape=3.00%/16384-32 | 43.0MB/s ± 0% | 53.0MB/s ± 0% | +23.26% (p=0.000 n=10+10) |
ParseJSON/escape=3.00%/262144-32 | 46.0MB/s ± 0% | 64.0MB/s ± 0% | +39.13% (p=0.000 n=10+8) |
ParseJSON/escape=3.00%/524288-32 | 45.7MB/s ± 2% | 67.0MB/s ± 0% | +46.61% (p=0.000 n=10+10) |
We wanted to find a library that was faster, and had a better memory allocation behavior. Since jsoniter appeared to have worse allocation behavior than standard go library, we decided to continue exploration.
Still resisting the urge to rewrite, and trying to find a more memory efficient implementation than jsoniter, Google yielded a very interesting blog post on high-performance JSON parsing, which described parser implementation that produced over 1GB/s throughput, while promising “eliminate allocations through API design”. That sounded very intriguing, and almost too good to be true. The package, pkg/json described in that post is listed as “not ready for production” – and for good reasons, which we’ll discuss below. However, it was worth a try, and the initial benchmarks seemed to confirm that, as it says on the “sticker”, the package can deliver over 1GB/s parsing throughput.
So, are we set on a “no rewrite” approach? Not quite. There were many things missing from pkg/json package. In particular, it did not have correct handling of escaped characters, and it did not handle UTF8 escape sequences at all. What it did have was an excellent API, and a great starting point for us to build upon.
I think it’s worth pausing for a second to appreciate the point of API design. This package choice to return a byte array to represent the next token in the input stream, as opposed to returning json.Token (encoding/json package) significantly improves allocation behavior since json.Token is an interface{}, which causes the result to escape to heap (i.e. cause more allocations). API choices matter. If the goal is to produce a high-performance library, you must think about those details.
We decided to take the pkg/json package, fork it, add missing functionality, and adapt it for our internal use.
Max Len | Iterations | Time/op (ns) | MB/s | Avg. Input Len | Alloc B/op | Allocs/op |
---|---|---|---|---|---|---|
128 | 2529363 | 2206 | 76.15 | 168.8 | 969 | 26 |
512 | 2257388 | 2489 | 171.14 | 426.1 | 1229 | 26 |
4096 | 1139653 | 4688 | 623.04 | 2922 | 3916 | 26 |
16384 | 454411 | 11675 | 968.32 | 11305 | 12710 | 26 |
262144 | 36289 | 157838 | 1180.9 | 186392 | 191801 | 26 |
524288 | 19600 | 302933 | 1193.54 | 361565 | 366902 | 26 |
Across the board, this implementation used fewer, smaller allocations. Speed is set to 11 out of 10, so to speak.
name | old speed | new speed | delta |
---|---|---|---|
ParseJSON/escape=0.00%/128-32 | 30.6MB/s ± 4% | 66.9MB/s ± 3% | +118.89% (p=0.000 n=9+8) |
ParseJSON/escape=0.00%/512-32 | 58.4MB/s ± 6% | 153.6MB/s ± 1% | +163.12% (p=0.000 n=10+8) |
ParseJSON/escape=0.00%/4096-32 | 110MB/s ± 1% | 501MB/s ± 2% | +353.49% (p=0.000 n=8+8) |
ParseJSON/escape=0.00%/16384-32 | 133MB/s ± 1% | 769MB/s ± 2% | +478.79% (p=0.000 n=9+9) |
ParseJSON/escape=0.00%/262144-32 | 138MB/s ± 1% | 1038MB/s ± 2% | +654.61% (p=0.000 n=8+10) |
ParseJSON/escape=0.00%/524288-32 | 139MB/s ± 1% | 1110MB/s ± 0% | +699.16% (p=0.000 n=9+8) |
ParseJSON/escape=0.00%/524288-32 | 139MB/s ± 1% | 669MB/s ± 7% | +381.80% (p=0.000 n=9+10) |
Parsing data without escape sequences is 8 times faster vs encoding/json, and almost twice as fast as jsoniter.
What about “elimination of allocations through API design” – clearly, we make allocations? Well, the above results ultimately produce a final, CockroachDB specific JSON object – and thus we must perform some allocations. The underlying pkg/json library does not do almost any allocations on its own. Our allocation behavior now is reflective of the input size while the number of allocations remains stable across the board. Allocation behavior is stable even when decoding data with 40% escape sequences:
Max Len | Iterations | Time/op (ns) | MB/s | Avg. Input Len | Alloc B/op | Allocs/op |
---|---|---|---|---|---|---|
128 | 2012869 | 2849 | 71.26 | 203.7 | 996 | 27 |
512 | 1396340 | 4113 | 126.67 | 521.7 | 1217 | 26 |
4096 | 285523 | 18243 | 215.37 | 3929 | 4010 | 26 |
16384 | 86448 | 62476 | 239.6 | 14970 | 12863 | 26 |
262144 | 5745 | 972994 | 249.86 | 243116 | 190634 | 27 |
524288 | 2439 | 2091165 | 232.39 | 485975 | 376054 | 26 |
The performance, when data contains up to 40% of escaped data, is “only” 2.25 times that of Go standard library. However, comparing with the allocations behavior, our implementation significantly outperforms both encoding/json and jsoniter. Why such differences between escaped and unescaped data? Without escapes, we simply consume input and produce direct translation to the JSON object. With escape sequences present, we must copy data into a temporary buffer as we go along – and copying is not free.
The blog post about building a high-performance JSON parser mentioned above is a must read. It contains a treasure trove of information on building a high-performance parser. We have utilized many techniques mentioned in the above blog post.
For example, utilizing method expression to store the next state transition in the state machine results in around 5% performance gain. Arguably, not much, but 5% here, 5% there, and soon enough you’re talking real gains.
Another valuable technique is to compile code with escape analysis to understand allocation behavior. Compiling Go code with the “-gcflags=-m=2” flag produces lots of interesting output indicating whether variables escape, or whether or not the function can be inlined (you can “grep” for specific files/functions).
For example, if we look for buffer.go, we see:
pkg/util/json/tokenizer/buffer.go:74:6: can inline grow with cost 51 as: func(\[]byte, int) \[]byte { if n < smallBufferSize { n = smallBufferSize }; if buf == nil { return make(\[]byte, n) }; c := len(buf) + n; if c < 2 \* cap(buf) { c = 2 \* cap(buf) }; nb := append((\[]byte)(nil), make(\[]byte, c)...); copy(nb, buf); return nb\[:cap(nb)] }
pkg/util/json/tokenizer/buffer.go:36:6: can inline (\*buffer).AppendByte with cost 79 as: method(\*buffer) func(byte) { if cap(b.buf) - b.l <= 1 { b.buf = grow(b.buf, 1) }; b.buf\[b.l] = c; b.l += 1 }
pkg/util/json/tokenizer/buffer.go:38:15: inlining call to grow
We can see that some of the functions that are called on a “hot loop” can be inlined (grow function costs 51, and AppendByte costs 79). The (current) threshold for inlining the function is 80. I do not recommend hunting for inline optimizations as a general rule of thumb. But when they work, the performance improvements could be noticeable. Sometimes, making a small tweak can make a difference between a hot function being inlinable or not.
This has been a fun optimization project. Open source offers many alternatives to almost any problem. Explore what’s out there before jumping into your own implementation. It is important to understand the reason for attempting a particular optimization approach and it is equally important to have clear goals in mind. A word of caution: do not forget about the reader of your code. A modest optimization in the code sometimes results in an incomprehensible code – those should be avoided when possible.
Final, final thought: we might be able to backport our changes to the pkg/json package so that you all can benefit.
Huge thanks to all of the authors who wrote open source packages and blog posts mentioned here.
*Guest post alert! Jack is a core maintainer of pgx, a PostgreSQL driver and toolkit for Go. He helped build the testing …
Read moreFor multi-tenant mixed-workload systems like CockroachDB, performance predictability and isolation are critical. Most …
Read moreAchieving low latency reads in a transactionally consistent, multi-region database is a unique challenge. In …
Read more