In this article, I will introduce you to a new package called astjson
that I have been working on for the last couple of weeks. It is a Go package that allows you to transform and merge JSON objects with unmatched speed. It is based on the jsonparser package by buger aka Leonid Bugaev and extends it with the ability to transform and merge JSON objects at unparalleled performance.
By leveraging the astjson
package, we were able to speed up our GraphQL API Gateway (Cosmo Router) while reducing the memory footprint. At the macro level, we were able to increase requests per second by 23% and reduced p99 latency by 44% over the previous version of Cosmo Router. At the micro level, we reduced the memory usage of a benchmark by 60%.
For comparison, we benchmarked the Cosmo Router against the Apollo Router, which is written in Rust. Our benchmark showed that Cosmo Router, although written in Go, has 8.6x higher throughput and 9x lower latency than the Apollo Router on a 169kb JSON response payload. With smaller payloads, the difference is slightly smaller, but still significant.
As we've seen a reduction in p99 latency by 44% compared to the previous version of the Cosmo Router, we are confident that the astjson
package is a significant contributor to the performance improvements.
Are you looking for an Open Source Graph Manager? Cosmo is the most complete solution including Schema Registry, Router, Studio, Metrics, Analytics, Distributed Tracing, Breaking Change detection and more.
Why we needed astjson and what problem does it solve for us?
At WunderGraph, we are building a GraphQL API Gateway, also known as a Router. The Router is responsible for aggregating data from multiple Subgraphs and exposing them as a single GraphQL API. Subgraphs are GraphQL APIs that can be combined into a single unified GraphQL API using the Router.
When a client sends a GraphQL Operation to the Router, the Router will parse it and use its configuration to determine which requests to send to which Subgraph. When the data comes back from the Subgraphs, the Router will merge the result with other results. It's often the case that some requests depend on the result of other requests.
Here's a typical pattern of the request pipeline:
- The Router receives a GraphQL Request from a client
- It makes a request to one Subgraph to fetch the data for a root field
- It then drills into the result and makes 3 more requests in parallel to fetch the data for the nested fields
- It merges all of the results and drills even deeper into the merged data
- It makes 2 more requests to fetch the data for the nested fields
- It merges all of the results
- It renders the response and sends it back to the client
In a previous post, we've written about Dataloader 3.0 and how we're using breadth-first data loading to reduce concurrency and the number of requests to the Subgraphs.
In this post, we will expand on that and show you how we're using the new astjson
package to efficiently transform and merge JSON objects during the request pipeline. Let's start by looking at some of the benchmarks we've run and then dive into the details of how we're using astjson
to achieve these results.
Benchmarks
Cosmo Router 0.33.0 vs Cosmo Router 0.35.0 vs Apollo Router 1.33.2
Cosmo Router 0.33.0 nested batching benchmark without astjson
Memory profile:
jens@MacBook-Pro-3 resolve % go test -run=nothing -bench=Benchmark_NestedBatchingWithoutChecks -memprofile memprofile.out -benchtime 3s && go tool pprof memprofile.out
goos: darwin
goarch: arm64
pkg: github.com/wundergraph/graphql-go-tools/v2/pkg/engine/resolve
Benchmark_NestedBatchingWithoutChecks-10 338119 11003 ns/op 33.72 MB/s 7769 B/op 101 allocs/op
PASS
ok github.com/wundergraph/graphql-go-tools/v2/pkg/engine/resolve 4.016s
Type: alloc_space
Time: Nov 22, 2023 at 9:13am (CET)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top5
Showing nodes accounting for 1538.83MB, 58.95% of 2610.46MB total
Dropped 53 nodes (cum <= 13.05MB)
Showing top 5 nodes out of 46
flat flat% sum% cum cum%
570.69MB 21.86% 21.86% 570.69MB 21.86% github.com/buger/jsonparser.Set
402.56MB 15.42% 37.28% 405.56MB 15.54% github.com/wundergraph/graphql-go-tools/v2/pkg/engine/resolve.(*Loader).mergeJSONWithMergePath
308.54MB 11.82% 49.10% 717.10MB 27.47% github.com/wundergraph/graphql-go-tools/v2/pkg/engine/resolve.(*Loader).resolveBatchEntityFetch
131.51MB 5.04% 54.14% 131.51MB 5.04% github.com/wundergraph/graphql-go-tools/v2/pkg/engine/resolve.(*Loader).resolveBatchEntityFetch.func1
125.54MB 4.81% 58.95% 125.54MB 4.81% bytes.growSlice
(pprof)
CPU profile:
go test -run=nothing -bench=Benchmark_NestedBatchingWithoutChecks -cpuprofile profile.out -benchtime 3s && go tool pprof profile.out
goos: darwin
goarch: arm64
pkg: github.com/wundergraph/graphql-go-tools/v2/pkg/engine/resolve
Benchmark_NestedBatchingWithoutChecks-10 278344 10863 ns/op 34.15 MB/s 7514 B/op 101 allocs/op
PASS
ok github.com/wundergraph/graphql-go-tools/v2/pkg/engine/resolve 6.022s
Type: cpu
Time: Nov 22, 2023 at 12:36pm (CET)
Duration: 5.83s, Total samples = 130.82s (2245.54%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top20
Showing nodes accounting for 125.45s, 95.90% of 130.82s total
Dropped 293 nodes (cum <= 0.65s)
Showing top 20 nodes out of 105
flat flat% sum% cum cum%
86.66s 66.24% 66.24% 86.66s 66.24% runtime.usleep
22.62s 17.29% 83.53% 22.62s 17.29% runtime.pthread_cond_wait
6.45s 4.93% 88.47% 6.45s 4.93% runtime.pthread_kill
5.51s 4.21% 92.68% 5.51s 4.21% runtime.pthread_cond_signal
1.48s 1.13% 93.81% 1.48s 1.13% runtime.madvise
1.44s 1.10% 94.91% 1.44s 1.10% runtime.pthread_cond_timedwait_relative_np
0.29s 0.22% 95.13% 0.71s 0.54% github.com/buger/jsonparser.blockEnd
0.28s 0.21% 95.34% 1.11s 0.85% runtime.mallocgc
0.13s 0.099% 95.44% 1.05s 0.8% github.com/buger/jsonparser.searchKeys
0.10s 0.076% 95.52% 0.66s 0.5% github.com/wundergraph/graphql-go-tools/v2/pkg/engine/resolve.(*Loader).mergeJSON
0.09s 0.069% 95.59% 57.05s 43.61% runtime.lock2
0.09s 0.069% 95.66% 8.90s 6.80% sync.(*Mutex).lockSlow
0.07s 0.054% 95.71% 26.99s 20.63% github.com/wundergraph/graphql-go-tools/v2/pkg/engine/resolve.(*Loader).resolveBatchFetch
0.06s 0.046% 95.76% 3.79s 2.90% runtime.unlock2
0.04s 0.031% 95.79% 0.91s 0.7% github.com/wundergraph/graphql-go-tools/v2/pkg/engine/resolve.(*SimpleResolver).resolveObject
0.04s 0.031% 95.82% 2.40s 1.83% runtime.gcDrain
0.03s 0.023% 95.84% 24.46s 18.70% runtime.semrelease1
0.03s 0.023% 95.86% 2.47s 1.89% sync.(*WaitGroup).Add
0.02s 0.015% 95.88% 0.86s 0.66% github.com/buger/jsonparser.ArrayEach
0.02s 0.015% 95.90% 1.65s 1.26% github.com/buger/jsonparser.internalGet
Cosmo Router 0.35.0 nested batching benchmark with astjson
Memory profile:
go test -run=nothing -bench=Benchmark_NestedBatchingWithoutChecks -memprofile memprofile.out -benchtime 3s && go tool pprof memprofile.out
goos: darwin
goarch: arm64
pkg: github.com/wundergraph/graphql-go-tools/v2/pkg/engine/resolve
Benchmark_NestedBatchingWithoutChecks-10 492805 7361 ns/op 50.40 MB/s 2066 B/op 35 allocs/op
PASS
ok github.com/wundergraph/graphql-go-tools/v2/pkg/engine/resolve 3.815s
Type: alloc_space
Time: Nov 22, 2023 at 9:11am (CET)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top5
Showing nodes accounting for 898.63MB, 86.74% of 1036.04MB total
Dropped 39 nodes (cum <= 5.18MB)
Showing top 5 nodes out of 28
flat flat% sum% cum cum%
274.08MB 26.45% 26.45% 275.08MB 26.55% github.com/wundergraph/graphql-go-tools/v2/pkg/engine/resolve.(*Group).doCall.func2
240.03MB 23.17% 49.62% 580.06MB 55.99% github.com/wundergraph/graphql-go-tools/v2/pkg/engine/resolve.(*Loader).resolveAndMergeFetch
199.51MB 19.26% 68.88% 510.59MB 49.28% github.com/wundergraph/graphql-go-tools/v2/pkg/engine/resolve.(*Loader).loadBatchEntityFetch
132.01MB 12.74% 81.62% 407.09MB 39.29% github.com/wundergraph/graphql-go-tools/v2/pkg/engine/resolve.(*Group).Do
53MB 5.12% 86.74% 53MB 5.12% github.com/wundergraph/graphql-go-tools/v2/pkg/engine/resolve.(*Loader).selectNodeItems
CPU profile:
go test -run=nothing -bench=Benchmark_NestedBatchingWithoutChecks -cpuprofile profile.out -benchtime 3s && go tool pprof profile.out
goos: darwin
goarch: arm64
pkg: github.com/wundergraph/graphql-go-tools/v2/pkg/engine/resolve
Benchmark_NestedBatchingWithoutChecks-10 468690 7289 ns/op 50.90 MB/s 2067 B/op 35 allocs/op
PASS
ok github.com/wundergraph/graphql-go-tools/v2/pkg/engine/resolve 3.716s
Type: cpu
Time: Nov 22, 2023 at 12:32pm (CET)
Duration: 3.60s, Total samples = 95.33s (2646.29%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top20
Showing nodes accounting for 93.28s, 97.85% of 95.33s total
Dropped 227 nodes (cum <= 0.48s)
Showing top 20 nodes out of 67
flat flat% sum% cum cum%
75.21s 78.89% 78.89% 75.21s 78.89% runtime.usleep
15.70s 16.47% 95.36% 15.70s 16.47% runtime.pthread_cond_wait
1.61s 1.69% 97.05% 1.61s 1.69% runtime.pthread_cond_signal
0.53s 0.56% 97.61% 0.53s 0.56% runtime.pthread_kill
0.04s 0.042% 97.65% 51.07s 53.57% runtime.lock2
0.02s 0.021% 97.67% 25.70s 26.96% runtime.runqgrab
0.02s 0.021% 97.69% 25.76s 27.02% runtime.stealWork
0.02s 0.021% 97.71% 1.72s 1.80% runtime.systemstack
0.02s 0.021% 97.73% 14.38s 15.08% sync.(*Mutex).lockSlow
0.02s 0.021% 97.76% 26.74s 28.05% sync.(*Mutex).unlockSlow
0.01s 0.01% 97.77% 45.20s 47.41% github.com/wundergraph/graphql-go-tools/v2/pkg/engine/resolve.(*Group).Do
0.01s 0.01% 97.78% 10.71s 11.23% github.com/wundergraph/graphql-go-tools/v2/pkg/engine/resolve.(*Loader).walkArray
0.01s 0.01% 97.79% 46.48s 48.76% runtime.schedule
0.01s 0.01% 97.80% 15.88s 16.66% runtime.semacquire1
0.01s 0.01% 97.81% 29.25s 30.68% runtime.semrelease1
0.01s 0.01% 97.82% 0.55s 0.58% runtime.unlock2
0.01s 0.01% 97.83% 26.75s 28.06% sync.(*Mutex).Unlock (partial-inline)
0.01s 0.01% 97.84% 2.55s 2.67% sync.(*WaitGroup).Add
0.01s 0.01% 97.85% 1.55s 1.63% sync.runtime_Semacquire
0 0% 97.85% 16.14s 16.93% github.com/wundergraph/graphql-go-tools/v2/pkg/engine/resolve.(*Group).doCall
Analyzing the CPU and Memory Profiles
Let's have a close look at the profiles and see what we can learn from them.
If we look at the memory profile of 0.33.0, we can see that it's dominated by the jsonparser.Set
function as well as mergeJSONWithMergePath
and resolveBatchEntityFetch
. All of them make heavy use of the jsonparser
package to parse, transform and merge JSON objects.
If we compare this to the memory profile of 0.35.0, we can see that the jsonparser
package is no longer visible, and the memory usage for resolving batch entities has been significantly reduced.
If we look at the CPU profiles, we can see that they are dominated by runtime calls like runtime.usleep
and runtime.pthread_cond_wait
. This is because we're using concurrency to make requests to the Subgraphs in parallel. In a benchmark like this, we're sending unrealistic amounts of requests, so we simply need to ignore the runtime calls and focus on the function calls from our code.
What we can see is that the jsonparser
package is visible in the CPU profile of 0.33.0 but not in the CPU profile of 0.35.0. In addition, 0.33.0 shows mergeJSON
witch significant CPU usage, while 0.35.0 really only shows (*Group).Do
which is sync group to handle concurrency.
We can summarize that 0.33.0 spends a lot of CPU time and memory to parse, transform and merge JSON objects using the jsonparser
package. In contrast, 0.35.0 seems to have eliminated the jsonparser
package from the critical path. You might be surprised to hear that 0.35.0 still uses the jsonparser
package, just differently than 0.33.0.
Let's have a look at the code to see what's going on.
How does the astjson package work?
You can check out the full code including tests and benchmarks on GitHub. It's part of graphql-go-tools, the GraphQL Router / API Gateway framework we've been working on for the last couple of years. It's the "Engine" that powers the Cosmo Router.
Here's an example from the tests to illustrate how the astjson
package can be used:
// create a new JSON object
// this object has an internal JSON AST Node Cache
// it's advised to "Free" the object and re-use it e.g. using sync.Pool
js := &JSON{}
// parse a JSON object into the internal JSON AST Node Cache
// It's important to note that the AST Nodes don't contain the actual JSON content
// they only contain the structure of the JSON object
// We store the actual JSON content in a separate buffer
// The JSON AST Nodes only contain references to positions in the buffer
// This makes the AST Nodes very lightweight
err := js.ParseObject([]byte(`{"a":1,"b":2}`))
assert.NoError(t, err)
// append another JSON object to the internal JSON AST Node Cache
// this is the secret sauce!
c, err := js.AppendObject([]byte(`{"c":3}`))
assert.NoError(t, err)
assert.NotEqual(t, -1, c)
// as we've parsed both JSON objects into the same internal JSON AST Node Cache
// we can merge them at the AST level
// we don't actually merge the content of the JSON objects,
// we only merge the AST trees
merged := js.MergeNodes(js.RootNode, c)
assert.NotEqual(t, -1, merged)
assert.Equal(t, js.RootNode, merged)
out := &bytes.Buffer{}
// once we have all the data in the internal JSON AST Node Cache
// we can print the result to a buffer
// Only at this point, we actually touch the content of the JSON objects
err = js.PrintNode(js.Nodes[js.RootNode], out)
assert.NoError(t, err)
assert.Equal(t, `{"a":1,"b":2,"c":3}`, out.String())
The fundamental idea behind the astjson
package is to parse the incoming JSON objects into an AST exactly once, then merge and transform the ASTs as needed without touching the actual JSON content. When we're done adding, transforming and merging the ASTs, we can print the final result to a buffer.
Before this, we were using the jsonparser
package to parse each JSON object, merge it with the parent JSON object into a new JSON object, then drill into the resulting JSON object and merge it with other JSON objects. While the jsonparser
package is very fast and memory efficient, the amount of allocations and CPU time required to parse, transform and merge JSON objects adds up over time.
At some point, I've realized that we're parsing the same JSON objects over and over again, only to add a field or child object to it and then parse it again.
Another improvement we've made is to not even print the result to a buffer anymore. In the end, we have to use the AST from the GraphQL Request to define the structure of the JSON response. So instead of printing the raw JSON to a buffer, we've added another package that traverses the GraphQL AST and JSON AST in parallel and prints the result into a buffer. This way, we can avoid another round of parsing and printing the JSON. But there's another benefit to this approach!
How the astjson package simplifies non-null error bubbling
In GraphQL, you can define fields as non-null. This means that the field must be present in the response and must not be null. However, as we are merging data from multiple Subgraphs, it's possible that a field is missing in the response of one Subgraph, so the Gateway will have to "correct" the response.
The process of "correcting" the response is called "error bubbling". When a non-null field is missing, we bubble up the error to the nearest nullable parent field and set the value to null. This way, we can ensure that the response is always valid. In addition, we add an error to the errors
array in the response, so the client can see that there was an error.
How does the astjson
package help us with this? As explained earlier, we're using the astjson
package to merge all the results from the Subgraphs into a huge JSON AST. We then traverse the GraphQL AST and the JSON AST in parallel and print the result into a buffer. Actually, this was the short version of the story.
In reality, we walk both ASTs in parallel and "delete" all the fields that are not present in the GraphQL AST. As you might expect, we don't actually delete the fields from the JSON AST, we only set references to the fields to -1 to mark them as "deleted". In addition, we check if a field is nullable or not, and if it's not nullable and missing, we bubble up the error until we find a nullable parent field. All of this happens while we're still just walking through the ASTs, so we haven't printed anything to a buffer yet.
Once the "pre-flight" walk through both ASTs is done, all that's left to do is to print the resulting JSON AST to a buffer.
How does the astjson package perform?
Here's a benchmark from the astjson
package:
func BenchmarkJSON_MergeNodesWithPath(b *testing.B) {
js := &JSON{}
first := []byte(`{"a":1}`)
second := []byte(`{"c":3}`)
third := []byte(`{"d":5}`)
fourth := []byte(`{"bool":true}`)
expected := []byte(`{"a":1,"b":{"c":{"d":true}}}`)
out := &bytes.Buffer{}
b.SetBytes(int64(len(first) + len(second) + len(third)))
b.ReportAllocs()
b.ResetTimer()
for i := 0; i < b.N; i++ {
_ = js.ParseObject(first)
c, _ := js.AppendObject(second)
js.MergeNodesWithPath(js.RootNode, c, []string{"b"})
d, _ := js.AppendObject(third)
js.MergeNodesWithPath(js.RootNode, d, []string{"b", "c"})
boolObj, _ := js.AppendObject(fourth)
boolRef := js.Get(boolObj, []string{"bool"})
js.MergeNodesWithPath(js.RootNode, boolRef, []string{"b", "c", "d"})
out.Reset()
err := js.PrintNode(js.Nodes[js.RootNode], out)
if err != nil {
b.Fatal(err)
}
if !bytes.Equal(expected, out.Bytes()) {
b.Fatal("not equal")
}
}
}
This benchmark parses 4 JSON objects into the JSON AST, merges them into a single JSON Object and prints the result to a buffer.
Here are the results:
pkg: github.com/wundergraph/graphql-go-tools/v2/pkg/astjson
BenchmarkJSON_MergeNodesWithPath
BenchmarkJSON_MergeNodesWithPath-10 2050671 623.7 ns/op 33.67 MB/s 0 B/op 0 allocs/op
As you can see, we're able to eliminate all allocations, making this operation very garbage collector friendly.
Conclusion
In this article, we've introduced you to the astjson
package and shown you how we're using it to speed up our GraphQL API Gateway. It allows us to parse each incoming JSON object exactly once, then merge and transform the JSON AST, and finally print the result to a buffer.
As we've seen in the benchmarks, this approach doesn't just look good on paper, it actually helps us to reduce the CPU time and memory usage of our GraphQL API Gateway, which in turn allows us to increase the throughput and reduce the latency.
As a side effect, the code is now much simpler and easier to understand. Especially the non-null error bubbling logic is much easier to implement and understand.
If you found this article interesting, consider following me on Twitter.
Top comments (0)