Stefan 🚀

Posted on Mar 11, 2022

Thunk-based Resolvers: How to build a powerful and flexible GraphQL Gateway

#webdev #graphql #api #apigateway

At the core of WunderGraph is a very powerful and flexible GraphQL API Gateway. Today, we'll dive deep into the pattern that powers it: Thunk-based GraphQL Resolvers.

Learning more about the Thunk-based Resolver pattern will help you understand how you can build your own custom GraphQL Gateway. If you're familiar with Go, you also don't have to start from scratch as you can build upon our Open Source Framework. If you'd like to implement this idea in another language, you can still learn a lot about the patterns as this post will not focus on the Go implementation.

To start off, let's define the goals of a GraphQL API Gateway:

It should be easy to configure and extend
It should be able to mediate between different services and protocols
It should be possible to re-deploy without a code-generation / compilation step
It should support schema stitching
It should support Apollo Federation
It should support Subscriptions
It should support REST APIs

Building a GraphQL API Gateway that is able to support these requirements is a real challenge. To better understand the problem and how Thunk-based Resolvers help solve it, let's take a look at the differences between a "regular" resolver and a "thunk-based" resolver first.

A Regular GraphQL Resolver

const userResolver = async (id) => {
    const user = await db.userByID(id);
}

This is a simple user resolver. It takes a user ID as an argument and returns a user object. What's important to note is that this function returns data if you execute it.

A Thunk-based GraphQL Resolver

A thunk-based resolver on the other hand doesn't return data immediately. Instead, it returns a function that you can execute later.

Here's an example:

const userResolver = async () => {
    return async (root, args, context, info) => {
        const user = await db.userByID(args.id);
        return user;
    }
}

This is a thunk-based resolver. It doesn't return data immediately. Instead, it returns a function than can be used later to load the user.

However, this is a drastic simplification of how thunk-based resolvers look like in practice.

To add more context, we'll now look at how GraphQL Servers usually resolve GraphQL queries.

A Regular GraphQL Server

In a regular GraphQL server, the execution flow of a GraphQL Query looks like this:

Parsing of the GraphQL Query
Normalization
Validation
Execution
During the Execution phase, the GraphQL Server will execute the resolvers and assemble the result.

Let's assume a simple GraphQL Schema...

type Query {
    user(id: ID!): User
}
type User {
    id: ID!
    name: String!
    posts: [Post]
}
type Post {
    id: ID!
    title: String!
    body: String!

...with these two GraphQL Resolvers...

const userResolver = async (userID: string) => {
    const user = await db.userByID(userID);
    return user;
}
const postResolver = async (userID: string) => {
    const post = await db.postsByUserID(userID);
    return post;
}

...and the following GraphQL Query:

query {
    user(id: "1") {
        id
        name
        posts {
            id
            title
            body
        }
    }
}

The execution of this query will look like this:

walk into the field user and execute the userResolver with the argument "1"
resolve the fields id and name of the user object
walk into the field posts and execute the postResolver with the argument "1"
resolve the fields id, title and body of the post object

Once executed, the response might look like this:

{
    "data": {
        "user": {
            "id": "1",
            "name": "John Doe",
            "posts": [
                {
                    "id": "1",
                    "title": "Hello World",
                    "body": "This is a post"
                }
            ]
        }
    }
}

This should be enough context to understand the problem of building an API Gateway. Let's assume you'd like to build a GraphQL API Gateway that supports the above GraphQL Schema. You should be able to point your GraphQL API Gateway at your GraphQL Server, and it should be able to execute the GraphQL Query.

Remember, the gateway should work without generating code or any compilation steps. What this means is that we're not able to create resolvers like we did in the previous example. That's because the Gateway cannot know ahead of time what Schema you're going to use. It needs to be flexible enough to support any Schema you want to use.

So, how do we solve this problem? This is where thunk-based resolvers come into play.

A Thunk-based GraphQL Server

Thunk-based GraphQL Servers are different from regular GraphQL Servers in that they split the execution of a GraphQL Query into two phases, planning and execution.

During the planning phase, the GraphQL Query will be transformed into an execution plan. Once planned, the execution plan can be executed. It's worth noting that the plan can be cached, so future executions of the same query don't have to go through the planning phase again.

To make this as easy to understand as possible, let's work backwards from the execution phase to the planning phase.

The Execution Phase of a Thunk-based GraphQL Server

The execution phase depends on the execution plan. As these plans can be quite complex, we're using a simplified version for illustration purposes.

We can actually use the response from above and turn it into an execution plan.

{
    "data": {
        "__fetch": {
            "url": "http://localhost:8080/graphql",
            "body": "{\"query\":\"query{user(id:\\\"1\\\"){id,name,posts{id,title,body}}}\"}",
            "method": "POST"
        },
        "user": {
            "__type": "object",
            "__path": ["user"],
            "fields": [
              {
                "id": {
                  "__type": "string",
                  "__path": ["id"]
                },
                "name": {
                  "__type": "string",
                  "__path": ["name"]
                },
                "posts": {
                  "__type": "array",
                  "__path": ["posts"],
                  "__item": {
                    "__type": "object",
                    "__fields": [
                      {
                        "id": {
                          "__type": "string",
                          "__path": ["id"]
                        },
                        "title": {
                          "__type": "string",
                          "__path": ["title"]
                        },
                        "body": {
                          "__type": "string",
                          "__path": ["body"]
                        }
                      }
                    ]
                  }
                }
              }
            ]
        }
    }
}

In plain english, the execution engine creates a JSON response. It walks into the data field and executes the fetch operation. Once the fetch operation is executed, it traverses the rest of the fields and extracts the data from the response. This way, the final response is being built.

A few things to note:

The userID is hardcoded to "1". In reality, this would be a dynamic value. So in the real world, you'd have to turn this into a variable and inject it into the query.
Another difference between real world and the example above is that you'd usually make multiple nested fetch operations
Fetches could also need to be parallelized or batched
Apollo Federation and Schema Stitching add extra complexity
Talking to databases means it's not just fetches but also more complex requests
Subscriptions need to be treated completely differently, as they open up a stream of data

Alright, so now we have an execution plan. As we've mentioned earlier, we're working backwards. Now let's move to the planning phase and see how such an execution plan can be generated from a GraphQL Query.

The Planning Phase of a Thunk-based GraphQL Server

The first ingredient for the execution planner is the configuration. The configuration contains all information about the GraphQL Schema and how to resolve Queries.

Here's a simplified version of the configuration:

{
  "dataSources": [
    {
      "type": "graphql",
      "url": "http://localhost:8080/graphql",
      "rootFields" : [
        {
          "type": "Query",
          "field": "user",
          "args": {
            "id": {
              "type": "string",
              "path": ["id"]
            }
          }
        }
      ],
      "childFields": [
          {
            "type": "User",
            "field": "id"
          },
          {
            "type": "User",
            "field": "name"
          },
          {
            "type": "User",
            "field": "posts"
          },
          {
            "type": "Post",
            "field": "id"
          },
          {
            "type": "Post",
            "field": "title"
          },
          {
            "type": "Post",
            "field": "body"
          }
        ]
    }
  ]
}

In plain english, the configuration contains the following: If a query starts with the root field "user", it will be resolved by the DataSource. This DataSource is also responsible for the child fields "id", "name", and "posts" on the type User, as well as the fields "id", "title", and "body" on the type Post.

As we know that it's an upstream of type "graphql", we know that we need to create a GraphQL Query to fetch the data.

Ok, we have the configuration. Let's look at the "resolvers".

How do you transform the GraphQL Query into the execution plan we've seen above?

We do so by visiting each node of the GraphQL Query. Here's a simplified example:

const document = parse(graphQLOperation);
visit(document, {
    enterSelectionSet(node: SelectionSetNode) {
        // open json object
    },
    leaveSelectionSet(node: SelectionSetNode) {
        // close json object
    },
    enterField(node: FieldNode) {
        // add field to json object
        // if it's a scalar, add the type
        // if it's an object or array, created nested structure
        //
        // if it's a root field, create a fetch and add it to the field
    },
    leaveField(node: FieldNode) {
        // close objects and arrays
    },
});

As you can see, we're not resolving any data. Instead, we're "walking" though all the nodes of the parsed GraphQL Query. We're using the visitor pattern, as it allows us to visit all fields and selection sets in a predictable way. The walker walks depth first through all nodes and calls the "callbacks" on the visitor as it enters or leaves a node.

While walking through the nodes, we can look into the configuration to see if we need to fetch data from an upstream. If we do, we create a fetch and add it to the field.

So, this process does two things. First, it creates the structure of the response that we'll return to the client. Second, it configures the fetches that we'll execute to fetch the data.

As these fetches are just functions that return data and can be executed later, we can also just call them thunks. That's how this approach got its name.

The Challenges of building a Thunk-based GraphQL Framework

What might look simple at first turns out to be a really hard challenge. I'd like to mention a few hard problems that we encountered while building this framework.

You can't just "attach" a datasource to a type field tuple

This was one of the first expensive learnings. While building the first iteration of the "engine", I've though it would be ok to attach DataSources directly to a combination of type and field. The tuple of a type and a field seems unique at first.

Well, the problem is that GraphQL allows for recursive operations. For example, if you have a type User, you can have a field "posts" that returns an array of Post objects. A Post object on the other hand can have a field "user" that returns a User object.

What this means is that you can have the exact same combination of type and field multiple times. So, if you really want to uniquely identify a datasource, you need to take into consideration the type, field AND path in the Operation.

How did we solve the problem ourselves? We're "walking" through the Operation twice.

During the first walk, we identify how many DataSources we need and instantiate them. The second walk is then used to actually build the execution plan. This way, the "execution planner" doesn't have to worry about the boundaries of the datasources.

This is also why we've decided to have "root nodes" and "child nodes" in the planner configuration. Together, they define the boundaries of each DataSource. If you enter a "root node" during the first walk, you instantiate a datasource. Then, as long as you stay within the "child nodes", the same datasource is still responsible. Once you hit a node that's not a child node, the responsibility for this field is handed over to the next datasource.

We'll dedicate a separate paragraph to this problem at the end of the post.

You should distinguish between downstream and upstream Schema

Another issue we were facing came with scaling our approach. It's simple to start with one schema, but what if you combine many of them?

Joining multiple GraphQL Schemas is super powerful, as it allows you to Query and combine data from multiple DataSources with a single GraphQL Operation.

At the same time, doing so comes with a few challenges. One large problem area is "naming collisions", which happen all the time. Without thinking too much about it, most if not all schemas have a type named "User". Obviously, a Stripe User is not the same thing as a Salesforce User.

So, what's inevitable with a tool like this is that you have to think about solving naming collisions.

Our first approach was to create an API that would allow the user to rename types. It works, but it's not very user-friendly. You will always run into problems and have to manually rename types before you can combine two schemas.

Another problem is that you will also have naming collisions for field on the root types Query, Mutation and Subscription. At first, we've created another API that allowed users to also rename fields. However, this approach is as user-friendly as the previous one, it's a tedious process, and it's not very scalable.

Having this manual step would mean that you have to rename types and fields whenever a schema changes.

We went back to the drawing board and searched for a solution that would allow us to combine multiple schemas automatically, with no manual intervention at all.

The solution we came up with is "API namespacing". By allowing the user to chose a namespace for each API, we can automatically prefix all types and fields with the namespace. This way, all naming collisions are solved automatically.

That said, "API namespacing" doesn't come for free.

Imagine the following GraphQL Schema:

type User {
  id: ID!
  name: String!
}
type Anonymous {
  name: String!
}
union Viewer = User | Anonymous
type Query {
    viewer: Viewer
}

If we "namespace" this schema with the prefix "identity", we'll end up with this Schema:

type identity_User {
  id: ID!
  name: String!
}
type identity_Anonymous {
  name: String!
}
union identity_Viewer = identity_User | identity_Anonymous
type Query {
    identity_viewer: identity_User | identity_Anonymous
}

Now, let's assume a "downstream" Query, a query coming from a client to the "Gateway".

query {
  identity_viewer {
    ... on identity_User {
      id
      name
    }
    ... on identity_Anonymous {
      name
    }
  }
}

We can't just send this query to the "identity" API. We have to "un-namespace" it.

Here's how the upstream Query needs to look like:

query {
  viewer {
    ... on User {
      id
      name
    }
    ... on Anonymous {
      name
    }
  }
}

As you can see, namespacing a GraphQL is not that simple. But it can even get more complex.

What about variable definitions in the downstream query that are only used in some upstream queries? What about namespaced directives that only exist in some upstream Schemas? API Namespacing is a great solution to solve the problem, but it's not as simple to implement as it might sound.

Benefits of the Thunk-based approach to GraphQL Resolvers

Enough on the downsides and challenges, let's talk about the benefits as well!

Thunk-based resolving, if done right, means that you can move a lot of complex computational logic out of your GraphQL resolvers. Static analysis allows you to move a lot of the code out of the hot path, making the execution of your GraphQL resolvers more performant.

What his means is that you can actually completely remove "GraphQL" from the runtime. Once an execution plan is created for a Query, it can be cached and executed at a later time. On subsequent requests, the cached plan is used instead of the original query. This means, we can completely skip all GraphQL related parts of the execution, like parsing, normalization, validation, planning, etc...

All we do at runtime is to execute the pre-compiled execution plan, which is simply a data structure that defines the shape of the response and when to execute which fetch. That said, this wasn't enough for us. We went one step further and removed GraphQL altogether. During development, you still write GraphQL queries and mutations, but at runtime, we replace GraphQL with JSON RPC.

We generate a typesafe client that knows exactly what GraphQL Operations you've defined. The developer experience still remains the same, but it's a lot more performant and secure. It's not just that the generated client is only a few kilobytes in size, this approach also solves most of the 13 most common GraphQL vulnerabilities.

At this point, it should be clear why we introduced the concept of the virtual Graph. If you have a GraphQL API, composed of multiple other APIs, but really only expose a generated REST/JSON RPC, it makes sense to call it "virtual" Graph, as the composed GraphQL Schema only really exists virtually.

The idea of the virtual Graph is so powerful, it allows you to integrate a heterogeneous set of APIs using a single unified GraphQL schema and interact with them using plain GraphQL Operations. Join and Stitch data from multiple APIs, as if they were one. A universal interface to all your services, APIs and databases.

Another benefit of the thunk-based approach is that you can be a lot more efficient when it comes to batching. With traditional GraphQL resolvers, you have to use patterns like "DataLoader" to batch requests. DataLoader is a technique that waits for a small period of time to batch multiple requests.

With thunk-based resolvers, you don't have to wait for a "batch window" to fill. You can use static analysis to insert batching into an execution plan. What this also means is that you can actually batch multiple requests with a single "Thread".

This is very important to mention, because synchronization of multiple Threads is very expensive. Imagine you're resolving 10 child fields of an array, and for each array, you have to make another fetch which can be batched.

With the DataLoader pattern, 10 Threads would be blocking until the batch is resolved. With static analysis on the other hand, a single thread can resolve all 10 fields synchronously.

When "walking" into the first field of all the batch siblings, it's already known through static analysis (at compile time) that a batch request is needed. This means, we can immediately create the batch request and executing it. Once fetched, we can continue resolving the fields of all the sibling fields synchronously.

Using the thunk-based approach, in this case, means we can make use of the CPU much better because we don't create situations where CPU threads have to wait for each other.

Thunk-based Resolvers are no replacement for the "classic" Resolver approach

With all the benefits of the thunk-based approach, you might be thinking that you should replace all your existing GraphQL resolvers with the thunk-based approach.

While this might be possible, thunk-based resolvers are not really meant to replace the "classic" Resolver approach. It's a technique that makes sense to build API Gateways and Proxies.

If you're building a middleware, it makes a lot of sense to use this approach, as you already have an API implementation.

However, building a GraphQL Server just using thunk-based resolvers is really hard, and I wouldn't recommend it.

How we've made configuring thunk-based resolvers easy

You've learned a lot about thunk-based resolvers so far, and one takeaway should definitely be that configuring them is both crucial for the correct execution but also hard to do manually. That's why we've decided that it needs the right level of abstraction without sacrificing flexibility.

Our solution to the problem was to create a TypeScript SDK that allows you to "generate" the configuration for your thunk-based resolvers in just a few lines of code.

Let's look at an example of combining three APIs using the SDK:

const db = introspect.postgresql({
    apiNamespace: "db",
    databaseURL: "postgresql://admin:admin@localhost:54322/example?schema=public"
});

const countries = introspect.graphql({
    apiNamespace: "countries",
    url: "https://countries.trevorblades.com/",
});

const stripe = introspect.openApi({
    apiNamespace: 'stripe',
    statusCodeUnions: true,
    source: {
        kind: 'file',
        filePath: './stripe.yaml',
    },
    headers: builder => {
        return builder
            .addStaticHeader('Authorization', `Bearer ${process.env["STRIPE_SECRET_KEY"]}`)
    },
});
const myApplication = new Application({
    name: "app",
    apis: [
        db,
        countries,
        stripe,
    ],
});

Here, we're introspecting a PostgreSQL database, a Country GraphQL API and a Stripe OpenAPI specification. As you can see, API Namespaces is a simple configuration parameter, all the rest of the configuration is being generated.

Imagine you'd have to manually adjust and configure a combination of n APIs. Without such a framework, the complexity of the configuration would probably be: O(n^2). The more APIs you add, the more complex the configuration will be. If a single API changes, you have to go through all steps again to re-configure and test everything.

Contrast this to an automated approach using the SDK. The complexity of the configuration looks more like this: O(n). For each API you add, you only have to adjust the configuration once.

If an origin API changes, you can re-run the introspection process and update the configuration. This can even be done using a CI/CD pipeline.

If you're interested in trying the approach, have a look at our Quickstart Guide which gets you up and running on you local machine in a few minutes.

Did we meet the requirements?

Let's go back and look at the requirements we've defined in the introduction and see if we've met them.

It should be easy to configure and extend

As we've shown, the combination of GraphQL Engine and Configuration SDK makes for a very flexible solution, while keeping it easy to configure and extend.

It should be able to mediate between different services and protocols

The Middleware speaks GraphQL to the client. Actually, it speaks JSON RPC over HTTP but let's ignore this for a moment.

The Thunk-based resolvers allow us to implement both the planning and execution phase in a way that we can mediate between GraphQL and any other protocol, such as REST, GraphQL, gRPC, Kafka or even SQL.

Simply gather all the information you need during the "planning phase", e.g. the tables and columns of the database, or the topics and partitions of a Kafka service, and then execute the thunks during the "execution phase" which actually talk to the database or Kafka.

It should be possible to re-deploy without a code-generation / compilation step

Having to re-compile and deploy the GraphQL Engine every time you want to change the schema would be a huge pain. As we've described earlier, the output of the planning phase, the "execution plans" can be cached in memory and even serialized/deserialized. This means, we can change the configuration without a compilation step and horizontally scale the Engine up and down.

It should support schema stitching

If we recall how the GraphQL Engine works, we've described earlier that you have to define a GraphQL Schema and then attach DataSources to tuples of types and fields. This is the only requirement to enable schema stitching, meaning that it works out of the box.

It should support Apollo Federation

The GraphQL DataSource (It's all Open Source) is implemented in such a way that it understands the Apollo Federation Specification. All you have to do is pass the correct configuration parameters to the DataSource and the rest works automatically.

Here's an example how to configure it:

const federatedApi = introspect.federation({
    apiNamespace: "federated",
    upstreams: [
        {
            url: "http://localhost:4001/graphql"
        },
        {
            url: "http://localhost:4002/graphql"
        },
    ]
});

The GraphQL+Federation DataSource shows the power of the thunk-based approach. It started off as a simple GraphQL DataSource, which got then extended to support the Apollo Federation Specification. As the planning and execution phase is split into two, it's also very easy to test.

It should support Subscriptions

Subscriptions are one of those features that a lot of developers shy away when implementing a GraphQL Gateway. Especially when it comes to Federation, I was asked by someone from Apollo how we manage all the connections between client and Gateway (downstream) and between Gateway and Subgraphs (upstream).

The answer is simple: Divide and Conquer. The Thunk-based approach allows us to split hard problems like Subscriptions into small pieces.

When a client connects to the Gateway and wants to start the first Subscription, you go through the planning steps and figure out what GraphQL Operation you have to send to the origin. If there's no active WebSocket connection yet to the origin, you initiate one using the Protocol that all GraphQL Servers agreed on. When you get a message back, you forward it to the client. If a second client connects and wants to start a second Subscription, you can reuse the same WebSocket connection to the origin.

Once all clients disconnected from the origin, you can close the WebSocket connection to the origin.

In short, because we can put whatever logic we want into the planner implementation as well as the executor implementation, we can also support Subscriptions.

It should support REST APIs

The same principle applies to REST APIs, which means they are supported as well. The REST API DataSource was probably the easiest to implement. That's because you don't have to "traverse" child fields to identify if the REST API call needs to be modified. If you compare the REST DataSource vs the implementation of the GraphQL DataSource, you'll see that the former is a lot simpler.

Summary

In this article, we've covered the secrets behind building a GraphQL API Gateway: Thunk-based Resolvers. It should be clear that the classic approach to writing Resolvers makes most sense for writing GraphQL APIs, whereas the thunk-based approach is the right choice for building API Gateways, Proxies and Middleware in general.

We're soon going to fully open source both the Thunk-based GraphQL Engine and the SDK, so make sure to sign up to get notified about the release.

Outlook

I could imagine writing more about the "behind the scenes" of GraphQL Gateways, e.g. how we've implemented Federated Subscriptions or how the Engine can automatically batch requests without having to rely on DataLoader. Please drop a line on Twitter or Discord if you're interested in such topics.

DEV Community