My team ran into an interesting issue today with our configuration of Apollo Server, data sources, and our caching layer. It stems from our architecture decisions, and how seemingly sensible defaults can lead to failures. Fortunately, it has a happy ending with a fix in less than a dozen lines of code.
Let's start from the outside and work our in...
Outermost Layer: Apollo Server
My team decided, for various reasons that I'm happy to elaborate on in a another post, went with an entirely serverless architecture on AWS for our application. That means all our code is deployed in Lambda functions. We love it, but it comes with tradeoffs. At the time of this writing, the biggest of these has to do with interacting with certain other AWS services that can only be deployed within a VPC (Virtual Private Cloud). Services such as RDS (Relational Database Service), and importantly for this story, ElastiCache (managed Redis & Memcache).
By default, when you deploy a Lambda function, it is not deployed into A VPC. The API calls to invoke the function are open to the Internet, but protected by IAM Authorization. You do have the option to specify a VPC to attach the function to. However, here's where the tradeoffs come in. First of all, you need to configure the VPC to have multiple IP subnets large enough to accommodate the maximum concurrency (how many instances of your function can execute at the same time) across all the functions that you want to attach to this VPC. Second, you need to configure Security Groups that allow your functions in the Lambda subnets to talk to the resources, such as Redis, in the other subnets. Third, you need to set up VPC Endpoints for any services that you use that aren't natively resident inside a VPC, such as S3 and DynamoDB. Finally, if you access any APIs out on the Internet, you'll also need to set up a NAT Gateway. That's a lot of networking setup, and a lot of things to misconfigure!
Even if you set up your VPC correctly, you'll soon discover that the cold-start (first time an instance of your function is run) time is atrocious, on the order of 10-20 seconds before your code can start executing. That's because of the mechanics of attaching Lambda to a VPC requires setting up an ENI (Elastic Network Interface) for each instance of your function. ENIs were not designed to for quick setup and teardown. This reason alone makes attaching a function to a VPC a non-starter for an API that services a web app.
For that reason, we opted to forgo anything that required a VPC and deploy our Apollo Server Lambda unattached. That means that Redis and Memcache are out of the question for caching.
Caching
When I talk about caching here, I'm speaking about DataSource caching, and specifically how it interacts with RESTDataSource. RESTDataSource
allows you to create an abstraction for calling REST APIs in your resolver functions. Its interface allows the Apollo GraphQL engine to insert a cache into the process, such that repeated requests will only hit the network once, and duplicate requests will use the cache. By default, this is an in-memory cache, but if you're running with a concurrency greater than 1, which you almost certainly will with Lambda, you're going to want to use an external cache service, such as Redis. However, Redis needs to be deployed inside a VPC, and as previously discussed, this is a non-starter for our architecture.
Enter apollo-server-cache-dynamodb, a Node module that I wrote to use DyanmoDB as a Key Value cache, with automatic key expiration. You configure this and plug it in to the Apollo Server configuration, and it takes care of injecting it into your Data Sources. All your Data Source GET
requests will be cached in DynamoDB, using the request url as the key.
Cache Key
If your Spidey-sense tingled at the thought of using the request url as the cache key, you'd be right. This, in fact, turned out to be the source of our problems. The way we are querying an API results in a very long query string. So long, in fact, that it breached the DynamoDB limit of 2048 bytes for a partition key. When the RESTDataSource
class went to cache the response for one of these very long urls in DynamoDB, it would raise an error and cause the entire GraphQL request to fail.
Fortunately for us, RESTDataSource
provides a number of ways to hook into interesting request and response events. For example, there's willSendRequest
, which allows you to set an Authorization header for every request. There's also cacheKeyFor
, which allows you to calculate your own cache key for the request. This is the hook we needed in order to generate a cache key suitable for use as a DynamoDB partition key.
Cache Key Calculation
We decided that we'd use a hashing function to calculate a unique identifier for the request url. We quickly realized that we would make our cache ineffective if we didn't take care to sort the query string parameters using a stable sort algorithm. As it turns out, the WHAT-WG URLSearchParams
interface provides a sort
method:
The
URLSearchParams.sort()
method sorts all key/value pairs contained in this object in place and returnsundefined
. The sort order is according to unicode code points of the keys. This method uses a stable sorting algorithm (i.e. the relative order between key/value pairs with equal keys will be preserved).
Bazinga! Once we had a way to stably sort the query params, generating an identifier for the request was very straight forward. Initially, we went with a straight sha1
hex digest, but ultimately we opted to go with a UUID v5.
Here's what our cacheKeyFor
implementation looks like, using the uuid
package:
cacheKeyFor(request){
const requestUrl = new URL(request.url);
requestUrl.queryParams.sort();
return uuidv5(requestUrl.toString(), uuidv5.URL);
}
Top comments (0)