Introduction
In my organization, we built a SaaS application. It’s a multi-tenancy application. We leverage AWS to host the application then deliver the best experiences to users across the globe. The application spans multiple regions to help us to distribute and isolate infrastructure. It will improve high availability and avoid outages caused by disasters. If there is an outage in a region, only that region is affected but not others, so that the outage is mitigated.
Our application has two main components: a frontend module - a single page web application (React), and a backend module that is a set of microservices running on Kubernetes clusters. It’s quite a basic architecture. However, there are challenges that need to deal with, especially since the application is multi-tenant multi-region
In this post, let’s talk about the frontend module.
Challenges
As said the frontend module is designed and deployed as a region-specific application. Initially, the module is deployed in regional Kubernetes clusters as Nginx pods. For each region, the module is built and hosted in a separate directory of a Docker image. Based on the region in which it’s deployed, the corresponding directory will be used to serve requests.
This deployment architecture requires us to operate and maintain Nginx in Kubernetes clusters as well as handle scaling to meet on-demand users traffic. It's also not good in term of latency since every end-user requests have to reach out to Nginx pods in the specific region. Let's say a user, who locates in the US, accesses a tenant in Singapore which is https://xyz.example.com. That user's requests are routed from the US to Singapore and back. That increases latency thus site loading speed is poor.
Requirements
To overcome the above challenges and have better user experiences, we try to find out a solution that meets the requirements below:
- Reduce latency as much as possible so site performance is increased no matter wherever end-users are
- Remove operation cost as much as we can
- Because of business, we want some regions to go live before/after others. So the application must be region-specific
Solutions
Fortunately, CDN (AWS CloudFront) is the best fit for our case. It's ideal solutions that meet the above requirements.
There are possible solutions
1. A CloudFront distribution for each region
This is the first solution that comes to mind and is the simplest solution. However, we quickly realize that it cannot be done when implemented. It’s because of a CloudFront limitation with Alternative domain name
. Below is the error when setting up a second distribution with the same alternative name *.example.com
Invalid request provided: One or more of the CNAMEs you provided are already associated with a different resource
Read more alternate-domain-names-restrictions
2. One Cloufront distribution + Lambda@Edge for all regions
We leverage CloudFront, Lambda@Edge, and DynamoDB global table. Here is a high-level of the solution:
Since we host the frontend module for each region in a directory of S3 bucket. We have to implement some kind of dynamic routing origin requests to correct directory of S3 bucket for CloudFront distribution.
To implement that dynamic routing, we use Lambda@Edge. Its capability allows us to use any attribute of the HTTP request such as Host
, URIPath
, Headers
, Cookies
, or Query String
and set the Origin accordingly.
In our case, we'll use Origin request
event to trigger Lambda@Edge function that inspects Host
to determine the location of the tenant and route request to correct directory of S3 origin bucket.
The following diagram illustrates the sequence of events for our case.
Here is how the process works:
- User navigates to the tenant. E.g. https://xyz.example.com
- CloudFront serves content from cache if available, otherwise it goes to step 3.
- Only after a CloudFront cache miss, the origin request trigger is fired for that behavior. This triggers the Lambda@Edge function to modify origin request.
- The Lambda@Edge function queries DynamoDB table to determine which folder should be served for that tenant.
- The function continues to send the request to the chosen folder.
- The object is returned to CloudFront from Amazon S3, served to the viewer and caches, if applicable
Issues
1. Cannot get tenant identity from Origin request.
To determine tenant location, we need Host
header which is also tenant identity. However, the origin request overrides Host
header to S3 bucket host, see HTTP request headers and CloudFront behavior. We will use X-Forwarded-Host
header instead. Wait, where X-Forwarded-Host
comes from? It’s is a copy of Host
header with help of CloudFront function triggered by Viewer request
event.
Here is how the CloudFront function (viewer request) looks like:
function handler(event) {
event.request.headers['x-forwarded-host'] = {value: event.request.headers.host.value};
return event.request;
}
Here is how the Lambda@Edge function (origin request) looks like:
import boto3
from boto3.dynamodb.conditions import Key
from botocore.exceptions import ClientError
def lambda_handler(event, context):
request = event['Records'][0]['cf']['request']
table_name = 'tenant-location'
response = None
try:
table = boto3.resource('dynamodb').Table(table_name)
response = table.query(
KeyConditionExpression=Key('Tenant').eq(request['headers']['x-forwarded-host'][0]['value']),
ScanIndexForward=False
)
except ClientError:
table = boto3.resource('dynamodb', 'us-east-1').Table(table_name)
response = table.query(
KeyConditionExpression=Key('Tenant').eq(request['headers']['x-forwarded-host'][0]['value']),
ScanIndexForward=False
)
if response and len(response['Items']) > 0:
request['origin']['s3']['path'] = '/' + response['Items'][0]['Region']
return request
else:
return {
'status': '302',
'headers': {
'location': [{
'key': 'Location',
'value': 'https://www.example.com',
}]
}
}
2. High latency when cache miss at edge region
That issue is the answer to question “why DynamoDB global table?”
At the first implementation, a normal DynamoDB table is used. We experienced a poor latency (3.57 seconds) when loading the site while cache miss from CloudFront edge region. Inspecting CloudWatch log, found that the lambda function took more than 2.2 seconds to complete. Query tenant info from DynamoDB table is a most time-consuming step.
REPORT RequestId: c12f91db-5880-4ff6-94c3-d5d1f454092c Duration: 2274.74 ms Billed Duration: 2275 ms Memory Size: 128 MB Max Memory Used: 69 MB Init Duration: 335.50 ms
After CloudFront caches response at the edge region, the latency is good. So only users who first access the tenant in a specific region will experience high latency. However, it’s better if the issue is eliminated too.
DynamoDB global table helps to overcome this issue.
After enabling DynamoDB global table, the request latency is reduced from 3.57 seconds to 968 milliseconds. The lambda function now took 254 milliseconds to complete.
REPORT RequestId: af3889c5-838d-4aed-bc0c-2d96e890d444 Duration: 253.61 ms Billed Duration: 254 ms Memory Size: 128 MB Max Memory Used: 70 MB
Top comments (0)