Fully Isolating Data in a Multi-Tenant SaaS on Google Cloud using a Token Vending Machine

#googlecloud #security #webdev #architecture

If you're building a multi-tenant SaaS, securely isolating customer data not only from other customers but from your own developers is a conversation that you'll have sooner or later. At a former startup, our customers' data was highly confidential and we went to extreme efforts to protect it both from inadvertent exposure caused by software bugs and internal access by employees unless absolutely necessary.

We strived for a hybrid pool/silo architecture, my favorite security strategy to achieve this is one that AWS promotes known as the Token Vending Machine that leverages IAM to isolate customer data.

Essentially an authorized user (1) makes an API request through the API Gateway (2), which calls a custom authorizer to validate the credentials and generate a dynamic IAM policy (3). The dynamic IAM policy is passed to the handler function (4) that locks all further processes into a specific set of resources (5). The elegance of this solution is that it removes the burden of handling tenant security from the developers' hands and moves it down to the platform level. The threat of inadvertently exposing tenant data even at the hands of a malicious developer is almost completely mitigated.

The Problem

Google Cloud doesn't offer the same functionality out of the box:

Endpoints and API Gateway don't support custom authorizers.
Dynamically generated IAM policies aren't supported.

The proposed solutions you'll find on StackOverflow, Reddit and even GCP's own whitepapers all basically say the same thing: "Tenant security should be handled at the app level."

Yuck!

But after days of trial and error, I found a solution that gave us the highly secure tenant isolation we needed on Google Cloud!

The Solution

Similarly as before, the user in Tenant A (1) makes an authorized request to list the users in their tenant (2). The API Gateway passes that to the UsersEndpoint service (3) that has no inherit permission to access any database, so it passes the user's auth token to the TokenVendingMachine (4). The TokenVendingMachine validates the token and based on the custom claims retrieves the tenant's Service Account key file from our secure bucket (5) and returns it to the UsersEndpoint service. Finally we can call our database using the key file (6) and return the results to the user.

Step 1: Onboarding

When a new tenant is created, a tenant-specific Service Account is asynchronously created and the JSON key file is stored in a highly-secured bucket containing tenant key files.

Step 2: Authentication

We use the Identity Platform with multi-tenancy enabled to authenticate users. When a user logs in they exchange their initial token with a custom token containing custom claims such as the user's tenant and role, and that custom token is sent with every subsequent request.

Those custom claims look something like this:

{
  tn: 'tn-xyz987', // Tenant ID
  rl: 'editor', // Role
  rg: 1, // Region
  ...
}

The claims identify the user's tenant, their role and the region that their data resides in.

Step 3: API Requests

When a user's authenticated request hits the API Gateway, it's sent to a Cloud Run service that runs our API. The database and storage buckets are abstracted behind like-named services and require a valid JSON key file in order to access any resource.

So if a user requests a list of users within their tenant, the API's code can be as simple as this pseudocode:

app.run('/users', (res: Request, res: Response) => {
  // Create a new instance of our TokenVendingMachine class
  const tvm = new TokenVendingMachine();

  // Request the key file using the user's auth token
  tvm.get(req.headers.authorization)
    .then(async (key: Credentials) => {
      // The tenant's database name has been embedded in the key
      const db = new Database(key);

      const rows = await db.query("SELECT ...");

      res.json(rows);
    })
    .catch((e: any) => res.status(403));
});

Main Takeaway: The developers can write code as if this is a single-tenant environment!

I know what you're going to say...

Why not issue short lived service account credentials?
Latency. Retrieving an existing key file from a GCS bucket is extremely fast compared to requesting new credentials on each request. Sure you could cache those short-lived credentials, but it creates a new set of problems of storing those securely if your goal is total isolation.

Why not use the Secrets Manager to store the key files?
In a word, cost. At $0.03 per 10,000 operations the costs will add up fast for an API.

Isn't a storage bucket full of key files dangerous?
Not if properly secured. The TokenVendingMachine service has read only access to all objects in that bucket and another service that generates the key file during the onboarding process has write access. There's also have a backend service that regularly cycles the keys so that they don't live on in perpetuity.

Conclusion

What's important is that by separating tenant security from the app level, we achieve reliable, secure storage and access of our customers' data while removing the responsibility of tenant security from our developers' hands.