DEV Community

Nicholas Frush
Nicholas Frush

Posted on

Authn/Authz in a distributed, microservice architecture

Introduction

When handling user authentication today, one must consider a myriad of problems that are inherently unique to a distributed system:

  • Cache Stampede (e.g. “Thundering Herd”)
  • Place of evaluation
  • Duration of access Most systems today have opted for an access/refresh token paradigm to ensure access tokens are short lived - and thus limited to exploitation during loss. Moreover, these tokens are kept stateless to ensure that the backend services remain scalable and that authentication/authorization can be pushed to happen down at the service level, instead of a monolithic gateway.

The standard authentication flow

Image description

Within a standard authentication flow, a user is exchanging their credentials for a short-lived access token (< 15 minutes) and a long lived refresh token. The refresh token’s sole purpose is to provide the client the ability to refresh an expired (or expiring) access token. Meanwhile the access token’s sole job is to indicate to the service a form of identity authentication - i.e. this is who I am.

In this flow, we tend to provide the access and refresh tokens as cookies (with the Secure and HttpOnly flag set) this prevents arbitrary JavaScript by an attacker during a CSRF or XSS attack from reading the cookies. In more advanced scenarios, we may provide the same values as response headers - such as X-Auth-Token and X-Refresh-Token - and follow something like the double submit cookie pattern. In this pattern, a client making a request is expected to provide the request headers (i.e. X-Auth-Token and X-Refresh-Token or Authorization and X-Refresh-Token) and the cookies. Backend middleware will then verify that these 2 values matched (since it’s impossible to reconstruct a signed JWT since the signing secret never leaves our backend) and then verifies the validity (i.e. has this expired, was it signed by us) of the provided JWT.

Using our access token

Image description

As explained previous, during normal use of the access token, server-sided middleware is provided that looks for the particular headers (e.g. Authorization, X-Auth-Token, X-Refresh-Token, Cookie, etc.) that govern a client providing your with their access (and refresh) token. This middleware will often verify the provided JWT has not expired and is signed by your backend services. After successful verification, the decoded claims of the JWT (which should include the user’s id as the audience claim) will be added to the context of the request so that the ultimate handling function can check for the presence of proof of authentication in the handler. This de-couples the need of each function to verify and decode a JWT from handling business logic.

When a function needs to assess the roles or permissions of a user, it will often reach out to an Authorization service that - given a particular user id - can provide the roles and permissions of that user. Often times, these roles and permissions will take the form of scopes. In my mind, scopes come in two flavors - atomic and aggregated. An atomic scope is a permission which is fine-grained and particular, such as “edit user account information”. An aggregated scope often encompasses atomic scopes in some way to make a larger, logical piece - that would often be referred to as a role. For example, an aggregated scope of admin may encompass atomic scopes such as “disable user accounts”, “edit organization information”, etc. In practice, I often like my scopes to take a form of “${namespace}.${role}/${restriction}” for my aggregated scopes. For example org.admin/:orgId. For atomic scopes, I often like to follow “${namespace}:${action}:${target}”, such as “user:edit:account”.

To further de-couple your handler from having to assess the particular roles/permissions of a user, you may choose to offload that evaluation to a sidecar running with your backend service called “Open Policy Agent”. Open Policy Agent (OPA for short) is designed to exclusively author policies to evaluate given some input in a very quick, performant manner. It’s often used toady - by many FAANG companies, open-source and commercial databases, etc. - to asses whether an input providing user information on roles/permissions/context is allowed to perform some set of actions. This is often constructed as an “allow/deny” pattern.

When our access token expires

Image description

Often when an access token expires, you expect your client to refresh the token themselves. This can be somewhat problematic when a client is in the middle of a transaction or a series of REST calls that need to happen in succession without loss of access.

In these case scenarios, it can be beneficial to provide user’s with a renewed access and refresh token when our backend services received an expired access token with a valid refresh token. However, if a client is making 100 REST calls in parallel, we don’t want to refresh this token 100 times. This is a variant of the cache stampede or “Thundering Herd” problem, where you don’t want a particular cache miss (or in our case expired access token) to cause an exponential increase in unnecessary work (e.g. refreshing the cache, renewing the access token). In this case, the first caller will request a lock in the database that specifies that a request is in progress to refresh the token. Of course, before requesting this lock, our first caller may check to see if the “cache” has a recent renewed access token for the user (thus skipping this process). If we do need to refresh the token, this lock is used by subsequently callers to stop and wait, preventing a flurry of access token refresh requests. These requests will enter a loop (with some known upper bound and backoff) to continually check the cache for the presence of a lock or a recently renewed access token. In the off chance our process has failed and neither is present, we begin this process anew and the next caller grabs the lock and beings the token refresh process. Otherwise, if the lock is removed and the access token made available in the cache, all requests are expecting to return the same access token.

If you want me to elaborate on anything I’ve talked about here, feel free to drop me a comment or DM (or go find me on LinkedIn).

Top comments (0)