edA‑qa mort‑ora‑y

Posted on Apr 15, 2021 • Originally published at mortoray.com

System Architecture for Edaqa's Room

#webdev #architecture #gamedev #cloud

I tried explaining to a friend how my games were setup, but it became confusing quickly. Drawing all the component boxes, I’m surprised to see how complex it has become. I think it’s a decent example of modern system architecture, and will go through the setup here. This is for a multiplayer game, so I’ll point out how this might differ from a more typical web application.

I could reasonably call this architecture the platform on which my game runs. A higher-level of code runs on top of, but is intimately tied, to this platform.

Client

I like to start at the user’s view on the system, as it keeps me grounded in the system's purpose. Mostly the user interacts via the website, but I also send email confirmation on purchase. The starting point to the game could be via the immediate web link, or the link in the email.

I was tempted to split the client into a game and website proper, as they are fairly distinct aspects of the system. But the discussion of the website’s logical structure is better left for another article.

Note the two lines from the browser to the HTTP server. One is normal HTTP traffic, and the other is for WebSocket. Though they go through the same machines, they are handled differently. I’ll provide more detail later, but the way I handle WebSocket is specific to a multiplayer game — a need for a fast response motivates the design.

In terms of fault tolerance, it’s the client which is most likely to fail. From browser incompatibility to crashes, and slow or lost connections, the client is an endless pool of problems. The servers are virtually faultless by comparison. As this is an interactive multiplayer game, it’s vital to handle common client problems correctly. The higher level code handles most of the faults, which this architecture supporting it.

Cloud Processing Services

The three red boxes contain the abstract aspects of the cloud service. These services are mainly configurations and I have no insight into their internal structure. They contain only transient data.

Content Delivery Network (CDN): The CDN serves all the static assets of the website and the game. Most of these resources use the web server as the origin, as it gives me the cleanest control over versions. The CDN provides faster loading to the client and reduces load on the host machines. I could do an entire article on the challenges of getting this working. (Service: AWS CloudFront)
HTTP Frontend: This takes care of the incoming connections, as well as SSL handling. It provides, when needed, a slow rollout to upgrading the hosts. It’s a security barrier between the public world and my private hosts. Thankfully, it routes both normal HTTP and Websocket traffic. (Service: AWS Elastic Load Balancer)
Email Sender: Sends purchase confirmation emails to the user. I mentioned the client layer is fault prone, and email is no exception. You absolutely want a third-party service handling the challenging requirements of modern email. (Service: AWS Simple Email Service)

Host

My host contains several microservices, which I’m grouping into a large block. With Python as the main server language, I was forced into the microservice architecture. Separate processes is the only way I can get stability and parallel processing of these services.

These are all launched as systemd services on an AWS Linux image.

Web Server: Handles all web requests, including static files, templates, game launchers, and APIs. These requests are stateless. (Service: Python Code with Eventlet and Flask)
Game Server: Implements the game message queues, which are shared message rooms per game — think of it like a chat server with channels. This is stateful per game. It handles client connections and transmits messages but does not understand the logical game state. For fault tolerance, it was vital that misbehaving clients don’t interfere with other games. (Python Code with Asyncio and Websockets)
Message Service: Migrates game messages from the live database to the long-term database store. This happens regularly to minimize the memory use of the live database, allowing more games to live on one host. (Service: Python Code)
Confirm Service: Sends emails when somebody purchases a game. I avoid doing any external processing in the web server itself, instead having it post a job that is handled by this service. This keeps the web server responsive and stable. (Service: Python Code)
Stats Service: This is a relatively fresh addition, needed for my affiliate program. I previously calculated game stats offline for analysis, but am working on features to present those at the end of the game. There is a bit of ping-pong with the web server to get this working. This is external, as it has slow DB queries and slow processing. It operates sequentially, as I do not want multiple stats running in parallel. (Service: Python Code)
Live Database: Contains game state for all games on this host. The game uses a sequenced message queue. For a synchronized visual response between players, it is vital this service is fast. Therefore I use a local Redis store to keep live messages, with the message service moving them offline. (Service: Redis)
Message Queue: Provides the message queue for these services to talk to each other. This is per-host because a few of the services need access to the Live Data for a game. The Confirm service does not need live data, and I could orchestrate the stats service to not need it either. However, having an additional shared message queue is unnecessary overhead. (Service: Redis)

The diagram creates siblings of the Live Database and Message Queue boxes, since the same process implements both. This is another point where the needs of the game dictate this local Redis server. Most web apps can probably use an off host queue and an external DB service. When you look at my alternate design later, you’ll see I’d be happy to have this part even faster.

I estimate a host can handle at least 100 concurrent games, around 400 users, and I dream about the day when I need many hosts. I can also add region specific hosts, providing faster turnaround for groups playing in other countries.

WebSocket

The diagram shows two different connections between the client and the HTTP Frontend, which continue to the backend.

The black HTTP connection is stateless, and it doesn’t matter which host it ends up at. Ultimately, when my dreams of high load come to fruition, I’d separate this, putting it on a different host pool, or potentially recreate it as lambda functions.

The orange WebSocket connection is stateful and must always arrive at the same machine. This is sticky per game; all players of the same game must reach the same machine. This must be done as a single host to minimize turnaround time. Shared, non-local queues, lambda functions, and DBs, all introduce too much of a response lag. This is particular to a multiplayer game.

Alternate Game Server Design

Again, I’m kind of forced into the above architecture because of Python. Should I ever need more performance, or wish to reduce hardware needs, I’d reimplement this, likely choosing C++, though any compiled static language with good threading and async IO would work.

A new single server would be a single application replacing these services:

game server: Depending on the language and framework, this socket handling code could look very different. Much of the speed improvement though would come simply from better data parsing and encoding.
message service: I’d gain more control over when this runs and have an easier time reloading messages for clients
stats service: I would make this a lot simpler since it wouldn’t need as much cross-process coordination to work.
live database: Simple in memory collections replace the Redis DB, providing faster turnaround, but complicating persistence and fault management.
message queue: The remaining job messages would migrate to a shared queue, like SQS.

This alternate architecture is simpler, at least to me, and I estimate it could easily handle 100x as many games on a single host. Or rather, it’d let me handle as many games as now, but with several much smaller hosts. That would improve fault tolerance.

Added coding time keeps this on the long-term backlog. Unless some here-to-unknown feature appears where I need this, it’ll be cheaper to keep the microservices model and spin up more hosts as required.

An intermediate solution is to code strictly the websocket channels in another language, since it’s the most inefficient part. Though I recently reprogrammed this part, still in Python, to be massively more efficient. New rewrites are on the long-term backlog.

Storage

The storage boxes contain all the long-term data for my game. There are no game assets here; I store them on the host where I upload each game. This provides the easiest way to manage game versions.

Media Store: Holds large static assets which aren’t part of the game proper, such as trailers and marketing materials. I synchronize this on-demand with a local work computer. (Service: AWS S3)
Log Store: Collects and stores the logs from the HTTP Frontend. I analyze these offline regularly. (Service: AWS S3)
Database: This is the heart of my business data, storing purchase information and persisting long-term game state. (Service: Mongo)

What’s Missing

I’ve left several components out of the diagram to focus on the core experience. I’ll describe them briefly here.

I don’t show monitoring, partially because it’s incomplete, but also because it’s merely a line from every box to a monitoring agent. The structure doesn’t change for monitoring, but it’s part of the live environment.

I’ve left DNS out of the diagram for simplicity. I use multiple endpoints for the client, the web server and the CDN, as well as for email, which adds up to many DNS entries. In AWS one has Route 53, but the individual services can thankfully configure, and maintain most of their entries automatically.

I have many offline scripts that access the database and the log store. This includes accounting scripts which calculate cross-currency payments and affiliate payouts — world sales with tax are a nightmare! I also do analysis of game records to help me design future games.

There’s an additional system used to manage the mailing list. As the sign-up form is part of the website, and people can follow links from the emails to the website, it is a legitimate part of the architecture.

Layers upon layers

I’m tempted to call this the hardware architecture, but with cloud services, everything is logical. It’s a definite layer in my system. Can I call it the “DevOps Layer”?

The website on top of this is fairly standard, but the game is not. I will come back and do some articles about how the game functions. I can also show how the system architecture and game architecture work together.

Other than a few game specific parts, the architecture is fairly standard for an internet application. I believe this is a good approach to what I needed.

Top comments (4)

Nested Software • Apr 17 '21 • Edited

My understanding is that the message service finds old/stale data in the live database and uses redis to move this over to the long-term storage database - i.e. if a game is being actively played, the data is not moved. If a user returns after a period of time to continue a game, is the idea that the game server issues a message on the redis queue to retrieve the data from the long-term storage database back into the live database?

edA‑qa mort‑ora‑y • Apr 17 '21

Messages are live games are moved to the long-term storage as well, clearing out the Redis store on each sweep. When a user connects to the game it will load both the messages from the long-term storage as well as the live database.

The engine is event source based, so no "current" state is ever stored. This will need to be changed long-term if I wish to support longer games.

That said, there is a minimal state in the Redis DB that needs to be restored if the game has been completely purged (which happens infrequently). This is no more than the count of messages and total count of players.

Nested Software • Apr 17 '21

Does that mean the game server collects/replays the existing messages when resuming a game to send the appropriate initial state to the client?

edA‑qa mort‑ora‑y • Apr 17 '21

Yes. The client starts a "reset" state for the game and replays all the messages to get back to where it was.

Even for a 6-7 players over an hour or two this only adds up to 6-7k messages.

DEV Community

System Architecture for Edaqa's Room

Client

Cloud Processing Services

Host

WebSocket

Alternate Game Server Design

Storage

What’s Missing

Layers upon layers

Top comments (4)

Read next

Essential Use Cases to Master API Development and Integration

Understanding the Barrel Pattern in JavaScript/TypeScript

Deploying Next.js + Pocketbase to a single Fly.io machine

Boost Your Web App's Speed: JavaScript Performance Optimization Techniques