In today's article, we’ll cover performance from a different angle: The choices we made in our stack.
Usually we write about changes and features we release for AppSignal that are public on our changelog and here on the blog. But besides these public-facing features, we also spend a lot of time on making sure AppSignal can cope with the growth of traffic.
Because we are developers ourselves working on problems like this, we think we do a pretty good job at helping you as well with our APM (shameless plug 🤪). But today we use that experience to discuss our own stack. We will go over one of the bigger changes we made in the past few years ourselves to handle tends of billions of requests per month. We'll cover why we make that choice and the pros and cons of our approach.
AppSignal started out as a pretty standard Rails setup. We used a Rails app that collected data through an API endpoint which created Sidekiq jobs to process in the background.
After a while we replaced the Rails API with a Rack middleware to gain a bit of speed and later this was replaced with a Go web server that pushed Sidekiq compatible jobs to Redis.
While this setup worked well for a long time, we began to run into issues where the databases couldn’t keep up with the amount of queries run against them. At this point we were processing tens of billions of requests already. The main reason for this was that each Sidekiq process needed to get the entire app's state from the database in order to increment the correct counters and update the right documents.
We could alleviate this somewhat with local caching of data, but because of the round-robin nature of our setup it still meant that each server needed to have a full cache of all data, because we couldn’t be sure on what server the payload would end up. We realised that with the data growth we were experiencing this setup would become impossible in the future.
In search for a better way to handle the data we settled on using Kafka as the data processing pipeline. Instead of aggregating metrics in the database, we now aggregate the metrics in Kafka processors. Our goal is that our Kafka pipeline never queries the database until the aggregated data has to be flushed. This drives the amount of queries per payload down from up to ten reads and writes to just one write at the end of the pipeline.
We specify a key for each Kafka message and Kafka guarantees that the same keys end up on the same partition, that's consumed by the same server. We use the app's ID as a key for messages, this means that instead of having a cache for all customers on the server, we only have to cache data for the apps a server receives from Kafka, not all apps.
Kafka is a great system and we’ve migrated over in the past two years. Right now almost all processing is done in Rust through Kafka, but there are still things that are easier done in Ruby, such as sending Notifications and other database-heavy tasks. This meant that we needed some way to get data from Kafka to our Rails stack.
When we began this transition there were a couple Kafka Ruby gems, but none worked with the latest (at the time 0.10.x) release of Kafka and most were unmaintained.
We looked at writing our own gem (which we eventually did). We will write more about that in a different article. But having a nice driver is only part of the requirements. We also needed a system to consume the data and execute the tasks in Ruby and spawn new workers when old ones crash.
Eventually we came up with a different solution. Our Kafka stack is built in Rust and we wrote a small binary that consumes a
sidekiq_out topic and creates Sidekiq compatible jobs in Redis. This way we could deploy this binary on our worker machines and it would feed new jobs into Sidekiq just as you would do within Rails itself.
The binary has a few options such as limiting the amount of data in Redis to stop consuming the Kafka topic until the threshold is cleared. This way all the data from Kafka won’t end up in Redis' memory on the workers if there is a backlog.
From Ruby’s point of view, there is no difference at all between jobs generated in Rails and those that come from Kafka. It allows us to prototype new workers that get data from Kafka and process it in Rails–to send notifications and update the database–without having to know anything about Kafka.
It made the migration to Kafka easier as we could switch over to Kafka and back without having to deploy new Ruby code. It also made testing super easy as you could easily generate jobs in the test suite to be consumed by Ruby without having to setup an entire Kafka stack locally.
We use Protobuf to define all our (internal) messages, this way we can be pretty sure that if the test passes, the worker will correctly process jobs from Kafka.
In the end this solution saved us a lot of time and energy and made life a lot simpler for our Ruby team.
As with everything there are a few pros and cons for this setup:
- No changes in Ruby required, API compatible
- Easy to deploy and revert
- Easy to switch between Kafka and Ruby
- Redis isn’t overloaded by messages when using the limiter, saves memory on the server, keeping the messages in Kafka instead.
- Horizontal scaling leads to smaller caches on each server, because of the keyed messages.
- Still has the issue that each Sidekiq thread needs access to a cache of all data for the apps from the partitions the server consumes. (e.g. Memcache).
- Separate process running on the server
- The rust processor commits the message offset when the message is flushed to Redis, this means that it’s guaranteed to be in Redis, but there’s no guarantee the message is processed by Ruby, this means that in case of a server crash, there is a chance some messages that were in Redis, but not processed are not processed.
Using Sidekiq helped us tremendously while migrating our processing pipeline to Kafka. We've now almost completely moved away from Sidekiq and handling everything via our Kafka driver directly, but that's for another article.
This is it for today. We hope you enjoyed this perspective on performance and scaling, and our experience scaling AppSignal. And follow us to keep an eye on when the next episode about Kafka is published.