This blog post is adapted from a talk given by Ali Hamidi at Data Council SF '19 titled "Operating Multi-Tenant Kafka Services for Developers on Heroku."
Thousands of developers use Heroku’s Apache Kafka service to process millions of transactions on our platform—and many of them do so through our multi-tenant Kafka service. Operating Kafka clusters at this scale requires careful planning to ensure capacity and uptime across a wide range of customer use cases. With significant automation and test suites, we’re able to do this without a massive operations team.
In this post, we're going to talk about how we've architected our infrastructure to be secure, reliable, and efficient for your needs, even as your events are processed in a multi-tenant environment.
Kafka is a distributed streaming platform, with "producers" generating messages and "consumers" reading those messages. The message store is written to disk and highly available. Instances that run Kafka are called brokers, and multiple brokers can serve as replicas for your data. If a single broker fails, the replicas are promoted, enabling continued operations with no downtime. Each Kafka cluster comes with a fully-managed ZooKeeper cluster that handles the configuration and synchronization of these different services, including access rights, health checks, and partition management.
Kafka is the key enabling technology in a number of data-heavy use cases. Some customers use Kafka to ingest a large amount of data from disparate sources. Some use Kafka to build event-driven architectures to process, aggregate, and act on data in real-time. And others use Kafka to migrate from a monolith to microservices, where it functions as an intermediary to process events while extracting out embedded functionality into smaller services during the migration process. (You can read all about how SHIFT Commerce did this on our blog.)
Heroku's least expensive single-tenant plan comes with three brokers. Everything within this cluster belongs to you, and it's a good option if performance and isolation are critical to your application.
The trade-off, of course, is that these plans tend to be about fifteen times more expensive than our least expensive multi-tenant plan. The multi-tenant plans are better suited for development and testing environments, or even for applications that simply don't need the immense power of a dedicated cluster.
Our multi-tenant plans have feature and operational parity that match our single-tenant offerings: they’re the same number of brokers, the same number of ZooKeeper nodes, and they behave in exactly the same way. While it would have been more cost-effective for us to have provided just a single broker/single ZooKeeper option, we felt that our users would then be developing against something that's not really representative of real life. Instead, we give you access to something that represents what a real production cluster would look like at a fraction of the cost. For example, if your code only acts against a single node, you have no opportunity to anticipate and build for common failure scenarios. When you’re ready to upgrade from a multi-tenant Kafka service to a single-tenant setup, your application is already prepared for it.
We're also committed to ensuring that our multi-tenant options are still secure and performant no matter how many customers are sharing a single cluster. The rest of this post goes into more details on how we've accomplished that.
Even though many different customers can be situated on a single multi-tenant Kafka cluster, our top priority is to ensure that no one can access anyone else's data. We do this primarily in two ways.
First, Kafka has support for access control lists (ACLs), and it is enforced at the cluster-level. This allows us to specify which users can perform which actions on which topics. Second, we also namespace our resources on a per-tenant level. When you create a new Kafka add-on, we automatically generate a name and associate it with your account, like
wabash-58799. Thereafter, any activity on that resource is only accessible by your application. This guarantees another level of security that is essentially unique to Heroku.
A single tenant should also not disturb any of the other tenants on a cluster, primarily when it comes to performance. Your usage of Kafka should not degrade if another Heroku user is processing an immense number of events. To mitigate this, Kafka supports quotas for producers/consumers, and we enforce the number of bytes per second a user can write or read. This way, no matter how many events are available to act upon, every user on the cluster is given their fair share of computing resources to respond.
When you buy Kafka from Heroku, we immediately provision the full set of resources that you purchased; we never take shortcuts or overprovision workloads. When it comes to storage, for example, we will automatically expand the amount of disk space you have available to use. If you reach 80% of your quota, we will expand that hard drive so that you can continue to write data to Kafka. We'll keep expanding it (while sending you email reminders that you've exceeded your capacity) so as to not interrupt your workflow. If you go far above your limit without addressing the problem, we'll throttle your throughput, but still keep the cluster operational for you to use.
In many cases, for our sake and yours, we set configuration defaults that are sane and safe, and often times even higher than what Kafka initially recommends. Some of these include higher partition settings (from one to eight, to maximize throughput), additional replicas (from one to three, to ensure your data is not lost), andmore in-sync replicas (from one to two, to truly confirm that a replica received a write).
We tested our Kafka infrastructure across a variety of permutations in order to simulate how the multi-tenant clusters would behave in the real world.
In order to find the most optimal configurations and observe how Kafka's performance changed, we tried multiple combinations, from throughput limits to the maximum number of partitions.
Similarly, we replicated several failure scenarios, to see whether the service could survive unexpected changes. We would hammer a cluster with a million messages, and then take one of the brokers offline. Or, we'd operate a cluster normally, and then decide to stop and restart the server, to verify whether the failover process works.
Our tests create an empty cluster, and then generate 50 users that attach the Kafka add-on to their application; they then create many producers and consumers for these simulated users. From there, we define usage profiles for each user. For example, we'll say that 10% of our users send very small amounts of traffic, and 20% of them send very large messages but at a very slow speed, and then the rest of them do some random distribution of data. Through this process, and while gradually increasing the number of users, we’re able to determine the multi-tenant cluster’s operational limits.
From these incidents and observations, we were able to identify issues in our setups and adjusted assumptions to truly make the infrastructure resilient, before they became problems for our users. For a deeper look into how we’ve set up all this testing and automation, my colleague Tom Crayford has given a talk called “Running Hundreds of Kafka Clusters with 5 People.”
If you'd like to learn more about how we operate Kafka, or what our customers have built with this service, check out our previous blog posts. Or, you can watch this demo for a more technical examination on what Apache Kafka on Heroku can do for you.