John Preston for AWS Community Builders

Posted on Nov 22, 2022 • Edited on Dec 15, 2022

AWS MSK, Confluent Cloud, Aiven. How to chose your managed Kafka service provider?

#aws #kafka #confluent

TL;DR

This blog post provides an overview of different managed Kafka service providers, including AWS MSK, Confluent Cloud, and Aiven. It compares their features, including cost, operational capabilities, and security, to help you decide which provider is best suited to your needs.

A little background.

I am by no means, a Kafka Guru: I haven't contributed to it, and I haven't any sort of certification or affiliation to it. All I am is a "power user" who has been using AWS for years and spent the past few years working with a managed Kafka service provider, giving me now plenty to compare practically.

Solutions/Offering comparison

In today's comparison, I am going to use AWS MSK and Confluent Cloud.

At times, I will also mention Aiven, but my credits ran out before I could explore all its features of it, so I recommend you explore that option yourselves too.

MSK serverless being new and of limited use-cases, due to its limitations by nature, I am leaving out of this comparison. Equally not considering Confluent "Platform" out of this comparison, given it's not a managed service.

Security

This is to me the first and most important criteria. Kafka becoming more and more popular, it is crucial to ensure that the information is secured, and access is restricted.

	AWS MSK	Confluent
Encryption at rest	* Default AWS encryption key * Use Customer encryption key (CMK)	* Can use KMS key, at premium cost * No details on default encryption
Authentication Methods	* SASL with PLAIN/SCRAM/IAM * TLS/SSL	SASL PLAIN
Audits & Broker logs	* Full audits with IAM * No audits without IAM, rely on broker logs * Broker logs available, long term persistence	* Audit logs for Kafka actions. Requires efforts to query * No access to broker logs

Encryption at-rest

AWS MSK, Aiven, and Confluent Cloud all support encryption on the fly and at rest. AWS MSK allows you to use your own AWS KMS Key (or CMK) to encrypt the cluster data at rest, with no restriction in computing size (however, MSK Serverless does not allow you to set that up).

Aiven does not seem to have an option (at least, not as per their console/wizard) to import your encryption key, regardless of the compute tier you select.

Confluent offers this option, but only at a premium cost: you must choose the most expensive compute option to support importing an encryption key. At the time of writing this article, that's an option only available to AWS customers.

But, the permissions that Confluent require you to grant to your CMK, are so wide open that technically they could be using it for anything they would like. When asked to list the services leveraging the key, no answer was provided.

Kafka authentication methods

Kafka allows for different authentication methods, each of them having pros & cons, but we won't get into that, but there is lots of material out there that would better explain it in detail than I could.

In a nutshell, you have the following Apache Kafka native with SASL:

PLAIN (username/password)
SCRAM (username/password)
OAUTH (OAuth2)
GSSAPI (Kerberos)
LDAP

Apache Kafka also supports mutual TLS (or mTLS), which uses certificate-based authentication.

With regards to authorization (what a given client can/can't do), Apache Kafka supports setting ACLs to grant selective permissions for each user. You have to use tools such as JulieOps or CFN Kafka Admin, or just the Kafka CLI/Admin API, to set these permissions.

Confluent Cloud only supports SASL_SSL with PLAIN (username/password) authentication.
Their concept of service accounts makes access management easy across multiple clusters.
But in the past year, the information provided to you via API/CLI breaks native Apache Kafka compatibility: the principal given for ACLs is not a valid one. You must therefore request or query the correct Kafka user principal to get things working.

Confluent also has its own "Roles"/RBAC driven access control layer, which is an attempt at making user-friendly the management of said ACLs.

AWS MSK supports more authentication methods than Confluent Cloud. It also implemented an IAM Native SASL mechanism, allowing you to use IAM credentials (Access/Secret Keys & IAM role-based tokens, etc.) to authenticate.

MSK goes even further, as you can also define ACLs via setting IAM policies that grant the users access to resources (topics, groups, etc.).
You do not need any additional tooling to provide your clients access to Kafka. AWS MSK with IAM provides you with fine-grain auditability as you can log these calls into AWS Cloud Trail.

Making a note that MSK with IAM is very useful and powerful, but, AWS needs to keep in mind that they must support Apache Kafka native authentication methods in their other services offering.

Audits

I haven't been able to evaluate that capacity with Aiven, but yet again I could not find any options in their "UI" provisioning to configure such an option.

Confluent Cloud has some audits, but these are provided to you in the form of a Kafka topic that they publish Kafka action events for you. You cannot specify where this audit topic is located. Because the logs are in a topic, you have to retrieve/export the data yourself into a data store to intelligently query these events. I have a S3 Sink connector which stores the data in S3 and use Athena to query the logs.
Confluent does not provide you with a Schema of the data, so I had to figure that out myself to make intelligent queries possible on fields.

As mentioned above, MSK provides that audit capability natively when using IAM to authenticate, but for other authentication methods, you will have to rely on the broker logs.

Speaking of broker logs, Confluent simply does not share these or make them available to you, period. That makes troubleshooting very frustrating. But I also see it as a means for them to do all sorts of operations and changes without you having any visibility over these.

AWS MSK offers to have Broker logs stored in 3 different destinations: CloudWatch logs, Kinesis Firehose, and S3. All these have pros and cons, but ultimately, the option is there.

Operational Capabilities

On security alone, I already have my preference. But let's look at another aspect that these days you simply cannot do without: operability - at least that's what I call it.

	AWS MSK	Confluent
Kafka Version	Can be selected by user	Selected by Confluent, no options.
Infrastructure as Code & API.	* Full AWS API to CRUD resources * SDK support for multiple languages	* API without OpenAPI spec * No Confluent maintained SDK
Monitoring	* Full monitoring of brokers * Auto-generated clients metrics * Open Monitoring with Prometheus & JMX	* High level cluster metrics * Heavily rate limited (80 calls/h)
Network availability	* Private & Public access	* Private & Public access Limitations on options when using private networking

Kafka version

With Confluent Cloud, you cannot choose. They pick the version, run it for you, and all you get to know is the compatibility level.
According to their website, they run the same version as what's available in "Confluent Platform".

With AWS MSK, you get to choose which version you can use. In a way, it makes you responsible for choosing said version and knowing the difference from others. But equally, if you were in the process of migrating from a self-hosted cluster for example, that allows you to ensure that compatibility will be the same for your clients, limiting risks.

Some versions give you access to additional features, such as "2.8.2_tiered" version which allows you to configure tiered storage, for additional cost savings.

Infrastructure as Code & API.

As most vendors do these days, they have and maintain a Terraform provider for these. AWS MSK also has CloudFormation support (and therefore CDK/Troposphere support).

All three vendors also have a CLI that allows them to provision resources.

And all three vendors have an API, although AWS has a clear lead in maturity and security for it. And AWS maintains an SDK for nearly every language relevant to this century.

AWS never creates a service without an API for it. Confluent, however, had an API but only recently got into a "mature" state. They have a Go "library", but that's the extent of it.

I created a CloudFormation resource for Confluent Cloud, to manage my service accounts that way. I also have a Lambda function that is used to perform Confluent SASL credentials rotation.
Both these things, lead me into creating a Python SDK to manage Confluent Cloud, which mostly catered to my immediate needs. But the development of said API was slowed down by the state of the API before it went "GA".

Monitoring

We have already gone over logs & audits, so we are going to focus on "metrics".

Confluent Cloud being very secretive, you cannot access the JVM of the Kafka clusters, sadly, that results in very limited capabilities for monitoring. Confluent Cloud does offer a telemetry API, that you can use to request exporting data in a Prometheus format, but the API itself is very heavily rate-limited. So you have to make sure you are not going to make too many queries.
This further limits some operational abilities, such as getting a close-to-real-time set of metrics, such as your consumer groups' lag.

Overall, I found the monitoring capabilities of Confluent Cloud to be too limited, and I had to deploy other services, such as the excellent kafka-lag-exporter to get operationally relevant metrics.

AWS MSK is getting metrics all around(cluster metrics, consumer metrics, etc.), stored for free in AWS CloudWatch. That allows you to implement alarms using the native tools of the AWS eco-system, and trigger all sorts of actions (autoscaling your services, alarms, and so on).

It also supports to export of your metrics from the JMX in the Prometheus format, allowing you to scrape the brokers/nodes for additional information or export it to another metrics storage system.

Cluster evolution & operations

The Confluent Cloud offering gives you a level of granularity on the compute size and storage capacity of the cluster. With their "Dedicated" offering, you can choose the amount of "Confluent Kafka Unit", or CKU, to match your business needs. But there is a catch: if you want a multi-AZ cluster (redundant within a region), you must use at least 2 CKUs. That brings the costs to a significant amount, regardless of whether you do need that capacity or not. Combining that with the security encryption requirement, forces you to use their Dedicated offering.

As it is a managed service, you do not get to perform any operations such as rebalancing partition leaders and so on.
You have to trust that Confluent will be on top of infrastructure issues to perform these operations reliably for you. Also because it is a managed service, and the computing unit offuscates things for you, you don't get to know how much actual computing the cluster uses. Confluent provides you with an "average load" metric, and that's all.

You can also not make any settings changes, such as changing the number of in-sync acknowledges, and generally speaking, any default or cluster-level settings.

With AWS MSK, the number of brokers is driven by the number of subnets you deploy your cluster into the number of brokers must be a factor of that number. I assume that it is to guarantee that you get 1 broker per zone - if you decided to place all your brokers in subnets using the same zone. You can choose the compute size of your brokers, but you must be wary that some features are not supported on all broker sizes.

You can create MSK Configurations that allow you to define cluster level settings, fine-tune these for your use-cases, and associate these with your MSK Cluster.

In terms of storage, Confluent Cloud will read an "unlimited" amount of storage, whereas AWS MSK can auto-scale the storage capacity of each broker, from 1GB to 16TB. Both allow adding more brokers, although technically with Confluent, you are changing the number of CKUs.

Network availability

Both Confluent & AWS MSK allow having clusters hosted publicly or privately. But not both.

It is important to note that Confluent Cloud requires extremely large CIDR ranges - at the time of writing - if you are looking at connecting to these via AWS Transit Gateway or VPC Peering, making the legacy integrations of existing large IT networks near impossible.

This leaves you, for AWS users, with either VPC Private Link or public access. Considering latency and costs (public traffic being 48 times more expensive per GB via a NAT Gateway). Private Link only works one way, so if you were planning to use Confluent-managed connectors, a lot of these are off the table right away.
The way Confluent implemented network ingress on their end also will deny you multipathing: to get from your client to a broker, you must use the endpoint in the same AZ. Any attempt at using an alternative endpoint will be dropped.

Ecosystem

Confluent Cloud offers some features only available in Confluent Cloud and only possible among clusters hosted by Confluent cloud (although these are very specific and somewhat limited).
They have KSQL as a Service and some connectors. But these are yet again limited in number and/or security options. Not all options supported in the S3 sink connector for example are available in the Confluent cloud.

But for the customers out there not on AWS, Confluent & Aiven can make a very compelling offer.

AWS MSK integrates natively, thanks to its IAM authentication method, to a lot of various other AWS Services. The number of services that you will be able to integrate with MSK is only going to go up.

If you wanted KSQL-like capabilities, you can use a service such as Kinesis Data Applications, which is a managed Apache Flink cluster and has similar semantics and capabilities as KSQL.

They both have a managed Schema Registry service which will allow your application teams to store data schemas, which will help tremendously on your data-as-a-product journey.

Pricing

With both Confluent & AWS MSK, you have a model of pay-as-you-go which makes it very easy to get started with and scale as your needs do.

If you get in touch with the Sales team of Confluent, you might be able to get a discount based on volume and length of contractual engagement, classic IT style.

It is worth noting that having a paid subscription to Confluent Cloud can also get you a License key that will allow you to use some of the Confluent services which are under Confluent licensing. Although often there is a truly open source alternative to the Confluent "purchased" feature, worth considering.

Technically, you can get a smaller & cheaper MSK cluster with all the bells and whistles for security (encryption, audits, etc.), whereas to get all the options available with Confluent cloud, your costs will be higher by quite a factor.

Because AWS API & Kafka's API are both so rich, one could imagine implementing further logic such as binding consumers to partitions for which the leader is in the same zone as the broker, reducing cross-az traffic costs. Enabling tiered storage with MSK can also lead to further reduce the storage requirements.

Conclusion

In the Kafka world, competition on getting the best offering is fierce, with each vendor contributing to Kafka in their very own way. On different aspects, I sincerely wish for MSK & Confluent, as well as anyone involved in improving the Kafka ecosystem, to work together, progress KIPs along, and not forget the root of Apache Kafka is with the Open Source community. Implementing features that work within their ecosystem is a fair and logical business decision. And so long as the Kafka users come first, choosing your Kafka vendor should only be a question of features that meet your business requirements.

As a long-term AWS user, I think that MSK is only going to add more and more features that directly serve customers with their operational capabilities, as their features focus is always on the customer & security, first.

If you are an AWS user today and are heading towards micro-services architectures, where each application has its own set of permissions, using AWS MSK with IAM authentication is a no brainer and will get you up and running extremely fast.

In contrast, to do this with Confluent, who has very limited automation around creation of Service Account, SASL credentials, and operational capabilities, you will end up creating a few credentials, likely shared among different applications. To stay secure, this requires a lot of discipline and a very good company-wide strategy & maturity.

With the creation of MSK Serverless, MSK Connect, and integration with AWS Glue Schema Registry, the wealth of ETL services that AWS has not only makes Kafka a part of it, it empowers it and gets you into a future proof position. There is only so much other vendors will do that will get you further than having a managed hosted Kafka cluster: you will still have to do everything else yourselves.

So if you were undecided as you started reading, I hope this guide has guided you to a decision.

Latest comments (2)

Jason Hepp • Jan 4 '23

Hi John, I am the Global Director of Solution Architects at Aiven and I'd be happy to discuss some of your findings and supply additional Aiven credits to round out your comparison tables.

avnaprin • Dec 7 '22

Hi, I work as SA at Aiven. Please let me know if you need more credits to complete testing Aiven Kafka :-)