Apache Kafka® is the backbone of modern data platforms, allowing data to flow to where it is needed. Kafka Connect is the magic that enables the integration of Apache Kafka with a wide selection of different technologies as data sources or sinks by defining a few rows in a JSON configuration file.
Sometimes, Kafka Connect might appear as dark magic: its plethora of connectors with sometime overlapping functionality, the inconsistent configuration parameters and vague error messages can give the feeling of a hidden art behind the tool. Therefore, specifically if you're new in this field, a little advice can help you go from willing to sacrifice a cockroach to the connect gods to a perfectly working and reliable streaming pipeline.
Before jumping to the tips, we need to share the fundamental law to become a Kafka Connect magician: WE NEED TO READ THE MANUAL!
Kafka Connect is solving a complex integration problem, and doing a great job in taking part of the complexity away. Still the space is huge, with a great variety of technologies and several partially overlapping connectors solving similar integration problems.
Our first duty to become the next Kafka Connect Houdini is to browse for information, to take the time to understand which connectors exist for the integration problem we're trying to solve, and to carefully read their instructions to evaluate their usage for our case.
Now, it's officially time to start with the tips!
Like magicians use their hats to store rabbits for tricks, we need to prepare well so that our tricks, or data, will be successful. Whether we're sourcing or sinking data from Kafka, it's best to pre-create all the data structures needed to receive the data.
Most of the time, there are shortcuts such as
auto.evolve delegating the target topic or table creation to Kafka Connect but, by doing so, we lose control of these artifacts and can generate problems for the downstream data pipelines. Kafka Connect will generate a topic with the default partition count, or tables without the partitioning you have in mind. Therefore the suggestion is: read the documentation carefully and pre-create the necessary landing spots accordingly.
If, after reading the documentation, we're still unsure where the data will land, then we can create a test environment, enable the
auto.create and take note of which artifacts are created so we can properly define them in production.
Like magicians need to learn all the spells to perform tricks, we need to gather the similar knowledge on the Kafka Connect space. As mentioned above Kafka Connect is an amazing and wide space, with different connectors solving similar problems in slightly different methods.
Part of working successfully with Kafka Connect is understanding which available connectors can solve the integration problem we are facing, and understanding the benefits, risks, and capabilities of each one. Once we have a clear map of the options, chose the best fit for our needs.
A practical example; to source database data into Apache Kafka we have two choices: a polling mechanism based on JDBC queries or the Debezium push mechanism. Both seem like valid alternatives, but when you start pushing the boundaries the JDBC solution shows its limits. Knowing the limits of a solution can inform our choice.
Magicians need to check they have all the ingredients for their potions. To build a successful connector, we need to take the same care in validating that all the pre-requisites are satisfied!
Kafka connect is Java-based, so to run a specific connector we need to put all the required JAR dependencies under a particular folder (check out the case for Twitter). This is quite a task by itself, and is where using managed platforms like Aiven for Apache Kafka Connect can help by removing the friction.
Once dependencies are sorted, we still need to properly test that everything we need is there:
- Check the network paths: Can we ping the database? Is the Google Cloud Storage accessible from the Kafka Connect cluster?
- Evaluate the credentials and priviledges: Can the user login? Can it read from the table or write to the S3 bucket?
- Validate that required objects are in place: Is the target S3 bucket already in place? What about the database replication needed by Debezium?
Ensuring we have all the bits and pieces in place before starting the connector will provide a smoother experience. The last thing we want is to lose two hours of time checking at the connector configuration while the problem is that there's no network connectivity between our endpoints.
Data formats are often overlooked but this is a key consideration for setting up a successful pipeline.
When using the default configuration, most of the source connectors push data to Apache Kafka topics in JSON format. This approach works for the majority of the data pipelines but the lack of a defined schema means that we are unable to sink data to technologies, like relational database, where understanding the data structure is required. We'll get the error
No fields found using key and value schemas for table and there are no current workarounds for using these connectors without a defined schema.
The remedy is to use data formats that specify schemas in every possible scenario. From a connector configuration standpoint, this means adding some settings (the
value.converter in the Debezium example are perfect examples), or to make use of tools like Karapace to store key and value schemas.
Once we have the schema properly defined during data ingestion to Apache Kafka, we can use the same schema registry functionality to let a sink connector understand the data shape and push it to any of the downstream technology, whether or not that destination requires a schema.
Kafka Connect provides the magical ability to change the data shape while sourcing/sinking. The power is given by the Single Message Tranformations (SMT), enabling us to alter the data in several different ways including:
- Filtering: pass only a subset of the incoming dataset
- Routing: send different events to separate destinations based on particular field values
- Defining Keys: define the set of fields to be used as event key (this will be discussed more in depth later)
- Masking: obfuscate or remove a field, useful for PII data
SMTs are a very powerful swiss army knife to customize the shape of the data during the integration phase.
Keys are used in Apache Kafka to define how records land in partitions, and in the target system to perform lookups. When defined properly, keys drive performance benefits both when sourcing (parallel writes to partitions) and sinking (e.g. partition identification in database tables).
Therefore it is very important to analyze and accurately define the keys to achieve correct data (ordering in Apache Kafka is guaranteed only within a partition) and good pipeline performance. In this blog, Olena dives deep into how to balance your data across partitions and the tradeoffs you might encounter when selecting the best partitioning strategy.
To strengthen our Kafka Connect magic powers, we need to make our connectors more robust and less vulnerable to errors. Continuously monitoring the pipelines to make sure that they meet the performance expectations, and checking for errors, are an important part of healthy data platforms. There are also a couple of design decisions you can make to improve the probability of success.
Almost every connector allows definition of the data collection/push frequency. Sinking data to a target environment once per day means that the connector needs to hold the whole daily dataset until the appointed moment; if the data volume becomes too big, the connector could crash, forcing us to restart the process and therefore adding delay to the pipeline. Writing less data more often can mitigate the risk, we need to find the right balance between frequency and "batch" size.
Kafka Connect has the concept of tasks, and these can be used to parallelize the load.
As an example, when needing to source 15 tables from a database, having a single connector with a single task to carry all the load could be a risky choice. Instead, consider distributing the data ingestion either by defining one connector with 15 tasks or 15 different connectors with a single task each.
Parallelizing the work helps both in increasing performance and in reducing risks related to single point of failure.
Nobody becomes a Kafka Connect magician without mistakes; we'll hardly nail every setting on the first attempt. Our work is to understand what is wrong and fix. In the face of a problem, start with these tips:
- Check the logs: logs are the source of truth and can give us a hint of where to look at. The error description contained in the logs can help us understanding the nature of the problem and give us hints on where to look.
- Browse for a solution: the internet will provide plenty of fixes to a specific error message. We need to take time and understand carefully if the suggested fix applies to our situation. We can't assume a solution is valid only because the error message posted is similar, but it is a good way to find clues for what to do next.
Since Kafka is a relatively new technology, we may even find that nobody has experienced the same error before us. However, in that case, a second look at the logs and a read of the connector code might take us to a solution... that's the beauty of Open Source!
The error tolerance defines how many parsing mistakes we can accept before having the connector fail. The two options are
none where Kafka connect will crash on the first error, and
all where the connector will continue running even the in case that none of the messages can be processed successfully.
A middle ground is represented by the dead letter queue, a separate topic where we will receive all the erroring messages. The dead letter queue is a great way to make the connector more robust to single message errors but, if set, needs to be actively monitored. The last thing we want is to discover, one year later, 200.000 unparsed messages in the dead letter queue for our
orders topic because of a silly formatting error.
Related to the tolerance, another useful parameter is the automatic restart allowing us to try reanimating a crashed connector. Setting it to
on can be a good way to rescue our connector from transient errors, but will not save us in cases where our configuration is wrong and will cause the connector to always fail.
The concept of "spell book" translates really well to Kafka Connect and approaching the connector configuration with an iterative method can take us a long way towards success.
First we need to read the set of configurations available, then analyze what parameters are necessary and keep our connector configs as simple as possible to build a minimal integration example that can be evolved over time.
Linked with the above, it's also worth spending time on properly setting a version control system for the configuration and automate as much as possible the deployment. This approach will save us time when needing to revert non working changes and reduce the risk of human errors in the deployment. Kafka Connect REST APIs or tools like the Aiven Client, Aiven Terraform Provider or kcctl can help in the automation process.
Understanding Kafka Connect's dark magic can be overwhelming at first sight; but by browsing its ecosystem, reading the various connectors documentation and listening to the above tips we can start using our new wizardry to build fast, scalable and resilient streaming data pipelines.
Some more resources to get you started:
- Aiven for Apache Kafka Connect: don't lose time with setting up the Kafka Connect cluster, focus on creating the integration instead.
- How To Guides for Source and Sink Connectors: check prebuilt integration examples for all the major connectors.
- Twitter example: check out how to self-host a Kafka Connect cluster to run any connector you need, while still benefitting from a managed Apache Kafka cluster.