Muhammad Ibrahim Anis

Posted on Jan 20, 2023

Learning Kafka Part 4 (III): Kafka Connect

#apache #kafka #streaming #datapipeline

Sometimes, our source of data might be an external system, like a database, data warehouse or a file system. Also, we might want to send the data in Kafka topics to these external systems. Instead of writing custom implementation, Kafka has a component called Kafka Connect for this solution.

Kafka Connect

Kafka Connect, also known simply as Connect is a framework for connecting Kafka to external systems such as databases, search indexes, message queues and file systems, using connectors.
Connect runs in its own process, separate from the Kafka cluster. Using Kafka Connect requires no programming. It is completely configuration based.

Connectors are ready to use components, which can help us to import data from external systems into Kafka topics and/or export data from Kafka topics into external systems. We can use existing implementation of these connectors or implement our own connectors.

There two types of connectors in Kafka Connect
Source connector collects data from a system. Source systems can be databases, filesystems, cloud object stores etc. For example, JDBCSourceConnector would import data from a relational database into Kafka.

Sink connector delivers data from Kafka topics to external systems, which can also be databases, filesystems, cloud object stores etc. For example, HDFSSinkConnector would export data from Kafka topics to Hadoop Distributed File System.

A deep dive into Connect pipeline.

A connect pipeline is not actually made up of connectors only, but comprise many components, each responsible for carrying out a specific function.

Connectors
Connectors do not perform the actual data copying themselves, their configuration describes the set of data to be copied and are responsible for breaking that job into tasks that can be distributed to Connect workers.

Connectors run in a java process called a worker. Connectors can be run in standalone mode or distributed mode.
In standalone mode, we run a single worker process. In this mode, data only flows through the Connect pipeline as long as this single process is up, and we can’t make any change to the pipeline once it is running. It is intended for testing, temporary or one-time connections, not recommended for production environment.

In distributed mode, a group of workers work in a cluster and collaborate with each other to spread out the workload (like consumers in a consumer group). The Connect pipeline can be reconfigured while running. Having multiple workers means the Pipeline can keep sending or receiving data even if a worker goes down and we can add and remove workers as needed.

Tasks
Tasks contain the code that actually copies data to/from external systems, they receive the configuration information from their parent Connector. Then the Connector pushes/pulls data from the task.

Converters
Since producers and consumers use serializers and deserializers to configure how data should be translated before being sent to or retrieved from Kafka, if Connect is getting data out of Kafka, it needs to be aware of the serializers that were used to produce the data. Also, if Connect is sending data to Kafka, it needs to
serialize it to a format the consumers will be able to understand.

For Connect, we don’t configure a serializer and deserializer separately, like we do for producers and consumers. Instead, we provide a single library, a converter, that can both serialize and deserialize the data to our chosen format.

For source connectors, converters are invoked after the connector in the Connect pipeline. While for sink connectors, converters are invoked before connectors.

Transform
Simple logic to alter each message produced by or sent to Connect. Transform allow us to transform messages as they flow through Connect pipeline. This helps us get the data in right shape for our use case before it gets to either Kafka or the external system. Transformation is an optional component of the Connect pipeline.

While it is possible to perform complex transformations, it is considered best practice to stick to fast and simple logic. A slow or heavy transformation will affect the performance of a Connect pipeline. For advanced transformations, its best to use Kafka streams, a dedicated framework for stream processing.

For source connectors, transformations are invoked after the connector and before the converter, for sink connectors, transformations are invoked after the converter but before the connector.

Error Handling

In an ideal world, once our connect pipeline is up and running, we can call it night. Sadly, the world is never ideal, something somewhere will always go wrong, like our Connect pipeline. An example is when a sink system receives a data in an invalid format (JSON instead of Avro). And when something invariably does go wrong, we want our pipeline to be able to handle it elegantly. Error handling is an important part of a reliable data pipeline. In Kafka Connect, error handling is implemented by the sink connector, not source connector.

Kafka Connect has incorporated the following error handling options.

Fail Fast
By default, our sink connector terminates and stops processing messages when it encounters an error. This is a good option if an error in our sink connector indicates a serious problem.

Ignore Errors
We can optionally configure our sink connector to ignore all errors and never stop processing messages. This is good if we want to avoid downtime but run the risk of missing problems in our connector as we do not receive any message if something goes wrong. (It always does)

Dead Letter Queue
Dead Letter Queue (DLQ) is a service implementation within messaging systems or data pipelines to store messages that are not processed successfully. Instead of crashing the pipeline, or ignoring errors, the system moves it to a Dead letter queue so we can handle them later.

In Connect, the Dead Letter Queue is a separate topic that store messages than cannot be processed. The invalid messages can then be ignored or fixed and reprocessed. This is really the most graceful and recommended way to handle errors.

And this brings us to the end of Kafka connect, and also the end of the part three of this series, where we discussed about the components that interact with a Kafka cluster, we have had a look at producers and consumers, those applications that write to and read messages from Kafka, also Kafka streams for processing these messages in real time, and lastly Kafka Connect, a framework use to connect Kafka to external systems.

Coming up, a look at the medium these components use to communicate with Kafka.

DEV Community

Learning Kafka Part 4 (III): Kafka Connect

Kafka Connect

A deep dive into Connect pipeline.

Error Handling

Top comments (0)

Read next

Just Joined the Community – Lots of Questions as a Fresher React Developer

How Adam Smith Differentiates Money, Capital, Wealth & Goods

Navigating the Security Landscape of Docker Containers: A DevSecOps Perspective

Inferência de Tipos com o Operador Losango