Shahab Ranjbary

Posted on Dec 8, 2023 • Edited on May 22, 2024

Using Benthos: A Practical Guide for Kafka and PostgreSQL Integration

#kafka #postgres #benthos #ingestion

Introduction

Benthos is a versatile and powerful stream processor that simplifies the task of handling real-time data streams. In this article, we'll delve into Benthos, its key features, and how it can be seamlessly integrated with Kafka and PostgreSQL to build robust data pipelines. To illustrate these concepts, we'll use a practical example: BenthosPipelineDB.

BenthosPipelineDB is a sample project showcasing the seamless integration of Benthos with Kafka and PostgreSQL. This project is designed to produce random data, publish it to a Kafka topic, and consume and insert it into a PostgreSQL database using Benthos. The project structure and configurations provide a clear understanding of how these technologies can be orchestrated to create a reliable and scalable data processing pipeline.

Benthos Overview

Benthos is a stream processor designed for real-time data handling. Its features include:

Stream Processing: Benthos allows you to process data streams with ease, making it a valuable tool for scenarios that require real-time data transformations.
Connectivity: With native support for various data sources and sinks, including Kafka and PostgreSQL, Benthos simplifies the integration process.
Extensibility: Benthos supports a wide range of processors and plugins, providing flexibility in designing your data processing pipeline.

Kafka Overview

Kafka is a distributed streaming platform that enables the building of real-time data pipelines. Key Kafka concepts include:

Topics: Kafka organizes data streams into topics, allowing for efficient data segregation and distribution.
Producers and Consumers: Producers publish data to Kafka topics, while consumers subscribe to these topics, creating a scalable and fault-tolerant system.

PostgreSQL Overview

PostgreSQL is a powerful, open-source relational database system. It offers:

Relational Model: PostgreSQL follows a robust relational database model, supporting the creation of structured and organized datasets.
ACID Compliance: ACID (Atomicity, Consistency, Isolation, Durability) compliance ensures data integrity, making PostgreSQL suitable for critical applications.

Project Structure

Here's an overview of the project structure:

Benthos Pipeline: The core Benthos configuration is stored in pipeline/benthos.yml.
PostgreSQL Setup: SQL scripts for creating tables are located in postgres/sql/create_table.sql, and data is stored in postgres/data.
Data Producer: The data generator, responsible for producing random data, is in the data-producer directory.
Docker Setup: The docker-compose.yaml file orchestrates the Docker containers for PostgreSQL, Kafka, Benthos, and the data producer.
Configuration: Environment variables are stored in config.env, facilitating easy configuration management.

How It Works

Data Generation: The data producer (message-producer.py) generates random messages with the format {"name": "hRnWJsIf", "age": 82} and publishes them to the Kafka topic (users).
Benthos Processing: Benthos reads messages from the Kafka topic, processes them according to the defined pipeline in pipeline/benthos.yml, and sends them to the specified output.
PostgreSQL Storage: Processed data is inserted into the PostgreSQL table (users) as defined in postgres/sql/create_table.sql.

Running the Project

1- Navigate to the project directory:

cd /path/to/BenthosPipelineDB

2- Start the project using Docker Compose:

docker-compose up -d

3- Monitor logs to ensure everything is running smoothly:

docker-compose logs -f

4- Kafka Console Consumer
If you want to observe the data flowing through the users topic in real time, you can use the Kafka console consumer. Open your terminal and run the following command:

docker-compose exec kafka kafka-console-consumer.sh --topic users --from-beginning --bootstrap-server kafka:9092

5- Connecting to PostgreSQL
To inspect the data in the PostgreSQL database, you can use a PostgreSQL client. Assuming you have PostgreSQL installed locally, you can connect using the following command:

psql -h localhost -p 5432 -U postgres -d postgres

6- Now, let's run a simple query to fetch the first 10 records from the users' table:

SELECT * FROM users LIMIT 10;

Now, you have a robust data processing pipeline using Benthos, Kafka, and PostgreSQL!

Conclusion

BenthosPipelineDB demonstrates the power and flexibility of Benthos in combination with Kafka and PostgreSQL. With a clear project structure and straightforward configuration, it provides a foundation for building scalable and reliable data processing systems.

Explore the BenthosPipelineDB repository for hands-on experience and customization.

Happy streaming!

DEV Community