Jesus Delgado

Posted on Dec 13

# Building a Robust Application with Change Data Capture: A Practical Journey Through Debezium, Kafka, and PostgreSQL

Introduction

In today's world of enterprise applications, the ability to capture and process data changes in real-time has become a critical necessity. Change Data Capture (CDC) emerges as a revolutionary solution, enabling organizations to efficiently track, capture, and propagate data changes across different systems in real-time.

What is Change Data Capture?

Change Data Capture is a technique that identifies and captures changes made to a database, allowing these changes to be instantly propagated to other systems. It’s like having a constant observer recording every insert, update, or delete in your database.

Key Components of Our CDC Architecture

Debezium: Open-source platform for change data capture.
Apache Kafka: Distributed messaging system.
PostgreSQL: Relational database.
Python: Programming language for processing logic.

Technical Architecture of the System

Data Flow

Changes occur in PostgreSQL.
Debezium detects and captures these changes.
Events are sent to Kafka.
Python application processes the events.

Environment Setup

Prerequisites

Docker
Python 3.8+
Basic knowledge of distributed systems

Installing Dependencies

# Update system
sudo apt-get update

# Install Docker and dependencies
sudo apt-get install -y docker docker-compose python3-pip

# Install Python libraries
pip3 install psycopg2-binary confluent-kafka sqlalchemy

Detailed Configuration

Docker Compose: Orchestrating Services

version: '3'
services:
  zookeeper:
    image: wurstmeister/zookeeper
    ports:
      - "2181:2181"

  kafka:
    image: wurstmeister/kafka
    ports:
      - "9092:9092"
    environment:
      KAFKA_ADVERTISED_HOST_NAME: localhost
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
    depends_on:
      - zookeeper

  postgres:
    image: postgres:13
    ports:
      - "5432:5432"
    environment:
      POSTGRES_DB: inventorydb
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: postgres

Source Code: CDC Implementation

Data Model (database.py)

from sqlalchemy import create_engine, Column, Integer, String, Float
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker

Base = declarative_base()

class Product(Base):
    __tablename__ = 'products'

    id = Column(Integer, primary_key=True)
    name = Column(String)
    price = Column(Float)
    stock = Column(Integer)

class DatabaseManager:
    def __init__(self, connection_string):
        self.engine = create_engine(connection_string)
        Base.metadata.create_all(self.engine)
        self.Session = sessionmaker(bind=self.engine)

    def add_product(self, name, price, stock):
        session = self.Session()
        product = Product(name=name, price=price, stock=stock)
        session.add(product)
        session.commit()
        session.close()

CDC Processor (cdc_processor.py)

from confluent_kafka import Consumer, Producer
import json
from database import DatabaseManager

class CDCProcessor:
    def __init__(self, kafka_broker, source_topic, database_manager):
        self.consumer = Consumer({
            'bootstrap.servers': kafka_broker,
            'group.id': 'cdc-consumer-group',
            'auto.offset.reset': 'earliest'
        })
        self.source_topic = source_topic
        self.db_manager = database_manager

    def process_changes(self):
        self.consumer.subscribe([self.source_topic])

        try:
            while True:
                msg = self.consumer.poll(1.0)

                if msg is None:
                    continue

                if msg.error():
                    print(f"Consumer error: {msg.error()}")
                    continue

                payload = json.loads(msg.value().decode('utf-8'))
                self.handle_change(payload)

        except KeyboardInterrupt:
            self.consumer.close()

    def handle_change(self, payload):
        if payload['op'] == 'c':  # Create
            after = payload['after']
            self.db_manager.add_product(
                name=after['name'],
                price=after['price'],
                stock=after['stock']
            )
            print(f"Producto creado: {after['name']}")

Benefits of This Architecture

Real-Time: Instant propagation of changes.
Decoupling: Independent systems.
Flexibility: Easy integration with other services.
Scalability: Grows with business needs.

Challenges and Considerations

Initial Configuration Complexity: Setting up the system can be challenging.
Processing Overhead: Requires resources for real-time processing.
Data Consistency Management: Ensuring data accuracy across systems.
Information Security: Protecting sensitive data during transfer.

Real-World Use Cases

Inventory Synchronization: Keeping inventory systems updated.
Data Replication Between Microservices: Ensuring data consistency.
Real-Time Reporting Systems: Generating up-to-date reports.
Legacy System Integration: Bridging old and new systems seamlessly.

Best Practices

Implement Retry Mechanisms: Handle transient failures.
Use Transactions: Maintain data integrity.
Gracefully Handle Errors: Prevent system crashes.
Implement Monitoring and Logging: Ensure visibility and traceability.

Conclusion

Change Data Capture is not just a technology; it’s an integration strategy that allows organizations to be more agile, responsive, and efficient in handling data.

The combination of Debezium, Kafka, and Python offers a powerful and flexible solution for capturing and processing data changes in real-time.

Link github

https://github.com/jesus20202/CDC-con-Debezium-Kafka-y-PostgreSQL.git

DEV Community