kafka: distributed task queue

#python #kafka #dataprocessing

Apache Kafka is an open-source distributed streaming platform developed by the Apache Software Foundation. It is designed to handle high-throughput, fault-tolerant, and real-time data streaming scenarios. Kafka is built to handle large volumes of data streams, making it suitable for use cases such as real-time data pipelines, event sourcing, messaging systems, log aggregation, and more.

Key concepts in Kafka:

Topics: Topics are the core abstraction in Kafka and represent a specific stream of records. Producers publish messages to topics, and consumers subscribe to topics to consume those messages.
Producers: Producers are responsible for publishing messages to Kafka topics. They write messages to specific topics, which are then stored and made available for consumption by consumers.
Consumers: Consumers subscribe to Kafka topics and consume messages from those topics in real-time. Multiple consumers can be part of a consumer group, where each consumer in the group reads from a different subset of partitions for parallel processing.
Partitions: Topics can be divided into multiple partitions, allowing for parallelism and scalability. Each partition is an ordered, immutable sequence of messages. Kafka distributes the partitions across different brokers in a Kafka cluster.
Brokers: Brokers form the Kafka cluster and are responsible for receiving messages from producers, storing them on disk, and serving them to consumers. Kafka brokers ensure fault tolerance by replicating partitions across multiple brokers.
ZooKeeper: Kafka relies on Apache ZooKeeper for cluster coordination, managing metadata, and maintaining broker and consumer group information.
Connect: Kafka Connect is a framework for scalable and reliable integration of Kafka with external systems. It simplifies the development and management of connectors for data import/export to/from Kafka.

distributed task queue system- Using Kafka to distribute tasks across multiple workers. Use Kafka to enqueue tasks, distribute them among workers, and track the status and results of task execution. This can be useful for implementing parallel processing or load balancing in data processing workflows.The entire code is contained here github.com/James-Wachuka/python-kafka_distributed_task_queue

example of consumer code

import logging
from kafka import KafkaConsumer, KafkaProducer

# Kafka configuration
bootstrap_servers = 'localhost:9092'
task_topic = 'task_topic'
result_topic = 'result_topic'

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Create Kafka consumer and producer
consumer = KafkaConsumer(task_topic, bootstrap_servers=bootstrap_servers)
producer = KafkaProducer(bootstrap_servers=bootstrap_servers)

# Process tasks
for message in consumer:
    task = message.value.decode('utf-8')
    logger.info(f'Received task: {task}')

    # Perform task processing logic here
    result = task.upper()  # Example: Uppercase the task
    logger.info(f'Processed task: {task} --> Result: {result}')

    # Send the result to the result topic
    producer.send(result_topic, result.encode('utf-8'))
    producer.flush()
    logger.info(f'Result sent to {result_topic}')

the above code shows a Kafka consumer that listens to the task_topic and receives incoming tasks. Each task is processed by applying some logic (in this case, converting the task to uppercase) and then sending the result to the result_topic using a Kafka producer.

producer code

from kafka import KafkaProducer

# Kafka producer configuration
bootstrap_servers = 'localhost:9092'
task_topic = 'task_topic'

# Create Kafka producer
producer = KafkaProducer(bootstrap_servers=bootstrap_servers)

# Enqueue tasks
tasks = ['task1', 'task2', 'task3']  # Example tasks
for task in tasks:
    # Enqueue the task to the task topic
    producer.send(task_topic, task.encode('utf-8'))
    producer.flush()
    print(f"Task enqueued: {task}")

# Close the producer connection
producer.close()

In this code, we create a Kafka producer to enqueue tasks. We define a list of tasks, and each task is sent to the task_topic using the producer.

To run this example, start multiple instances of the worker code in separate terminals. Then, execute the enqueuing code to send tasks to the Kafka topic. The workers will consume the tasks, process them, and send the results to the result topic.

setup

download and extract kafka
build the kafka project -inside the kafka folder run

./gradlew jar -PscalaVersion=2.13.10
Start ZooKeeper:

bin/zookeeper-server-start.sh config/zookeeper.properties
start kafka brokers:

bin/kafka-server-start.sh config/server.properties
install kafka-python
create kafka topics:

bin/kafka-topics.sh --create --topic task_topic --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1 bin/kafka-topics.sh --create --topic result_topic --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1
run consumer.py and producer.py in separate terminals
Verify Output: The consumer will process the tasks produced by the producer and print the results to the console.

Customization:adding error handling and task acknowledgement

To add error handling and task acknowledgement to the Kafka implementation,
you can make use of Kafka's message acknowledgment feature
and handle any potential exceptions that may occur during processing
We set enable_auto_commit=False when creating the Kafka consumer to disable automatic offset commits.
This allows us to manually commit the offset only after successfully processing a task.
After processing a task, we use the add_callback() method on the Kafka producer's send() call to add a callback function that will be executed when the result is successfully sent to the result topic.
In the callback, we print a success message indicating that the task was processed and the result was sent.
We also use the add_errback() method to add an error callback function that will be executed if there is an error while sending the result to the result topic.
In the error callback, we print an error message indicating the failure.
After sending the result and before committing the offset, we explicitly call consumer.commit() to manually commit the offset, marking the task as processed.

Customization:enhancing the logic

enhanced consumer code

import time
from kafka import KafkaConsumer

# Kafka configuration
bootstrap_servers = 'localhost:9092'
task_topic = 'task_topic'
result_topic = 'result_topic'

# Create Kafka consumer
consumer = KafkaConsumer(task_topic, bootstrap_servers=bootstrap_servers)

# Start consuming messages
for message in consumer:
    task = message.value.decode('utf-8')
    print(f"Received task: {task}")

    # Simulate time-consuming task processing
    time.sleep(5)  # Delay of 5 seconds

    # Perform task processing logic here
    result = task.upper()  # Example: Uppercase the task

    # Send the result to the result topic
    print(f"Task processed: {task} --> Result: {result}")

In this updated code, we've added a time.sleep(5) statement to simulate a time-consuming task that takes 5 seconds to process.
You can modify the sleep duration to match your desired processing time.
With this enhancement, each task received by the worker will undergo a delay of 5 seconds before processing.
This can simulate scenarios where tasks require significant computation or external resource access.

The Kafka implementation described above, which is a distributed task queue, has several significant applications and benefits in the real world:

Scalable and Fault-tolerant Task Processing: Kafka's distributed nature allows the implementation to handle high-volume task processing with scalability and fault tolerance. By running multiple worker consumers, you can distribute the workload across multiple machines or processes, achieving parallel processing and load balancing.
Real-time Data Processing: Kafka provides real-time data streaming capabilities. By using Kafka as the underlying messaging system for task distribution, you can process tasks as they arrive, enabling real-time data processing and reducing latency.
Microservices Architecture: Kafka is commonly used as a communication bus in microservices architectures. In this context, the distributed task queue can be used to coordinate and distribute tasks among various microservices. Each microservice can subscribe to the task topic, process the tasks independently, and publish the results to other topics or services.
Event-driven Architectures: Kafka enables event-driven architectures where systems react to events asynchronously. The task queue can be utilized to handle event-driven tasks, allowing systems to react to events in real-time and trigger corresponding actions or workflows.
Big Data Processing: Kafka is commonly used in big data processing pipelines. By incorporating the distributed task queue in such pipelines, you can distribute data processing tasks across multiple workers, enabling efficient and parallel processing of large datasets.
Workflow Orchestration: The task queue can be integrated into workflow orchestration systems, where tasks represent individual steps or actions in a larger workflow. The distributed task queue allows for efficient coordination, tracking, and monitoring of tasks in complex workflows.
Real-time Analytics and Monitoring: By processing tasks in real-time and generating results or metrics, the distributed task queue can facilitate real-time analytics and monitoring. This can be useful for applications such as fraud detection, anomaly detection, real-time analytics dashboards, and system monitoring.

Overall, the Kafka implementation of a distributed task queue provides a flexible and scalable solution for handling tasks and processing data in real-time, making it a valuable component in various real-world scenarios, including microservices, big data processing, event-driven architectures, and real-time analytics.

DEV Community

kafka: distributed task queue

Top comments (0)

Read next

How to Get Data from Google Trends Using Python dev.to oxylabs

Deploy FastAPI application with SQLite on Fly.io

Learn Big O Notation once and for all

Exploring Async Deepgram API: Speech-to-Text using Python