Well, we encounter lots of data in our everyday life e.g. weather reports, flight timings, food deliveries etc., All these data are continuous and flowing from variety of sources. Such continuously flowing data is called a Data Stream.
A raw data(stream) is useless for any application unless it has some identifiable element to it.
Let me explain it with few examples, Joe joined Acme crop as a Developer on 25 October 2023, Mini placed an order for two pizzas at 12:00 PM. In these data examples if we take out the verbs(actions) "joined" and "ordered" along with the time "25 Oct" and "12:00 PM" , the whole data become useless.
For an application to effectively use the data stream, the data in the stream should be an event, a data that has time and action associated with it. A stream of data as events brings a great value to applications via analytics, triggers, chaining etc.,
With Mini placed an order for two pizzas at 12:00 PM event, we could extract the following analytical information,
- Is Mini placing order for two pizzas everyday?
- Is the order placed at 12:00PM everyday?
- Is the order is delivered by the same food chain?
- What pizzas were ordered?
- Are food chains delivering the orders on time ?
The process of using the events and building such analytical information is called Data Processing. Data Processing could be done by a human, an application, an IoT device etc.,
With any data streaming scenario there is always two essential primitive entities,
- Event Producer is someone or something that produces an event in our example above Mini is event producer who places the order for pizza.
- Event Consumer is again someone or something that consumes or uses the event produced by the Event Producer. Taking the same Mini's order example there could be Pizza house that takes, processes and delivers the order.
As we understood the Event Producers produces events that are being consumed by Event Consumers, software architectures started to evolve around building applications that act as an Event Producer/Consumer and leverage these streams of events. An architectural style of building applications around events is called an Event Driven Architecture(EDA).
When building applications using EDA, enforces few basic requirements on the Platform and a software Framework that will be used to build such applications.
The platform that will be used to build the such applications need to be:
- Scalable - as events are continuously flowing there could be sudden spikes to number of events that might come in, the platform need to be scalable or elastic to handle such spikes
- Durable - as events can be consumed immediately or bit late in time, the platform should support a mechanism of delivering events durably at time of need
- Resilient - The platform should be capable of handling failures and recovering from them without data loss
- Data Retention - Retaining of data until a configurable amount time
- Responding to Events - The platform should also be able to respond to events at bare minimum acknowledge the event on receive
- Ordering - As events are associated with time, ordering of the events helps consumers who need to process them in specific order e.g within a time range or date range etc.,
A sole platform alone might not be enough to build an effective EDA styled application. There is greater need for an integration into the platform via plugins, API etc., In other words, a framework that is extensible and pluggable, and works on common semantics.
The framework should support:
- Data Sources - the sources from where the events are generated i.e. Event Producers
- Data Sinks - the destinations where the processed event is drained into i.e. Event Consumers
- API - An interface to connect and work with the platform, data sources and data sinks.
Some great of Data Streaming platforms,
- Apache Kafka - Developed at LinkedIn and Opensources to Apache Software Foundation
- Redpanda - is a simple, powerful, and cost-efficient streaming data platform that is compatible with Kafka® APIs while eliminating Kafka complexity.
Apache Kafka also supports processing of streaming data using its ecosystem of Streaming API and ksqlDB. But for an effective architecture, it is always nice to have the core data streaming and data processing to be decoupled(Separation of Concerns). Such decoupling helps in processing data from heterogeneous sources e.g. Apache Kafka, Database, File System CSV files etc.,
Apache Flink is one such framework and distributed processing engine for stateful computations over unbounded(Apache Kafka) and bounded data streams(Database).
Just to summarise we learnt,
- What is a Data Stream and an Event
- What is a Data Producer and Data Consumer
- An architecture style that is used to build application around Events
- An effective EDA platform
- Some great platform and frameworks that could be used to build EDA applications.
Top comments (0)