This post is the fourth in a series on real-time analytics. It is an excerpt from Real-time analytics, a definitive guide which can be read in full here.
--
Building a real-time analytics application can feel daunting. In particular, 7 key challenges arise when building real-time analytics:
- Using the right tools for the job
- Adopting a real-time mindset
- Managing cross-team collaboration
- Handling scale
- Enabling real-time observability
- Evolving data projects in production
- Controlling costs
Using the right tools for real-time analytics
Real-time analytics demands a different toolset than do traditional data pipelines or app development. Instead of data warehouses, batch ETLs, DAGs, and OLTP or document-store app databases, engineers building real-time analytics need to use streaming technologies, real-time databases, and API layers effectively.
And because speed is so critical in real-time analytics, engineers must bridge these components with minimal latency, or turn to a real-time analytics platform that integrates each function.
Either way, developers and data teams must adopt new tools when building real-time applications.
Adopting a real-time mindset
Of course, using new tools won’t help if you’re stuck in a batch mindset.
Batch processing (and batch tooling like dbt or Airflow) often involves running the same query on a regular basis over data to constantly recalculate certain results based on new data. In effect, much of the same data gets processed many times.
But if you need to have access to those results in real-time (or over fresh data), that way of thinking does not help you.
Engineers comfortable with batch processes need to think differently when building real-time analytics.
A real-time mindset focuses on minimizing data processing - optimizing to process raw data only once - to both improve performance and keep costs low.
In order to minimize query latencies and process data at scale while it’s still fresh, you have to:
- Filter out and avoid processing anything that’s not absolutely essential to your use case, to keep things light and fast.
- Consider materializing or enriching data at ingestion time rather than query time, so that you make your downstream queries more performant (and avoid constantly scanning the same data).
- Keep an optimization mindset at all times: the less data you have to scan or process, the lower the latency you’ll be able to provide to within your applications, and the more queries that you’ll be able to push through each CPU core.
Handling scale
Real-time analytics combines the scale of “Big Data” with the performance and uptime requirements of user-facing applications.
Batch processes are also less prone to the negative effects caused by spikes in data production. Like a dam, they can control the flow of data. But real-time applications must be able to handle and process ingestion peaks in real-time. Consider an eCommerce store on Black Friday. To support use cases like in-session personalization during traffic surges, your real-time infrastructure must respond to and scale with massive data spikes.
To succeed with real-time analytics, engineers need to be able to manage and maintain, data projects at scale and in production. This can be difficult without adding additional tooling and resources.
Enabling real-time observability
Failures in real-time infrastructure happen fast. Detecting and remediating scenarios that can negatively impact production requires real-time observability that can keep up with real-time infrastructure.
If you’re building real-time analytics in applications, it’s not enough for those applications to serve low-latency APIs. Your observability and alerting tools need to have similarly fast response times so that you can detect user-affecting problems quickly.
Evolving data projects in production
In a batch context, schema migrations and failed data pipelines might only affect internal consumers, and the effects appear more slowly. But in real-time applications, these changes will have immediate and often external ramifications.
For example, changing a schema in a dbt pipeline that runs every hour gives you exactly one hour to deploy and test new changes without affecting any business process.
Schema migrations in real-time have zero margin for error.
Changes in real-time infrastructure, on the other hand, only offer milliseconds before downstream processes are affected. In real-time applications, schema evolutions and business logic changes are more akin to changes in software backend applications, where an introduced bug will have an immediate and user-facing effect.
In other words, changing a schema while you are writing and querying over 200,000 records per second is challenging, so good migration strategy and tooling around deployments is critical.
Managing cross-team collaboration
Up until recently, data engineers and software developers often focused on different objectives. Data engineers and data platform teams built infrastructure and pipelines to serve business intelligence needs. Software developers and product teams designed and built applications for external users.
With real-time analytics, these two functions must come together. Companies pursuing real-time analytics must lean on data engineers and platform teams to build real-time infrastructure or APIs that developers can easily discover and build with. Developers must understand how to use these APIs to build real-time applications.
As you and your data grow, managing this collaboration becomes critical. You need systems and workflows in place that let developers and engineers “flow” in their work while still enabling effective cross-team work.
This shift in workflows may feel unfamiliar and slow. Still, data engineers and software developers will have to work closely to succeed with real-time analytics.
Controlling the cost of real-time analytics
This final challenge is ultimately a culmination of the prior six. New tools, new ways of working, increased collaboration, added scale, and complex deployment models all introduce new dependencies and requirements that, depending on your design, can either create massive cost efficiency savings, or - if you get it wrong - serious cost sinks.
If you’re not careful, added costs can appear anywhere and in many ways: more infrastructure and maintenance, more SREs, slower time to market, added tooling. Many are concerned that the cost of real-time analytics will outweigh the benefits.
There is always a cost associated with change, but if you do it right, you can achieve an impressive ROI. With the right tools, mindset, and architecture, real-time analytics can simultaneously cut the cost of building new user-facing features while boosting revenue through powerful differentiation.
Despite its challenges, real-time analytics not only increases cost efficiency but also boosts revenue - if approached the right way.
Top comments (2)
Thank you for the sharing this. Great write-up. Indeed, building real-time analytics is a challenging task that requires a deep understanding of data streams, analytics processing, and system architecture. Getting the right requirements early on is crucial to the success of the project as it allows for proper scoping and planning. It is also essential to have a clear understanding of the data sources and formats, as well as the desired outcomes of the analytics. A well-defined architecture that can scale horizontally and handle data spikes is also critical. Finally, selecting the right tools and technologies that fit the project's specific needs and requirements is crucial for achieving success.
100% - thanks for the comment!