Welcome to this series of blog posts which covers Redis Streams with the help of a practical example. We will use a sample application to make Twitter data available for search and query in real-time. RediSearch and Redis Streams serve as the backbone of this solution that consists of several co-operating components, each of which will we covered in a dedicated blog post.
The code is available in this GitHub repo - https://github.com/abhirockzz/redis-streams-in-action
This is the first part which explores the use case, motivations and provides a high level overview of the Redis features used in the solution.
The use case is relatively simple. As an end goal, we want to have a service that allows us to search for tweets based on some criteria such as hashtags, user, location etc. Of course, there are existing solutions for this. The one presented in this blog series is an example scenario and can be applied to similar problems.
Here is a summary of the individual components:
- Twitter Stream Consumer: A Rust application to consume streaming Twitter data and pass them on to Redis Streams. I will demonstrate how to run this as a Docker container in Azure Container Instances
- Tweets Processor: The tweets from Redis Streams are processed by a Java application - this too will be deployed (and scaled) using Azure Container Instances.
- Monitoring service: The last part is a Go application to monitor the progress of the tweets processor service and ensure that any failed records are re-processed. This is a Serverless component which will be deployed to Azure Functions where you can run it based on a Timer trigger and only pay for the duration it runs for.
I have used a few Azure services (including Enterprise tier of Azure Cache for Redis that supports Redis modules such as
RedisTimeSeries and Redis Bloom) to run different parts of the solution, but you can tweak the instructions a little bit and apply them as per your environment e.g. you can use use Docker to run everything locally! Although the individual services have been written in different programming languages, the same concepts apply (in terms of Redis Streams,
RediSearch, scalability etc.) and can be implemented in the language of your choice.
I had written a blog post RediSearch in Action that covered the same use case i.e. how to implement a set of applications for consuming tweets in real-time, index them in RediSearch and query them using a REST API. However, the solution presented here has been implemented with the help of Redis Streams along with other components in order to make the architecture scalable and fault-tolerant. In this specific example, it's the ability to process large volume of tweets, but the same idea can be extended/applied to other use-cases which deal with high velocity data e.g. IoT, log analytics, etc. Such problems benefit from an architecture where you can horizontally scale out your applications to handle increasing data volumes. Typically, this involves introducing a Messaging system to act a buffer between producers and consumers. Since this is a common requirement and the problem space is well understood, there are lot of established solutions in the distributed messaging world ranging from JMS (Java Messaging Service), Apache Kafka, RabbitMQ, NATS, and of course Redis.
There is something unique about Redis though. From a messaging point of view, Redis is quite flexible since it provides multiple options to support different paradigms, hence serving a wide range of use cases. It's features include Pub-Sub, Lists (worker queue approach) and Redis Streams. Since this blog series is focuses on Redis Streams, I will provide a quick over view of the other possibilities before moving on.
- Pub-Sub: it follows a based broadcast paradigm where multiple receivers can consume messages sent to a specific channel. Producers and consumers are completely decoupled, but note that there is no concept of message persistence i.e. if a consumer app is not up and running, it does not get those messages when it comes back on later.
- Lists: they allow us to adopt a worker-queue based approach which can distribute load among worker apps. the messages are removed once they are consumed. it can provide some level of fault-tolerance and reliability using
Introduced in Redis 5.0, Redis Streams provides the best of Pub/Sub and Lists along with reliable messaging, durability for messages replay, Consumer Groups for load balancing, Pending Entry List for monitoring and much more! What makes it different is that fact it is a
append-only log data structure. In a nutshell, producers can add records (using
XADD), consumers can subscribe to new items arriving to the stream (with
XREAD). It supports range queries (
XRANGE etc.) and thanks to consumer groups, a group of apps can distribute the processing load (
XREADGROUP) and its possible to monitor its state (
Since the magic of Redis lies in its powerful command system, let's go over some of the Redis Streams commands, grouped by functionality for easier understanding:
There is only one way you can add messages to a Redis Stream. XADD appends the specified stream entry to the stream at the specified key. If the key does not exist, as a side effect of running this command the key is created with a stream value.
XRANGE returns the stream entries matching a given range of IDs (the
+special IDs mean respectively the minimum ID possible and the maximum ID possible inside a stream)
XREVRANGE is exactly like
XRANGE, but with the difference of returning the entries in reverse order (use the end ID first and the start ID later)
- XREAD reads data from one or multiple streams, only returning entries with an ID greater than the last received ID reported by the caller.
XREADGROUP is a special version of the
XREADcommand with support for consumer groups. You can create groups of clients that consume different parts of the messages arriving in a given stream
Manage Redis Streams
XACK removes one or multiple messages from the Pending Entries List (
PEL) of a stream consumer group.
- XGROUP is used to manage the consumer groups associated with a Redis stream.
- XPENDING is the used to inspect the list of pending messages to observe and understand what is happening with a streams consumer groups.
- XCLAIM is used to acquire the ownership of the message and continue processing.
XAUTOCLIAM transfers ownership of pending stream entries that match the specified criteria. Conceptually,
XAUTOCLAIMis equivalent to calling
- XDEL removes the specified entries from a stream, and returns the number of entries deleted, that may be different from the number of IDs passed to the command in case certain IDs do not exist.
- XTRIM trims the stream by evicting older entries (entries with lower IDs) if needed.
For a detailed, I would highly recommend reading "Introduction to Redis Streams" (from the official Redis docs).
Redis has a versatile set of data structures ranging from simple Strings all the way to powerful abstractions such as Redis Streams. The native data types can take you a long way, but there are certain use cases that may require a workaround. One example is the requirement to use secondary indexes in Redis in order to go beyond the key-based search/lookup for richer query capabilities. Though you can use Sorted Sets, Lists, and so on to get the job done, you’ll need to factor in some trade-offs.
Available as a Redis module,
RediSearch provides flexible search capabilities, thanks to a first-class secondary indexing engine. Some of its key features include full-text search, auto completion, and geographical indexing. There are a bunch of other features whose detailed exploration is out of scope of this blog series. I would highly recommend you to go through the documentation to explore further. For now, here is a quick overview of some of the
RediSearch commands. You will see them in action in subsequent blog posts.
Two of the most important commands include creating an index and executing search queries:
- FT.CREATE is used to create an index with a given schema and associated details.
- FT.SEARCH searches the index with a textual query, returning either documents or just ids.
You can execute other operations on indices:
- FT.DROPINDEX deletes the index. Note that by default, it does not delete the document hashes associated with the index
- FT.INFO returns information and statistics on the index such as number of documents, number of distinct terms and more.
- FT.ALTER SCHEMA ADD adds a new field to the index. This causes future document updates to use the new field when indexing and re-indexing of existing documents.
To work with auto-complete features, you can use "suggestions":
- FT.SUGADD adds a suggestion string to an auto-complete suggestion dictionary.
- FT.SUGGET gets completion suggestions for a prefix.
RediSearch supports synonyms which is a data structure comprised of a set of groups, each of which contains synonym terms. FT.SYNUPDATE can be used to create or update a synonym group with additional terms.
If you want query spell check correction (similar to "did you mean" feature), you can use FT.SPELLCHECK which performs spelling correction on a query, returning suggestions for misspelled terms.
A dictionary is a set of terms. Dictionaries can be used to modify the behavior of RediSearch's query spelling correction, by including or excluding their contents from potential spelling correction suggestions. You can use FT.DICTADD and FT.DICTDEL to add and delete terms, respectively.
That's it for this part!
As I mentioned, this was just an introduction. Over the course of next three blog posts, you will learn about the details of the individual components used to build the solution. You will deploy, run and validate them on Azure as well as walk through the code to get a better understanding to what's happening "behind the scenes". Stay tuned!