DEV Community

Cover image for How we built our tiered storage subsystem
The Team @ Redpanda for Redpanda Data

Posted on • Originally published at redpanda.com

How we built our tiered storage subsystem

Introduction

One of the key premises of Redpanda is unification of real-time and historical data by giving users the ability to store infinite data. However, in the modern cloud, the price of storage often dominates the price of all computing resources. The cost of object storage is vastly lower than the cost of the local disk attached to a compute node. Furthermore, object storage is often more reliable than the nodes themselves. To these ends, we created Shadow Indexing, a capability (available as a tech preview as of this writing) that allows us to capitalize on the 99.995% guaranteed reliability of the tier-4 datacenter.

What is Shadow Indexing?

Shadow Indexing is a subsystem that allows Redpanda to move large amounts of data between brokers and cloud object stores efficiently and cost-effectively, without any human intervention. This allows Redpanda to overcome the data-center reliability limitations by storing data in object stores such as Amazon Simple Storage Service (S3) or Google Cloud Storage (GCS). The S3 storage is more cost-effective compared to attached storage in EC2 (around six times cheaper compared to provisioned IOPS io2 EBS storage, and three to five times cheaper than GP2/GP3). It is also more available than EC2 instances, providing a whopping 11 nines durability guarantee.

For the end-user, Shadow Indexing means the ability to fetch both historical and real-time data using the same API transparently and without much performance overhead. With Shadow Indexing enabled, Redpanda migrates data to an object store asynchronously without any added complexity. This makes for a seamless transition of data from hot in-memory, to warm local storage, and to lukewarm object storage, allowing users to create topics with infinite retention while maintaining high performance.

By allowing for tiered storage, Shadow Indexing decouples the cluster’s load capacity from storage capacity. This allows operators to deploy Redpanda on fewer, smaller brokers, and with less storage, thereby reducing infrastructure costs and administrative overhead. By eliminating total storage capacity as a constraint, operators can freely size their clusters strictly according to the live load. Deploying brokers with less storage also has the benefit of improving mean-time-to-recovery (MTTR), as it greatly reduces the amount of log data that needs to be replicated in the event of a broker failure. Finally, Redpanda has the ability to restore topic data from the archive, giving administrators an additional method for disaster recovery in case of accidental deletion, or in the unlikely event of a cluster-wide failure.

Next, we’ll go into detail about the key components of Shadow Indexing, and how we built it.

Understanding the Shadow Indexing architecture

The Shadow Indexing subsystem has four main components:

  1. The scheduler_service that uploads log segments and Shadow Indexing metadata to the object store
  2. The archival_metadata_stm, which stores information about segments uploaded to the object store locally, in the Redpanda data directory (default /var/lib/redpanda/data)
  3. The cache_service that temporarily stores data downloaded from the object store
  4. The remote_partition, which is a component responsible for downloading data from the object store and serving it to the clients

The scheduler_service and archival_metadata_stm are the main components of the write path, which uploads data to the object store bucket. The cache_service and remote_partition are elements of the read path, which is used to download data from the object store to satisfy a client request.

Image description

How write path works

The scheduler_service component is responsible for scheduling uploads. It creates an ntp_archiver object for every partition and invokes individual archivers periodically in a fair manner to guarantee that all partitions are uploaded evenly. Note that uploads are always done by the partition leader. The archiver follows a naming scheme that defines where the log segments should go in the bucket. It is also responsible for maintaining the manifest, which is a JSON file that contains information about every uploaded log segment.

Object stores shard workloads based on an object name prefix, so if all log segments have the same prefix, they will hit the same storage server. This will lead to throttling and limit the upload throughput. To ensure good upload performance, Redpanda inserts a randomized prefix into every object. The prefix is computed using a xxHash hash function.

The archival subsystem uses a PID regulator to control the upload speed in order to prevent uploads from overwhelming the cluster. The regulator measures the size of the upload backlog (the total amount of data that needs to be uploaded to the object store) and changes the upload priority based on that. If the upload backlog is small, the priority of the upload process will be low and the occasional segment uploads won’t interfere with any other activities in the cluster. However, if the backlog is large, the priority will be higher and the uploads will use more network bandwidth and CPU resources.

Image description

Redpanda maintains some metadata in the cloud so the data can be used without the cluster. For every topic, we maintain a manifest file that contains information about said topic (for instance, the retention duration, number of partitions, segment size etc). For every topic partition, we also maintain a separate partition manifest that has a list of all log segments uploaded to the cloud storage. This metadata and individual object names make the bucket content self-sufficient and portable. It can be used to discover and recreate the topics. The content of the bucket can also be accessed from different AWS regions.

When the archiver uploads a segment to the object store, it adds segment metadata (e.g. segment name, base and last offsets, timestamps, etc) to the partition manifest. It also adds this information to the archival_metadata_stm — the state machine that manages the archival metadata snapshot.

The partition leader handles the write path and manages archival state. For consistency, durability, and fault tolerance, this state needs to be replicated to the followers as well. Redpanda does this by sending state changes via configuration batches in the same raft log as the user data. This allows the followers to update their local snapshot. In case of a leadership transfer, this ensures that any replica that takes over as a leader has the latest state and can start uploading new data immediately.

The snapshot is stored within every replica of the partition. Every time the leader of the partition uploads the segment and the manifest, it adds a configuration batch with information about the uploaded segment to the raft log. This configuration batch gets replicated to the followers and then it gets added to the snapshot. Because of that, every replica of the partition “knows” the whereabouts of every log segment that was uploaded to the object store bucket. In other words, we’re tracking the data stored inside the object store bucket using a Raft group. This is the same Raft group that is used to store and replicate the user data. This solution has some nice benefits. For example, when the replica is not a leader, it still has the up-to-date archival snapshot. When the leadership transfer happens, the new leader can start uploading new data based on a snapshot state without downloading the manifest from the object store.

Another benefit that snapshot enables is smarter data retention. Because the archival metadata is available locally, the partition can use it to figure out what part of the log is already uploaded and can be deleted locally. This constitutes a safety mechanism in Redpanda, which prevents retention policy from deleting log segments that were not uploaded to the object store yet.

How read path works

Image description

The remote_partition component is responsible for handling the reads from cloud storage. The component uses data from the archival metadata snapshot to locate every log segment in the object store bucket. It also knows the offset range that it can handle based on the snapshot data.

When an Apache Kafka® client sends a fetch request, Redpanda decides how the request should be served. It will be served using local data stored in the Raft log, if it is possible. However, if the data is not available locally, the data will be served using the remote_partition component. This means that even if the partition on the Redpanda node stores only recent data, the client will see that the offsets are available starting from offset zero.

When Redpanda is processing a fetch request, it checks if the offsets are available locally and, if this is the case, serves the local data back to the client. Alternatively, if the data is only available in cloud storage, it uses the remote_partition to retrieve the data. The remote_partition checks the archival snapshot to find a log segment that hosts the required offsets and copies that segment into the cache. Then, the log segment is scanned to get the record batches to the client. However, the remote_partition can’t serve the fetch request using data in the cloud storage directly. First, it has to download the log segments to the local cache. The cache is configured to store only a certain number of log segments simultaneously and evicts unused segments if there is not enough space left.

Internally, the remote_partition object stores a collection of lightweight objects that represent uploaded segments. When the segment is accessed, this lightweight representation is used to create a remote_segment. The remote_segment deals with a single log segment in the cloud. It can download the segment to the local storage, and can be used to create reader objects, which are used to serve fetch requests. Readers can fetch data from the log segment and materialize record batches. Think of them as something loosely similar to database cursors that can scan data in one direction. The remote_segment also maintains a compressed in-memory index, which is used to translate Redpanda offsets to offsets in the file.

The remote_partition also hosts the reader cache. This cache stores unused readers that can be reused by fetch requests if the offset range requested by the fetch request matches one of the readers. Also, the remote_partition can evict unused remote_segment instances by transitioning them to the offloaded state, in which they’re not consuming system resources.

The latency profile might be different between the Shadow Indexing reads and normal reads: Shadow Indexing needs to start retrieving data from the object store to the cache in order to be able to serve fetch requests. Also, Shadow Indexing doesn’t use record batch cache. Because of that, it’s more suitable for batch workloads.

Implementing Shadow Indexing in Redpanda

The development of the Shadow Indexing subsystem started with the write path. We developed our own S3 client using Seastar. However, Seastar didn’t allow us to use the existing object store client efficiently, and the framework didn’t have an http client that could be used to access the object store API. In order to overcome this challenge, we developed our own http client using Seastar.

The next step was development of the scheduler_service. This service schedules individual uploads from different partitions. This sounds easy on paper, but the task is actually quite challenging. Firstly, the scheduler needs to provide a fairness guarantee to prevent a situation in which one of the partitions doesn’t receive enough upload bandwidth and lags behind. The situation would be dangerous because it could cause a disk space leak by preventing the retention policy from doing its job. Secondly, every replica of the partition may have different segment alignment, and the segments may begin and end on different offsets, despite having the same data inside. Because of this, we may have a situation when, after the leadership transfer, only part of the segment needs to be uploaded. The newly elected leader must be able to see what offset range is already uploaded for the partition and has to compute a starting point for the next upload inside one of the segments.

Once all bits and pieces for the write path were in place, we started to work on topic recovery, which allows Redpanda to restore a topic using data from the object store. However, it’s not possible to just download the data. In order for Redpanda to be able to use data, we need to create a proper log and bootstrap a Raft group. There is a lot of bookkeeping used to manage Raft state outside the log itself, and this needs to be taken care of. The recovery process should create entries in the internal KV-store, create a Raft snapshot, archival metadata snapshot, etc. Also, the log itself needs to be pre-processed upon download. This is because the log contains different non-data messages that Raft and other subsystems are using. However, these messages can break the newly created Raft group. So, to remove them, the downloaded log has to be patched on the fly. The offsets have to be updated and the checksums recalculated.

Next, we developed the components in the read path. These are the cache_service, the archival_metadata_stm, and the remote_partition.

The cache_service is a tricky affair because it is global (per node), but Seastar likes everything to be shared per CPU. But, if we shard the cache then one hot partition could theoretically cause a lot of downloads from the object store with other shards underutilized. With one global cache_service this isn’t a problem, though it does mean we have to consider other issues. For instance, cache eviction. Cache eviction has to be done globally, which requires coordination between the shards.

The remote_partition is interesting because it has to track all uploaded segments at once. In order to be able to serve fetch requests, it should be able to locate individual segments, download them to the cache directory, and create readers (the cursor-like things). The problem here is that the bucket may contain much more data than the Redpanda node can handle (think how much data Redpanda can send to the object store over several years). Because of that, the remote_partition can’t just create a remote_segment object for every log segment that it tracks. Instead, it creates the remote_segment instances on demand and destroys them when they’re idle long enough. Because of this the remote_segment has to be a complex state machine that has a bit of internal state for every uploaded segment.

The process of developing Shadow Indexing had its fair share of challenges but, in the end, we dare say, the system came together nicely.

Conclusion

The new Shadow Indexing feature allows for infinite data retention with good performance at a low cost. It offers application developers ease of use and flexibility when designing applications. For operators, it allows cluster infrastructure to scale optimally according to live load, provides additional tools for data recovery, and helps improve MTTR by reducing the amount of data that needs to be replicated when a broker fails.

This is just the start. We have further enhancements planned for Redpanda in the future, including:

  • Full cluster recovery. Redpanda will support a complete restore from object store based on data that has been uploaded to the archive.
  • Faster data balancing. When adding new brokers to the cluster, Redpanda supports automatic balancing by replicating partition data to the new brokers using Raft. This process can be more resource efficient by allowing the new brokers to fetch only parts of the log that haven’t been archived, and serving the rest from object store using Shadow Indexing.
  • Analytical clusters. Some workloads are analytical in nature and involve reading large chunks of historical data to perform analysis. These workloads tend to be ad-hoc, disruptive, and have an adverse effect on operational workloads with strict SLAs. Redpanda will provide a way to deploy analytical clusters that have read-only access to archived data in the object store. This will allow for true elasticity, as multiple read-only clusters can be deployed and decommissioned according to need. It will also provide true isolation of real-time operational workloads from analytical workloads.

To learn more about Shadow Indexing and how to use it, please view our documentation here, or join our Slack community if you have specific questions, feedback, or ideas for enhancements!

Top comments (0)