tarantool

Posted on Jun 29, 2022

How we compress data in large projects

#programming #database #datascience #http

Hi everyone! My name is Alexander Klenov, and I work in Tarantool. In April, we released Tarantool 2.10 Enterprise Edition–an updated version of the Tarantool in-memory computation platform. The 2.10 version includes several new features.

In this article, I want to talk in detail about one of the features–data compression in RAM. I am going to tell you how to use it, what it can and cannot do, and what aspects to pay attention to.

How to enable data compression

It's very easy to do–just specify in the field's settings that it needs to be compressed: compression = ' < option > '.

There are two compression options available:

• zstd,
• lz4.

For example:

local space = box.schema.space.create(space_name, { if_not_exists = true })
space:format({
    {name = 'uid', type = 'string'},
    {name = 'body', type = 'string', compression = 'zstd'},
})

If your project already has some uploaded data, you'll need to perform a background migration:

box.schema.func.create('noop', {
    is_deterministic = true, body = 'function(t) return t end'
})

space:upgrade{func = 'noop', format = {
    {name = 'uid', type = 'string'},
    {name = 'body', type = 'string', compression = 'zstd'},
}}

See the documentation for more details.

What can be achieved by using data compression?

So, we've set up the feature. Let's see what it can give us. As an example, we'll use real data of a major telecom company. The data includes 100 thousand different JSON documents with a total volume of 316 MB.

We will try out different compression mechanisms. The CPU capacity in our case is 3.6 GHz. To see the difference better, let's add compression using the external ZLIB library.

Testing showed the following results:

*ZLIB external library

See the source code of the test on GitHub.

The key conclusion is that all these three methods work differently. ZSTD compresses efficiently but slowly, although it unpacks documents rather quickly. LZ4 compresses and unpacks quickly, but the compression ratio is lower than in other methods. Compression using ZLIB is comparable to the results of ZSTD and is not too slow, but the decompression speed is pretty low.

Compression in a cluster

In real-live projects, Tarantool instances are divided by roles. For our example, it's important to distinguish two roles:

• Router
• Storage

Compressing cluster data is not a linear task like compressing just one node. There are a couple of ways to do it, depending on where the compression is performed. These ways have their advantages and disadvantages. Let's take a look at three examples.

Option one — quick result.

Here, compression and decompression happens in the storage. This can be implemented with the functionality described in this article. It's useful when you need to free a lot of space quickly with minimal changes.

Sounds good, but this creates additional load to the storage's CPU and slows it down. Moreover, each replica will be repeating this compression. As a result, the performance of the entire cluster decreases. A storage with a built-in packer is more expensive to scale. Instead of one additional packer, you'll need to install a storage with a packer in it.

See an example of how to do this here:

The local variable enable_zlib in the router must be set to false. In the storage, in the 11th line, you must specify the compression type.

This option is good when you need to free space quickly. It works well as a temporary solution, but can also become a permanent one, if the load allows it. It's cheap and fast.

But if you want to keep the cluster performance on the same level, it would take more effort.

Option two — maximum performance

Compression and decompression happens in the router. This method offers better performance, but the built-in compression won't work here — you have to connect an external library and implement additional logic for it. The master and replicas will receive already packed data.

As an example, we'll use the router — storage pair described above. Except here, the enable_zlib variable in the router must be set to true. We don't need to specify the compression type in the storage in this case.

Advantages:

• We won't load the storage.
• We are reducing the traffic between the router and the storage.
• It's easier to scale — we just need to set up the necessary number of routers with a packer until we reach the required level of performance.

Disadvantages:

• We need to implement packer logic on the router.
• A delay on the router due to compression and decompression appears.

If we need more flexibility, we can create a separate role to manage data compression. It would allow configuring the number of routers separately from the number of compression instances. Resources utilization will become more flexible. Let's take a look at the third option.

Option three — lazy compression

We compress the data in a separate instance and decompress it in the router. So, we write the data to the storage as it is, uncompressed. Then we implement a separate packer role that goes through master storages and packs everything that is not packed. If new data is uploaded too fast, and you need to pack more quickly, you can simply add more packer instances. The replicas will receive already packed data.

For better understanding of this concept, see an example here.

It can be tested with the same load test. The test will show the same results as writing without compression, since the compression does not occur during writing but in a separate parallel thread.

Advantages:

• There is no lag on the router for data write.
• Data compression scales with maximum efficiency.
• There are less cluster instances.
• We don't load the storage as in the first case.

Disadvantages:

• Additional traffic in the cluster — between the storage and the packer.
• Additional load on the storage and one more read and write operation.
• We need to implement a separate role — packer.
• We need to implement unpacking logic on the router.

It can be useful if:

• You want to write data quickly — at once, without compression.
• You want to regulate compression speed by increasing or decreasing the number of instances, and use resources more flexibly.

Moving compression to the router

After looking at different options, let's see what happens when we move compression from the storage to the router, since this is the most efficient method. Let's build a small cluster with two instances — a router and a storage. Then, we'll load it while using different compression methods.

Test bench implementation:

• Router implementation.
• Storage implementation.
• Module for write load testing. Supplemental module for generating data.

The maximum test duration is set at 30 seconds. It is not too long to wait for the tests to finish, and on the other hand, it's enough time to test the load. During this time, we generate 100,000 write requests to the router on behalf of 10 virtual users. See the console output here. The results are summarized in the table below.

Based on the key parameters — number of requests per second (RPS), latency (http_req_duration), and total request execution time (iteration_duration) — compression on the router performs much better than compression on the storage.

This is primarily because ZSTD compression takes longer than ZLIB compression. Also, data from the router to the storage is transferring already compressed, so less data is written to the storage. Another advantage is that you don't need to load the storage with compression.

Conclusion

Tarantool keeps growing and gains new features. The compression mechanism built into 2.10EE is useful, but it can't solve everything. The best way to compress data depends on the specifics of your project — hence the recommendations for use.

If you do not have the need or the ability to use an external library, you can use the built-in LZ4 compression tool. This method allows freeing about 40% of the space with almost no RPS drawdown (6351 with LZ4 vs. 7076 without compression). And it can be enabled in just one click, which, in my opinion, is a great advantage.

If you want better, more efficient, or cheaper data compression, you'll need to make additional efforts, which are described above.

You can download Tarantool on the official website and get help in our Telegram chat.

Top comments (1)

Sokolov Yura • Jul 1 '22

Why ZSTD compression time so huge? Looks like, you set up too large compression level. If you lower compression level to the compression ratio equal to ZLIB, it should be considerably faster to compress than ZLIB.
And disable multithreaded compression. I don't remember, if it enabled by default, it must be checked.