DEV Community

DASWU
DASWU

Posted on

Migrating ClickHouse’ warm & cold data to object storage with JuiceFS

ClickHouse is quite popular now. But more data brings about problems. Users are facing the conflict between storage capacity and query performance as well as storage cost. In order to cope with it, ClickHouse comes with some architectural design strategies for data tiering.
Meanwhile object storage has gained favor with its low price and elastic scalable space. However, using ClickHouse directly on object storage will hinder query performance and cause compatibility issues.

In this article will elaborate the fundamentals of hot and cold data tiering in ClickHouse and introduce how to use JuiceFS to solve the problems that occur on object storage.

ClickHouse’s data tier architecture

The image below shows a set of Russian nesting dolls from large to small, which well illustrates ClickHouse's data management model via the MergeTree engine. The MergeTree engine composes of the following elements:

  • Table. It is the largest concept shown on the most right of the image, which is the first thing that the user needs to create or access in ClickHouse.
  • Partition. It is a smaller dimension with a smaller granularity compared to ”table“. In ClickHouse, the data is divided into partitions to store, and each partition has an identity.
  • Part. Each partition can be further divided into smaller parts. If you’ve checked the format of the data stored on ClickHouse disks, each subdirectory can be considered as a part.
  • Column. Part contains data with even smaller granularity, i.e. column. Data is stored by column in ClickHouse. There are many columns in the part directory, for example, there will be 100 Column files for a table with 100 columns.
  • Block. Each Column file is organized by the granularity of block.

Image description

As the following example, there are four subdirectores under the table directory, and each subdirectory refers to a part.

$ ls -l /var/lib/clickhouse/data/<database>/<table>
drwxr-xr-x  2 test  test    64B Aug  8 13:46 202208_1_3_0
drwxr-xr-x  2 test  test    64B Aug  8 13:46 202208_4_6_1
drwxr-xr-x  2 test  test    64B Sep  8 13:46 202209_1_1_0
drwxr-xr-x  2 test  test    64B Sep  8 13:46 202209_4_4_0
Enter fullscreen mode Exit fullscreen mode

In the rightmost column of the above example, the name of each subdirectory is preceded by time, i.e., 202208, but 202208 is also the partition name, which can be defined by user but usually named by time.

For example, the partition, 202208, will have two subdirectories (i.e., parts), and each partition usually consists of multiple parts. When writing data to ClickHouse, data will be written to memory first, and then persisted to disk according to the data structure in memory. If the data in a partition is too large, the partition will become many parts on the disk. ClickHouse doesn’t recommend creating too many parts under one table, it will also merge parts to reduce its total number. This is one of the reasons why it’s called the MergeTree engine.

There is another example helping us to understand “part” in ClickHouse. There are many small files in the part, some of which are meta-information, such as index information, which facilitates lookup performance.

$ ls -l /var/lib/clickhouse/data/<database>/<table>/202208_1_3_0
-rw-r--r--  1 test  test    ?? Aug  8 14:06 ColumnA.bin
-rw-r--r--  1 test  test    ?? Aug  8 14:06 ColumnA.mrk
-rw-r--r--  1 test  test    ?? Aug  8 14:06 ColumnB.bin
-rw-r--r--  1 test  test    ?? Aug  8 14:06 ColumnB.mrk
-rw-r--r--  1 test  test    ?? Aug  8 14:06 checksums.txt
-rw-r--r--  1 test  test    ?? Aug  8 14:06 columns.txt
-rw-r--r--  1 test  test    ?? Aug  8 14:06 count.txt
-rw-r--r--  1 test  test    ?? Aug  8 14:06 minmax_ColumnC.idx
-rw-r--r--  1 test  test    ?? Aug  8 14:06 partition.dat
-rw-r--r--  1 test  test    ?? Aug  8 14:06 primary.idx
Enter fullscreen mode Exit fullscreen mode

The most right column of the above example, the files prefixed by Column are actual data files, which are relatively large compared to meta information. There are only two columns in this example, A and B, and a table may consist of many columns in actual uses. All these files, including meta and index information, will together help users to quickly jump between files or look up files.

ClickHouse storage policy

If you want to tier hot and cold data in ClickHouse, you will use a lifecycle policy similar to the one mentioned in ES, which is called Storage Policy in ClickHouse.
Slightly different from ES, ClickHouse does not divide data into different stages, i.e., hot, warm, cold. Instead, ClickHouse provides some rules and configuration methods that require users to develop their own data tiering policy.
Each ClickHouse node supports the simultaneous configuration of multiple disks, and the storage medium can be varied. For example, users usually configure a ClickHouse node with an SSD for better performance; for warm and cold data, users can store the data in a medium with a lower cost, such as a mechanical disk. Users of ClickHouse will not be aware of the underlying storage medium.

Similar to ES, ClickHouse users need to create a storage policy based on data characteristics, such as the size of each subdirectory in part, the proportion of space left on the entire disk, etc. The execution of the storage policy is triggered when a certain data characteristic occurs. This policy will migrate one part from one disk to another. In ClickHouse, multiple disks configured in the same node have priority, and by default data will fall on the highest priority disk. This enables the transfer of the part from one storage medium to another.

Data migration can be triggered manually through SQL commands in ClickHouse, such as MOVE PARTITION/PART, and users can also do function validation through these commands. Secondly there may be some cases where explicitly need to move a part from the current storage medium to another one by manual means.

ClickHouse also supports time-based migration policy, which is independent of the storage policy. After data is written, ClickHouse triggers the migration of data on disk according to the time set by the TTL property of each table. For example, if the TTL is set to 7 days, ClickHouse will re-write the data in the table, which is older than 7 days, from the current disk (e.g. default SSD) to another lower priority disk (e.g. JuiceFS)

What is JuiceFS?

JuiceFS is a high-performance, open-source distributed POSIX file system, which can be built on top of any object storage. For more details: https://github.com/juicedata/juicefs

Integration of ClickHouse + JuiceFS

Step 1: Mount the JuiceFS file system on all ClickHouse nodes. Any path would work because ClickHouse will have a configuration file to point to the mount point.

Step 2: Modify the ClickHouse configuration to add a new JuiceFS disk. Add the JuiceFS file system mount point that you just mounted in ClickHouse so that ClickHouse can recognize this new disk.

Step 3: Add a new storage policy and set the rules for sinking data. This storage policy will automatically sink data from the default disk to the specified store, such as JuiceFS, according to the user's rules.

Step 4: Set the storage policy and TTL for a specific table. After the storage policy is set, you need to apply the policy to a table. In the pre-testing and validation phases, it is recommended to use a relatively large table, and if users want to achieve data sinking based on the time dimension, they need to set the TTL on the table at the same time. The whole sinking process is automatic, you can check the parts that are currently processing data migration and migration progress through ClickHouse's system table.

Step 5: Manually move the part for validation. You can verify whether the current configuration or storage policy is in effect by manually executing the MOVE PARTITION command.

As an example below, ClickHouse has a configuration item called storage_configuration, which contains disks configuration, in which JuiceFS is added as a disk and named "jfs" (他the name is arbitrary) and the mount point is the /jfs directory.

<storage_configuration>
    <disks>
        <jfs>
            <path>/jfs</path>
        </jfs>
    </disks>
    <policies>
        <hot_and_cold>
            <volumes>
                <hot>
                    <disk>default</disk>
                    <max_data_part_size_bytes>1073741824</max_data_part_size_bytes>
                </hot>
                <cold>
                    <disk>jfs</disk>
                </cold>
            </volumes>
            <move_factor>0.1</move_factor>
        </hot_and_cold>
    </policies>
</storage_configuration>
Enter fullscreen mode Exit fullscreen mode

Further down are the policies configuration items, where a storage policy called hot_and_cold is defined, and the user needs to define some specific rules, such as prioritizing the volumes in order of hot first and then cold, with the data first falling to the first hot disk in the volumes and the default ClickHouse disk (usually the local SSD).

The max_data_part_size_bytes configuration in volumes means that when the size of a part exceeds the set size, the storage policy will be triggered and the corresponding part will sink to the next volume, i.e. cold volume. In the above example, JuiceFS is the cold volume.
The bottom move_ factor configuration means that ClickHouse will trigger the execution of the storage policy based on the portion of the remaining disk space.

CREATE TABLE test (
  d DateTime,
  ...
) ENGINE = MergeTree
...
TTL d + INTERVAL 1 DAY TO DISK 'jfs'
SETTINGS storage_policy = 'hot_and_cold';
Enter fullscreen mode Exit fullscreen mode

As the above code snippet shows, you can set the storage_policy to the previously defined hot_and_cold storage policy in SETTINGS when you create a table or modify the schema of this table. The TTL in the second to last line of the above code is the time-based tiering rule mentioned above. In this example, we specify a column called d in the table, which is of type DateTime; with INTERVAL 1 DAY, that line of code presents that the data will be transferred to JuiceFS when new data is written in for more than one day.

From JuiceFS/Juicedata.

Top comments (0)