Man yin Mandy Wong for Tencent Cloud

Posted on Nov 8, 2022

COS-Based ClickHouse Data Tiering Solution

#cos #clickhouse #database #olap

ClickHouse is a columnar database management system (DBMS) for online analytical processing (OLAP) and supports interactive analysis of petabytes of data. As a distributed DBMS, it differs from other mainstream big data components in that it doesn't adopt the Hadoop Distributed File System (HDFS). Instead, it stores data in local disks of the server and uses data replicas to guarantee high data availability. Then, it leverages distributed tables to implement distributed data storage and query.

Shard: It refers to a server that stores different parts of the data. In order to read all the data, you must access all the shards. Storing the data of distributed tables in multiple shards implements horizontal scaling of computing and storage.

Replica: Each shard contains multiple data replicas, so you can access any replica to read data. The replica mechanism ensures data availability in case a single storage node fails. Only MergeTree table engines support the multi-replica architecture. ClickHouse implements the data replica feature in table engines rather than database engines; therefore, replicas are table-level rather than server-level. When data is inserted into ReplicatedMergeTree engine tables, primary-secondary sync is performed to generate multiple data replicas. ZooKeeper is used to conduct distributed coordination during the sync.

Distributed table: Distributed tables created with distributed engines distribute query tasks among multiple servers for processing but don't store data. When such a table is created, ClickHouse will first create a local table in each shard, which will be visible only on the corresponding node; then, it will map the local tables to the distributed table. In this way, when you access the distributed table, ClickHouse will automatically forward your request to the corresponding local table based on the cluster's architecture information.

In summary, one ClickHouse cluster consists of multiple shards, each of which contains multiple data replicas. A replica corresponds to a server node in the cluster and uses its local disk to store data. With distributed tables, shards, and replicas, ClickHouse achieves the horizontal scalability and high data availability.

1. Tiered data storage

Starting from v19.15, ClickHouse supports multi-volume storage, which stores ClickHouse tables in volumes containing multiple devices. This feature makes it possible to define different types of disks in a volume for tiered storage of cold and hot data, striking a balance between performance and cost.

2. Disk types supported by ClickHouse

ClickHouse mainly supports DiskLocal and DiskS3 disks.

3. Data movement policy

ClickHouse can store data in different storage media by configuring disks of different types and storage policies in the configuration file. It also supports movement policies to automatically move data between storage media.

4. Current problems with data storage in ClickHouse

Many users choose ClickHouse for its superior query performance. To make the most of it, they generally select Tencent Cloud Enhanced SSD cloud disks to store ClickHouse data for their high performance; however, Enhanced SSD costs a lot. After a trade-off between the performance and cost, they may clear legacy data from ClickHouse. Although most queries involve the latest data, the business side does need to access legacy data sometimes. The balance between the cost and occasional access to legacy data bothers ClickHouse system admins.

5. COS strengths

Cloud Object Storage (COS) is a distributed storage service launched by Tencent Cloud. It has no directory hierarchy or data format restrictions, can accommodate an unlimited amount of data, and supports access over HTTP/HTTPS protocols.

COS organizes data in pay-as-you-go buckets with an unlimited capacity, which can be used and scaled on demand.

6. COS-based ClickHouse data tiering

Prepare the following environments before configuring data tiering:

• Local storage: Format an Enhanced SSD cloud disk and mount it to the "/data" path for storing hot data.

• COS bucket: Create a COS bucket for storing cold data and get the "SecretId" and "SecretKey" of the account that can access the bucket.

6.1 Configure the ClickHouse disk and policy

First, you need to configure the "/etc/clickhouse-server/config.d/storage.xml" file. In , define the local disk path, COS bucket URL, and "SecretId" and "SecretKey" of the access account. In , define the policy, which defines and volumes that contain the local disk and COS bucket respectively.

6.2 Import data to ClickHouse

After completing the storage configuration, set up a table with the TTL policy configured and import data to it to verify the tiering policy.

Here, a COS bucket inventory is selected as the data source for import. First, create a table named "cos_inventory_ttl" in ClickHouse based on the content of each column in the inventory. Then, configure the TTL policy. According to the "LastModifiedDate" value, store hot data in the "ttlhot" volume and cold data at least three months old in "ttlcold".

6.3 Verify data

After import, view the total number of data rows. Then, you can query the volumes storing different data. You can further conduct a query test to count the total size of files generated in the past three months in the "cos-user/" directory.

7. Conclusion

In ClickHouse, configuring different storage media and policies implements automatic tiered storage of data. Thanks to the unlimited capacity and cost-effectiveness of COS, ClickHouse clusters can store data in the long term at low costs while providing a superior query performance.

DEV Community

COS-Based ClickHouse Data Tiering Solution

Top comments (0)

Read next

LLaVA-o1: Transforming How We Think with Visual Language Models (VLMs)

Amazon Redshift now supports refresh interval in a zero-ETL integration

Vector Search and Semantic Search in Depth

Filament Database Notification Sound