Hamed Karbasi

Posted on Jun 2, 2023

ClickHouse Basic Tutorial: Keys & Indexes

#clickhouse #database #tutorial

In the previous parts, we saw an introduction to ClickHouse and its features. Furthermore, we learned about its different table engine families and their most usable members. In this part, I will walk through the special keys and indexes in ClickHouse, which can help reduce query latency and database load significantly.

It should be said that these concepts are only applicable to the default table engine family: Merge-Trees.

Primary Key

ClickHouse indexes are based on Sparse Indexing, an alternative to the B-Tree index utilized by traditional DBMSs. In B-tree, every row is indexed, which is suitable for locating and updating a single row, also known as pointy-queries common in OLTP tasks. This comes with the cost of poor performance on high-volume insert speed and high memory and storage consumption. On the contrary, the sparse index splits data into multiple parts, each group by a fixed portion called granules. ClickHouse considers an index for every granule (group of data) instead of every row, and that's where the sparse index term comes from. Having a query filtered on the primary keys, ClickHouse looks for those granules and loads the matched granules in parallel to the memory. That brings a notable performance on range queries common in OLAP tasks. Additionally, as data is stored in columns in multiple files, it can be compressed, resulting in much less storage consumption.

The nature of the spars-index is based on LSM trees allowing you to insert high-volume data per second. All these come with the cost of not being suitable for pointy queries, which is not the purpose of the ClickHouse.

Structure

In the below figure, we can see how ClickHouse stores data:

ClickHouse Data Store Structure

Data is split into multiple parts (ClickHouse default or user-defined partition key)
Parts are split in granules which is a logical concept, and ClickHouse doesn't split data into them as the physical. Instead, it can locate the granules via the marks. Granules' locations (start and end) are defined in the mark files with the mrk2 extension.
Index values are stored in the primary.idx file, which contains one row per granule.
Columns are stored as compressed blocks in .bin files: One file for every column in the Wide and a single file for all columns in the Compact format. Being Wide or Compact is determined by ClickHouse based on the size of the columns.

Now let's see how ClickHouse finds the matching rows using primary keys:

ClickHouse finds the matching granule marks utilizing the primary.idx file via the binary search.
Looks into the mark files to find the granules' location in the bin files.
Loads the matching granules from the bin files into the memory in parallel and looks for the matching rows in those granules using binary search.

Case Study

To clarify the flow mentioned above, let's create a table and insert data into it:



CREATE TABLE default.projects
(

    `project_id` UInt32,

    `name` String,

    `created_date` Date
)
ENGINE = MergeTree
ORDER BY (project_id, created_date)



INSERT INTO projects 
SELECT * FROM generateRandom('project_id Int32, name String, created_date Date', 10, 10, 1)
LIMIT 10000000;

First, if you don't specify primary keys separately, ClickHouse will consider sort keys (in order by) as primary keys. Hence, in this table, project_id and created_date are the primary keys. Every time you insert data into this table, it will sort data first by project_id and then by created_date.

If we look into the data structure stored on the hard drive, we face this:

Physical files stored in a part

We have five parts, and one of them is: all_1_1_0. You can visit this link if you're curious about the naming convention. As you can see, columns are stored in bin files, and we see mark files named as primary keys along with the primary.idx file.

Filter on the first primary-key

Now let's filter on project_id, which is the first primary key, and explain its indexes:

Index analysis of a query on first primary key

As you can see, the system has detected project_id as a primary key and ruled out 1224 granules out of 1225 using it!

Filter on second primary-key

What if we filter on created_date: the second PK:



EXPLAIN indexes=1
SELECT * FROM projects WHERE created_date=today()

Index analysis of a query on second primary key

The database has detected created_date as a primary key, but it hasn't been able to filter any granules. Why?
Because ClickHouse uses binary search only for the first key and generic exclusive search for other keys, which is much less efficient than the former. So how can we make it more efficient?

If we substitute project_id and created_date in the sort keys while defining the table, you will achieve better results in filtering for the non-first keys since the created_date has lower cardinality (uniqueness) than the project_id:



CREATE TABLE default.projects
(

    `project_id` UInt32,

    `name` String,

    `created_date` Date
)
ENGINE = MergeTree
ORDER BY (created_date, project_id)



EXPLAIN indexes=1
SELECT * FROM projects WHERE project_id=700

Index analysis of a query on second primary key on an improved sort keys table

If we filter on the project_id, the second key, now ClickHouse, would use only 909 granules instead of the whole data.

So to summarize, always try to order the primary keys from low to high cardinality.

Order Key

I mentioned earlier that if you don't specify the PRIMARY KEY option, ClickHouse considers sort keys as the primary keys. However, if you want to set primary keys separately, it should be a subset of the sort keys. As a result, additional keys specified in the sort keys are only utilized for sorting purposes and don't play any role in indexing.



CREATE TABLE default.projects
(

    `project_id` UInt32,

    `name` String,

    `created_date` Date
)
ENGINE = MergeTree
PRIMARY KEY (created_date, project_id)
ORDER BY (created_date, project_id, name)

In this example, created_date and project_id columns are utilized in the sparse index and sorting, and name column is only used as the last item for sorting.

Use this option if you wish to use a column in the ORDER BY part of the query since it will eliminate the database sorting effort while running it.

Partition Key

A partition is a logical combination of parts in ClickHouse. It considers all parts under no specific partition by default. To find out more, look into the system.parts table for that projects table defined in the previous section:



SELECT
    name,
    partition
FROM
    system.parts
WHERE
    table = 'projects';

Parts structure in an unpartitioned table

You can see that the projects table has no particular partition. However, you can customize it using the PARTITION BY option:



CREATE TABLE default.projects_partitioned
(

    `project_id` UInt32,

    `name` String,

    `created_date` Date
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(created_date)
PRIMARY KEY (created_date, project_id)
ORDER BY (created_date, project_id, name)

In the above table, ClickHouse partitions data based on the month of the created_date column:

Parts structure in a partitioned table

Index

ClickHouse creates a min-max index for the partition key and uses it as the first filter layer in query running. Let's see what happens when we filter data by a column existent in the partition key:



EXPLAIN indexes=1
SELECT * FROM projects_partitioned WHERE created_date='2020-02-01'

Index analysis on a partitioned table

You can see that database has chosen one part out of 16 using the min-max index of the partition key.

Usage

Partitioning in ClickHouse aims to bring data manipulation capabilities to the table. For instance, you can delete or move parts belonging to partitions older than a year. It is way more efficient than an unpartitioned table since ClickHouse has split data based on the month physically on the storage. Consequently, such operations can be performed easily.

Although Clickhouse creates an additional index for the partition key, it should never be considered a query performance improvement method because it loses the performance battle to define the column in the sort keys. So if you wish to enhance the query performance, contemplate those columns in the sort keys and use a column as the partition key if you have particular plans for data manipulation based on that column.

Finally, don't get partitions in ClickHouse wrong with the same term in the distributed systems where data is split on different nodes. You should use shards and distributed tables if you're inclined to achieve such purposes.

Skip Index

You may have recognized that defining a column in the last items of the sort key cannot be helpful, mainly if you only filter on that column without the sort keys. What should you do in those cases?

Consider a dictionary you want to read. You can find words using the table of contents, sorted by the alphabet. Those items are the sort keys in the table. You can simply find a word starting with W, but how can you find pages containing words related to wars?

You can put marks or sticky notes on those pages making your effort less the next time. That's how Skip Index works. It helps the database filter granules that don't have desired values of some columns by creating additional indexes.

Case Study

Consider the projects table defined in the Order By section. created_date and project_id were defined as primary keys. Now if we filter on the name column, we'll encounter this:



EXPLAIN indexes=1
SELECT * FROM projects WHERE name='hamed'

Index analysis on a query on non-indexed column

The result was expected. Now what if we define a skip index on it?



ALTER TABLE projects ADD INDEX name_index name TYPE bloom_filter GRANULARITY 1;

The above command creates a skip index on the name column. I've used the bloom filter type because the column was a string. You can find more about the other kinds here.

This command only makes the index for the new data. Wishing to create for already inserted, you can use this:



ALTER TABLE projects MATERIALIZE INDEX name_index;

Let's see the query analysis this time:

Index analysis on a query on skip-indexed column

As you can see, the skip index greatly affected granules' rule-out and performance.

While the skip index performed efficiently in this example, it can show poor performance in other cases. It depends on the correlation of your specified column and sort keys and settings like index granularity and its type.

Conclusion

In conclusion, understanding and utilizing ClickHouse's primary keys, order keys, partition keys, and skip index is crucial for optimizing query performance and scalability. Choosing appropriate primary keys, order keys, and partitioning strategies can enhance data distribution, improve query execution speed, and prevent overloading. Additionally, leveraging the skip index feature intelligently helps minimize disk I/O and reduce query execution time. By considering these factors in your ClickHouse schema design, you can unlock the full potential of ClickHouse for efficient and performant data solutions.

Top comments (1)

Ali Nasir • Jun 3 '23

👏🏼👏🏼👏🏼

DEV Community

ClickHouse Basic Tutorial: Keys & Indexes

Primary Key

Structure

Case Study

Filter on the first primary-key

Filter on second primary-key

Order Key

Partition Key

Index

Usage

Skip Index

Case Study

Conclusion

Top comments (1)

Read next

How to Get Started with Elon Musk AI Grok for FREE

Understanding Strings and Arrays in Java

Everyone Uses Postgres… But Why?

Every Student Needs to Join a Tech Community