Using the ALTER TABLE ... UPDATE
statement in ClickHouse is a heavy operation not designed for frequent use. If we design our schema to insert/update a whole partition at a time, we could update large amounts of data easily. Doing it in a simple MergeTree
table is quite simple, but doing it in a cluster with replicated tables is trickier. Let's see how could be done.
In this example we use a cluster test_cluster
that consists of 2 nodes. We will create a replicated table in each node and a distributed table that we could use to parallelize reading.
In each node we should have the test_cluster
configured in the /etc/clickhouse-server/config.xml
configuration file:
<remote_servers>
<test_cluster>
<shard>
<internal_replication>true</internal_replication>
<replica>
<host>192.168.33.201</host>
<port>9000</port>
</replica>
<replica>
<host>192.168.33.202</host>
<port>9000</port>
</replica>
</shard>
</test_cluster>
</remote_servers>
In this example we have the IP addresses but we could have the hostnames if they were correctly configured in the /etc/hosts
file.
We will use also a couple of macros in /etc/clickhouse-server/config.d/macros.xml
. They will be used as parameters in ON CLUSTER
DDL statements. In the first ClickHouse server (ch1) we have:
<yandex>
<macros>
<shard>shard_01</shard>
<replica>replica_01</replica>
</macros>
</yandex>
And in the second one (ch2):
<yandex>
<macros>
<shard>shard_01</shard>
<replica>replica_02</replica>
</macros>
</yandex>
Let's connect to one of the nodes and create the database test
. As we use ON CLUSTER
this statement will be executed in all the nodes of the cluster:
CREATE DATABASE IF NOT EXISTS test ON CLUSTER test_cluster
Let's create the replicated table. This will be the table that will store all the data. We will perform the partitioning by hour and insert all the data for that hour in a single insert.
CREATE TABLE IF NOT EXISTS test.hits_local ON CLUSTER test_cluster
(
`date` Datetime,
`user_id` String,
`pageviews` Int32
)
ENGINE = ReplicatedMergeTree('/clickhouse/test_cluster/tables/{shard}/hits_local', '{replica}')
PARTITION BY toStartOfHour(date)
ORDER BY (date)
Then we create a table that will be used just to perform the inserts. It must be created in all the nodes of the cluster and its engine must be MergeTree
, otherwise we could end up experiencing a lot of ZooKeeper errors. We will call this table the temporary table (notice that it's not a real TEMPORARY
table).
CREATE TABLE IF NOT EXISTS test.hits_local_tmp ON CLUSTER test_cluster
AS test.hits_local
ENGINE = MergeTree()
PARTITION BY toStartOfHour(date)
ORDER BY (date)
Finally we could create the distributed table:
CREATE TABLE IF NOT EXISTS test.hits ON CLUSTER 'test_cluster'
AS test.hits_local
ENGINE = Distributed('test_cluster', 'test', 'hits_local')
So now we have 3 tables in each of the nodes. In ch1:
ch1 :) use test
ch1 :) show tables
┌─name───────────┐
│ hits │
│ hits_local │
│ hits_local_tmp │
└────────────────┘
And in ch2:
ch2 :) use test
ch2 :) show tables
┌─name───────────┐
│ hits │
│ hits_local │
│ hits_local_tmp │
└────────────────┘
Let's insert some values in the temporary table:
ch1 :) INSERT INTO test.hits_local_tmp VALUES
('2020-10-27 10:00:00', 'user 1', 7),
('2020-10-27 10:00:00', 'user 2', 3),
('2020-10-27 10:00:00', 'user 3', 1),
('2020-10-27 10:00:00', 'user 4', 12)
We can check that now we have a new partition:
ch1 :) SELECT partition, name, active FROM system.parts WHERE table = 'hits_local_tmp'
┌─partition───────────┬─name─────────────┬─active─┐
│ 2020-10-27 10:00:00 │ 1603792800_1_1_0 │ 1 │
└─────────────────────┴──────────────────┴────────┘
The idea is to move this partition in the temporary table to the distributed table. But we can not do it using MOVE PARTITION
:
ch1 :) ALTER TABLE test.hits_local_tmp MOVE PARTITION '2020-10-27 10:00:00' TO TABLE test.hits_local
Received exception from server (version 20.10.2):
Code: 48. DB::Exception: Received from 127.0.0.1:9000. DB::Exception: Table test.hits_local_tmp (400fd602-745e-4904-9f61-76ccda9eaf0a) supports movePartitionToTable only for MergeTree family of table engines. Got ReplicatedMergeTree.
The good news is that we can do a REPLACE PARTITION
and delete the original partition in the temporary table (REPLACE PARTITION
does not delete the source partition). The REPLACE PARTITION
works even if we don't have a partition to replace in the destination table:
ch1 :) ALTER TABLE test.hits_local REPLACE PARTITION '2020-10-27 10:00:00' FROM test.hits_local_tmp
ch1 :) ALTER TABLE test.hits_local_tmp DROP PARTITION '2020-10-27 10:00:00'
Now we have the partition in the distributed table:
ch1 :) SELECT partition, name, active FROM system.parts WHERE table = 'hits_local'
┌─partition───────────┬─name─────────────┬─active─┐
│ 2020-10-27 10:00:00 │ 1603792800_1_1_0 │ 1 │
└─────────────────────┴──────────────────┴────────┘
The data are there:
ch1 :) select * from test.hits_local
┌────────────────date─┬─user_id─┬─pageviews─┐
│ 2020-10-27 10:00:00 │ user 1 │ 7 │
│ 2020-10-27 10:00:00 │ user 2 │ 3 │
│ 2020-10-27 10:00:00 │ user 3 │ 1 │
│ 2020-10-27 10:00:00 │ user 4 │ 12 │
└─────────────────────┴─────────┴───────────┘
We can query them through the distributed table:
ch1 :) select * from test.hits
┌────────────────date─┬─user_id─┬─pageviews─┐
│ 2020-10-27 10:00:00 │ user 1 │ 7 │
│ 2020-10-27 10:00:00 │ user 2 │ 3 │
│ 2020-10-27 10:00:00 │ user 3 │ 1 │
│ 2020-10-27 10:00:00 │ user 4 │ 12 │
└─────────────────────┴─────────┴───────────┘
They are replicated to the other node:
ch2 :) select * from test.hits_local
┌────────────────date─┬─user_id─┬─pageviews─┐
│ 2020-10-27 10:00:00 │ user 1 │ 7 │
│ 2020-10-27 10:00:00 │ user 2 │ 3 │
│ 2020-10-27 10:00:00 │ user 3 │ 1 │
│ 2020-10-27 10:00:00 │ user 4 │ 12 │
└─────────────────────┴─────────┴───────────┘
And the partition has been created in the second node too:
ch2 :) SELECT partition, name, active FROM system.parts WHERE table = 'hits_local'
┌─partition───────────┬─name─────────────┬─active─┐
│ 2020-10-27 10:00:00 │ 1603792800_1_1_0 │ 1 │
└─────────────────────┴──────────────────┴────────┘
Now let's do the update of the whole partition. We have to perform the insert again in the temporary table:
ch1 :) INSERT INTO test.hits_local_tmp VALUES
('2020-10-27 10:00:00', 'user 1', 6),
('2020-10-27 10:00:00', 'user 2', 4),
('2020-10-27 10:00:00', 'user 3', 2),
('2020-10-27 10:00:00', 'user 4', 15),
('2020-10-27 10:00:00', 'user 5', 14)
A new partition is created:
ch1 :) SELECT partition, name, active FROM system.parts WHERE table = 'hits_local_tmp'
┌─partition───────────┬─name─────────────┬─active─┐
│ 2020-10-27 10:00:00 │ 1603792800_3_3_0 │ 1 │
└─────────────────────┴──────────────────┴────────┘
We can use it to replace the data in the replicated table:
ch1 :) ALTER TABLE test.hits_local REPLACE PARTITION '2020-10-27 10:00:00' FROM test.hits_local_tmp
ch1 :) ALTER TABLE test.hits_local_tmp DROP PARTITION '2020-10-27 10:00:00'
We see that we only have a partition in the first node:
ch1 :) SELECT partition, name, active FROM system.parts WHERE table = 'hits_local'
┌─partition───────────┬─name─────────────┬─active─┐
│ 2020-10-27 10:00:00 │ 1603792800_3_3_0 │ 1 │
└─────────────────────┴──────────────────┴────────┘
The new data are there:
ch1 :) select * from test.hits_local
┌────────────────date─┬─user_id─┬─pageviews─┐
│ 2020-10-27 10:00:00 │ user 1 │ 6 │
│ 2020-10-27 10:00:00 │ user 2 │ 4 │
│ 2020-10-27 10:00:00 │ user 3 │ 2 │
│ 2020-10-27 10:00:00 │ user 4 │ 15 │
│ 2020-10-27 10:00:00 │ user 5 │ 14 │
└─────────────────────┴─────────┴───────────┘
They are replicated to the second node:
ch2 :) select * from test.hits_local
┌────────────────date─┬─user_id─┬─pageviews─┐
│ 2020-10-27 10:00:00 │ user 1 │ 6 │
│ 2020-10-27 10:00:00 │ user 2 │ 4 │
│ 2020-10-27 10:00:00 │ user 3 │ 2 │
│ 2020-10-27 10:00:00 │ user 4 │ 15 │
│ 2020-10-27 10:00:00 │ user 5 │ 14 │
└─────────────────────┴─────────┴───────────┘
And the distributed table is fine too:
ch1 :) select * from test.hits
┌────────────────date─┬─user_id─┬─pageviews─┐
│ 2020-10-27 10:00:00 │ user 1 │ 6 │
│ 2020-10-27 10:00:00 │ user 2 │ 4 │
│ 2020-10-27 10:00:00 │ user 3 │ 2 │
│ 2020-10-27 10:00:00 │ user 4 │ 15 │
│ 2020-10-27 10:00:00 │ user 5 │ 14 │
└─────────────────────┴─────────┴───────────┘
Top comments (0)