João Victor Martins

Posted on Jul 26, 2021

Partition and Clustering Key on Scylla DB

#database #scylladb

In the last 2 previous posts, we talked about Scylla DB. Today I will keep talking about the database, more specifically how works Primary Key on Scylla. We will see that the PK can be divided into two: Partition Key and Clustering Key. Let's get the difference between both and some differences in relation to a relational database.

Partition Key

Partition Key is responsible for data distribution across the nodes. It determines which node will store a given row. It can be one or more columns.

Clustering Key

Clustering Key is responsible for sorting the rows within the partition. It can be zero or more columns.

The problem of Car Table

Now that we know the difference about the Composites Keys, let's remember how we modeling our PK in Car table. In the cqlsh after selecting the Keyspace, use the command below.

DESCRIBE car

The result will be.

CREATE TABLE automobilies.car (
    id uuid PRIMARY KEY,
    brand text,
    color text,
    model text
) WITH bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
    AND comment = ''
    AND compaction = {'class': 'SizeTieredCompactionStrategy'}
    AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99.0PERCENTILE';

How we can see, the unique PK that we have is id and could be calling of Simple Primary Key. If we do a select in the database, will retrieve only one car

cqlsh:automobilies> select * from car;

 id                                   | brand | color | model
--------------------------------------+-------+-------+-------
 f714e8e5-b160-4341-807d-f4cd92b973a4 |    VW |   Red |  Golf

To improve our test, let's include more cars in the table. The same model of car will have the same id.

cqlsh:automobilies> insert into car (id, brand, color, model) values (e0625c94-e9c2-11eb-9a03-0242ac130003, 'Ford', 'Red', 'Focus');
cqlsh:automobilies> insert into car (id, brand, color, model) values (fbd04f2c-511a-43c5-b588-9e29ebcb5d7a, 'VW', 'Nardo Grey', 'Passat');

And now the select clause result in

 id                                   | brand | color      | model
--------------------------------------+-------+------------+--------
 fbd04f2c-511a-43c5-b588-9e29ebcb5d7a |    VW | Nardo Grey | Passat
 e0625c94-e9c2-11eb-9a03-0242ac130003 |  Ford |        Red |  Focus
 f714e8e5-b160-4341-807d-f4cd92b973a4 |    VW |        Red |   Golf

If we want to find just a Golf car, we can query by id

cqlsh:automobilies> select * from car where id = f714e8e5-b160-4341-807d-f4cd92b973a4;

 id                                   | brand | color | model
--------------------------------------+-------+-------+-------
 f714e8e5-b160-4341-807d-f4cd92b973a4 |    VW |   Red |  Golf

(1 rows)

But what happening if we want to query by id and color? Well, maybe just need to add the and in the query and choose color, right?

cqlsh:automobilies> select * from car where id = f714e8e5-b160-4341-807d-f4cd92b973a4 and color = 'Red';

InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"

Like we can see, when add color, we get an error, but why?
To answer this question, we need to understand how Scylla saves data (I will use an example with another domain).

In the image above, we see that Scylla uses a PK to divide the partitions. When we use just id for query, Scylla knows exactly which node contains the data by hashing the ID (which is the partition key) and that is why we got results in the first query. In the second query, when we tried to filter for color, wasn't possible, because we not defined color as Clustering Key or as Partition Key. We would still know on which partition the data exists because we have the partition key, but we would have to do a linear scan of the entire data on that partition to find the specific row we are looking for (using ALLOW FILTERING), but performing this query would be inefficient and possibly very slow. To use ALLOW FILTERING is like this

cqlsh:automobilies> select * from car where id = f714e8e5-b160-4341-807d-f4cd92b973a4 and color = 'Red' ALLOW FILTERING;

 id                                   | brand | color | model
--------------------------------------+-------+-------+-------
 f714e8e5-b160-4341-807d-f4cd92b973a4 |    VW |   Red |  Golf

(1 rows)


cqlsh:automobilies> select * from car where id = f714e8e5-b160-4341-807d-f4cd92b973a4 and color = 'Green' ALLOW FILTERING;

---MORE---
(0 rows)

How we saw, this it's not good, so let's improve our data modeling.

Remodeling Car Table

We should know for which attributes we will want to query to define our keys. How we want a query for color and id, both will be our Pk, being id a Partition Key and color a Clustering Key.

cqlsh:automobilies> CREATE TABLE car (id uuid, brand text, color text, model text, PRIMARY KEY (id, color));

cqlsh:automobilies> DESCRIBE car;

CREATE TABLE automobilies.car (
    id uuid,
    color text,
    brand text,
    model text,
    PRIMARY KEY (id, color)
) WITH CLUSTERING ORDER BY (color ASC)
    AND bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
    AND comment = ''
    AND compaction = {'class': 'SizeTieredCompactionStrategy'}
    AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99.0PERCENTILE';

Now that we have a Clustering Key we can insert some cars with the same id, but different colors and Scylla will sort for us.

cqlsh:automobilies> insert into car (id, brand, color, model) values (f714e8e5-b160-4341-807d-f4cd92b973a4, 'VW', 'Red', 'Golf');
cqlsh:automobilies> insert into car (id, brand, color, model) values (e0625c94-e9c2-11eb-9a03-0242ac130003, 'Ford', 'Red', 'Focus');
cqlsh:automobilies> insert into car (id, brand, color, model) values (fbd04f2c-511a-43c5-b588-9e29ebcb5d7a, 'VW', 'Red', 'Passat');
cqlsh:automobilies> insert into car (id, brand, color, model) values (fbd04f2c-511a-43c5-b588-9e29ebcb5d7a, 'VW', 'Green', 'Passat');
cqlsh:automobilies> insert into car (id, brand, color, model) values (f714e8e5-b160-4341-807d-f4cd92b973a4, 'VW', 'Green', 'Golf');

Our partition will stay like this

Thus we have the partitions divided for id and rows of cars sorted by color

cqlsh:automobilies> select * from car;

 id                                   | color | brand | model
--------------------------------------+-------+-------+--------
 fbd04f2c-511a-43c5-b588-9e29ebcb5d7a | Green |    VW | Passat
 fbd04f2c-511a-43c5-b588-9e29ebcb5d7a |   Red |    VW | Passat
 e0625c94-e9c2-11eb-9a03-0242ac130003 |   Red |  Ford |  Focus
 f714e8e5-b160-4341-807d-f4cd92b973a4 | Green |    VW |   Golf
 f714e8e5-b160-4341-807d-f4cd92b973a4 |   Red |    VW |   Golf

And now we can query by id and color

cqlsh:automobilies> select * from car where id = fbd04f2c-511a-43c5-b588-9e29ebcb5d7a and color = 'Red';

 id                                   | color | brand | model
--------------------------------------+-------+-------+--------
 fbd04f2c-511a-43c5-b588-9e29ebcb5d7a |   Red |    VW | Passat

(1 rows)

Conclusion

In general, we saw how we define our PK in Scylla, can being Partition Key and Clustering Key. We need first think about our queries and after to model our tables. There are many others rules for work with these PK's, but in a unique post is not possible to cover all subject matter. I recommend that you do the courses of https://university.scylladb.com/. Well, this is what I wanted to introduce in this post, I really hope that you like it. Doubts, critics, and suggestions, I will be a disposition.

DEV Community