Franck Pachot for YugabyteDB Distributed PostgreSQL Database

Posted on Jul 13, 2022 • Edited on Sep 1, 2022

find the partition key with a duplicate covering index local read

#yugabytedb #distributed #sql #database

In the previous posts I described multiple ways to read from local replicas. This was for cases where your table is global but accessed mostly from one region. I'll continue this series with another case where you want to geo-partition data. The table is global, at logical level, but data is stored in regional partitions of it. This uses PostgreSQL declarative partitioning and works fine when the partition key is part of the primary key, and mentioned in the query. If it is not, then all partition have to be read to find the right one. But there's a possibility to have a local read query without knowing the partition key. Here is an example.

I'll do this in a lab on my laptop, using Docker Compose containers, tagging the servers with the region. Of course, I'll not measure the latency, as all is physically there. But I'll check the execution plan to verify that I read only local partitions.

Create lab cluster

I start a regional cluster with 3 regions: earth, moon and mars. I start a RF=1 so that I need only one server by region.

I've used this docker-compose

Now, I'll connect to the node on earth with psql -p 5433 yugabyte


yugabyte=# show listen_addresses;

 listen_addresses
------------------
 yb-tserver-0

(1 row)

yugabyte=# select host,node_type,cloud,region,zone 
           from yb_servers() order by 1;

     host     | node_type | cloud | region | zone
--------------+-----------+-------+--------+------
 yb-tserver-0 | primary   | star  | earth  | base
 yb-tserver-1 | primary   | star  | moon   | base
 yb-tserver-2 | primary   | star  | mars   | base
(3 rows)

I create one tablespace per region:

create tablespace earth with (replica_placement=$placement${
"num_replicas": 1,"placement_blocks": [
{"cloud": "star","region": "earth","zone": "base","min_num_replicas": 1}
]}$placement$);

create tablespace moon with (replica_placement=$placement${
"num_replicas": 1,"placement_blocks": [
{"cloud": "star","region": "moon","zone": "base","min_num_replicas": 1}
]}$placement$);

create tablespace mars with (replica_placement=$placement${
"num_replicas": 1,"placement_blocks": [
{"cloud": "star","region": "mars","zone": "base","min_num_replicas": 1}
]}$placement$);

Create partitioned table

I create a list-partitioned table with a partition in each tablespace:

create extension if not exists pgcrypto;

create table customers(
 id uuid default gen_random_uuid()
 , planet text
 , info text
 , primary key(id,planet)
) partition by list(planet);

create table customers_earth partition of customers
for values in ('earth') tablespace earth;

create table customers_moon partition of customers
for values in ('moon') tablespace moon;

create table customers_mars partition of customers
for values in ('mars') tablespace mars;

Here my customers are identified by an id. However, I need to add the partition key planet to the primary key. There's no way to enforce that the id is unique across all regions, because PostgreSQL, and then YugabyteDB, has no global indexe. If you need, you can use a trigger for that. But, given that it is a uuid you probably don't need it. Just take the full primary key as an identifier of the customer.

Now adding some data:

with planets(name) as (
values('earth'),('moon'),('mars')
) insert into customers(planet, info) select name, n::text
from planets, generate_series(1,10000) n;

My docker-compose has a container that shows the tablet statistics every 10 seconds. This is how how check which reads and writes go to each region.

Querying without the partition key

Now we get to the main point. I said that ideally you have the full key where querying for a customer:

yugabyte=#

explain analyze select id,planet from customers 
where id='1c888089-c4f2-4da4-ba61-59957b965bf8' and planet='earth';
                                                                 QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
 Append  (cost=0.00..4.11 rows=1 width=48) (actual time=0.981..0.984 rows=1 loops=1)
   ->  Index Scan using customers_earth_pkey on customers_earth  (cost=0.00..4.11 rows=1 width=48) (actual time=0.980..0.983 rows=1 loops=1)
         Index Cond: ((id = '1c888089-c4f2-4da4-ba61-59957b965bf8'::uuid) AND (planet = 'earth'::text))
 Planning Time: 0.227 ms
 Execution Time: 1.031 ms
 Peak Memory Usage: 24 kB
(6 rows)

I query a customer from planet earth, it reads only one partition, stored in planet earth.

But what if I have only the id?

yugabyte=#

explain analyze select id,planet from customers 
where id='1c888089-c4f2-4da4-ba61-59957b965bf8' ;
                                                                   QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
 Append  (cost=0.00..46.65 rows=300 width=48) (actual time=1.598..8.184 rows=1 loops=1)
   ->  Index Scan using customers_earth_pkey on customers_earth  (cost=0.00..14.65 rows=100 width=48) (actual time=1.597..1.599 rows=1 loops=1)
         Index Cond: (id = '1c888089-c4f2-4da4-ba61-59957b965bf8'::uuid)
   ->  Index Scan using customers_mars_pkey on customers_mars  (cost=0.00..15.25 rows=100 width=48) (actual time=1.554..1.554 rows=0 loops=1)
         Index Cond: (id = '1c888089-c4f2-4da4-ba61-59957b965bf8'::uuid)
   ->  Index Scan using customers_moon_pkey on customers_moon  (cost=0.00..15.25 rows=100 width=48) (actual time=5.027..5.028 rows=0 loops=1)
         Index Cond: (id = '1c888089-c4f2-4da4-ba61-59957b965bf8'::uuid)
 Planning Time: 0.252 ms
 Execution Time: 8.245 ms
 Peak Memory Usage: 56 kB
(10 rows)

This is not so good because the customer can be anywhere. We have to query all partitions to find the right one

Don't look at the time here, because it is a lab on my laptop. But if you really deploy between Earth, Moon and Mars you will see a latency of 2.51 second for Index Scan using customers_moon_pkey and minutes for customers_mars_pkey. You may increase some timeouts for that, and set a preferred location for the yb-master. Anyway, even with cross-region deployment on Earth, we need a solution for those who have the id only.

Duplicate Indexes

In order to avoid cross-region reads in this case, I can create a Duplicate Covering Indexes in each region. This is a redundant index for the other regions than the primary key. For this I create two additional indexes for each table partition, mentioning the tablespace where I want it to be stored:

create index customers_moon_id_planet_earth on customers_moon(id,planet) tablespace earth;
create index customers_mars_id_planet_earth on customers_mars(id,planet) tablespace earth;

create index customers_earth_id_planet_moon on customers_earth(id,planet) tablespace moon;
create index customers_mars_id_planet_moon on customers_mars(id,planet) tablespace moon;

create index customers_earth_id_planet_mars on customers_earth(id,planet) tablespace mars;
create index customers_moon_id_planet_mars on customers_moon(id,planet) tablespace mars;

It is good to have meaningful naming conventions and, even better, generate them. Or at least check them:


yugabyte=# select tablename, tablespace, indexname
           , pg_get_indexdef(indexname::regclass)
           from pg_indexes
           where tablename like 'customers%' order by 1,2,3;

yugabyte=# select tablename, tablespace, indexname
 , pg_get_indexdef(indexname::regclass)
 from pg_indexes
 where tablename like 'customers%' order by 1,2,3;
    tablename    | tablespace |           indexname            |                                            pg_get_indexdef
-----------------+------------+--------------------------------+-------------------------------------------------------------------------------------------------------
 customers_earth | earth      | customers_earth_pkey           | CREATE UNIQUE INDEX customers_earth_pkey ON public.customers_earth USING lsm (id HASH, planet ASC)
 customers_earth | mars       | customers_earth_id_planet_mars | CREATE INDEX customers_earth_id_planet_mars ON public.customers_earth USING lsm (id HASH, planet ASC)
 customers_earth | moon       | customers_earth_id_planet_moon | CREATE INDEX customers_earth_id_planet_moon ON public.customers_earth USING lsm (id HASH, planet ASC)
 customers_mars  | earth      | customers_mars_id_planet_earth | CREATE INDEX customers_mars_id_planet_earth ON public.customers_mars USING lsm (id HASH, planet ASC)
 customers_mars  | mars       | customers_mars_pkey            | CREATE UNIQUE INDEX customers_mars_pkey ON public.customers_mars USING lsm (id HASH, planet ASC)
 customers_mars  | moon       | customers_mars_id_planet_moon  | CREATE INDEX customers_mars_id_planet_moon ON public.customers_mars USING lsm (id HASH, planet ASC)
 customers_moon  | earth      | customers_moon_id_planet_earth | CREATE INDEX customers_moon_id_planet_earth ON public.customers_moon USING lsm (id HASH, planet ASC)
 customers_moon  | mars       | customers_moon_id_planet_mars  | CREATE INDEX customers_moon_id_planet_mars ON public.customers_moon USING lsm (id HASH, planet ASC)
 customers_moon  | moon       | customers_moon_pkey            | CREATE UNIQUE INDEX customers_moon_pkey ON public.customers_moon USING lsm (id HASH, planet ASC)
(9 rows)

Each table partition (tablename) has, in each region (tablespace) an index on (id,planet). And, yes, I should have created it as UNIQUE. That's a harmless mistake.

Now running my query above, looking for the region by providing only the id:

yugabyte=#

explain analyze select id,planet from customers 
where id='1c888089-c4f2-4da4-ba61-59957b965bf8' ;

                                                                          QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
 Append  (cost=0.00..45.45 rows=300 width=48) (actual time=0.890..3.030 rows=1 loops=1)
   ->  Index Scan using customers_earth_pkey on customers_earth  (cost=0.00..14.65 rows=100 width=48) (actual time=0.890..0.891 rows=1 loops=1)
         Index Cond: (id = '1c888089-c4f2-4da4-ba61-59957b965bf8'::uuid)
   ->  Index Only Scan using customers_mars_id_planet_earth on customers_mars  (cost=0.00..14.65 rows=100 width=48) (actual time=1.293..1.293 rows=0 loops=1)
         Index Cond: (id = '1c888089-c4f2-4da4-ba61-59957b965bf8'::uuid)
         Heap Fetches: 0
   ->  Index Only Scan using customers_moon_id_planet_earth on customers_moon  (cost=0.00..14.65 rows=100 width=48) (actual time=0.844..0.844 rows=0 loops=1)
         Index Cond: (id = '1c888089-c4f2-4da4-ba61-59957b965bf8'::uuid)
         Heap Fetches: 0
 Planning Time: 43.061 ms
 Execution Time: 3.105 ms
 Peak Memory Usage: 56 kB
(12 rows)

Remember that I'm connected to earth. Finding the country for any id from there reads the primary key customers_earth_pkey for the local partition customers_earth of the table and reads the duplicate indexes customers_mars_id_planet_earth and customers_moon_id_planet_earth for the partitions customers_mars and customers_moon

Those are all local

Then, if there are other columns to read, you should have them included into the local index [but be careful if you expect updates on them - See Denis comment below]. But if there's more, like joining to other table, better to connect to the right region, to get SQL processed there, and add the region in the where clauses:

 $ docker exec -it yb-tserver-0 ysqlsh -h yb-tserver-0
ysqlsh (11.2-YB-2.15.0.1-b0)
Type "help" for help.

yugabyte=# show listen_addresses ;
 listen_addresses
------------------
 yb-tserver-0
(1 row)

yugabyte=# select id,planet from customers
where id='85338353-162a-4a62-bcb9-78fd58a6b500' ;

                  id                  | planet
--------------------------------------+--------
 85338353-162a-4a62-bcb9-78fd58a6b500 | mars
(1 row)

yugabyte=# select host from yb_servers() where region='mars';
     host
--------------
 yb-tserver-2

yugabyte=# \c yugabyte yugabyte yb-tserver-2
You are now connected to database "yugabyte" as user "yugabyte" on host "yb-tserver-2" at port "5433".

yugabyte=# show listen_addresses ;

 listen_addresses
------------------
 yb-tserver-2
(1 row)

yugabyte=# explain analyze select * from customers where id='85338353-162a-4a62-bcb9-78fd58a6b500' and planet='mars';
                                                                QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
 Append  (cost=0.00..4.11 rows=1 width=80) (actual time=2.763..2.765 rows=1 loops=1)
   ->  Index Scan using customers_mars_pkey on customers_mars  (cost=0.00..4.11 rows=1 width=80) (actual time=2.762..2.764 rows=1 loops=1)
         Index Cond: ((id = '85338353-162a-4a62-bcb9-78fd58a6b500'::uuid) AND (planet = 'mars'::text))
 Planning Time: 34.833 ms
 Execution Time: 2.842 ms
 Peak Memory Usage: 40 kB
(6 rows)

Another advantage of having the region in the primary key is that you will have it in all child table's foreign key. Then, if they are also partitioned, the local partition only will be read.

Note that maintaining the additional indexes has a cost. Here is the same insert as I did above, but now with indexes to maintain in all regions:

with planets(name) as (
values('earth'),('moon'),('mars')
) insert into customers(planet, info) select name, n::text
from planets, generate_series(1,10000) n;

Top comments (2)

Denis Magda • Jul 13 '22

It's coincidence but I was researching how to query a geo-partitioned cluster efficiently when all you have is an id of a record (with no geo-column). Very timely! Just curious how big would be an impact for writes as long as the app needs to update all the indexes across distant regions.

Btw, love the planetary data model!

Franck Pachot • Jul 14 '22

Maybe not a coincidence as this was driven by discussions with colleagues and users 😀
Only the inserts and delete have to update all indexes because I don't expect an update in the primary key. Except if you include more columns in the indexes -> I'll add a comment on this suggestion, yes, good point

DEV Community

find the partition key with a duplicate covering index local read

Create lab cluster

Create partitioned table

Querying without the partition key

Duplicate Indexes

Top comments (2)

Read next

Amazon Redshift now supports refresh interval in a zero-ETL integration

Vector Search and Semantic Search in Depth

Filament Database Notification Sound

In-Memory Databases vs. Relational Databases: Key Advantages and Use Cases