Franck Pachot for YugabyteDB

Posted on Mar 18, 2022

Efficient pagination in YugabyteDB & PostgreSQL 📃🐘🚀

#yugabytedb #postgres #pagination #sql

If you want to query with pagination, do not use offset. The right way is explained by Markus Windand in Use The Index, Luke. But what is the best indexing pattern in PostgreSQL and YugabyteDB?

TL;DR:

Create a unique index which includes, after the search key, the columns used in ORDER BY, and, if they are not unique, the columns from the table primary key
Use a query that returns all those columns, in order to identify the last row displayed
Use those columns for the where clause, to have the range scan start just after the last row
Always check the execution plan to be sure that you read only the rows that needs to be fetched for the next page only
Love stored procedures to keep the data processing logic efficient and easy to maintain
YugabyteDB or PostgreSQL is an infrastructure choice, keep the application code compatible to stay free

Test case

I'm running this on YugabyteDB in order to be sure that this solution is efficient in a distributed SQL database. It works on PostgreSQL and I'll explain the differences, if any. I'm using YugabyteDB 2.8 here (and WARNING: 'analyze' is a beta feature! is expected in this version).

Joins may be involved in paginated queries. Here is one fact table and a dimension table:

drop table demo1;
drop table demo2;
create extension pgcrypto;

create table demo2(
 id int primary key
, name text);

insert into demo2 select generate_series(0,1000),'x';

create table demo1(
 id uuid default gen_random_uuid()
, key int
, ts timestamptz not null
, val int
, ref int not null references demo2
);

insert into demo1(key, ts, val, ref)
select 1,timestamp'2022-01-01 00:00:00'+mod(n,500)*interval '1 minute',n,mod(n,10) from generate_series(1,1000000) n
;

analyze demo1;
analyze demo2;

I've not added lot of data. I don't need it because I'll check the execution plan. This is the correct way to write efficient queries. Just increase the range in generate_series() if you want to scale it and verify that the response time depends only on the pagination size.

Index

The goal is to get rows from "demo1", joined to "demo2, for one "key", ordered by the most recent "ts". To optimize this, I create the following index:

create unique index demo1_key_ts_id
 on demo1(key, ts desc, id desc);

Because the timestamp may not be unique, I have added the "id", and then I am able to create it as a unique index. This will be important for pagination in order to seek to the exact next page.

SELECT ... WHERE ... ORDER BY ... LIMIT

I'll use a function to make it easier:

create or replace function read_by_page(int,timestamptz,uuid,int) 
returns table(id uuid, key int, ts timestamptz, val int, ref int, name text)
as $$
select demo1.*,demo2.name
from demo2 join demo1 on demo1.ref=demo2.id
where demo1.key=$1 
 and ( 
 -- for the first page, pass null values
 ($2 is null and $3 is null) 
 -- for the next page, pass last values fetched
 or (demo1.ts,demo1.id)<($2,$3)
) 
order by demo1.ts desc,demo1.id desc
limit $4;
$$ language sql;

The execution is really simple. This gets the first page for id=1, with a page of 4 rows:

select * from read_by_page(1,null,null,4);

The result:

yugabyte=> select * from read_by_page(1,null,null,4);

                   id                  | key |           ts           |  val   | ref | name
--------------------------------------+-----+------------------------+--------+-----+------
 fffc4d86-cef6-4caa-9dd9-6dc47866d7aa |   1 | 2022-01-01 08:19:00+00 | 559999 |   9 | x
 ffb4fde0-0cba-4e52-b011-385b0821160d |   1 | 2022-01-01 08:19:00+00 | 130499 |   9 | x
 ff866c28-4105-4557-b227-feb9e6f2cc48 |   1 | 2022-01-01 08:19:00+00 | 633499 |   9 | x
 ff7ff77d-6e42-48fe-8691-8b0d8f6b8dac |   1 | 2022-01-01 08:19:00+00 | 669999 |   9 | x
(4 rows)

Then I call the next page:

select * from read_by_page(1,'2022-01-01 08:19:00+00','ff7ff77d-6e42-48fe-8691-8b0d8f6b8dac',4);

This gives the next rows:

yugabyte=> select * from read_by_page(1,'2022-01-01 08:19:00+00','ff7ff77d-6e42-48fe-8691-8b0d8f6b8dac',4);

                  id                  | key |           ts           |  val   | ref | name
--------------------------------------+-----+------------------------+--------+-----+------
 ff05778d-b494-4fab-b144-0e90376974ab |   1 | 2022-01-01 08:19:00+00 | 688999 |   9 | x
 fef0ef1c-9e89-4ba1-ac6f-1791f4c58a96 |   1 | 2022-01-01 08:19:00+00 | 170499 |   9 | x
 fec4b1f8-c0e4-48bb-8738-6adc64348376 |   1 | 2022-01-01 08:19:00+00 | 147499 |   9 | x
 febd8f05-dde5-4575-8dc2-d230e390fbc1 |   1 | 2022-01-01 08:19:00+00 | 971999 |   9 | x
(4 rows)

This is fast (milliseconds). But imagine that you want to go to the latest page, it can take many seconds on this 1 million rows table:

yugabyte=> explain analyze select * from read_by_page(1,'2022-01-01 00:00:01+00','000fffff-ffff-ffff-ffff-ffffffffffff',1000);

                                                       QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
 Function Scan on read_by_page  (cost=0.25..10.25 rows=1000 width=68) (actual time=21418.485..21418.485 rows=1 loops=1)
 Planning Time: 0.031 ms
 Execution Time: 21418.510 ms
(3 rows)

This 21 seconds to get only the last row, probably because all previous rows had to be fetched. But the execution plan on a function doesn't tell a lot

Prepared statement

I would have prefer to use prepared statement but the problem is, because of the condition on ($2 is null and $3 is null) a generic plan cannot be good. I'm showing it to show the problem, and also to get to look at the execution plan. But the right solution is the one above with the function call. Better parse it each time and get a custom plan.

Here is the prepared statement with the same code as the function:

deallocate all;
prepare read_by_page(int,timestamptz,uuid,int) as
select demo1.*,demo2.name
from demo2 join demo1 on demo1.ref=demo2.id
where demo1.key=$1 
 and ( 
 -- for the first page, pass null values
 ($2 is null and $3 is null) 
 -- for the next page, pass last values fetched
 or (demo1.ts,demo1.id)<($2,$3)
) 
order by demo1.ts desc,demo1.id desc
limit $4;

Obviously with the same result:

yugabyte=> execute read_by_page(1,null,null,4);

                  id                  | key |           ts           |  val   | ref | name
--------------------------------------+-----+------------------------+--------+-----+------
 fffc4d86-cef6-4caa-9dd9-6dc47866d7aa |   1 | 2022-01-01 08:19:00+00 | 559999 |   9 | x
 ffb4fde0-0cba-4e52-b011-385b0821160d |   1 | 2022-01-01 08:19:00+00 | 130499 |   9 | x
 ff866c28-4105-4557-b227-feb9e6f2cc48 |   1 | 2022-01-01 08:19:00+00 | 633499 |   9 | x
 ff7ff77d-6e42-48fe-8691-8b0d8f6b8dac |   1 | 2022-01-01 08:19:00+00 | 669999 |   9 | x
(4 rows)

yugabyte=> execute read_by_page(1,'2022-01-01 08:19:00+00','ff7ff77d-6e42-48fe-8691-8b0d8f6b8dac',4);

                  id                  | key |           ts           |  val   | ref | name
--------------------------------------+-----+------------------------+--------+-----+------
 ff05778d-b494-4fab-b144-0e90376974ab |   1 | 2022-01-01 08:19:00+00 | 688999 |   9 | x
 fef0ef1c-9e89-4ba1-ac6f-1791f4c58a96 |   1 | 2022-01-01 08:19:00+00 | 170499 |   9 | x
 fec4b1f8-c0e4-48bb-8738-6adc64348376 |   1 | 2022-01-01 08:19:00+00 | 147499 |   9 | x
 febd8f05-dde5-4575-8dc2-d230e390fbc1 |   1 | 2022-01-01 08:19:00+00 | 971999 |   9 | x
(4 rows)

and the same time for the last page. But now with more insights on the execution.

Explain Analyze

To understand the performance complexity, I look at the execution plan, with EXPLAIN to be sure that I'm doing a range scan on the index, and , with ANALYZE check that I read exactly the number of rows I need.

I take care that the custom plan is used (no $1 $2 $3 $4 in the conditions). If not, I'll reallocate and re-prepare the statement. The current YugabyteDB version is PostgreSQL 11.2 compatible which doesn't have a plan_cache_mode parameter.

yugabyte=> explain analyze execute read_by_page(1,'2022-01-01 00:00:01+00','000fffff-ffff-ffff-ffff-ffffffffffff',1000);
                                                                           QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------
-------------------
 Limit  (cost=0.00..237.42 rows=1000 width=68) (actual time=26446.024..26748.760 rows=1000 loops=1)
   ->  Nested Loop  (cost=0.00..23753.30 rows=100046 width=68) (actual time=26446.022..26748.240 rows=1000 loops=1)
         ->  Index Scan using demo1_key_ts_id on demo1  (cost=0.00..12759.84 rows=100046 width=36) (actual time=26444.459..26447.198 rows=1000
 loops=1)
               Index Cond: ((key = 1) AND (ROW(ts, id) < ROW('2022-01-01 00:00:01+00'::timestamp with time zone, '000fffff-ffff-ffff-ffff-ffff
ffffffff'::uuid)))
               Rows Removed by Index Recheck: 998000
         ->  Index Scan using demo2_pkey on demo2  (cost=0.00..0.11 rows=1 width=36) (actual time=0.285..0.285 rows=1 loops=1000)
               Index Cond: (id = demo1.ref)
 Planning Time: 12.457 ms
 Execution Time: 26751.767 ms
(9 rows)

The Rows Removed by Index Recheck tells me where the problem is. It means that I've read more rows than what I needed. The same as OFFSET would have done. With an offset of 4, this is still fast, but here for the last page, 998000 have been read to be discarded later. Rows Removed by Index Recheck is to avoid in a distributed SQL database like YugabyteDB because those rows have to be shipped between nodes before being discarded.

I wanted to be smart with row-wise comparison (demo1.ts,demo1.id)<($2,$3). It was already not so smart with the need to test for null before (because the semantic of this is nulls last). For correct index usage, I need to do column-wise comparison, replacing this with (DEMO1.TS<=2 and (DEMO1.TS <$2 or (DEMO1.TS =$2 and DEMO1.ID <$3)))

deallocate all;
prepare read_by_page(int,timestamptz,uuid,int) as
select demo1.*,demo2.name
from demo2 join demo1 on demo1.ref=demo2.id
where demo1.key=$1 
 and ( 
 -- for the first page, pass null values
 ($2 is null and $3 is null) 
 -- for the next page, pass last values fetched
 or ((DEMO1.TS<=$2 and (DEMO1.TS <$2 or (DEMO1.TS =$2 and DEMO1.ID <$3))))
) 
order by demo1.ts desc,demo1.id desc
limit $4;

Trying again the second page with this change:

yugabyte=# explain analyze execute read_by_page(1,'2022-01-01 00:00:01+00','000fffff-ffff-ffff-ffff-ffffffffffff',1000);

                                                                                                QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..249.92 rows=1000 width=68) (actual time=18.769..305.666 rows=1000 loops=1)
   ->  Nested Loop  (cost=0.00..25069.13 rows=100307 width=68) (actual time=18.768..305.425 rows=1000 loops=1)
         ->  Index Scan using demo1_key_ts_id on demo1  (cost=0.00..14046.96 rows=100307 width=36) (actual time=18.121..19.741 rows=1000 loops=1)
               Index Cond: ((key = 1) AND (ts <= '2022-01-01 00:00:01+00'::timestamp with time zone))
               Filter: ((ts < '2022-01-01 00:00:01+00'::timestamp with time zone) OR ((ts = '2022-01-01 00:00:01+00'::timestamp with time zone) AND (id < '000fffff-ffff-ffff-ffff-ffffffffffff'::uuid)))
         ->  Index Scan using demo2_pkey on demo2  (cost=0.00..0.11 rows=1 width=36) (actual time=0.275..0.275 rows=1 loops=1000)
               Index Cond: (id = demo1.ref)
 Planning Time: 6.833 ms
 Execution Time: 309.985 ms
(9 rows)

Now the plan still seeks to the right timestamp, with no Rows Removed by Index Recheck. This means that the condition has filtered all the rows when reading in the storage layer, seeking directly to the right point.

In this example, I can execute the prepared statement many times and it still uses the custom plan without switching to the generic one, which is good. But I would not rely on it. Better use a function there.

Trying the function again

Here is the function I would use for this case, the code being generic, the plan being custom, and the condition verified as full pushed down on YugabyteDB:

create or replace function read_by_page(int,timestamptz,uuid,int) 
returns table(id uuid, key int, ts timestamptz, val int, ref int, name text)
as $$
select demo1.*,demo2.name
from demo2 join demo1 on demo1.ref=demo2.id
where demo1.key=$1 
 and ( 
 -- for the first page, pass null values
 ($2 is null and $3 is null) 
 -- for the next page, pass last values fetched
 or (DEMO1.TS<=$2 and (DEMO1.TS <$2 or (DEMO1.TS =$2 and DEMO1.ID <$3)))
) 
order by demo1.ts desc,demo1.id desc
limit $4;
$$ language sql;

I expected this to be fast, but:

yugabyte=> explain analyze select * from read_by_page(1,'2022-01-01 00:00:01+00','000fffff-ffff-ffff-ffff-ffffffffffff',1000);

                                                       QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
 Function Scan on read_by_page  (cost=0.25..10.25 rows=1000 width=68) (actual time=21657.987..21657.988 rows=1 loops=1)
 Planning Time: 0.028 ms
 Execution Time: 21658.008 ms
(3 rows)

My guess is that it uses a generic plan. Then, the solution if you want to use a function is to define two queries in a plpgsql function for the first and next page:

create or replace function read_by_page(p_key int, p_ts timestamptz, p_id uuid, p_rows int) 
returns table(id uuid, key int, ts timestamptz, val int, ref int, name text)
as $$ begin
if (p_id is null and p_ts is null) then
 -- for the first page, pass null values
return query
 select demo1.*,demo2.name
  from demo2 join demo1 on demo1.ref=demo2.id
  where demo1.key=p_key
  order by demo1.ts desc,demo1.id desc
  limit p_rows;
else
 -- for the next page, pass last values fetched
return query
 select demo1.*,demo2.name
  from demo2 join demo1 on demo1.ref=demo2.id
  where demo1.key=p_key
   and (DEMO1.TS<=$2 and (DEMO1.TS <$2 or (DEMO1.TS =$2 and DEMO1.ID <$3)))
  order by demo1.ts desc,demo1.id desc
  limit p_rows;
end if;
end; $$ language plpgsql;

This gives even the last page fast:

yugabyte=# select * from read_by_page(1,'2022-01-01 00:00:01+00','000fffff-ffff-ffff-ffff-ffffffffffff',1000);

                  id                  | key |           ts           |  val  | ref | name
--------------------------------------+-----+------------------------+-------+-----+------
 00015f08-b5b1-460b-97e1-7995b9a8f1f8 |   1 | 2022-01-01 00:00:00+00 | 78000 |   0 | xy
(1 row)

yugabyte=#  explain analyze select * from read_by_page(1,'2022-01-01 00:00:01+00','000fffff-ffff-ffff-ffff-ffffffffffff',1000);
                                                   QUERY PLAN
----------------------------------------------------------------------------------------------------------------
 Function Scan on read_by_page  (cost=0.25..10.25 rows=1000 width=68) (actual time=9.291..9.291 rows=1 loops=1)
 Planning Time: 0.027 ms
 Execution Time: 9.318 ms

Of course, you can also build the custom query, as:

select demo1.*,demo2.name
from demo2 join demo1 on demo1.ref=demo2.id
where demo1.key= 1 
  and (
    DEMO1.TS <= '2022-01-01 00:00:01+00'
    and (
      DEMO1.TS < '2022-01-01 00:00:01+00'
      or (
        DEMO1.TS = '2022-01-01 00:00:01+00'
        and DEMO1.ID < 'ffffffff-ffff-ffff-ffff-ffffffffffff'
      )
    )
  )
order by demo1.ts desc,demo1.id desc
limit 42;

If you struggle converting (demo1.ts,demo1.id) <= ('2022-01-01 00:00:01+00','ffffffff-ffff-ffff-ffff-ffffffffffff') you can use the excellent jOOQ translation tool: https://www.jooq.org/translate/ with a target that doesn't support row-wise comparison like "MS Access", "SQL Server" or "Oracle".

For recent PostgreSQL versions, it is possible to use a generic prepared statement, with plan_cache_mode=force_custom_plan.

For YugabyteDB I've opened issue 11794 to get the same performance with row-wise comparison, but the workaround is easy.

In all cases, I like the plpgsql function: this is not business logic, but purely data processing tied to the table definition, and this is where stored procedures are easy and efficient.