Franck Pachot for YugabyteDB

Posted on Sep 3, 2021 • Edited on May 11, 2022

Index Scan in YugabyteDB

#yugabytedb #index #database

The goal of this post is to help reading the index access path in a YugabyteDB execution plan.

In a previous post I've explained the importance of the Index Only Scan access in PostgreSQL. YugabyteDB uses PostgreSQL for its query layer (and then the same explain plan semantic) but plugs-in a different storage, DocDB, to store tables and indexes (instead of heap tables and B*Tree indexes). And then, because the primary index contains the whole table rows (like a clustered index in SQL Server, an IOT in Oracle, an InnoDB table in MySQL...), what is displayed as "Index Scan" is actually the equivalent of an "Index Only Scan" when on the primary index.

From the application developer point of view, it may be preferable to show Index Only. But our compatibility with PostgreSQL is not only about on the API. We also consider the community of contributors (and that's why YugabyteDB is fully open source). And changing the semantic in the PostgreSQL index access code may add confusion there.

Note: this may change in the future, using the PostgreSQL "IndexScan" to "IndexOnlyScan" for primary key, as if they have all columns in "include" - which is actually how it is implemented. Follow https://github.com/yugabyte/yugabyte-db/issues/6955 to see when this changes (I'll also update this post)

Test case

Here is a simple table with 3 columns:

"a" is the primary key
"b" has a secondary index on it
"c" is not in the index

yugabyte=# create table demo (
 a int constraint demo_a_pk primary key,
 b int, c int) split into 1 tablets;

CREATE TABLE

yugabyte=# create index demo_b_si on demo(b) split into 1 tablets;

CREATE INDEX

yugabyte=# insert into demo (a, b, c)
 select n, n, n from generate_series(1,1000) n;

INSERT 0 1000

Index Scan on Primary Index

I'll query on the primary key (where a=42) to retreive only the primary key so that I'm sure I hit only the index structure.

yugabyte=# explain (analyze, verbose, summary false)
 select a from demo where a=42;

                                                      QUERY PLAN
-----------------------------------------------------------------
 Index Scan using demo_a_pk on public.demo  (cost=0.00..4.11 rows=1 width=4) (actual time=0.459..0.461 rows=1 loops=1)
   Output: a
   Index Cond: (demo.a = 42)

This would be an "Index Only Scan" in PostgreSQL but is displayed as an "Index Scan" on YugabyteDB because there's nothing specific there: the access by the primary index does not depend on which columns are in the index. The optimization for it is down to the storage layer.

I'll run it in a loop to look at the execution time:

yugabyte=# \timing on
yugabyte=# do $$ begin for i in 1..50000 loop
 perform a from demo where a=42;
 end loop; end; $$;

DO
Time: 15405.692 ms (00:15.406)

This takes 15 seconds here and I can see that only one node is doing read operations (3199 per second here):

This node holds my table tablet which is a LSM-tree structure. Its RaftConfig shows this node as the leader - where strong consistent read happen:

Table: yugabyte.demo (000030af000030008000000000004311) 
    FOLLOWER: 10.0.0.63
    LEADER:   10.0.0.61
    FOLLOWER: 10.0.0.62

Now, if I query a column that is not in the primary key:

yugabyte=# explain (analyze, verbose, summary false)
 select c from demo where a=42;

                                                      QUERY PLAN
-----------------------------------------------------------------
 Index Scan using demo_a_pk on public.demo  (cost=0.00..4.11 rows=1 width=4) (actual time=0.636..0.638 rows=1 loops=1)
   Output: c
   Index Cond: (demo.a = 42)

\timing on
yugabyte=# do $$ begin for i in 1..50000 loop
 perform c from demo where a=42;
 end loop; end; $$;

DO
Time: 15636.146 ms (00:15.636)

I see exactly the same plan and time. In PostgreSQL those are different operations (Index Only Scan and Index Scan) but here in YugabyteDB they are the same and they are handled by the Index Scan code and show up as this in the execution plan.

In summary: if you see Index Scan on a primary index, where you would have expected an Index Only Scan, don't panic, this is the most efficient operation that can be done. YugabyteDB is designed for OLTP where "select * from table where pk=$1" is fully optimized.

Index Scan on Secondary Index

Now, what about the additional indexes? They contain the index keys, and a reference to the table row when additional columns are required.

I'm doing something similar to the above example, but now on the index on "b", reading only that column for the selection and the projection:

yugabyte=# explain (analyze, verbose, summary false)
 select b from demo where b=42;

\timing on
yugabyte=# do $$ begin for i in 1..50000 loop
 perform b from demo where b=42;
 end loop; end; $$;
                                                         QUERY PLAN
-----------------------------------------------------------------
 Index Only Scan using demo_b_si on public.demo  (cost=0.00..5.12 rows=10 width=4) (actual time=0.932..0.934 rows=1 loops=1)
   Output: b
   Index Cond: (demo.b = 42)
   Heap Fetches: 0

Because this is not the primary index, it is displayed as Index Only Scan in the execution plan - even if it is very similar to the previous one: a secondary index like another table at DocDB level. The PostgreSQL layer (YSQL) is linking them though an internal pointer based on the primary key (ybctid)

You can ignore the "Heap Fetches: 0". This is for PostgreSQL where Index Only may have to read the visibility from the table but in YugabyteDB the index has all information.

Now if I read a column that is not indexed, it needs to access to the table:

yugabyte=# explain (analyze, verbose, summary false)
 select c from demo where b=42;

                                                       QUERY PLAN
-----------------------------------------------------------------
 Index Scan using demo_b_si on public.demo  (cost=0.00..5.22 rows=10 width=4) (actual time=1.784..1.787 rows=1 loops=1)
   Output: c
   Index Cond: (demo.b = 42)

\timing on
yugabyte=# do $$ begin for i in 1..50000 loop
 perform c from demo where b=42;
 end loop; end; $$;

DO
Time: 31771.423 ms (00:31.771)

You see twice the time and because I have the table and the index in different nodes (I explicitly defined 1 tablet and didn't define them colocated) the reads are distributed to two nodes:

Guess where is the Raft leader for the index tablet?

Table: yugabyte.demo_b_si (000030af000030008000000000004316) 
    LEADER:   10.0.0.62
    FOLLOWER: 10.0.0.61
    FOLLOWER: 10.0.0.63

Did I mention that at DocDB level indexes are tables? That's why you see my index "demo_b_si" as a table type. This is all accessible from the web console (yb-master:7000) if you have any doubt. And you can play with yb-admin to modify the placement (this is used in geo-distributed databases).

Note that the total throughput in "Read ops/s" here doesn't change because I'm running one session only so I clearly see that the work is split between reading from the index and reading from the table. It runs in twice the time because now we have two of those ops for each row accessed.

If I want to avoid this double read, I can add include (c) to my index creation and this brings back the Index Only Scan - as mentioned in the previous post.

Now, one more thing that may surprise you. What if I select "a" which is the primary key. It is not part of the secondary index, but you can expect it to be there internally to reference the table row.

yugabyte=# explain (analyze, verbose, summary false)
 select a from demo where b=42;

                                                       QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
 Index Scan using demo_b_si on public.demo  (cost=0.00..5.22 rows=10 width=4) (actual time=1.482..1.484 rows=1 loops=1)
   Output: a
   Index Cond: (demo.b = 42)

\timing on
yugabyte=# do $$ begin for i in 1..50000 loop
 perform a from demo where b=42;
 end loop; end; $$;

DO
Time: 36701.568 ms (00:36.702)

This is an Index Scan, exactly like the above, reading from the secondary index and the table. The internal pointer to the table row ("ybctid") is there, encoded from the primary key columns, but is not used (currently) to get the value for the projection.

If you want an Index Only Scan here, you need to include explicitly the primary key in the index:

yugabyte=# create index demo_b_si_covering on demo(b)
 include (a)
 split into 1 tablets;

CREATE INDEX
Time: 3272.519 ms (00:03.273)

yugabyte=# explain (analyze, verbose, summary false)
 select a from demo where b=42;

                                                              QUERY PLAN
-----------------------------------------------------------------
 Index Only Scan using demo_b_si_covering on public.demo  (cost=0.00..5.12 rows=10 width=4) (actual time=0.622..0.623 rows=1 loops=1)
   Output: a
   Index Cond: (demo.b = 42)
   Heap Fetches: 0

\timing on
yugabyte=# do $$ begin for i in 1..50000 loop
 perform a from demo where b=42;
 end loop; end; $$;

DO
Time: 15397.191 ms (00:15.397)

We are back to an Index Only Scan here. The overhead is minimal: a few bytes to store (compressed) and a low probability to update it (it is a primary key and should be immutable).

TL;DR

YugabyteDB uses the same explain plan semantic as PostgreSQL. "Index Only Scan" is available when all columns used by the query are in the secondary index, either in the indexed column list or the "include" one.
on a primary index you always see Index Scan, because the PostgreSQL code is designed for heap tables were all indexes are secondary indexes. However, the internal implementation makes it as efficient as an Index Only Scan, whatever columns are in the primary key. Think of it like an implicit "INCLUDE (*)" in the index that enforces the primary key, but that doesn't show off because it didn't feel like doing any additional work for this.

DEV Community

Index Scan in YugabyteDB

Test case

Index Scan on Primary Index

Index Scan on Secondary Index

TL;DR

Oldest comments (0)