Franck Pachot for YugabyteDB

Posted on Dec 5, 2022

Index usage monitoring in YugabyteDB & PostgreSQL

#yugabytedb #postgres #index #monitoring

Secondary indexes add overhead on DML, so removing unused ones is a good idea. But how are you sure they are unused? Gunnar Morling asked about identifying unused indexes and the result is that only few people have a process fot that:

PostgreSQL (and then YugabyteDB) gathers statistics, visible from pg_stat_user_indexes, showing the number of Index Scan and number of index entries read from it:

idx_scan: Number of index scans initiated on this index
idx_tup_read: Number of index entries returned by scans on this index

However, I think the meaning of those values is not well known (and this is visible from this poll). What is an initiated scan?

I like to run small examples to get full understanding. I prepare a query (index_usage(tablename_pattern)) to show those index statistics. I also use my ybwr.sql script to show the reads at YugabyteDB distributed storage level.

prepare index_usage(text) as
 select indexrelname,idx_scan,idx_tup_read,idx_tup_fetch from pg_stat_user_indexes
 where relname like $1;

\! curl -s https://raw.githubusercontent.com/FranckPachot/ybdemo/main/docker/yb-lab/client/ybwr.sql | grep -v '\watch' > ybwr.sql
\i ybwr.sql

I create two tables, because I also want to show what happens with a Nested Loop join:

create table demo1 ( id1 int, primary key (id1 asc), n int);
create unique index demo1_n_key on demo1 ( n asc );
alter table demo1 add unique using index demo1_n_key;
create table demo2 ( id1 int references demo1 , x int);
create index demo2_id1 on demo2(id1 asc) include(x);
insert into  demo1 select n, n from generate_series(1,10) n;
insert into  demo2 select id1,0 from demo1, generate_series(1,7);

Index Scan on the primary key

For each test, I'll reset the statistics. This is a simple range scan on the primary key for id1 <= 6:

select pg_stat_reset(); execute snap_reset;
explain (analyze, costs off, summary off)
 select * from demo1 where id1 <= 6 ;
execute snap_table; execute index_usage('demo%');

This is an Index Scan on the primary key (which, in YugabyteDB, is actually a covering index because the table is stored in the primary key), returning 6 rows:

                                   QUERY PLAN
--------------------------------------------------------------------------------
 Index Scan using demo1_pkey on demo1 (actual time=0.858..0.861 rows=6 loops=1)
   Index Cond: (id1 <= 6)
(2 rows)

From the YugabyteDB tablet, 6 rows have been read with a next() in the LSM-Tree, after a seek() to the start of the range:

yugabyte=# execute snap_table; execute index_usage('demo%');
 rocksdb_seek | rocksdb_next | rocksdb_insert |        dbname / relname / tserver / tabletid / leader
--------------+--------------+----------------+--------------------------------------------------------------
            1 |            6 |                | yugabyte demo1 c200c76ebc304d858082053b1363a01f L 10.0.0.141
(1 row)

This shows the table name (demo1) because the index (demo1_pkey) is only a logical object seen by YSQL, the PostgreSQL-compatible query layer.

In pg_stat_user_indexes, provided by YSQL, we see the statistics with the index name (demo1_pkey):

 indexrelname | idx_scan | idx_tup_read | idx_tup_fetch
--------------+----------+--------------+---------------
 demo1_pkey   |        1 |            6 |             0
 demo1_n_key  |        0 |            0 |             0
 demo2_id1    |        0 |            0 |             0
(3 rows)

One scan, 6 rows... no surprise here.

Index Scan on a Secondary Index

I have created a secondary index, used by the unique constraint. Here are the statistics for a similar range scan:

select pg_stat_reset(); execute snap_reset;
explain (analyze, costs off, summary off)
 select * from demo1 where n <= 6 ;
execute snap_table; execute index_usage('demo%');

This shows as Index Scan and, because it is not the primary key, it will read the table rows for each index entry:

                                   QUERY PLAN
---------------------------------------------------------------------------------
 Index Scan using demo1_n_key on demo1 (actual time=1.059..1.063 rows=6 loops=1)
   Index Cond: (n <= 6)
(2 rows)

The table reads are scattered, each one being a seek() to get to the row by its primary key, and a next() to get the values. The scan of the index shows one seek() for the range and one next() per index entry:

yugabyte=# execute snap_table; execute index_usage('demo%');
 rocksdb_seek | rocksdb_next | rocksdb_insert |           dbname / relname / tserver / tabletid / leader
--------------+--------------+----------------+--------------------------------------------------------------------
            6 |            6 |                | yugabyte demo1 c200c76ebc304d858082053b1363a01f L 10.0.0.141
            1 |            6 |                | yugabyte demo1_n_key d677b498cd3b4194999c5df7532b0d3e L 10.0.0.142
(2 rows)

There is one Index Scan in the plan, which means one idx_scan (Number of index scans initiated on this index) and 6 rows read from it:

 indexrelname | idx_scan | idx_tup_read | idx_tup_fetch
--------------+----------+--------------+---------------
 demo1_pkey   |        0 |            0 |             0
 demo1_n_key  |        1 |            6 |             0
 demo2_id1    |        0 |            0 |             0
(3 rows)

Index Only Scan

Without the need to go to the table:

select pg_stat_reset(); execute snap_reset;
explain (analyze, costs off, summary off)
 select count(*) from demo1 where n <= 6 ;
execute snap_table; execute index_usage('demo%');

The Index Only Scan shows the same from index usage perspective (but of course no read from the table's tablet):

yugabyte=# explain (analyze, costs off, summary off)
yugabyte-#  select count(*) from demo1 where n <= 6 ;
                                         QUERY PLAN
--------------------------------------------------------------------------------------------
 Aggregate (actual time=0.821..0.821 rows=1 loops=1)
   ->  Index Only Scan using demo1_n_key on demo1 (actual time=0.813..0.816 rows=6 loops=1)
         Index Cond: (n <= 6)
         Heap Fetches: 0
(4 rows)

yugabyte=# execute snap_table; execute index_usage('demo%');
 rocksdb_seek | rocksdb_next | rocksdb_insert |           dbname / relname / tserver / tabletid / leader
--------------+--------------+----------------+--------------------------------------------------------------------
            1 |            6 |                | yugabyte demo1_n_key 8e4c8ed7e99e4de78fa6c7b3598fba59 L 10.0.0.143
(1 row)

 indexrelname | idx_scan | idx_tup_read | idx_tup_fetch
--------------+----------+--------------+---------------
 demo2_id1    |        0 |            0 |             0
 demo1_pkey   |        0 |            0 |             0
 demo1_n_key  |        1 |            6 |             0
(3 rows)

Filter on the table columns

With an Index Scan and an additional filter on non-indexed column:

select pg_stat_reset(); execute snap_reset;
explain (analyze, costs off, summary off)
 select count(*) from demo1 where n <= 6 and mod(id1,2)=0;
execute snap_table; execute index_usage('demo%');

Even if only 3 rows are returned by the Index Scan, 6 index entries have been read:

yugabyte=# explain (analyze, costs off, summary off)
yugabyte-#  select count(*) from demo1 where n <= 6 and mod(id1,2)=0;
                                      QUERY PLAN
---------------------------------------------------------------------------------------
 Aggregate (actual time=1.062..1.062 rows=1 loops=1)
   ->  Index Scan using demo1_n_key on demo1 (actual time=1.055..1.058 rows=3 loops=1)
         Index Cond: (n <= 6)
         Filter: (mod(id1, 2) = 0)
         Rows Removed by Filter: 3
(5 rows)

yugabyte=# execute snap_table; execute index_usage('demo%');
 rocksdb_seek | rocksdb_next | rocksdb_insert |           dbname / relname / tserver / tabletid / leader
--------------+--------------+----------------+--------------------------------------------------------------------
            6 |            6 |                | yugabyte demo1 34321a1389e4473e95c03b34706ee0fb L 10.0.0.142
            1 |            6 |                | yugabyte demo1_n_key 8e4c8ed7e99e4de78fa6c7b3598fba59 L 10.0.0.143
(2 rows)

 indexrelname | idx_scan | idx_tup_read | idx_tup_fetch
--------------+----------+--------------+---------------
 demo2_id1    |        0 |            0 |             0
 demo1_pkey   |        0 |            0 |             0
 demo1_n_key  |        1 |            6 |             0
(3 rows)

Nested Loop join

Now, the most interesting to understand the statistics, from those 6 rows I'll join with a Nested Loop to get 7 rows per loop:

select pg_stat_reset(); execute snap_reset;
explain (analyze, costs off, summary off)
 select * from demo1 join demo2 using(id1) where id1 <= 6 ;
execute snap_table; execute index_usage('demo%');

The plan shows the two scans. The one on demo2_id1 has been repeated 6 times (loops=6) with each loop returning 7 rows (rows=7):

                                        QUERY PLAN
------------------------------------------------------------------------------------------
 Nested Loop (actual time=0.966..2.489 rows=42 loops=1)
   ->  Index Scan using demo1_pkey on demo1 (actual time=0.613..0.618 rows=6 loops=1)
         Index Cond: (id1 <= 6)
   ->  Index Only Scan using demo2_id1 on demo2 (actual time=0.300..0.303 rows=7 loops=6)
         Index Cond: (id1 = demo1.id1)
         Heap Fetches: 0
(6 rows)

This is 6 seek() to the inner index, with a total of 42 rows read from it:

yugabyte=# execute snap_table; execute index_usage('demo%');
 rocksdb_seek | rocksdb_next | rocksdb_insert |          dbname / relname / tserver / tabletid / leader
--------------+--------------+----------------+------------------------------------------------------------------
            1 |            6 |                | yugabyte demo1 c200c76ebc304d858082053b1363a01f L 10.0.0.141
            6 |           42 |                | yugabyte demo2_id1 33bad7eeee794a779ff805fc1ad6598b L 10.0.0.142
(2 rows)

In the index usage statistics, this is only one scan initiated on demo2_id1 (even if there were multiple executions of it) and 42 rows out of it:

 indexrelname | idx_scan | idx_tup_read | idx_tup_fetch
--------------+----------+--------------+---------------
 demo1_pkey   |        1 |            6 |             0
 demo1_n_key  |        0 |            0 |             0
 demo2_id1    |        1 |           42 |             0
(3 rows)

The number of loops doesn't matter here: one scan returned 42 rows. This is important to understand because you cannot infer the index selectivity from it. The actual selectivity of the id1= predicate is 14% of the table:

yugabyte=# select count(*), count(distinct id1) from demo2;
 count | count
-------+-------
    70 |    10
(1 row)

This is not visible from the 1 idx_scan and 42 idx_tup_read information because we don't know the number of predicate executions from it (the loops from the execution plan).

With Rows Removed by Filter

Here is another example with an additional condition that is covered by the index:

select pg_stat_reset(); execute snap_reset;
explain (analyze, costs off, summary off)
 select * from demo1 join demo2 using(id1) where id1 <= 6 and x=1;
execute snap_table; execute index_usage('demo%');

By default on this YugabyteDB version (2.16.3) there is no expression pushdown. This is visible with Rows Removed by Filter which filters in YSQL:

                                        QUERY PLAN
------------------------------------------------------------------------------------------
 Nested Loop (actual time=2.816..2.817 rows=0 loops=1)
   ->  Index Scan using demo1_pkey on demo1 (actual time=0.798..0.805 rows=6 loops=1)
         Index Cond: (id1 <= 6)
   ->  Index Only Scan using demo2_id1 on demo2 (actual time=0.327..0.327 rows=0 loops=6)
         Index Cond: (id1 = demo1.id1)
         Filter: (x = 1)
         Rows Removed by Filter: 7
         Heap Fetches: 0
(8 rows)

This reads the same index entries as before:

yugabyte=# execute snap_table; execute index_usage('demo%');
 rocksdb_seek | rocksdb_next | rocksdb_insert |          dbname / relname / tserver / tabletid / leader
--------------+--------------+----------------+------------------------------------------------------------------
            1 |            6 |                | yugabyte demo1 c200c76ebc304d858082053b1363a01f L 10.0.0.141
            6 |           42 |                | yugabyte demo2_id1 33bad7eeee794a779ff805fc1ad6598b L 10.0.0.142
(2 rows)

and it counts the same in index usage monitoring:

 indexrelname | idx_scan | idx_tup_read | idx_tup_fetch
--------------+----------+--------------+---------------
 demo1_pkey   |        1 |            6 |             0
 demo1_n_key  |        0 |            0 |             0
 demo2_id1    |        1 |           42 |             0
(3 rows)

With Remote Filter

Now enabling the pushdown of the condition, so that rows are filtered when reading on the table servers:

yugabyte=# set yb_enable_expression_pushdown = on;
SET

The plan shows it as Remote Filter:

                                        QUERY PLAN
------------------------------------------------------------------------------------------
 Nested Loop (actual time=2.765..2.765 rows=0 loops=1)
   ->  Index Scan using demo1_pkey on demo1 (actual time=0.827..0.839 rows=6 loops=1)
         Index Cond: (id1 <= 6)
   ->  Index Only Scan using demo2_id1 on demo2 (actual time=0.307..0.307 rows=0 loops=6)
         Index Cond: (id1 = demo1.id1)
         Remote Filter: (x = 1)
         Heap Fetches: 0
(7 rows)

And now the index usage statistics, which runs at YSQL level, counts index entries already filtered:

 indexrelname | idx_scan | idx_tup_read | idx_tup_fetch
--------------+----------+--------------+---------------
 demo1_pkey   |        1 |            6 |             0
 demo1_n_key  |        0 |            0 |             0
 demo2_id1    |        1 |            0 |             0
(3 rows)

multiple use of the same index

A last test, when one query has two scans of the same index:

select pg_stat_reset(); execute snap_reset;
explain (analyze, costs off, summary off)
 select id1 from demo1 a join demo1 b using(id1) where id1<=6;
execute snap_table; execute index_usage('demo%');

This counts two idx_scan because, even if on the same index, there are two scan operations in the execution plan:

yugabyte=# explain (analyze, costs off, summary off)
yugabyte-#  select id1 from demo1 a join demo1 b using(id1) where id1<=6;
                                       QUERY PLAN
----------------------------------------------------------------------------------------
 Nested Loop (actual time=1.376..3.533 rows=6 loops=1)
   ->  Index Scan using demo1_pkey on demo1 a (actual time=0.797..0.805 rows=6 loops=1)
         Index Cond: (id1 <= 6)
   ->  Index Scan using demo1_pkey on demo1 b (actual time=0.445..0.445 rows=1 loops=6)
         Index Cond: (id1 = a.id1)
(5 rows)

yugabyte=# execute snap_table; execute index_usage('demo%');
 rocksdb_seek | rocksdb_next | rocksdb_insert |          dbname / relname / tserver / tabletid / leader
--------------+--------------+----------------+------------------------------------------------------------------
            1 |            6 |                | yugabyte demo1 34321a1389e4473e95c03b34706ee0fb L 10.0.0.142
            7 |           12 |                | yugabyte demo1 3c66854f503c43fc919867d8c1ecb27f L 10.0.0.141
            6 |          126 |                | yugabyte demo2_id1 33bad7eeee794a779ff805fc1ad6598b L 10.0.0.142
(3 rows)

 indexrelname | idx_scan | idx_tup_read | idx_tup_fetch
--------------+----------+--------------+---------------
 demo1_pkey   |        2 |           12 |             0
 demo1_n_key  |        0 |            0 |             0
 demo2_id1    |        0 |            0 |             0
(3 rows)

Overhead on DML

There are two problems when having too many indexes. One is that it can confuse the query planner. The less choices, the less possibility of mistakes. The other is the overhead on DML and this can be observed from YugabyteDB rocksdb_insert related to index entries. I insert 70 more rows into demo2:

yugabyte=# insert into  demo2 select id1,0 from demo1, generate_series(1,7)
;
INSERT 0 70

yugabyte=# execute snap_table; 

 rocksdb_seek | rocksdb_next | rocksdb_insert |          dbname / relname / tserver / tabletid / leader
--------------+--------------+----------------+------------------------------------------------------------------
           21 |           27 |                | yugabyte demo1 c200c76ebc304d858082053b1363a01f L 10.0.0.141
              |              |             18 | yugabyte demo2 hash_split: [0x0000, 0x5554]   10.0.0.141
              |              |             18 | yugabyte demo2 hash_split: [0x0000, 0x5554]   10.0.0.142
           72 |              |             18 | yugabyte demo2 hash_split: [0x0000, 0x5554] L 10.0.0.143
              |              |             29 | yugabyte demo2 hash_split: [0x5555, 0xAAA9]   10.0.0.142
              |              |             29 | yugabyte demo2 hash_split: [0x5555, 0xAAA9]   10.0.0.143
          116 |              |             29 | yugabyte demo2 hash_split: [0x5555, 0xAAA9] L 10.0.0.141
              |              |             23 | yugabyte demo2 hash_split: [0xAAAA, 0xFFFF]   10.0.0.141
              |              |             23 | yugabyte demo2 hash_split: [0xAAAA, 0xFFFF]   10.0.0.143
           92 |              |             23 | yugabyte demo2 hash_split: [0xAAAA, 0xFFFF] L 10.0.0.142
              |              |             70 | yugabyte demo2_id1 33bad7eeee794a779ff805fc1ad6598b   10.0.0.141
              |              |             70 | yugabyte demo2_id1 33bad7eeee794a779ff805fc1ad6598b   10.0.0.143
          140 |              |             70 | yugabyte demo2_id1 33bad7eeee794a779ff805fc1ad6598b L 10.0.0.142
(13 rows)

The reads on demo1 are for referential integrity validation. The inserts into demo2 table are those 70 rows: 18+29+23 packed rows written to the Raft leaders (L), into to 3 hash sharded tablets. Each one has been written into the Raft followers (this is a RF=3 cluster) waiting for the majority to acknowledge.

Note that I've not defined a primary key for demo2 and an internal one has been created, with hash sharding, and 3 tablets, which you can see with the range of the hash code from 0x0000 to 0xFFFF. I made it simple for this demo but in general you shoiuld define a primary key for the most important access pattern.

Now about the index, I have only one secondary index here (demo2_id1) and 70 entries have been written to it. If you see lot of rocksdb_insert for indexes that are not used by queries, then this may be something to look at.

Make the index invisible before dropping it

A final note, if you want to make the index invisible before dropping it, an unsupported way is to set pg_index.indisvalid=false:

yugabyte=# explain (analyze, costs off, summary off)
yugabyte-#  select id1 from demo2 where id1<=6;
                                     QUERY PLAN
-------------------------------------------------------------------------------------
 Index Only Scan using demo2_id1 on demo2 (actual time=1.267..1.303 rows=84 loops=1)
   Index Cond: (id1 <= 6)
   Heap Fetches: 0
(3 rows)

yugabyte=# update pg_index set indisvalid=false where indexrelid='demo2_id1'::regclass;
UPDATE 1

yugabyte=# \c
psql (13.7, server 11.2-YB-2.15.3.2-b0)
You are now connected to database "yugabyte" as user "yugabyte".
yugabyte=# explain (analyze, costs off, summary off)
 select id1 from demo2 where id1<=6;
                          QUERY PLAN
--------------------------------------------------------------
 Seq Scan on demo2 (actual time=0.773..1.926 rows=84 loops=1)
   Filter: (id1 <= 6)
   Rows Removed by Filter: 56
(3 rows)

Note that this requires a re-connect on YugabyteDB (the metadata is cached and not refreshed with manual update). It is a bad idea anyway because this flasg was not made for this purpose. An example of possible problem in PostgreSQL:

Keep an history

Looking at cumulative statistics is never a good idea be it is easy to store snapshot of them and query for delta values:

create table if not exists ybwr_index_usage (
 primary key ( (schemaname,relname) hash, indexrelname, ts desc )
 , schemaname name, relname name, indexrelname name
 , ts timestamptz, idx_scan bigint, idx_tup_read bigint
);

insert into ybwr_index_usage
 select schemaname, relname, indexrelname, now(), idx_scan, idx_tup_read
 from pg_stat_user_indexes;

select schemaname, relname, indexrelname, ts, new_idx_scan, new_idx_tup_read, seconds from (
select *
, idx_scan     - lag(idx_scan    ) over idx as new_idx_scan
, idx_tup_read - lag(idx_tup_read) over idx as new_idx_tup_read
, extract(epoch from ts - lag(ts) over idx) as seconds
from ybwr_index_usage
window idx as (partition by schemaname, relname, indexrelname order by ts asc)
) history 
where new_idx_scan>0
order by ts, schemaname, relname, indexrelname;

In summary

Observing idx_scan increase during a time window helps to identify the usage of an index in an execution plan operation. This is what helps to identify indexes that are not used. Do you need to drop them? It depends. If the index was created on purpose, then maybe the problem is that it is not chosen by the query planner.

Look at the meaning of the index and try to understand why it is not used (bad statistics, or another index covers the same cases), and then decide what to do. Also look at the impact of the index: its size on storage, its number of tablets, and the number of rocksdb_insert to it.

The number of idx_scan is the number if Index Scan (or Index Only Scan) visible in the execution plan.

Observing idx_tup_read can give an idea of the index usage efficiency. A small number probably means that the index is used efficiently, to access quickly to few rows. But a high number may have many reasons. It may come from a bad selectivity (and them maybe a Seq Scan is acceptable) or it may come from a high number of Nested Loop, which is also a problem to look at, but nothing related to the index you are looking at.

Apart from the statistics on distributed storage, the same applies to PostgreSQL. Note that the maintenance of unused indexes is more problematic given the PostgreSQL implementation of MVCC.

DEV Community