DEV Community

Franck Pachot for AWS Heroes

Posted on • Updated on

Index Only Scan on Functional Indexes

In a past post I detailed how the most common RDBMS can avoid the most expensive operation in an access by index, the lookup to the rows scattered in the table, with Index Only Scan. I mentioned the limitation with PostgreSQL where the ACID visibility of the row is not stored in the index and then Index Only Scan makes sense with freshly vacuumed tables only. There's another limitation, which is easy to workaround with a little redundant storage in the index or the table.

Here is my table without indexes for the moment:

postgres=# set enable_bitmapscan=false;
SET
postgres=# create table demo (id bigint, username text);
CREATE TABLE
postgres=# insert into demo select n,'Number'||to_hex(n) from generate_series(1,1000) n;
INSERT 0 1000
postgres=# vacuum demo;
VACUUM
postgres=# select * from demo where username='Number42';
 id | username
----+----------
 66 | Number42
(1 row)
Enter fullscreen mode Exit fullscreen mode

Note that I've disabled Bitmap Scan to get the testcase simpler. The goal is to show Index Scan vs. Index Only Scan.

My goal is to search with a function applied on the column, en return only that value (with the function applied):

postgres=# explain analyze select * from demo 
           where upper(username)='NUMBER42';

                                           QUERY PLAN
------------------------------------------------------------------------------------------------
 Seq Scan on demo  (cost=0.00..22.00 rows=5 width=17) (actual time=0.057..0.657 rows=1 loops=1)
   Filter: (upper(username) = 'NUMBER42'::text)
   Rows Removed by Filter: 999
Enter fullscreen mode Exit fullscreen mode

I can add a Function Based Index to get fast access to those rows:

postgres=# create index demo_upper on demo( (upper(username)) );
CREATE INDEX
postgres=# explain analyze select upper(username) from demo 
           where upper(username)='NUMBER42';

                                                    QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
 Index Scan using demo_upper on demo  (cost=0.28..20.38 rows=5 width=32) (actual time=0.025..0.026 rows=1 loops=1)
   Index Cond: (upper(username) = 'NUMBER42'::text)
Enter fullscreen mode Exit fullscreen mode

That's not bad but we can do better. Why not using an Index Only Scan here as all information I need (upper(username)) is there in the index and the visibility map (I got no changes since the last vacuum)? The problem is that the query planner just takes the expression, sees that there's an index on it, knows that I'll select the "username" column and apply a function on it. Then it thinks it needs the "username" column without realizing it already has the value with the function applied.

covering index

One workaround is easy: add the column to the index. And this can be done with a covering index:

postgres=# create index demo_upper_covering on demo( (upper(username))) include (username )
CREATE INDEX
postgres=# analyze demo;
ANALYZE

postgres=# explain analyze select upper(username) from demo 
           where upper(username)='NUMBER42';

                                                           QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using demo_upper_covering on demo  (cost=0.28..4.38 rows=5 width=32) (actual time=0.027..0.029 rows=1 loops=1)
   Index Cond: ((upper(username)) = 'NUMBER42'::text)
   Heap Fetches: 0
Enter fullscreen mode Exit fullscreen mode

Here is my Index Only Scan. However, the index is a bit larger, which means less pages stays in the shared buffers and filesystem cache.

calculated column

Another possibility since PG12 is to add this (upper(username) within the table. The big advantage is that queries will use this generated column without having to code the expression each time, without the risk of being inconsistent between the where clause and the select clause. And we will index it so that there's no doubt about its presence in the index.

postgres=# alter table demo add column upper_username text generated always as (upper(username)) stored;
ALTER TABLE
postgres=# create index demo_upper_stored on demo( upper_username );
CREATE INDEX
postgres=# analyze demo;
ANALYZE

postgres=# explain analyze select upper_username from demo
           where upper_username='NUMBER42';

                                                         QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using demo_upper_stored on demo  (cost=0.28..8.29 rows=1 width=9) (actual time=0.022..0.023 rows=1 loops=1)
   Index Cond: (upper_username = 'NUMBER42'::text)
   Heap Fetches: 1
Enter fullscreen mode Exit fullscreen mode

Here it is: easy to query for the developer, easy to index for the DBA, easy to optimize for the query planner. There's still some redundant storage here, but in the table which is a smaller problem (as the goal is not to read the table here...)

versions and alternatives

Currently (PG13) only STORED generated columns are supported. If one day virtual ones are allowed, it may be even better. About PostgreSQL-compatible databases, AWS Aurora supports all those with the provisioned version. The serverless one is PG10 compatible and supports functional indexes but not generated columns. YugabyteDB is currently PG11 compatible so the solution is functional index. CockroachDB has no function based indexes, but generated columns (stored and virtual but only indexes on stored ones do not need to go to the primary one)

db<>fiddle for this: https://dbfiddle.uk/?rdbms=postgres_13&fiddle=7a4dc212f126a17e4014c41dbcd78a74

The best performance optimization for OLTP is to avoid the random reads of accessing the table for the critical use-cases that read many rows. SSD have reduce the random read latency, but now with distributed databases we want to avoid cross-node latency. RDBMS have many possibilities for storing redundant data to keep clustered what will be queried together. And, thanks to SQL (and Edgar F. Codd relational rules 8 and 9), this is only DDL with no need to change the DML in your code, for better scalability without compromising agility. Function Based Indexes and Covering Indexes are key features for that, allowing Index Only Scan even on Secondary Indexes.

Discussion (2)

Collapse
hunleyd profile image
Douglas J Hunley

That's not a covering index. A covering index allows a user to perform an index-only scan if the select list in the query matches the columns that are included in the index. You can specify the additional columns for the index using the "INCLUDE" keyword, e.g.

CREATE INDEX a_b_idx ON x (a,b) INCLUDE (c);

What you have is a multicolumn index.

Collapse
franckpachot profile image
Franck Pachot Author

Right, include is the one I wanted to put there and I'll update the post. Doesn't change the idea anyway as if you update the column you will move the entry anyway. Both are covering the projection and the query planner see it, that's the important point.