Channel of advanced SQL tuning
Indexes are one of the most important tools for optimizing database performance. However, creating too many indexes or indexing the wrong columns can also have a negative impact on performance. Therefore, it is important to follow certain principles when designing indexes.
The most important principle for creating efficient indexes is to create indexes based on your workload, not your table structure. The purpose of an index is to improve the efficiency of operations in the database, and SQL statements executed against a database make up the workload of the database. Therefore, any other index creation methods that do not start from the workload are incorrect.
When building a set of indexes for a workload, we need to consider the following characteristics of the workload:
- SQL type: In OLTP scenarios where users frequently insert new data and modify existing data, multiple indexes may have a negative impact on performance. It is recommended to create the minimum number of indexes to meet your indexing requirements. In OLAP scenarios where queries are dominal usage, you can add more indexes, each index can have multiple columns, and even function indexes and condition indexes can be created.
- SQL frequency: Indexes should be created for the most frequently used queries. By creating the best indexes for these queries, the overall performance of the system can be maximized.
- Importance of SQL queries: The more important a query is, the more you may want to optimize its performance by creating an index.
- The structure of the SQL query itself.
The functions of indexes are as follows:
- Quickly locate data
- Avoid sorting
- Avoid table lookup
Indexes can quickly locate data by matching the query conditions, which may be in the WHERE clause, HAVING clause, or ON clause. The matching principle between indexes and conditions follows the leftmost prefix matching principle.
The leftmost prefix matching principle refers to when the equal query conditions accurately match the leftmost consecutive column or several columns of an index, this column can be used to matching index. When encountering range queries (> , < , between, like), matching stops, but the the range condition is matched.
For the composite index
lineitem (l_shipdate, l_quantity), the first two SQL queries below satisfy the leftmost prefix matching principle and can use the index. The last one does not meet the leftmost prefix matching principle and cannot use the index.
select * from lineitem where l_shipdate = date '2021-12-01'; -- index can be used select * from lineitem where l_shipdate = date '2021-12-01' and l_quantity = 100; -- index can be used select * from lineitem where l_quantity = 100; -- The index cannot be used
The execution plans for these three SQL queries are as follows:
-> Index lookup on lineitem using lidx (L_QUANTITY=100.00, L_SHIPDATE=DATE'2021-12-01') (cost=0.35 rows=1)
-> Index lookup on lineitem using lidx (L_QUANTITY=100.00, L_SHIPDATE=DATE'2021-12-01') (cost=0.35 rows=1)
-> Filter: (lineitem.L_QUANTITY = 100.00) (cost=15208.05 rows=49486) -> Table scan on lineitem (cost=15208.05 rows=148473)
Besides the leftmost prefix principle, when creating a composite index, you should also consider the number of distinct values(Cardinality) when deciding the order of them index fields. The fields with higher Cardinality should be placed first.
Single table equal condition
- COL = 'A'
- COL IN ('A')
Equal conditions in table joins, when a table is used as a driven table, the equal join condition can also be considered as the equal condition is used for index matching.
- T1.COL = T2.COL
select * from orders, lineitem where o_orderkey = l_orderkey;
- Execution plan of above query
-> Nested loop inner join (cost=484815.77 rows=1326500) -> Table scan on orders (cost=20540.71 rows=200128) -> Index lookup on lineitem using lineitem_idx(L_ORDERKEY=orders.O_ORDERKEY) (cost=1.66 rows=7)
- range operators (>,>=,<,<=,BETWEEN)
- IN ('A','B')
- IS NOT NULL
- IS NULL
- LIKE 'ABC%'
- COL = 'A' OR COL = 'B'
Range conditions can also be used to quickly locate data,
create index lshipdate_idx on lineitem(l_shipdate); explain format = tree select * from lineitem where l_shipdate >= date '2021-12-01';
-> Index range scan on lineitem using lshipdate_idx over ('2021-12-01' <= L_SHIPDATE), with index condition: (lineitem.L_SHIPDATE >= DATE'2021-12-01') (cost=11855.06 rows=26344)
Indexed columns that follow a range condition cannot take advantage of the index due to the leftmost matching principle.
For B+ tree indexes, since they are sorted by index key, they can be used to avoid sorting in SQL queries. The SQL structures involved mainly include:
- GROUP BY
- ORDER BY
- PARTITION BY... ORDER BY...
Considering we have a index
lshipdate_idx as follows
create index lshipdate_idx on lineitem(l_shipdate);
You can see the execution plan for the following SQLs leverages
lshipdate_idx index to avoid sorting.
- SQL1 (ORDER BY)
select * from lineitem order by l_shipdate limit 10;
- Execution Plan for SQL1
-> Limit: 10 row(s) (cost=0.02 rows=10) -> Index scan on lineitem using lshipdate_idx (cost=0.02 rows=10)
- SQL2(GROUP BY)
select l_shipdate, sum(l_quantity) as sum_qty from lineitem group by l_shipdate;
- Execution Plan for SQL2
-> Group aggregate: sum(lineitem.L_QUANTITY) (cost=30055.35 rows=148473) -> Index scan on lineitem using lshipdate_idx (cost=15208.05 rows=148473)
select DISTINCT l_shipdate from lineitem;
- Execution Plan for SQL3
-> Covering index skip scan for deduplication on lineitem using lshipdate_idx (cost=4954.90 rows=15973)
- SQL4(PARTITION BY... ORDER BY...)
select rank() over (partition by L_SHIPDATE order by L_ORDERKEY) from lineitem;
- Execution Plan for SQL4
WindowAgg (cost=0.29..545.28 rows=10000 width=28) -> Index Only Scan using lshipdate_idx on lineitem (cost=0.29..370.28 rows=10000 width=20)
- For grouping and deduplication, the order doesn't matter.
- For sorting, the order of sorting columns needs to match the order of indexed columns, otherwise, the index cannot be used to avoid sorting.
- If sorting and grouping appear at the same time, the sorting column needs to come first.
For example, for the following SQL statement:
select l_shipdate, l_orderkey, sum(l_quantity) as sum_qty from lineitem group by l_shipdate, l_orderkey order by l_orderkey;
- Case 1: Create an index on
(l_shipdate, l_orderkey), use index access, and require sorting, with a cost of 486.526.
-> Sort: lineitem.L_ORDERKEY (actual time=479.465..486.526 rows=149413 loops=1) -> Stream results (cost=30055.35 rows=148473) (actual time=0.175..423.447 rows=149413 loops=1) -> Group aggregate: sum(lineitem.L_QUANTITY) (cost=30055.35 rows=148473) (actual time=0.170..394.978 rows=149413 loops=1) -> Index scan on lineitem using lshipdate_idx2 (cost=15208.05 rows=148473) (actual time=0.145..359.567 rows=149814 loops=1)
- Case 2: Create an index on
(l_orderkey,l_shipdate), use index access, and avoid sorting, with a cost of 228.401.
-> Group aggregate: sum(lineitem.L_QUANTITY) (cost=30055.35 rows=148473) (actual time=0.067..228.401 rows=149413 loops=1) -> Index scan on lineitem using lshipdate_idx3 (cost=15208.05 rows=148473) (actual time=0.052..194.479 rows=149814 loops=1)
When all the columns in a query are in the index columns, the database only needs to access the index to obtain the required data, avoiding table lookups. In some scenarios, this can greatly improve query efficiency.
For the following SQL statement:
select l_shipdate, l_orderkey, sum(l_quantity) as sum_qty from lineitem group by l_orderkey,l_shipdate;
Index on (l_orderkey,l_shipdate) does not include
l_quantity, a table lookup is required, with a cost of 194.875.
-> Group aggregate: sum(lineitem.L_QUANTITY) (cost=30055.35 rows=148473) (actual time=0.044..194.875 rows=149413 loops=1) -> Index scan on lineitem using lshipdate_idx3 (cost=15208.05 rows=148473) (actual time=0.034..159.863 rows=149814 loops=1)
Index on (l_orderkey,l_shipdate,l_quantity) includes
l_quantity, no table lookup is needed, with a cost of 113.433, resulting in a performance improvement of approximately 71.8%.
-> Group aggregate: sum(lineitem.L_QUANTITY) (cost=30055.35 rows=148473) (actual time=0.035..113.433 rows=149413 loops=1) -> Covering index scan on lineitem using lshipdate_idx4 (cost=15208.05 rows=148473) (actual time=0.026..82.266 rows=149814 loops=1)
For partitioned tables, different databases support different types of indexes. Generally, partitioned tables can have the following three types of indexes:
- Local partitioned indexes (PostgreSQL/MySQL/Oracle/Opengauss)
- Global partitioned indexes (Oracle)
- Global non-partitioned indexes (Oracle/Opengauss)
In terms of index maintenance, local indexes are easier to manage than global indexes. When you add, delete, or truncate a table partition, local indexes will automatically maintain their index partitions. MySQL and PostgreSQL only support local partitioned indexes; when creating local partitioned indexes in Oracle and Opengauss, the keyword
local needs to be specified.
create index lshipdate_idx on lineitem(l_shipdate) local;
Similar to table partitioning, the partition key for the index and the partition key for the table do not necessarily have a relationship, and even non-partitioned tables can have global partitioned indexes. Oracle supports global partitioned indexes.
For global non-partitioned indexes, when you perform partition operations on the table, the index may become unavailable and you need to explicitly update or rebuild the index. In terms of index efficiency, global indexes are more efficient than local partitioned indexes in queries that do not include partitioning columns. Oracle and Opengauss create global non-partitioned indexes by default for partitioned tables.
create index lshipdate_idx on lineitem(l_shipdate) global; create index lshipdate_idx on lineitem(l_shipdate);
When performing partition operations, you need to add the
update global index keyword to rebuild the index, otherwise the index is unavailable.
alter table t DROP PARTITION partition_name update global index;
A function index (or expression index) is an index based on a function or an expression. It uses a function or expression to provide a calculated value as an index column to build an index, which can improve query performance without modifying the application.
The use of functional indexes requires strict matching between functions or expressions and expressions in SQL queries, so the conditions for its use are relatively strict, and it is suitable for key optimization for important queries or queries with high frequency.
select * from lineitem where EXTRACT(DAY from l_shipdate) = 1;
If there is a normal index on
l_shipdate, the database optimizer will not able to use it.
Seq Scan on lineitem (cost=0.00..1870.24 rows=238 width=158) (actual time=0.502..10.655 rows=1616 loops=1) Filter: (EXTRACT(day FROM l_commitdate) = '1'::numeric) Rows Removed by Filter: 46000 Planning Time: 0.107 ms Execution Time: 10.709 ms
If we have an function index created on
EXTRACT(DAY from l_shipdate) as follows, you will see the optimizer uses the index and chooses a much better execution plan.
create index idx on lineitem(EXTRACT(DAY from l_shipdate));
Bitmap Heap Scan on lineitem (cost=6.13..593.60 rows=238 width=158) (actual time=0.216..0.981 rows=1620 loops=1) Recheck Cond: (EXTRACT(day FROM l_shipdate) = '1'::numeric) Heap Blocks: exact=889 -> Bitmap Index Scan on idx (cost=0.00..6.08 rows=238 width=0) (actual time=0.149..0.149 rows=1620 loops=1) Index Cond: (EXTRACT(day FROM l_shipdate) = '1'::numeric) Planning Time: 0.102 ms Execution Time: 1.075 ms
From the execution plans, we can see after utilizing the function index, the performance improves by 900%.
A partial index (Conditional index) is an index built on a subset of a table, and the subset is defined by a conditional expression, and the index includes only those rows in the table that satisfy the conditional expression.
Partial indexes are used with relatively strict conditions. Only when the database can recognize that the WHERE condition of the query logically covers the definition of the conditional expression of the index, this partial index can be used for the query.
Take the following conditional index as an example, the conditional expression of the index is
l_shipdate > '2022-01-01'
create index l_partkey_idx on lineitem(l_partkey) where l_shipdate > '2022-01-01';
Since the condition of the following query statement
l_shipdate = date '2021-12-01' does not fall within the scope of the index condition expression, the index will not be used, so the execution plan uses a full table scan.
select l_partkey , count(1) from lineitem where l_shipdate = date '2021-12-01' and l_partkey < 100 group by l_partkey ;
GroupAggregate (cost=1870.25..1870.27 rows=1 width=12) Group Key: l_partkey -> Sort (cost=1870.25..1870.26 rows=1 width=4) Sort Key: l_partkey -> Seq Scan on lineitem (cost=0.00..1870.24 rows=1 width=4) Filter: ((l_partkey < 100) AND (l_shipdate = '2021-12-01'::date))
However, the condition of the following query statement
l_shipdate = date '2022-12-01' is within the scope of the conditional expression, the database optimizer will use this index, and you can see that the performance has been greatly improved.
select l_partkey , count(1) from lineitem where l_shipdate = date '2022-12-01' and l_partkey < 100 group by l_partkey ;
GroupAggregate (cost=402.37..402.39 rows=1 width=12) Group Key: l_partkey -> Sort (cost=402.37..402.38 rows=1 width=4) Sort Key: l_partkey -> Index Scan using lorderkey_idx on lineitem (cost=0.28..402.36 rows=1 width=4) Filter: ((l_partkey < 100) AND (l_shipdate = '2022-12-01'::date))
Notice: MySQL does not currently support conditional indexes, but PostgreSQL, Opengauss, and Oracle all support them.
Index Merge is an optimization technique that uses multiple indexes to perform a single-table data access. When multiple conditions on a table are involved in a query and each condition has a suitable index, Index Merge can merge the results of multiple indexes before retrieving the data, which can improve query performance.
lineitem table, there are single-column indexes on
l_partkey. For the following SQL statement:
select * from lineitem where l_shipdate = date '2010-12-01' or l_partkey=100;
Execution Plan in PostgreSQL
Bitmap Heap Scan on lineitem (cost=9.05..202.96 rows=59 width=158) Recheck Cond: ((l_shipdate = '2010-12-01'::date) OR (l_partkey = 100)) -> BitmapOr (cost=9.05..9.05 rows=59 width=0) -> Bitmap Index Scan on l_shipdate_idx (cost=0.00..4.70 rows=54 width=0) Index Cond: (l_shipdate = '2010-12-01'::date) -> Bitmap Index Scan on l_partkey_idx (cost=0.00..4.33 rows=5 width=0) Index Cond: (l_partkey = 100)
Execution Plan in MySQL
-> Filter: ((lineitem.L_SHIPDATE = DATE'2010-12-01') or (lineitem.L_PARTKEY = 100)) (cost=12.53 rows=21) -> Deduplicate rows sorted by row ID (cost=12.53 rows=21) -> Index range scan on lineitem using l_shipdate_idx over (L_SHIPDATE = '2010-12-01') (cost=1.11 rows=1) -> Index range scan on lineitem using l_partkey_idx over (L_PARTKEY = 100) (cost=3.03 rows=20)
The optimizer can use both indexes to satisfy the query, and then merge the results of the two indexes. This approach avoids the need for a full table scan and can improve query performance.
Index Merge is supported by various databases such as MySQL, PostgreSQL, Oracle, and SQL Server. However, not all databases support all types of Index Merge, and the optimization behavior may vary across different versions of the same database.
It is recommended to create an index on the foreign key. This principle may seem to contradict the first principle (create indexes based on your workload), but in fact, it is consistent, because in real-world applications, most table relationships are based on primary and foreign keys. Creating an index on the foreign key can improve the efficiency of table association.
In MySQL, if a field is defined as a foreign key, an index is created on it by default. However, in PostgreSQL and its related databases, setting certain fields as foreign keys does not automatically create an index on those fields.
When designing indexes, it is important to consider the following constraints:
- Maximum number of indexes per table: Creating too many indexes can have a negative impact on write performance, as every insert or update to the table must also update the indexes. Therefore, it is important to limit the number of indexes per table.
- Maximum number of columns per index: Creating indexes with too many columns can also have a negative impact on performance, as the index may become too large to be efficient. Therefore, it is important to limit the number of columns per index.
- Disk space usage: Indexes can take up a significant amount of disk space. Therefore, it is important to consider the amount of available disk space when designing indexes.
To design and maintain indexes that meet these constraints, the following methods can be used:
- Index selection: Indexes can be provided on the most important SQL statements or the most frequently used queries.
- Column selection for indexing: Indexes should be created on the columns that have the best filtering effect based on the evaluation of the single value selectivity of the column. You should avoid creating indexes on columns that are frequently updated.
- Index merging: By designing the columns in the correct order in composite indexes, one index can be used to accelerate multiple queries.
- Index deletion: Unused indexes can be periodically deleted to free up disk space and reduce maintenance overhead.
In summary, the process of creating indexes can be abstracted as defining the benefits of an index under above constraints, using heuristic algorithms to calculate the index set that maximizes the overall workload benefit under specific constraints. This is also the internal logic of the PawSQL index engine.