Pavel Tiunov

Posted on Mar 6, 2019 • Originally published at statsbot.co

Simple Tips for PostgreSQL Query Optimization

#webdev #database #postgres #sql

A single query optimization tip can boost your database performance by 100x. At one point, we advised one of our customers that had a 10TB database to use a date-based multi-column index. As a result, their date range query sped up by 112x. In this post, I share five simple but still powerful tips for PostgreSQL query optimization.

I use these techniques a lot to optimize our customers PostgreSQL databases with billions of data points during Cube.js deployments.

To keep it easy, I ran examples for this article on a test dataset. Although it doesn’t show the actual performance improvement, you will see that our tips solve the significant set of optimization problems and work well in real-world case scenarios.

Explain analyze

Postgres has a cool extension to the well-known EXPLAIN command, which is called EXPLAIN ANALYZE. The difference is that EXPLAIN shows you query cost based on collected statistics about your database, and EXPLAIN ANALYZE actually runs it to show the processed time for every stage.

I highly recommend you use EXPLAIN ANALYZE because there are a lot of cases when EXPLAIN shows a higher query cost, while the time to execute is actually less and vice versa. The most important thing is that the EXPLAIN command will help you to understand if a specific index is used and how.

The ability to see indexes is the first step to learning PostgreSQL query optimization.

One index per query

Indexes are materialized copies of your table. They contain only specific columns of the table, so you can quickly find data based on the values in these columns. Indexes in Postgres also store row identifiers or row addresses used to speed up the original table scans.

It’s always a trade-off between storage space and query time, and a lot of indexes can introduce overhead for DML operations. However, when read query performance is a priority, as is the case with business analytics, it’s usually a well-working approach.

I advise to create one index per unique query for better performance. Look further in this post to learn how to create indexes for specific queries.

Using multiple columns in index

Let’s review the explain analyze plan of the following simple query without indexes:

EXPLAIN ANALYZE SELECT line_items.product_id, SUM(line_items.price)
FROM line_items
WHERE product_id > 80
GROUP BY 1

An explain analyze returns:

HashAggregate (cost=13.81..14.52 rows=71 width=12) (actual time=0.137..0.141 rows=20 loops=1)
Group Key: product_id
-> Seq Scan on line_items (cost=0.00..13.25 rows=112 width=8) (actual time=0.017..0.082 rows=112 loops=1)
Filter: (product_id > 80)
Rows Removed by Filter: 388
Planning time: 0.082 ms
Execution time: 0.187 ms

This query scans all of the line items to find a product with an id that is greater than 80, and then sums up all the values grouped by that product id.

Now we’ll add the index to this table:

CREATE INDEX items_product_id ON line_items(product_id)

We created a B-tree index, which contains only one column: product_id. After reading lots of articles about the benefits of using index, one can expect a query boost from such an operation. Sorry, bad news.

As we need to sum up the price column in the query above, we still need to scan the original table. Depending on the table statistics, Postgres will choose to scan the original table instead of the index. The thing is, index lacks a price column.

We can tweak this index by adding a price column as follows:

CREATE INDEX items_product_id_price ON line_items(product_id, price)

If we rerun the explain plan, we’ll see our index is the forth line:

GroupAggregate (cost=0.27..7.50 rows=71 width=12) (actual time=0.034..0.090 rows=20 loops=1)
Group Key: product_id
-> Index Only Scan using items_product_id_price on line_items (cost=0.27..6.23 rows=112 width=8) (actual time=0.024..0.049 rows=112 loops=1)
Index Cond: (product_id > 80)
Heap Fetches: 0
Planning time: 0.271 ms
Execution time: 0.136 ms

But how would putting the price column first affect the PostgreSQL query optimization?

Column order in a multicolumn index

Well, we figured out that a multicolumn index is used in the previous query because we included both columns. The interesting thing is that we can use another order for these columns while defining the index:

CREATE INDEX items_product_id_price_reversed ON line_items(price, product_id)

If we rerun explain analyze, we’ll see that items_product_id_price_reversed is not used. That’s because this index is sorted firstly on price and then on product_id. Using this index will lead to a full index scan, which is nearly equivalent to scanning the whole table. That’s why Postgres opts to use scan for an original table.

It's a good practice to put in the first place columns, which you use in filters with the biggest number of unique values.

Filters + joins

It’s time to figure out what the best set of indexes is for a specific join query, which also has some filter conditions. Usually, you can achieve optimal results by trial and error.

As in the case of simple filtering, choose the most restrictive filtering condition and add an index for it.

Let’s consider an example:

SELECT orders.product_id, SUM(line_items.price)
FROM line_items
LEFT JOIN orders ON line_items.order_id = orders.id
WHERE line_items.created_at BETWEEN '2018-01-01' and '2018-01-02'
GROUP BY 1

Here we have join on order_id and filter on created_at. This way, we can create a multicolumn index that will contain created_at in the first place, order_id in the second and price in the third:

CREATE INDEX line_items_created_at_order_id_price ON line_items(created_at, order_id, price)

We’ll get the following explain plan:

GroupAggregate (cost=12.62..12.64 rows=1 width=12) (actual time=0.029..0.029 rows=1 loops=1)
Group Key: orders.product_id
-> Sort (cost=12.62..12.62 rows=1 width=8) (actual time=0.025..0.026 rows=1 loops=1)
Sort Key: orders.product_id
Sort Method: quicksort Memory: 25kB
-> Nested Loop Left Join (cost=0.56..12.61 rows=1 width=8) (actual time=0.015..0.017 rows=1 loops=1)
-> Index Only Scan using line_items_created_at_order_id_price on line_items (cost=0.27..4.29 rows=1 width=8) (actual time=0.009..0.010 rows=1 loops=1)
Index Cond: ((created_at >= '2018-01-01 00:00:00'::timestamp without time zone) AND (created_at <= '2018-01-02 00:00:00'::timestamp without time zone))
Heap Fetches: 0
-> Index Scan using orders_pkey on orders (cost=0.29..8.30 rows=1 width=8) (actual time=0.004..0.005 rows=1 loops=1)
Index Cond: (line_items.order_id = id)
Planning time: 0.303 ms
Execution time: 0.072 ms

As you can see, line_items_created_at_order_id_price is used to reduce scan by date condition. After that, it’s joined with orders using the orders_pkey index scan.

Date columns are usually one of the best candidates for the first column in a multicolumn index as it reduces scanning throughput in a predictable manner.

Conclusion

Above tips for PostgreSQL query optimization will help you to speed up queries 10-100x for multi-GB databases. They can solve most of your performance bottlenecks in an 80/20 manner. It doesn’t mean you shouldn’t double check your queries with EXPLAIN for real-world case scenarios.

DEV Community