A single query optimization tip can boost your database performance by 100x. At one point, we advised one of our customers that had a 10TB database to use a date-based multi-column index. As a result, their date range query sped up by 112x. In this post, I share five simple but still powerful tips for PostgreSQL query optimization.
I use these techniques a lot to optimize our customers PostgreSQL databases with billions of data points during Cube.js deployments.
To keep it easy, I ran examples for this article on a test dataset. Although it doesn’t show the actual performance improvement, you will see that our tips solve the significant set of optimization problems and work well in real-world case scenarios.
Postgres has a cool extension to the well-known
EXPLAIN command, which is called
EXPLAIN ANALYZE. The difference is that
EXPLAIN shows you query cost based on collected statistics about your database, and
EXPLAIN ANALYZE actually runs it to show the processed time for every stage.
I highly recommend you use
EXPLAIN ANALYZE because there are a lot of cases when
EXPLAIN shows a higher query cost, while the time to execute is actually less and vice versa. The most important thing is that the
EXPLAIN command will help you to understand if a specific index is used and how.
The ability to see indexes is the first step to learning PostgreSQL query optimization.
Indexes are materialized copies of your table. They contain only specific columns of the table, so you can quickly find data based on the values in these columns. Indexes in Postgres also store row identifiers or row addresses used to speed up the original table scans.
It’s always a trade-off between storage space and query time, and a lot of indexes can introduce overhead for DML operations. However, when read query performance is a priority, as is the case with business analytics, it’s usually a well-working approach.
I advise to create one index per unique query for better performance. Look further in this post to learn how to create indexes for specific queries.
Let’s review the explain analyze plan of the following simple query without indexes:
EXPLAIN ANALYZE SELECT line_items.product_id, SUM(line_items.price) FROM line_items WHERE product_id > 80 GROUP BY 1
An explain analyze returns:
HashAggregate (cost=13.81..14.52 rows=71 width=12) (actual time=0.137..0.141 rows=20 loops=1) Group Key: product_id -> Seq Scan on line_items (cost=0.00..13.25 rows=112 width=8) (actual time=0.017..0.082 rows=112 loops=1) Filter: (product_id > 80) Rows Removed by Filter: 388 Planning time: 0.082 ms Execution time: 0.187 ms
This query scans all of the line items to find a product with an id that is greater than 80, and then sums up all the values grouped by that product id.
Now we’ll add the index to this table:
CREATE INDEX items_product_id ON line_items(product_id)
We created a B-tree index, which contains only one column:
product_id. After reading lots of articles about the benefits of using index, one can expect a query boost from such an operation. Sorry, bad news.
As we need to sum up the price column in the query above, we still need to scan the original table. Depending on the table statistics, Postgres will choose to scan the original table instead of the index. The thing is, index lacks a
We can tweak this index by adding a price column as follows:
CREATE INDEX items_product_id_price ON line_items(product_id, price)
If we rerun the explain plan, we’ll see our index is the forth line:
GroupAggregate (cost=0.27..7.50 rows=71 width=12) (actual time=0.034..0.090 rows=20 loops=1) Group Key: product_id -> Index Only Scan using items_product_id_price on line_items (cost=0.27..6.23 rows=112 width=8) (actual time=0.024..0.049 rows=112 loops=1) Index Cond: (product_id > 80) Heap Fetches: 0 Planning time: 0.271 ms Execution time: 0.136 ms
But how would putting the price column first affect the PostgreSQL query optimization?
Well, we figured out that a multicolumn index is used in the previous query because we included both columns. The interesting thing is that we can use another order for these columns while defining the index:
CREATE INDEX items_product_id_price_reversed ON line_items(price, product_id)
If we rerun explain analyze, we’ll see that
items_product_id_price_reversed is not used. That’s because this index is sorted firstly on
price and then on
product_id. Using this index will lead to a full index scan, which is nearly equivalent to scanning the whole table. That’s why Postgres opts to use scan for an original table.
It's a good practice to put in the first place columns, which you use in filters with the biggest number of unique values.
It’s time to figure out what the best set of indexes is for a specific join query, which also has some filter conditions. Usually, you can achieve optimal results by trial and error.
As in the case of simple filtering, choose the most restrictive filtering condition and add an index for it.
Let’s consider an example:
SELECT orders.product_id, SUM(line_items.price) FROM line_items LEFT JOIN orders ON line_items.order_id = orders.id WHERE line_items.created_at BETWEEN '2018-01-01' and '2018-01-02' GROUP BY 1
Here we have join on
order_id and filter on
created_at. This way, we can create a multicolumn index that will contain
created_at in the first place,
order_id in the second and
price in the third:
CREATE INDEX line_items_created_at_order_id_price ON line_items(created_at, order_id, price)
We’ll get the following explain plan:
GroupAggregate (cost=12.62..12.64 rows=1 width=12) (actual time=0.029..0.029 rows=1 loops=1) Group Key: orders.product_id -> Sort (cost=12.62..12.62 rows=1 width=8) (actual time=0.025..0.026 rows=1 loops=1) Sort Key: orders.product_id Sort Method: quicksort Memory: 25kB -> Nested Loop Left Join (cost=0.56..12.61 rows=1 width=8) (actual time=0.015..0.017 rows=1 loops=1) -> Index Only Scan using line_items_created_at_order_id_price on line_items (cost=0.27..4.29 rows=1 width=8) (actual time=0.009..0.010 rows=1 loops=1) Index Cond: ((created_at >= '2018-01-01 00:00:00'::timestamp without time zone) AND (created_at <= '2018-01-02 00:00:00'::timestamp without time zone)) Heap Fetches: 0 -> Index Scan using orders_pkey on orders (cost=0.29..8.30 rows=1 width=8) (actual time=0.004..0.005 rows=1 loops=1) Index Cond: (line_items.order_id = id) Planning time: 0.303 ms Execution time: 0.072 ms
As you can see,
line_items_created_at_order_id_price is used to reduce scan by date condition. After that, it’s joined with orders using the
orders_pkey index scan.
Date columns are usually one of the best candidates for the first column in a multicolumn index as it reduces scanning throughput in a predictable manner.
Above tips for PostgreSQL query optimization will help you to speed up queries 10-100x for multi-GB databases. They can solve most of your performance bottlenecks in an 80/20 manner. It doesn’t mean you shouldn’t double check your queries with
EXPLAIN for real-world case scenarios.