One of the common things most programmers do is writing database queries. The minimum expectation is to get the expected results from the written query. Say, you write a query and expect it to run only once or to work in a small dataset alone. In these cases, it doesn't matter whether the query is efficient or not.
But.. but when you hit cases where your query is expected to run many times for fetching live data for a website or arriving at analytics out of your company’s huge data set. Here your queries impact your time and cost. Writing them in an optimised fashion is the way to go.
Most database engines have a query optimiser that interprets or executes your query in the most effective way possible. But there are several strategies that can yield better results in many cases.
I would like to share ten ways for writing optimised queries. In this blog, I have used Postgres queries as examples.
1. Ask for what you need
Following this not only helps in leading a peaceful life, but also in building faster queries :)
Most of the time it is tempting to start queries with “SELECT * FROM”. It can be convenient because you don't need to think about which columns you need. But convenience comes with a cost. The cost here is the amount of data your query will process.
Instead of
SELECT * FROM PAYMENT
Use
SELECT AMOUNT,
PAYMENT_DATE
FROM PAYMENT;
This will load only these two columns. On the other hand, * will process unnecessary data as well. So ask for what you need alone!
2. WHERE Vs HAVING
WHERE and HAVING are used for specifying the conditions to fetch records. But using them interchangeably may cost you. WHERE
fetches the records satisfying the conditions but HAVING
fetches all the records and then applies the condition.
Instead of
SELECT PAYMENT_DATE, COUNT(AMOUNT)
FROM PAYMENT
GROUP BY PAYMENT_DATE
HAVING PAYMENT_DATE >= '04-02-2007';
Use
SELECT PAYMENT_DATE, COUNT(AMOUNT)
FROM PAYMENT
WHERE PAYMENT_DATE >= '04-02-2007'
GROUP BY PAYMENT_DATE;
Use HAVING only when you deal with aggregated results and you cannot use WHERE. Otherwise WHERE clause would be faster.
3. You may know by a handful the whole sack
If you like to check, whether records matching your conditions exist in your system, you need not fetch all matching records instead just check for the sample record's existence.
Instead of
SELECT * FROM PAYMENT WHERE PAYMENT_DATE >= '01-01-2021'
Use
SELECT EXISTS
(SELECT 1
FROM PAYMENT
WHERE PAYMENT_DATE >= '01-01-2021');
EXISTS halts when it hits the first matching record. Also it is not bothered on the data to be fetched to the result set. Hence SELECT 1 (as shown in the example) / 0 / NULL / {any_column} will work and will not affect the execution or result of the query.
4. JOIN Vs Subqueries
Subqueries may be easier to write/read but Joins are better optimised by the server.
Instead of
SELECT P.AMOUNT, P.PAYMENT_DATE
FROM PAYMENT P
WHERE PAYMENT_DATE >= '04-02-2007'
AND P.CUSTOMER_ID
IN
(SELECT C.CUSTOMER_ID
FROM CUSTOMER C
WHERE C.LAST_NAME = 'SOTO');
Use
SELECT P.AMOUNT, P.PAYMENT_DATE FROM PAYMENT P
JOIN CUSTOMER C
ON P.CUSTOMER_ID = C.CUSTOMER_ID
WHERE PAYMENT_DATE >= '04-02-2007' AND C.LAST_NAME = 'SOTO';
Most of the data engines evaluate subquery as separate queries. The internal queries are first ran and then from the result set of all, the actual results are derived. On the other hand Join produces the result in one go.
5. EXISTS Vs DISTINCT
If you are interested in fetching distinct records in Join queries, using DISTINCT will be costlier than using EXISTS.
Instead of
SELECT DISTINCT C.FIRST_NAME
FROM CUSTOMER C
JOIN PAYMENT P ON P.CUSTOMER_ID = C.CUSTOMER_ID
WHERE AMOUNT = 4.99;
Use
SELECT C.FIRST_NAME FROM CUSTOMER C
WHERE EXISTS (SELECT 1 FROM PAYMENT P
WHERE P.CUSTOMER_ID = C.CUSTOMER_ID
AND AMOUNT = 4.99);
In DISTINCT, duplicate rows are suppressed by first sorting all the fetched results and then returning the unique ones alone from the sorted result. This sort operation is pretty expensive and can be avoided using EXISTS in this case. EXISTS checks just for the existence of row returned by the subquery, the first occurrence is only considered and unique list is formed in one go. As no costly Sort operation is needed, EXISTS is preferred over DISTINCT.
6.UNION vs UNION ALL
If you are not worried about duplicate records, go for UNION ALL instead of UNION!
Instead of
SELECT FIRST_NAME, LAST_NAME, 'ACTOR' AS ROLE FROM ACTOR
UNION
SELECT FIRST_NAME, LAST_NAME, 'CUSTOMER' AS ROLE FROM CUSTOMER;
Use
SELECT FIRST_NAME, LAST_NAME, 'ACTOR' AS ROLE FROM ACTOR
UNION ALL
SELECT FIRST_NAME, LAST_NAME, 'CUSTOMER' AS ROLE FROM CUSTOMER;
UNION ALL performs better than UNION because by using UNION ALL we are avoiding an expensive sort operation. The Sort is done by the DISTINCT operation which is triggered internally by UNION.
7.Indexes to retrieve records on demand
Indexes are very useful if you want to retrieve records frequently. If you are looking for a small number of rows having vast column information to be retrieved frequently, then Indexes will make your job faster!
Say, If you need to retrieve the list of films using titles frequently.
SELECT *
FROM FILM
WHERE TITLE LIKE '{any_film_prefix}%';
will always make a sequential scan which takes time. So if you create an index on the hot field ‘title’,
CREATE AN INDEX ON FILM(TITLE);
The same query will be much faster.
The query planner will be using an Index scan instead of a sequential scan. You can create indexes on multiple columns as well.
Note: You can check if the query planner is using Index scan or sequential scan by prefixing explain analyse to your query.
EXPLAIN ANALYSE SELECT * FROM FILM WHERE TITLE LIKE 'Bird%';
Few points to keep in mind before going for Indexes:
- Index scan happens when the retrieved row count is not huge, if it is so, most of the query planners will compare Indexed and sequential scan and go for the latter.
SELECT *
FROM FILM
WHERE TITLE LIKE '%b%';
As it makes more sense to navigate sequentially the huge recordset rather than making the index scan for these many records and then retrieving the records.
- Index scan will not work with SQL functions applied on indexed keys. Even though the below query will perform an Index scan,
SELECT *
FROM FILM
WHERE FILM_ID BETWEEN 20 AND 30;
While the below query will perform a sequential scan because of mod function.
SELECT *
FROM FILM
WHERE MOD(FILM_ID,2) = 0;
- The index comes with the cost. Faster retrieval comes with space costs for storing indexes and maintenance costs for keeping the indexes updated.
8. Use CROSS JOIN - iff needed
This Join results in cartesian product which may not be needed itself. Say you are cross joining film and actor.
SELECT FILM.NAME, ACTOR.*
FROM FILM, ACTOR
WHERE ACTOR.ACTOR_ID = FILM.ACTOR_ID;
If there are 100 records in the film and actor table each, it would first form 10,000 records (for 1 film record all 100 actor records will be present), and then it filters records having matching actor_id. This will be expensive for huge record sets.
Instead of
SELECT FILM.NAME, ACTOR.*
FROM FILM
CROSS JOIN ACTOR
WHERE ACTOR.ACTOR_ID = FILM.ACTOR_ID;
Use
SELECT FILM.NAME, ACTOR.*
FROM FILM
INNER JOIN ACTOR
WHERE ACTOR.ACTOR_ID = FILM.ACTOR_ID;
This will form only the required 100 records in one go.
Note: Cross join and ‘,’ can be used for getting the cartesian product result set.
9. Materialized views
If you are using views with more joins or complex select values and the underlying data of the view is less frequently modified, then using Materialized view will improve your retrieval performance.
CREATE MATERIALIZED VIEW AVAILABLE_FILMS AS
SELECT ARRAY_AGG(F.TITLE) AS FILMS,
C.NAME AS CATEGORY,
COUNT(F.TITLE) AS NUMBER_OF_FILMS
FROM FILM F
INNER JOIN FILM_CATEGORY FC ON F.FILM_ID = FC.FILM_ID
INNER JOIN CATEGORY C ON FC.CATEGORY_ID = C.CATEGORY_ID
GROUP BY CATEGORY
HAVING COUNT(F.TITLE) > 15
ORDER BY COUNT(F.TITLE) DESC;
When a materialized view is referenced in a query, the data is returned directly from the materialized view, like from a single table. Hence no overhead of joins or select happens every time the data is retrieved.
SELECT *
FROM AVAILABLE_FILMS;
The cost here is maintaining/refreshing the materialized view with updated table values. Materialized views can be updated on a regular basis either through triggers or commit changes.
REFRESH MATERIALIZED VIEW AVAILABLE_FILMS;
10. Avoid OR conditions on frequently used queries
Say you are interested in getting customers whose first name starts with {given_input} or email address starts with {given_input} and has ‘.org’ suffix.
For example the given input is Kim
SELECT FIRST_NAME,
LAST_NAME,
EMAIL
FROM CUSTOMER
WHERE FIRST_NAME LIKE 'Kim%'
OR EMAIL Like 'kim%.org';
Also this query frequently runs. So you would intend to create an index on both columns.
CREATE INDEX ON CUSTOMER(FIRST_NAME, EMAIL);
But this will not work, the query planner will still use sequential scan because the condition here is OR - condition on first_name OR condition on email
.
In this case, creating separate indexes for separate columns alone would not help. A better performance approach will be splitting the queries and running them with union along with separate indexes.
CREATE INDEX ON CUSTOMER(EMAIL);
CREATE INDEX ON CUSTOMER(FIRST_NAME);
SELECT FIRST_NAME,
LAST_NAME,
EMAIL
FROM CUSTOMER
WHERE EMAIL LIKE 'kim%.org'
UNION
SELECT FIRST_NAME,
LAST_NAME,
EMAIL
FROM CUSTOMER
WHERE FIRST_NAME LIKE 'Kim%';
Going by this way, we can accommodate another column to be added in the where clause as well. As mentioned if you don’t mind duplicate records, replace UNION with UNION ALL which is more performant.
That's it for now! Next time when you get to write SQL queries give these tips a thought, think about your use cases and choose the wiser approach.
Happy learning!
Top comments (2)
Thanks for the article, really useful.
I have an observation regarding the item 10. Avoid OR conditions on frequently used queries.
You mentioned that it is an
OR
condition, but your snippet used anAND
, like thisI thought it is a typo and may you wanna fix that.
Glad you liked it.
Yes it is 'OR', thanks for pointing it out! Changed it.