Tomas@PawSQL

Posted on Mar 30, 2023

IN Subquery Optimization

#sql #database

Channel of advanced SQL tuning

Problem Definition

An IN subquery is a type of subquery that takes the following form.

(expr1, expr2...) [NOT] IN (SELECT expr3, expr4, ...)

An IN subquery can be rewritten as an equivalent correlated EXISTS subquery or inner join, which can create a extra filtering condition. If the filtering condition has an appropriate index or is recommended by the PawSQL index recommendation engine, better performance can be achieved.

IN Subquery to EXISTS conversion

The following IN subquery is used to retrieve user information for users who have placed orders within the past year:

select * from customer where c_custkey in (select o_custkey from orders where O_ORDERDATE>=current_date - interval 1 year)

It can be rewritten as an EXISTS subquery, which creates a extra filtering condition (c_custkey = o_custkey):

select * from customer where exists (select * from orders where c_custkey = o_custkey and O_ORDERDATE>=current_date - interval 1 year)

IN Subquery to INNER JOIN conversion

If the query result of a subquery is distinct, then the IN subquery can be rewritten as a join between two tables. This allows the database optimizer to plan a better table join sequence and also enables PawSQL to recommend better optimization methods.

For example, consider the following SQL query where c_custkey is the primary key of the customer table:

SELECT * FROM orders WHERE o_custkey IN (SELECT c_custkey FROM customer);

Here, orders and customer are two related tables, where c_custkey is the primary key of the customer table and o_custkey is the foreign key in the orders table that relates to the customer table.

If the query result of the subquery is distinct, we can rewrite the query as a JOIN query, as shown below:

SELECT * FROM orders,customer where orders.o_custkey = customer.c_custkey;

By rewriting the IN subquery as a JOIN query, the database optimizer can plan a better table join sequence and choose a better execution plan while executing the query. In addition, such a query is also easier to understand and maintain.

Optimization of IN Subqueries in Databases

Both MySQL and PostgreSQL can adopt the optimal execution plan for IN and EXISTS.

If there is no index on O_ORDERDATE, the execution plans of Query1 and Query2 on MySQL will both use pseudo-code implementation logic of IN subquery:

-> Nested loop inner join  (cost=19847117.66 rows=198449671)
    -> Table scan on customer  (cost=1155.80 rows=9948)
    -> Single-row index lookup on <subquery2> using <auto_distinct_key> (o_custkey=customer.C_CUSTKEY)
        -> Materialize with deduplication  (cost=22471.48..22471.48 rows=19949)
            -> Filter: (orders.O_ORDERDATE = <cache>((curdate() - interval 1 year)))  (cost=20476.61 rows=19949)
                -> Table scan on orders  (cost=20476.61 rows=199487)

If an index is built on O_ORDERDATE, then their execution plans on MySQL will both use pseudo-code implementation logic of EXISTS subquery:

-> Nested loop semijoin  (cost=22777.29 rows=5705)
    -> Table scan on customer  (cost=1155.80 rows=9948)
    -> Filter: (orders.O_ORDERDATE = <cache>((curdate() - interval 1 year)))  (cost=0.92 rows=1)
        -> Index lookup on orders using o_idx_key (O_CUSTKEY=customer.C_CUSTKEY)  (cost=0.92 rows=6)

If the queried columns in the subquery are unique, the database will convert it to an inner join.

For the following SQL as an example

select * from orders where o_custkey in (select c_custkey from customer where c_phone like '139%')

The execution plan on MySQL is as follows (PostgreSQL is similar):

-> Nested loop inner join  (cost=3541.61 rows=6313)
    -> Filter: (customer.C_PHONE like '139%')  (cost=1148.89 rows=1099)
        -> Table scan on customer  (cost=1148.89 rows=9888)
    -> Index lookup on orders using idx_orders_ckey (O_CUSTKEY=customer.C_CUSTKEY)  (cost=1.60 rows=6)

Note: In MySQL and PostgreSQL databases, IN or EXISTS are equivalent, and the database always try to choose the optimal execution plan based on indexes and statistical information.

Optimization of IN Subqueries in PawSQL

PawSQL will rewrite the IN subquery to EXISTS subquery or inner join query to help the index recommendation engine recommend suitable indexes, prompting the optimizer to adopt the optimal execution plan.

IN subquery to EXISTS conversion

Original Query

select *
from tpch.customer
where customer.c_custkey in (
           select orders.o_custkey
           from tpch.orders
           where orders.O_ORDERDATE >= current_date - interval '1' YEAR)

Rewritten Query

select /*QB_1*/ *
from tpch.customer
where exists (select /*QB_2*/ orders.o_custkey
             from tpch.orders
             where orders.O_ORDERDATE >= current_date - interval '1' YEAR
         and orders.o_custkey = customer.c_custkey)

Index Recommended

CREATE INDEX PAW_IDX1072908633 ON tpch.ORDERS(O_ORDERDATE,O_CUSTKEY);
-- When table ORDERS(referenced in query block QB_2) serves as a DRIVE table in the join planning, PAW_IDX1072908633 can be used to do a COVERING index RANGE SCAN with condition(orders.O_ORDERDATE >= current_date - interval '1' YEAR).

CREATE INDEX PAW_IDX0031767282 ON tpch.ORDERS(O_CUSTKEY,O_ORDERDATE);
-- When table ORDERS(referenced in query block QB_2) serves as a DRIVEN table in the join planning, PAW_IDX0031767282 can be used to do a COVERING index RANGE SCAN with condition(orders.o_custkey = customer.c_custkey and orders.O_ORDERDATE >= current_date - interval '1' YEAR).

Validation Details

Query Plan(before)

-> Nested loop inner join  (cost=65987720.69 rows=659855821)
  -> Table scan on customer  (cost=1149.80 rows=9888)
  -> Single-row index lookup on <subquery2> using <auto_distinct_key> (o_custkey=customer.C_CUSTKEY)
      -> Materialize with deduplication  (cost=13874.51..13874.51 rows=66733)
          -> Filter: (orders.O_ORDERDATE >= <cache>((curdate() - interval '1' year)))  (cost=7201.21 rows=66733)
              -> Table scan on orders  (cost=7201.21 rows=200219)

Query Plan(after)

-> Nested loop inner join  (cost=3771444.20 rows=37693056)
  -> Table scan on customer  (cost=1149.80 rows=9888)
  -> Single-row index lookup on <subquery2> using <auto_distinct_key> (o_custkey=customer.C_CUSTKEY)
      -> Materialize with deduplication  (cost=1150.65..1150.65 rows=3812)
          -> Filter: (orders.O_ORDERDATE >= <cache>((curdate() - interval '1' year)))  (cost=769.45 rows=3812)
              -> Covering index range scan on orders using PAW_IDX1072908633 over ('2022-03-28' <= O_ORDERDATE)  (cost=769.45 rows=3812)

Performance improves by 1648.67% after PawSQL optimization.

IN subquery to INNER JOIN conversion

Original Query，c_custkey is the primary key of table customer.

select *
from tpch.orders
where orders.o_custkey in (
           select customer.c_custkey
           from tpch.customer)

Rewritten Query

select *
from tpch.orders, tpch.customer
where customer.c_custkey = orders.o_custkey

Index Recommended

CREATE INDEX PAW_IDX0455857015 ON tpch.ORDERS(O_CUSTKEY,O_CLERK);
-- When table ORDERS serves as a DRIVEN table in the join planning, PAW_IDX0455857015 can be used to do a index LOOKUP with condition(customer.c_custkey = orders.o_custkey).

Validation Details

Query Plan(before)

-> Nested loop inner join  (cost=240790.71 rows=200219)
  -> Table scan on orders  (cost=20549.81 rows=200219)
  -> Single-row covering index lookup on customer using key_idx (C_CUSTKEY=orders.O_CUSTKEY)  (cost=1.00 rows=1)

Query Plan(after)

-> Nested loop inner join  (cost=21289.23 rows=53135)
  -> Index scan on customer using key_idx  (cost=1149.80 rows=9888)
  -> Index lookup on orders using PAW_IDX0455857015 (O_CUSTKEY=customer.C_CUSTKEY)  (cost=1.50 rows=5)

Performance improves by 1064.60% after PawSQL optimization.

DEV Community

IN Subquery Optimization

Problem Definition

Optimization of IN Subqueries in Databases

Optimization of IN Subqueries in PawSQL

IN subquery to EXISTS conversion

IN subquery to INNER JOIN conversion

Top comments (0)

Read next

The Justin Beiber database problem!

Designing a Production-Grade Database for High-Traffic Applications on AWS RDS MySQL

Aurora Limitless - Sequences

Adding new columns - lowCalAlt_update5