DEV Community

Tomas@PawSQL
Tomas@PawSQL

Posted on

IN Subquery Optimization

Channel of advanced SQL tuning

Problem Definition

An IN subquery is a type of subquery that takes the following form.

(expr1, expr2...) [NOT] IN (SELECT expr3, expr4, ...)
Enter fullscreen mode Exit fullscreen mode

An IN subquery can be rewritten as an equivalent correlated EXISTS subquery or inner join, which can create a extra filtering condition. If the filtering condition has an appropriate index or is recommended by the PawSQL index recommendation engine, better performance can be achieved.

  • IN Subquery to EXISTS conversion

The following IN subquery is used to retrieve user information for users who have placed orders within the past year:

select * from customer where c_custkey in (select o_custkey from orders where O_ORDERDATE>=current_date - interval 1 year)
Enter fullscreen mode Exit fullscreen mode

It can be rewritten as an EXISTS subquery, which creates a extra filtering condition (c_custkey = o_custkey):

select * from customer where exists (select * from orders where c_custkey = o_custkey and O_ORDERDATE>=current_date - interval 1 year)
Enter fullscreen mode Exit fullscreen mode
  • IN Subquery to INNER JOIN conversion

If the query result of a subquery is distinct, then the IN subquery can be rewritten as a join between two tables. This allows the database optimizer to plan a better table join sequence and also enables PawSQL to recommend better optimization methods.

For example, consider the following SQL query where c_custkey is the primary key of the customer table:

SELECT * FROM orders WHERE o_custkey IN (SELECT c_custkey FROM customer);
Enter fullscreen mode Exit fullscreen mode

Here, orders and customer are two related tables, where c_custkey is the primary key of the customer table and o_custkey is the foreign key in the orders table that relates to the customer table.

If the query result of the subquery is distinct, we can rewrite the query as a JOIN query, as shown below:

SELECT * FROM orders,customer where orders.o_custkey = customer.c_custkey;
Enter fullscreen mode Exit fullscreen mode

By rewriting the IN subquery as a JOIN query, the database optimizer can plan a better table join sequence and choose a better execution plan while executing the query. In addition, such a query is also easier to understand and maintain.

Optimization of IN Subqueries in Databases

Both MySQL and PostgreSQL can adopt the optimal execution plan for IN and EXISTS.

  • If there is no index on O_ORDERDATE, the execution plans of Query1 and Query2 on MySQL will both use pseudo-code implementation logic of IN subquery:
-> Nested loop inner join  (cost=19847117.66 rows=198449671)
    -> Table scan on customer  (cost=1155.80 rows=9948)
    -> Single-row index lookup on <subquery2> using <auto_distinct_key> (o_custkey=customer.C_CUSTKEY)
        -> Materialize with deduplication  (cost=22471.48..22471.48 rows=19949)
            -> Filter: (orders.O_ORDERDATE = <cache>((curdate() - interval 1 year)))  (cost=20476.61 rows=19949)
                -> Table scan on orders  (cost=20476.61 rows=199487)
Enter fullscreen mode Exit fullscreen mode
  • If an index is built on O_ORDERDATE, then their execution plans on MySQL will both use pseudo-code implementation logic of EXISTS subquery:
-> Nested loop semijoin  (cost=22777.29 rows=5705)
    -> Table scan on customer  (cost=1155.80 rows=9948)
    -> Filter: (orders.O_ORDERDATE = <cache>((curdate() - interval 1 year)))  (cost=0.92 rows=1)
        -> Index lookup on orders using o_idx_key (O_CUSTKEY=customer.C_CUSTKEY)  (cost=0.92 rows=6)
Enter fullscreen mode Exit fullscreen mode
  • If the queried columns in the subquery are unique, the database will convert it to an inner join.

For the following SQL as an example

select * from orders where o_custkey in (select c_custkey from customer where c_phone like '139%')
Enter fullscreen mode Exit fullscreen mode

The execution plan on MySQL is as follows (PostgreSQL is similar):

-> Nested loop inner join  (cost=3541.61 rows=6313)
    -> Filter: (customer.C_PHONE like '139%')  (cost=1148.89 rows=1099)
        -> Table scan on customer  (cost=1148.89 rows=9888)
    -> Index lookup on orders using idx_orders_ckey (O_CUSTKEY=customer.C_CUSTKEY)  (cost=1.60 rows=6)
Enter fullscreen mode Exit fullscreen mode

Note: In MySQL and PostgreSQL databases, IN or EXISTS are equivalent, and the database always try to choose the optimal execution plan based on indexes and statistical information.

Optimization of IN Subqueries in PawSQL

PawSQL will rewrite the IN subquery to EXISTS subquery or inner join query to help the index recommendation engine recommend suitable indexes, prompting the optimizer to adopt the optimal execution plan.

IN subquery to EXISTS conversion

  1. Original Query
select *
from tpch.customer
where customer.c_custkey in (
           select orders.o_custkey
           from tpch.orders
           where orders.O_ORDERDATE >= current_date - interval '1' YEAR)
Enter fullscreen mode Exit fullscreen mode
  1. Rewritten Query
select /*QB_1*/ *
from tpch.customer
where exists (select /*QB_2*/ orders.o_custkey
             from tpch.orders
             where orders.O_ORDERDATE >= current_date - interval '1' YEAR
         and orders.o_custkey = customer.c_custkey)
Enter fullscreen mode Exit fullscreen mode
  1. Index Recommended
CREATE INDEX PAW_IDX1072908633 ON tpch.ORDERS(O_ORDERDATE,O_CUSTKEY);
-- When table ORDERS(referenced in query block QB_2) serves as a DRIVE table in the join planning, PAW_IDX1072908633 can be used to do a COVERING index RANGE SCAN with condition(orders.O_ORDERDATE >= current_date - interval '1' YEAR).

CREATE INDEX PAW_IDX0031767282 ON tpch.ORDERS(O_CUSTKEY,O_ORDERDATE);
-- When table ORDERS(referenced in query block QB_2) serves as a DRIVEN table in the join planning, PAW_IDX0031767282 can be used to do a COVERING index RANGE SCAN with condition(orders.o_custkey = customer.c_custkey and orders.O_ORDERDATE >= current_date - interval '1' YEAR).

Enter fullscreen mode Exit fullscreen mode
  1. Validation Details
  • Query Plan(before)
-> Nested loop inner join  (cost=65987720.69 rows=659855821)
  -> Table scan on customer  (cost=1149.80 rows=9888)
  -> Single-row index lookup on <subquery2> using <auto_distinct_key> (o_custkey=customer.C_CUSTKEY)
      -> Materialize with deduplication  (cost=13874.51..13874.51 rows=66733)
          -> Filter: (orders.O_ORDERDATE >= <cache>((curdate() - interval '1' year)))  (cost=7201.21 rows=66733)
              -> Table scan on orders  (cost=7201.21 rows=200219)
Enter fullscreen mode Exit fullscreen mode
  • Query Plan(after)
-> Nested loop inner join  (cost=3771444.20 rows=37693056)
  -> Table scan on customer  (cost=1149.80 rows=9888)
  -> Single-row index lookup on <subquery2> using <auto_distinct_key> (o_custkey=customer.C_CUSTKEY)
      -> Materialize with deduplication  (cost=1150.65..1150.65 rows=3812)
          -> Filter: (orders.O_ORDERDATE >= <cache>((curdate() - interval '1' year)))  (cost=769.45 rows=3812)
              -> Covering index range scan on orders using PAW_IDX1072908633 over ('2022-03-28' <= O_ORDERDATE)  (cost=769.45 rows=3812)
Enter fullscreen mode Exit fullscreen mode

Performance improves by 1648.67% after PawSQL optimization.

IN subquery to INNER JOIN conversion

  1. Original Query,c_custkey is the primary key of table customer.
select *
from tpch.orders
where orders.o_custkey in (
           select customer.c_custkey
           from tpch.customer)
Enter fullscreen mode Exit fullscreen mode
  1. Rewritten Query
select *
from tpch.orders, tpch.customer
where customer.c_custkey = orders.o_custkey
Enter fullscreen mode Exit fullscreen mode
  1. Index Recommended
CREATE INDEX PAW_IDX0455857015 ON tpch.ORDERS(O_CUSTKEY,O_CLERK);
-- When table ORDERS serves as a DRIVEN table in the join planning, PAW_IDX0455857015 can be used to do a index LOOKUP with condition(customer.c_custkey = orders.o_custkey).
Enter fullscreen mode Exit fullscreen mode
  1. Validation Details
  • Query Plan(before)
-> Nested loop inner join  (cost=240790.71 rows=200219)
  -> Table scan on orders  (cost=20549.81 rows=200219)
  -> Single-row covering index lookup on customer using key_idx (C_CUSTKEY=orders.O_CUSTKEY)  (cost=1.00 rows=1)
Enter fullscreen mode Exit fullscreen mode
  • Query Plan(after)
-> Nested loop inner join  (cost=21289.23 rows=53135)
  -> Index scan on customer using key_idx  (cost=1149.80 rows=9888)
  -> Index lookup on orders using PAW_IDX0455857015 (O_CUSTKEY=customer.C_CUSTKEY)  (cost=1.50 rows=5)
Enter fullscreen mode Exit fullscreen mode

Performance improves by 1064.60% after PawSQL optimization.

Top comments (0)