Tomas@PawSQL

Posted on

Order by random Optimization

Problem

Function`rand`(MySQL) or `random`(PostgreSQL) returns a random floating-point value in the range 0 to 1.0. It is useful to get a random sample of dataset using following query.

``````select * from orders order by rand() limit 1;
``````

This works fine if `customer` table is less than 10,000 rows. But when you have 1,000,000 rows, the overhead for sorting the rows becomes unacceptable. The reason is quiet clear: we sort all rows, but only keep one.

MySQL Plan

``````-> Limit: 1 row(s)  (actual time=203.465..203.465 rows=1 loops=1)
-> Sort: `rand()`, limit input to 1 row(s) per chunk  (actual time=203.464..203.464 rows=1 loops=1)
-> Stream results  (cost=20540.71 rows=200128) (actual time=0.035..173.789 rows=201800 loops=1)
-> Table scan on l  (cost=20540.71 rows=200128) (actual time=0.030..121.293 rows=201800 loops=1)
``````

PostgreSQL Plan

``````Limit  (cost=4962.22..4962.22 rows=1 width=66) (actual time=58.080..58.082 rows=1 loops=1)
->  Sort  (cost=4962.22..5391.96 rows=171898 width=66) (actual time=58.078..58.079 rows=1 loops=1)
Sort Key: (random())
Sort Method: top-N heapsort  Memory: 25kB
->  Seq Scan on orders l  (cost=0.00..4102.73 rows=171898 width=66) (actual time=0.010..26.818 rows=171898 loops=1)
``````

Solutions

1. If there is an unique index on a numeric column with uniformly distributed values, the query can be rewrite to more efficient one to avoid full table scan and all row sorting. `create unique index ord_idx_key on orders(o_orderkey)`

Rewritten SQL

``````select
*
from
orders
where
o_orderkey >= (
select
floor( RAND() * ((select MAX(o_orderkey) from orders)-(select MIN(o_orderkey) from orders)) + (select MIN(o_orderkey) from orders)))
order by
o_orderkey
limit 10;
``````

Analyze Plan of MySQL

``````-> Limit: 10 row(s)  (cost=0.03 rows=3) (actual time=76.877..124.157 rows=10 loops=1)
-> Filter: (orders.O_ORDERKEY >= floor(((rand() * <cache>(((select #3) - (select #4)))) + (select #5))))  (cost=0.03 rows=3) (actual time=76.876..124.155 rows=10 loops=1)
-> Index scan on orders using PAW_IDX0334337551  (cost=0.03 rows=10) (actual time=0.015..111.731 rows=106410 loops=1)
-> Select #3 (subquery in condition; run only once)
-> Rows fetched before execution  (cost=0.00..0.00 rows=1) (actual time=0.000..0.000 rows=1 loops=1)
-> Select #4 (subquery in condition; run only once)
-> Rows fetched before execution  (cost=0.00..0.00 rows=1) (actual time=0.000..0.000 rows=1 loops=1)
-> Select #5 (subquery in condition; run only once)
-> Rows fetched before execution  (cost=0.00..0.00 rows=1) (actual time=0.000..0.000 rows=1 loops=1)
``````

Analyze Plan of PostgreSQL

``````Limit  (cost=1.27..1.44 rows=1 width=66) (actual time=27.840..27.847 rows=1 loops=1)
->  Nested Loop  (cost=1.27..9724.17 rows=57299 width=66) (actual time=27.838..27.845 rows=1 loops=1)
Join Filter: ((t1.o_orderkey)::double precision >= (floor(((random() * ((\$1 - \$3))::double precision) + (\$5)::double precision))))
Rows Removed by Join Filter: 91974
->  Index Scan using ord_idx_key on orders t1  (cost=0.29..6714.94 rows=171898 width=58) (actual time=0.022..9.406 rows=91975 loops=1)
->  Materialize  (cost=0.98..1.02 rows=1 width=8) (actual time=0.000..0.000 rows=1 loops=91975)
->  Result  (cost=0.98..1.01 rows=1 width=8) (actual time=0.112..0.117 rows=1 loops=1)
InitPlan 2 (returns \$1)
->  Result  (cost=0.32..0.33 rows=1 width=4) (actual time=0.080..0.082 rows=1 loops=1)
InitPlan 1 (returns \$0)
->  Limit  (cost=0.29..0.32 rows=1 width=4) (actual time=0.079..0.079 rows=1 loops=1)
->  Index Only Scan Backward using ord_idx_key on orders  (cost=0.29..3704.51 rows=171898 width=4) (actual time=0.078..0.079 rows=1 loops=1)
Index Cond: (o_orderkey IS NOT NULL)
Heap Fetches: 0
InitPlan 4 (returns \$3)
->  Result  (cost=0.32..0.33 rows=1 width=4) (actual time=0.013..0.014 rows=1 loops=1)
InitPlan 3 (returns \$2)
->  Limit  (cost=0.29..0.32 rows=1 width=4) (actual time=0.013..0.013 rows=1 loops=1)
->  Index Only Scan using ord_idx_key on orders orders_1  (cost=0.29..3704.51 rows=171898 width=4) (actual time=0.013..0.013 rows=1 loops=1)
Index Cond: (o_orderkey IS NOT NULL)
Heap Fetches: 0
InitPlan 6 (returns \$5)
->  Result  (cost=0.32..0.33 rows=1 width=4) (actual time=0.007..0.009 rows=1 loops=1)
InitPlan 5 (returns \$4)
->  Limit  (cost=0.29..0.32 rows=1 width=4) (actual time=0.006..0.007 rows=1 loops=1)
->  Index Only Scan using ord_idx_key on orders orders_2  (cost=0.29..3704.51 rows=171898 width=4) (actual time=0.006..0.007 rows=1 loops=1)
Index Cond: (o_orderkey IS NOT NULL)
Heap Fetches: 0
``````
1. Otherwise, we need to create a map table to solve this problem.
``````   create table orders_key_map ( row_id int not NULL primary key, o_orderkey int not null);
SET @id = 0;
INSERT INTO orders_key_map SELECT @id := @id + 1, id FROM holes;
``````
``````   > select * from orders_key_map;
+--------+-----------+
| row_id | o_orderkey |
+--------+-----------+
|      1 |         100 |
|      2 |         102 |
|      3 |         300 |
|      4 |         833 |
|      5 |        1116 |
+--------+-----------+
``````

And then we can use following SQL to get a random sample of `orders` table:

``````   select
*
from
orders o, orders_key_map m
where
o.o_orderkey = m.o_orderkey
m.row_id >= (
select
floor( RAND() * ((select MAX(row_id) from orders_key_map)-(select MIN(row_id) from orders_key_map)) + (select MIN(row_id) from orders_key_map)))
order by
row_id
limit 1;
``````

PawSQL is dedicated to automatic and intelligent database performance optimization. The products provided by PawSQL include:

• PawSQL Cloud, an online automated SQL optimization tool that supports SQL auditing, intelligent query rewriting, cost-based index recommendations, suitable for database administrators and data application developers.
• PawSQL Advisor, an IntelliJ plugin that is suitable for data application developers and can be installed via the IDEA/DataGrip marketplace by searching for "PawSQL Advisor" by name.
• PawSQL Engine, which is the backend optimization engine of the PawSQL series of products, and provides optimization capability through http/json interfaces. PawSQL Engine is provided for deployment and installation as a docker image.