DEV Community

Franck Pachot for YugabyteDB Distributed PostgreSQL Database

Posted on • Updated on

ORDER BY is mandatory in SQL to get a sorted result

In IT, just like in math, a negative test can demonstrate that an assertion is incorrect. However, a positive test doesn't definitively establish the correctness of the assertion. This challenge is amplified in IT because algorithms can yield apparently correct results with a probability that exceeds mere chance. Simply put, one should never make assumptions based solely on results. It's essential to understand how it works.

Here's a classic example:

postgres=# create table demo (n int primary key);
CREATE TABLE

postgres=# insert into demo  
           select n from generate_series(1,1e6) n;
INSERT 0 1000000

postgres=# select n from demo limit 10;

 n
----
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
(10 rows)
Enter fullscreen mode Exit fullscreen mode

Using this example, one might assume that a SELECT query returns rows in a specific order. However, this is a misconception. SQL is a declarative language, and without an ORDER BY clause, the result is essentially unordered. The apparent sorting in this scenario is merely a consequence of inserting data into heap tables by appending it at the end of the file and reading the file sequentially with a single thread. Rows are displayed as they are retrieved.

When inserting the rows in a different order:

postgres=# drop table if exists demo;
DROP TABLE
postgres=# create table demo (n int primary key);
CREATE TABLE
postgres=# insert into demo  select 1e6-n from generate_series(1,1e6) n;
INSERT 0 1000000
postgres=# select n from demo limit 10;
   n
--------
 999999
 999998
 999997
 999996
 999995
 999994
 999993
 999992
 999991
 999990
(10 rows)
Enter fullscreen mode Exit fullscreen mode

they appear as they were originally stored. This behavior is typical of a serial SeqScan:

postgres=# explain select n from demo limit 10;
                             QUERY PLAN
--------------------------------------------------------
 Limit  (cost=0.00..0.14 rows=10 width=4)
   ->  Seq Scan on demo  (cost=0.00..14425.00 rows=1000000 width=4)
(2 rows)
Enter fullscreen mode Exit fullscreen mode

This behavior cannot be relied upon. It can change with a different execution plan, and this can happen for many reasons like different data distribution, new indexes, parallelization, or simply some configuration changes:

postgres=# set enable_seqscan=false;
SET
postgres=# select n from demo limit 10;
 n
---
 0
 1
 2
 3
 4
 5
 6
 7
 8
 9
(10 rows)

postgres=# explain select n from demo limit 10;

                                        QUERY PLAN
--------------------------------------------------------
 Limit  (cost=0.42..0.77 rows=10 width=4)
   ->  Index Only Scan using demo_pkey on demo  (cost=0.42..34712.43 rows=1000000 width=4)
(2 rows)
Enter fullscreen mode Exit fullscreen mode

Several factors can influence this behavior. For example, an index scan instead of a full table scan, parallel query execution, and updates can all alter the way rows are read and their physical location:

postgres=# set enable_seqscan=true;
SET
postgres=# alter table demo add column x char;
ALTER TABLE
postgres=# update demo set x=1 where mod(n,3)=0;
UPDATE 333334
postgres=# select n from demo limit 10;
   n
--------
 999998
 999997
 999995
 999994
 999992
 999991
 999989
 999988
 999986
 999985
(10 rows)
Enter fullscreen mode Exit fullscreen mode

The only way to obtain a reliably sorted result is by using an ORDER BY clause:

postgres=# select n from demo order by n limit 10;
 n
---
 0
 1
 2
 3
 4
 5
 6
 7
 8
 9
(10 rows)
Enter fullscreen mode Exit fullscreen mode

Don't assume that an ORDER BY clause will necessarily add a sorting operation. The query planner may choose a physical structure that returns the rows in the desired order without an explicit sorting step:

postgres=# explain select n from demo order by n limit 10;
                                        QUERY PLAN
-------------------------------------------------------------------------------------------
 Limit  (cost=0.42..0.77 rows=10 width=4)
   ->  Index Only Scan using demo_pkey on demo  (cost=0.42..34716.43 rows=1000000 width=4)
(2 rows)
Enter fullscreen mode Exit fullscreen mode

In SQL, you declare the desired order using an ORDER BY clause, and the query planner selects the appropriate access methods to retrieve the data in that order. The optimizer has this knowledge, but it's typically hidden from you unless you've read and comprehended the source code for the exact version you're using. That's why you must add an ORDER BY if you expect a sorted result.

YugabyteDB

In a distributed SQL database, it's highly likely that the default order won't align with your expectations:

yugabyte=# create table demo (n int primary key);
CREATE TABLE

yugabyte=# insert into demo  select 1e6-n from generate_series(1,1e6) n;
INSERT 0 1000000

yugabyte=# select n from demo limit 10;

   n
--------
 110359
 192735
 219128
 237047
 310517
 593962
 627995
 651891
 669921
 790562
(10 rows)

Enter fullscreen mode Exit fullscreen mode

This may appear random at first glance, as it exhibits an increasing order not tied to the insertion sequence, unlike PostgreSQL. However, there is a underlying logic to it. In this case, the default sharding method involves applying a hash function to the first column of the primary key. We can query this hashing function ourselves with yb_hash_code()

yugabyte=# select min(h),max(h),avg(h), 
            count(*)::float/count(distinct h) per_hash from (
            select yb_hash_code( generate_series(1,1e6) ) h 
           ) v;

 min |  max  |        avg         |   per_hash
-----+-------+--------------------+---------------
   0 | 65535 | 32774.509179000000 | 15.2587890625
(1 row)
Enter fullscreen mode Exit fullscreen mode

This function can return one of 65536 values, ranging from 0 to 65535. These values are used to distribute rows into tablets within the distributed storage system of YugabyteDB, called DocDB.

Without an ORDER BY clause, the rows are returned in the order of their hash code:

yugabyte=# select yb_hash_code(n), n from demo limit 10;

 yb_hash_code |   n
--------------+--------
            0 | 110359
            0 | 192735
            0 | 219128
            0 | 237047
            0 | 310517
            0 | 593962
            0 | 627995
            0 | 651891
            0 | 669921
            0 | 790562
(10 rows)
Enter fullscreen mode Exit fullscreen mode

Indeed, when querying all rows with a specific hash code, they are returned in order, because for each hash value, in each DocDB partition, the rows are clustered and sorted on their primary key.

yugabyte=# select n from demo where yb_hash_code(n)=0;

   n
--------
 110359
 192735
 219128
 237047
 310517
 593962
 627995
 651891
 669921
 790562
 792363
 819768
 891493
 984191
(14 rows)
Enter fullscreen mode Exit fullscreen mode

When you delve into the physical storage and retrieval process using a SeqScan, the initial order may seem random, but it's not mysterious. With 1 million rows and 65536 hash values, there are roughly 15.2 rows per hash code on average. What appears to be an order when selecting a small number of rows exhibits a different pattern when selecting more rows than the number of rows per hash code:

yugabyte=# select n, yb_hash_code(n) from demo limit 50;

   n    | yb_hash_code
--------+--------------
 110359 |            0
 192735 |            0
 219128 |            0
 237047 |            0
 310517 |            0
 593962 |            0
 627995 |            0
 651891 |            0
 669921 |            0
 790562 |            0
 792363 |            0
 819768 |            0
 891493 |            0
 984191 |            0
  17012 |            1
  24685 |            1
 153595 |            1
 186378 |            1
 219742 |            1
 258869 |            1
 271029 |            1
 547922 |            1
 565568 |            1
 763430 |            1
 766123 |            1
 772002 |            1
 781840 |            1
 840822 |            1
 844655 |            1
 953917 |            1
 162485 |            2
 168413 |            2
 271551 |            2
 285516 |            2
 407063 |            2
 420509 |            2
 440160 |            2
 572540 |            2
 585722 |            2
 589471 |            2
 628271 |            2
 719191 |            2
 837125 |            2
 866379 |            2
 951013 |            2
 976519 |            2
 994652 |            2
    854 |            3
  57757 |            3
  70079 |            3
(50 rows)
Enter fullscreen mode Exit fullscreen mode

Internally, within YugabyteDB, table rows are organized in order based on the DocDB key. This key is essentially the primary key with a prefix of the hash code. This design facilitates fast access to specific key points or ranges: starting from the hash code, it identifies the correct tablet, and subsequently locates the desired row within the SSTable (Sorted Sequence Table) structure.

Should I choose to shard based on a range, even a SeqScan would return results in an ordered fashion:

yugabyte=# drop table demo;
DROP TABLE

yugabyte=# create table demo (n int, primary key(n asc))
           split at values ( (333333),(666666) );
CREATE TABLE

yugabyte=# insert into demo  select 1e6-n from generate_series(1,1e6) n;
INSERT 0 1000000

yugabyte=# select n from demo limit 10;
 n
---
 0
 1
 2
 3
 4
 5
 6
 7
 8
 9
(10 rows)

yugabyte=# explain analyze select n from demo limit 10;
                                                QUERY PLAN
----------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..1.00 rows=10 width=4) (actual time=0.788..0.796 rows=10 loops=1)
   ->  Seq Scan on demo  (cost=0.00..100.00 rows=1000 width=4) (actual time=0.787..0.791 rows=10 loops=1)
 Planning Time: 0.047 ms
 Execution Time: 0.855 ms
(4 rows)
Enter fullscreen mode Exit fullscreen mode

The Seq Scan displays an order similar to that of an Index Scan on the primary key:

yugabyte=# explain analyze select n from demo order by n limit 10;
                                                         QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..1.14 rows=10 width=4) (actual time=0.809..0.816 rows=10 loops=1)
   ->  Index Scan using demo_pkey on demo  (cost=0.00..114.00 rows=1000 width=4) (actual time=0.808..0.813 rows=10 loops=1)
 Planning Time: 0.066 ms
 Execution Time: 0.843 ms
(4 rows)
Enter fullscreen mode Exit fullscreen mode

This occurs because the table is stored the primary key structure, similar to Oracle's Index Organized Table, SQL Server's Clustered Index, or MySQL's InnoDB tables.

I mentioned earlier that the hash code has 65536 possible values:

yugabyte=# drop table demo;
DROP TABLE
yugabyte=# create table demo (n int primary key)
           split into 16 tablets;
CREATE TABLE
yugabyte=# insert into demo
           select n from generate_series(1,1e6) n;
INSERT 0 1000000
yugabyte=# select count(distinct yb_hash_code(n)) from demo;
 count
-------
 65536
(1 row)
Enter fullscreen mode Exit fullscreen mode

However, it's important to note that the number of tablets is typically smaller, with just a few per node, to facilitate rebalancing when adding nodes.

To determine the number of tablets from the execution plan, I execute an aggregate function that gets pushed down to each tablet. This way, the number of rows in the execution plan corresponds to the number of aggregations performed:

yugabyte=# explain analyze select count(*) from demo;

yugabyte=# explain analyze select count(*) from demo;
                                                  QUERY PLAN
--------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=102.50..102.51 rows=1 width=8) (actual time=984.031..984.031 rows=1 loops=1)
   ->  Seq Scan on demo  (cost=0.00..100.00 rows=1000 width=0) (actual time=357.419..984.010 rows=16 loops=1)
 Planning Time: 0.055 ms
 Execution Time: 984.145 ms
(4 rows)
Enter fullscreen mode Exit fullscreen mode

The rows=16 corresponds to the 16 count(*) results obtained from each tablet. In this case, there are 16 tablets, and each tablet handles 4096 hash codes.

What about GROUP BY or DISTINCT ?

Because many databases implement deduplication by sorting the row set, it frequently leads to the misconception that GROUP BY or DISTINCT queries yield ordered results. To illustrate this, here's an example of a misguided and unhelpful question on Twitter with numerous incorrect responses:

To truly master SQL, you should disregard the myths disseminated by self-proclaimed experts who make broad generalizations based on isolated observations. Instead, prioritize understanding execution plans and remember these key points that apply to all SQL databases:

  • SQL is a declarative language, so you can't predict how rows are processed without examining the execution plan
  • The use of ORDER BY is the sole declaration in a SQL query that ensures a sorted result.

Back to my example on YugabyteDB, and you can see the same on PostgreSQL, without an ORDER BY the result is not sorted:

yugabyte=> select n from demo group by n limit 10;

   n
--------
    790
 662718
 694339
 233338
 129976
 164125
 756002
 502084
 157514
 503514
(10 rows)

yugabyte=> explain analyze
           select n from demo group by n limit 10;

                                                       QUERY PLAN                                               
------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=102.50..102.60 rows=10 width=4) (actual time=2811.675..2811.680 rows=10 loops=1)
   ->  HashAggregate  (cost=102.50..112.50 rows=1000 width=4) (actual time=2811.673..2811.676 rows=10 loops=1)
         Group Key: n
         ->  Seq Scan on demo  (cost=0.00..100.00 rows=1000 width=4) (actual time=2.195..2463.991 rows=1000000 loops=1)
 Planning Time: 0.064 ms
 Execution Time: 2818.669 ms
 Peak Memory Usage: 139813 kB
(7 rows)

Enter fullscreen mode Exit fullscreen mode

The execution plan makes it clear that an Hash Aggregate was used.

I have the option to disable it and enforce a Sort Aggregate, which can be achieved using pg_hint_plan. This extension is installed by default in YugabyteDB, and is handy to restrict the setting to the scope of one query, but you can also set the parameter at the scope of the session:

yugabyte=> /*+ Set(enable_hashagg off) */
           select n from demo group by n limit 10;

 n
----
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
(10 rows)

yugabyte=> /*+ Set(enable_hashagg off) */
           explain analyze
           select n from demo group by n limit 10;

                                                           QUERY PLAN                                            
------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=149.83..149.88 rows=10 width=4) (actual time=2894.237..2894.244 rows=10 loops=1)
   ->  Group  (cost=149.83..154.83 rows=1000 width=4) (actual time=2894.236..2894.240 rows=10 loops=1)
         Group Key: n
         ->  Sort  (cost=149.83..152.33 rows=1000 width=4) (actual time=2894.234..2894.235 rows=10 loops=1)
               Sort Key: n
               Sort Method: external merge  Disk: 13800kB
               ->  Seq Scan on demo  (cost=0.00..100.00 rows=1000 width=4) (actual time=2.306..2621.073 rows=1000000 loops=1)
 Planning Time: 0.085 ms
 Execution Time: 2896.456 ms
 Peak Memory Usage: 6495 kB
(10 rows)
Enter fullscreen mode Exit fullscreen mode

The ascending order of the result is merely a result of the grouping algorithm used. It's essential to remember that without an ORDER BY clause, you cannot depend on obtaining ordered results.

Do you believe that adding an ORDER BY clause will impact response time? Simply inspect the execution plan, and you'll observe that it remains unchanged. The query planner recognizes that the Group operation retains the order and doesn't require an additional sorting step:

yugabyte=> /*+ Set(enable_hashagg off) */
           explain analyze
           select n from demo group by n order by n limit 10;
                                                          QUERY PLAN                                            
------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=149.83..149.88 rows=10 width=4) (actual time=2870.805..2870.812 rows=10 loops=1)
   ->  Group  (cost=149.83..154.83 rows=1000 width=4) (actual time=2870.804..2870.808 rows=10 loops=1)
         Group Key: n
         ->  Sort  (cost=149.83..152.33 rows=1000 width=4) (actual time=2870.802..2870.804 rows=10 loops=1)
               Sort Key: n
               Sort Method: external merge  Disk: 13800kB
               ->  Seq Scan on demo  (cost=0.00..100.00 rows=1000 width=4) (actual time=2.242..2603.396 rows=1000000 loops=1)
 Planning Time: 0.089 ms
 Execution Time: 2872.848 ms
 Peak Memory Usage: 6495 kB
(10 rows)
Enter fullscreen mode Exit fullscreen mode

To Summarize...

It's crucial to understand how rows are stored and fetched to optimize performance, and that's where the execution plan comes in. However, when executing application queries, remember that SQL is declarative, and the actual plan can vary. If you require a sorted result, explicitly declare it using ORDER BY, and the query planner will determine the appropriate steps. Any claim about an expected ordering without an ORDER BY is a mistake, by definition.

Top comments (0)