Cameron Archer for Tinybird

Posted on May 8, 2023 • Originally published at tinybird.co

The 5 rules for writing faster SQL

#sql #database #analytics #backend

When you’re trying to process huge amounts of data in real-time, there are some best practices to follow. These are hard-learned lessons from years of building real-time systems at huge scale. Now, we're sharing them with you.

The 5 Rules of Fast Queries

At Tinybird, we want our users to be able to go fast. Both when building queries, and then when those queries run. So, we've created a cheatsheet of rules you can consult whenever you need to go fast.

We like to call them The 5 Rules of Fast Queries:

Rule № 1 ⟶ The best data is the data you don’t write.
Rule № 2 ⟶ The second best data is the data you don’t read.
Rule № 3 ⟶ Sequential reads are 100x faster.
Rule № 4 ⟶ The less data you process (after read), the better.
Rule № 5 ⟶ Move complex operations later in the processing pipeline.

Let’s go one by one, analyzing the performance improvement after the implementation of each rule using data stored in Tinybird. We will use the well-known NYC Taxi Trip dataset. You can get a sample here and import it directly into Tinybird by creating a new Data Source from your dashboard.

Rule 1: The best data is the data you don’t write

This rule seems obvious, but it’s not always followed. There is no reason to save data that you don’t need: it will impact the memory needed (and the money!) and the queries will take more time, so it only has disadvantages.

SELECT 'do not save what you don\'t need!' AS first_rule

Rule 2: The second best data is the data you don’t read

To avoid reading data that you don’t need, you should apply filters as soon as possible.

Using the NYC taxi data, let’s suppose we want a list of the trips whose distance is greater than 10 miles and that took place between ‘2017-01-31 14:00:00’ and ‘2017-01-31 15:00:00’. Additionally, we want to get those trips ordered by date.

Let’s see the difference in performance when we apply the filters at the end of our query execution versus the beginning.

First, let’s start the first approach by ordering all the data by date:

--NODE: rule2_data_read_NOT_OK
SELECT *
FROM nyc_taxi
ORDER BY tpep_pickup_datetime ASC
---
10.31 MB processed, 139.26k rows, 9.16 ms elapsed

Once the data is sorted, we filter it:

A screenshot showing a node of SQL that runs over filtered data. — In Tinybird, you can run multiple nodes of SQL and query over prior nodes. This node runs over the filtered SQL from the previous query above.

This first approach takes around 30-60 ms, if you add the processed time of both nodes.

Pay attention to the statistics: The first node scanned 139.26k rows processed 10.31 MB of data. The second node scanned 24.58k rows and 1.82 MB of data. Why would we scan 139.26k rows in the first place if we just really need to scan 24.58k?

It’s important to be aware that these two values directly impact the query execution time and also affect other queries you may be running at the same time. IO bandwidth is also something you need to keep in mind.

Now, let’s see what happens if the filter is applied before the sorting:

--NODE: rule2_data_read_OK
SELECT * FROM nyc_taxi
WHERE (trip_distance > 10) 
AND ((tpep_pickup_datetime >= '2017-01-31 14:00:00') 
AND (tpep_pickup_datetime <= '2017-01-31 15:00:00'))
ORDER BY tpep_pickup_datetime ASC
---
1.82 MB processed, 24.58k rows, 4.33 ms elapsed

As you can see, if the filter is applied before the sorting, the query takes < 10 ms. If you take a look at the size of the data read, it’s 1.82MB, while the number of rows read is 24.58k. Compared to the previous approach, these are much smaller and more efficient.

This significant difference happens because in the first approach, we are sorting all the data available (even the data that we don’t need for our query) while in the second approach, we are sorting just the rows we need.

Filtering is the fastest operation, so always filter first.

Rule 3: Sequential reads are 100x faster

To be able to carry out sequential reads, it’s essential to define indexes correctly. These indexes should be defined based on the queries that we're going to perform. In this example, we'll simulate indexing data by sorting our tables before we query them to see how filtering by non-indexed columns affects performance.

For example, if we want to query the data and filter by tpep_pickup_time, let’s compare what happens when the data is sorted by tpep_pickup_time versus when it’s sorted by any other column.

In the first approach, we will sort the data by another column, for instance, passenger_count:

--NODE: rule3_sequential_read_NOT_OK
SELECT * FROM nyc_taxi
ORDER BY passenger_count ASC

Once we have the data sorted by passenger_count, we filter it by tpep_pickup_time:

A screenshot of a suboptimal SQL query that filters by a column which is not indexed. — This node filters by date over data that hasn't been sorted by date. The results are suboptimal.

This approach takes around 5-10 ms, the number of scanned rows is 26.73k and the size of data is 1.98MB.

For the second approach, we'll sort the data by tpep_pickup_time:

--NODE: rule3_sequential_read_OK
SELECT * FROM nyc_taxi
ORDER BY tpep_pickup_datetime ASC

And once it’s sorted by date, we filter it:

A screenshot of an optimized SQL query that filters on indexed data. — This node filters by date over data that has been sorted by date. The query scans much less data and responds much faster.

We can see that if the data is sorted by tpep_pickup_time and the query uses tpep_pickup_time for filtering, it just takes 1-2 ms, scans only 10.35k rows, and processes only 765.53 KB. The first approach, filtering by another column, takes around 5-10 ms, scans 26.73k rows, and processes 1.98 MB.

It’s important to highlight that the more data we have, the greater the difference between both approaches. When dealing with tons of data, sequential reads can be up 100x faster or more.

Therefore, it’s essential to define the indexes taking into account the queries that will be made.

Rule 4: The less data you process (after read), the better

The less data you process, the faster and cheaper your queries will be. So, if you just need two columns, only retrieve two columns in your SELECT.

Let’s suppose that for our use case, we just need three columns: vendorid, tpep_pickup_datetime, and trip_distance.

Let’s analyze the difference between selecting all the columns versus just the ones we need.

When we select all the columns, the query takes around 140-180 ms and processes 718.55 MB of data:

--NODE: rule4_read_all_columns_NOT_OK
SELECT *
FROM
(
    SELECT 
        *
    FROM nyc_taxi
    order by tpep_dropoff_datetime
)
---
718.55 MB processed, 9.71m rows, 112.18ms elapsed

However, when we select just the columns we need, the query only takes around 35-60 ms and process ~20% of the data:

--NODE: rule4_read_some_columns_OK
SELECT *
FROM
(
    SELECT
        vendorid,
        tpep_pickup_datetime,
        trip_distance
    FROM nyc_taxi
    order by tpep_dropoff_datetime
)
---
155.36 MB processed, 9.71m rows, 35.10ms elapsed

As we mentioned before, you can check how the size of scanned data is much less, now just 155.36MB. With analytical databases, if you do not need to retrieve a column, those files are not read and it is much more efficient.

Therefore, you should only process the data that you need.

Rule 5: Move complex operations later in the processing pipeline

Complex operations, such as joins or aggregations, should be performed as late as possible in the processing pipeline. This is because in the first steps you should filter all the data, so the number of rows at the end will be less than at the beginning and, therefore, the cost of executing complex operations will be lower.

So first, let’s aggregate the data:

--NODE: rule5_complex_operation_NOT_OK
SELECT
    vendorid,
    pulocationid,
    count(*)
FROM nyc_taxi
GROUP BY
    vendorid,
    pulocationid
---
77.68 MB processed, 9.71m rows, 25.09ms elapsed

Now, let’s apply the filter:

A screenshot of a suboptimal SQL query that filters aggregated data — This query filters over aggregated data. The results are suboptimal.

If the aggregations are performing before filtering, the query takes around 50-70 ms in total, and it scans 9.71 million rows and processes 77.68 MB.

Let’s see what happens if we filter first and then aggregate:

--NODE: rule5_complex_operation_OK
SELECT
    vendorid,
    pulocationid,
    count(*)
FROM nyc_taxi
WHERE vendorid < 10
GROUP BY
    vendorid,
    pulocationid
---
77.68 MB processed, 9.71m rows, 22.13ms elapsed

This approach takes only 20-40 ms even though the number of scanned rows and the size of data is the same as in the previous approach.

Therefore, you should perform complex operations as late as possible in the processing pipeline.

Some additional guidance

In addition to these 5 rules, here’s some more general advice for optimal queries in Tinybird:

Avoid Full Scans

The less data you read in your queries, the faster they are. There are different strategies you can follow in Tinybird to avoid reading all the data in a data source (doing a full scan) from your queries:

Always filter first
Use indices by setting a proper ENGINE_SORTING_KEY in the Data Source.
The column names present in the ENGINE_SORTING_KEY should be the ones you will use for filtering in the WHERE clause. You don’t need to sort by all the columns you use for filtering, only the ones to filter first.
The order of the columns in the ENGINE_SORTING_KEY is important: from left to right ordered by relevance (the more important ones for filtering) and cardinality (less cardinality goes first).

--Tinybird Data Source schema
SCHEMA >
  `id` Int64,
  `amount` Int64,
  `date` DateTime

ENGINE "MergeTree"
ENGINE_SORTING_KEY "id, date"

--BAD: not filtering by any column in the sorting key
SELECT *
FROM data_source_sorted_by_date
WHERE amount > 30

--GOOD: filtering by columns present in the sorting key
SELECT *
FROM data_source_sorted_by_date
WHERE
  id = 135246 AND
  date > now() - INTERVAL 3 DAY AND
  amount > 30

Avoid Big Joins

When doing a JOIN in Tinybird, the data in the right Data Source is loaded in memory to perform the JOIN. Therefore, it’s recommended to avoid joining big Data Sources by filtering the data in the right Data Source.

JOINs over tables of more than 1M rows might lead to MEMORY_LIMIT errors when used in Materialized Views, affecting ingestion.

A common pattern to improve JOIN performance is the one below:

--BAD: doing a join with a large right Data Source
SELECT
    left.id AS id,
    left.date AS day,
    right.response_id AS response_id
FROM left_data_source AS left
INNER JOIN big_right_data_source AS right ON left.id = right.id

--GOOD: pre-filtering the joined Data Source
SELECT
    left.id AS id,
    left.date AS day,
    right.response_id AS response_id
FROM left_data_source AS left
INNER JOIN (
  SELECT id, response_id
  FROM big_right_data_source
  WHERE id IN (SELECT id FROM left_data_source)
) AS right ON left.id = right.id

New to Tinybird?

Tinybird is a real-time data platform for developers and data teams. Ingest data from anywhere, query it with SQL, and publish your (optimized) queries as low-latency APIs in a click. You can sign up for free with no time limit and no credit card required.

Top comments (1)

Tomas@PawSQL • Jun 1 '23

Nice!

DEV Community