Improving the performance of LIMIT,OFFSET in MySQL

#mysql

When using medium or large datasets, the use of LIMIT can be a performance bottle neck.

The reason it's a bottle neck is to do with how LIMIT works with OFFSET.

LIMIT is not able to just retrieve the number of rows you ask for, it retrieves all rows up until it has enough data to return the rows being asked for, it then discards everything not needed.

In the query below, 20 rows will be retrieved and 10 returned in the result set.

SELECT *
FROM salaries
LIMIT 10
OFFSET 10

In the query below, 800010 rows are retrieved and 10 returned in the result set.

SELECT *
FROM salaries
LIMIT 10
OFFSET 800000

If your website uses Previous/Next navigation to move through the pages of data, then one option is to return a cursor with your result set.

What is a cursor?

A cursor is an indicator to tell the query where in the result set the data needs to continue searching from.

It's important to remember that because data doesn't always return the same order each time, that the data needs to be ordered in a way that will make it consistent.

A good way to do this is to include an ORDER BY in the query.

SELECT *
FROM salaries
ORDER BY id
LIMIT 10
OFFSET 10

Once you receive the data from this query, the next query might be:

SELECT *
FROM salaries
ORDER BY id
LIMIT 10
OFFSET 20

Where the offset is now 20 instead of 10 to retrieve the next set of 10 rows.

Using a cursor here can help improve performance. To achieve this, your code would need to send the last read ID, so the new query can take that into account

SELECT *
FROM salaries
WHERE id > 10
ORDER BY id
LIMIT 10

This will be especially beneficial when the page offset is large. Instead of retrieving 800000 rows and returning 10 rows, the query is only retrieves 10 rows.

SELECT *
FROM salaries
WHERE id > 800000
ORDER BY id
LIMIT 10

Cursors are not for every situation, sometimes you may need to navigate directly to page 10 of the result set, in a scenario like this, a cursor won't work, you won't know what the last ID read is.

One way to improve a query like this is to use a deferred join.

Deferred Joins

Deferred joins are used to prefetch a dataset before retrieving all the data needed.

To compare the query execution time with a standard LIMIT/OFFSET query, I am using the following base query.

SELECT *
FROM salaries
ORDER BY salary, id
LIMIT 100
OFFSET 10

In the query above the average time to run on a table with about 1 million records was: 0.095 Seconds.

Creating a deferred join.

To create a deferred join, most of the work the query does is moved into a subquery that returns the PRIMARY KEY or UNIQUE indexed field, in this case the id field. It just needs to be a field that can self join without generating multiple rows in the returned dataset.

This reduces the number of rows the outer query needs to retrieve before returning all the data required.

It's no longer retrieving all the table columns and performing the LIMIT/OFFSET and discard on all the data, it's only doing that work on the id field data.

An added benefit to this is that the table index can be better utilized.

Under some circumstances you may be able to use a covering query in the sub query, which can be more performant.

Below is the deferred query method, INNER JOINING the id field returned from the subquery with the main data, only those 10 rows of data will be retrieved and returned in the result set.

SELECT *
FROM salaries
  INNER JOIN (
    SELECT id
    FROM salaries
    ORDER BY salary, id
    LIMIT 100
    OFFSET 10
  ) s USING (id)

The average time for this query was: 0.0901 Seconds

While the deferred query was faster, it is not significantly faster when the OFFSET is very small.

Lets change the OFFSET to a much larger number

SELECT *
FROM salaries
ORDER BY salary, id
LIMIT 100
OFFSET 800000

This time the average time for the query was: 0.3337 Seconds.

What about with a deferred join?

SELECT *
FROM salaries
  INNER JOIN (
    SELECT id
    FROM salaries
    ORDER BY salary, id
    LIMIT 100
    OFFSET 800000
  ) s USING (id)

For this query the time taken was: 0.1757 Seconds.

The deferred query method was about 2x faster than the standard query.

Like anything in SQL, Deferred joins are not always going to be the best choice, but they can provide a lot of benefits to a LIMIT/OFFSET query.

The more rows your query needs to traverse to reach the data it needs to return, the larger the savings will be, but even with a small offset there was still benefit to using a deferred join.