DEV Community

Cover image for 1 SQL Query You Should Stop Using

1 SQL Query You Should Stop Using

Abdisalan on June 16, 2020

Let's talk about pages. You know, like this or infinite scrolling pages like this Because we never want to give our website visitors all of our...
Collapse
 
alaindet profile image
Alain D'Ettorre

It's always about memory vs CPU: using the ID is much faster because the ID field is usually the primary key hence it's indexed, meaning each value is literally assigned a number internally because numbers are great for sorting. For example, when the machine sees "WHERE id > 123 LIMIT 10" it can throw out any smaller id right away and just get those 10 rows, while "OFFSET 123 LIMIT 10" means you perform a query getting 133 rows and then discard 123 rows.

In order to get rid of OFFSET in pagination queries and still have filters in the WHERE clause and custom sorting, you just need to index any table field you want rows to be sorted by, so that each row of that field internally has a number to refer to.

Collapse
 
abdisalan_js profile image
Abdisalan

That’s a good clarification, the cursor pagination methods rely on there being an index to reduce the lookup time for the where clause.

Without an index the query goes much slower.

Collapse
 
patricknelson profile image
Patrick Nelson

Did you have any performance charts on the second technique outlined? Curious to know if having a slightly more complex WHERE clause with two sorted columns affects performance much when dealing with ~2M rows.

Also: Was this MySQL? I wonder if different databases have similar results. Worth noting in case results vary between the SQL DB's out there. Since I'm lazy and I'd be interested in tinkering with this first hand, do you still have the source code you used to generate these stats? Specifically, the code used to generate the fake data (e.g. script inserting output from Faker)?

Thanks for the article!

Collapse
 
abdisalan_js profile image
Abdisalan

Found my code and added it to GitHub here github.com/Abdisalan/blog-code-exa...

I've also included the db schema sql file to help create the database. Also, each file is named after what pagination method they use and I collected the data by outputting the times into a file like this:

python offset_pagination.py > data.csv
Collapse
 
abdisalan_js profile image
Abdisalan

The performance when sorting with two columns was slightly slower, running at 425k rows per second vs 476k. I don’t have the charts for that one unfortunately, I should have made it before I deleted the DB!

The database was in PostgreSQL, I wonder if another db would perform better 🤔

The script is also gone! 😢 I can try to recreate it and get back to you though.

Thanks for reading!

Collapse
 
djsullenbarger profile image
David Sullenbarger

(sorry, I didn't actually read anything but the title) ... you could pull in all of the data at once and let your (application) code send it out in chunks which would mean you could write the SQL any way you wanted (code review should clean it up)

or if you have database access: write a function or procedure from that side to help.

using indexes (and partitioning for huge table) really helps

Collapse
 
abdisalan_js profile image
Abdisalan

I promise the rest of the article is good too 😂

Collapse
 
djsullenbarger profile image
David Sullenbarger

Ok, I read it :-)

I've never used 'offset' because it doesn't seem like a good idea to me (though I do keep an eye out for an actual use case) .. but I'm also writing apps that are used on a robust enterprise WAN and have access to ... well .. just about anything I could possibly need from a hardware prospective so YMMV

Collapse
 
calvintwr profile image
calvintwr

This is exactly right. More pertinently, ORMs should be providing this manner of pagination by default. The other reason is because if database rows are inserted inbetween queries, the pagination will start to overlap.

Collapse
 
abdisalan_js profile image
Abdisalan

Absolutely, this should be standard or at least an option in most ORMs!

Collapse
 
dmfay profile image
Dian Fay

Markus Winand maintains a "hall of fame" for data access tools which offer keyset pagination.

Thread Thread
 
abdisalan_js profile image
Abdisalan

That’s awesome! Thanks for sharing 🤩

Collapse
 
focus97 profile image
Michael Lee

This was a terrific and quick read. My MySQL skills are limited but this makes perfect sense for getting in/out of the db as efficiently as possible (though it's an interesting proposition to just grab all results and filter server-side, away from the db). Great stuff.

Collapse
 
abdisalan_js profile image
Abdisalan

Thanks! Glad you liked it!

Collapse
 
aut0poietic profile image
Jer

I'd worry about using an auto-incremented ID column as the offset. I imagine in practice that they're sequential, but I don't think it's part of the spec (of course, I'm no expert). I have seen MySQL and Postgres spit out rando indexes though.

Like @robloche mentioned, I think using OFFSET + FETCH might help with the issue. It'd be interesting to see results, anyway. Postgres supports it.

Collapse
 
abdisalan_js profile image
Abdisalan

The auto-incremented ID was just a demonstration that you can use any ordered column. I'll have to look into FETCH! There are so many awesome suggestions from this community 😁

Collapse
 
robloche profile image
Robloche

Are you talking about MySQL?
Because I have an SQL Server stored procedure that uses OFFSET and FETCH NEXT to browse through a 500K-row table and the execution time is stable throughout the table.

Collapse
 
abdisalan_js profile image
Abdisalan

I'm using PostgreSQL and no stored procedures. That's awesome that MySQL can do that!

Collapse
 
robloche profile image
Robloche

Sorry for the confusion. I'm using SQL Server, and yes, it seems correctly optimized for this situation.

Collapse
 
igorsantos07 profile image
Igor Santos

Great article! I guess I never thought about ever-changing data with long pagination.

Mini-tip: avoid the trap of clickbaits! You can better title the article, like "1 SQL feature" instead, or even mentioning paging directly.

Collapse
 
bhupesh profile image
Bhupesh Varshney 👾

I am not good in this aspect but wouldn't indexing help in this ?

Collapse
 
lukecarrier profile image
Luke Carrier

With the disclaimer that I'm no expert, there are two different types of index:

  • Clustered indices affect the layout of the data on the disk. This is why primary key lookups are relatively fast: the database engine is able to calculate the offset within the data file where it expects to find a row from the index.
  • Non-clustered indices are an intermediary between a set of values in the row and the row's primary key. These incur an additional lookup.
Collapse
 
abdisalan_js profile image
Abdisalan

Interesting, I'll have to learn more about this Luke. Good point!

Collapse
 
abdisalan_js profile image
Abdisalan

That's okay! I'm learning as well :)

In my testing, I used an index to help with the OFFSET and the results are improved but still no where near as good as the cursor method.

The results you see in the article are me using an index.

Collapse
 
gabbersepp profile image
Josef Biehler • Edited

I already know about the internals of pagination. But how can we avoid this if we have a Pool of items with one to ten filters where the user can decide which filter he combines and how he sorts . Is there a solution available that gives me always ten items per page (except the last page of course)?

Collapse
 
abdisalan_js profile image
Abdisalan • Edited

As long as your results are ordered, you can use this method with any number of filters. You'll have to get creative on how to add the filters, because you don't want to create 2^10 SQL statements for every combination of filters.

The solution would be to pick 1 or 2 fields you can order your results by and then apply your filters as well. You can query the next page based on the field you are ordering your data by.

Example query for employees where the filter is the department and whether they are on vacation and order by birthdays

#page 1
SELECT * FROM employees
WHERE
  department != 'finance' AND
  on_vacation = false
ORDER BY birthday DESC
LIMIT 10

# page 2
SELECT * FROM employees
WHERE
  department != 'finance' AND
  on_vacation = false AND
  birthday <= $last_birthday_on_page_1
ORDER BY birthday DESC
LIMIT 10
Enter fullscreen mode Exit fullscreen mode

Hope that helps

Collapse
 
austingil profile image
Austin Gil

Yeah, this is some good lil tips. I think your "Order Base Pagination" example relies on the ID column being an auto-incrementing index. I've heard this is no longer a best-practice because it would make sharding really hard. So it's better to reach for UUIDs. Do you have a solution for pagination with UUIDs?

Collapse
 
abdisalan_js profile image
Abdisalan

To sort with UUIDs, you'll need to rely on another column that you can order your data by. The 3rd example with created_at is a common solution but you could use anything you can order your data by :)

Collapse
 
austingil profile image
Austin Gil

OK, I figured as much. I wonder if there is a way to do something like DynamoDB where you would pass the ID to start after. So rather than an offset, you tell the DB like "get me the next 10 posts starting after this ID" but the ID is not like an integer.

Thread Thread
 
abdisalan_js profile image
Abdisalan

That would be pretty interesting! I'll have to look up that feature in Dynamo

Thread Thread
 
austingil profile image
Austin Gil

Yeah. Im not sure it's possible with SQL. Part of the reason Dynamo is so fast is due to the way it looks up data, but it also makes it very restricting on how you access that data and plan your primary keys. So far I still prefer SQL, but it's good to know about the strengths and weaknesses

Collapse
 
thechrisjasper profile image
Chris Jasper

And how would you propose that you page data that is ordered by the user? For example, a name column that a user can toggle asc/desc?

Collapse
 
abdisalan_js profile image
Abdisalan

I would do the same as the created_at example. Order your query by the name whether asc or desc and then use the where clause to filter the names you've already seen.

If page one is:
AAA
AAB
AAC

your query could be

SELECT * from users
WHERE name < 'AAC'
ORDER BY name DESC
LIMIT 3

Its a little bit harder if the user can sort things by a lot of different factors. You can make the decision to just use OFFSET anyway depending on how much data you have

Collapse
 
amaelftah profile image
Ahmed Mohamed Abd El Ftah

nice tip thanks for sharing

Collapse
 
lukecarrier profile image
Luke Carrier

How would you approach this problem if you're using UUIDs instead of numeric IDs? I think you'd need a separate sort order column whose value is derived from a sequence?

Collapse
 
abdisalan_js profile image
Abdisalan

Exactly, in order to benefit from this you need a field that you can order your data by.
A common example is using the created_at field along with your UUID to break ties.

There's a deeper explanation here if you'd like
use-the-index-luke.com/no-offset

Collapse
 
yiannisdesp profile image
yiannisdesp

Didn't give much thought to it until now! Thanks

Collapse
 
major200322 profile image
major200322 • Edited

Thank you for this interesting post :)
Can I have the name of his tool, for analysing database ?

Collapse
 
abdisalan_js profile image
Abdisalan

Thanks for reading! I didn't use a tool, but my python script printed how long each page took to load and I put that into google sheets to make a chart. Here's all the code I used for this project!

github.com/Abdisalan/blog-code-exa...

Collapse
 
detzam profile image
webstuff

Interesting.. thank you

Collapse
 
abdisalan_js profile image
Abdisalan

Glad you enjoyed it 😁