DEV Community

Cover image for The best database pagination technique is ...
Matej Baฤo for Appwrite

Posted on

The best database pagination technique is ...

๐Ÿ‘‹ Introduction

The Database is one of the cornerstones of every application. It's where you store everything your app needs to remember, compute later, or display to other users online. It's all fun and games until the database grows and your application starts lagging because you were trying to fetch and render 1,000 posts at once. Well, you're a smart engineer right? You quickly patch that with a Show more button. A few weeks later, you're presented with a new Timeout error! You head to Stack Overflow but quickly realize that Ctrl and V have stopped working due to excessive usage ๐Ÿคฆ With no more options at your disposal, you actually start debugging and realize that the database returns over 50,000 posts each time a user opens your app! What do we do now ?

This is fine meme

To prevent these horrific scenarios, we should be aware of risks from the beginning because a well-prepared developer will never have to risk. This article will prepare you to fight database-related performance issues using offset and cursor pagination.

"An ounce of prevention is worth a pound of cure." - Benjamin Franklin

๐Ÿ“š What is pagination?

Pagination is a strategy employed when querying any dataset that holds more than just a few hundred records. Thanks to pagination, we can split our large dataset into chunks ( or pages ) that we can gradually fetch and display to the user, thus reducing the load on the database. Pagination also solves a lot of performance issues both on the client and server-side! Without pagination, you'd have to load the entire chat history only to read the latest message sent to you.

These days, pagination has almost become a necessity since every application is very likely to deal with large amounts of data. This data could be anything from user-generated content, content added by administrators or editors, or automatically generated audits and logs. As soon as your list grows to more than a few thousand items, your database will take too long to resolve each request and your front-end's speed and accessibility will suffer. As for your users, their experience will look something like this.

loading

Now that we know what pagination is, how do we actually use it? And why is it necessary?

๐Ÿ” Types of pagination

There are two pagination strategies that are widely used - offset and cursor. Before digging in-depth and learning everything about them, let's look at some websites using them.

First, let's visit GitHub's Stargazer page and notice how the tab says 5,000+ and not an absolute number? Also, instead of standard page numbers, they use Previous and Next buttons.

GitHub pagination

Now, let's switch to Amazon's products list and notice the exact amount of results 364, and standard pagination with all page numbers you can click through 1 2 3 ... 20.

Amazon pagination

It's very clear that two tech giants could not agree on which solution is better! Why? Well, we'll need to use an answer developers hate, Because it depends. Let's explore both methods to understand their advantages, limitations, and performance implications.

Offset pagination

Most websites use offset pagination because of its simplicity and how intuitive pagination is to users. To implement offset pagination, we will usually need two pieces of information:

  • limit - Number of rows to fetch from the database
  • offset - Number of rows to skip. Offset is like a page number, but with a bit of math around it (offset = (page-1) * limit)

To get the first page of our data, we set the limit to 10 (because we want 10 items on the page) and offset to 0 (because we want to start counting 10 items from the 0th item). As a result, we will get ten rows.

To get the second page, we keep the limit at 10 ( this doesn't change since we want every page to contain 10 rows ) and set offset to 10 ( return results from the 10th row onwards ). We continue this approach thereby allowing the end user to paginate through the results and see all of their content.

In the SQL world, such a query would be written as SELECT * FROM posts OFFSET 10 LIMIT 10.

Some websites implementing offset pagination also show the page number of the last page. How do they do it? Alongside results for each page, they also tend to return a sum attribute telling you how many rows there are in total. Using limit, sum and a bit of math, you can calculate last page number using lastPage = ceil(sum / limit)

As convenient as this feature is for the user, developers struggle to scale this type of pagination. Looking at sum attribute, we can already see that it can take quite some time to count all rows in a database to the exact number. Alongside that, the offset in databases is implemented in a way that loops through rows to know how many should be skipped. That means that the higher our offset is, the longer our database query will take.

Another downside of offset pagination is that it doesn't play well with real-time data or data that changes often. Offset says how many rows we want to skip but doesn't account for row deletion or new rows being created. Such an offset can result in showing duplicate data or some data missing.

Cursor pagination

Cursors are successors to offsets, as they solve all issues that offset pagination has - performance, missing data and data duplication because it does not rely on the relative ordering of the rows as in the case of offset pagination. Instead, it relies on an index created and managed by the database. To implement cursor pagination, we will need the following information:

  • limit - Same as before, amount of rows we want to show on one page
  • cursor - ID of a reference element in the list. This can be the first item if you're querying the previous page and the last item if querying the next page.
  • cursorDirection - If user clicked Next or Previous (after or before)

When requesting the first page, we don't need to provide anything, just the limit 10, saying how many rows we want to get. As a result, we get our ten rows.

To get the next page, we use the ID of the last row as the cursor and set cursorDirection to after.

Similarly, if we want to go to the previous page, we use the ID of the first row as cursor and set direction to before.

To compare, in the SQL world, we could write our query as SELECT * FROM posts WHERE id > 10 LIMIT 10 ORDER BY id DESC.

Queries that use a cursor instead of offset are more performant because the WHERE query helps skip unwanted rows, while OFFSET needs to iterate over them, resulting in a full-table scan. Skipping rows using WHERE can get even faster if you set up proper indexes on your IDs. The index is created by default in case of your primary key.

Not only that, you no longer need to worry about rows being inserted or deleted. If you were using an offset of 10, you would expect exactly 10 rows to be present ahead of your current page. If this condition were not met, your query will return inconsistent results leading to data duplication and even missing rows. This can happen if any of the rows ahead of your current page were deleted or new rows were added. Cursor pagination solves this by using the index of the last row you fetched and it knows exactly where to start looking from, when you request for more.

It's not all sunshine and rainbows. Cursor pagination is a really complex problem if you need to implement it on the backend on your own. To implement cursor pagination, you will need WHERE and ORDER BY clauses in your query. In addition, you will also need WHERE clauses to filter by your required conditions. This can get quite complex very quickly and you might end up with a huge nested query. Alongside that, you will also need to create indexes for all the columns you need to query.

Great! We got rid of duplicates and missing data by switching to cursor pagination! But we still have one problem left. Since you should not expose incremental numeric IDs to the user (for security reasons), you must now maintain a hashed version of each ID. Whenever you need to query a database, you convert this string ID to its numeric ID by looking at a table that holds these pairs. What if this row is missing? What if you click the Next button, take the last row's ID, and request the next page, but the database can't find the ID?

This is a really rare condition and only occurs if the row's ID that you are about to use as cursor has been just deleted. We can solve this problem by trying previous rows or re-fetching data of earlier requests to update the last row with a new ID, but all of that brings a whole new level of complexity, and the developer needs to understand a bunch of new concepts, such as recursion and proper state management. Thankfully, services such as Appwrite take care of that, so you can simply use cursor pagination as a feature.

๐Ÿš€ Pagination in Appwrite

Appwrite is open source backend-as-a-service that abstracts all the complexity involved in building a modern application by providing you with a set of REST APIs for your core backend needs. Appwrite handles user authentication and authorization, databases, file storage, cloud functions, webhooks and much more! If there is anything missing, you can extend Appwrite using your favourite backend language.

Appwrite Database lets you store any text-based data that needs to be shared across your users. Appwrite's database allows you to create multiple collections (tables) and store multiple documents (rows) in it. Each collection has attributes (columns) configured to give your dataset a proper schema. You can also configure indexes to make your search queries more performant. When reading your data, you can use a host of powerful queries, filter them, sort them, limit the number of results, and paginate over them. And all this comes out-of-the-box!

What makes Appwrite Database even better is Appwrite's pagination support, as we support both offset and cursor pagination! Let's imagine we have collection with ID articles, we can get documents from this collection with either offset or cursor pagination:

// Setup
import { Appwrite, Query } from "appwrite";
const sdk = new Appwrite();

sdk
    .setEndpoint('https://demo.appwrite.io/v1') // Your API Endpoint
    .setProject('articles-demo') // Your project ID
;

// Offset pagination
sdk.database.listDocuments(
    'articles', // Collection ID
    [ Query.equal('status', 'published') ], // Filters
    10, // Limit
    500, // Offset, amount of documents to skip
).then((response) => {
    console.log(response);
});

// Cursor pagination
sdk.database.listDocuments(
    'articles', // Collection ID
    [ Query.equal('status', 'published') ], // Filters
    10, // Limit
    undefined, // Not using offset
    '61d6eb2281fce3650c2c' // ID of document I want to paginate after
).then((response) => {
    console.log(response);
});
Enter fullscreen mode Exit fullscreen mode

First, we import Appwrite SDK library and setup an instance that connects to specific Appwrite instance and a specific project. Then, we list 10 documents using offset pagination while having a filter to only show those that are published. Right after, we write the exact same list documents query, but this time using cursor instead of offset pagination.

๐Ÿ“Š Benchmarks

We've used the word performance pretty often in this article without providing any actual numbers, so let's create a benchmark together! We will be using Appwrite as our backend server because it supports both offset and cursor pagination and Node.JS to write the benchmark scripts. After all, Javascript is pretty easy to follow along.

You can find complete source code as GitHub repository.

First, we set up Appwrite, register a user, create a project and create a collection called posts with collection-level permission and read permission set to role:all. To learn more about this process, visit the Appwrite docs. We should now have Appwrite ready to be used.

We can't do a benchmark yet, because our database is empty! Let's fill our tables with some data. We use the following script to load data into our MariadDB database and prepare for the benchmark.

const config = {};
// Don't forget to fill config variable with secret information

console.log("๐Ÿค– Connecting to database ...");

const connection = await mysql.createConnection({
    host: config.mariadbHost,
    port: config.mariadbPost,
    user: config.mariadbUser,
    password: config.mariadbPassword,
    database: `appwrite`,
});

const promises = [];

console.log("๐Ÿค– Database connection established");
console.log("๐Ÿค– Preparing database queries ...");

let index = 1;
for(let i = 0; i < 100; i++) {
    const queryValues = [];

    for(let l = 0; l < 10000; l++) {
        queryValues.push(`('id${index}', '[]', '[]')`);
        index++;
    }

    const query = `INSERT INTO _project_${config.projectId}_collection_posts (_uid, _read, _write) VALUES ${queryValues.join(", ")}`;
    promises.push(connection.execute(query));
}

console.log("๐Ÿค– Pushing data. Get ready, this will take quite some time ...");

await Promise.all(promises);

console.error(`๐ŸŒŸ Successfully finished`);
Enter fullscreen mode Exit fullscreen mode

We used two layers for loops to increase the speed of the script. First for loop creates query executions that need to be awaited, and the second loop creates a long query holding multiple insert requests. Ideally, we would want everything in one request, but that is impossible due to MySQL configuration, so we split it into 100 requests.

We have 1 million documents inserted in less than a minute, and we are ready to start our benchmarks. We will be using the k6 load testing library for this demo.

Let's benchmark the well-known and widely used offset pagination first. During each test scenario, we try to fetch a page with 10 documents, from different parts of our dataset. We will start with offset 0 and go all the way to an offset of 900k in increments of 100k. The benchmark is written in a way, that it makes only one request at a time to keep it as accurate as possible. We will also run the same benchmark ten times and measure average response times to ensure statistical significance. We'll be using k6's HTTP client to make requests to Appwrite's REST API.

// script_offset.sh

import http from 'k6/http';

// Before running, make sure to run setup.js
export const options = {
    iterations: 10,
    summaryTimeUnit: "ms",
    summaryTrendStats: ["avg"]
};

const config = JSON.parse(open("config.json"));

export default function () {
    http.get(`${config.endpoint}/database/collections/posts/documents?offset=${__ENV.OFFSET}&limit=10`, {
        headers: {
            'content-type': 'application/json',
            'X-Appwrite-Project': config.projectId
        }
    });
}
Enter fullscreen mode Exit fullscreen mode

To run the benchmark with different offset configurations and store output in CSV files, I created a simple bash script. This script executes k6 ten times, with a different offset configuration each time. The output will be provided as console output.

#!/bin/bash
# benchmark_offset.sh

k6 -e OFFSET=0 run script.js
k6 -e OFFSET=100000 run script.js
k6 -e OFFSET=200000 run script.js
k6 -e OFFSET=300000 run script.js
k6 -e OFFSET=400000 run script.js
k6 -e OFFSET=500000 run script.js
k6 -e OFFSET=600000 run script.js
k6 -e OFFSET=700000 run script.js
k6 -e OFFSET=800000 run script.js
k6 -e OFFSET=900000 run script.js
Enter fullscreen mode Exit fullscreen mode

Within a minute, all benchmarks have finished and provided me with the average response time for each offset configuration. The results were as expected but not satisfying at all.

Offset pagination (ms)
0% offset 3.73
10% offset 52.39
20% offset 96.83
30% offset 144.13
40% offset 216.06
50% offset 257.71
60% offset 313.06
70% offset 371.03
80% offset 424.63
90% offset 482.71

Graph with offset pagination

As we can see, offset 0 was pretty fast, responding in less than 4ms. Our first jump was to offset 100k, and the change was drastic, increasing response times to 52ms. With each increase in the offset, the duration went up, resulting in almost 500ms to get ten documents after an offset of 900k documents. That is crazy!

Now let's update our script to use cursor pagination. We will update our script to use a cursor instead of offset and update our bash script to provide a cursor (document ID) instead of an offset number.

// script_cursor.js
import http from 'k6/http';

// Before running, make sure to run setup.js
export const options = {
    iterations: 10,
    summaryTimeUnit: "ms",
    summaryTrendStats: ["avg"]
};

const config = JSON.parse(open("config.json"));

export default function () {
    http.get(`${config.endpoint}/database/collections/posts/documents?cursor=${__ENV.CURSOR}&cursorDirection=after&limit=10`, {
        headers: {
            'content-type': 'application/json',
            'X-Appwrite-Project': config.projectId
        }
    });
}
Enter fullscreen mode Exit fullscreen mode
#!/bin/bash
# benchmark_cursor.sh

k6 -e CURSOR=id1 run script_cursor.js
k6 -e CURSOR=id100000 run script_cursor.js
k6 -e CURSOR=id200000 run script_cursor.js
k6 -e CURSOR=id300000 run script_cursor.js
k6 -e CURSOR=id400000 run script_cursor.js
k6 -e CURSOR=id500000 run script_cursor.js
k6 -e CURSOR=id600000 run script_cursor.js
k6 -e CURSOR=id700000 run script_cursor.js
k6 -e CURSOR=id800000 run script_cursor.js
k6 -e CURSOR=id900000 run script_cursor.js
Enter fullscreen mode Exit fullscreen mode

After running the script, we could already tell that there was a performance boost as there was a noticeable differences in response times. We've put the results into a table to compare these two pagination methods side-by-side.

Offset pagination (ms) Cursor pagination (ms)
0% offset 3.73 6.27
10% offset 52.39 4.07
20% offset 96.83 5.15
30% offset 144.13 5.29
40% offset 216.06 6.65
50% offset 257.71 7.26
60% offset 313.06 4.61
70% offset 371.03 6.00
80% offset 424.63 5.60
90% offset 482.71 5.05

Graph with offset and cursor pagination

Wow! Cursor pagination rocks! The graph shows that cursor pagination does not care about offset size, and every query is as performant as the first or last one. Can you imagine how much harm can be done by loading the last page of a huge list repeatedly? ๐Ÿ˜ฌ

If you are interested in running tests on your own machine, you can find the complete source code as GitHub repository. The repository includes README.md explaining the whole process of installation and running scripts.

๐Ÿ‘จโ€๐ŸŽ“ Summary

Offset pagination offers a well-known pagination method where you can see page numbers and click through them. This intuitive method comes with a bunch of downsides, such as terrible performance with high offsets and a chance of data duplication and missing data.

Cursor pagination solves all of these problems and brings a reliable pagination system that is fast and can handle real-time (often changing) data. The downside of cursor pagination is not showing page numbers, its complexity to implement, and a new set of challenges to overcome, such as missing cursor ID.

Let's now get back to our original question, why does GitHub use cursor pagination, but Amazon decided to go with offset pagination? Performance is not always the key... User experience is much more valuable than how many servers your business has to pay for.

I believe Amazon decided to go with offset because it improves UX, but that is a topic for another research. We can already notice that if we visit amazon.com and search for a pen, it says there are exactly 10 000 results, but you can only visit the first seven pages (350 results).

First, there are way more than just 10k results, but Amazon limits it. Secondly, you can visit the first seven pages anyway. If you try to visit page 8, it shows a 404 error. As we can see, Amazon is aware of offset pagination's performance but still decided to keep it because their user-base prefers seeing page numbers. They had to include some limits, but who goes to page 100 of search results anyway. ? ๐Ÿคท

Do you know what is better than reading about pagination? Trying it out! I would encourage you to try both methods because it's best to get first-hand experience. Setting up Appwrite takes less than a few minutes, and you can start playing with both pagination methods. If you have any questions, you can also reach us on our Discord server.

๐Ÿ”— Resources

Discussion (5)

Collapse
thumbone profile image
Bernd Wechner

Thanks, a really interesting read and great links at end. I love what you're doing in that you're clearly writing an appwriter ad, but it's done in such a general, informative way, that serves as a real solid read, easing into it with a kind of while-we're-here appwriter reveal late in the piece and then actually helping me to understand better what appwriter is. I am not being cynical here nor sarcastic, but genuinely appreciate this excellent balance.

The quality of the writing in the first half is awesome (there is a missing Amazon pagination image though I think), much better than many amateur blog posts I read here, and I think that can be achieved because of the reality that it also serves to as an ad for appwriter and so appwriter can invest, pay someone with talent and possibly even a copy editor, and the so gently introduce the product in an informative context that it remains informative and interesting to boot.

I look forward to more.

Collapse
meldiron profile image
Matej Baฤo Author

I am happy to hear you enjoyed the article ๐Ÿ˜Š We focused on introducing problems and solutions in a generic way without using Appwrite-specific terms or tools, so anyone can educate about this topic.

Collapse
elreyes profile image
Fabian Reyes

Hi, a question, do u have any document with the new available queries to filter documents?

Collapse
eldadfux profile image
Eldad A. Fux

Check this section in the official docs: appwrite.io/docs/database#querying...

Collapse
pointy profile image
Mike McNally

Two things:

1: Indexes on all sortable columns are not at all necessary, unless the WHERE clause predicate (not including the cursor predicate) does not hit any indexes, which in my experience is pretty rare. A user looking through their own forum posts, for example, is going to result in a query involving the user ID, and there's almost certainly an index for that.

2: Exposing database keys (either serial primary keys or UUID primary keys) has never been considered a security issue in any source I've ever read. For practical terms for a real web application, the primary key is vital for constructing a form that will update a resource when posted from the client. From outside the application (like, from the wilds of the Internet) the primary key cannot be directly used for any purpose; it's a number or string with no semantic content. Such keys may not even be permanent, as a table in the application database may be updated or migrated in a way that generates new keys for old data. Because including primary keys in a cursor is necessary to disambiguate sort column values that are duplicated between rows (as with city names or people names), the actual value must used and not a hash because the database server does not have access to the hash: the primary key will be added implicitly as the least-significant sort key (after all user-requested or implicit application keys), so an ascending or descending column on a group of sorted-together duplicates is required for the cursor approach to work.