DEV Community

loading...
Cover image for Hassle-free cursor pagination for Ecto — part 1

Hassle-free cursor pagination for Ecto — part 1

Ivan Yurov
A Master in Computer Science student at EPFL interested in modern Web-dev and type-safe functional languages
・5 min read

TLDR; Cursor-based pagination is tricky. The distinct ordering of the query is necessary to make it deterministic, and the potential arbitrary complexity of order by clause makes it hard to provide a universal solution. Existing libraries for Ecto do not support complex expressions and rarely even work with joined tables. EctoCursor does all of that out of the box and without any configuration. You don’t even need to declare what fields would be used for the cursor, it is just derived from the ordering.

This is the first part of two articles, providing the motivation behind the implementation of EctoCursor. More specifically we talk about why this should work in general case without any configuration. In the next part we will talk about interesting parts of implementation: mostly digging in the internals of Ecto queries and manipulating them.

If you don't want to wait for the next part, feel free to take a look at the source code and test cases right away :)

Why cursor?

Offset-based pagination is slow. Query execution time grows proportionally to the number of rows you need to skip, because database engines scan those sequentially. Keyset- or cursor-based pagination, on the contrary, relies on concrete values in columns which unlocks the power of indexing. As a result, 1000th page loads as quickly as the first.

This approach works especially good for feeds and so called infinite scroll. Yet it does not work for explicit pagination, because it would impose prohibitive overhead on generating cursors for every page-link.

Why is it tricky?

First important constraint is making sure that order is distinct. Imagine a table like this:

id name timestamp
1 Post 1 1
2 Post 2 1
3 Post 3 2

Suppose we decided to order by timestamp without considering that it is not impossible that more than one row can have the same value (think of bulk loading, for example). Now if we request for the first page consisting of one row for simplicity, we get this tuple back:

[{1, "Post 1", 1}]

How we encode the cursor for subsequent queries? We just take that last value in the list and assume that the starting value in the next frame will have to be strictly larger, so:

timestamp > 1

And now we have a problem. Next row will be skipped because of this condition, and second page we receive will be:

[{3, "Post 3", 2}]

This is easy to fix though. We can just add a column that is guaranteed to be unique to disambiguate the query, like so:

SELECT * FROM posts ORDER BY timestamp, id;

Now the cursor that comes with the first page, will be composed of two values timestamp = 1 and id = 1. And the condition in the next query will have to tackle the case of equality:

SELECT * FROM posts
  WHERE timestamp > cursor_1
    OR timestamp = cursor_1 AND id > cursor_2
  ORDER BY timestamp, id;

Cursor components here are denoted with _N, obviously we need to make sure somehow that number of components and their types correspond to the ORDER BY clauses in query. We will talk about it later.

The case discussed above was extremely simple, in real life we deal with arbitrary complex expressions in ORDER BY clauses as well as multiple tables joins and groupings. I'm going to propose a solution of unknown degree of completeness, and I'm certainly open for corrections, additions or any other contributions that would help to prove or disprove completeness.

Dissecting the Query

This is a basic syntax of SQL Select query. We need to apply some transformations in order to be able to generate next cursor and apply existing cursor.

SELECT {select_exprs}
  FROM {source_tables}
  WHERE {where_exprs}
  GROUP BY {grouping_exprs}
  HAVING {having_exprs}
  ORDER BY {ordering_exprs}

Theoretically it is possible to do it in one request, but for the sake of simplicity we are going to separate these tasks, because expressions (fields, aggregates) used in ordering condition will not necessarily appear in select_exprs. So we should start from separating the task into two subtasks.

  1. Applying current cursor
  2. Generating next cursor

Applying cursor condition

This is relatively easy to do in the general case. Suppose we have N subexpressions in order by, we will call them ord_N for brevity. Then the cursor must have N components — cur_N respectively. Now all we need to do is to extend where_exprs blindly reusing parts of ordering_exprs and ignoring direction for now:

SELECT {select_exprs}
  FROM {source_tables}
  WHERE {
    where_exprs
    AND (
      ord_1 > cur_1
      OR ord_1 = cur_1 AND ord_2 > cur_2
      OR ord_1 = cur_1 AND ord_2 = cursor_2 AND ord_3 > cur_3
      ...etc
    )
  }
  -- GROUP BY {grouping_exprs}
  -- HAVING {having_exprs AND the above extensions}
  ORDER BY {ord_1 DIR, ord_2 DIR, ord_3 DIR, ...}
  LIMIT lim

This will return the original query result after the row that is cascade-matching the cursor. One important note: whenever the query has GROUP BY clause, the extended conditions will have to be added into HAVING and not where, due to possibility of having the aggregates in ordering.

Generating next cursor

Now we need to capture the values of order expressions in the last row of the request above. Again, theoretically it can be done in one request by just extending the select_exprs, but this might become tricky to implement when we dig into Ecto. So for now we just replace original select_exprs and make appropriate adjustments in LIMIT/OFFSET:

SELECT {ord_1, ord_2, ord_3, ...}
  FROM {source_tables}
  WHERE {
    where_exprs
    AND (
      ord_1 > cur_1
      OR ord_1 = cur_1 AND ord_2 > cur_2
      OR ord_1 = cur_1 AND ord_2 = cursor_2 AND ord_3 > cur_3
      ...etc
    )
  }
  -- GROUP BY {grouping_exprs}
  -- HAVING {having_exprs AND the above extensions}
  ORDER BY {ord_1 DIR, ord_2 DIR, ord_3 DIR, ...}
  LIMIT 1
  OFFSET lim - 1

This will return the cursor itself. All we need to do afterwards is just to serialize it to make it URL-friendly.

Validation of cursor

From the modifications we proposed above, it is clear that cursor absolutely must have the same number of components as the ordering expressions, and they have to have same types respectively. What if cursor is incompatible with the query? We simply will not be able to compose properly conditioned query. One of the easy ways to make sure is to hash the original query shape without arguments and sign both cursor and the query hash using some common MAC algorithm, this will let to reject the damaged cursor early. Reporting the error or just falling back to no cursor — should be up to the application developer.

In the next part we will dig into Ecto to try and implement this proposed approach. Again if you don't want to wait for the next part, feel free to take a look at the source code and test cases right away. I will always appreciate corrections, suggestions and any input here as well as in github issues.

Discussion (0)