In this post I'll show you several approaches for writing SQL queries when you need to retrieve the latest row per group from the database. I rank them according to how performant I found them to be on my generated dataset. The main idea with using SQL (a declarative language) to retrieve data is that you don't have to provide precise instructions on how to retrieve data but just express what you want under the form of SQL queries. The database component called the planner will actually determine what's the best way to retrieve the data. But as we shall see, different ways to express the same need can have quite the impact on performance.
To explain why the alternatives presented are slower or faster, I use visualisations of the Postgres explain plans, showing how the database processes the query internally. As you will see, some changes will make quite a difference - we will go from hundreds of ms
to about 5ms
.
To support the story we'll use the following scenario: we have a set of meters
that regularly record readings
and store them in the database. We have to retrieve the latest reading per every meter. I'm using Postgres
version 16.2
(default configuration) on my laptop.
Setup
First, we create the table of meters
.
create table meters
(
id bigserial,
-- other columns which I omitted for simplicity
primary key (id)
);
Then, we create the table of readings
.
create table readings
(
id bigserial,
meter_id bigint,
date date,
reading double precision,
primary key (id),
constraint fk__readings_meters foreign key (meter_id) references meters (id)
);
For integrity, the tables are linked together with a foreign key constraint.
Let's now populate our tables with some rows: we add 500
meters, all having one reading every day, for a year. For generating test data like this, you can use the generate_series function:
insert into meters select * from generate_series(1, 500) seq;
insert into readings(meter_id, date, reading)
select m.id, seq, random() from generate_series('2024-02-01'::date, '2025-02-01'::date, '1 day'::interval) seq, meters m;
Our tables are now populated and ready to be queried. Let's discuss the first approach.
1. Window functions: 250ms
If we use window functions to write this query, we first express that we want to partition our readings per meter IDs, and then use the row_number
function to assign an "order number" for every reading based on its age among its peers within a partition. You then return only ones that are at the top (have row number equal to 1) per every partition. Here's the query:
explain(analyse, buffers)
with readings_with_rownums as (
select
meters.id as meter_id,
readings.reading as reading,
row_number() over (partition by meters.id order by readings.date desc) as rownum
from
readings
join
meters on meters.id = readings.meter_id
)
select
meter_id,
reading
from
readings_with_rownums
where
readings_with_rownums.rownum = 1;
This runs in about 250ms
. It is not the worst, but this timing does mean that the users will not perceive this response as being instantaneous (100ms). Let's try to do better.
Time to have a look at the explain plan to understand how the database retrieves our data and get some clues. As you can see how I formulated the query above, I am using the buffers
option, see the motivation for why that is helpful here. In terms of visualisation tools, you have several great alternatives, here's some examples:
For the rest of this blog I'm going to use the one from Dalibo.
I've opened the relevant explain plan nodes which I want to explore further. They are in the upper part of the image above. We can quickly discern why this is not performing very well. It's because the query efficiency is low. It discards 99%
of rows that it's read right before returning the final results. You can see that in the Sort
node, where 183k
rows arrive coming from the nodes below, but then, only 500
are actually returned in the final result set. Retrieving our results like this is also less scalable, being very sensitive to increases in the dataset.
Let's consider resource utilisation for a moment as well. When the database is performing sorts like this, especially over and over again (like in an application used by several users concurrently) on potentially very large tables, you will see a spike in the CPU utilisation - as sorting is a CPU-bound operation. If possible we should try to preserve the CPU resource for situations where there are actually no alternatives. Let's proceed then and look at other options.
2. DISTINCT ON: 250ms
For details about how this works, you can have a look at the DISTINCT clause section in the Postgres docs. I find this approach rather elegant actually. Here's how it looks like applied to our use-case:
explain (analyse, buffers)
select
distinct on (meters.id)
meters.id as meter_id,
readings.reading as reading
from
meters
join
readings on meters.id = readings.meter_id
order by
meters.id,
readings.date desc;
I didn't get a noticeable difference with regards to how fast it runs compared with the previous approach with the window functions. The explain plan looks quite similar to the one we have seen before. Have a look.
We can indeed observe a difference in that it doesn't have the WindowAgg
node anymore, but this didn't get us very far. It's is still inefficient, reading a lot of rows and discarding the majority before returning the results. One thing I noticed in the text version of the explain plan before visualising it is the following detail which is worth taking a closer look at:
Sort Method: external merge Disk: 3968kB
This means the sort operation is being slowed down because it is forced to use the external disk. This happens when the work_mem
setting is too low given the size of the dataset. The sort can't be done fully in memory, so it has no choice but to "spill" the operation to disk. The default setting for work_mem
in Postgres is 4MB
, which in this case proves to be too little.
Let's increase this setting to make sure it's sufficient to do the sort in memory. But note that if you give it too much, it can be problematic because if the load increases a lot surprisingly, you can run out of memory and start seeing errors - it's a balancing act.
set work_mem='16MB';
I don't have to restart Postgres for this to take effect. For other settings, we have to.
We retry the query above, and I we can confirm that the sort happens in memory now; the proof is that we can see the following detail in the explain plan:
Sort Method: quicksort Memory: 14951kB
Cool! Did this make a difference though? Not really - we're still at ~250ms
response time. We still got other options though.
One thing to note at this point. This experiment is on my laptop, and the CPU / memory / storage are all on one machine. But for example when using Amazon RDS, the storage relies on another service - EBS, which is not on the same machine as your database, but it's separated by the network. So in those scenarios, avoiding disk spills like the one I showed you above will make more of a difference because the data has to "travel" more between memory and disk. As a side-note, recently AWS introduced the https://aws.amazon.com/blogs/database/introducing-optimized-reads-for-amazon-rds-for-postgresql/ option, where for these operations, the database instance can use a fast local NVMe SSD disk instead of a network attached EBS volume.
3. MAX(DATE): 115ms
Time to look at another approach. How about this, using SQL aggregate functions.
explain(analyse, buffers)
with latest_reads_per_meter as (
select
readings.meter_id as meter_id,
max(date) as reading_date
from
readings
group by
readings.meter_id
)
select
readings.meter_id,
readings.reading
from
meters
join
readings on meters.id = readings.meter_id
join
latest_reads_per_meter lrpm on lrpm.meter_id = meters.id
and readings.date = lrpm.reading_date;
As usual, the explain plan:
Hmm, this looks a bit different than what we have seen before. In a good way! Let's understand why. It's now doing the sort much earlier, and so it's discarding the irrelevant rows earlier. It doesn't "carry them over" all the way throughout the retrieval process. This will consume less memory because the intermediate results are smaller. However, does it speed up our query?
Indeed it does! Have a look at this.
Execution Time: 114.942 ms
Finally, some solid progress. We cut the runtime we started with in half. It's now very close to the limit where we can say it's perceived as being instantaneous. But can we do better? You bet, even reduce it by one order of magnitude.
4. Loose index scans: 14ms
Let's have a look at how the loose index scan works, since by now you're probably already wondering, when are you going to bring in the indexes?!
For the index definition, order of columns matters. The columns have to be defined exactly in this order I'm about to show you, with the "grouping" element first and then the other column which will be used for determining "latest" within a group.
create index idx on readings(meter_id, date);
We can then write the query like this.
explain (analyse, buffers)
select
meters.id as meter_id,
readings.reading as reading
from
meters
cross join lateral (
select
meter_id,
reading
from
readings
where
readings.meter_id = meters.id
order by
date desc
limit 1
) readings;
Execution Time: 13.814 ms
This is much faster. Let's see why. For starters - you don't see the 183k
rows anywhere at all in the explain plan. We are not sorting anything anymore as part of the query, because the index keeps our data sorted, at the expense of insertion overhead.
But, let's push the envelope, since we're here anyway. Let's open the IO & Buffers
tab of the index scan node in the above explain plan and have a look in there. Here it is:
Let's take a step back and consider how the index-based retrieval works at a high level. What happens is that as a first step, the B-tree index is traversed to determine which rows have to be retrieved to satisfy the query, however after this step, Postgres has to now actually go ahead and retrieve the rows from somewhere else, namely the table (or heap).
As we can see from the explain plan, this amounts to 2000 blocks read, about 16MiB (2000 * 8 kibibytes). A block (or page) is the fundamental unit by which Postgres stores data, have a look here for details.
But wait a minute, let's try to put the 2000 blocks in some context - how many of these blocks does the table readings
have in total on disk?
SELECT relpages FROM pg_class WHERE relname = 'readings';
relpages
----------
1350
It seems to be reading more blocks with this index-based approach, compared with if we'd simply just do a sequential scan and simply read the entire table, sort, and discard, like we saw above with the window functions. This trade-off is something to watch out for indeed. The index based approach leads to more speed, but needs more I/O (more blocks touched) to do it. It is good to know that for example, if you're on Amazon Aurora or other cloud databases, you pay for every block that it has to retrieve from storage because it couldn't find it in memory. A nice cautionary tale about keeping the I/O under control on Aurora would be this one. However, this situation in my generated dataset is a byproduct of how "narrow" my tables are (small number of columns) - it's an artificial setup after all. If there would be more columns, then the number of pages in the table would be larger, so the B-tree traversing would add up to much less blocks.
Let's see if we can reduce the number of blocks touched. You might wonder, can we avoid the extra step (reading from the table after the index)? Exactly!
5. Loose index-only scans: 5ms
We can implement an index-only scan. We use the include
option when creating the same index as above to achieve this, like so:
create index idx_with_including on readings(meter_id, date) include (reading);
Let's retry the query and look at the relevant tabs again.
Two important things to observe here. First, notice the Heap Fetches: 0
, which indicates that it does indeed not go to the heap anymore to get the rows because they are in the index already so it can get them from there. Secondly, the total number of blocks is now ~1500
, so 500 less than before. This confirms again that it indeed doesn't read anything from the heap.
It can happen that you try this out and don't see Heap Fetches: 0
, and you might wonder why, because you've set everything up correctly. This can happen when the visibility map is not up to date and Postgres has no other option but to actually go to the heap to get the visibility information about a row, because that is where it is stored. As the visibility map is kept up to date by the autovacuum
process, it is important to regularly visit the configuration settings to make sure it is able to keep up.
Let's look at the final result. How fast did we get it?
Execution Time: 5.448 ms
This is very nice. Let's now understand it a bit better how exactly the B-tree index is traversed for retrieving the data. Here's a diagram. In Postgres, we have three types of nodes in a B-tree, the root node, intermediate nodes (used only for traversal) and leaf nodes (pointers to tuples on the heap and included data).
For each row returned in the final result, it will do 3 page read operations - first for the root page, then for an intermediate/internal B-tree page, and lastly it will read the leaf page, from where it will collect the reading. I have marked these steps with 1
, 2
, 3
next to the arrows that represent descending down the tree.
Note though that index-only scans are not a panacea and should always be used judiciously, because by including columns into it, it does increase its overall size, so it might increase the time needed to traverse it for other use-cases.
Conclusion
After quite the journey, we've finally arrived at the fastest solution - a loose index-only scan; meaning it's end of the line for this post. Let me know if you've tried it out and it improved the performance of your queries!
Note though that as they say, there is no free lunch - every additional index gives the database more work to do at every insert, so you will have to decide if it's worth it. Also, the results do depend on the data distribution. It's so fast because we in our dataset we have many readings for every meter, but things might be different in your setup. As a last remark, note that if you used the indexes I added in the latter queries and retried the initial queries of the post, they will be faster, however not as fast as the loose index scans.
References
The first time I heard about loose index scanning was from this answer on SO.
Here's some other great resources on the topic:
- Loose Index Scan aka Skip Scan in PostgreSQL
- Select the Most Recent Record (of Many Items) With PostgreSQL
- PERFORMANCE TUNING: MAX AND GROUP BY
There is also a Postgres wiki entry on the topic.
There's an ongoing discussion about implementing skip scans in Postgres.
Thanks for reading!
Top comments (0)