Ordering Django querysets via order_by
causes Django to ask the database to order the rows by a given column.
As long as the number of rows being ordered is not too large then this operation should be quick. However, there is a situation where using the build-in order_by
can trigger a series of very expensive and slow operations: getting a random row from the database.
Read one random row inefficiently
Django's build in way to get a random record is via order_by('?')[0]
. Note though that the Django docs themselves give this warning:
order_by('?')
queries may be expensive and slow
The reason for the poor performance is the fact Django generates the following inefficient SQL:
SELECT * FROM entry ORDER BY RAND() LIMIT 1
Let's break that SQL down:
SELECT * FROM entry
This tells the database to include all columns from the entry
table.
ORDER BY RAND()
This tells the database to generate a random number for every row (depending on database used. Some do it differently).
LIMIT 1
This tell the database to scan the generated numbers for the smallest one. This rows is then read.
Let that sink in: if we have 100,000 rows then 100,000 random numbers are first be generated, then all 100,000 are scanned for the smallest one. Generating random numbers is slow, and scanning is not quick. For perspective this whole operation smells a lot like mining Bitcoin but ultimately all we get out of it is we read one row from the database! A lot of work must be done for small pay-off.
If there are a lot of rows in the database and order_by('?')[0]
is used then expect bad performance.
Read one random row efficiently
In short, databases are not good at random. Applications are though. So consider splitting responsibilities between the database and the application:
- At database level determine the number of rows
- At application level get a random number between 0 and that count.
- At database level select the row at that index.
This will mean two database reads are performed, but those will are very efficient operations and so significantly quicker than the database copying the table and ordering all the rows randomly.
With that in mind, that is why Django Doctor gives this advice when reviewing Pull Requests and it spots order_by('?')[0]
being used:
Self-contained
Consider improving the layering of the solution by moving the logic to a custom model manager like so:
from random import randint
from django.db import models, transaction
class RandomManager(models.Manager):
@transaction.atomic
def random(self):
index = randint(0, self.count()-1)
return self.all()[index]
class Entry(models.Model):
objects = RandomManager()
random_item = Entry.objects.random()
Notice the use of transaction.actomic
to make both the count and the row read happen in the same transaction, preventing a rare race conditions occurring if:
- the last row in the table was deleted after the count was read but before the random row was read.
- the random number generated was the the last row in the table.
Without a atomic transaction under very rare circumstances an IndexError
could occur.
Does your Django codebase have tech debt like this?
Django Doctor can find and fix over 40 types of common Django tech debt, and can even review your GitHub Pull Requests for you.
Top comments (0)