Code Review Doctor

Posted on Oct 10, 2021

The Django feature that can slow your website to a crawl

#django #python #webdev #programming

Ordering Django querysets via order_by causes Django to ask the database to order the rows by a given column.

As long as the number of rows being ordered is not too large then this operation should be quick. However, there is a situation where using the build-in order_by can trigger a series of very expensive and slow operations: getting a random row from the database.

Read one random row inefficiently

Django's build in way to get a random record is via order_by('?')[0]. Note though that the Django docs themselves give this warning:

order_by('?') queries may be expensive and slow

The reason for the poor performance is the fact Django generates the following inefficient SQL:

SELECT * FROM entry ORDER BY RAND() LIMIT 1

Let's break that SQL down:

SELECT * FROM entry

This tells the database to include all columns from the entry table.

ORDER BY RAND()

This tells the database to generate a random number for every row (depending on database used. Some do it differently).

LIMIT 1

This tell the database to scan the generated numbers for the smallest one. This rows is then read.

Let that sink in: if we have 100,000 rows then 100,000 random numbers are first be generated, then all 100,000 are scanned for the smallest one. Generating random numbers is slow, and scanning is not quick. For perspective this whole operation smells a lot like mining Bitcoin but ultimately all we get out of it is we read one row from the database! A lot of work must be done for small pay-off.

If there are a lot of rows in the database and order_by('?')[0] is used then expect bad performance.

Read one random row efficiently

In short, databases are not good at random. Applications are though. So consider splitting responsibilities between the database and the application:

At database level determine the number of rows
At application level get a random number between 0 and that count.
At database level select the row at that index.

This will mean two database reads are performed, but those will are very efficient operations and so significantly quicker than the database copying the table and ordering all the rows randomly.

With that in mind, that is why Django Doctor gives this advice when reviewing Pull Requests and it spots order_by('?')[0] being used:

Self-contained

Consider improving the layering of the solution by moving the logic to a custom model manager like so:

from random import randint

from django.db import models, transaction


class RandomManager(models.Manager):

    @transaction.atomic
    def random(self):
        index = randint(0, self.count()-1)
        return self.all()[index]

class Entry(models.Model):
     objects = RandomManager()

random_item = Entry.objects.random()

Notice the use of transaction.actomic to make both the count and the row read happen in the same transaction, preventing a rare race conditions occurring if:

the last row in the table was deleted after the count was read but before the random row was read.
the random number generated was the the last row in the table.

Without a atomic transaction under very rare circumstances an IndexError could occur.

Does your Django codebase have tech debt like this?

Django Doctor can find and fix over 40 types of common Django tech debt, and can even review your GitHub Pull Requests for you.

DEV Community

The Django feature that can slow your website to a crawl

Read one random row inefficiently

Read one random row efficiently

Self-contained

Does your Django codebase have tech debt like this?

Top comments (0)

Read next

Finding Which Columns Value Has Actually Changed in PostgreSQL Trigger Functions

Top 10 Common Data Engineers and Scientists Pain Points in 2024

Reflecting On My Programming Career

How I Became Top Rated on UpWork