This article showcases optimization techniques that had great results at GitGuardian. You’ll find basic and advanced optimization techniques using Django and PostgreSQL, yet most of the principles we discuss are general enough to be applied to other frameworks or relational databases. However, a basic understanding of database transactions and how ORMs work is necessary to get the most out of it.
But don’t believe me, test them yourself! We provide a fully functional Django Playground where you can run all the queries shown in this article.
Here it is: https://github.com/GitGuardian/blog-playground
Optimization is a state of mind
Database theories provide a lot of good practices that we should always follow… at first. Because sometimes, as the usage of your app grows, those practices may reveal insufficient or may not perform as expected.
Always remember, there is no magic recipe to get the best out of your database in all cases. As you’ll read later in this article, some tips will be super effective in some cases and not perform well in others. As your application keeps living (and hopefully growing), you may get from one to another, reverting to the optimization that made you proud a few months ago. Don’t see that as a failure; it was helpful for a time. What’s important here is to continue to monitor, test, measure, and improve.
So, the first and most important tip is that optimization is not a one-off task. It’s a never-ending process.
A good method to iterate fast
Django ORM is a super handy tool to persist and query your business concepts using pythonic code instead of SQL. But, you need to see the generated SQL to understand how your code will execute. Many tools provide this ability, but as you search for the best optimization, the closer you stay to your code, the better it is. Testing queries directly in Django Shell (or a Notebook) may finally be the fastest way to iterate and find the best solution.
Did you know that Django allows you to display all queries done by a code block?
Let’s consider the following code:
from books.models import Person
from django.db import connection, reset_queries
reset_queries()
qs = Person.objects.only("id")
person = qs.first()
print("SQL Query", qs[:10].query)
print("PostgreSQL query: ", connection.queries[0]) # needs DEBUG=True
print("pg explain analyze:", qs[:10].explain(ANALYZE=True))
And its output:
SQL Query SELECT "books_person"."id" FROM "books_person" LIMIT 10
PostgreSQL query: {'sql': 'SELECT "books_person"."id" FROM "books_person" ORDER BY "books_person"."id" ASC LIMIT 1', 'time': '0.000'}
pg explain analyze: Limit (cost=0.00..0.20 rows=10 width=8) (actual time=0.013..0.016 rows=10 loops=1)
-> Seq Scan on books_person (cost=0.00..20251.18 rows=1000018 width=8) (actual time=0.012..0.014 rows=10 loops=1)
Planning Time: 0.100 ms
Execution Time: 0.034 ms
Django gives you access to the SQL query related to a Queryset, but you can also list queries executed at the connection level, and for the lucky PostgreSQL users, you can even get the executed plan for the query, and the time of execution. That’s an excellent starting point for comparing ideas on how to optimize a query.
Be aware that the production is not like your local machine
In the previous example, we extracted the execution plan executed by PostgreSQL. But if you compare the same code executed on your local machine and your production, you may have very different results.
Why? As your tables grow, PostgreSQL optimizes them(or not!), and—depending on the database cluster’s many settings—the planner may make different decisions. For example, if a table is small, the cost of using an index may not be smaller than doing a full scan. So on your local machine with three users in your database, you can get different results than in your production instance with millions (let’s think great!).
On your local machine, you should only be concerned by the number of queries you do and how big/complex they are. Leave the in-depth execution plan analysis for queries executed on the prod (or a prod-like) environment. Even there, be careful about your tables’ statistics. If the tables change often, statistics may not be up to date and can lead to poor decisions from the planner (you can activate auto vacuum to ensure your stats are always updated). For instance, with wrong statistics, the planner may “think” that your table is small and use a full scan, as in fact, the table is full of new rows, and using the index is the right choice.
Select only what you need
Now, let’s talk about some basic good practices that should always be followed. Running a query from your app can be broken down into 3 main tasks:
- Sending the query to the database
- Executing the query
- Sending back the results
We often focus on the second part which can be memory and CPU bound, depending on the query and data, but the first and third parts, which are more network bound, are important too.
You may argue that the query will always be small… and maybe you are too confident about how ORMs work. You may also have forgotten the time you filtered using a list:
Person.objects.filter(email__in=all_population_of_north_america)
In this example, the generated SQL itself can be huge. Even if your table is small and no row matches the query, the time to send the query will be very long. In the worst case, it could even raise an error.
Most of the time though, the result is often a lot bigger than the query. So if you only need to work on a small set of fields, don’t hesitate to use only()
or defer()
Person.objects.filter(name='tolstoy').defer(bio) # will not retrieve Tolstoy's bio
Person.objects.filter(name='tolstoy').only("name", "email") # will only retrieve name and email
In the second example, be aware that if you want to access a field that is not listed in only()
, Django will execute another query to retrieve it. If you do that in a loop, that’s many more queries. Dumping connection.queries
can be useful to identify such cases and decide whether you should add a field to the only()
clause.
After receiving the data, Django will instantiate models for each row returned. If you want to get some field values, use values()
or values_list()
to save a lot of time.
To be complete on the subject, and if you plan to serialize the result of your query, be aware that the more you retrieve, the more you will have to transform (and maybe send to some client). Pagination, when applicable, will make your queries run faster, and will also significantly reduce the time to process the result of the query and to send it to the client.
Index what you’re searching for
In databases, indexes use optimized data structures to avoid looking for every row of a table when searching for those matching a certain condition. Although indexes can significantly increase the size of your database, they are a cornerstone database performance, saving a lot of time and CPU.
That’s a well-known SQL good practice to index columns that are used in WHERE
, ORDER BY
, and GROUP BY
clauses, and you can easily find where you should add indexes by looking at the costly “Seq scans” (aka full scans) in an execution plan. As a minimum, always index your primary keys, foreign keys (Django does it automatically), and all the fields you often use in filtering.
That’s not all. When creating an index, you should carefully choose its index type; otherwise, it could be less effective than expected, or worse PostgreSQL may not even use it! For example, HASH
indexes can only operate on a =
comparison and do not affect an ORDER BY
, while B-TREE
indexes will also be used on LIKE
, inequalities, and sorting operations. GIN index will operate on complex data types like array or jsonb…
Finally, be careful: functions can sabotage your indexes! Let’s take an example:
Let’s say we want to search Persons by name. At first, we naively think we’ll be fine indexing the name field…
class Person(models.Model):
email = models.CharField(max_length=255)
name = models.CharField(max_length=255, index=True)
But now imagine we want to do a case-insensitive search with this filter:
Person.objects.filter(name__iexact='tolstoy').only('email')
Which translates to:
SELECT email FROM books_person WHERE UPPER(name) = UPPER('tolstoy');
Execution Time: 4740.735 ms
In this case, our regular index is useless because we are using the UPPER function on the index! Instead, we should define the index as:
class Person(models.Model):
email = models.CharField(max_length=255)
name = models.CharField(max_length=255)
class Meta:
indexes = [
Index(
Upper('name'),
name='books_person_name_upper_idx',
),
]
Or in SQL:
CREATE INDEX users_user_name_upper_idx on books_person (UPPER(name));
Execution Time: 470.121 ms
Select_related and prefetch_related are not always the best match
When working on a Django Model, you often need to access its relations. To demonstrate that, we create a second model, Book, which is related to a Person through an author's foreign key and a readers many to many relation (a person can read many books, a book can be read by many persons. I hope it will be the case for this blog post).
Class Book(models.Model):
title = models.CharField(max_length=256)
author = models.ForeignKey(Person, on_delete=models.CASCADE, related_name="writings")
readers = models.ManyToManyField(Person, related_name="readings")
If you have to iterate through Books and access information about their author, you could simply do:
for book in Book.objects.all():
author = book.author
This is generally a bad idea as it will generate N+1 queries for N books. Instead, you should rather use select_related
in order to minimize queries:
SELECT "books_book"."id", "books_book"."title", "books_book"."author_id", "books_person"."id", "books_person"."email", "books_person"."name", "books_person"."bio" FROM "books_book" INNER JOIN "books_person" ON ("books_book"."author_id" = "books_person"."id")
prefetch_related
is also a viable option, for OneToMany or ManyToMany relations:
Person.objects.prefetch_related("writings").all()
In this case, only two queries are generated:
SELECT "books_person"."id", "books_person"."email", "books_person"."name", "books_person"."bio" FROM "books_person"
'SELECT "books_book"."id", "books_book"."title", "books_book"."author_id" FROM "books_book" WHERE "books_book"."author_id" IN ( …)
Query duration: 1.0s
Total time: 8.12s
But note that in the second query, the IN
clause contains the full list of all Person.id
selected in the first query, so the query can be huge. If you implement this solution on a big dataset, you may find performances very disappointing.
result = {}
for user in User.objects.prefetch_related("writings")[:100_000]:
result[user.email] = [book.title for book in user.writings.all()]
print("query duration", sum(float(query["time"]) for query in connection.queries))
Query duration 0.9s
The output of this code should tell you that the total query duration is small compared to the total time of execution. There are mainly 2 reasons for that:
- Django must fetch the complete list of users in order to get the necessary
user.id
to build the second query - Each time you access a user, Django will need to look for matching books in the second query
Django gives you a way to optimize the second point:
“When using
to_attr
the prefetched result is stored in a list. This can provide a significant speed improvement over traditional prefetch_related calls which store the cached result within a QuerySet instance”
from django.db.models import Prefetch
result = {
user.email: [book.title for book in user.prefetched_writings]
for user in User.objects.prefetch_related(
Prefetch("writings", to_attr="prefetched_writings")
)
}
But, if you need only to access some fields of the relation, the performances can be significantly better by using aggregation instead of prefetch_related
:
from django.contrib.postgres.aggregates import ArrayAgg
result = {
user.email: user.writings_titles
for user in User.objects.annotate(
writings_titles=ArrayAgg("writings__title")
)
}
Which will generate only one query and Django’s work will be reduced to the minimum:
SELECT "users_user"."id", "users_user"."email", "users_user"."name", ARRAY_AGG("books_book"."title" ) AS "writings_titles" FROM "users_user" LEFT OUTER JOIN "books_book" ON ("users_user"."id" = "books_book"."author_id") GROUP BY "users_user"."id"
Query duration 0.6s
Total time: 1.65s
Limiting back & forths between your database and your Django backend is generally a good idea as your relational database will perform better than Django to join and aggregate data from several tables. You will save some network traffic and memory because there is no need to fetch and keep intermediate data for prefetch_related
.
Still, want the extra mile? As in the previous example, we want emails and titles, meaning we don’t need to wake up the full Models. Here value_list
is just what we need, allowing us to work directly on Python objects
result = {
user.email: user.writings_titles
for user in User.objects.annotate(
writings_titles=ArrayAgg("writings__title")
).values_list("email", "writings_titles", named=True)
}
Query duration 0.25s
Total time: 0.55s
Aggregations vs Subqueries
In the previous example, we saw that we could rely on the power of PostgreSQL to save our life and RAM (select what’s more important for you). But if there is something to get from this article, it is that what works well in some cases may not perform in another. Always keep an eye on how queries are performing and stay open-minded.
Let’s take the case of subqueries. They are often seen as bad practice as database planners are not able to generate a single optimized execution plan from them. But in some cases, they can give surprisingly good results.
In our example, if we want to get all books published by our authors as well as their total count of readers, we will end with this kind of query:
writers_stats_qs = Person.objects.annotate(
writings_title=ArrayAgg("writings__title"),
readers_count=Count("writings__readers"),
).values_list("name", "bio", "writings_title", "readers_count")
Or in SQL:
SELECT "books_person"."email", ARRAY_AGG("books_book"."title" ) AS "writings_title", COUNT("books_book_readers"."person_id") AS "readers_count" FROM "books_person" LEFT OUTER JOIN "books_book" ON ("books_person"."id" = "books_book"."author_id") LEFT OUTER JOIN "books_book_readers" ON ("books_book"."id" = "books_book_readers"."book_id") GROUP BY "books_person"."id"
Query duration 68.92s
At some point, when a query needs to JOIN
many tables or perform GROUP BY
, the database will need a lot of memory to create all temporary structures that will allow resolving the query. It will also need to load all needed indexes, check for unicity on all temporary data, etc… So sometimes, using subqueries is worth a try, especially if they allow replacing a huge complex query with smaller ones, a lot easier to solve.
writings_subquery = Book.objects.filter(author_id=OuterRef("id")).values(
"title"
)
readers_subquery = (
Book.objects.filter(author_id=OuterRef("id"))
.values("author_id")
.values(count=Count("readers__id"))[:1]
)
writers_stats_qs = Person.objects.annotate(
writings_title=ArraySubquery(writings_subquery), readers_count=readers_subquery
).values_list("name", "bio", "writings_title", "readers_count")
This query will generate the SQL:
SELECT "books_person"."name", "books_person"."bio", ARRAY(SELECT U0."title" FROM "books_book" U0 WHERE U0."author_id" = ("books_person"."id")) AS "writings_title", (SELECT COUNT(U2."person_id") AS "count" FROM "books_book" U0 LEFT OUTER JOIN "books_book_readers" U2 ON (U0."id" = U2."book_id") WHERE U0."author_id" = ("books_person"."id") GROUP BY U0."author_id" LIMIT 1) AS "readers_count" FROM "books_person"
Query duration 18.5s
Big writes, but not too big
Until now, we talked about querying the database but write operations may also lead to performance issues when done on a large scale.
In Django, bulk_create()
, bulk_update()
, and Queryset.update()
allow to pack writes in grouped queries. This is far more efficient than iterating through a list of objects and saving each of them individually.
If you are working on a large dataset, it goes without saying that many issues can happen. First, the query size can be problematic. As mentioned earlier, a huge query will consume a lot of memory in your Django process and isn’t optimal on the database side, either. For that reason, bulk methods allow passing a batch_size
to break the big query into smaller bits, meaning that many small batches of this size will be executed one after another.
Then, another possible bottleneck is the transaction. Transactions are great to ensure that the data stays consistent. But when inserting or updating a lot of data, the database will have to maintain all the transaction changes in a specific space, and may also lock tables to ensure no race condition is happening during the transaction. This is good for consistency but bad for concurrent queries waiting for the transaction to release the locks.
Therefore if atomicity is not mandatory for the whole data you’re writing, it could be a good idea to break your bulk writes into smaller transactions.
from itertools import grouper
from django.db import transaction
CHUNK_SIZE = 500
for books_chunk in grouper(big_list_of_books, CHUNK_SIZE):
with transaction.atomic():
Book.objects.bulk_create(books_chunk)
In this example, books are created in chunks of 500, and each bulk_create()
is isolated in a transaction.
Save your RAM
We often think optimization has to do with CPU usage. But most of the time, optimization is more memory bound. Databases are more efficient when they can resolve a query without needing to read or write to the disk. The cost of accessing big data structures is important in Python too, even in memory. That’s why we should avoid, as much as possible, creating lists from querysets. We have a much more memory-efficient option: iterators. The good news is that you can transform one into the other almost for free:
def iter_over_persons():
for person in Person.objects.all():
pass
RAM consumption: 1770.63671875 MB
def iter_over_persons_with_iterator():
for person in Person.objects.iterator():
pass
RAM consumption: 114.70703125 MB
Note that using an iterator with prefetching wasn’t possible before Django 4.1. Since this version, the documentation states:
If you use
iterator()
to run the query,prefetch_related()
calls will only be observed if a value forchunk_size
is provided.
Monitor your performances
As we repeated throughout this article, optimization is a long game. This is not the first thing you should think about when coding a new feature —there is even the well-known mantra: _avoid premature optimization—_but rather a thing you should prepare. Observability is definitely a must-have to detect bottlenecks, timeouts, locks, and all those nasty problems that are inseparable from the success of your great product.
We have seen how to optimize at the query level, but you first need to identify queries you should work on.
At GitGuardian, we use Silk profiler on our development environments, and we use a Database APM scrapping PostgreSQL slow queries to identify bad performers. We also monitor database locks and our instances’ vitals (CPU, memory, io, …)
With all this tooling, we can quickly spot issues and work on them. Django queries’ optimization is only one way to resolve those issues. We sometimes change our models’ architectures or add asynchronous processes to compute some values, like we did to manage permissions at scale. This is another story, yet the state of mind is the same: never stop improving performances, never stop improving user experience.
To sum up our tips
- Keep an open mind; optimization is a long game, and what is good one day may become bad a few months later
- Create an environment where you can experiment easily and get immediate feedback
- Be aware of the differences between your production and your test playground
- Be careful of the amount of data you send or get from the database
- Create indexes, and make sure they are used the way they are supposed to
- Minimize the count of queries your code will generate and be aware of the overhead of processing on the Django side...
- ...but sometimes many subqueries can perform better than a single complex query
- Find the best chunk size for your writes to get short and efficient transactions
- Measure RAM consumption on the Django side for a better scaling
- Use monitoring to detect performance bottlenecks at the earliest
Top comments (0)