Andrew Tate for Scout APM

Posted on Jul 28, 2023 • Updated on Oct 18, 2023

Identifying and Troubleshooting Inefficiencies in Django ORM

#django #apm #python #development

The Python web framework Django is an excellent choice for building performant web apps. Disqus, Pinterest, and Instagram are all built on a Django foundation. Django offers simplicity, flexibility, reliability, scalability, and all the tools you need to build a web app from admin to authentication to ORM.

But this “batteries included” approach has disadvantages. It can hide a lot of the true complexity of an app from developers. Django ORM is a great example. Django ORM abstracts away the complexity of database querying from the developer but in doing so, buries some of the problems that might be slowing your application down or causing other performance issues.

Identifying and troubleshooting these inefficiencies then becomes an important part of the Django workflow. Here we want to take you through how these problems can manifest, how you can find them, and how you can start fixing them.

The Django ORM and its problems

Django ORM, or Object-Relational Mapping, is a feature of Django that allows you to interact with your database, like MySQL, PostgreSQL, or SQLite, in the same way you would interact with Python objects.

The ORM allows you to create a Python class in Django, known as a model, which is then translated into a database table. Instances of these models represent rows in the table. Django ORM provides an abstraction over SQL, meaning that you can perform Create, Read, Update, and Delete (CRUD) operations using Python code rather than writing raw SQL queries.

For example, if you have a model called Article in Django and you want to retrieve all articles from the database, you would write Article.objects.all().

You define models like this:

from django.db import models

class Author(models.Model):
    name = models.CharField(max_length=100)

class Article(models.Model):
    title = models.CharField(max_length=100)
    content = models.TextField()
    published_date = models.DateTimeField()
    author = models.ForeignKey(Author, on_delete=models.CASCADE)

Then you can use this model to perform CRUD operations, like creating records:

# Create an Author
author = Author.objects.create(name="John Doe")

# Create an Article
article = Article.objects.create(
    title="Django ORM Tutorial",
    content="This is a tutorial on Django ORM...",
    published_date=datetime.datetime.now(),
    author=author,
)

You can then retrieve your records, either by fetching them all, or using filters to select just the records you want:

# Get all articles
articles = Article.objects.all()

# Get a single article by id
article = Article.objects.get(id=1)

# Get all articles written by a specific author
johns_articles = Article.objects.filter(author__name="John Doe")

Django ORM will translate this into the appropriate SQL query behind the scenes. For instance, the Article.objects.get(id=1) query will become:

SELECT * FROM article WHERE id = 1 LIMIT 1;
A primary advantage of Django's ORM is that it is database-agnostic. You can switch your database backend with minimal changes to your code since Django ORM takes care of translating your Python code into the appropriate SQL for your database.

This is obviously hugely convenient. A developer doesn’t have to hand code SQL queries for all their tables–they don’t even need to create tables. It’s all provided within the language of Python.

But with great power comes great responsibility. The Article.objects.filter(author__name="John Doe") query above can become:

SELECT * FROM article INNER JOIN author ON article.author_id = author.id WHERE author.name = 'John Doe';

Joins can be tricky in SQL. If you haven’t indexed your tables correctly or you just have extremely large tables, joins are going to be slow (as you have to traverse two tables). The problem isn’t that Django ORM is using joins–joins are an integral part of database querying–it is that Django ORM abstracts away your joins so you, as a developer, don’t even know you are using them. When queries start slowing down, this abstraction makes troubleshooting more complicated.

Inefficient queries can have several negative impacts on the performance of a Django application:

Increased Server Load: If a particular query is inefficient and it's being run multiple times, it can put an unnecessary load on your database server, potentially slowing down other processes or even causing crashes in extreme cases.
Increased Load Times: Inefficient queries take longer to execute, which can slow down the response time of your web pages, resulting in a poor user experience.
Data Overload: Inefficient queries might fetch more data than necessary (over-fetching), leading to higher memory usage and slower processing times.

This is where application performance monitoring tools can help. These tools can look for performance issues within your application and tie them back to not just the code, but the underlying queries that were actually performed. Thus, you get to see your actual queries and performance and tweak your ORM code as necessary.

Here we’ll use Scout APM to find three common ORM problems–N+1 queries, slow queries, and over-fetching–and then how they might be fixed.

Finding N+1 queries that increase server load

The "N+1" query problem occurs when you load a parent object and then load its related child objects in separate queries. For Django, this often occurs when accessing related fields in a loop, causing Django to execute one query to fetch the parent objects, and then additional queries for each parent to fetch its child objects (hence "N+1"). This results in many unnecessary database hits and can significantly slow down an application.

Here's an example of how the N+1 query problem might occur in Django:

Assume you have a blog application with two models, Blog and Entry. A blog can have multiple entries.

class Blog(models.Model):
    name = models.CharField(max_length=100)

class Entry(models.Model):
    blog = models.ForeignKey(Blog, on_delete=models.CASCADE)
    headline = models.CharField(max_length=255)

Now, suppose you want to display a list of all entries and the name of the blog to which each entry belongs. You might write a view like this:

def entries(request):
    entries = Entry.objects.all()
    return render(request, 'entries.html', {'entries': entries})

And then in your template (entries.html), you might have something like this:

{% for entry in entries %}
  <p>{{ entry.headline }} - {{ entry.blog.name }}</p>
{% endfor %}

In this scenario, Django's ORM will first execute a query to fetch all entries. Then, for each entry, when entry.blog.name is accessed, a new query is made to fetch the related Blog object. So if you have N entries, you'll end up making N+1 queries (1 to fetch all entries, plus N to fetch the related blog for each entry). This is the N+1 query problem.

This won’t necessarily be obvious with Django ORM–you might think you are just making that single query and retrieving the blog.name for every entry, but that isn’t what’s happening under the hood.

In Scout APM, N+1 query issues will be highlighted for you in your dashboard:

In this example, we can see that the app is making 200 queries within this view. We can then dig down into the endpoint to find out more about the specific query:

Here, we can see all the sequential calls and the underlying SQL query causing the problem. Likely the core_author.name field is a foreign key from another Authors or Person table that we’re calling again and again, 199 times (plus the 1 call to the core_author table).

To fix this, Django provides methods like select_related and prefetch_related that you can use to fetch related objects in the same query.

select_related(): This is a performance booster that results in a single, more complex query using a SQL join. It's useful for fetching "one-to-many" and "one-to-one" related objects in a single query when you know you will need the related objects for each of the primary objects you're retrieving. For instance, in a blog application where an Entry model has a foreign key to Blog, Entry.objects.select_related('blog') will cache the related Blog objects, avoiding additional database queries when you access entry.blog.
prefetch_related(): This is similar to select_related, but it works on "many-to-many" and "many-to-one" relationships and does a separate lookup for each relationship, which can result in more queries than select_related, but less overall time if the related table is large. For example, if you have a Blog model that has a many-to-many relationship with the Tag model, Blog.objects.prefetch_related('tags') will fetch all the related Tag objects for each Blog in one go.

In this example, you could modify the view with something like this:

def authors(request):
    authors = Core_Authors.objects.select_related(‘name’).all()
    return render(request, authors.html', {‘authors’: authors})

This will result in a single query with a SQL join to fetch both core authors and their related names, thereby eliminating the N+1 problem.

Finding slow queries that increased load time

N+1 queries aren’t the only poorly optimized queries that can slow down your application performance. Simply not thinking through how your data is fetched can also cause issues. Again, this problem is exacerbated by Django ORM. It’s so easy to fetch data, so why not fetch it all!

In SQL, it’s generally seen as bad practice to use SELECT * in a query (which retrieves all columns of data), but as the examples above show, that is effectively what Article.objects.all() or Article.objects.get(id=1) is doing. Under the hood, hidden by ORM, you are grabbing all the data for a given entry.

We can see how that looks in Scout APM. Here’s a different example that suffers from this problem. We can see in the dashboard that one query is taken almost a second to return:

The RefreshArticleSelect query is slow. When we click on it we can see why:

Oh boy. For every article we are querying, we are returning every column of data. Really this app only needs a few of these fields. This is because the query is:

class RefreshViewSet(viewsets.ModelViewSet):

serializer_class = RefreshSerializer

def get_queryset(self):
unique_id = self.kwargs['unique_id']
return Article.objects.filter(unique_id=unique_id).filter(refresh=1)

We are effectively using SELECT *. With this insight, we can refactor this code to use only():

Article.objects.filter(unique_id=unique_id, refresh=1).only('title', 'url')

only() is used when you only need certain fields from a model. This helps save memory and reduce the load on your database by only pulling in the specified fields rather than all fields from a model.

Finding over-fetching that increases memory bloat

The above issue is a good example of over-fetching when you retrieve more data than you need. This isn’t only slow, it can also lead to memory bloat.

Memory bloat refers to a situation where an application uses more memory than necessary, often due to inefficient data handling, such as loading more data into memory than needed or failing to release memory when it's no longer needed. It can lead to reduced performance, increased infrastructure costs, and in extreme cases, application crashes or slowdowns.

Scout APM tells us we are also seeing memory bloat in this application:

We have a view being allocated over 100MB of memory. This stems from the same query as above, but a more subtle problem with it. When we write an ORM statement like this:

Article.objects.filter(unique_id=unique_id).filter(refresh=1)
Not only are we retrieving all our data, but we are also retrieving entire QuerySets. Retrieving full model instances is more memory and CPU intensive than retrieving a subset of fields, especially if the model has a large number of fields, or if some fields contain large amounts of data.

We can solve this problem by using the only() option above. However, that also returns a QuerySet, just a smaller one that only includes the fields you want. Another option we can use here is the values() or values_list() methods to select only a specific subset of fields in your query:

Article.objects.filter(unique_id=unique_id, refresh=1).values('title', 'url')

This returns a list (or a tuple with values_list()) instead of a QuerySet so can be more memory efficient.

Understand your queries

An interesting element of the fixes above is that they also have trade-offs. If you choose to use only() or values() to reduce your memory footprint, you’ll now need more database calls if you do want other fields.

This is why it’s important to understand what is happening with Django ORM. This excellent tool reduces complex queries to single lines of Python code. But the complexity of the queries isn’t reduced. It is still there.

So you need to see it. Tools like Scout APM are made for this type of introspection. Setting up Scout APM with Django is easy. All you need to do is install the package and add change some settings and data will start to be piped immediately. You can find out more in our documentation.

Ultimately, the key to efficient Django ORM usage lies in the careful balance of convenience and performance - understanding your queries, monitoring their impact, and utilizing the right tools such as Scout APM. This awareness, coupled with the power of Django ORM, will empower you to build efficient, scalable, and maintainable applications.

DEV Community

Identifying and Troubleshooting Inefficiencies in Django ORM

The Django ORM and its problems

Finding N+1 queries that increase server load

Finding slow queries that increased load time

Finding over-fetching that increases memory bloat

Understand your queries

Top comments (0)

Read next

How to deploy to multiple providers with dynamic credentials at once in Terraform CDK for Python

How to configure Makefile to dynamically generate list of commands

I created Ragrank 🎯- An open source ecosystem to evaluate LLM and RAG.

Difference between Microservices vs Monolithic Architecture