Ivan Slavko Matić

Posted on Sep 18, 2022

Django 4.1: Improving Database Accessibility

#django #python #postgres

Introduction

Today’s commodity in all applications, written by all manner of different programming languages, is speed. For example, how fast we get our render, send our information, process complex operations, or in general, how fluent is the whole experience while using an application. We’ve developed a certain standard we expect from our applications. So how do we maintain that standard with Django?

There are many factors that decide how fast or how reliable an application is. And those factors get especially noticed as our application grows with each new feature and operation we add-in. After a while, it's fairly common for a developer to try and improve the performance of its application. Django’s strength lies in procuring easily assembled and fast queries from a database, and that is the angle we want to explore today.and that is the angle we want to explore today.

Speed Traps

First, you might wonder, what are ‘Speed Traps’? We’ll that term is meant to explain the following. To sustain a level of reliability in our applications, some elements such as caching, indexing and query optimisation, are not fully automated in Django and are left to the end user to do so. They come at a cost. A trade-in, where we sacrifice memory and storage space (which provides ‘breathing room’ for our applications and provides stability), to procure speed. This is not just a Django quandary, it is in fact, present in game engines, database systems and programming languages, which engineers have to face whilst planning ahead and upgrading. While some performance boosts are straightforward (where performance is added and not transferred or traded), trading memory and storage space for speed are not. As matter of fact, a wrong allocation of resources can hurt the application and slow it down, or even worse - crash it. And as we know, user experience is affected by both speed and reliability. So in short, by saying speed traps, I’m referring to the trade of memory, storage space and speed which can be costly if done wrong.

But here’s a cool thing, with some careful and well-thought-out fine-tuning, we can optimize our application to run at full speed. We can take control of the unknown in our Django application and push it to new heights. Let’s set up some tests and examples, and see how we can use these boosts to our best advantage.

Quick Setup Of Testing Environment

We will be using PostgreSQL 13.6 and Python 3.9 for this project.

Start project

django-admin startproject speedster

Start app

./manage.py startapp speedy

Create venv

./manage.py venv venv

Python packages (use ‘pip install’ or our IDE package manager to install packages)

#Packages
 asgiref==3.5.2
 Django==4.1.1
 psycopg2==2.9.3
 django-debug-toolbar==3.6.0
 sqlparse==0.4.2
 tzdata==2022.2

Note: ‘django-debug-toolbar’ is an optional third-party package. It is used primarily for monitoring and troubleshooting. For full installation details, follow this link https://django-debug-toolbar.readthedocs.io/en/latest/installation.html.

Define our app in settings.py

INSTALLED_APPS = [
  'django.contrib.admin',
  ...
  'speedy'
 ]

Setup our DB, login as superuser in our ‘postgres’ shell (‘postgres’ is superuser in my case)

psql -U postgres

Creating DB and user with privileges

CREATE DATABASE speedster_db;
CREATE USER speedster_user WITH PASSWORD 'speedsterpass';
ALTER ROLE speedster_user SET client_encoding TO 'utf8';
ALTER ROLE speedster_user SET default_transaction_isolation TO 'read committed';
GRANT ALL PRIVILEGES ON DATABASE speedster_db TO speedster_user;
\q

Add our database configuration to our settings.py

DATABASES = {
  'default': {
   'ENGINE': 'django.db.backends.postgresql',
   'NAME': 'speedster_db',
   'USER': 'speedster_user',
   'PASSWORD': 'speedsterpass',
   'HOST': 'localhost',
   'PORT': '5432',
  }
 }

Run migrate for our default apps like ‘admin’, ‘contenttypes’, ‘auth’... (We haven’t any models yet, so ‘makemigrations’ command is not necessary)

./manage.py migrate
./manage.py check   # Can't hurt

Models

We are keeping our models modest and simple. Altogether, I will be filling tables with approx. 10000 objects where each object has a similar size (that does not include the ‘through’ relations table autogenerated from the m2m field).

from django.db import models
from django.db.models import CASCADE


class Employees(models.Model):
    first_name = models.CharField(max_length=100)
    last_name = models.CharField(max_length=100)
    address = models.CharField(max_length=200)
    age = models.PositiveSmallIntegerField()
    date_of_birth = models.DateField()


class Stores(models.Model):
    manager = models.ForeignKey(Employees, on_delete=CASCADE)
    city = models.CharField(max_length=100)


class Shifts(models.Model):
    name = models.CharField(max_length=100)
    employees_shift = models.ManyToManyField(Employees)
    wage = models.DecimalField(max_digits=5, decimal_places=5)
    hours = models.CharField(max_length=15)

Ok, the initial setup for our project is done. Let’s explore some ways to gain performance boosts.

Indexes

We’ve all encountered long renders or just waiting for a slow API to respond due to a query sifting through large amounts of data in our database (imagine a query fetching 1mil.+ objects from a database that aren’t small integers — that takes a while). Let’s say we want to fetch all employees whose id is over 284:

# Django Query
try:
    employee = Employees.objects.filter(id__gte=284)
    print(connection.queries)
    print(employee.explain())
except Employees.DoesNotExist as e:
    print('Error: ', e)
    employee = None
    pass
# Postgres explain: 
# Seq Scan on speedy_employees  (cost=0.00..228.00 rows=9971 width=50)
# SQL: 
# SELECT "speedy_employees"."id", "speedy_employees"."first_name", "speedy_employees"."last_name", "speedy_employees"."address", "speedy_employees"."age", "speedy_employees"."date_of_birth" FROM "speedy_employees" WHERE "speedy_employees"."age" >= 30

By running the ‘Postgres EXPLAIN’ command on our query, we can see that Postgres did a sequential scan on 9971 rows, meaning it scanner went row by row, comparing values from our filter to our condition. Let’s apply indexes and compare index scan. In Django, we can add indexes with ‘Meta.index’ or with ‘Field.db_index’.

Field.db_index example:

Use this one if your needs are simple and if you’re looking for an ‘inline declaration’

id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False, db_index=True)

Meta.index example:

With this approach, you can create custom name/s for your index/es

class Meta:
    indexes = [
        models.Index(fields=['id'])
    ]

After you declared it, run ‘makemigrations’ and ‘migrate’ to apply changes.

After we run our ‘postgres EXPLAIN’ on our employee query, we can notice the cost has reduced to 203.00. The lower the cost, the fast the query execution is and in this case, we got a faster query as a result. For more information on how the cost is calculated, please check reference [1].

Now, we’ve indexed our field ‘id’, which means that if we filter with parameter ‘id’ in our ORM filter(), exclude() and other methods that involve filtering with the field ‘id’ as our parameter, those rows will be fetched faster.

Important note:

Employees.objects.filter(id__gte=284, age=49)

The ‘id’ field in the example above is indexed, while the ‘age’ field is not, which means you get an index scan on one part of the query and sequential on the other. So plan accordingly. The decision on where to apply indexing, depends on your situation. Django’s recommendation states that indexing is best used on fields that are unique (pk’s, fk’s, unique’s). Lookups will be faster because there is no chance of matching a duplicate value.

My personal rule of thumb is to apply indexing on those fields that are most used in my filter(), order() and other similar methods.

Storage space
Index table size should take at least the same amount of space as the data inside the column. And that is one of the bigger costs of using indexing. Space taken by your indexes scales with the size of your column.

Pros:

There is no need for constant reindexing or maintenance, our model index declaration handles it for us (in the PostgreSQL case — no need for aggressive ‘VACCUM’ usage).
Reduces the cost and speeds up query lookup
Today’s servers and host devices are pretty powerful, so storage space and computing power (at least for now) are bountiful

Cons:

Uses storage space (limited resource, no matter how big)
Trade of query speed versus index maintenance may result in no performance gain (may even reduce it) if utilized incorrectly
Cost of maintenance scales with the number of indexed fields (Even though it’s automated — it still requires the execution of the command to re-add index)

Where and when to use:
Usually, it’s unnecessary to use indexing while in the early development stage of the application. Indexing is something I wouldn’t do proactively but rather reactively — there should be a need for indexing, a need for performance upgrade. And that need often appears in the processing of large quantities of data and long query lookups.

Caching

Alternative to waiting for long renders of complex (or pages with expensive/time-consuming operations) to produce, is caching. The idea behind caching is to predict the future needs of end-users/clients, and prepare data (which can come in many forms) ahead. Prepared data is stored in a cache. Configuring caches is quite simple, the part which requires the most work is deciding where to cache the data. Caching isn’t designed for permanent data storing, but temporary, which means if you get a server crash — data is lost, keep that in mind. Now looking at Django’s support for caching, we can see that it is quite extensive, and you can learn more by visiting reference [4]. Let’s try out one of the caching options Django offers.

Memcached

A really powerful addition to performance but also a very serious speed trap. That is how I would summarize memcaching. Before we try it, let's debunk what Memcached actually does. Quote from official Django documentation:

Memcached runs as a daemon and is allotted a specified amount of RAM. All it does is provide a fast interface for adding, retrieving and deleting data in the cache. All data is stored directly in memory, so there’s no overhead of database or filesystem usage.

An important note to notice is, that Memcached is using RAM. And even more importantly, the RAM being used is from the host itself. That might factor in your calculation and planning if you decide to put Memcached into your host/remote server. Let’s install and try it out.

Memcached is third-party software and requires downloading a release from https://memcached.org/downloads. Note that the release you find will be originally meant for debian/ubuntu. To make it work on Windows, you will have to search the internet a bit for a Windows release (it’s not particularly hard to find). Personally, I recommend installing it on Linux (or Linux VM if on Windows) — support is better and the configuration does not require you to dig through the registry like we are just about to do on Windows.

After you unpacked it, you need to install it and start it. So the daemon will be running in the background.

Navigate to your unzipped folder and run these commands to start the daemon

memcached.exe -d install
memcached.exe -d  start -p 11211 -m 2048 
# -p is already set by default, this is for demonstrating

By default, the Memcached server will be listening on port 11211 and 64 MB memory limit (unless specified otherwise). Both can be changed. Port can stay the same for now, but memory limit we would like to expand. For starters, I recommend setting memory limit to 512 MB. In Windows Run type ‘regedit’ and navigate to:

HKEY_LOCAL_MACHINE/SYSTEM/CurrentControlSet/Services/memcached/ImagePath

On variable ‘ImagePath’ you can edit parameters manually and set memory limit. If you open ‘Services’, you should find ‘memcached’ service running. Ok, now we need our ‘middleman’ python library to communicate with our Memcached server. Django recommends pylibmc and pymemcache. We will be using pymemcache. Let’s run a quick pip install:

pip install pymemcache

The final step is to define our CACHE configuration in our settings.py

CACHES = {
    'default': {
        'BACKEND': 'django.core.cache.backends.memcached.PyMemcacheCache',
        'LOCATION': '127.0.0.1:11211',
    }
}

And our setup is done, now here comes an equally important part — deciding what to cache. Some of the possibilities are to cache a whole site, a view or a template fragment. Let’s do a simple site cache, which is done by updating MIDDLEWARE_CLASSES in settings.py:

MIDDLEWARE_CLASSES += (
   'django.middleware.cache.UpdateCacheMiddleware',
   'django.middleware.common.CommonMiddleware',
   'django.middleware.cache.FetchFromCacheMiddleware',
)

Note: order of the classes mentioned above are strict — update comes before fetch. End now we just have to note which configured cache in our CACHES configuration in settings.py to use and set the timeout for cache usability:

CACHE_MIDDLEWARE_ALIAS = 'default'  # Which alias to use
CACHE_MIDDLEWARE_SECONDS = '600'  # How long to keep cache

Now, if you look closely, we just fell into a speed trap. We used expensive RAM to cache a whole site. Usually, that is very ineffective and performance costly. So instead of caching an entire site, better use for ‘Memcached’ would be something smaller and ‘focused’ such as ‘Per-view cache’. It is always recommended to cache only the most important parts of your application, then expand if needed. Per-view cache is quite simple, you add a decorator to your view and add ‘cache_page()’ in your URL path, like this:

from django.views.decorators.cache import cache_page

@cache_page(60 * 15)
def your_view(request):
    ...

# or

from django.views.decorators.cache import cache_page

urlpatterns = [
    path('object/<int:object_id>/', cache_page(60 * 15)(your_view)),
]

With a little tinkering, this can be instrumental in providing a user with a satisfying experience. For instance, this can be useful for fetching a user’s profile page in repeated succession.

Pros:

Can use RAM, for even greater speeds
Customizable amount of RAM/storage space
Caching has a large variety of choices and customizations
if properly configured, a cached copy can actually save the day, by providing content to the user while the server is crashed

Cons:

Just like indexing, caching can be expensive performance-wise if overused
In some cases, requires third-party software installations (for instance, Redis and Memcached)
Data cached is temporary, when a server or a service is stopped/crashed — data is lost

Where and when to use:
If there is a need to reduce access latency (especially in cases where data is frequently accessed) or if there is a need for alternative data storage. Rule of thumb: find the busiest view with the most query calls in your application and provide cache.

Querysets

Unlike caching and indexing, which are best used later on in development phases or reactively (when there is a performance hit), you can always improve your query writing game and get a performance boost for your application as you write your code!

Iteration as a last resort

This is more view oriented, but it involves queries as well. You probably heard this before, but I will repeat it because it’s very important. Let the queries do the heavy lifting. Query in ORM, already does caching and data assembly, iterating through those objects again to form another data bundle is a delay, and you should avoid that as much as possible.

Instead of rushing ahead, stop and take your time with queries. One quality query in Django is the centrepiece of efficient view.

As matter of fact, you should try to even avoid caching in your queries by using ‘iterator()’. For example, you create a query (which caches results, to provide quicker access) and for the rest of the view, you use that query results only once (let’s say to send data to the template in render). In this case, we are actually creating and carrying cache through our view that won’t be used. Now, if we use the iterator(), we actually eliminate the default cache from our query.

Example:

employee = Employees.objects.filter(id__gte=284).iterator()

By default, the chunk_size parameter the iterator uses is set to 2000 objects. It all depends, on the amount of data you are fetching, it can always be increased. Granted, this may not be a large speed boost, but if you intend on writing many queries, this practice can pay off really quick if used often.

‘select_related()’ and ‘prefetch_related()’

These two methods are getting more noticed by the day. First, let’s explain what these two methods do. ‘select_related’ and ‘prefetch_related’ unlike iterator(), actually add to the query caching. Queryset they perform, fetch related objects of fk and m2m fields, all in one query by using SQL join. It is best shown through example. This is our query:

stores = Stores.objects.filter(id__gte=284)

Now what seems simple and benign to the eye, if you do a related object access via foreign key:

for obj in stores:
    print(obj.id)

you are actually performing an additional query to fetch a related object, which means on top of your cached query, you are performing another query. And that's where prefetch and select related comes in:

Stores.objects.select_related('manager').filter(id__gte=284)
# or 
Shifts.objects.prefetch_related('employees_shift').all()

Imagine fetching 10000 objects and performing a for loop in which you access a related object. That means you will perform ~10000 queries towards your database, and that I think you will agree, is a mistake. That mistake can be avoided by using select_related and prefetch_related, which generate only 1 query and thus prevent all those unnecessary calls to the database.

I heartily recommend checking the references below. I believe you will find quite valuable additional information to bring out your A game in Django.

References

[0] “djangoproject.com”, “Database access optimization”, https://docs.djangoproject.com/en/4.1/topics/db/optimization/

[1] “https://scalegrid.io/”, “PostgreSQL EXPLAIN — What are the Query Costs?”, https://scalegrid.io/blog/postgres-explain-cost/, March 30, 2022

[2] “kaiv.wordpress.com”, “Decreasing the index size on wide columns”, “https://kaiv.wordpress.com/2007/07/23/decreasing-the-index-size-on-wide-columns/”, July 23, 2007

[3] “wiki.postgresql.org”, “Index Maintenance”, “https://wiki.postgresql.org/wiki/Index_Maintenance”, 9 June 2021

[4] “docs.djangoproject.com”, “Django’s cache framework”, “https://docs.djangoproject.com/en/4.1/topics/cache/”

[5] “honeybadger.io”, “Everything You Need to Know About Caching in Django”, “https://www.honeybadger.io/blog/caching-in-django/”, Jun 16, 2022

[6] “tutorialspoint.com, “Django Caching”, “tutorialspoint.com/django/django_caching.htm”

Top comments (1)

eawanmacen • Oct 1 '22

Dyson only extends the cleaning runtime lg cordzero a9 vs dyson V11 with consecutive use of two batteries. So, runtime based on the consecutive use of two battery packs.

DEV Community