AviKKi

Posted on Aug 11, 2020

Django Postgres Text Search for 10M+ rows

#postgres #django #python

In a recent project I had to add a full text search functionality to an already existing Django project, below are notes of what challenges I encountered and how I solved them.

For easy reading I have listed down a brief walk-through and limitations I found, followed by a more detailed log.

Project Overview

This project involved loading a 30GB+ CSV into the database, that included information about books; and implementing full text search on those book's title, author, tags, categories.

Overall walk-through

Adding in FTS with Postges is super easy -
- Add django.contrib.postgres in installed_apps
- perform search as following

Book.objects.filter(title__search='A little girl')

Indexing to increase performance
- Add a SearchVectorField field to the model

  # for pre computed search vectors
  search_vector = SearchVectorField(null=True, blank=True)

Create Index

  class Meta(object):
      indexes = [ GinIndex(fields=['search_vector']) ]

Increase work_mem, default work_mem of Postgres is too low for M+ rows.
edit work_mem in postgresql.config file and restart your db.
A bit of sed command if you are using docker.
Caching
I cached whole webpage on a redis instance along with certain queries like result count(very heavy one), which will be repeat for every search page load, required overloading Paginator class and ListView class.
Increase Shared Memory, generally not required but my docker container was running out of memory for some queries.

Limitations Found

Complex queries would be really slow, example
- sorting results based on similarity
- sorting search results(aka ORDER BY) based on number of comments on an book, or any other non text column.
A bit hard to tune
- Doing a trade-off between relevant results and max possible results ( good for SEO ) requires complex queries which will take too loooong to process.

Detailed log TL;DR

Available options

There are two major ways of achieving this -

django haystack plugin

With this you can integrate a search engine with your django application. You have several options like solr and elastic search for a search backend. These are really good at handing text search for a large amount of documents, but has overhead in form of server cost, development overhead etc.
Postgresql's full text search

Postgres has a full text search feature, in sql you just have to add a WHERE clause and you have fully working text search, and on djangos side you can use .filter method. Although it is not a dedicated search application so has many shortcomings, for small applications it works great out of the box, but as database grows you'll have to do some tweaking.

Implementation

Config

add django.contrib.postgres to installed apps.

# settings.py
....
INSTALLED_APPS = [
    ...
    'django.contrib.postgres', # for fts search
    ...
]
...

Model

from django.db import models
from django.contrib.postgres.search import SearchVectorField
from django.contrib.postgres.indexes import GinIndex

class Book(models.Model):
    title = models.CharField(max_length=300)
    poster_url = models.URLField()
    downloads = models.IntegerField()
    likes = models.IntegerField()
    comments_count = models.IntegerField()
    search_vector = SearchVectorField(null=True, blank=True)  # for pre computed serch vectors

    # tags, categories, authors remaining
    # raw data fields
    _tags = models.TextField(default="", blank=True)
    _categories = models.TextField(default="", blank=True)
    _authors = models.TextField(default="", blank=True)

    class Meta(object):
        indexes = [GinIndex(fields=['search_vector'])]

Above is a typical Django ORM model, search_vector contains vector representation of book's title, tags, categories and authors; Postgres converts both the search query and textfields into vectors then compares them for a match, by pre-computing the search vector and indexing it with a GinIndex we are improving the query speed.

search_vector can be computed with below python code,

Book.objects.update(search_vector=SearchVector('title', '_tags', '_categories', '_authors'))

Using authors, tags and categories as TextField helps in loading the huge CSV file faster.

View

view was implemented with generic ListView

Profiling

After this I used Django's debugging toolbar to have a look at the queries being performed, there were 2 major issues.

Count(*) was slow for queries with ~100K+ results

Count(*) is an notoriously expensive operation in sql, you basically have to scan through whole table to do this, there are some workarounds like storing count separately, partial indexes, but nothing is applicable to our use case.

I cached the queries for this

Query time was drastically more after a certain increase in number of search results.

DEV Community

Django Postgres Text Search for 10M+ rows

Project Overview

Overall walk-through

Limitations Found

Detailed log TL;DR

Available options

Implementation

Config

Model

View

Profiling

Top comments (0)

Read next

Situs Togel Terlengkap Di Indonesia 2024

Building a Blog App with Django and React: Step-by-Step Guide

Building an Article Generator with LangChain and Llama3: An AI Developer's Journey

Multimodal AI Explained: Why It’s Transforming the Future of Technology