In a recent project I had to add a full text search functionality to an already existing Django project, below are notes of what challenges I encountered and how I solved them.
For easy reading I have listed down a brief walk-through and limitations I found, followed by a more detailed log.
Project Overview
This project involved loading a 30GB+ CSV into the database, that included information about books; and implementing full text search on those book's title, author, tags, categories.
Overall walk-through
- Adding in FTS with Postges is super easy -
- Add
django.contrib.postgres
in installed_apps - perform search as following
- Add
Book.objects.filter(title__search='A little girl')
- Indexing to increase performance
- Add a
SearchVectorField
field to the model
- Add a
# for pre computed search vectors
search_vector = SearchVectorField(null=True, blank=True)
- Create Index
class Meta(object):
indexes = [ GinIndex(fields=['search_vector']) ]
Increase
work_mem
, default work_mem of Postgres is too low for M+ rows.
editwork_mem
inpostgresql.config
file and restart your db.
A bit ofsed
command if you are using docker.Caching
I cached whole webpage on a redis instance along with certain queries likeresult count
(very heavy one), which will be repeat for every search page load, required overloadingPaginator
class andListView
class.Increase Shared Memory, generally not required but my docker container was running out of memory for some queries.
Limitations Found
- Complex queries would be really slow, example
- sorting results based on similarity
- sorting search results(aka
ORDER BY
) based on number of comments on an book, or any other non text column.
- A bit hard to tune
- Doing a trade-off between relevant results and max possible results ( good for SEO ) requires complex queries which will take too loooong to process.
Detailed log TL;DR
Available options
There are two major ways of achieving this -
-
django haystack plugin
With this you can integrate a search engine with your django application. You have several options like solr and elastic search for a search backend. These are really good at handing text search for a large amount of documents, but has overhead in form of server cost, development overhead etc.
-
Postgresql's full text search
Postgres has a full text search feature, in sql you just have to add a
WHERE
clause and you have fully working text search, and on djangos side you can use.filter
method. Although it is not a dedicated search application so has many shortcomings, for small applications it works great out of the box, but as database grows you'll have to do some tweaking.
Implementation
Config
add django.contrib.postgres
to installed apps.
# settings.py
....
INSTALLED_APPS = [
...
'django.contrib.postgres', # for fts search
...
]
...
Model
from django.db import models
from django.contrib.postgres.search import SearchVectorField
from django.contrib.postgres.indexes import GinIndex
class Book(models.Model):
title = models.CharField(max_length=300)
poster_url = models.URLField()
downloads = models.IntegerField()
likes = models.IntegerField()
comments_count = models.IntegerField()
search_vector = SearchVectorField(null=True, blank=True) # for pre computed serch vectors
# tags, categories, authors remaining
# raw data fields
_tags = models.TextField(default="", blank=True)
_categories = models.TextField(default="", blank=True)
_authors = models.TextField(default="", blank=True)
class Meta(object):
indexes = [GinIndex(fields=['search_vector'])]
Above is a typical Django ORM model, search_vector
contains vector representation of book's title, tags, categories and authors
; Postgres converts both the search query and textfields into vectors then compares them for a match, by pre-computing the search vector and indexing it with a GinIndex
we are improving the query speed.
search_vector
can be computed with below python code,
Book.objects.update(search_vector=SearchVector('title', '_tags', '_categories', '_authors'))
Using authors, tags and categories as TextField helps in loading the huge CSV file faster.
View
view was implemented with generic ListView
Profiling
After this I used Django's debugging toolbar to have a look at the queries being performed, there were 2 major issues.
- Count(*) was slow for queries with ~100K+ results
Count(*)
is an notoriously expensive operation in sql, you basically have to scan through whole table to do this, there are some workarounds like storing count separately, partial indexes, but nothing is applicable to our use case.
I cached the queries for this
- Query time was drastically more after a certain increase in number of search results.
Top comments (0)