DEV Community

Cover image for Fast Fulltext Search With Postgres Gin Index
Marcell Cruz
Marcell Cruz

Posted on • Updated on

Fast Fulltext Search With Postgres Gin Index

  1. Creating Fake Realistic Data
    1. List Of English Words
    2. Postgres Data Directory
    3. Generating The Data
  2. Testing Before The Index
  3. Creating The Index And Testing
  4. Expression Indexes

The first thing we need to do is to create a lot of fake data so we can test our indexex, it's also important to use real words dependending on the index that you're trying
to use, there's a clever way of doing this on linux, if you're not on linux you can download
a list of words here

Creating Fake Realistic Data

The first thing and maybe the most important part when dealing with indexes or any code for that matter is being able to test it, you need to play around with it to figure out what is the best index for your specific query, to test indexes we need a lot of data otherwise the query planner
won't use our index, if you wanna to know more about indexes you can read this

List Of Words

In order to create realistic data we need a list of words, on Linux you can grab the word list generated by the spell checker utility on /usr/share/dict/, the name of the file varies depending on the distribution, if you're not on linux you can grab your list of words here

Postgres Data Directory

Now that we have a list of words we need a way of reading this list into postgres so we can generate our data. the function pg_read_file can read anything inside the data directory, you can find out where your data directory is running

SHOW data_directory;
Enter fullscreen mode Exit fullscreen mode

The result will probably be something like /var/lib/postgres/, this also dependends on your OS,
after getting the path you just need to copy your list of words to this directory so we can use pg_read_file
to read this list of words and generate our data.

Generating The Data

Let's create a table called notes with two fields, title and description

CREATE TABLE notes(id serial PRIMARY KEY, title text, description text);
Enter fullscreen mode Exit fullscreen mode

We can then create a function that will feed us the words

CREATE OR REPLACE FUNCTION getNArrayS(el text[], count int) RETURNS text AS $$
  SELECT string_agg(el[random()*(array_length(el,1)-1)+1], ' ') FROM generate_series(1,count) g(i)
$$
VOLATILE
LANGUAGE SQL;
Enter fullscreen mode Exit fullscreen mode

Just copy and paste, press enter and things will work, I promise :P

Now we can create the data, it will take some time so go grab some tea meanwhile.

WITH t(ray) AS(
  SELECT (string_to_array(pg_read_file('words.list')::text,E'\n')) 
) 
INSERT INTO notes(title, description)
SELECT getNArrayS(T.ray, 3), getNArrayS(T.ray, 3) FROM T, generate_series(1,100000);
Enter fullscreen mode Exit fullscreen mode

We're creating one million records [insert dr evil meme here] with real words in two fields, not bad,
we can increase the amount of words in each field if you want you just need to change '3' in getNArrayS, 3 in this case is the number of words in each field, now we can test our indexes variations and easily change it depending on what query and index we want to test.

Testing Before The Index

We need to test our query now and see how fast it's, so we can take note and compare later on after we create our index, unfortunatly as far as I know postgres don't have a way of forcing the query planner to use a specific index or not, so we need to test first then compare the times later.
run

\timing
Enter fullscreen mode Exit fullscreen mode

Now every query will show the time it took to run.
First let's test a simple like, we also need to choose one word from our word list ofc, I chose
asimov.

SELECT * FROM notes WHERE title ilike '%asi%';
Enter fullscreen mode Exit fullscreen mode

Time: 615.921 ms
with an AND

SELECT * from notes where title ilike '%asi%' and description ilike '%gar%';
Enter fullscreen mode Exit fullscreen mode

Time: 561.811 ms
with a OR

SELECT * from notes where title ilike '%asi%' or description ilike '%gar%';
Enter fullscreen mode Exit fullscreen mode

Time: 1027.195 ms (00:01.027)

concating the two fields

select * from notes where coalesce(title, '') || ' ' || coalesce(description, '') ilike '%gar%';
Enter fullscreen mode Exit fullscreen mode

Time: 810.602 ms

Ok, now let's create the index and compare the time afterwards for each query.

Creating The Index And Testing

Let's first create the trigrams extensions, more about this later.

CREATE EXTENSION IF NOT EXISTS pg_trgm;
Enter fullscreen mode Exit fullscreen mode

Now let's create an index just on the title field using the trigrams extension

CREATE INDEX notes_title_idx ON notes USING gin (title gin_trgm_ops);
Enter fullscreen mode Exit fullscreen mode

Since it's very common to search for the first three letters of a word postgres has a trigrams(3 letters) that speed up just that, it creates a index with the first 3
letters of the words in the field so our queries run even faster, now let's test our index with the same query that we used before.

SELECT * FROM notes WHERE title ilike '%asi%';
Enter fullscreen mode Exit fullscreen mode

Time: 211.155 ms
Almost three times faster, not bad, we can use EXPLAIN to see how the query planner is using our index

EXPLAIN SELECT * FROM notes WHERE title ilike '%asi%';
Enter fullscreen mode Exit fullscreen mode
                                     QUERY PLAN
------------------------------------------------------------------------------------
 Bitmap Heap Scan on notes  (cost=200.58..11682.82 rows=20204 width=58)
   Recheck Cond: (title ~~* '%asi%'::text)
   ->  Bitmap Index Scan on notes_title_idx  (cost=0.00..195.53 rows=20204 width=0)
         Index Cond: (title ~~* '%asi%'::text)
Enter fullscreen mode Exit fullscreen mode

We can see that the query is using our index notes_title_idx, subsequent runs will
return even faster results since the index is already in memory, the second run was 51.841 ms
more than 10x faster, the second run without the index woulnd't have made any difference.

The second query that we want to test has an AND using the description field, let's first run the query without a description index to see what happens.

SELECT * from notes where title ilike '%asi%' and description ilike '%gar%';
Enter fullscreen mode Exit fullscreen mode

Time: 51.215 ms

Even though we don't have a index in description, our results are already 10x faster, that's because title is almost instant now, and the and query on description don't need to scan the whole table, just the result set of the title index, before creating the index in the description field let's run the query with the OR clause to see
how it compares to the and query

SELECT * from notes where title ilike '%asi%' or description ilike '%gar%';
Enter fullscreen mode Exit fullscreen mode

Time: 1.059 ms

The same time as before, what happened? with an OR the query planner needs to scan the whole
table again so it decides to not use the index, if we run with an EXPLAIN we can see that

                                   QUERY PLAN
--------------------------------------------------------------------------------
 Gather  (cost=1000.00..20295.83 rows=20302 width=58)
   Workers Planned: 2
   ->  Parallel Seq Scan on notes  (cost=0.00..17265.62 rows=8459 width=58)
         Filter: ((title ~~* '%asi%'::text) OR (description ~~* '%gar%'::text))

Enter fullscreen mode Exit fullscreen mode

The query is not using notes_title_idx because it would be even slower in this case, the reason is because
the disc would have to be hit three times, one for the index read, one for the title read and one for the description.

Now let's create the description index and see how it compares

CREATE INDEX notes_description_idx ON notes USING gin (description gin_trgm_ops);
Enter fullscreen mode Exit fullscreen mode

After running the OR query again we have a much better result Time: 141.310 ms almost 10x faster again, we can use explain to see that both indexes are being used now

                                         QUERY PLAN
---------------------------------------------------------------------------------------------
 Bitmap Heap Scan on notes  (cost=222.43..11745.76 rows=20302 width=58)
   Recheck Cond: ((title ~~* '%asi%'::text) OR (description ~~* '%gar%'::text))
   ->  BitmapOr  (cost=222.43..222.43 rows=20304 width=0)
         ->  Bitmap Index Scan on notes_title_idx  (cost=0.00..195.53 rows=20204 width=0)
               Index Cond: (title ~~* '%asi%'::text)
         ->  Bitmap Index Scan on notes_description_idx  (cost=0.00..16.75 rows=100 width=0)
               Index Cond: (description ~~* '%gar%'::text)
Enter fullscreen mode Exit fullscreen mode

The last query that we need to improve is the one where we concatenate the two fields title and description and search them, we could use an index on both fields together or use one in each like we did in the previous example, let's do both and see what's faster.

To test for both we just need to run the query again since we already have two indexes one in each field

select * from notes where coalesce(title, '') || ' ' || coalesce(description, '') ilike '%gar%';
Enter fullscreen mode Exit fullscreen mode

Time: 849.965 ms

Using the coalesce the query planer doesn't use any of our indexes, now let's create an index on both fields and see what happens

create index notes_title_description_idx ON notes USING gin (title gin_trgm_ops, description gin_trgm_ops);
Enter fullscreen mode Exit fullscreen mode

Running the query again we get the same result, in order to speed up a query like this is much better to change the query to use a OR and to use a expression index.

SELECT * from notes where title ilike '%gar%' or description ilike '%gar%';
Enter fullscreen mode Exit fullscreen mode

with an OR we get 126 ms and the query planer is using our multicolumn index notes_title_description_idx that we just created, the results of using an multicolumn index are better but not by a lot and depending on how many fields you have in your table the index won't even be used, now let's try to create a concatenated index using the two fields that mimics the query that we tried.

Expression Indexes

CREATE INDEX notes_coales_title_description_idx ON notes USING gin ((coalesce(title, '') || ' ' || coalesce(description, '')) gin_trgm_ops);
Enter fullscreen mode Exit fullscreen mode

We need the extra parenteshis for the expression, now let's run the query again

select * from notes where coalesce(title, '') || ' ' || coalesce(description, '') ilike '%gar%';
Enter fullscreen mode Exit fullscreen mode

Time: 103.706 ms
10x faster again, postgres is smart enough to use our new index notes_coales_title_description_idx, so that was faster than using the two different indexes in each field with an OR and faster than multicolumn index with an OR.

The most important thing we can take from these tests is that is hard to predict what postgres will do so you should always test your indexes with the exact query that you're going to use, you should try different approaches to figure out what's the fastest for your query and don't be afraid of creating different indexes for different queries, and also using indexes for your queries makes them a LOT faster, 10x faster for all these queries that we tried, so you should definitely use them.

Discussion (0)