moritz rieger

Posted on Mar 12, 2021 • Edited on May 17, 2021

Build a fuzzy search with PostgreSQL

#database #postgres #sql #search

Google made us used to just type what comes into our mind when searching for something.
How can we achieve such a fuzzy search with a Postgres database?

There are basically three approaches.

Pattern matching with ILIKE '%searchterm%' to find all occurences of searchterm inside of a string.
Postgres full-text search
Postgres pg_trgrm (trigram) extension

In this article, I will focus on the third approach, because I want a search so fuzzy that is insensitive to typing errors and it should match even if a word is not complete. These two criteria dismiss the first two approaches.
So let's dive into the third approach, the Trigrams!

Trigrams

Let's see what the postgres documentation tells us about trigrams:

A trigram is a group of three consecutive characters taken from a string. We can measure the similarity of two strings by counting the number of trigrams they share. This simple idea turns out to be very effective for measuring the similarity of words in many natural languages.

To get a grasp of what this means, check the show_trgrm() method of the pg_trgrm extension.
But first, we need to enable the trigram extension.

CREATE EXTENSION pg_trgm;
SELECT show_trgm('Hello');
            show_trgm            
---------------------------------
 {"  h"," he",ell,hel,llo,"lo "}
SELECT show_trgm('Hello World');
                           show_trgm                           
---------------------------------------------------------------
 {"  h","  w"," he"," wo",ell,hel,"ld ",llo,"lo ",orl,rld,wor}

similarity

This similarity method returns a number from 0 to 1, which indicates how similar two strings are. A value of 1 means the strings are identical.
In our case, it is 0.5 since the first string has 6 trigrams in common with the second string, which has a total of 12 trigrams.

SELECT similarity('Hello', 'Hello world');
 similarity 
------------
        0.5

But if you search for a small string in a larger string with the similarity function, the score will drop drastically. Because there are a lot of trigrams that are not in both strings.

SELECT similarity('Hello', 'Hello world, you wonderful planet!');
 similarity 
------------
 0.19354838

Therefore Postgres provides two other functions which respect word boundaries in strings.

word similarity

The word similarity score reflects the highest matching similarity of a substring (word_similarity) respectively of a word (strict_word_similarity).

Note: No matter how long the second string gets, the (strict) word similarity will always be 1, because Hello matches exactly a whole word in the example.

SELECT 
word_similarity('Hello', 'Hello world, you wonderful planet!'),
strict_word_similarity('Hello', 'Hello world, you wonderful planet!');
 word_similarity | strict_word_similarity 
-----------------+------------------------
               1 |                      1

The word similarity score can be understood as the greatest similarity between the first string and any substring of the second string. So said, it is useful for finding the similarity for parts of words.

The strict word similarity score is useful for finding the similarity of whole words.

The cool thing with trigram comparison is, that even if a word is misspelled, it has relatively high similarity.

SELECT similarity('wonderfull', 'wonderful'), strict_word_similarity('wonderfull', 'Hello World, you wonderful planet!'), word_similarity('wonderfull', 'Hello World, you wonderful planet!');
 similarity | strict_word_similarity | word_similarity 
------------+------------------------+-----------------
       0.75 |                   0.75 |       0.8181818

Example

So far so good, let's get our hands dirty.

Imagine you have a bookstore. Often customers ask you about a book they only know fragments of the title, or you don't know how to spell the author correctly. Sounds like a perfect fit for our fuzzy search!

For the sake of simplicity, let us generate a dataset. Yeah, I know this is not the most beautiful sample data for a search, but I hope you get the point.

CREATE TABLE books (
  id TEXT,
  title TEXT,
  abstract TEXT,
  author TEXT,
  PRIMARY KEY(id)
);

INSERT INTO books (
    id, title, abstract, author
)
SELECT
    left(md5(i::text), 10),
    left(md5(random()::text), 15),
    md5(random()::text),
    concat_ws(' ', left(md5(random()::text), 10), left(md5(random()::text), 10))
FROM generate_series(1, 100000) s(i);

SELECT * FROM books LIMIT 5;
     id     |      title      |             abstract             |        author         
------------+-----------------+----------------------------------+-----------------------
 c4ca4238a0 | 06629ea8b04aaa7 | 9f0ad3e3542ecf1efbdddd98cca507a6 | 4dceb23e53 0f01b57e71
 c81e728d9d | 0a284b57f95d997 | fdd45a7d9eda64c9deb4882ccbb42296 | f3fbf4ed2a ef99e869d2
 eccbc87e4b | b464a24deba866e | 0eda8641e61327719906b493080cd96f | fca64a442c db03ca6331
 a87ff679a2 | 6275469f9e41990 | f3771fb3afdb463d302d7849c38d5641 | fbe106a816 b4313ad5c3
 e4da3b7fbb | b2e9f3a8bad3aec | a9f4b8432c1b6655ae8775dd5b904498 | 6df1b43a13 87c7a303a4

But wait. How can we search over multiple columns?

Good catch. To search over multiple columns we have to concatenate all columns that should be searched with the concat_ws function.

Note: The <% operator returns true if the word_similarity is above the pg_trgm.word_similarity_threshold parameter.

The <% operator has a commutator %<.
So 'Hello' <% 'Hello World' = 'Hello World' %< 'Hello'

SHOW pg_trgm.word_similarity_threshold ;
 pg_trgm.word_similarity_threshold 
-----------------------------------
 0.6

SELECT * FROM books WHERE '9c9f6' <% concat_ws(' ', title, abstract, author);
     id     |      title      |             abstract             |        author         
------------+-----------------+----------------------------------+-----------------------
 2804d14b1b | 0152e58b57b94ff | 9c9f468fa58fc140427f7354f1a2b88e | b520b29fae ae1297ce25
 c764c89e73 | 2f81a4878b09a36 | 667ddc5f403d9c96b43912628e1cb73c | ebab7d5c7c 9c9ffbfd82
 5fa8101625 | 30071f6f0e9c9f6 | 5f3250393ac1b214340af81f239e7ca5 | ffec138ddb 3b9060d0c4
 7cb394c278 | ce5ad25e38577fc | 805d0e2ca8081c6a8a178b1f0650c9f6 | 9c5332d434 cd79b0e802
 9b39d07008 | a58ecb6f4e41c96 | 2df452e5bd0b1102733bbc89dd78e6ee | 9c9fb041fa 980becdc8f
(5 rows)
Time: 1245,955 ms (00:01,246)

Ouch, that took waaaaay too long. Now, it's time for GIN!

boost your search performance with GIN

Postgres has a special index for this use-case, it's called GIN (General Inverted Index).

"GIN is designed for handling cases where the items to be indexed are composite values, and the queries to be handled by the index need to search for element values that appear within the composite items."

Sounds like a perfect fit. But there is a little problem. We cannot create an index on the concat_ws function, since it's not immutable. We have to build an immutable function wrapper to create an index on the concatenated columns.

-- create immutable function wrapper for concat_ws
CREATE OR REPLACE FUNCTION f_immutable_concat_ws(t1 text, t2 text, t3 text)
  RETURNS text AS
$func$
SELECT concat_ws(' ', t1, t2, t3)
$func$ LANGUAGE sql IMMUTABLE;

-- create a GIN index
CREATE INDEX search_gin_trgm_idx ON books
USING gin (f_immutable_concat_ws(title, abstract, author) gin_trgm_ops);
-- validate performance improvements
EXPLAIN ANALYZE SELECT * FROM books WHERE '9c9f6' <% f_immutable_concat_ws(titale, abstract, author);
                                                           QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on books  (cost=21.68..152.10 rows=100 width=82) (actual time=1.937..2.394 rows=5 loops=1)
   Filter: ('9c9f6'::text <% f_immutable_concat_ws(title, abstract, author))
   Rows Removed by Filter: 13
   Heap Blocks: exact=18
   ->  Bitmap Index Scan on search_gin_trgm_idx  (cost=0.00..21.65 rows=100 width=0) (actual time=1.653..1.663 rows=18 loops=1)
         Index Cond: (f_immutable_concat_ws(title, abstract, author) %> '9c9f6'::text)
 Planning Time: 4.487 ms
 Execution Time: 2.533 ms
(8 rows)

Time: 9,965 ms

Yes, this was 125 times faster! 🏎

Where to go from here?

You could try:

strict_word_similarity with <<%
set another similarity threshold value for the GUC Parameters
if you search-string contains multiple words, use the WHERE condition from above for every word and concatenate them with AND

The syntactic similarity is not enough for you?
Word2vec could be an option to compare the semantic similarity of words. Luckily there already is a postgres-word2vec extension for this.

I would love to hear about your solution for a fuzzy search on Postgres in the comments.

Cheers!

Top comments (1)

yeargun • Oct 30 '23

This helped me. Loved the article, thanks