loading...

Pg fulltext search with DBIx::Class

jplindstrom profile image Johan Lindstrom Updated on ・1 min read

Recently I wrote an admin interface for an internal service. It's basically a list of things processed by the service.

For troubleshooting problems it seemed quite useful to be able to filter the list by some text columns containing parameters and the processed result. Using Postgres fulltext search turned out to be a trivial affair, so I thought I'd share it here.

Fulltext search in Postgres

In Postgres, performing a fulltext search is done using the Text Search datatypes and operators. Here's a good introduction.

As an example, assume we have a book table with a blurb TEXT column containing a free form description of the book.

Let's find books in The Laundry File series by Charles Stross, because the latest one, "The Labyrinth Index" was just published. Yay!

select id, title, blurb from book where
    to_tsvector('english', blurb)
    @@
    to_tsquery(
        'english',
        '(scifi | sci-fi) & paranormal & bureacracy
    );

Briefly what's going on here is that the blurb column is parsed into a tsvector which contains English words we can match against. The matching is done using the to_tsquery function, which contains a search query in a certain format. From the example you can probably figure out the overall syntax.

That's it! Working full text search on any column. Well done Pg!

There's a lot more to this and I only know the very first things about it, so go read the Postgres manual for more information.

Improving performance

With the query above, the to_tsvector('english', blurb) part will run this function for every row, so that's going to be a bit slow.

In this particular case, the data volumes weren't that huge initially. Even with several thousand rows this felt quite snappy, with queries returning in tens of ms.

That won't last with increasing data volumes though, so let's create an expression index on the blurb column for this one query.

CREATE INDEX CONCURRENTLY book_blurb_gin
    ON book USING GIN ( to_tsvector('english', blurb) );

Note how to_tsvector('english', blurb) matches exactly what we did in the query above.

Search usability

As a good starting point, especially for non-technical users, it's probably a good idea to not overcomplicate things with the tsquery format.

Just say that all words the they type must match the text: construct the tsquery string yourself by joining words in the user input with ' & '.

If you do allow your users to enter free-form tsquery search strings, you'll have to explain it somewhat plainly to them. For technically inclined users, the briefest explanation I've come up with is:

Use &, |, ! between words, and () to group expressions.
Match 'single-quoted phrases'.

Examples are very powerful for conveying details like this. You should pick some useful search terms related to your domain, maybe similar to this:


(scifi | sci-fi) & horror & 'paranormal bureacracy'

But do try to include all the available features in the example, because people don't read text, and the example may be the only thing they actually cast their busy eyes at. ¯\(ツ)

Simpler-to-use query formats

In Postgres 11 and onwards, there are more user friendly ways to get user input. Read further about plainto_tsquery and websearch_to_tsquery.

tsquery errors

If the user enters invalid tsquery syntax, Postgres will throw an exception when the query is executed. You'll need to be prepared to handle that and rephrase the error into something nicer. You can match the error message against this:

syntax error in tsquery

Chained DBIx::Class search methods

The Perl module DBIx::Class is an ORM on top of the SQL-level DBI module. It is excellent in many ways.

One of the nicer features of DBIC is chained resultsets. This means you can have a resultset object with a where clause applied to it, and then pass around the object and modify the query by applying further conditions, joining more tables etc.

my $book_rs = $schema->resultset("Book");
my $title_book_rs = $book_rs->search({
    title => "...",
});
my $title_and_author_book_rs = $title_book_rs->search({
    author => "...",
});

# The query is only executed when we ask for the rows
my @book_rows = $title_and_author_book_rs->all;

Of course, we can define one of these calls to search in its own method on the ResultSet::Book class.

sub search_title_rs($self, $title) {
    return $self->search({ author => $title });
}

Now, this is a trivial example, but encapsulating the details about the query into a search method like this makes the actual application code very readable, even when we need to construct quite complicated queries.

Extracting small query fragments like this is also a good way to ensure that details related to each resultset is placed in the class actually responsible for it, rather than spread around the code base as part of larger queries.

This technique is what we'll use to search the fulltext column.

search_tsquery_rs

Using the to_tsvector and to_tsquery syntax in SQL is a bit ardous, and in DBIC it is even more fiddly. But as we just saw it can be nicely encapsulated in a search_tsquery_rs method that can be chained onto the resultset.

Put this in the ResultSet::Book class for your table, or better, in a common base class which all your ResultSet classes inherit from:


=head2 search_tsquery_rs($column, $tsquery) : $rs

Search the $column as a Postgres tsquery fulltext search.

Use &, |, ! between words, and () to group expressions.

Invalid tsqueries will throw an exception that matches
m/syntax error in tsquery/.

=cut

sub search_tsquery_rs($self, $column, $tsquery) {
    $self->search({
        -and => [
            \[
                "to_tsvector('english', $column) @@ to_tsquery('english', ?)",
                $tsquery,
            ],
        ],
    });
}

Note the peculiarities needed to make this work in DBIC:

  • Literal SQL is inserted in the query, but (and this is cool) we can still use placeholders for this
  • This feels more like a hack, but to get a syntactically correct where clause, we need to combine this with an AND operator

Now we can use this method seamlessly in our application code:

my $book_rs = $schema->resultset("Book");
my $book_count = $book_rs
    ->search({ author => "..." })
    ->search_tsquery_rs("blurb", "horror & scifi")
    ->count;

Handling tsquery errors in DBIC

While you can pass around resultset objects and chain them together, the actual SQL query isn't executed until you ask for some data to be returned, e.g. by calling ->all or ->count.

This means that we can't encapsulate the error message handling inside the search_tsquery_rs method itself, because it's not necessarily near the place and time where the exception will be thrown.

Instead you'll have to wrap the place of execution in an eval (or better: use Try::Tiny) and identify the tsquery related exception there.

Wrap-up

That's it!

I hope that was a helpful intro to Postgres fulltext search, and how to use it neatly with DBIx::Class.

To get an idea for how nicely Postgres supports this: all in all, it probably took longer to write up this blog post than to research and implement this for my simple needs. It was also an absolute delight to be using software where well designed parts just seem to work out when composed togeher.

Bonus: Postgres regex search-method

Sometimes, fulltext search is overkill. So while we're at it, here's how to do a LIKE query, except using Postgres' regular expressions.

=head2 search_regex_rs($column, $regex) : $rs

Search the $column as a Postgres SIMILAR TO regex match.

Invalid regexes will throw an exception that matches
m/invalid regular expression/.

=cut

sub search_regex_rs($self, $column, $regex) {
    return $self->search({
        $column => \[ "SIMILAR TO ?" => $regex ],
    });
}

Posted on Nov 5 '18 by:

Discussion

markdown guide
 

Thanks for writing this up! This is immediately useful for a current project.

 

You can avoid the -and by just passing the arrayref-ref as the first arguemnt to search, by the way. It's much nicer. I always chain my literal sql so I don't have to use -and.