DEV Community

Cover image for Wordforms vs exceptions in Manticore Search
Sergey Nikolaev
Sergey Nikolaev

Posted on • Originally published at manticoresearch.com

Wordforms vs exceptions in Manticore Search

Exceptions and wordforms are two useful tools built into Manticore Search, which you can use to improve search recall and precision. They have a lot in common, but also have important differences that I’d like to cover in this article.

About tokenization

What’s the difference between full-text search (also called free-text search) and wildcard kinds of search such as:

  • commonly known LIKE operator in this or that form
  • or more complex regular expressions

? Of course there are tons of differences, but it all starts with what we do with the initial input text in each of the approaches:

  • with the wildcard search approach we normally consider the text as a whole
  • while in the area of full-text search it’s essential to first tokenize the text and then consider each token as a separate entity

When you want to tokenize text you need to decide how to do it, in particular:

  1. What should be separators and word characters. Normally a separator is a character that doesn’t occur inside a word, for example punctuation marks: ., ,, ?, !, - etc.
  2. Whether the tokens’ letter case should be retained or not. Normally it’s not since it’s bad for search that you don’t find Orange by the keyword orange.

Manticore does it all automatically. For example, text “What do I have? The list is: a cat, a dog and a parrot.” gets tokenized into:

mysql> drop table if exists t;
mysql> create table t(f text);
mysql> call keywords('What do I have? The list is: a cat, a dog and a parrot.', 't');

+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1    | what      | what       |
| 2    | do        | do         |
| 3    | i         | i          |
| 4    | have      | have       |
| 5    | the       | the        |
| 6    | list      | list       |
| 7    | is        | is         |
| 8    | a         | a          |
| 9    | cat       | cat        |
| 10   | a         | a          |
| 11   | dog       | dog        |
| 12   | and       | and        |
| 13   | a         | a          |
| 14   | parrot    | parrot     |
+------+-----------+------------+
14 rows in set (0.00 sec)
Enter fullscreen mode Exit fullscreen mode

As you can see:

  • the punctuation marks were removed
  • and all the words were lowercased

Problem

Here comes the first problem: in some cases separators are considered regular word characters, for example in “Is c++ the most powerful language?” it’s obvious that c++ is a separate word. It’s easy to understand for people, but not for full-text search algorithms, since it sees the plus sign, doesn’t find it in it’s list of word characters and removes it from the token, so you end up with:

mysql> drop table if exists t;
mysql> create table t(f text);
mysql> call keywords('Is c++ the most powerful language?', 't');

+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1    | is        | is         |
| 2    | c         | c          |
| 3    | the       | the        |
| 4    | most      | most       |
| 5    | powerful  | powerful   |
| 6    | language  | language   |
+------+-----------+------------+
6 rows in set (0.00 sec)
Enter fullscreen mode Exit fullscreen mode

OK, but what’s the problem?

The problem is that after this tokenization if you search for c#, for example, you will find the above sentence:

mysql> drop table if exists t;
mysql> create table t(f text);
mysql> insert into t values(0,'Is c++ the most powerful language?');
mysql> select highlight() from t where match('c#');

+-------------------------------------------+
| highlight()                               |
+-------------------------------------------+
| Is <b>c</b>++ the most powerful language? |
+-------------------------------------------+
1 row in set (0.01 sec)
Enter fullscreen mode Exit fullscreen mode

It happens because c# is also tokenized to just c and then the c from the search query matches the c from the document and you get it.

What’s the solution? There are a few options. The first one which probably comes to mind is:

OK, why don’t I put + and # to the word characters list?

It’s a good and fair question. Let’s try.

mysql> drop table if exists t;
mysql> create table t(f text) charset_table='non_cjk,+,#';
mysql> call keywords('Is c++ the most powerful language?', 't');

+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1    | is        | is         |
| 2    | c++       | c++        |
| 3    | the       | the        |
| 4    | most      | most       |
| 5    | powerful  | powerful   |
| 6    | language  | language   |
+------+-----------+------------+
6 rows in set (0.00 sec)
Enter fullscreen mode Exit fullscreen mode

It works, but + in the list immediately starts affecting other words and searches, for example:

mysql> drop table if exists t;
mysql> create table t(f text) charset_table='non_cjk,+,#';
mysql> call keywords('What is 2+2?', 't');

+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1    | what      | what       |
| 2    | is        | is         |
| 3    | 2+2       | 2+2        |
+------+-----------+------------+
3 rows in set (0.00 sec)
Enter fullscreen mode Exit fullscreen mode

You wanted c++ to be a separate word, but not 2+2, didn’t you?

Right, so what can we do?

To treat c++ special way you can make it an exception.

Exceptions

So, exceptions (also known as synonyms) allow to map one or more terms (including terms with characters that would normally be excluded) to a single keyword.

Let’s make c++ an exception by putting it into an exceptions file:

➜  ~ cat /tmp/exceptions
c++ => c++
Enter fullscreen mode Exit fullscreen mode

and using the file when we create the table:

mysql> drop table if exists t;
mysql> create table t(f text) exceptions='/tmp/exceptions';
mysql> call keywords('Is c++ the most powerful language? What is 2+2?', 't');

+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1    | is        | is         |
| 2    | c++       | c++        |
| 3    | the       | the        |
| 4    | most      | most       |
| 5    | powerful  | powerful   |
| 6    | language  | language   |
| 7    | what      | what       |
| 8    | is        | is         |
| 9    | 2         | 2          |
| 10   | 2         | 2          |
+------+-----------+------------+
10 rows in set (0.01 sec)
Enter fullscreen mode Exit fullscreen mode

Hooray, c++ is now a separate word and the plus signs are not lost, and all is ok with 2+2 too.

What you need to remember about the exceptions is that exceptions are very dumb, not smart at all, they do exactly what you ask them to do and nothing more. In particular:

  • they don’t change the case
  • if you make a mistake and put double space they don’t convert it into a single space

and so on. They literally consider your input as an array of bytes.

For example, people write c++ both in lower and upper case. Let’s try the above exception with the upper case?

mysql> drop table if exists t;
mysql> create table t(f text) exceptions='/tmp/exceptions';
mysql> call keywords('Is C++ the most powerful language? How about c++?', 't');

+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1    | is        | is         |
| 2    | c         | c          |
| 3    | the       | the        |
| 4    | most      | most       |
| 5    | powerful  | powerful   |
| 6    | language  | language   |
| 7    | what      | what       |
| 8    | is        | is         |
| 9    | 2         | 2          |
| 10   | 2         | 2          |
+------+-----------+------------+
10 rows in set (0.00 sec)
Enter fullscreen mode Exit fullscreen mode

Oops, C++ was tokenized as just c, because the exception is c++ (lower case), not C++ (upper case).

But did you notice the exception constitutes a pair of items, not a single one: c++ => c++. The left part is what triggers the exceptions algorithm in the text, the right part is a resulting token. Let’s try to add mapping of C++ to c++?

➜  ~ cat /tmp/exceptions
c++ => c++
C++ => c++

mysql> drop table if exists t;
mysql> create table t(f text) exceptions='/tmp/exceptions';
mysql> call keywords('Is C++ the most powerful language? How about c++?', 't');

+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1    | is        | is         |
| 2    | c++       | c++        |
| 3    | the       | the        |
| 4    | most      | most       |
| 5    | powerful  | powerful   |
| 6    | language  | language   |
| 7    | how       | how        |
| 8    | about     | about      |
| 9    | c++       | c++        |
+------+-----------+------------+
9 rows in set (0.00 sec)
Enter fullscreen mode Exit fullscreen mode

Alright, now it’s fine again, since both C++ and c++ are tokenized into token c++. So satisfying.

What are the other good examples of the exceptions:

  • AT&T => AT&T and at&t => AT&T.
  • M&M's => M&M's and m&m's => M&M's and M&m's => M&M's
  • U.S.A. => USA and US => USA

What are the bad examples?

  • us => USA, because we don’t want each us become USA.

So the rule of thumb with the exceptions is:

Tip:
If a term includes special characters and that’s how it’s normally written in text and in a search query - make it an exception.

Synonyms

Manticore Search users also often call exceptions synonyms, because another use case for them is not to just retain special character and letter case, but to map terms written absolutely differently to the same token, for example:

MS Windows => ms windows
Microsoft Windows => ms windows
Enter fullscreen mode Exit fullscreen mode

Why is this important? Because it enables to easily find documents with Microsoft Windows by MS Windows and vice versa.

Example:

mysql> drop table if exists t;
mysql> create table t(f text) exceptions='/tmp/exceptions';
mysql> insert into t values(0, 'Microsoft Windows is one of the first operating systems');
mysql> select * from t where match('MS Windows');
+---------------------+---------------------------------------------------------+
| id                  | f                                                       |
+---------------------+---------------------------------------------------------+
| 1514841286668976139 | Microsoft Windows is one of the first operating systems |
+---------------------+---------------------------------------------------------+
1 row in set (0.00 sec)
Enter fullscreen mode Exit fullscreen mode

So at first glance it works fine, but thinking further about it and recalling the exceptions are case and byte sensitive you can ask yourself: “Can’t people write MicroSoft windows, MS WINDOWS, microsoft Windows and so on?”

Yes, they can. So if you want to use the exceptions for that be ready for what’s called in mathematics a combinatorial explosion.

It looks no good at all. What can we do about it?

Wordforms

Another tool which is similar to the exceptions is wordforms. Unlike the exceptions, the word forms are applied after tokenizing incoming text. So they are:

  • case insenstitive (unless your charset_table enables case sensitivity)
  • don’t care about special characters

They essentially let you replace one word with another. Normally, that would be used to bring different word forms to a single normal form. For example, to normalize all the variants such as “walks”, “walked”, “walking” to the normal form “walk”:

➜  ~ cat /tmp/wordforms
walks => walk
walked => walk
walking => walk
Enter fullscreen mode Exit fullscreen mode
mysql> drop table if exists t;
mysql> create table t(f text) wordforms='/tmp/wordforms';
mysql> call keywords('walks _WaLkeD! walking', 't');

+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1    | walks     | walk       |
| 2    | walked    | walk       |
| 3    | walking   | walk       |
+------+-----------+------------+
3 rows in set (0.00 sec)
Enter fullscreen mode Exit fullscreen mode

As you can see all the 3 words were converted to just walk and, note, the 2nd word _WaLkeD! even being very deformed was also normalized fine. Do you see where I’m going with this? Yes, the MS Windows example. Let’s test if the wordforms can be useful to solve that issue.

Let’s put just 2 lines to the wordforms file:

➜  ~ cat /tmp/wordforms
ms windows => ms windows
microsoft windows => ms windows
Enter fullscreen mode Exit fullscreen mode

and populate the table with a few documents:

mysql> drop table if exists t;
mysql> create table t(f text) wordforms='/tmp/wordforms';
mysql> insert into t values(0, 'Microsoft Windows is one of the first operating systems'), (0, 'porch windows'),(0, 'Windows are rolled down');

mysql> select * from t;
+---------------------+---------------------------------------------------------+
| id                  | f                                                       |
+---------------------+---------------------------------------------------------+
| 1514841286668976166 | Microsoft Windows is one of the first operating systems |
| 1514841286668976167 | porch windows                                           |
| 1514841286668976168 | Windows are rolled down                                 |
+---------------------+---------------------------------------------------------+
3 rows in set (0.00 sec)
Enter fullscreen mode Exit fullscreen mode

Let’s now try different queries:

mysql> select * from t where match('MS Windows');
+---------------------+---------------------------------------------------------+
| id                  | f                                                       |
+---------------------+---------------------------------------------------------+
| 1514841286668976166 | Microsoft Windows is one of the first operating systems |
+---------------------+---------------------------------------------------------+
1 row in set (0.00 sec)
Enter fullscreen mode Exit fullscreen mode

MS Windows finds Microsoft Windows fine.

mysql> select * from t where match('ms WINDOWS');
+---------------------+---------------------------------------------------------+
| id                  | f                                                       |
+---------------------+---------------------------------------------------------+
| 1514841286668976166 | Microsoft Windows is one of the first operating systems |
+---------------------+---------------------------------------------------------+
1 row in set (0.01 sec)
Enter fullscreen mode Exit fullscreen mode

ms WINDOWS works fine too.

mysql> select * from t where match('mIcRoSoFt WiNdOwS');
+---------------------+---------------------------------------------------------+
| id                  | f                                                       |
+---------------------+---------------------------------------------------------+
| 1514841286668976166 | Microsoft Windows is one of the first operating systems |
+---------------------+---------------------------------------------------------+
1 row in set (0.00 sec)
Enter fullscreen mode Exit fullscreen mode

✅ And even mIcRoSoFt WiNdOwS finds the same document.

mysql> select * from t where match('windows');
+---------------------+---------------------------------------------------------+
| id                  | f                                                       |
+---------------------+---------------------------------------------------------+
| 1514841286668976166 | Microsoft Windows is one of the first operating systems |
| 1514841286668976167 | porch windows                                           |
| 1514841286668976168 | Windows are rolled down                                 |
+---------------------+---------------------------------------------------------+
3 rows in set (0.00 sec)
Enter fullscreen mode Exit fullscreen mode

✅ Just basic windows finds all the documents.

So indeed, wordforms helps to solve the issue.

The rule of thumb with the wordforms is:

Tip:
Use wordforms for words and phrases that can be written in different forms and don’t contain special characters.

Floor & Decor

Let’s take a look at another example: we want to improve search for the brand name Floor & Decor. We can assume people can write this name in the following forms:

Floor & Decor
Floor & decor
floor & decor
Floor and Decor
floor and decor
Enter fullscreen mode Exit fullscreen mode

and other letter capitalization combinations.

Also:

Floor & Decor Holdings
Floor & Decor Holdings, inc.
Enter fullscreen mode Exit fullscreen mode

and, again, various combinations with different letter capitalized.

Now that we know how exceptions and wordforms work what do we do to cover this brand name?

First of all we can easily notice that the canonical brand name is Floor & Decor, i.e. it includes a special character which is normally considered a word separator, so should we use exceptions? But the name is long and can be written in many ways. If we use exceptions we can end up with a huge list of all the combinations. Moreover there are extended forms Floor & Decor Holdings and Floor & Decor Holdings, inc. which can make the list even longer.

The most optimal solution in this case is to just use wordforms like this:

➜  ~ cat /tmp/wordforms
floor & decor => fnd
floor and decor => fnd
floor & decor holdings => fnd
floor and decor holdings => fnd
floor & decor holdings inc => fnd
floor and decor holdings inc => fnd
Enter fullscreen mode Exit fullscreen mode

Why does it include &? Actually you can skip it:

floor decor => fnd
floor and decor => fnd
floor decor holdings => fnd
floor and decor holdings => fnd
floor decor holdings inc => fnd
floor and decor holdings inc => fnd
Enter fullscreen mode Exit fullscreen mode

because wordforms anyway ignores non-word characters, but just for the sake of ease of reading it was left.

As a result you’ll get each combination tokenized as fnd which will be our shortkey for this brand name.

mysql> drop table if exists t; create table t(f text) wordforms='/tmp/wordforms';
mysql> call keywords('Floor & Decor', 't')
+------+-------------+------------+
| qpos | tokenized   | normalized |
+------+-------------+------------+
| 1    | floor decor | fnd        |
+------+-------------+------------+
1 row in set (0.00 sec)

mysql> call keywords('floor and Decor', 't')
+------+-----------------+------------+
| qpos | tokenized       | normalized |
+------+-----------------+------------+
| 1    | floor and decor | fnd        |
+------+-----------------+------------+
1 row in set (0.00 sec)

mysql> call keywords('Floor & Decor holdings', 't')
+------+----------------------+------------+
| qpos | tokenized            | normalized |
+------+----------------------+------------+
| 1    | floor decor holdings | fnd        |
+------+----------------------+------------+
1 row in set (0.00 sec)

mysql> call keywords('Floor & Decor HOLDINGS INC.', 't')
+------+--------------------------+------------+
| qpos | tokenized                | normalized |
+------+--------------------------+------------+
| 1    | floor decor holdings inc | fnd        |
+------+--------------------------+------------+
1 row in set (0.00 sec)
Enter fullscreen mode Exit fullscreen mode

Is this the perfect ultimate solution? Unfortunately not as many other things in the area of full-text search. There are always rare cases and in this case too. For example:

mysql> drop table if exists t; create table t(f text) wordforms='/tmp/wordforms';
mysql> insert into t values(0,'It\'s located on the 2nd floor. Decor is also nice');
mysql> select * from t where match('Floor & Decor Holdings');

+---------------------+---------------------------------------------------+
| id                  | f                                                 |
+---------------------+---------------------------------------------------+
| 1514841286668976231 | It's located on the 2nd floor. Decor is also nice |
+---------------------+---------------------------------------------------+
1 row in set (0.00 sec)
Enter fullscreen mode Exit fullscreen mode

We can see here that Floor & Decor Holdings finds the document which has floor in the end of the first sentence and the following one starts with Decor. This happens because floor. Decor also gets tokenized to fnd since we use just wordforms that are insensitive to letter case and special characters:

mysql> call keywords('floor. Decor', 't');
+------+-------------+------------+
| qpos | tokenized   | normalized |
+------+-------------+------------+
| 1    | floor decor | fnd        |
+------+-------------+------------+
1 row in set (0.00 sec)
Enter fullscreen mode Exit fullscreen mode

The false match is not good. To solve this particular problem we can use Manticore’s functionality to detect sentences and paragraphs.

Now if we enable it we can see that the document is not a match for the keyword any more:

mysql> drop table if exists t; create table t(f text) wordforms='/tmp/wordforms' index_sp='1';
mysql> insert into t values(0,'It\'s located on the 2nd floor. Decor is also nice');
mysql> select * from t where match('Floor & Decor Holdings');

Empty set (0.00 sec)
Enter fullscreen mode Exit fullscreen mode

because:

  1. Floor & Decor, as we remember is converted into fnd by wordforms
  2. index_sp='1' splits text into sentences
  3. after splitting floor. and Decor end up in different sentences
  4. and do not match fnd and therefore all the original forms of it anymore

Conclusion

Manticore’s exceptions and wordforms are powerful tools that can help you fine-tune your search, in particular improve recall and precision when it comes to short terms with special characters that should be retained and longer terms that should be aliased one to another. But you need to help Manticore do it, since it can’t decide what the names should be for you.

Thank you for reading this article!

References:
Documentation about exceptions - https://manual.manticoresearch.com/Creating_an_index/NLP_and_tokenization/Exceptions
Documentation about wordforms - https://manual.manticoresearch.com/Creating_an_index/NLP_and_tokenization/Wordforms

Top comments (0)