People are bound to make typo's. You probably know how Google handles these situations: The search engine suggests a corrected search term, or even automatically corrects the terms you searched for if nothing could be found. This keeps users in their search flow.
Only so much can be solved by optimizing the matching of your search query. This post shows you how to find corrected search terms in Elasticsearch using the suggester query type, and how to apply this in an effective way. If you want to try it out yourself, check out this example repo.
The problem
The worst thing you can do is show no results when you actually have some idea of what results to show. This hurts the user directly, as they will either need to fix their search term, or will think that the item they are looking for simply is not available. Users are likely to be at least annoyed when presented with no results, and in many cases they might give up their search for something completely.
Setting up the index
Elasticsearch provides an extensive suggest feature which allows you to autocomplete user input. Today we are going to use the Phrase Suggester to find search terms that are related to some input search terms.
To get the Phrase Suggester to work, we'll need to supply it with a list of generators. Each of these generators will generate suggestions for a single term. The Phrase Suggester will then combine these suggestions in a smart way into a single correction for the whole input.
For example, imagine you have a document with the text "Harry Potter poster". If a user now searches for "poter", you could correct that to either "Potter" or "poster". Without context, you don't know which one is better though. If a user now searches for "harry poter", you do know which suggestion to give: "Potter".
To take the context into account, the Phrase Suggester uses a shingled version of the specified field to be able to suggest combinations of terms correctly. It can use the shingled representation of the data to determine how common the corrected words appear next to each other.
Depending on how you configure the shingle filter, the suggester will be able to suggest shorter or longer sentences. For most cases, a sensible analyzer would be set up like this:
...
"analysis": {
"analyzer": {
"shingle": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"standard",
"lowercase",
"shingle_filter"
]
}
},
"filter": {
"shingle_filter": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 3
}
}
}
...
We need to apply this analyzer to a field, so we add it to the mapping. In this example, we've got an index containing books. To be able to suggest terms based on the title field, we would set up the following mapping:
...
"mappings": {
"book": {
"properties": {
"title": {
"type": "text",
"fields": {
"shingle": {
"type": "text",
"analyzer": "shingle"
}
}
},
"author": {
"type": "keyword"
},
"year": {
"type": "integer"
}
}
}
}
...
This adds the shingle analyzer we just defined to the title field.
Querying related search terms
Now that we have the index set up, we're going to add some documents to it. In the example repository you can do this by just running npm start
. We can now execute the following query to get suggestions for any text a user might type:
{
"suggest": {
"text": "prisoner of axaban",
"phrase_suggester": {
"phrase": {
"field": "title.shingle",
"confidence": 0.0,
"direct_generator": [
{
"field": "title.shingle"
}
]
}
}
}
}
The only non-default value here is confidence. By default the confidence will be set at 1.0, but that will limit the results too much for this use case.
Running this query will give us a list of options:
...
"options": [
{
"text": "prisoner of azkaban",
"score": 0.22856912
},
{
"text": "prisoner of axaban",
"score": 0.040878013
}
]
...
As you see, a corrected version of the whole search term is shown. However, with the confidence set this low, the original search term is also shown as an option. With this query we cannot guarantee that each option will actually show results. Luckily there is a way to filter out options that do not improve our current state of showing no results.
Pruning the options
The Phrase Suggester accepts a collate field, which contains a scripted query that will be executed for each option returned. By adding this to our suggest query, we will get a boolean with each option that tells us whether the option returned results for the collate query.
In our example, we're probably already happy if the suggested option fuzzy matches the title field. The following query contains the collate query with a fuzzy match:
{
"suggest": {
"text": "prisoner of axaban",
"simple_phrase": {
"phrase": {
"field": "title.shingle",
"confidence": 0.0,
"direct_generator": [
{
"field": "title.shingle"
}
],
"collate": {
"query": {
"source": {
"match": {
"title": {
"query": "{{suggestion}}",
"fuzziness": "1",
"operator": "and"
}
}
}
},
"prune": "true"
}
}
}
}
}
Running this query adds the collate_match
field to the response:
"options": [
{
"text": "prisoner of azkaban",
"score": 0.22856912,
"collate_match": true
},
{
"text": "prisoner of axaban",
"score": 0.040878013,
"collate_match": false
}
]
We can now filter out options based on collate_match
. In the example, we only run this suggest query when no results are returned from the search query. If we find a suggestion that matches, we replace the current search term with the suggestion and execute the search query again.
Improving the suggestions
Depending on your use case, you may want to play with the confidence parameter. The default value of 1.0 is sensible if you are going for functionality that will automatically change user input. Lower values make sense if you want to determine whether to show suggestions to users based on your own logic (e.g. few or no results), or based on the score that is returned.
By default, the suggestions will correct at most one term. You can increase this by setting the max_errors
parameter to 2 or more, or to a fraction of the input term count (for example setting it to 0.5 will allow 50% of the terms to be corrected). Setting this parameter to 2 could be useful if the input contains multiple terms most of the time. As noted in the documentation for this parameter, setting this value too high will negatively impact performance.
Conclusion
The Elasticsearch suggest functionality is a really powerful and versatile tool. Most of the hard work is abstracted away, but you still get a lot of control over the results by tweaking the numerous parameters the query accepts. There are a lot of ways I could see someone using this functionality:
- Auto correcting errors in search queries if confidence is high enough
- Showing spelling corrections in autocomplete/typeahead
- Detecting possible spelling errors in a CMS, based on data already in the index
- Finding human input errors in large datasets
If you want to try any of this out, get started by checking out the Github repo.
Top comments (7)
Every single time I used this, I kinda regretted. It was hard to configure for bigger datasets and I often was getting funny suggestions.
For instance beer would be suggested as bear 😂
Yeah agreed, it can give interesting suggestions. That's why I would mostly use it in cases where any alternative is better than showing no results.
Out of curiosity, what was your use case? Did you try limiting your suggestions with a collate query?
To solve the problem in this example ("bear" vs "beer") - we need to provide user's query history alongside the current request. So search can determine if user hunts often, or uses Bear framework, or cooks wheat beer
That's a very good suggestion. Thank you.
It was classic search. Search data consisted of categories, countries, companies, brands, other attributes. Millions of possible combinations of these.
I did use collate query but it was very likely that suggestion with bear also would have brought back results. But there's a chance I did something wrong. 🤔
same to your bro...am in your situation
With these things I always wonder how much of this is pure ElasticSearch or "just" some icing on Lucene.