How to choose between different analyzers and queries to get the best search performance? Benchmarking of course!
Deploying a large-scale full-text search engine can be very hard. Elasticsearch makes the job much easier but it’s not one size fits all — quite the contrary.
Elasticsearch has many configurations and features, but having many features also means many ways to achieve the same goal and it’s not always straightforward to know what’s the best way for the product you’re building.
Let’s start with finding out the main ways we can find users by their username/name, measuring their performance, advantages, and drawbacks.
This will match terms using a fuzziness param.
- Simple to use
- Doesn’t use much space
- Allows fuzzy search
- If the size of the indexed word is bigger than the searched term+fuzziness_size it will not match
- Fuzzy search can slow things down
- Simple to use
- Potentially very fast (especially if you use index_prefixes option)
- It will only match if the indexed term starts with the searched term
- If you use the index_prefixes option, it will use more space
- No fuzzy search
Works a bit the same way as “LIKE %term%” when using a relational database SELECT.
- Easy to implement and debug
- Usually, the slowest option, especially if the wildcard is placed at the start or very few characters are used
- will match even if the search term is in the middle of a word
- good search performance
- allows having a “fuzzy” search since it will match segments of each word
- specialized analyzer
- uses more disk space
- only matches if the search term is at least the size of the smallest “gram”
Match query + Ngram Analyzer
To do the benchmarks, I’ve created a small python script that uses 4 parallel processes that will each run 1000 consecutive queries.
It runs that for each kind of query.
The main objective is not to know how long each query takes but to compare their execution time under the same conditions.
- Time in seconds is calculated summing the time of 1000 runs and then doing the average between 4 parallel processes
- Avoid the wildcard query at all costs: I see the wildcard query being recommended everywhere but as we saw, it is the slowest option and you can get better results with the other options.
- If you can live with matching only the beginning of a word: The prefix query can do this job, and it can do it really fast. If your use case fits this, it’s a good choice. There is also the possibility of using the index_prefix option to speed things up even more at the cost of disk space.
- If you want to save on disk space : Using the standard analyzer with a match+fuzziness param should do the trick.
- If you want to be able to match even if the search term is in the middle of a word and really need it to be fast : ngram seems to be the choice in this case. It can be “dangerous” to use it sometimes though.
When using the ngram analyzer , you should avoid having a big distance between min and max gram size and also avoid using very small ngram sizes like 1 to allow showing results when using only 1 letter.
If you have a big range of gram sizes, it will become very expensive disk-wise and potentially degrade your performance.
Instead, you could, for example, use the fields that use the standard analyzer and perform a simple match or prefix query when your search_term < min_ngram_size.
- The Complete Guide to Increase Your Elasticsearch Write Throughput
- Using Event Sourcing to Increase Elasticsearch Performance
How does this all sound? Is there anything you’d like me to expand on? Let me know your thoughts in the comments section below (and hit the clap if this was useful)!
Stay tuned for the next post. Follow so you won’t miss it!