DEV Community

Le Dev Novice
Le Dev Novice

Posted on

Must and filter, a slight subtlety with serious consequences

When developing an Elasticsearch query to search and retrieve data within an index, the number of possibilities can easily lead to questions. Sometimes it is even difficult to really see the difference between one choice and another. This is particularly the case for the two keywords must and filter. Seemingly very similar, it is actually interesting to understand the slight subtlety that lies between these two options. A good knowledge of this difference will then allow a more informed choice of one or the other keyword.

At first glance, the same search query on the same Elasticsearch index with the must keyword and the filter keyword seems identical. In order to illustrate the point of this article, I defined an example case of an index composed of more than 2 million documents.

Below you can see two different queries on our example index. One uses the filter keyword, the other the must keyword. But each of them seeks to retrieve only the documents from the index which have a code field at the root.

{
    "query": {
        "bool": {
            "filter": [
                {
                    "exists": {
                        "field": "code"
                    }
                }
            ]
        }
    }
}
Enter fullscreen mode Exit fullscreen mode
{
    "query": {
        "bool": {
            "must": [
                {
                    "exists": {
                        "field": "code"
                    }
                }
            ]
        }
    }
}
Enter fullscreen mode Exit fullscreen mode
hits": {
        "total": {
            "value": 194026,
            "relation": "eq"
        },
...
Enter fullscreen mode Exit fullscreen mode

The result returned in response to the query shows that they both return the same number of documents. But then, what is the real difference between the two keywords if their logic seems so similar?

Elasticsearch has an internal capacity to calculate relevance scores for each of the documents returned in response to the search query. In this way, it can then classify the results of a search according to the relevance score calculated for each of the documents that have been returned.

And this is where the subtle difference between the two keywords must and filter lies. Indeed, while filter does not have any underlying score calculation during the search query, this score calculation is inherent to the must keyword. Inserting it into the search query within the index means understanding that we also want to have in the response a calculation of relevance for each of the documents returned.

We can also see the difference in the response of queries each using one of the two keywords.

Filter : "max_score": 0.0
Must : "max_score": 1.0
Enter fullscreen mode Exit fullscreen mode

Here we are in a very basic example of a search that creates results of uniform relevance for the keyword must where each of the documents returned has the code field. The filter keyword also returns a score in response but defined by default at 0, showing that no score calculation has been carried out.

And if we do a performance test between the two keywords on an index with only a few dozen or even a few hundred documents, we may then find the difference in query execution insignificant and therefore clearly non-existent. significant. However, if we take the case of our example, an index with several million documents, the query execution time result for each of the keywords is as follows:

Filter : 30 ms
Must : 425 ms
Enter fullscreen mode Exit fullscreen mode

In the context of our example, the query using the filter keyword can then run almost 14 times faster than the same query, in the same environment, returning the same results, but with the use of the must keyword which generates the calculation of relevance scores for each of the documents meeting the search criteria.

Thus, it is important to clearly understand the need for its application, the context in which the research is carried out. Using the must keyword involves additional work during the search query. In a small-scale project, the choice of one or the other keyword will make little difference. But this choice could prove to be much more costly for the application's search performance on larger projects, associated with Elasticsearch indices of a completely different size.

To avoid these unforeseen pitfalls as the project grows in scale, for example, only use the keyword must when your application really requires a classification of the documents returned during the search query. These scenarios where users have a real need to classify their search results by relevance. Those where this functionality brings them real added value.

In all other cases, the filter keyword will be sufficient, and even the best option.

Top comments (0)