Scenario
Lets consider a Mongo collection topics
that looks somewhat like:
[
{
"_id": 1,
"title": "computer science"
},
{
"_id": 2,
"title": "numerical computation"
},
{
"_id": 3,
"title": "medical science"
},
{
"_id": 4,
"title": "information technology"
},
{
"_id": 5,
"title": "sci fi"
}
...
]
We would like to retrieve documents whose title "contains" comp
.
Here are few possible ways that we can consider:
OPTION 1 - Simple regex query
We can perform a regex query on the title field, like this:
db.topics.find(title: /.*comp.*/i)
OR
db.topics.find(title: {$regex: "comp", $options: "i"})
NOTE: Option i
is used to perform case insensitive search.
Mongo will need to examine every document in the collection and perform a regex match on the title field.
We can define an index (nothing fancy, the usual index) on the title field, like:
db.topics.createIndex({title: 1})
Now if we perform the regex query, mongo will not examine all the documents, but it still needs to examine all the index keys, and then examine and retrieve only those documents that matched.
This works good for a small collection, but for large collections (having more than 100k or 1M docs) the query becomes slow, and it also consumes a lot of resources from the mongo db instances, which is definitely bad.
This solution is the simplest to implement, but doesn't scale well.
OPTION 2 - Index on tokenised words
If we are okay to support searching only by words, and not partial words, then tokenising the title field can help. E.g.:
[
{
"_id": 1,
"title": "computer science",
"titleTokens": ["computer", "science"]
},
{
"_id": 2,
"title": "numerical computation",
"titleTokens": ["numerical", "computation"]
},
{
"_id": 3,
"title": "medical science",
"titleTokens": ["medical", "science"]
},
{
"_id": 4,
"title": "information technology",
"titleTokens": ["information", "technology"]
},
{
"_id": 5,
"title": "sci fi",
"titleTokens": ["sci", "fi"]
},
...
]
While saving / updating documents, we can:
- convert the
title
field to lower case (to support case insensitive searches) - split the text into multiple words
- save those words as an array, as
titleTokens
And then we can define a multi key index on titleTokens
Now if we need to search for a text, we can:
- convert the search text to lower case
- split the search text into multiple words (using the same algorithm used to tokenise the
title
field) -
Query the array
titleTokens
with the tokens obtained from search text, e.g.:
db.topics.find({titleTokens: {$all: ["science"]}})
Above query will retrieve all documents having the token "science" in the
titleTokens
field, i.e.: document #1 and #3
OR
db.topics.find({titleTokens: {$all: ["computer", "science"]}})
Above query will retrieve all documents having both "computer" and "science" tokens, and hence will retrieve document #1
Using this solution we cannot search by partial words like "comp", but can search with individual words pretty fast.
OPTION 3 - Text index
Mongo has a provision for defining one text index per collection. This works pretty much the same way as OPTION 2, i.e., the title field's value will be tokenised into words and only search by full words (not partial words) is supported.
The index can be created like this:
db.topics.createIndex({title: "text"})
This is one way to query the text index:
db.topics.find({$text: {$search: "computer"}})
Above query will retrieve all documents having the word "computer" in the
title
field.
Here is another:
db.topics.find({$text: {$search: "computer science"}})
Above query will retrieve all documents having either "computer" or "science" in the
title
field, i.e.: documents #1 and #3
Yet another:
db.topics.find({$text: {$search: "\"computer sci\""}})
This is a phrase query. It will:
- tokenise the search text into "computer" and "sci"
- fetch documents containing either "computer" (this will match doc #1) or "sci" (this will match doc #5)
- among all these documents, mongo will examine the title fields and check whether the exact phrase "computer sci" is present or not. only doc #1 will pass this check, which will be returned as the result of the query
Using this solution we can efficiently search for full words or for phrases containing at least one full word.
OPTION 4 - Atlas search
This is only applicable when mongo is running on atlas.
Atlas provides a very powerful way to search through documents for partial words. autocomplete
indexing and searching perfectly meets our use case.
We can define a search index on topics
following these steps:
- Choose an analyser, maybe
whitespace
. This will tokenise the field's value into words - Disable dynamic mapping (assuming that we already know which field to index, and the field name / hierarchy doesn't change)
- Define explicit fields to index, in this case it can be the
title
field - Also define two datatypes for the field:
- String - for complete text matching (also needed for autocomplete match if we want to increase the search score of completely matched strings)
- Autocomplete - for partial matching
- Choose the tokeniser for Autocomplete datatype:
- edgeGram: This will allow prefix searches against tokenised words. E.g.: searching by "comp" will return entities #1 and #2, but searching by "puter" wont return anything
- nGram: This will behave like a contains search. E.g.: searching by "put", will return entities #1 and #2
- Define:
- minGrams: The minimum length from which tokenisation will start. If minGrams is 2, then by searching for "c" we wont get any results, we need to have at least 2 characters. We need to be careful with this field. Setting a very low minGrams value (like 1) with nGram tokeniser will result in a huge index, and setting something very high won't be useful enough.
- maxGrams: The maximum length of characters which will be tokenised
- Choose the tokeniser for Autocomplete datatype:
Once the Atlas search index is ready, we can query the mongo collection using an aggregation pipeline like this:
db.topics.aggregate(
[
{
$search: {
index: <name of atlas search index>,
autocomplete: {
query: "comp",
path: "title"
}
}
}
]
)
NOTE:
- These search indexes can accomplish a lot more (fuzzy search, compound search, and maybe more).
- The search index can be defined via terraform (if we want to follow infrastructure as code :-)).
Using this solution we can efficiently search for prefixes of words or for any part of the text, if we run mongo on atlas.
Above options were easy solutions to the problem. Below are some not so easy solutions:
ELASTIC SEARCH
If we strictly do not want to use atlas search, we may consider:
- Continuously syncing the mongo collection (maybe only the title and id fields) to elastic search
- Defining appropriate autocomplete edgeGram / nGram indexes
- Executing queries against elastic and retrieving matched documents by id from mongo. Or we may also consider storing the whole document in elastic itself (in that case we need not retrieve docs by id from mongo as we already get the full document from elastic itself).
FROM SCRATCH !
If none of the above options work for us, we can build our own indexes using prefix trie or suffix trie, and point to mongo document ids (or any database reference) from them. Searching the index for all possibilities for a given search text may become a difficult / costly operation. We may consider caching the top results in each trie node. If space becomes an issue, we can distribute the trie across multiple nodes.
This is definitely not an easy job. We'll encounter multiple challenges / questions related to:
- Pre-computation of search results in each trie node: What algorithm to use? How to keep the trie index in sync with the underlying mongo collection? How much latency is acceptable? How will the data infrastructure look like?
- Trie sharding: How to evenly distribute the trie across multiple nodes?
Please feel free to suggest better ideas / corrections to this content. Thanks in advance.
Top comments (0)