Simple Scalable Search Autocomplete Systems

#go #redis #elasticsearch #sql

Here I'll discuss 4 ways to build a simple scalable search autocomplete service that I know of and some pros and cons of each of them.

Requirements

One API for inserting texts/words that were searched by users. For example :

POST /insert
{
  "search string" : "Harry Potter and the Prisoner of Azkaban"
}

One API for getting autocomplete suggestions as the user keeps typing. For example :

GET /search?SearchString=potter

One API to clear all searches made by the user. For Example -

DELETE /delete

The system should be able to return top N frequent relevant queries/searches.
The system should be horizontally scalable using shards and read replicas.
The calculations should be In-memory to make the system fast and real-time.

Way 1: Using SQL

SELECT * FROM autocomplete 
WHERE search_items LIKE '%{SEARCH_STRING}% 
ORDER BY frequency DESC 
LIMIT 5';

PostgreSQL (and some other SQL databases) have LIKE operator which can be used to search patterns in text data. Percent sign (%) matches any sequence of zero or more characters. So %{SEARCH_STRING}% will return all rows in autocomplete table where {SEARCH_STRING} exists in any position of the search-items field. We would also want to get rid of less frequent search items before they pile up. There is no direct way to do this. We could have an "expire" field which we would update every time insert API is called, that is when a user searches something. Another process could periodically delete the expired items. This can do infix matching (i.e. matching terms at any position within the input). This is the simplest way to do this but this actually doesn't happen in memory.

Using Redis Sorted Set

Redis is an in-memory database. So we could try to exploit Redis's data structures to make a search autocomplete system. Salvatore Sanfilippo Antirez, the creator of Redis, wrote about 2 ways about how it could be done in his old blog post from 2010. You can read the original blog post but here I'm gonna explain both of the ways. Both of the ways use sorted set data structure of Redis which is implemented using the skip list data structure under the hood.

Way 2: Using 1 Sorted Set

If any 2 members in a Redis's sorted set have an equal score they are sorted lexicographically. For each prefix in the search string, we'll store them in a sorted set with the same score. So, they will be lexicographically sorted in the set. To mark an end word we'll add '*' to it in the end. For example, if we searched "bangladesh" and "bangkok", our sorted set will look like this -

 0 b
 1 ba
 2 ban
 3 bang
 4 bangk
 5 bangko
 6 bangkok
 7 bangkok*
 8 bangl
 9 bangla
10 banglad
11 banglade
12 banglades
13 bangladesh
14 bangladesh*

Each insertion in the sorted set happens at O(logN) complexity where N is the size of the sorted set. So total complexity of insert each time the insert API is called is O(L*logN) where L is the average length of a search string. We can use the ZRANK command to find any member's index/position in the sorted set. We then can use ZRANGE command to fetch M members from that position in O(M) complexity. If any fetched member has '*' in the end and it contains the search string as a prefix, then we can return it as a suggestion.
For example, if the user types 'bang' we can find the index of 'bang' in a sorted set, it's 3. Now if we fetch 12 items from position 3 we would fetch -

[
bang,bangk,bangko,bangkok,bangkok*,
bangl,bangla,banglad,banglade,
banglades,bangladesh,bangladesh*
]

Among these members bangkok* and bangladesh* has "bang" in the prefix and "*" in the end. So they can go into the suggestion. Our API would return [bangkok,bangladesh].
But this method cannot return top N frequent relevant searches. Also, there is no way to get rid of less frequent searched items. On top of that, this cannot do infix matching (i.e. matching terms at any position within the input).

Here is my full code implementing this method

Way 3: Using Multiple Sorted Sets

For each prefix of the searched string, we'll have 1 sorted set. The searched string will be a member of each of those sorted sets. The score of each member will be the frequency. The complexity of insert here is L*log(N) where L average length of a search string, N is the size of the sorted set. For example, if we searched "bangladesh" 2 times and "bangkok" 3 times, our Redis would look like this


|Sorted Set| Members                   | 
| ---------|---------------------------|
|b         | bangladesh:2 , bangkok:3  |           
|ba        | bangladesh:2 , bangkok:3  |
|ban       | bangladesh:2 , bangkok:3  |
|bang      | bangladesh:2 , bangkok:3  |
|bangk     | bangladesh:2 , bangkok:3  |
|bangko    | bangladesh:2 , bangkok:3  |
|bangkok   | bangladesh:2 , bangkok:3  |
|bangl     | bangladesh:2 , bangkok:3  |
|bangla    | bangladesh:2 , bangkok:3  |
|banglad   | bangladesh:2 , bangkok:3  |
|banglade  | bangladesh:2 , bangkok:3  |
|banglades | bangladesh:2 , bangkok:3  |
|bangladesh| bangladesh:2 , bangkok:3  |

So if the user types 'bang' we can fetch top N frequent items easily from the sorted set 'bang' using the ZREVRANGE command as our suggestions are sorted according to frequency. We can also use EXPIRE command to delete sorted sets containing less frequent search items.
Now we need to cap the length of the sorted set. We cannot let it reach infinite length. For the top 5 suggestions, keeping 300 elements is quite enough. Whenever a new member would come in the sorted set we'll pop the element with the lowest score and insert the new element with a score equal to the popped element's score plus 1. So this new element would have a chance to rise to the top. Now, this is a stream-based algorithm. So the accuracy of this algorithm depends on the distribution of input. If all the strings have very close frequency this might not be very accurate. But in the case of searching problems, usually, a small subset occupies a big percentage of all the searches. So in a real-life scenario, this will be very feasible. But this also can not do infix matching.

Here is my full code implementing this method

Way 4: Using Elasticsearch

There are multiple ways to do it using Elasticsearch. But almost all of them does most of the work at query time and are not guaranteed to be in memory. Except if we map our field as search_as_you_type data type. Then this does most of the work at index time by updating in-memory Finite State Transducers which are a variant of the trie data structure. Also, this can do infix matching. This is the most efficient way to solve this problem.
our mapping would look like this -

{
   "mappings":{
      "properties":{
         "search-text":{
            "type":"search_as_you_type"
         },
         "frequency":{
            "type":"long"
         },
         "expire":{
            "type":"long"
         }
      }
   }
}

"frequency" is the frequency of the search-text, so we can fetch top N relevant searches. "expire" field can be used to delete expired less frequent items by another process.

We don't want one search to appear multiple times in the database. One way to solve this is to use the hash of the search string as the document ID.

Our insert operation will be an upsert operation where we would increase the "frequency" of the document and update the "expire" field if it already exists or insert the document if it doesn't exist. We can take help of the "script" field for this -

POST autocomplete/_update/{DOCUMENT_ID}
{
   "script":{
      "source":"ctx._source.frequency++;ctx._source.expire={EXPIRE_TIME}",
      "lang":"painless"
   },
   "upsert":{
      "search-text":"{SEARCH_STRING}",
      "frequency":1,
      "expire":"{EXPIRE_TIME}"
   }
}

In search API we want to fetch the top N relevant element sorted by frequency -

GET /autocomplete/_search
{
   "size":5,
   "sort":[
      {
         "frequency":"desc"
      }
   ],
   "query":{
      "multi_match":{
         "query":{SEARCH_STRING},
         "type":"bool_prefix",
         "fields":[
            "search-text",
            "search-text._2gram",
            "search-text._3gram"
         ]
      }
   }
}

Here is my full code implementing this method.

Conclusion :

Thanks a lot for taking your time to read it. Let me know what you think of it in the comment. Let me know if there are any other solutions you know of or how I could improve my solutions. Any feedback is welcome.