Most of the developers have side projects. And I have my own too. About one month ago I have started writing a full-text-search library to learn a little bit better C# and algorithms. Today I know, it's still a lot of work and learning ahead of me. However, I have some thoughts and in this short article, I'll do my best to share them with you.
- I created Coresearch library + in-memory command-line tool for efficient matching content with its resource at large scale https://github.com/pilotpirxie/coresearch ⭐🐙
- Database engines use inverted index commonly to speed-up queries.
- Prefix Tree is one of the structures used in this approach. It splits, information to the fixed-length common parts and stores in the nodes.
- Radix Tree is optimized version of Prefix Tree, where unused nodes without data (resources) are compressed together, which means less node handle more information.
- PATRICIA Tree is a binary version (radix is equal 2) version of the radix tree with two children in each node.
- You can use logical connective on search results to find diffs, sums etc.
Working on this project I have started by writing down on the paper what's the purpose of the program and what features it should have. Because in my work senior engineers were excited about Solr and ElasticSearch I have decided to learn a little more about how they work, which algorithms they used and learn .NET Core as well (daily I am JS, React dev).
My design notes assume to create a library and an application for searching in which resources (TXT, JSON, CSV, YAML etc.) picked by user words occur. Moreover, it should have the option to search resources even if user types just a part of the word like in real-world search engines.
My first approach was to create a class with some simple
word -> resource
map using inverted index pattern. To accomplish that I used Dictionary with string as key and HashSet of strings as value to avoid duplicated resources. However as you may expect, it took an extremely large amount of memory to store all those information even if the Dictionary in C# is similar to the HashMap. For four words like:
alan -> a ala -> b alice -> c alek -> d
it creates 4 entries. Exact search in the dictionary is very fast. Simply compute a hash of the key and check value. But I decided to add functionality for a wildcard search, and this is a moment when the real fun starts.
With dictionary to find every value under the key like "al*" program must follow one-by-one each entry and match the key with the pattern. With 4 entries that are not a problem. Even after scaling up to hundreds of thousands of records, having the power of modern devices it's still fairly fast. The problem starts when you have a really big amount of data to search.
Imagine having a web search with 10k of requests/s and every time scanning each of the 100 millions of entries in a linear way. It's very slow and waste resources, which nowadays in the era of infrastructures in clouds, uncontrolled may be very expensive.
So next few days I spend on reading and learning about how databases and tools like Solr and ElasticSearch handle data and searching and I found the concept of organizing data in the tree structure and traversing to find non-leaf nodes with data. I tried a few different and finally decided to use Prefix Tree because it's extremely fast and allows searching down and up with nodes. When adding new keys, new nodes are created only for characters that currently doesn't exist. Saying that four words from above are going to look like in Prefix Tree:
i-c-e | a-l-a-n | e-k
Every node that terminates word contains information about a resource. And even if at this scale it may not looks very optimal, it scales very well. Think, about inserting a new word like "aleksander". Instead of new entry with long hashcode / literal value, you create few missing nodes with characters s, a, n, d, e, r and attach resource to newly created "r" node.
It's even better when you look, how you can search with this structure. Suppose, you are looking for the word starting with "ali". In the dictionary, as I wrote above you must check every entry to determine if key contains requested characters in the specified order. With prefix tree following from the top-level nodes check every attached child and compare the next letter. If it matches, then replace node and repeat process until you find last node or somewhere won't be a child with required character. In our case, you will stop at the i node. From this place, you go recursively down and returns every resource at every child.
When you have dict with 100k resources and you are looking for the word "calypso" you must search every of 100k entry. In Trie structure that are 7 steps. So search in Prefix Tree is O(n) where n is the length of the word to search.
Having this knowledge I started working on the second version of my library. I found a lot of resources on the Internet. Some more noticeable:
After creating Trie, Node and classes library I created a small CLI program inspired by Redis to make interaction with Trie without forcing implementation in code on people. For testing, I was using sample BBC articles from http://mlg.ucd.ie/datasets/bbc.html and enwiki8. Few last days I spend on implementing new arguments, refactoring and implementing wildcard and depth+1 search. There is still a lot of optimization that may be done, and maybe I will do this in the future. Personally, I completed my plan so now it's time to celebrate a little ;p
To sum up, I really recommend trying to create library like this to better understand what is going under the hood in the web search engines when we type the next letter.
If you like the article, want to try on your own do some modification or check how it's implemented you can download source code from GitHub: https://github.com/pilotpirxie/coresearch