Introduction
Recently, in the health-tech company where I work, I was tasked with introducing and implementing Elasticsearch to our CRM repository. Our Chief Product Officer gave me three weeks to complete this project, and while the requirements initially seemed straightforward, the process turned out to be more complex than anticipated.
The objectives were clear:
- Implement Elasticsearch search functionality.
- Ensure the code is testable.
- Document the implementation thoroughly.
In this series of articles, I will walk through the process of integrating Elasticsearch into a simple Django repository. The first article will provide a theoretical overview, followed by a second article detailing the implementation and programming aspects. The final article will focus on testing and running these tests within a GitLab/GitHub CI environment.
Prerequisites
Before diving into the implementation, I highly recommend familiarizing yourself with the following resources:
Additionally, I want to acknowledge the invaluable insights gained from the following articles:
What is Elasticsearch?
Elasticsearch is a specialized search engine designed for efficiently querying and retrieving data. It operates on the Java platform and leverages a sophisticated indexing mechanism known as the Generalized Inverted Index.
Elasticsearch is mainly used for searching as the name suggests. A lot of users read requests usually entail some sort of search, we use Elasticsearch to spread out these read requests from our main database to Elasticsearch. Additionally, Elasticsearch allows more intuitive search by allowing fuzzy-searching that caters for misspellings, auto-completion and other powerful searching capabilities.
Intricacies of Elasticsearch
Cluster
A cluster in Elasticsearch is a collection of one or more nodes (servers) that collectively store data and provide indexing and search capabilities. Clusters are used for scalability, fault tolerance, and load distribution. They are similarly to kubernetes clusters.
Node
A node is a single server that is part of a cluster. It stores data, participates in the cluster’s indexing and search capabilities, and can be configured to perform specific roles such as master-eligible or data node. They are similarly to kubernetes nodes. Think of them as a virtual machine or a laptop in your case.
Shard
A shard is a basic unit of data in Elasticsearch. It is a subset of an index, containing a portion of the index's data. Elasticsearch distributes shards across nodes in the cluster to enable parallel processing and scalability.
Replica
A replica is a copy of a shard. Replicas provide fault tolerance and high availability by allowing data to be replicated across multiple nodes. Elasticsearch automatically manages the distribution of replicas to ensure data resilience.
Type (Deprecated in Elasticsearch 7.x)
In older versions of Elasticsearch, a type was a way to logically partition data within an index (An index is like a table in a relational database). However, starting from Elasticsearch 7.x, types are deprecated, and indices can only contain a single mapping type.
Document
A document is a basic unit of information stored in Elasticsearch. It is a JSON object that contains data and its associated metadata. Documents are indexed and stored in shards based on their index and type. Think of it as a row in a relational database.
Field
A field is a key-value pair within a document that represents a specific attribute or property of the data. Fields can be indexed for search, aggregated for analysis, and retrieved in query results. Think of this as a column in a relational database.
Strategies used by Elastic Search to store data
A Generalized Inverted Index (GIN) is a data structure used in databases to efficiently support complex queries. It stores lists of indexed items along with their corresponding keys, enabling fast retrieval based on search terms. GIN indexes are particularly effective for full-text search and support various operations like AND, OR, and phrase searches
Traditional storage engines often utilize indices by mapping unique keys to corresponding values. For instance, PostgreSQL uses B-trees, where a key is associated with a file offset pointing to the stored data. In contrast, Elasticsearch adopts a different approach: it maps textual data to unique keys, which are then linked to lists of postings representing the documents containing that text. This strategy facilitates rapid searching and retrieval.To enable this, text needs to undergo stemming and lemmatisation as well the removal of stop words.
Stop words are words that are common to a language. For example in English, words like "the" , "and" , "or" are common words that do not need to be stored.
Lemmatisation and stemming are techniques used in natural language processing to reduce words to their base or root form. Lemmatisation aims to accurately identify a word's lemma, or dictionary form, considering its context and grammatical features. Stemming, on the other hand, applies simpler rules to remove suffixes and prefixes from words, often resulting in the root form but sometimes leading to inaccuracies. While lemmatisation produces linguistically valid roots, stemming is faster and more suitable for tasks where linguistic accuracy is less critical, such as information retrieval.
Lemma: The lemma of "running" is "run". Lemmatisation would convert "running" to "run" by considering its context and grammatical features.
Stem: The stem of "running" using a stemming algorithm might be "run". Stemming applies simple rules to remove suffixes, so it may not always produce a valid root or lemma. In this case, "run" is a valid stem, but it's not necessarily the correct lemma.
Lets do an example for the famous sentence with all letters in the alphabet;
The quick brown fox jumped over the lazy dog.
Remove stop words: The stop words in the given sentence are typically common words like "the", "of", "and", etc. After removing them, we get:
"quick brown fox jumped lazy dog."
Lemmatization: Lemmatization involves reducing words to their base or dictionary form. Here, we can use lemmatization to convert words like "jumped" to "jump",
Stemming: Stemming involves removing affixes from words to obtain their root forms. For example, "jumped" can be stemmed to "jump", "brown" remains "brown", etc.
We thus get
quick brown fox jump lazy dog.
Example
- The pen will be used to write in a red book by the students.
- The red book is hated by most of our students.
- The students will write to the red book.
Remove stop words:
- "pen used write red book students"
- "red book hated students."
- "pen used write book."
Lemmatization:
- "pen use write red book student."
- "red book hate student."
- "student use red book."
Stemming:
- "pen use write red book student."
- "red book hate student."
- "student write red book."
Now we index these words:
Texts are mapped to unique keys which are mapped to the postings.
Text | Unique Keys | List of Postings |
---|---|---|
pen | 1 | [1] |
use | 2 | [1,3] |
write | 3 | [1,3] |
red | 4 | [1,3] |
book | 5 | [1,3] |
student | 6 | [1,2,3] |
hate | 7 | [2] |
How Querying works under the hood?
Scoring in Elasticsearch refers to the process of ranking search results based on their relevance to the query. Elasticsearch employs various factors to calculate scores, including the frequency of search terms within documents and their proximity to each other.
Boosting allows users to assign greater importance to certain fields or documents when calculating relevance scores. This feature enables fine-tuning of search results to emphasize specific criteria.
Must and should are terms used in Elasticsearch's Boolean query syntax to define mandatory and optional criteria, respectively. Must clauses must match for a document to be considered a relevant result, while should clauses enhance the score of documents if they match but are not required for a successful search.
Fuzziness in Elasticsearch refers to the capability to find approximate matches for a given search term. This feature is particularly useful for handling typographical errors, misspellings, and variations in word forms. Elasticsearch employs algorithms like the Levenshtein distance algorithm to determine the similarity between terms and their potential matches.
Now you can query as appropriately as you want. It is up to you as the relevance engineer to choose what fields are appropriate to boost so as to make the query more relevant.
From a business perspective, we want our search to be rich and relevant so that our users can find it useful. The search also has to make business sense by for example recommending expiring products and products that have higher margins. Such factors should be considered when effectively building a query.
In summary, Elasticsearch offers a powerful and flexible solution for searching and retrieving data, leveraging advanced indexing techniques and query capabilities. Understanding its features, such as scoring, boosting, Boolean queries, and fuzzy matching, allows developers to build efficient and accurate search applications tailored to their specific needs.
Conclusion
In conclusion, Elasticsearch is a powerful tool for building scalable, real-time search applications. Understanding its architecture and intricacies, such as clusters, nodes, shards, and documents, is essential for effectively implementing and managing Elasticsearch in a production environment. By leveraging its capabilities, developers can create robust search solutions that meet the demands of modern applications.
The next two articles will delve into coding so stay tuned!!
Top comments (0)