Raphael Jambalos for AWS Community ASEAN

Posted on Aug 27, 2022

ElasticSearch: Zero to Hero in 12 Commands

#elasticsearch #codenewbie #aws #programming

It's relatively easy to get started with ElasticSearch. But as our use cases get more specific, we found the documentation lacking. This guided cheatsheet will execute 12 commands: from setting up your ES index to making advanced ES queries to support advanced (but common) use cases.

The 12 commands works when done sequentially. I will explain each of them, but trying them for yourself is still the best.

This post is part of a broader series on ElasticSearch that will be released in the coming weeks:

The Guided ElasticSearch Cheatsheet you need to Get Started with ES - you are here
Using DynamoDB + ElasticSearch for prod workloads - coming soon
And how to create DynamoDB Streams to sync data changes from DynamoDB to ES asynchronously - coming soon

0 | Prerequisites

Install Elasticsearch with this official ES Guide. And then, turn on the ES server on localhost:9200

For easier testing, installing an API platform like Postman is a must.

A | Setup the index

In ElasticSearch, we store our data in indexes (similar to tables in your MySQL database). We populate indexes with documents (similar to rows). We will create and set up your first index in the subsequent commands.

[1] Verify the ES cluster is accessible



GET localhost:9200

First, make sure your local ES server is online, and you have your Postman open. Create a new GET request headed for localhost:9200. You should see something like this:

[2] Create an index



PUT localhost:9200/mynewindex

Now, let's create our first index. Indexes store our data. It is equivalent to creating a table in relational databases.

[3] Create the mapping for the index

The index we just created has no mapping. A mapping is similar to a schema in SQL databases. It dictates the form of the documents that our index will ingest. Once defined, the index will refuse to accept documents that cannot fit into this mapping (i.e, we defined stocks as integer below. If we try to insert a row with stocks="none", the operation will not continue).

One thing you'd notice with ES is that these mappings are permissive by default. If I add a row with a new attribute "perishable" = true, when I push a document to ES, the schema will add that attribute and infer its data type. In this case, it will add a new attribute in the mapping for "perishable" with data type "boolean".

There are options that you can add when you create your index to only allow attributes defined in mapping of your index, nothing more, nothing less.

In this command, we create the mapping for our newly created index.



PUT localhost:9200/mynewindex/_mapping

{
    "properties": {
        "product_id": {
            "type": "keyword"
        },
        "price": {
            "type": "float"
        },
        "stocks": {
            "type": "integer"
        },
        "published": {
            "type": "boolean"
        },
        "title": {
            "type": "text"
        },
        "sortable_title": {
            "type": "text"
        },
        "tags": {
            "type": "text"
        }
    }
}

Most of the data types are straightforward, except for Text and Keyword. This article explains the difference clearly.

But TLDR, Text allows you to query words inside the field (i.e querying "Burger" will show the product "Cheese Burger with Fries"). It does this by treating each word in the text as individual tokens that could be searched: "cheese", "burger", "with", "fries".

On the other hand, Keyword treats the content of the field as one, so if you want to get the cheeseburger with fries, you'd have to query it: "Cheese Burger with Fries". Querying "burger" will return nothing.

[4] Show the mapping of the index

Let's verify if we have successfully created the mapping for the index by sending a GET request.



GET localhost:9200/mynewindex

B | Data Operations with our ES Index

With our index already set up, let's add data and chip away at the more exciting bits of ES!

[5] Create data for the index

For this section, let's send three consecutive post requests with different a request body per request. This adds 3 "rows" inside our Elasticsearch index.



POST localhost:9200/mynewindex/_doc

{
    "product_id": "123",
    "price": 99.75,
    "stocks": 10,
    "published": true,
    "sortable_title": "Kenny Rogers Chicken Sauce",
    "title": "Kenny Rogers Chicken Sauce",
    "tags": "chicken sauce poultry cooked party"
}

POST localhost:9200/mynewindex/_doc

{
    "product_id": "456",
    "price": 200.75,
    "stocks": 0,
    "published": true,
    "sortable_title": "Best Selling Beer Flavor",
    "title": "Best Selling Beer Flavor",
    "tags": "beer best-seller party"
}

POST localhost:9200/mynewindex/_doc


{
    "product_id": "789",
    "price": 350.5,
    "stocks": 200,
    "published": false,
    "sortable_title": "Female Lotion",
    "title": "Female Lotion",
    "tags": "lotion female"
}

[6] Display all the data

Now, let's see if the three documents we inserted via command #5 got inside our index. This command shows all the documents inside your index:



POST localhost:9200/mynewindex/_search

{
    "query": {
        "match_all": {}
    }
}

It does!

[7] Exact search with product id

Now, let's start with a simple search. Let's search by product id.



POST localhost:9200/mynewindex/_search

{
    "query": {
        "term": {
            "product_id": "456"
        }
    }
}

In the command above, we are using a "term query" because we are looking for a product with a "product_id" that exactly matches the string "456". The term query works because the data type of "product_id" is "keyword".

[8] Fuzzy search with titles

Now, onto the more exciting bits.

ES is known for its comprehensive search capability. Let's sample that by creating our first Fuzzy search. Fuzzy searches allow us to search for products by typing just a few words instead of the whole text of the field. Instead of typing the full name of the product name (i.e Incredible Tuna Mayo Jumbo 250), the customer just instead has to search for the part he recalls of the product (i.e Tuna Mayo).



POST localhost:9200/mynewindex/_search

{
    "query": {
        "match": {
            "title": "Beer Flavor"
        }
    }
}

In the default setting, we can get the product "Best Selling Beer Flavor" even with our incomplete query "Beer Flavor". There are other settings that allow us to tolerate misspellings or incomplete words to show results (i.e Bee Flavo)

Also, notice carefully that we now use a "match query" instead of a "term query" because we want to be able to get results even if we didn't type the full product name. The match query works because the title field is of type "text".

[9] Sorted by prices

Another thing we usually have to do with an e-commerce website is to sort products by specific categories like price or rating:



POST localhost:9200/mynewindex/_search

{
    "query": {
        "match_all": {}
    },
    "sort": [
        {"price": "desc"},
        "_score"
    ]
}

With our query above, we return all the products sorted by most expensive to the cheapest. Notice that the sort parameter is a list, which allows us to add multiple criteria for sorting. We also added "_score", which is an elasticsearch keyword for search relevance. We will explore this concept deeper on later examples.

[10] Search for all "beer" products that are PUBLISHED, and in stock. Sorted by cheapest to most expensive

To make things more interesting, let's add several more beer products. We do this by sending a POST request thrice, with a different request body each time.




POST localhost:9200/mynewindex/_doc

{
    "product_id": "111",
    "price": 350.55,
    "stocks": 10,
    "published": true,
    "sortable_title": "Tudor Beer Lights",
    "title": "Tudor Beer Lights",
    "tags": "beer tudor party"
}

POST localhost:9200/mynewindex/_doc

{
    "product_id": "222",
    "price": 700.50,
    "stocks": 500,
    "published": false,
    "sortable_title": "Stella Beer 6pack",
    "title": "Stella Beer 6pack",
    "tags": "beer stella party"
}

POST localhost:9200/mynewindex/_doc

{
    "product_id": "333",
    "price": 340,
    "stocks": 500,
    "published": true,
    "sortable_title": "Kampai Beer 6pack",
    "title": "Kampai Beer 6pack",
    "tags": "beer kampai party"
}

With more documents in our index, we can now do the query. This is a complex query that has three conditions that must be fulfilled. We analyze the query below.



{
    "query": {
        "bool": {
            "must": [
                {
                    "match": {
                        "title": "Beer"
                    }
                },
                {
                    "term": {
                        "published": true
                    }
                },
                {
                    "range": {
                        "stocks": {
                            "gt": 0
                        }
                    }
                }
            ]
        }
    },
   "sort": [
        {"price": "asc"},
        "_score"
    ]
}

With our recent additions, there are four products with the word beer:

456: Best Selling Beer Flavor
111: Tudor Beer Lights
222: Stella Beer 6pack
333: Kampai Beer 6pack

Since we filter out items whose inventory is zero (or below), we remove product 456 from the list. Another filter is that the product must be published (published = true). With this filter, product 222 is removed. We are left with the 2 products below. They must be sorted by cheapest to most expensive, as is shown below:

333: Kampai Beer 6pack (price = 340)
111: Tudor Beer Lights (price = 350.55)

In this example, the key "must" was used, with a list as its value. The list contains conditions that must be met together for the query requirements to be met. In this example, its "title must have the word 'beer'" AND "published attribute is equal to true" AND "stocks is greater than zero".

[11] Search for all products that have at least 1 of the following tags ['poultry, 'kampai', 'best-seller'], that are PUBLISHED, and in stock. Sorted by cheapest to most expensive

Our previous query just involved three conditions that must be ALL TRUE to hold. That's equivalent to "A and B and C".

In this query, we still have three conditions that have to be all true, but the 1st condition is marked as true if it has either "poultry", "kampai", or "best-seller".In this example, we introduce the syntax for "OR":



{
    "query": {
        "bool": {
            "must": [
                {
                    "bool": {
                        "should": [
                            {
                                "match": {
                                    "tags": "poultry"
                                }
                            },
                            {
                                "match": {
                                    "tags": "kampai"
                                }
                            },
                            {
                                "match": {
                                    "tags": "best-seller"
                                }
                            }
                        ],
                        "minimum_should_match": 1
                    }
                },
                {
                    "term": {
                        "published": true
                    }
                },
                {
                    "range": {
                        "stocks": {
                            "gt": 0
                        }
                    }
                }
            ]
        }
    },
    "sort": [
        {
            "price": "asc"
        },
        "_score"
    ]
}

In this query, we still have a "must" keyword, but its first contains a "should" keyword. The whole query is equivalent to: (A or B or C) AND D AND E. The "should" implies that as long as one condition is met, the (A or B or C) statement returns true.

A tweak we can do is adjust the "minimum_should_match" (msm) parameter, so we can require that two or three or N conditions be met for the statement to be true. In our example, if msm=2, it means a product has to have two matching tags to be considered true (i.e a product has to be both poultry and kampai).

We analyze the query below:

The product should have at least 1 of these tags: poultry, kampai, best-seller
- This matches 3 products: poultry (pid: 123), kampai (pid: 333) and best-seller (pid: 456)
That is published
- All 3 PIDs from the previous step are already published. So no changes.
Should have stocks
- Since pid 456 does not have stocks, we are left with pid 123 and pid 333
Sorted by price
- pid 333 is 340pesos
- pid 123 is 99.75pesos
- therefore, the order should be pid 123 => pid 323

[12] Search for all products that have at least 1 of the following tags ['poultry, 'kampai', 'best-seller'], and in stock. The price should be between 0 to 300 only. Sorted by cheapest to most expensive

This query is similar to #11 but we added another criteria that the price of the products returned should only be between 0 and 300.



{
    "query": {
        "bool": {
            "must": [
                {
                    "bool": {
                        "should": [
                            {
                                "match": {
                                    "tags": "poultry"
                                }
                            },
                            {
                                "match": {
                                    "tags": "kampai"
                                }
                            },
                            {
                                "match": {
                                    "tags": "best-seller"
                                }
                            }
                        ],
                        "minimum_should_match": 1
                    }
                },
                {
                    "term": {
                        "published": true
                    }
                },
                {
                    "range": {
                        "stocks": {
                            "gt": 0
                        }
                    }
                },
                {
                    "range": {
                        "price": {
                            "gt": 0,
                            "lt": 300
                        }
                    }
                }
            ]
        }
    },
    "sort": [
        {
            "price": "asc"
        },
        "_score"
    ]
}

This query introduces the "range" keyword, which allows us to filter for items if they match a specific range of values. For the price, we set a condition for the price to be between 0 and 300. For the stock, we only set the price to be greater than zero.

Let's analyze the query:

From the results in #11, we have pid 333 (340pesos) and pid 123 (99.75pesos)
With the 0-300 price filter, our only result will be pid 123 (99.75 pesos)

Conclusion

Getting started with ElasticSearch is easy! But your searching needs can become more complex as your business needs grow. This cheatsheet helps you navigate that complexity.

An alternative to learning ES syntax at this level is to use a DSL library for Elasticsearch that "abstracts" the long-form syntax of Elasticsearch. It is a powerful tool for general-purpose usage of ES. However, as your query needs grow, learning the syntax under that DSL will keep you informed on the options you can add to make your searching richer.

How about you? Are there other ElasticSearch syntax you want to learn?

Maybe I can help! Type it in the comments, and I'll try to add it to the article.

Photo by TK on Unsplash

DEV Community