DEV Community

Lucas Rivelles
Lucas Rivelles

Posted on

Document Analysis in Elasticsearch

Elastic

When we index a document in Elasticsearch, its text values pass through an analysis process. In this article we'll cover what happens in this process and how Elasticsearch's standard analyzer works.

Introduction to analysis

The main objective of the analysis is storing the documents in a way that makes them efficient for searching. This happens at the moment that we index some document in Elasticsearch and it uses three mechanisms to do so, which are:

  • Character filter
  • Tokenizer
  • Token filter Analysis process

Character filters

The first step consists in receiving the full text and adding, removing or changing characters. For example, we can remove HTML tags, such as:

Input: <p>I <strong>REALLY</strong> love to go hiking!</p>
Result: I REALLY love to go hiking!

An analyzer may contain zero or more character filters, and the result of the operation is passed to the tokenizer.

Tokenizer

Differently from the character filters, an analyzer must contain exactly one tokenizer, and its responsibility is to split a String into tokens. In this process, some characters may be removed from the text, such as punctuation. An example of this would be:

Input: I REALLY love to go hiking!
Result: "I", "REALLY", "love", "to", "go", "hiking"

Token filters

The token filters will receive the tokens and operate on them. A simple example is the lowercase filter.
Input: "I", "REALLY", "love", "to", "go", "hiking"
Result: "i", "really", "love", "to", "go", "hiking"

An analyzer may also have zero or more token filters.

For more examples of built-in character filters, tokenizers and token filters, we can check the official documentation.

Elasticsearch's standard analyzer consists of:

  • No character filters
  • The standard tokenizer
  • The lowercase and an optional stop words token filter.

Using the analyze API

Elasticsearch provides us a way of visualizing how a String gets analyzed, which is the analyze API. To use it, we just need to send a POST request to the /_analyze endpoint with a "text" parameter. Let's try it out!

POST /_analyze
{
  "text": "The 2 QUICK     Brown-Foxes jumped over the lazy dog's bone. :)"
}
Enter fullscreen mode Exit fullscreen mode

In the response, we can see the generated tokens:

{
  "tokens" : [
    {
      "token" : "the",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "2",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<NUM>",
      "position" : 1
    },
    {
      "token" : "quick",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "brown",
      "start_offset" : 12,
      "end_offset" : 17,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "foxes",
      "start_offset" : 18,
      "end_offset" : 23,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "jumped",
      "start_offset" : 24,
      "end_offset" : 30,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "over",
      "start_offset" : 31,
      "end_offset" : 35,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "the",
      "start_offset" : 36,
      "end_offset" : 39,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "lazy",
      "start_offset" : 40,
      "end_offset" : 44,
      "type" : "<ALPHANUM>",
      "position" : 8
    },
    {
      "token" : "dog's",
      "start_offset" : 45,
      "end_offset" : 50,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "bone",
      "start_offset" : 51,
      "end_offset" : 55,
      "type" : "<ALPHANUM>",
      "position" : 10
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

In this API, we can specify the character filters, the tokenizer and the token filters that we want to use. The same result we got would've been returned if we had made the request like this:

POST /_analyze
{
  "text": "The 2 QUICK     Brown-Foxes jumped over the lazy dog's bone. :)",
  "char_filter": [],
  "tokenizer": "standard",
  "filter": ["lowercase"]
}
Enter fullscreen mode Exit fullscreen mode

Top comments (0)