DEV Community

loading...

Elasticsearch Tokenizers – Partial Word Tokenizers

loizenai
Software Engineer - Founder at https://loizenai.com
・2 min read

https://grokonez.com/elasticsearch/elasticsearch-tokenizers-partial-word-tokenizers

Elasticsearch Tokenizers – Partial Word Tokenizers

In this tutorial, we're gonna look at 2 tokenizers that can break up text or words into small fragments, for partial word matching: N-Gram Tokenizer and Edge N-Gram Tokenizer.

I. N-Gram Tokenizer

ngram tokenizer does 2 things:

  • break up text into words when it encounters specified characters (whitespace, punctuation...)
  • emit N-grams of each word of the specified length (quick with length = 2 -> [qu, ui, ic, ck] )

=> N-grams are like a sliding window of continuous letters.

For example:


POST _analyze
{
  "tokenizer": "ngram",
  "text": "Spring 5"
}

It will generate terms with a sliding (1 char min-width, 2 chars max-width) window:


[ "S", "Sp", "p", "pr", "r", "ri", "i", "in", "n", "ng", "g", "g ", " ", " 5", "5" ]

Configuration

  • min_gram: minimum length of characters in a gram (min-width of the sliding window). Defaults to 1.
  • max_gram: maximum length of characters in a gram (max-width of the sliding window). Defaults to 2.
  • token_chars: character classes that will be included in a token. Elasticsearch will split on characters that don’t belong to:
  • letter (a, b, ...)
  • digit (1, 2, ...)
  • whitespace (" ", "\n", ...)
  • punctuation (!, ", ...)
  • symbol ($, %, ...)

Defaults to [] (keep all characters).

For example, we will create a tokenizer with sliding window (width = 3) and character classes: only letter & digit.


PUT jsa_index_n-gram
{
  "settings": {
    "analysis": {
      "analyzer": {
        "jsa_analyzer": {
          "tokenizer": "jsa_tokenizer"
        }
      },
      "tokenizer": {
        "jsa_tokenizer": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 3,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    }
  }
}

POST jsa_index_n-gram/_analyze
{
  "analyzer": "jsa_analyzer",
  "text": "Tut101: Spring 5"
}

Terms:

More at:

https://grokonez.com/elasticsearch/elasticsearch-tokenizers-partial-word-tokenizers

Elasticsearch Tokenizers – Partial Word Tokenizers

Discussion (0)