Elasticsearch Tokenizers – Word Oriented Tokenizers

#elasticsearch #tokenizers #word

https://grokonez.com/elasticsearch/elasticsearch-tokenizers-word-oriented-tokenizers

A tokenizer breaks a stream of characters up into individual tokens (characters, words...), then outputs a stream of tokens. We can also use tokenizer to record the order or position of each term (for phrase and word proximity queries), or the start and end character offsets of the original word which the term represents (for highlighting search snippets).

In this tutorial, we're gonna look at how to use some Word Oriented Tokenizers which tokenize full text into individual words.

1. Standard Tokenizer

standard tokenizer provides grammar based tokenization:


POST _analyze
{
  "tokenizer": "standard",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

Token:


{
  "tokens": [
    {
      "token": "The",
      "start_offset": 0,
      "end_offset": 3,
      "type": "",
      "position": 0
    },
    {
      "token": "2",
      "start_offset": 4,
      "end_offset": 5,
      "type": "",
      "position": 1
    },
    {
      "token": "QUICK",
      "start_offset": 6,
      "end_offset": 11,
      "type": "",
      "position": 2
    },
    {
      "token": "Brown",
      "start_offset": 12,
      "end_offset": 17,
      "type": "",
      "position": 3
    },
    {
      "token": "Foxes",
      "start_offset": 18,
      "end_offset": 23,
      "type": "",
      "position": 4
    },
    {
      "token": "jumped",
      "start_offset": 24,
      "end_offset": 30,
      "type": "",
      "position": 5
    },
    {
      "token": "over",
      "start_offset": 31,
      "end_offset": 35,
      "type": "",
      "position": 6
    },
    {
      "token": "the",
      "start_offset": 36,
      "end_offset": 39,
      "type": "",
      "position": 7
    },
    {
      "token": "lazy",
      "start_offset": 40,
      "end_offset": 44,
      "type": "",
      "position": 8
    },
    {
      "token": "dog's",
      "start_offset": 45,
      "end_offset": 50,
      "type": "",
      "position": 9
    },
    {
      "token": "bone",
      "start_offset": 51,
      "end_offset": 55,
      "type": "",
      "position": 10
    }
  ]
}

To keep things simple, we can write term from tokens in this way:


[ The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog's, bone ]

Max Token Length

We can configure maximum token length (max_token_length - Defaults to 255).
If a token exceeds this length, it is split at max_token_length intervals.
For example, we set max_token_length to 4, it makes QUICK separate to QUIC and K.

More at:

https://grokonez.com/elasticsearch/elasticsearch-tokenizers-word-oriented-tokenizers

Elasticsearch Tokenizers – Word Oriented Tokenizers