DEV Community

loading...

Elasticsearch Tokenizers – Structured Text Tokenizers

loizenai
Software Engineer - Founder at https://loizenai.com
・1 min read

https://grokonez.com/elasticsearch/elasticsearch-tokenizers-structured-text-tokenizers

Elasticsearch Tokenizers – Structured Text Tokenizers

In this tutorial, we're gonna look at Structured Text Tokenizers that are usually used with structured text like identifiers, email addresses, zip codes, and paths.

I. Keyword Tokenizer

keyword tokenizer is the simplest tokenizer that accepts whatever text it is given and outputs the exact same text as a single term.

For example:


POST _analyze
{
  "tokenizer": "keyword",
  "text": "Java Sample Approach"
}

Term:


[ Java Sample Approach ]

II. Pattern Tokenizer

pattern tokenizer uses a regular expression to either split text into terms whenever it matches a word separator, or to capture matching text as terms.

The default pattern is \W+, which splits text whenever it encounters non-word characters.

For example:


POST _analyze
{
  "tokenizer": "pattern",
  "text": "Java_Sample_Approach's tutorials are helpful."
}

Terms:


[ "Java_Sample_Approach", "s", "tutorials", "are", "helpful" ]

Configuration

  • pattern: Java regular expression, defaults to \W+.
  • flags: Java regular expression flags. (for example: "CASE_INSENSITIVE|COMMENTS") More flags at: regex Pattern
  • group capture group to extract as tokens. Defaults to -1 (split).

For example, we want to break text into tokens when it encounters commas:

More at:

https://grokonez.com/elasticsearch/elasticsearch-tokenizers-structured-text-tokenizers

Elasticsearch Tokenizers – Structured Text Tokenizers

Discussion (0)