DEV Community

loading...

Elasticsearch Character Filters

loizenai
Software Engineer - Founder at https://loizenai.com
・1 min read

https://grokonez.com/elasticsearch/elasticsearch-character-filters

Elasticsearch Character Filters

Elasticsearch Character Filters preprocess (adding, removing, or changing) the stream of characters before it is passed to Tokenizer. In this tutorial, we're gonna look at 3 types of Character Filters: HTML Strip, Mapping, Pattern Replace that are very important to build Customer Analyzers.

1. HTML Strip Character Filter

html_strip character filter can:

  • strip out HTML elements (like <b>)
  • replace HTML entities with their decoded value (&amp; becomes &).

For example:


POST _analyze
{
  "tokenizer":      "keyword", 
  "char_filter":  [ "html_strip" ],
  "text": "

JavaSampleApproach's tutorials are so helpful!

" }

Terms:


[ \nJavaSampleApproach's tutorials are so helpful!\n ]

Configuration

escaped_tags: array of HTML tags which should not be stripped.

For example, we want to to leave <b> and <p> tags in place:


PUT jsa_index_char_filter_html
{
  "settings": {
    "analysis": {
      "analyzer": {
        "jsa_analyzer": {
          "tokenizer": "keyword",
          "char_filter": ["jsa_char_filter"]
        }
      },
      "char_filter": {
        "jsa_char_filter": {
          "type": "html_strip",
          "escaped_tags": ["b", "p"]
        }
      }
    }
  }
}

POST jsa_index_char_filter_html/_analyze
{
  "analyzer": "jsa_analyzer",
  "text": "

JavaSampleApproach's tutorials are so helpful!

" }

More at:

https://grokonez.com/elasticsearch/elasticsearch-character-filters

Elasticsearch Character Filters

Discussion (0)