https://grokonez.com/elasticsearch/elasticsearch-character-filters
Elasticsearch Character Filters
Elasticsearch Character Filters preprocess (adding, removing, or changing) the stream of characters before it is passed to Tokenizer. In this tutorial, we're gonna look at 3 types of Character Filters: HTML Strip, Mapping, Pattern Replace that are very important to build Customer Analyzers.
1. HTML Strip Character Filter
html_strip
character filter can:
- strip out HTML elements (like
<b>
) - replace HTML entities with their decoded value (
&
becomes&
).
For example:
POST _analyze
{
"tokenizer": "keyword",
"char_filter": [ "html_strip" ],
"text": "JavaSampleApproach's tutorials are so helpful!
"
}
Terms:
[ \nJavaSampleApproach's tutorials are so helpful!\n ]
Configuration
escaped_tags
: array of HTML tags which should not be stripped.
For example, we want to to leave <b> and <p> tags in place:
PUT jsa_index_char_filter_html
{
"settings": {
"analysis": {
"analyzer": {
"jsa_analyzer": {
"tokenizer": "keyword",
"char_filter": ["jsa_char_filter"]
}
},
"char_filter": {
"jsa_char_filter": {
"type": "html_strip",
"escaped_tags": ["b", "p"]
}
}
}
}
}
POST jsa_index_char_filter_html/_analyze
{
"analyzer": "jsa_analyzer",
"text": "JavaSampleApproach's tutorials are so helpful!
"
}
More at:
https://grokonez.com/elasticsearch/elasticsearch-character-filters
Elasticsearch Character Filters
Top comments (0)