DEV Community

Cover image for NLP and Elastic: Getting started
Priscilla Parodi for Elastic

Posted on • Edited on

NLP and Elastic: Getting started

| Menu | Next Post: NLP HandsOn |

Natural language processing (NLP) is the branch of artificial intelligence (AI) that focuses on understanding human language as closely as possible to human interpretation, combining computational linguistics with statistical, machine learning and deep learning models.

Image description

Some examples of NLP tasks:

  • Named entity recognition is a type of information extraction, identifying words or phrases as entities.

Image description(model used)

  • Sentiment analysis is a type of text classification, attempting to extract subjective emotions from text.

Image description(model used)

There are more examples that can be used according to your use case.

BERT

In 2018, Google sourced a new technique for pre-training NLP called BERT.

BERT uses “transfer learning”, which is the method of pre-training linguistic representations. Pre-training refers to how BERT was first trained using unsupervised learning on a large source of plain text extracted from a collection of samples (800 million words) and Wikipedia documents (2,500 million words). Earlier models required manual labeling.

BERT was pretrained on two tasks: language modeling (15% of tokens were masked and BERT was trained to predict them from context) and next sentence prediction (BERT was trained to predict if a chosen next sentence was probable or not given the first sentence). With this understanding, BERT can be adapted to many other types of NLP tasks very easily.

Knowing the intent and context and not just the keywords, it is possible to go further in understanding in a way that is even closer to the way humans understand.

NLP with Elastic

To support models that use the same tokenizer as BERT, Elastic is supporting the PyTorch library, one of the most popular machine learning libraries that supports neural networks like the Transformer architecture that BERT uses, enabling NLP tasks.

In general, any trained model that has a supported architecture is deployable in Elasticsearch, including BERT and variants.

These models are listed by NLP task. Currently, these are the tasks supported:

Named entity recognition
Fill-mask
Question answering

Language identification
Text classification
Zero-shot text classification

Text embedding
Text similarity

As in the cases of classification and regression, when a trained model is imported you can use it to make predictions (inference).

Note: For NLP tasks you must choose and deploy a third-party NLP model. If you choose to perform language identification, as an option we have a trained model lang_ident_model_1 provided in the cluster.

NLP with Elastic Solutions

There are many possible use cases to add NLP capabilities to your Elastic project and here are some examples:

  • Security

Spam detection: Text classification capabilities are useful for scanning emails for language that often indicates spam, allowing content to be blocked or deleted and preventing malware emails.

Image description

PUT spam-detection/_doc/1
{
  "email subject": "Camera - You are awarded a SiPix Digital Camera! Call 09061221066. Delivery within 28 days.",
  "is_spam": true
}
Enter fullscreen mode Exit fullscreen mode
  • Enterprise Search

Analysis of unstructured text: Entity recognition is useful for structuring text data, adding new field types to your documents and allowing you to analyze more data and obtain even more valuable insights.

Image description

PUT /source-index
{
  "mappings": {
    "properties": {
      "input":    { "type": "text" }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode
PUT /new-index
{
  "mappings": {
    "properties": {
      "input":    { "type": "text" },  
      "organization":  { "type": "keyword"  }, 
      "location":   { "type": "keyword"  }     
    }
  }
}
Enter fullscreen mode Exit fullscreen mode
  • Observability

Service request and incident data: Extracting meaning from operational data, including ticket resolution comments, allows you to not only generate alerts during incidents, but also go further by observing your application, predicting behavior, and having more data to improve ticket resolution time.

Image description

...
  "_source": {
    "support_ticket_id": 119237,
    "customer_id": 283823,
    "timestamp": "2021-06-06T17:23:02.770Z",
    "text_field": "Response to the case was fast and problem was solved after first response, did not need to provide any additional info.",
    "ml": {
      "inference": {
        "predicted_value": "positive",
        "prediction_probability": 0.9499962712516151,
        "model_id": "heBERT_sentiment_analysis"
      }
    }
  }
...
Enter fullscreen mode Exit fullscreen mode

NLP HandsOn

Now, let's proceed with an end-to-end example! To prepare for the NLP HandsOn, we will need an Elasticsearch cluster running at least version 8.0 with an ML node. If you haven't created your Elastic Cloud Trial yet, now is the time.

| Menu | Next Post: NLP HandsOn |

This post is part of a series that covers Artificial Intelligence with a focus on Elastic's (Creators of Elasticsearch, Kibana, Logstash and Beats) Machine Learning solution, aiming to introduce and exemplify the possibilities and options available, in addition to addressing the context and usability.

Top comments (0)