Priscilla Parodi for Elastic

Posted on Jun 2, 2022 • Edited on Apr 6, 2023

NLP HandsOn

#nlp #tutorial #elasticsearch

Note: This HandsOn assumes that you have already followed the step-by-step Setup of your Elastic Cloud Trial account, and also that you have read the blog NLP and Elastic: Getting started.
Config: To prepare for the NLP HandsOn, we will need an Elasticsearch cluster running at least version 8.0 with an ML node.

To start using NLP in your Stack you will need to import your model. The first thing we need to do is upload your model into a cluster.

In our eland library, a Python Elasticsearch client for exploring and analyzing data in Elasticsearch, we have some simple methods and scripts that allow you to upload models from local disk, or to pull models down from the Hugging Face model hub.

Once models are uploaded into the cluster, you’ll be able to allocate those models to specific ML nodes. Once model allocation is complete, we’re ready for inference.

Eland can be installed from PyPI via pip.

Before you go any further, make sure you have Python installed.

You can check this by running:

Unix/macOS
python3 --version

You should get some output like:
Python 3.8.8

Additionally, you’ll need to make sure you have pip available.

You can check this by running:

Unix/macOS
python3 -m pip --version

You should get some output like:
pip 21.0.1 from …

If you installed Python from source, with an installer from python.org, or via Homebrew you should already have pip.

If you don't have Python and pip installed, install it first.

With that, Eland can be installed from PyPI via pip:

$ python3 -m pip install eland

Getting started

To interact with your cluster through the API, we will need to use your Elasticsearch cluster endpoint information.

The endpoint looks like:
https://<user>:<password>@<hostname>:<port>

Open your deployment settings to find your endpoint information and click on the gear icon.

Copy your Elasticsearch endpoint as in the image below.

Note: If you want to try out examples with your own cluster, remember to include your endpoint URLs and authentication details.

Now add the username and password so your request can be authenticated, your endpoint will look like this:

https://elastic:123456789@00c1f8.es.uscentral1.gcp.cloud.es.io:9243

username: elastic is a built-in superuser. Grants full access to cluster management and data indices.

password: If you don't have your password, you will need to reset it and generate a new password.

Copy your endpoint, you'll need it later.

In parallel, let's proceed locating the first model to be imported.

We will import the model from Hugging Face, an AI community to build, train and deploy open source machine learning models.

In this demo we will use a random sentiment analysis model but feel free to import the model you want to use. You can read more details about this model on the Hugging Face webpage.

Copy the model name as in the image below.

Now that we have all the necessary information (elasticsearch cluster endpoint information and the name of the model we want to import) let's proceed by importing the model:

Open your terminal and update the following command with your endpoint and model name:

eland_import_hub_model --url https://<user>:<password>@<hostname>:<port> \
--hub-model-id <model_name> \
--task-type <task_type>

In this case we are importing the bhadresh-savani/distilbert-base-uncased-emotion model to run the text_classification task.

In Huggning Face filters you will be able to see the task of each model. Supported values are fill_mask, ner, question_answering, text_classification, text_embedding, and zero_shot_classification.

eland_import_hub_model --url https://elastic:<password>@<hostname>:<port> \
--hub-model-id bhadresh-savani/distilbert-base-uncased-emotion \
--task-type text_classification

You will see that the Hugging Face model will be loaded directly from the model hub and then your model will be imported into Elasticsearch.

Wait for the process to end.

Let's check if the model was imported.

Click Machine Learning in your Kibana menu.

Under model management click Trained Models:

Your model needs to be on this list as shown in the image below, if it is not on this list check if there was any error message in the previous process.

If your model is on this list it means it was imported but now you need to start the deployment. To do this, in the last column under Actions click Start deployment.

After deploying, the State column will have the value started and under Actions the Start deployment option will be disabled, which means that the deploy has been done.

Let's test our model!

Copy your model ID:

In Kibana's menu, click Dev Tools.

In this UI you will have a console to interact with the REST API of Elasticsearch.

We will to use the inference processor to evaluate this model.

POST _ml/trained_models/<model_id>/deployment/_infer
{
  "docs": { "text_field": "<input>"}
}

This POST method contains a docs array with a field matching your configured trained model input, typically the field name is text_field. The text_field value is the input you want to infer.

In our case it will be:

POST _ml/trained_models/bhadresh-savani__distilbert-base-uncased-emotion/deployment/_infer
{
  "docs": { "text_field": "Elastic is the perfect platform for knowledgebase NLP applications"}
}

Where the model_id is bhadresh-savani__distilbert-base-uncased-emotion and the value that I am using as a test is Elastic is the perfect platform for knowledgebase NLP applications.

Clicking the play button you can send the request:

In this case the predicted sentiment is "joy".

That's it, the model is working. 🚀

Note: You can run more tests to determine if this model works for what you need.

To get all the statistics of your model you can use the _stats request:

GET _ml/trained_models/<model_id>/_stats

Let's continue with part 2, How to run this model on data being ingested?

To do this, let's start by importing a .csv file into Elasticsearch. So we can run the model while importing data.

I think it's interesting to run an analysis on random texts and tweets are good use cases.

Recently Elon Musk announced his interest in buying Twitter, but before that he was famously active on the platform. As we have a sentiment analysis model, let's proceed with analyzing a sample of Elon's tweets.

I found this database on Kaggle, this is a good website for locating datasets.

Note: We don't have a huge amount of data, 172Kb between November 16, 2012 and September 29, 2017. But as this is not a research paper this is not a problem.

Feel free to use whatever data you prefer, or even the twitter API.

Let's download this file:

And import into Elasticsearch.

There are different ways to do this, but since this is a small .csv file, we can use the Upload a file integration.

In the Kibana menu, click Integrations, you will see a list of integrations we have for collecting data.

Search for Upload a file as in the image below:

And then click Select or drag and drop a file and choose your csv file, in our case data_elonmusk.csv that you downloaded earlier.

You will see something similar to the image below:

Click Override settings to rename the Tweet column to text_field. As explained before, there needs to be a field that matches your configured trained model input which is typically called text_field. With this, the model will be able to identify the field to be analyzed.

Rename the Tweet column/field to text_field. Click Apply.

After the page loads, click Import.

And then click Advanced to edit the import process settings.

The import process has several steps:

Processing file - Turning the data into NDJSON documents so they can be ingested using the bulk api

Creating index - Creating the index using the settings and mappings objects

Creating ingest pipeline - Creating the ingest pipeline using the ingest pipeline object

Uploading data - Loading data into the new Elasticsearch index

Creating a data view (Index pattern) - Create a Kibana index pattern (if the user has opted to)

As you can see the CSV processor is being used in the ingest pipeline to import your document.

Feel free to edit the mapping or ingest pipeline.

In our case we need to edit the ingest pipeline to add our previously trained and imported model.

Add the model that will infer the data being ingested into the processor as in the image below:

  {
       "inference": {
       "model_id": "bhadresh-savani__distilbert-base-uncased-emotion"
        }
    }

After that add your index name and click Import. If for some reason it doesn't work, repeat the process and check if you typed something incorrectly.

Note: What we are doing is adding your model for inference in the ingest pipeline, it doesn't need to be a .csv. Read more about it here.

When it finishes loading, your screen will look like mine, click View index in Discover.

If you didn't disable Create data view when you were importing data you should be able to locate your index by the name you used. Now you can explore your index data.

Next to the word Documents, click Field statistics, so far this is a beta feature but excellent for exploring your data. As we can see, Elon was feeling Joyful in 70% of the analyzed tweets considering this sentiment analysis model. The second most popular sentiment in Elon's tweets was anger and then fear.

Let's click on the lens button on the right side of the screen to open Kibana Lens and explore this data.

When the screen loads. Click and drag the Time field to explore this data considering the date of each tweet.

Considering time, some suggestions will appear, I liked one of them, but instead of every 30 days I edited it for an annual review. Also try filtering only by prediction probability between 0.90 and 1 for better accuracy. Here you can have fun with the analysis you want to run.

Apparently anger has increased over time, but joy remains the most common in Elon's Tweets. Fear increased until the beginning of 2016 but decreased in 2017.

Well, there are several interpretations for data, we always need to take into account the model used, accuracy, the quality of our data, the information we seek, the type of analysis and our interpretation, context and knowledge, but I believe that now it is possible to see how useful it is to analyze language.

For example, try running a classification model with the inference data (which is now a new field) to predict sentiment in addition to checking for influencers. Also try importing other models and using other datasets.

I also imported a NER model to identify entities in the same dataset so we can start to correlate text topics (keywords) with sentiment. The year Elon talked about Tesla the most in this dataset was 2015, which coincides with the year with the greatest increase in joy.

This news is from 2015 and Elon was really positive about Tesla even with the company reporting losses.

Again, these are not necessarily facts. But my goal is to show a little bit of what we can do with NLP analysis and correlation (which does not imply causation 😅).

Let's proceed with the last part, How to run this model on an existing index?

If your data is already indexed and you want to infer your model considering this data but without changing the index content, this is possible. If this is your case, let's proceed with this test.

In the Kibana menu click Ingest Pipeline and then Create pipeline and New pipeline.

Give your pipeline a name and click Add a processor.

The first step is to rename the field that will be inferred to text_field.

For that add the Rename processor, in the message field add the field to be renamed and in the target field add text_field. And then click Add.

Now we will add the Inference processor, for that click again Add processor and then under Model ID add your Model ID, in our case: bhadresh-savani__distilbert-base-uncased-emotion

Click Add.

Click Create pipeline and copy the name of your pipeline, you will need it later.

Now open Dev Tools and run the following request (adding your source index, dest index and pipeline name):

POST _reindex
{
  "source": {
    "index": "<your-source-index-name>"
  },
  "dest": {
    "index": "<your-ml-dest-index-name>",
    "pipeline": "<your-pipeline-name>"
  }
}

This copies documents from a source to a destination. You can copy all documents to the destination index, or reindex a subset of the documents, you can also use source filtering to reindex a subset of the fields in the original documents.

This will take some time, wait for the successful response as in the image below:

For this new index you don't have the Data View yet, you need it to access the Elasticsearch data that you want to explore, to do that click Stack Management in the Kibana menu and then click Data Views.

Click Create new data view and then for the Name field add the name of your new index, in my case it is elon-output-ml. Click Create data view.

Now open Discover and select the new index.

That's it, without making changes to your current index you have a new index with the result of this model.

I hope you enjoy using NLP with the Elastic Stack! Feedback is always welcome.

This post is part of a series that covers Artificial Intelligence with a focus on Elastic's (Creators of Elasticsearch) Machine Learning solution, aiming to introduce and exemplify the possibilities and options available, in addition to addressing the context and usability.