Note: This HandsOn assumes that you have already followed the step-by-step Setup of your Elastic Cloud account and added the Samples available there to replicate the analysis mentioned here. If not, please, follow the steps mentioned there.
We've added a Sample containing data you don't know anything about.
As mentioned here, it is very important to know the data and type of data we have in order to know what kind of analysis we can do.
So, before proceeding with Anomaly Detection we can, for example, use Data Visualizer
Kibana>Machine Learning>Data visualizer to understand more about the available data.
In a real use case you may prefer to change some field, mapping or wrong/empty data, but in this case we will use the available data exactly as it is.
Let's select the Index Pattern
[eCommerce] Orders. When the page loads you usually won't be able to see any data because the time interval initially set is too short (15 minutes) for this type of data that is not continuously being updated, for you to access all available data click on
Use full kibana_sample_data_ecommerce_data.
Now we have some important information, the data type (text/keyword/geo_point/number...), % of documents containing that data type, distintic values, distributions and the possibility to visualize data in a graph
Actions, something we won't cover at this point. This can help us with the most important part before analyzing data, which is: What answer don't we have? What is this data not telling us that it would be important to know based on the needs that I/my company have?
Let's pretend we are an Ecommerce owner and we want to analyze our data, the first thing we can notice is that we only have 1 month data
May 26, 2021 -> Jun 26, 2021 which may not be positive for a complete analysis because we have holidays and events that can momentarily change customer behavior.
Maybe it's our first month with the company, we could do a full analysis after a few years and possibly map and add rules to skip holidays and momentary events, but now what we want is to start an analysis that will be useful at this point, clearly, with limited possibilities.
Something that can make sense is to understand who our customers are, after all, in just 1 month we already have a total of 4675 events with 3321 distinct values for
customer_full_name.keyword, which means that we have a good amount of unique customers.
Another thing that stands out is that these customers are from different continents and countries, we are a global ecommerce company. But how does this population spend their money? Do they behave similarly?
Let's run an Anomaly Detection analysis:
Machine Learning>Anomaly Detection>Create job and select
[eCommerce] Orders then, click
Use a Wizard>Population
Again, let’s use the full sample, click on
Use full kibana_sample_data_ecommerce_data>Next
To define a population we need a relationship between the data, not necessarily with the same value for all, but we need at least one field that is common, which characterizes the data as belonging to the same population.
For the Population field, let's add some location data, let’s use the
Now, considering a population characterized by groups from different regions, we want to know how these people spend their money, so as a metric to identify abnormal behavior, let's add the sum of
taxful_total_price, to understand if there are abnormal behaviors in the sum of the total amount spent over time.
Your screen should look like this image:
We can also add the Bucket Span on the left side, usually the Bucket Span is 15 minutes, but you can click
Estimate bucket span and based on your data automatically set a good interval for time series analysis. In my case the estimated time was 3h.
And on the right side you will see the
Influencers, if you want to see the influence of other fields on the result, and as you can see, the region will already be there.
Click Next and then define a name for the Job ID, I used the name:
pop_price_region. At this time we are not going to add additional or advanced settings. After that, click Next and your webpage should look like this image:
Click Next one more time if your webpage looks like this, otherwise check the error message. Finally, click Create Job.
After loading, click on View Results, in this case we don't want the Job running in real time, otherwise you could do that.
A new page will load and as you can see, we don't have an Anomaly Score > 75, which means we don't have one high-severity event, but we do have two anomalies > 50, in orange.
The two events with
severity>50, taking into account the population and not just a single metric, came from New York on June 17th 2021 (Current: $998.88 / Typical: $117.59, 8x higher, Probability: 0.00148...), and from Cairo Governorate on June 21st 2021 (Current: $885.97 / Typical: $118.45, 7x higher, Probability: 0.00234), although all this detailed information is important it is worth remembering that the severity value is a normalized value from 0-100 of all these data considering the behavior of the population in the analyzed period of time, which means that only one data, alone, is not necessarily relevant, it is possible to find other purchases with a value also 7x higher with less relevance, for example.
If you want to leave this job running, as mentioned above, you just need to click
Machine Learning>Anomaly Detection>(your job)>Start datafeed, you will set the start date and set the end time, select
no end time to search in real time and then click Start.
You can also create alerts based on severity and connect to services like Email, IBM Resilient, Jira, Microsoft Teams, Slack or even write to an index or create a Webhook connector. There are also APIs to perform machine learning anomaly detection activities.
This post is part of a series that covers Artificial Intelligence with a focus on Elastic's (Creators of Elasticsearch, Kibana, Logstash and Beats) Machine Learning solution, aiming to introduce and exemplify the possibilities and options available, in addition to addressing the context and usability.