As the name suggests, the algorithm needs to identify anomalies in the data.
But how does the model identify anomalies?
How do we identify anomalies?
What is abnormal in this image, for example?
And now, what is abnormal?
Identifying patterns is an essential part of our learning process, but the answers are not necessarily obvious, because you know what a cat is and what a dog is, not from the pictures I showed you, I never told you this, but because you learned it during your life.
We must always remember that the algorithms will only process the data that you send.
In the case of a child, who is still learning the difference between a dog and a cow, for example, we would possibly have the answer that all animals in the image belong to the same pattern: animals, which is not wrong, just using different criteria considering similar characteristics with the available data.
If we are looking for a more specific answer, taking into account all possible details/variables and behavior, we must ensure that all data that may be part of the answer is analyzed over time, and the more data, the better.
In the case of a child, for them to identify the cat as "abnormal" they would need more examples, more data would need to be analyzed. The conclusion is exactly the same for the algorithms.
With this information you already know that the question "What is abnormal?" is answered taking into account what is normal, and that to know what is normal the algorithm identifies patterns over time.
There are 4 types of Anomaly Detection analysis available in Elastic's ML solution:
Single Metric analysis, for jobs that analyze a single time series;
Multi-Metric analysis, to split a single time series into multiple time series;
Population analysis, to identify abnormal behaviors in a homogeneous "population" over a period of time;
Categorization analysis, which is a machine learning process that tokenizes a text field, clusters similar to data together, and classifies it into categories;
The Anomaly Detection feature analyzes the input stream of data, models its behavior using techniques to construct a model that best matches your data, and performs analysis based on the detectors you defined in your job, considering possible rules and dates you want to ignore or disqualify from being modeled.
The blue line in the chart represents the actual data values. The shaded blue area represents the bounds for the expected values. In the beginning, the range of expected values is wide as it does not have a significant amount of data in the analyzed time period, so the model is not capturing the periodicity in the data.
After more data is processed a model is built with a coefficient that results in expected values close to the real value, the shaded blue area being close to the blue line, for then, know if the values are outside of this area and monitor the anomaly score to indicate the severity of possible anomalies.
The anomaly score (severity) is a value from 0 to 100, which indicates the significance of the observed anomaly compared to previously seen anomalies. Highly anomalous values are shown in red.
In order to provide a sensible view of the results, an anomaly score is calculated for each bucket time interval (we use the concept of a bucket to divide up a continuous stream of data into batches, between 10 minutes and 1 hour, for processing).
When you review your machine learning results, there is a
multi_bucket_impact property that indicates how strongly the final anomaly score is influenced by multi-bucket analysis; anomalies with medium or high impact on multiple buckets are represented with a cross symbol instead of a dot.
This post is part of a series that covers Artificial Intelligence with a focus on Elastic's (Creators of Elasticsearch, Kibana, Logstash and Beats) Machine Learning solution, aiming to introduce and exemplify the possibilities and options available, in addition to addressing the context and usability.