DEV Community

Priscilla Parodi for Elastic

Posted on • Updated on

Elastic Anomaly Detection - Categorization

| Menu | Next Post: Elastic Anomaly Detection and Data Visualizer HandsOn|

For categorization analysis, the learning process is the same, but there are other steps to process the text.

The input data must be a text field, typically containing repeated elements such as log messages because it's not a natural language processing (NLP) and it works best on machine-written messages.

When you create a categorization anomaly detection job, the machine learning model processes the input text into different categories, identifying patterns over time, as you can see in this example:

Input text

Log message:

Jul 20 15:02:19 localhost sshd[8903]: Invalid user admin from 58.218.92.41 port 26062
Jul 20 15:02:19 localhost sshd[8903]: input_userauth_request: invalid user admin [preauth]
Jul 20 15:02:20 localhost sshd[8903]: Connection closed by 58.218.92.41 port 26062 [preauth]
Jul 20 17:10:23 localhost sshd[2074]: Received disconnect from 41.43.112.199 port 41805:11: disconnected by user
Jul 20 17:10:23 localhost sshd[2074]: Disconnected from 41.43.112.199 port 26062
Jul 20 17:10:23 localhost sshd[2072]: pam_unix (sshd:session): session closed for user ec2-user
Jul 20 19:14:55 localhost sshd[8944]: pam_unix (sshd:session): session closed for user ec2-user by (uid=0)
Jul 20 19:17:22 localhost runner: pam_unix(runuser-1:session): session closed for user ec2-user 
Jul 20 19:17:22 localhost runner: pam_unix(runuser-1:session): session opened for user ec2-user by (uid=0)
Jul 20 19:17:23 localhost runner: pam_unix(runuser-1:session): session closed for user ec2-user 
Enter fullscreen mode Exit fullscreen mode

Step 1 - Remove mutable text

Mutable texts are not taken into account to not identify an anomaly or a pattern where there is no relevance as the value is always changing, e.g, date and time.

localhost sshd: Invalid user from port
localhost sshd: input_userauth_request: invalid user [preauth]
localhost sshd: Connection closed by port [preauth]
localhost sshd: Received disconnect from port disconnected by user
localhost sshd: Disconnected from port
localhost sshd: pam_unix session: session closed for user ec2-user
localhost sshd[8944]: pam_unix session: session closed for user ec2-user by (uid=0)
localhost runner: pam_unix session: session closed for user ec2-user 
localhost runner: pam_unix session: session opened for user ec2-user by (uid=0)
localhost runner: pam_unix session: session closed for user ec2-user 
Enter fullscreen mode Exit fullscreen mode

Step 2 - cluster similar messages together

Which can mean a line or several lines that are part of a task, for example, and that are respecting a pattern.

->mlcategory:1
localhost sshd: Invalid user from port

->mlcategory:2
localhost sshd: input_userauth_request: invalid user [preauth]

->mlcategory:3
localhost sshd: Connection closed by port [preauth]

->mlcategory:4
localhost sshd: Received disconnect from port disconnected by user

->mlcategory:5
localhost sshd: Disconnected from port

->mlcategory:6
localhost sshd: pam_unix session: session closed for user ec2-user
localhost sshd[8944]: pam_unix session: session closed for user ec2-user by (uid=0)
localhost runner: pam_unix session: session closed for user ec2-user
localhost runner: pam_unix session: session opened for user ec2-user by (uid=0)
localhost runner: pam_unix session: session closed for user ec2-user

Step 3 - Count per time bucket

By processing analyzing time buckets, the behavior in a cluster can be better and easily identified for anomaly checking.

In the image below you can see an example of the graphic behavior of each ml category over time for a further time bucket analysis:

Alt Text

As an example, at a specific time bucket, we could see an mlcategory:1 followed by an mlcategory:4, twice:

mlcategory:1 -> mlcategory:4 -> mlcategory:1 -> mlcategory:4.

We could call it bucket 1, as a reference, and so on, bucket 2...

Alt Text

| Menu | Next Post: Elastic Anomaly Detection and Data Visualizer HandsOn|

This post is part of a series that covers Artificial Intelligence with a focus on Elastic's (Creators of Elasticsearch) Machine Learning solution, aiming to introduce and exemplify the possibilities and options available, in addition to addressing the context and usability.

Top comments (0)