Practical log anomaly detection using machine learning

#devops #testing

Catching Faults Missed by Automated Test and Monitoring tools

As software gets more complex, it gets harder to test all possible failure modes within a reasonable time. Monitoring can catch known problems – albeit with pre-defined instrumentation. But it’s hard to catch new (unknown) software problems if you need pre-instrumentation and are only looking for known failure modes.

What’s needed is a way to learn normal software patterns automatically, and reliably detect abnormal ones. There have been earlier attempts to do this, but a good solution is hard - it needs to work close to real time, work without (impractical) training and not annoy developers with too many false positives.

Our team has thoughtfully worked on these challenges for a long time and has built something that truly works. We improve the accuracy of pattern learning by first learning the foundational “dictionary” of all unique event types generated by your software. Our ML can do this with surprisingly little data (as little as a couple of MB, although more data obviously helps). It extracts all event structure in near real time, including typed variables and metrics embedded in logs. Building this structured event dictionary lets us accurately learn the normal patterns of every unique event type, and allows us to perform very reliable detection of “anomalies” – when events break pattern. Factors considered include: occurrence of a new event type, change in frequency or periodicity, severity, and correlations between anomalies within one or more files or streams. By fully learning (and continuing to adapt to) event structure, our software is also the perfect building block to capture known failure patterns.

How can you test its effectiveness?

From a user’s perspective, it can be hard to verify claims about the effectiveness of machine learning and anomaly detection. Positive anecdotes from other users may not apply to your application. Free trials help, but take some commitment in terms of planning and effort.

So here’s the easiest way we could come up with to test our log anomaly detection. Just enter an email address and upload up to 5 (related) log files at a time – for example from 5 different services within your stack. Within minutes you’ll get a report with a list of the anomalies we found (including the factors that caused them to be flagged) and a cool fingerprint visualizing the event patterns within your logs. You’ll also see examples of how we auto-parsed the structure in your log events. The service is designed with security in mind - your data is encrypted in transit and at rest, and deleted upon completion of the test.

Naturally our ML gets better at extracting event structure and characterizing anomalies with more data (including data from multiple services in your application stack). So if you like the fault signal report, we also welcome you to request a password and try our full functionality. It’s free for up to 500MB per day – no time limits. The full service provides a richer set of capabilities than the fingerprint report, including easy options for log streaming, real time alerts, rich and customizable visualizations, the ability to create your own fault signatures, and a whole lot more.

To get started with the anomaly report, just click here.

Note: Posted with permission of the author Ajay Singh@Zebrium

DEV Community

Practical log anomaly detection using machine learning

Top comments (0)

Read next

Automating Cron Jobs in Docker with Ofelia: CVZilla's Experience

Load Testing PostgreSQL on Kubernetes: A YAML-Only Approach

Providing storage with secure access for an app using managed identity and role-based access control

Self-Hosted GitHub Actions Runner in Kubernetes