Building a new shift-left approach for alerting

#opensource #monitoring #discuss #devops

Hi Community!
Looking forward to hearing your thoughts on this!

Keep is an open-source alerting CLI tool that @shaharglazner and I wrote out of a pain we felt throughout our careers as developers and developers managers.
Alerting (aka monitors/alarms) always felt like a second-class citizen within all the different monitoring/observability/infrastructure tools with a very narrow feature set, which in turn results in poor alerts, alert fatigue (yes, your muted Slack channel), unreliable product and a complete alerting-hell.

It's not only that we couldn't create better applicative/infrastructure alerts, but it's also that it is tough to maintain them and ensure they work over time.

Organizations today have so many tools they use for alerting that it's becoming an absolute nightmare.

Alerting as a first-class citizen

The best way to describe what we had in mind when we first built Keep is how one of our first users puts it:

Keep is doing to alerting what GitHub actions did to CI/CD

There were three main guidelines when we started coding:

Good alerts are not just over thresholds/logs BUT should be treated as workflows with multiple "tests" (steps/actions).
The tool should be 100% data agnostic - agnostic to where data resides (& not only "traditional" data sources but also a DB, for example). There's no real reason why it shouldn't be abstracted from developers.
Maintained and lives in your code - allowing it to be integrated with all CI/CD processes (imagine a gate that fails your PR when you break alerts).

What's Ahead?

We constantly try to improve with our promised:

Try our first mock alert and get it up and running in <5 minutes

So we're adding plenty more deployment options, providers, and functions. We're working on simplifying the syntax furthermore.

What do you think about the need for this kind of "abstraction"? What do you think about alerts as post-production tests? How do you manage and control your alerting chaos right now?

Would love to hear your thoughts; feel free to comment here / on our Github repo / in our Slack

Top comments (4)

Vijay Jangir • Nov 10 '23

a couple of questions.
From implementation point of view:

If alerts are part of the PR, how changing only alert threshold for existing release version will work?, new pr will mean cutting out a new version and new deployment?? If keeping it in a seperate repository, it'll again start to become a problem, as we still try to refrain from getting a new repo for gitops as well. From configuration point of view
How to configure alerts based, do you have any examples which covers most of the scenarios for classic alerts used with SRE principals?

Tal Borenstein • Nov 16 '23

If alerts are part of the PR, how changing only alert threshold for existing release version will work?, new pr will mean cutting out a new version and new deployment?? If keeping it in a seperate repository, it'll again start to become a problem, as we still try to refrain from getting a new repo for gitops as well. From configuration point of view

It's an interesting question and depends on the implementation details, but the user should configure how to handle it.
Perhaps a separate version for alerts?