DEV Community

For the Love of Bleep! Building a Scalable Monitoring System

Molly Struve (she/her) on May 01, 2019

One of my jobs as a Site Reliability Engineer(SRE) is to ensure that our application has a solid monitoring system. You cannot guarantee reliabilit...

Read full post

rhymes • May 4 '19

Hi Molly, nice article! Consolidating monitoring tools is definitely a plus and for the on call people probably even something to preserve their sanity :D

On the note of consolidation and observability of a distributed system, have you had the chance to take a look at Honeycomb? I'd love to read your take on it.

Molly Struve (she/her) • May 5 '19

I just looked at Honeycomb for the first time and it looks pretty nice! Couple things I consider when choosing a monitoring solution

Does it integrate with all the third party services you need it to integrate with?
Does it integrate well into your application?
- For example, New Relic was a pain for us to integrate with our Rails application. We had a lot of monkey patches in place just to get it to work and we had code included all over the place in our app. Datadog, on the other hand, integrated seamlessly. We included the gem, set up a single configuration file, deployed the app, and that was it. There was a large net code deletion when we switched.

rhymes • May 5 '19

Thanks for the reply!

NewRelic has a weird pricing models as well.

About HoneyComb, you should follow its CEO regardless, she's a very interesting voice in the observability/monitoring landscape: twitter.com/mipsytipsy

I think its Ruby client library is a still raw (they are a Go shop IIRC) but the idea behind it is super solid.

Molly Struve (she/her) • May 5 '19

Oh yeah, I do follow her!!! She seems pretty awesome 😃

Timothy McGrath • May 1 '19

Good post. I have this issue, too... there are too many alerts that show up that are false positives, which causes us to not trust any of the alerts.

I'm going to check out DataDog. I've tried to use Azure Application Insights for this, but it gets expensive really quickly.

Valentin Eleftheriou • May 2 '19

Nice post !

I'm happy to see we're not the only ones to use DataDog for our monitoring. We use it as a central console, and with a few Twilio Webhooks we are now able to have customized Voice Call & SMS alerting for whoever is on-call.

But still, you have to be careful and manage closely your DataDog setup, otherwise it will end up like all the others: clogged and false-alerting for pretty much everything. To avoid that, we conduct monthly review of the alerts, and tune them so they are most accurate possible :-)

Molly Struve (she/her) • May 2 '19

Thanks!

So true, you definitely have to stay on top of all your alerts. I like the idea of a monthly review, I will keep that in mind. It is nice though when you do that monthly review you only have one place you have to go 😃

Rafael Jesus • Jun 3 '19 • Edited

Good to see yet another SRE team taking ownership of monitoring! At HelloFresh we were into the same scenario, actually we had tons of infrastructure and product services w/out any sort of monitoring.

With our move to k8s, anything that runs on top if can leverage system metrics (CPU, Memory, Network etc...). Services whose expose HTTP endpoints at the k8s edge (ingress) can have RED metrics (Req, Err Duration) automatically. Since edge metrics are common we were able to automate away dashboards by creating one general allowing ppl to filter by service name. Automating alerts were also possible.

We are truly believers that w/out monitoring software ownership is not possible. Now on, incidents are much faster to be detected (MTTD) and recovered (MTTR).

We tune alerts religiously, TBH I don't even know how we could be flying w/out the monitoring we have nowadays

Molly Struve (she/her) • Jun 3 '19

Right?! Once you have a good monitoring system in place its hard to envision life without it!

Vinay Hegde • May 2 '19 • Edited

Some very good points on why having detailed visibility into an application is important, thanks Molly!

Even we were exploring Datadog which is comprehensive indeed but quickly ends up being very expensive since it bills one on the number of hosts / APMs.

For small / medium scale organizations, cost is a major factor so we eventually discovered AppSignal that bills you only on number of requests (both web & background) than hosts with very effective alerting & a 30 day retention plus their pricing is very affordable. They also have lots of 3rd party add-ons integrated right out of the box plus you can always add custom webhooks should you need more.

The downside? It's currently available only for Ruby / Elixir apps.

PS: They've a fantastic sales & engineering team who'll help you with any kind of queries that's worth checking out.

Molly Struve (she/her) • May 2 '19

Awesome! Thanks for the alternative suggestion and insight 😊

Ryan Holton • Dec 22 '20

I've built a website monitoring tool that might be of interest to some of you, take a look here