donis

Posted on Mar 7, 2019 • Originally published at donis.github.io on Feb 19, 2019

Tracking one metric opened a whole new world for me

#metrics #datadog #monitoring #sre

I now know these things about my API after tracking one metric:

how many successful requests are made
which endpoints are most used
how heavily loaded my API is at certain times
amount of unsuccessful responses and what code are they
which endpoints are slow
top API client devices
performance of servers
various cuts of API response times (by client, by endpoint)
requests per second
top erroring endpoints

It looks something like this:

It’s a fountain of information! Just look at those beautiful graphs 💖

How? The only metric I track to get all of the aforementioned information is represented by this PHP code:

    $metric->duration('api.response', [
        'route' => $request->getAttribute('route'),
        'code' => $response->getStatusCode(),
        'client-agent' => $request->getAttribute('client-agent'),
        'successful' => $response->getStatusCode() < 400 ? 'true' : 'false',
        'environment' => 'production',
        'webserver' => 'web-1'
    ], $request->getAttribute('startTime'));

Looks simple and quick to implement? Yes, but I will try to explain what it all means.

If you know a thing or two about SRE (Site Reliability Engineering) you probably already understand, that this covers main “golden signals” for successful monitoring - latency, traffic and error rate.

Disclaimers

I’m not a good writer nor a good teacher, but I hope someone will find this useful. If you could spare a few minutes to write some feedback - negative or positive - it will help me to learn, thank you!

I’m using Datadog as a metric service because it’s a managed SaaS, works great and I know it well. I’m a customer, not affiliated with them. I mentioned a few alternative services at the end of the article.

PHP is just as an example, the same can easily be done with any other language which powers your API (or anything else).

I’m only focusing on an API, but possibilities are unlimited, use this as an inspiration.

There are no implementation details in this article, I’ll try to link to resources at the end to help you on.

Why do I want metrics in my team and my employer?

Assumptions are the roots of all bad decisions made, I learned once. It doesn’t matter how big or small the project is - metrics reveal a whole new layer of visibility and allow to avoid assumptions. For me - to make a good decision on how to prioritise my work, where to return technical debt first or which parts of my project are more risky to changes - it is very important, that I have some piece of data in my hands to base it off. I want to know what happens after I deploy a new version of my code. Does it break responses? Does it slow everyone down? I want to know if this new feature I released is used! Don’t you? It gives me a sense of accomplishment - “Yes, this works, mission accomplished and I can move on!”.

In case my employer doesn’t have some form of automated deployment, having metrics would be the first step towards more frequent feature delivery and automated deployment. It’s such an immediate feedback loop - deployment happens, metrics show a bunch of 500s, ups, rollback is initiated, no more 500s, everyone’s happy. Customers might have even written off this not working period as a hiccup. I don’t like customers calling my colleagues in a call centre and telling that the project I’m responsible for is not working. It means I failed at my job! Having metrics helps me and my team to succeed in our job.

Once I had to comply with a rule to deploy only at certain times or certain days, I hate these rules! No deploys on Friday, no deploys after 16:00 - sounds familiar? It’s all because of fear and uncertainty, that if something breaks - customers won’t be able to reach anyone and the system will halt until the next morning, or worse, over the weekend. When I have metrics - I have immediate feedback loop on the health of my system. In case my change, deployed on Friday 17:00, breaks something - I just revert, deploy again, see that my application landscape metrics are green and go have a nice weekend. This immediate feedback loop allows me to deploy much, much more frequently and toss those stupid rules into the trash bin.

Warning, strong opinion incoming. If the team, or my employer, do not have any sorts of metrics for things they are responsible for - they are out of control. Perhaps there is an illusion of control, but it is created by assumptions, some sort of feedback from the clients and gut feelings. How do you focus on the most important things in growing your company without actually having any data to back the decision?

I had a situation once - our team spent multiple long months building this brilliant feature, the project manager was excited, owners were excited and hyped. But there was no concept of “baby steps”, no MVP or A/B testing, to actually validate whether the idea is worth investing in. After finishing, deploying it and being happy it works, we patted each other on the back and went on to work on the next big feature, we ruled the world. A couple of months later our project got some metrics so we finally could see how our hyped and hard worked on feature is actually used! To our jaw-dropping surprise - it was utterly forgotten and no customers cared about it. We were blind and thought everything is alright.

All that glitters is not gold

There is a big catch to having metrics. It’s easy to get them in, plot data into lovely graphs, but they might lie to you a little. Here’s an example - I thought our API client devices segregate like this - 60% Android, 30% iOS and the rest 10% desktop. When I got the metrics in - it showed me a completely different picture - 60% were desktop, 30% iOS and 10% Android. My jaw dropped that it was so different from my assumptions and gathered knowledge.

After deeper analysis - I understood that not all of the clients are making API calls equally. I had to make sure that I take this particular graph and plot it from more specific data - by taking one API endpoint, which I know for sure, all platforms call it the same amount of times per session. I got the correct information after that, as well, as a new question - why is it such a big difference?

It is very important to give a few days, better a few weeks for the metrics to flow in. It gave me a whole span of different timeframes to analyze and interpret data, spotting weird information like with devices percentage. I could see the difference between weekday and weekend usage, day and night usage. When I had doubts - I could verify what I see with my peers and having a bigger set of data helped us both to understand the trend. If you think you will base some decisions from the metrics you will collect, always double check it with other sources of data (if you have) and peers about what you see and tune the graphs to reflect the reality, before making big choices.

There are different ways on how to plot data on graphs and Datadog has good articles about it. For example, you could have response time plotted as average, but in reality, you have very quick endpoints and a few very slow ones. These outliers will create a false sense of alright. It would be reflecting real life situation if you would take the 95th percentile or exclude those outliers from graphs completely, make separate graphs only for them.

Performance

I bet you had a question, especially if you work on a bigger project - how will it affect the performance of the project, that is being monitored? The most I’ve sent to Datadog I think was around a hundred different metrics, with at least 5-8 tags, per request. Datadog agent is built on top of statsd project, which works using the UDP network, which is not as reliable as TCP but blazing fast. Fire & forget basically. In the background, in Datadog case, there is an agent running which aggregates that information and sends it to the servers in customisable intervals. This has a very negligible impact on request performance.

When the amount of metrics increases - the agent might need its separate CPU core to deal with all the aggregation, so if there’s really a huge amount of metrics being sent, you might need a bit of a stronger CPU instance.

There definitely is a delay between the metric sent and when it’s displayed on the graph. It’s not big, say 15-30 seconds or so and only annoying if you have very low sample traffic.

Alternatives to Datadog

There are many great tools out there to help you get started on metrics! I’ve used cacti, Zabbix, long, long time ago, I did run an ELK stack and Graphite + Grafana setup running. But lately, I’m all in love with Datadog. It’s a managed service, setup is super fast and you can send metrics in a matter of minutes. It’s not overly expensive as well, but the value, if used correctly, is immense! If that’s not up to your valley - here are some alternatives to explore and choose from:

Prometheus + Grafana - I know Rafael Dohms talked about it
Zabbix - one of the granddads of monitoring
Graphite (together with Grafana)
Librato aka SolarWinds appoptics
InfluxDB & TICK stack

There’s probably even more but, alas, I know just so much.

What about SRE?

Site Reliability Engineering is a very big topic and the best resource I’ve found to read about it is Google. Well, they invented it. Their SRE books are a big help. To support this article - chapter Service Level Objectives is a must-read. It helped me to understand how should I use the data I have gathered from the metrics. Service Level Objectives and Service Level Indicators sound like mumbo-jumbo, but when defined correctly, they provide the best overview of my system.

I always try to set such objectives, which would describe how healthy my API is from one glance at the screen. At the example dashboard below, even without reading the titles, you can see something is starting to get wrong here.

If the system is much bigger, SLIs can be aggregated into smaller boxes, but also separated between different teams. I’m assuming, of course, that multiple teams at your company are responsible for separate big domains. So you can have payments dashboard, orders dashboard, content dashboard and similar, each of which will monitor their own SLIs - most important things that will say if their part of the system is healthy or not.

I love these screens! We had a monitor for each team and all of them had a dashboard like this for systems they were responsible for. Walking around the office I could see who had problems and where I shouldn’t step in for a quick chat but, maybe I should offer my help.

Alarm, alarm, alarm!

I have metrics, I have SLIs defined, all I am missing is to eliminate my alarm fatigue from all the useless “CPU load > 60% on web1” or “Network rate is > 85MB/s!”. Some of the alerts we have are triggered every day, multiple times and they fix themselves. Out of 26 monitoring alerts fired, 26 have resolved by themselves and none of them impacted my system. If you ignore an alert - it’s not an alert, stop firing it.

By using the same one metric I’ve defined at the beginning of this text, I have defined some SLIs that properly represent what my end-users are experiencing. I see yellow boxes and red boxes on my screen, so maybe these also should be the alerts that should wake me up at the night?

Yes! If my SLIs are not satisfied - the alert goes into my team slack channel. I am confident that this is a rare occasion and an important to act upon. If things really hit the fan - for example, successful reads go below 70%, error rate increases above 10% - then it’s a ping to pagerduty service which then will wake up someone up to act on this.

Little effort, big win

This one metric allowed me to have a very different perspective to what I’m working on. It took very little effort to start - some package dependencies and a few lines of code to cover the widest scope of a project - HTTP request.

I got a big amount of visibility on how my API is being used and how it behaves in production.

I got an easy way to define basic Service Level Objectives and pro-actively monitor them.

I got rid of alarm fatigue from my slack channels and replaced them with alarms that represent end-user problems.

I love it and I very highly recommend to try the same.