As an SRE, your main job is to think about the infrastructure from a software developer's eyes. Because of that, one of the things you should be constantly thinking about is how can the application perform better from a software and infrastructure perspective, and how can I get that information?
The answer is with application monitoring.
When you're monitoring an application, you start to understand its bottlenecks. When you understand its bottlenecks, you begin to understand how to fix them from a development perspective.
In this blog post, you'll learn about what performance means and how to monitor your applications.
When you're thinking about performance, it's not just about the uptime or how long the servers have been running. It's about how the application is accepting interactions. For example, a server could have great uptime, but the application may be slow to reach because there isn't enough network bandwidth and the tunnel is typically too small to handle all of the requests.
When you're thinking about performance, you should think about:
- Is the application reachable in an acceptable amount of time
- Are the systems where the application is configured to run have enough bandwidth, auto-scaling, and high availability available? This is a huge one because an application can be up and running, but if there are several timeouts or it's taking users forever to reach the application, they'll most likely go somewhere else.
- Are there SLO's and SLA's in place?
The biggest killer to any performance for an application is degradation. You need to ensure that you have proper monitoring and alerting around slowness and timeouts.
Performance is all about how the application is being affected when one, five, or thousands of users are interacting with it.
As an SRE, you're going to be on-call. However, that doesn't mean that you should be alerted every 5 minutes with an alert that could be easily solved with an automated runbook. When you're thinking about an alert coming through for a performance issue, you should think about how to automate it.
For example, let's say you have an auto-scaling group in AWS. Perhaps you set up two EC2 instances to handle the load for an application. As your application becomes more and more popular, more people will be interacting with it. Because of that, it may require extra EC2 instances to run efficiently. If this occurs, you shouldn't have to wake up at 2:00 AM and configure a new EC2 instance for the auto-scaling group. Instead, you should have an automated runbook handle that for you.
In a traditional monitored environment, system administrators and infrastructure folks would monitor RAM, CPU, and the hard disk. Although monitoring the hardware or virtual hardware is still important, you also need to monitor the application itself.
For example, you may have a Service running in Kubernetes. If that Service goes down, the server that's hosting Kubernetes will still be running, so just monitoring the RAM/CPU/Hard disk wouldn't mean much in this scenario. However, if you're monitoring the Kubernetes service itself to confirm if it goes down or not, you'll know at that point if the application is still reachable.
Ensure that you're monitoring at the application level, the binary level, and even the runtime that's running the application.
Below are the top tools between categories of observability and performance monitoring.
- Serverless monitoring: AWS XRay and New Relic
- Overall monitoring: Datadog
- Container monitoring: New Relic and Nagios
- Automated runbooks: Rundeck by PagerDuty and xMatters