DEV Community

Cover image for AWS services for SREs
Unmesh Gundecha for AWS Community Builders

Posted on • Updated on

AWS services for SREs

SREs using the AWS platform have access to a wide range of tools for creating, running, and managing systems & services in the cloud. This blog post examines some of AWS's essential services and features that SREs can utilize.

Amazon CloudWatch

Monitoring and observability is a crucial tenets of System Reliability Engineering (SRE). The AWS CloudWatch is a monitoring service available in the AWS platform to SREs for effective monitoring of systems and services. It enables SREs to watch and understand the state of their systems. Monitoring is based on gathering predefined sets of metrics or logs. Observability enables SREs to debug their system actively. AWS CloudWatch is the most popular and essential monitoring solution in AWS.

SREs can monitor and manage on-premises, hybrid, and AWS application or resource with the Amazon CloudWatch. CloudWatch offers data and valuable insights. Instead of keeping track of them separately, SREs can gather and access all of operational and performance data from a single platform in the form of logs and metrics (server, network, or database). Applications, infrastructure, and services can all be monitored with CloudWatch, and SREs can leverage alarms, logs, and events data to automate actions and speed up mean time to resolution (MTTR).

You can find more about AWS CloudWatch here

AWS Systems Manager

Driving Toil out of the system is another key tenet of SREs. The term Toil refers to tedious, repetitive tasks associated with running a production environment. Site Reliability Engineering (SRE) teams strive to minimize or even eliminate Toil to maximize the time spent on engineering and innovation.

AWS Systems Manager enables SREs to automate processes across AWS resources and centralizes operational data from several AWS services. SREs can arrange resources into logical categories, such as apps, various application stack layers, or development and production environments. The resource group's current API activity, resource configuration changes, related notifications, operational alerts, software inventory, and patch compliance status may all be viewed in Systems Manager by selecting the resource group.

You can find more about AWS Systems Manager here

AWS Systems Manager Incident Manager

Performance problems and unplanned outages can happen to systems, online applications, servers, devices, etc., at some point. For SREs, it is anticipated such failures will occur, and they cannot be avoided as a fact. Moreover, these unexpected failures can result in significant revenue losses, a loss of customer confidence, and, depending on the industry sector, possible fines. Therefore, SREs use incident management as one of the fundamental techniques to reduce the disturbance brought on by unforeseen problems.

The AWS Systems Manager Incident Manager service enables incident management for SREs.

Critical application availability and performance issues can be resolved more quickly with the Incident Manager from AWS Systems Manager. Automatic reaction plans that coordinate the appropriate on-call engineers and information aids SREs in incident preparation. For example, when a significant issue is discovered by an Amazon Eventbridge event or an alarm from Amazon CloudWatch, it can automatically respond with Incident Manager. In addition to executing AWS Systems Manager Automation runbooks, Incident Manager links selected chat channels using AWS Chatbot and activate pre-configured response plans to engage responders via SMS and phone calls. By recommending post-event action items like automating a runbook step or adding a new alert, Issue Manager, which was developed at Amazon based on decades of experience in incident response and analysis, aids SREs in increasing service reliability.

You can find more about AWS Systems Manager here

Amazon Managed Service for Prometheus

Prometheus is a popular monitoring tool for microservices, distributed systems, and container workloads.

Amazon offers managed service for Prometheus for monitoring and alerting, offering information and valuable insights for container environments that are widely used. SREs can gather and access performance and operational data from container workloads running on AWS and on-premises using Amazon Managed Service for Prometheus. The well-known open-source Prometheus project from the Cloud Native Computing Foundation (CNCF) is entirely compatible with Amazon Managed Service for Prometheus. As Prometheus is an AWS-managed service, it makes it easier to deploy and set up Prometheus. In addition, it automates many routine operations and maintenance for SREs.

Amazon Managed Service for Prometheus automatically adapts when the container workloads scale up and down to provide cost-effective performance metrics and reliable query response times.

You can find more about AWS Managed Prometheus here

Amazon Managed Service for Grafana

SREs need operational dashboards that connect them with monitoring and observability tools to visualize system performance, key metrics, and alerts.

Based on open-source Grafana, Amazon Managed Grafana is a highly scalable, highly available, and fully managed service that offers interactive data visualization of operational and monitoring data. SREs can view, analyze, and set alarms on crucial metrics, logs, and traces gathered from various data sources in the observability system, including AWS, third-party ISVs, and other resources across environments and applications using Amazon Managed Grafana. By automating scalability compute and database infrastructure as usage demands increase, with automated version upgrades and security patching, Amazon Managed Grafana offloads the operational management of Grafana.

You can find more about AWS Managed Grafana here

AWS Fault Injection Simulator

SREs aim to minimize downtime and ensure that systems operate as planned. One of the critical objectives of SRE teams is to ensure that systems can maintain their Service-Level Objectives (SLOs), which are the particular metrics SRE teams target to achieve the business SLAs, and Service-Level Agreements (SLAs), the system uptime provided to consumers.

SRE teams get important system dependability insights by introducing actual failures within the confines of a chaos engineering experiment. These insights wouldn't be possible through solely theoretical modeling. SREs can apply the real post-incident procedure to real system outputs to acquire a more realistic perspective of the system's condition and capabilities rather than dealing with hidden infrastructure flaws that can lead to incidents—or, worse still, running into them during incident.

AWS Fault Injection Simulator (FIS) is a fully managed service for running chaos engineering experiments to improve an application's performance, observability, and resiliency. FIS simplifies setting up and running controlled experiments across various AWS services so SREs can build confidence in their application behavior.

SREs can use AWS FIS for defining and running Chaos Engineering experiments to build resilient systems and services.

You can find more about AWS Fault Injection Simulator here

Top comments (0)