Indika_Wimalasuriya for AWS Community Builders

Posted on Dec 27, 2023

AWS Observability: Building a Comprehensive Solution for Distributed Systems

#observability #aws #sre #awscloudoperations

Observability is now a key theme in modern distributed systems maintenance. Why? Because businesses require better insights into customer and system behaviors, demand monitoring of performance and continuous adoptions, need troubleshooting of issues, and aim to fix them faster. Real-time monitoring of systems is required to obtain production wisdom; security is paramount, and scalability is a must. All of these factors lead to a world where Observability is an oracle of modern-day distributed systems.

In this blog post, we'll dive into key aspects of Observability and how it can be leveraged to increase revenue for your customers. Well, at the end of the day, everyone is here to make money, isn't it? It's not as bad as it sounds, I hope.

What's the heck is observability?

So, what is observability? We already have monitoring, so what's the heck of observability? Is that a new buzzword? Well, think of it like this: monitoring is based on what you're already aware of; it's like having a thermometer to measure your body temperature. If it goes above a certain level, you pronounce that you have a fever. So, you're simply measuring a known indicator. But is that enough? Of course, for your current need. But imagine a patient in an ICU bed at the hospital. Have you seen all the gadgets that are plugged in to get the signals out? Well, just mere temperature checking is not enough, and doctors require observing the human body, getting all insights, signals they possibly can to make decisions.

Fast forward to now, and that is what a lot of modern-day distributed systems require. Complexity is as high as it can get, and eventually, inherent baggage of problems needs to be proactively identified and fixed. Monitoring simply falls behind in this game since you're tracking what you know, so you need to go a few steps ahead to observe and then make decisions. That's observability in a nutshell

Why should you be serious about observability?

It can cost you money, save money, or increase your revenue. Let's dig deep: distributed systems are complex in nature, and while you can pour any amount of money in, nature will take its course, and failures are inevitable. So, we all embrace failures, we plan, but when they actually happen, can you restore your services within a limited amount of time? Are you able to identify issues before your end users find out? Or better yet, can you fix problems before your end users even notice them? This is one area where observability will play a huge role.

Now let's focus on how you can improve clients' revenue. Walmart research found that for every 1-second improvement in page load time, conversions increased by 2%. Well, that's just mind-blowing, and it makes sense too. All our systems are coupled with customer experience; bad customer experiences make customers move out, while good customer experiences result in more revenue. It's a no-brainer, actually. This elevates the role of observability to even greater heights; page loading time is just one piece of the chessboard. There are so many variables at play now

Okay, what is observability actually?

The textbook definition is the ability to gain insight into the internal workings and behavior of a system or application through the analysis of its outputs, often without direct access to its internal state. We call all our output data telemetry. Observability is about gaining insight into the internal state of systems through analyzing the telemetry it's going to output

Observability is based on a few key components as follows:

Logs - Logs are the original data type; they are basically lines of text that systems produce while running particular code blocks. Logs depend on developers to write code following best practices and insert meaningful logs when the code is getting executed.

Metrics - These are typically values related to the system at a certain point in time.
Tracers - Also referred to as distributed tracers, these are samples of a causal chain of events, transactions between components in a block of code. They help find out exactly what a unit of code was doing.
Events - This is more of a sequence of occurrences that take place within a system being monitored. Modern-day systems release thousands of events

How to design a robust Full-Stack Observability solution

Before you jump in, full-stack observability involves observing the entire technology stack and all layers of your systems to ensure all areas and corners are covered in your observability solution. Below, the diagram illustrates the particular areas you would require monitoring to achieve this goal

Enable Real User Monitoring (RUM): Real User Monitoring (RUM) involves monitoring your system's front-end performance, aiming to understand exactly what your end-users are facing and identifying the customer experience. It provides a clear picture of the customer experience related to the application.
Enable Application Performance Monitoring (APM): Application Performance Monitoring (APM) covers the performance of your code, focusing on code performance and how your code behaves in a production environment. This enables full visibility into the application's performance.
Enable Distributed Tracing: Tracing enables the identification of the unit of work done by your code and helps drill down performance from the front end to the backend.
Enable Logs & Events: Logs and events, of course, help you keep track of the system status.
Enable Metrics and Define SLOs for Your Services: Metrics and Service Level Objectives (SLOs) provide details into your system's performance.
Enable Infrastructure Monitoring: Infrastructure health is the final checkpoint to ensure all areas of your system's health are covered

Now let's delve into how AWS allows us to build a comprehensive full-stack observability solution by leveraging some of its services.

One good thing about AWS is that it uses most of its own services to monitor Amazon.com, its flagship retail site. With a wealth of learning almost always, AWS services are battle-ready. AWS offers multiple services to support the observability journey.

AWS CloudWatch - AWS CloudWatch is one of the leading observability products in the market. It provides a comprehensive observability implementation with support for Logs, Metrics, and tracers.
AWS X-Ray - AWS X-Ray provides the ability to integrate CloudWatch with tracers. With support for OpenTelemetry, X-Ray provides service discovery and tracers.
CloudWatch RUM - Real User Monitoring is integrated with CloudWatch, allowing you to proactively monitor end-user front-end experiences.
CloudWatch Canaries - It provides the capability to create synthetic monitors to mimic end-user experiences, allowing us to set up meaningful monitors to proactively monitor systems.

Building a Comprehensive Observability Solution with AWS

Let's delve into the detailed steps to leverage AWS CloudWatch services for constructing a comprehensive observability solution.

Step 1:

The initial step involves downloading the aws-otel-collector installation file compatible with your EC2 operating system. In this example, I am using Ubuntu, and the commands are as follows:

wget https://aws-otel-collector.s3.amazonaws.com/ubuntu/amd64/latest/aws-otel-collector.deb
sudo dpkg -i -E ./aws-otel-collector.deb
sudo /opt/aws/aws-otel-collector/bin/aws-otel-collector-ctl -a start
sudo /opt/aws/aws-otel-collector/bin/aws-otel-collector-ctl -a status

Step 2:
Next, download the aws-opentelemetry-agent.jar:

wget https://github.com/aws-observability/aws-otel-java-instrumentation/releases/latest/download/aws-opentelemetry-agent.jar

Step 3:
Bring up the respective Microservices by injecting the aws-opentelemetry-agent.jar:

nohup java -Xms<TBC>m -Xmx<TBC>m -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=./java_pid.hprof -javaagent:/<TBC>/aws-opentelemetry-agent.jar -jar /<TBC>/<Microservices.jar> -c /<TBC>opt/aws/aws-otel-collector/etc/config.yaml > <TBC>.log 2>&1 &

Step 4:
Download the CloudWatch Agent and configure it to ship the log files:

wget https://s3.amazonaws.com/amazoncloudwatch-agent/ubuntu/amd64/latest/amazon-cloudwatch-agent.deb
sudo dpkg -i -E ./amazon-cloudwatch-agent.deb
sudo systemctl status amazon-cloudwatch-agent
sudo nano /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a append-config -s -c file:/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json

The CloudWatch Agent JSON file is where you will configure log paths:

{
    "agent": {
        "region": "ap-southeast-1",
        "run_as_user": "root",
        "metrics_collection_interval": 10
    },
    "logs": {
        "logs_collected": {
            "files": {
                "collect_list": [
                    {
                        "file_path": "<Log Path>.log",
                        "log_group_name": "<Log Group Name>",
                        "log_stream_name": "_test_discovery_service_stream",
                        "timestamp_format": "%Y-%m-%d %H:%M:%S"
                    }
                ]
            }
        }
    },
    "force_flush_interval": 15
}

These steps guide you through the process of setting up AWS services to create a robust observability solution, utilizing AWS CloudWatch and associated tools.

Now let’s log in to AWS CloudWatch and start Observing our application.

AWS CloudWatch automatically discovers system services using X-Ray.

As discussed, we can leverage X-Ray with our OpenTelemetry integration to visualize the ecosystem. A sample view is included below.

Above, you will see it mapping the full ecosystem. Starting from the client, requests reach out to the API Gateway, then to respective Microservices, and finally to the Database. Below are some of the major benefits you can leverage by using X-Ray to illustrate your services.

End-to-End Visibility: With X-Ray's Service view, you gain a holistic, end-to-end view of your application's architecture, allowing you to trace requests across various services.
Performance Monitoring: The Service view in X-Ray provides real-time insights into the performance of each service, enabling quick identification and resolution of bottlenecks or issues impacting overall system performance.
Troubleshooting Simplified: Quickly pinpoint and troubleshoot errors or latency issues by visualizing the flow of requests and responses through your services, streamlining the debugging process.
Dependency Mapping: X-Ray's Service view generates automatic dependency maps, illustrating the relationships and dependencies between different components, helping you understand the overall structure of your application.
Resource Optimization: By visualizing the interactions between services, X-Ray's Service view facilitates efficient resource optimization, allowing you to allocate resources where they are most needed based on actual usage patterns.

Connecting Front for Real User Monitoring

Next, I recommend connecting AWS CloudWatch RUM with your system's front-end code. You can create an application and then obtain either the commands or code snippet to be injected into your frontend code. This will automatically initiate CloudWatch monitoring for your front-end performance.

Some of the key benefits of CloudWatch RUM that you can leverage include:

Real-Time User Insights: CloudWatch RUM (Real User Monitoring) provides real-time insights into end-user experiences, allowing you to understand how users interact with your applications instantly.
Proactive Issue Detection: With CloudWatch RUM, you can proactively detect issues affecting the front-end user experience, enabling quicker identification and resolution of performance bottlenecks or errors.
Improved User Satisfaction: By monitoring and optimizing the front-end experience in real time, CloudWatch RUM helps enhance user satisfaction, ensuring a smoother and more enjoyable interaction with your applications.
Data-Driven Performance Optimization: CloudWatch RUM offers data-driven insights into user interactions, enabling you to make informed decisions for performance optimization and feature enhancements based on actual user behavior.
Integrated with CloudWatch: As part of the AWS CloudWatch suite, CloudWatch RUM seamlessly integrates with other CloudWatch services, providing a comprehensive observability solution for both infrastructure and user experience monitoring.

Log Monitoring using CloudWatch

Let’s go and check the logs. Just to remind you, we have already brought up the CloudWatch Agent with paths to log files so agent will start shipping the log.

Below are some of the benefits you can gain by connecting your logs with CloudWatch

Centralized Log Management: Integrating logs with CloudWatch allows for centralized log storage, making it easier to search, analyze, and manage logs from various sources in a unified platform.
Real-Time Monitoring: CloudWatch provides real-time log monitoring, enabling quick detection and response to events, errors, or issues within your applications or infrastructure.
Customizable Log Retention: CloudWatch allows you to define retention policies for logs, ensuring that you retain relevant log data for compliance, auditing, or troubleshooting purposes while managing costs effectively.
Automated Alerts and Notifications: You can set up CloudWatch Alarms to trigger notifications based on specific log events or patterns, allowing for proactive monitoring and immediate response to critical events.
Integration with Other AWS Services: CloudWatch log integration extends to other AWS services, facilitating a seamless connection with metrics and events, providing a comprehensive view of your AWS environment for better troubleshooting and optimization. Enabling Tracers with CloudWatch X-Ray

We have already looked at how X-Ray provides service discovery. Next, let's examine the tracers. The advantages of tracers lie in their ability to provide more insightful information than logs when telling a story and diagnosing real-time issues. Traces represent the narrative of internal processing triggered by an external request. Applications emit events or tracers.

If we start going deeper into tracers,

Trace: Represents the entire journey of a transaction through a system.
Segment: Unit of work within a trace, depicting a specific service or component's contribution.
Annotation: Additional metadata providing context and details for a specific segment.
Metadata: Extra information about a trace or segment, enhancing contextual understanding.
Trace ID and Span ID: Unique identifiers linking traces and segments for correlation.
Timestamps: Record when specific events or segments occur for chronological analysis.
Context Propagation: Mechanism ensuring seamless trace context flow between different components.

Key benefits of tracers provided to your observability solution are as follows:

Enhanced Visibility: Tracers offer a granular view of transaction journeys, enabling a detailed understanding of how requests traverse various components and services.
Efficient Troubleshooting: With the ability to trace individual segments, identifying and isolating issues within a system becomes more precise, streamlining the troubleshooting process.
Performance Optimization: Tracers provide insights into the duration of each segment, enabling the identification of bottlenecks and areas for performance improvement within the system.
Root Cause Analysis: Tracers assist in pinpointing the root cause of issues by revealing the sequence of events and the specific components involved in a transaction.
Resource Allocation: Understanding the flow of transactions through tracers aids in optimizing resource allocation, ensuring that resources are efficiently utilized based on actual usage patterns.

Creating synthetic tests to mimic end users

Next, you can leverage AWS CloudWatch Canaries to create synthetic tests.

CloudWatch Canaries provide great benefits, such as below:

Synthetic Monitoring: CloudWatch Canaries enable synthetic monitoring by allowing you to create and run scripts that mimic end-user interactions, helping simulate real-world scenarios and detect potential issues proactively.
Proactive Issue Identification: By continuously running synthetic tests, CloudWatch Canaries help identify performance degradation or errors in your application, allowing you to address issues before they impact real users.
Realistic User Scenarios: Canaries allow you to create scripts that emulate specific user scenarios, helping you assess the performance and functionality of critical pathways within your application.
Automation of Monitoring: CloudWatch Canaries automate the monitoring process by executing predefined scripts at scheduled intervals, providing consistent and reliable performance data without manual intervention.
Meaningful Performance Metrics: With the ability to set up meaningful monitors based on user scenarios, CloudWatch Canaries offer insights into the end-user experience and help you measure key performance metrics in a controlled and reproducible manner.

Observability System Metrics with CloudWatch:

While you're setting these up, you will start having access to a rich set of metrics. Metrics cover the traffic, errors, latency, and saturation for your system. It provides a holistic view of your system's performance

Key benefits you can leverage using Metrics are as follows

Comprehensive Monitoring: AWS CloudWatch Metrics offer a comprehensive monitoring solution, providing insights into various aspects of your infrastructure, applications, and services.
Real-Time Visibility: CloudWatch Metrics provide real-time data on system performance, allowing you to monitor and respond promptly to changes, anomalies, or issues.
Automated Scaling: Utilize CloudWatch Metrics to set up alarms and automate scaling activities based on predefined thresholds, ensuring optimal resource utilization and system performance.
Customizable Dashboards: Create custom dashboards using CloudWatch Metrics to visualize and analyze key performance indicators, tailoring the monitoring experience to specific needs and priorities.
Unified Observability: CloudWatch Metrics integrate seamlessly with other AWS services, offering a unified observability platform that simplifies the monitoring and management of your entire AWS environment.

Infrastructure Observability with CloudWatch

Finally, CloudWatch provides detailed infrastructure monitoring, including container insights, Lambda insights, Contributor insights, application insights, and resource health. There is an option for creating custom dashboards, and there are also numerous automatic dashboards available

CloudWatch Alarms

CloudWatch Alarms play an important role in AWS monitoring and observability. These intelligent triggers allow users to set thresholds for specific metrics and receive notifications when those thresholds are breached. With CloudWatch Alarms, users can proactively respond to anomalies, mitigate potential issues, and optimize resource utilization. The alarms enable automated actions, such as scaling or stopping instances, ensuring optimal performance and cost-efficiency. Offering a robust alerting system, CloudWatch Alarms empower users to maintain the health and reliability of their AWS resources. With flexibility in customization and integration, CloudWatch Alarms are a fundamental tool for ensuring a resilient and well-managed AWS environment.

And that's a wrap! With AWS, building a complete observability solution becomes a breeze. We gain deep insights into our systems—infrastructure, applications, and services—ensuring optimal performance and efficiency. AWS empowers us to monitor, analyze, and enhance our operations seamlessly