DEV Community

Cover image for Build On Live | Observability - Show notes
Darko Mesaroš ⛅️ for AWS

Posted on

Build On Live | Observability - Show notes

Welcome to our second Build On Live Event which is all about Observability. AWS experts and community members cover topics ranging from tracing, OpenTelemetry, SLOs, eBPFs and more!

This event was hosted by Jacquie Grindrod and Ricardo Ferreira and it was a sight to be seen. With both in-person and remote guests, the industry experts covered a variety of specialty topics within the observability field

These are the show notes from that event, which was live streamed on the 29th of September 2022. The recording of this event is available on our YouTube channel, but in this article I will be linking each segment notes with its video, so make sure to hit that subscribe button! 👏


Individual session notes:

Intro and Welcome

Guest: Curtis Evans, Developer Advocate at AWS

We kick off the day with your hosts Ricardo, Jacquie and a guest star Curtis. Discussing the basics of Observability and what can we expect throughout the day.

So, "explain to us like we are 5", Curtis takes an example of a e-Commerce website and what all goes on in an application and how we track each execution and process that happens in each step in your application, that can in the future give us more insights and helps us detect and solve a problem before it even occurs.

Go Beyond Observability with AI

Guest: A.B. Vijay Kumar, IBM Fellow

Did you know AI can be used to help you understand the stream of events from observability data and alert you when something is bit off? In this session, A.B Vijay Kumar from IBM gives you an overview of how to leverage AI in what the industry calls AIOps (also called Augmented SRE), which can be a powerful component of your observability strategy. He shows a nice demo where AI is used to detecting anomalies and trigger conversations on Slack that people on-call could discuss what the problems are, and what actions the AI recommends taking based on past incidents.

Also, check out the 1:14 minute mark in the video something funky happened to the camera 😅

Ricardo and Jacquie talking to A.B. remotely

Some links that were shared:


How to Migrate Observability Platforms With OpenSLO

Guest: Ian Bartholomew, Site Reliability Engineering Manager at Nobl9

One of the best ways to measure the availability of services is having SLOs (Service Level Objective) set to them. They help in giving you a more pragmatic way to measure things and calculate how much does it costs having a service unavailable. What you may not know is that you can define them using an agnostic standard called OpenSLO using YAML. In this session, Ian Bartholomew from Nobl9 get into the weeds of how to do that, showing step by step the definition, implementation, and migration of defined OpenSLO's from on-premise to the cloud. And he does all of this with a hands-on approach using only the terminal. How cool is that?

How do you explain/justify Error budgets to Engineering/Upper management? Well, you need to very clearly state that service uptimes will never be a 100%, it's gonna be very close but we also need to understand that all the extra nines is a lot of extra money - exponentially. So this error budget allows for an X amount to be unreliable.

Quote from Charity Majors: "Nine nines mean nothing if your customers are unhappy"

Ian, Ricardo and Jacquie in the Studio

Some links that were shared:


Intro to eBPF: Explain to Me As if I Was Five

Guest: Antón Rodríguez, Principal Software engineer at New Relic

Implementing observability these days is a no turning back trip to the world of instrumentation. But instrumenting services can be a lot of work depending on the approach you use. To minimize a bit of this workload, you can use eBPF — a non-intrusive way to instrument sandbox programs in an operating system kernel. In this session, Antón Rodriguez from New Relic explains how he got involved with this technology, how he applies it to his daily job as a software engineer, and also shows a cool demo of eBPF using the CNCF project called Pixie.

Anton showing us the Pixie dashboard

Some links that were shared:


OpenTelemetry—The Industry Telemetry Standard

Guest: Michael Hausenblas, Open Source - Observability at AWS

OpenTelemetry these days is arguably a synonym for observability. This CNCF project that provides an agnostic way for developers to instrument their services to produce telemetry data such as tracing and metrics is catching on. But a project of that magnitude can be often hard to understand. In this session, Michael Hausenblas from AWS gives us an overview of OpenTelemetry, his involvement in this project with CNCF, and how can you get started with this using the right resources. He also clarifies the many sub-projects that exist for OpenTelemetry, which are handy to know as you find your way into this technology.

Michael, Jacquie and Ricardo

Some links that were shared:


That's a lot of data! how to manage ingestion and storage costs?

Guest: Richard "RichiH" Hartmann, Director of Community at Grafana

All monitoring initiatives start the same: you collect as much data as you can, you store them infinitely so you can analyze them someday, but one day the bill comes and upper management asks you to decrease infra costs. Suddenly, you shrink your monitoring infra and end up with a shared cluster for all the 100+ services that need some visibility. In this session, Richard Hartmann from Grafana Labs explains why this is an anti-pattern that should not occur every time people panic about infra costs related to monitoring. He shares some best practices about how to optimize ingestion and the storage costs of telemetry data. As he walks us over the different telemetry data, he also explains what are golden signals and how you should implement them.

Richard and the hosts on stream

Some links that were shared:


Observability and Distributed Tracing at CNCF

Guest: Yuri Shkuro, Creator of Jaeger, Co-founder of OpenTelemetry

Despite of most people think, logs are not the only way to understand the execution path of code. Traces can tell that story too, and when combined with logs, provide even a more interesting story. This is the reason distributed tracing is so mainstream these days. In this session, Yuri Shkuro, from Meta, gives his perspective about the importance of distributed tracing, how he, by accident, bumped into this technology, and how they helped it to grow during his tenure at CNCF. He also explains how projects like OpenTracing, Jaeger, OpenCensus, and Zipkin helped to pave the way to what we know as OpenTelemetry.

Yuri, Ricardo and Jacquie on stream

Some links that were shared:


Creating trace data with OpenTelemetry

Guest: Curtis Evans, Developer Advocate at AWS

There is no better way for developers to learn how to use a specific technology than rolling up their sleeves and writing some code for it to build an example. Curtis Evans from AWS strongly believes that, and he steps up to help us understand OpenTelemetry in a more hands-on way. In this session, he takes his time to instrument a microservice written in Python to produce telemetry data for traces and metrics. He highlights the difference between a piece of code with and without instrumentation, so you can have a sense of how much work is needed.

Curtis showing some telemetry data on screen

Some links that were shared:


Talking Observability with Liz Fong-Jones

Guest: Liz Fong-Jones, Field CTO at Honeycomb.io

Working with observability sometimes looks like building a puzzle with curvy edge pieces. The task at hands is to understand each piece and never lose track of the big picture. The problem is, with observability, knowing how the big picture looks like is not always possible. In this session, Liz Fong-Jones from Honeycomb shares her extensive experience with SRE and observability and shows us how observability looks like and how to pull each technology together to build that bigger picture. With an outstanding end to end example of troubleshooting a problem, fixing it in production, and coming back to development to apply the lessons learned with instrumentation, Liz shares when and how to use observability technologies to improve the system's reliability and customer satisfaction.

Also, to quote Liz from the day: "Do not be afraid of failure, failure is a learning opportunity. Break you systems as long as you learn from it, so you don't break them the same way twice!"

Liz running a demo

Some links that were shared:

Top comments (0)