Observability is a topic that has gained increased attention, popularity, and focus the past few years - and for good reason. The ability for engineers to easily discover, walk through, and reason about the state of their systems and services is crucial to efficiently and effectively acting upon outages, bugs, and failure states.
Honeycomb, a leader in this space, has developed a powerful observability tool for event ingestion and interrogation.
There are plenty of blog posts out there about Observability, what it is, what it is not, and plenty of "getting started with Honeycomb" guides. Honeycomb even offer's live playgrounds of their product that you can demo for free! If you are unfamiliar with the concept of observability, Honeycomb, or both, I encourage you to seek out and start with those posts and tutorials and return to this series later. This series is primarily aimed at engineers who have a basic familiarity with observability, may have recently joined a team or company using Honeycomb, and want to get some real-world examples of every-day use.
I am by no means an expert on the use of Honeycomb - If a feature exists that I am unfamiliar with, or you know something about the tool that could help myself or others, please chime in in the comments or feel free to write up your own post about it and let me know! The intent of this series is to show how I use it in my day-to-day work, to share with others and learn as well.
I am a senior software engineer at Blackbaud, a mid-size enterprise/SaaS company focused on software targeting the Social Good sector. Over the past several months, a team of my colleagues took on the work of creating standard libraries and tooling to emit events from more than 400 microservices to our Honeycomb dataset. While I won't go into the implementation specifics in this blog post, it's important to understand the sheer magnitude of the events we're dealing with. Since event sampling was put in place, we usually see about 250-275 million successful events ingested into Honeycomb every workday.
My team is primarily responsible for the identity and authentication services that allow our customers (and theirs) to authenticate and interact with our systems. Needless to say, our services are among the most critical of the entire stack -- if authentication is down so are the rest of our services. As such, our ability to respond to, triage, diagnose, and resolve issues accurately and efficiently is extremely important. Honeycomb is a tool that helps us answer questions effectively, and feel more confident in the status of our services.
Traditionally, the industry has leaned on logs, metrics, and alerting to diagnose, troubleshoot, and resolve issues:
- We develop dashboards to monitor "key areas" of our systems for issues.
- We rely on symptoms like "high cpu", "error rate", and "connection status" to alert us to problems with our services.
- When problems arise, we sift through logs, scan metric charts, and try to guess at where the issue is.
We had an idea of where the problems might arise, so we structured our logging, monitoring, and alerting to focus on those known areas.
The core tenets of Observability require us to shift from a symptom-reactive, guess-and-check mindset to one that is proactive and interrogative. Distributed microservice architectures have brought with them added complexity - a single call to one API service could result in dozens of calls to other services on the back-end, any one of which could fail for any number of reasons.
We can no longer reasonably expect to predict, monitor, and prevent all failure states of our application. We can put logging, monitoring, and alerting in place to watch for symptoms, but to truly understand root cause and pinpoint complex issues quickly requires the ability to get answers from our telemetry to questions that we haven't even thought of yet.
Honeycomb is a tool for doing just that -- answering questions about our systems.
In this series, I will walk through a few different real-world examples of how Honeycomb can be used that demonstrate the practicality, usefulness, and benefits of adding it to your toolbelt - no matter the stage of your career or familiarity with the service.
Scenarios will differ in complexity and technicality. I intend this to be a running series of common and interesting real-world use-cases for Honeycomb that I run into. Feel free to offer up feedback and suggestions of your own in the comments!
- Performance and Cost Optimization
- Solving for Customer Delight
- Incident Response
This is a living list of features that I would find helpful in Honeycomb that do not yet exist (I think!) as of this post.
- The ability to re-arrange query operators and visualizations.
- Right now, in order to re-arrange operators/visualizations, you need to remove then and re-add them. It would be nice to be able to re-arrange/edit these on the fly.
- Allow me to restrict Marker visibility to certain conditions, such as Microservice name.
- This will allow Markers to be useful for larger customers with hundreds of services and developers utilizing the tool and querying across a shared dataset.