Guillaume Lagrange for Serverless By Theodo

Posted on Jan 19, 2022 • Updated on Feb 8, 2022

We Tested the Best Serverless Monitoring Solutions so You Don’t Have To

#serverless #monitoring #aws

Written by Charles Géry and Guillaume Lagrange

Go beyond AWS alerts with these monitoring tools for your serverless application

Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live. — John F. Woods

📉 Introduction

The ultimate goal of monitoring is to create quality software for your users. Indeed, monitoring is a key element in providing insights into the health of your application. It helps you find the cause of your bugs, provide a performance overview and get custom metrics.

Serverless applications don't escape this need. However, in a serverless context your architecture is distributed across multiple different services. You therefore need to use specific monitoring techniques to aggregate the distributed data.

TLDR: We liked Epsagon

For those who want a quick answer, here is our conclusion:

If you only need to monitor a serverless backend, we highly recommend Epsagon. Lumigo is a pricey but good alternative if you want an even quicker to setup solution.
If you already use Bugsnag or Sentry to monitor some part of your app, their serverless integration is the way to go.
If you need to use the Datadog ecosystem, you can. However their serverless offer is not up to the standard of their other monitoring services.

EDIT 08/02/2022: If you want a deeper focus on Epsagon, its cool features and how to set it up, Guillaume Duboc published a great article about it

Let's talk about what matters

Here's a list of the aspects we deem important in a monitoring software.

Finding the origin of a bug as fast as possible.
Alert notification channels: slack and teams are a must-have.
Alert triggers customization, as well as the alert notification content itself.
AWS services which can be monitored, with a particular emphasis on Lambda, DynamoDB, API Gateway, EventBridge and Stepfunctions.
- Any service outside of AWS is also a bonus, the more the merrier.
Ability to monitor custom metrics and to classify errors.
Pricing 💸
The overall UX and intuitivity of the monitoring solution dashboard.
Ease of installation and documentation.
Funds and number of clients: we do not want the service provider to cease activity when we need it most!

We first tested the different solutions on a small proof of concept project, which generates various errors, available here https://github.com/theodo/monitoring-serverless. From this first test, we selected our favorite services and tested them on larger existing projects. This gave us a shortlist of usable solutions, as well as a few bad experiences.

⌚️ But first, why do we not use Cloudwatch and get on with our day ?

Our story begins when we received an email with an alert from AWS Cloudwatch. It was telling us one of the Lambda function execution in our project had failed. Finding the origin and the cause of the bug proved to be a complicated task. The message we received not only did was missing the error stack trace, but we could not even know which Lambda threw the error! To find the origin of the bug we had to go to AWS Cloudwatch, investigate to find which Lambda had an error, find the logs of this lambda, and finally browse through them to find the error stack trace. To cut down on the research, you could set an AWS alarm up for each lambda, but it quickly gets expensive. In both cases, it is a gruesome and tiresome process.

That's when we asked ourselves: is there a better solution? Our cognitive time could be used to create value instead of playing hide-and-seek with every bug. We decided to test the different monitoring solutions the market had to offer to shortlist the ones that were the most suited to our needs.

Our main goal was to be able to find the origin of bugs as fast as possible. To achieve that we needed to be informed of errors on relevant notification channels (e.g. Slack, Teams...) and to have the origin and stack trace of the errors easily available. Starting from this objective, we listed a set of criteria to compare the existing Serverless monitoring solutions.

🐰 Epsagon

Let's be real, Epsagon is the winner. It ticked all the boxes of our checklist. The installation could not be simpler. Deploying the CloudFormation stack and selecting the AWS resources we wanted to trace was a matter of minutes. The only drawback is the out-of-the-box dashboard experience. While some services provided an excellent default dashboard, Epsagon gives you an empty dashboard you have to configure by yourself. Fortunately, some templates are available to start, depending on your application specificities. A better default dashboard would have been appreciated. Other than that, the experience has been absolutely stellar.

We were able to do everything we wanted with Epsagon, and more:

Set up instant alerts via mail/slack/teams/telegram triggered by customizable filters
Custom metrics using the provided Epsagon SDK
Trace requests across multiple AWS services
Customize our dashboard with very flexible widgets

By passively using the app and having the events traced, Epsagon was able to build an up-to-date Service Map. From this map, you can access each traced invocation of monitored services. This makes Epsagon not only an excellent monitoring service, but also a nice tool to visualize what actually happens in your application.

Exported view of an application through Epsagon's Service map. Names have been hidden for privacy.

Epsagon was acquired by Cisco in last October, and extended its free tier to 10M monthly traces. This makes Epsagon the most financially interesting solution we tested out by far. We can only recommend you to join the hype and try it out! Keep in mind that the Epsagon stack uses CloudTrail, which can add costs to your project.

Because everything cannot be perfect, we have to mention we had issues inviting team members while using Google SSO. This was resolved by creating an account explicitly with Epsagon, but there's definitely room for improvement here.

+	-
Extensive and very customizable filters	Dashboard can be painful to set up
Very generous free tier
Out of the box tracing is very performant
Service map view

If you want to monitor your AWS serverless app with highly customizable alerts and are willing to spend the time configuring the tool, we highly recommend Epsagon.

☄️ Lumigo

Lumigo started out as our favorite since the initial experience was better than Epsagon. The installation process is perfect, and Lumigo can be set up in a matter of minutes. The default dashboard is functionnal. However, the default widgets are not removable, and in our opinion forcing them on everyone is not a good idea.

In terms of features however, Lumigo is packed with useful monitoring tools for your serverless application. You can quickly access logs related to alerts, trace a request through your different service, access occurrences of categorized errors...

Where Lumigo is the least competitive, is its pricing. Its free tier is limited to 150K traces a month. To put the pricing in perspective, its $300 a month plan only offers half as many traces than Epsagon's free tier.

Lumigo also provides a live view of your lambdas, so you can track what is happening on your application at all time!

+	-
Easy installation	Quickly expensive
Default dashboard is already satisfactory	Default widgets cannot be removed
Live view of lambda invocations

If you want to quickly setup a monitoring tool for your serverless application, and you want to make use of features like Live Tail, we would recommend Lumigo. While pricier than Epsagon, it will definitely save you debugging time.

🔺 Sentry

The next two services both offer great experiences, but lack specific features for serverless monitoring. The first one, Sentry, is a great generalist monitoring solution. It is easy to set up and the UX feels smooth. It offers a few options to monitor your lambdas and get notified quickly when one of them fails (we received error notifications in ~30s during our tests). Yet, the range of options offered for serverless monitoring are limited compared to our main challengers. You can only monitor Lambda functions, and you can't add custom metrics inside your lambdas for instance. The range of notification integration options was also quite limited (at least with the free tier) compared to its competitors.

But Sentry really shines when you have to monitor applications with many technologies : it integrates with a great deal of technologies. Overall, we recommend using Sentry if you want an easy-to-use monitoring solution and serverless monitoring is not your main concern.

Sentry's default email alert contains information about the Lambda that caused the error and its stack trace

+	-
Lightweight installation	Limited Options for Serverless monitoring
Smooth UX	Limited notification channels in free tier
Great when you need to monitor services that are not serverless
Great to quickly find the origin and the cause of errors

🌀 Bugsnag

Bugsnag offers a clean and functional experience. You just need to wrap your lambdas with Bugsnag code, and you are good to go. Error notification is lightning fast (we received Slack notifications for errors in ~1s during some of our tests), and it integrates with dozens of other apps. The website looks awesome (even though you can't create custom dashboards) and the documentation is crystal clear.

However, Bugsnag was not developed with serverless in mind. Lambda monitoring is added through a plugin, and has limited options compared to the rest of the framework (you can't use custom metrics, and you can't filter errors by lambda for instance). Serverless services outside of Lambda are not available through Bugsnag.

All in one, Bugsnag is a cool tool but has limited customization options and was not thought with serverless in mind.

+	-
Lightweight installation	Limited Options for Serverless monitoring
Smooth UX	No customizable dashboard
Great when you need to monitor services that are not serverless (but a little worse than Sentry)
Great to quickly find the origin and the cause of errors

The bad students

Let's be honest, the services that follow were not all that bad. They just did not suit the set of criteria that we had, but still might be adapted for your project.

🐶 Datadog

Our experience with Datadog was kind of a surprise. Setting it up to monitor a serverless application was complex and the documentation didn't help. Once set up, the classification of the errors and the notification system proved to be limited for our specific serverless needs. For instance, you cannot trigger a Slack notification with the name and error message whenever you get a new error out-of-the-box; you have to manually set-up a monitor for this. Yet, on the bright side, Datadog offers tons of integrations with different services, the possibility to get custom metrics, and many options to monitor about every existing system.

While we don't recommend the use of Datadog on a fully serverless project, it might be interesting on a project that is a mix of serverless and other technologies. Datadog remains a excellent monitoring solution if you know you are not going to use serverless !

+	-
Market leaders, which the widest monitoring offer available out there	Limited Options for Serverless monitoring
A lot of potential with Datadog's track record	Cumbersome installation
	Documentation could be clearer, and is sometimes outdated

⚡️ Serverless Dashboard

The developers behind the famous Serverless Framework also offer a monitoring service which easily integrates with their framework. While it is super easy to integrate with your project (you literally have between one and three lines of code to add in your serverless.yml and that's all) and the interface of the dashboard is very neat, the monitoring options it offered were more limited than its rivals. In particular, the notification and error classifications options were not as good as their counterparts. We also did not find a way to create custom metrics. As its name indicates, Serverless dashboard is limited to serverless monitoring.

As a side note, we encountered a few issues when using Serverless Dashboard with Typescript that prevented us from getting the monitoring insights.

Serverless Dashboard might be the solution for you if you need a simple and easy-to-set-up dashboard when using the Serverless Framework.

+	-
Super easy installation	Limited Monitoring Options
Beautiful UI	No customizable dashboard
	We had a few issues when using TypeScript

🕊 Dashbird

Dashbird is the penultimate service on our list. While its interface is pretty, and it integrates with many AWS services, we found its UX pretty poor and the options available for serverless monitoring were limited. There was no custom metrics, the classification of errors was limited, and we had troubles getting notified for errors.

+	-
Nice UI	Poor UX
Integrates with many AWS services	Limited Monitoring Options
	A few issues when trying to get notified

🧿 New Relic

Unfortunately, we were not able to test New Relic. We got errors during the account creation and the setup was too cumbersome. Hopefully, we will be able to test it in the future!

Conclusion

While Epsagon is the winner of our comparison, we want to remind that all the solutions presented here were compared in light of our specific set of criteria. We wanted a monitoring solution that fitted perfectly in the serverless ecosystem, that integrated well with existing notification channels, and that helped retrieve the origin and stack traces of errors quickly and effectively.

Charles Géry & Guillaume Lagrange are software engineers at Kumo, serverless expertise by Theodo

Top comments (5)

Ismail Egilmez • Jan 20 '22

Hey Charles and Guillaume, congrats for this great article!
I couldn't see Thundra in your post and felt a bit upset about it. Maybe you can try out Thundra APM for the next time if it feels right.

Guillaume Lagrange • Jan 20 '22

Hello Ismail, thank you very much for your comment.
We were under the impression that Thundra's focus shifted from their serverless service and this is why we did not test it out, but it does indeed look like it would belong in the article!

Ismail Egilmez • Jan 20 '22

Ah, I see! Thundra actually created 2 more products on top of its oldest product - APM. APM is actually kinda among the mature serverless monitoring tools as of now, I believe. Thanks very much for your kind response.