Roy R.

Posted on Mar 12, 2024 • Edited on Mar 20, 2024

The 1000 papercuts of AWS Serverless

#aws

AWS serverless is great for scaling up to heavy usage and down to zero. It is truly a space to do engineering at a true sense of the word, putting together systems and composing pieces of a puzzle in a way that servers and monoliths can't match. It's how I make my living. It's why I get paid the big bucks. :p

However, the more I use serverless, the more I come to intimately know some parts of it that I wish were better. From trying to tamp down runaway costs, to trying to "see" what these very complex interlacing systems are doing internally, to just avoiding having to constantly context switch, there are obstacles that tend to pile up. Individually, each obstacle - obstable to doing stuff easily and simply - doesn't amount to much on it's own, but like "death by a thousand (paper) cuts", the cumulative effects tend to weigh down attempts at good engineering.

One of the first things that I'll talk about is surprises lying in wait in terms of cost, then I will move on to structural and complexity issues and end with context switching and finding your place again. But costs first, because as AWS's Dr Werner says, costs is a proxy for sustainability, so if you can't control it well, things can get unsustainable fast.

Cost Boundaries

Scaling up Spikes

In many server-based services, it was common to just spin up a server and have a relatively stable expectation of cost. Years ago, if I spun up a virtual server on Heroku or DigitalOcean or Rackspace or EC2, I wouldn't worry much about how much it was going to cost me generally, because the most it would cost was 1 server's worth, and was pretty stable.

The scaling and performance was also pretty static is the problem of course. Essentially every time I was creating a server, I was way over-provisioning the server in the hope that I would never run up against it's upper limits. Or if I did, I would have time to adjust. It's not great to overprovision, but at least the cost had natural maximums that were usually stable and predictable.

... A Year Of An EC2 Server ...

On the serverless cloud, since it is often on-demand by default, the cost can go up and out of control if I don't take the time to set up limits up front. AWS ships with a 1000 concurrent lambda invocation limit on the account, but if someone (coughs we won't say who) writes an infinite loop in a lambda or a poison pill in an SQS queue, the system could run 10 invocations x 1000 concurrent instances for 10,000 invocations per second. Speaking from experience, this can rack up thousands of dollars in a day.

... Roughly the same timeframe of a serverless service ...

This is by design; you only pay for what you use. However, the dark side of that is that a spike in usage can rack up a major spike in cost.

Missing low-granularity budget controls

The granularity of budgets is either really long, like a month, and thus I will hear about a problem way too late (Finding out on the 15th of the month that there's a major cost spike can be too late and costly), or it is really noisy and thus the signal that I want to hear about get lost in the noise ("Oh no, a cost overrun is currently happening!" gets quickly lost in "Another daily budget adjustment as it swings back and forth")

Here is a monthly cost bump that I got in RDS (the maroon), that because it is viewed at a monthly granularity, is difficult to track down.

When viewed with a daily granularity, it becomes more clear that this was a service that was added:

So the more granular you can make your cost monitoring, the easier it can be to debug.

Unfortunately, for -budgets- the granularity min-maxes out at a day at the lowest, so for alerting and alarming against cost overruns budgets are not particularly effective.

Cost Anomaly detection

Ideally, automatic anomaly detection would fill the gap of weird cost overruns that may happen in unexpected areas not cost anomaly detection to be a helpful assistant in detecting cost spikes. It just responds to late, and to too regular of anomalies:

At a granularity of a day, by the time 24 hours has gone by, I can have already racked up thousands of dollars before ever getting an email that something has gone wrong.

Here is an example of some cost anomalies that get caught, but where the anomalies are often so minor that they're like an anomaly of 0.7 cents?

Architectural Design Simplicity

Automating code deployments has been made easy, automating Infrastructure-as-Code deployments is still rough

As I have gone in depth into using Infrastructure-as-Code (hereafter IaC) for work and personal projects, I commonly want to automate deployment of my infrastructure alongside my code. Code in the same repository, infrastucture in the same repository, data fixtures in the same repository, and I can build the whole application all at once. Guides that talk about code deployment often have great direct approaches. On the other hand, guides that talk about IaaC deployments are often another matter.
There seems to be a substantial lack of visualization tools with IaaC to make it easier to handle the complexity that it brings with it. I have spent many hours at work designing architectural diagrams at work... ...of already existing infrastructure, in an attempt to understand it and capture it's purpose. Conversely, if I am diagraming something that I am architecting from scratch, it would make a lot of sense if there were a way to map from a visual diagram to a baseline architecture and modify both in concert.

There are possibilities around this space:

AWS CodeStar
AWS Amplify Studio
AWS Codecatalyst
etc

But these generally seem to revolve around using a standard blueprint and spinning up a pre-generated set of infrastructure, as opposed to creating infrastructure and then splitting apart the pieces that are provisioned, in order to retroactively refactor as time goes on.

I find a gap with cloudformation and AWS CDK where it can be very difficult to see what has already been created holistically. For example, if there were a legacy architectural stack that was created with little institutional architectural documentation, it is quite difficult to go see what all the resources the app/ecosystem of that legacy stack are.

So if feels like there is a missing toolset around IaaC that code has more matured enough to have good tools for. Unfortunately, because the power of Iac is so strong and far reaching, architects and engineers need the tools all the more on to keep complexity under control. Just like the thoughts that early global explorers probably had while navigating new coastlines and navigating by the stars; "I really wish I had a good map to see where I am NOW."

Visibility

While we're near the topic of visibility, let's talk about the somewhat unrealized promise of observability on AWS cloud. With a complex system, it becomes important to "see" what each of the pieces are doing when each operates independently of each-other. With micro-microservices, each piece could be the failure point that I am looking for when debugging, so I need to be able to "see" each. But observability is pretty difficult to set up on AWS when not working directly in the console, and after-the-fact monitoring is just no longer enough when trying to track down which peice of a 30 link service chain is the one that is broken.

Take an SQS<-to->Lambda<-to->Dynamo stack.

If I look in dynamo, I can see the stored data. If I look at SQS, I may see the messages in flight if something is wrong or slow, but more likely is I will see an empty queue and no evidence of whether it is empty because messages have already flowed through it. Or because messages have not yet flowed through it.

With the lambda I will see the traces after the fact of events, the invocation metrics, and the logging entries that I have to manually create, but I actually want to see the events that are continually flowing through the lambda.

So serverless requires connecting together these many-link chains of services, and it stops being effective to just monitor the input end of the chain and expect certain output at the other end of the chain. The number of chain links where it could be breaking down is too high, real time observability of the behavior of the data through the chain becomes necessary.

Unfortunately, observability seems to have a high overhead setup, so much so that it is usually relegated to the domains of large corporations, and adds a lot of overhead for individual engineers. For example, I was interested in getting real time tracing and observability with AWS X-Ray, but even after setting it up for a reduced test case stack, it ended up not actually showing any live data, and I gave up on it because the overhead, not just the learning curve, seems to be too high to be an efficient use of time in standard use cases.

Cloudwatch Log Ease of Access

Ok, so if observability may not be easily accessible, we have to fall back to monitoring (e.g. monitoring lambdas, since most custom code will be in the lambdas), and for that we need Cloudwatch.

Now Cloudwatch logs have two problems. out-of-time-order, and difficult findability.

The out-of-time-order problem is generally an interface problem. It is common to want to do X behavior in your app, and check for output Y in the cloudwatch logs. However, because looking at a stream doesn't get you to the next stream when a new lambda instance is created, you may never see your output Y when looking at a single log. Since every lambda instance creates it's own stream, and you can't predict when a new lambda instance will spin up, whether for concurrency or because of an old instance being killed off, you can't predict where your logged information will appear.

The way that logwatch is configured in the AWS console is... ...like a jigsaw puzzle that has been split apart. Each stream is it's own separate peice, and (ironically), they don't flow into each-other.
I can't tell you how many times I have been looking for a log output message to come through in log stream A... ...but it never will because a new stream B has been spun up in the meantime, and the log output is actually in stream B.

Log output findability is low

The findability / search problem is just that searching cloudwatch log streams is not a very user friendly experience. It's case sensitive by default, for some reason, and the documentation on operators, well, there almost isn't any, so you're often limited to mostly knowing a priori the exact string that you need to find.
For an egregious example, to trivially find all "Error", "ERROR" and "error" instances, none of these will get you all you need:

error (1/3rd of results)
ERROR (1/3rd of results)
error ERROR Error (0 results because of exact matching)

The actual search terms and operators you need, after quite a bit of research, end up being along the lines of:

?error ?ERROR ?Error

And good luck if your error is actually like ERR or something, because it's case sensitive.

Frankly, even after all the time that I have been using cloudwatch logs, and getting used to the problems above,
I still have no idea how other engineers are getting good information out of lambda invocation logs. I am still searching for some realtime log viewer & aggregator tool, so that I could just see some output pop up, in real time, whichever log stream it ends up in.

These papercuts of cloudwatch are small, but because cloudwatch and monitoring are so core to everything, and how frequently you'll be looking at logs for so many scenarios - invocations, alarms, dynamo errors, even billing - those usability issues are going to be hitting constantly, probably daily.

Navigation

Now, in a hypothetical monolith, if you have a problem, you generally know where to go to debug, you go to the application (and it's server, and the infrastructure built by other people that runs that app), but crucially, because it's a monolith, it's generally all in the same place. When running serverless, though, you have to collect together the services and context switch between them.

Again, with a monolith, it's all there, all in the same place. The place is complex and intermingled, but hopefully you at least know where it is, it's on that server or whatever.
With serverless, you have to navigate and metaphorically travel back and forth, between contexts and between services. It can be disruptive to process to hop between Lambda, SQS, Alarms in cloudwatch, and Dynamo. So navigation becomes a key activity.
Navigation should be a smooth helper for moving between these things. Navigation, through a good information architecture, should highlight the things that are likely to be most important to the user, and deprioritize those things that the user rarely uses.
Here is what my current navigation in aws looks like, though:

Or on the homepage there is one other jumping-off point:

Unfortunately, while services are great for engineering and putting together applications, they don't really work for Navigation, for finding things.
For example, I really don't want to navigate to the base case of AWS Lambda, I want MY lambda functions that I was most recently working on. I don't want the general category of DynamoBD, I just want to get back to the actual Chat table that I was most recently working on. Because I had to context switch out to a different service, and now I am trying to get back into flow.

The current navigation is very "flat". There are 100s of AWS services, and each is presented as if they were equally important to the current user. Of course they aren't. They can't be all equally important. Manually favoriting and bookmarking services is one way to communicate a bit of priority, but the "services" really are never what I'm looking for anyway. The services are like a whole herd of similar zebra, and I am searching for that one that I previously shot a tranquilizer dart into.

UI Hierarchy

Traditional websites go to a ton of trouble to make it easy for me to get to the most useful thing that I want to do. Especially when they are selling a product. "Oh, you're ordering seeds online again? Let me make the [Buy This Seed] button 5 times bigger, with an image." By contrast, with AWS services you are switching betweens concepts that are treated a bit too equally in terms of concept and navigation. So that flat treatment of services makes serverless context switching that much more of a difficulty.

Fixing the UI Hierarchy

You can create an application and a resource group to gather up the pieces within a category that you are working on. That is a very manual process. What is really needed and desirable within the AWS UI is a landing page for people working in AWS where the most recently used resources (not just services) are continuously collected. Last 5 viewed lambdas. Last 5 viewed queues. Mix the services together into a "last 10", so that returning to work on a serverless pieces is just 1 click away, and getting back to work is easy.

Service Icons

Finally, an anecdote about the AWS service icons:

There is a joke that floats around the AWS community about a quiz you can take to try to show your knowledge of the 3d AWS service icons. It's a fun little 5 minutes of satire rooted in a reality of using all the services: https://news.ycombinator.com/item?id=17697366 I personally have memorized some of the icons as a way to quickly try to navigate, but is memorizing that really the best navigational aids we can have?

DEV Community

The 1000 papercuts of AWS Serverless

Cost Boundaries

Scaling up Spikes

Missing low-granularity budget controls

Cost Anomaly detection

Architectural Design Simplicity

Visibility

Cloudwatch Log Ease of Access

Log output findability is low

Navigation

UI Hierarchy

Fixing the UI Hierarchy

Service Icons

Top comments (0)

Read next

Configuring AWS Vault with the Pass Backend for Secure Credential Management on Linux

Cheating Lambda scalability

Host a static website on AWS: A detailed step-by-step guide

Secure Remote Access to EC2 Instances: AWS SSM Session Manager vs EC2 Instance Connect vs EC2 Instance Connect Endpoint