Deploying single "hello world" like Lambda function is easy. Growing an app consisting of tens or hundreds of functions and multiple other resources is also rather easy, although it may be intimidating and a little bit confusing where to start or how to test it. It requires some discipline too. In this post I'll try to share my experience of running serverless applications in a production setup.
First essential thing is a deployment pipeline. No matter what infrastructure as a code tool you use, pipeline needs to be there. All the application code and infra changes should be reviewed by your peers and shipped to production from day one. It takes some time to learn how to continuously deploy your application, so that partial changes do not break the app for users. That's why it's worth shipping to production when the project is still fresh.
Running your application locally is a bad approach. There is no point in trying to set up a local environment. It's just a wasted effort to try setting things up on your laptop just to find out some IAM permission is still not there once you deploy. Even if your app is just a bunch of Lambdas triggered from HTTP events and could profit from solutions like
serverless-offline, it rarely stays that way for long. It's also highly likely there is no offline version of the new service you just want to use. Thus, from now on, your app will run only in the cloud. This includes both staging and production, as well as individual stacks for testing new features that can be deployed on demand. Serverless requires a mindset switch: ability to create and destroy any cloud resource on demand is a super power. Embrace it.
Deploying an app every time something changes is not feasible. Quick feedback loop is essential for developers. This is why serverless requires you to unit test religiously. If you test-drive your code already that's a great fit. If you don't yet, it's worth learning this skill.
- Contrary to popular belief design phase is required before applying TDD. You need to know what you want to achieve, then test drive it. It's not a silver bullet, but just another learnable skill that you need to practice (and it's hard at first!) that helps to avoid bugs in the long run.
- Events that trigger Lambda functions can be captured in logs and reused for testing purposes. It's a very handy technique for discovering how a DynamoDB stream event for a particular table looks or what exact information would fly your way in an SQS message. Frequently first implementation of a function is just a proper logging of an event (with some sampling if scale is involved).
- Mocks are fine, but pure functions are better :)
Some APIs are complicated enough and mocking those won't give you a lot of confidence going forward. This is where integration tests run on your local machine against dockerized services may help. For example there is an official DynamoDB image. In similar fashion I was also able to integration test a service that talks to ElasticSearch.
Using hexagonal architecture is a must. Main goal of it is to disconnect application logic from technical details. Your data doesn't care if it arrived via SQS or Kinesis. Plan your app properly, so that domain code is not scattered across Lambdas. Those are just entry points that translate artificial events (file
users.csv arrived on S3 bucket) to domain structs (
User) and pass those as parameters to your domain services.
On top of it, your packages should be cohesive and loosely coupled, so that you could wire up all dependencies and run the code from all the possible places: tests, migration pipelines (see below) or Lambda functions.
Separating application logic from the environment that the app runs on also helps with testability.
There are a few flavours of E2E tests:
- Automated backend tests: performed on stacks that you may deploy on demand from branch pipelines. After deployment you just inject a lot of data into your system via various scripts and check if it behaves the way you expect it to. Those kind of tests help you make sure that all the required resources and IAM permissions are in place and most of your business essential features are working as expected.
- Browser based tests that test both your frontend and backend together are run most preferably on staging with use of tools like Cypress.
- Synthetic tests: those make sure that nothing's broken on production by examining some common happy paths. They should be run on regular intervals, like every 5 mins, so that we are certain that users can login and render crucial pages.
Every change to the system should be documented. This includes some operational jobs like 'one time emails sending', fixing data format in a table or requeuing SQS DLQ messages after a bug fix. This can be achieved with a migration mechanism in place: your pipeline, after deployment to an env, should run 'not yet ran' scripts, so that the env is up to date. I am using a command-line utility for this, which keeps a state of past migrations in a DynamoDB table.
Set up alarms for everything, then decide what is an important signal and what is a noise. The process of learning how your system behaves may take some time, but it is a very important tool to find bad assumptions or what optimisations can be applied to the system. Alarms may use maths expressions and your custom metrics, as well as anomaly detection based on past traffic.
Sending metrics to CloudWatch with the use of API calls can be a complicated endeavour (did you know there is a limit of 20 metrics per
PutMetric call?). It will add latency to your requests and complicate your code with new dependencies. Luckily, there is a format called EMF that allows you to just log your metrics to standard output and shave off some precious milliseconds by avoiding HTTP calls.
Hard limit of 200 (now 500, yay :)) resources per CloudFormation stack is real, so plan ahead and split your services properly. Same goes for other services. There are plenty of traps: for example batch requests to DynamoDB can contain no more than 25 items. You will learn all those gotchas while working your way thru docs or when your code breaks in production. The best way to avoid those in practice is to really take testing seriously. Knowing your limits is an important skill, same goes for costs. Serverless total cost of ownership is favourable when comparing with on-prem solutions. Still it's worth having an awareness of what the costs are and knowing tricks like optimising your function to take less than 100ms (billing costs increase every 100ms see explanation here).
Oh, it turned out overnight that it is no longer a true statement. Per-ms billing is just rolling out as I am typing this.
Maciej WinnickiWe knew that this day would come. twitter.com/astuyve/status…07:32 AM - 01 Dec 2020AJ Stuyvenberg @astuyveBREAKING (fine it's a few hours old): Lambda just got per-ms billing! Check your logs, this is a huge savings: (alt: Duration: 35.05 msBilled Duration: 36 ms) @goserverless #Serverless https://t.co/hVs0PsDWeM
That's my take on building reliable applications with the use of serverless technologies. I gathered this experience while working with the GOLD stack.
The above list is subjective and covers points I most frequently discuss with folks approaching serverless development for the first time. It is by no means complete. Did I miss anything important? What is your secret for building serverlessly?