Here I collected few tips that may be useful for those just at the beginning of their AWS journey.
1. Account separation
It is worth separating production and dev accounts from the very beginning. It costs almost nothing, but if you do not do it initially, you will have to spend precious time on rework.
This is important to do for the following reasons:
- It allows you to limit people accessing the production environment
- It allows you to create a loosely coupled architecture where the prod environment doesn't depend on the testing environment
- You will have separate billing so that you will have the ability to see how much you pay for your production
- This is necessary for passing different security and compliance checks
2. Permissions
The question may quickly arise: how to limit the permissions for developers? For example:
- Protect against admins assuming roles between accounts. Developers with administrator rights could assume a role from another account by default
- If you have not restricted access to SSM, decrypt methods, etc., developers can access various secrets stored in the environment (tokens, keys, etc.)
- Access to the Billing Dashboard
The simple solution: grant only the necessary permissions, but this is more complex than it looks because the scope of rights changes constantly and needs to be continuously reviewed. With this approach, developers will periodically lose access when they need to develop something new that was not considered earlier.
To solve this, you can grant a more comprehensive range of rights, e.g., up to an Administrator on a test account. But set different permission boundaries.
I also recommend setting up MFA for all accounts to reduce the risk of token leaks.
3. Automation
AWS is very captivating with its simplicity in setting something up manually, especially at the beginning of a project.
But behind this simplicity lies many complexities:
- There is a very high probability of breaking something by accident
- It is difficult to repeat the environment and, for example, create an identical staging for the developer
- It isn't easy to understand how our system looks like in general because this knowledge is in our heads
To solve this, I suggest using one of the popular solutions:
4. CloudFormation
The 3 solutions above use CF. So if using CF to manage infrastructure, you should familiarize yourself with its limitations.
The most frustrating thing is the 500 resource limit, which you will run out of very quickly if you prefer a serverless approach or have a large project.
So, to avoid this problem at a crucial moment, I suggest you think about a strategy for placing resources in nested stacks (e.g. plugin for Serverless Framework).
5. Cognito
AWS Cognito is a powerful service for a customer identity and access management.
But when setting it up, you can make two mistakes that can cause serious trouble in production:
- Case-sensitive user pool. If you create a User Pool programmatically, you must set the CaseSensitive parameter to false; otherwise, it will default to true. This complicates the sign-in process (e.g., User@test.co vs user@test.co). And most importantly, after creating the User Pool, you can no longer change this setting
- The same applies to Mutable attributes. This means that if, for example, you mark an email as Mutable, users will not be able to change their emails, and most importantly, you cannot change the value of Mutable without recreating the User Pool
Recreating the User Pool is only possible by resetting user passwords, which would look suspicious. Therefore, if you made such mistakes when configuring your User Pool and have already launched the product, you may have unexpected problems that could have been easily avoided initially.
6. DynamoDB
DynamoDB, unlike the usual relational databases, recommends using the single table design.
For Node.js users to make this approach easier to understand and make the code incredibly simpler, I recommend looking at the OneTable library.
Here are some benefits:
- Single-table design
- TypeScript support
- Local unit test support
- Simple API
- Migrations support
- Indexes & conditions support
- Built-in encryption support
7. Backups
AWS has a built-in backup service. This is a very powerful service that, for example, allows you to create a particular type of backup without the possibility of deleting previously created snapshots.
But this service has its limitations. We are faced with the fact that when restoring DynamoDB, we must manually restore DDB Streams. This would be problematic if, as we do, everything is deployed through CloudFormation, which does not "like" manual changes.
Therefore, to avoid surprises at the most inconvenient moment, in addition to creating backups, you need to develop, test, and document the recovery procedure and periodically perform test recovery to ensure that the process is still up to date.
8. Lambda
If you are using lambda, you probably have heard about the cold start problem.
To solve this, there are different workarounds: warm-up requests, language-specific solutions, lambda provisioned concurrency, etc.
But I want to focus on one, in my opinion, the most critical factor, in addition to the chosen language & memory: function bundle size.
Your bundle size directly affects the cold start time because AWS needs to prepare the code before it can be run. Even if you use Lambda Layers and extract dependencies, these Layers still need to be loaded.
So I advise trying to minimize the bundle size in every possible way. For example, we use Node.js, we never add AWS SDK because it is already present in the environment, we use esbuild because it showed the best results in terms of speed and size, and as a result, we don’t have a single function larger than 500KBs.
9. Alarms
The alarm system is incredibly useful for detecting suspicious activity and important alerts.
For example, we set up alarms when various errors occur in the system:
- Lambda errors & throttling
- Errors in logs (e.g. from docker)
- WAF alarms
- Billing alarms (spending more than a certain amount in $)
- Custom metrics
And then, we connected SNS to send these alarms to us in a special slack channel.
10. Audit logs
You can enable activity logging via CloudTrail. It costs almost nothing but can be indispensable when analyzing suspicious activity or determining why something is not working in the system.
We had an incident with user registration in Cognito, and this logging helped to sort out the real reasons.
Top comments (2)
How do you recommend to transfer changes from dev account to prod? Via the automation from the section 3?
Exactly. All infrastructure should be deployed via code. After testing new resource creation/update, you could run the same scripts on the prod.