9 Things I Wish I Knew When I Started Managing AWS Workloads 🤷

#aws #devops #beginners #architecture

Around 3 years ago, I started managing the AWS workload of a medium-sized enterprise. I learned a lot on the job and have had my fair share of mistakes. In this article, I will share with you 9 things I wish I knew when I started managing AWS workloads.

1 | Studying deeply always pays

AWS has a lot of services and it would be almost impossible to study everything. But the key is studying the handful of services that your company uses deeply. In doing this, you will discover features that will help you do certain tasks faster and more securely.

The classic example is creating a copy of the database. If you didn't know much about AWS, you would probably generate an SQL dump from your database, create a new RDS instance and restore the dump to the database. The whole process can take hours for a production database. But there is a much faster way in AWS: you can create a database snapshot of your RDS instance and re-create your database with that snapshot in any region on your AWS account (or even have that snapshot shared with other AWS accounts).

It also makes debugging problems faster.

AWS does not show you error logs when you make mistakes in AWS. The service just stops working. And it would be up to you to check on your every step to make sure you made everything right. A misconfigured security group or a wrong setup on the target group can take hours to debug. If you don't know how AWS works, the only way to debug is to try to adjust the configuration of each component manually, one at a time, until you find the component that makes the entire setup work.

2 | Always plan for failure

When making changes to your infrastructure, always plan for failure. Including failure scenarios when planning your deployments not only saves you when something goes wrong, it also reduces the chance you'd fail (because you already prepared for it).

If you're going to make a major change in your RDS/EC2 instance, take a backup of it first (via RDS snapshots). If your change is big enough, have a maintenance window where all your systems are down, take snapshots of your RDS/EC2, then apply the changes you're planning to make. If anything goes wrong, since all your systems are down anyway, you can just restore your RDS/EC2 instance from the snapshot and everything will be back to normal. Another benefit of a maintenance window is that, should anything go wrong, the data restored from a snapshot is not stale (since no changes are made during downtime), so no data is lost.

3 | Be creative on double checking your work

This goes without saying, of course. But when I started managing AWS workloads, I didn't know how to test out the infrastructure changes that I was deploying. I just knew what the production setup was, and how to change it. I didn't have the right mindset on how making big changes on the cloud should go.

The typical way to test is to try out what you are about to do in a staging environment. The staging environment is a copy of your production environment but is on another region (or another AWS account). If it works there, it should go fine on production. Most of the time, this kind of testing should be adequate. But the production and the staging environment aren't the same, so we might need to "test on production" (okay, I know that sounds dangerous but hear me out!!).

Testing via the blue/green technique 🔵🟢

You can do this by using the "blue/green" deployment technique.

🔵BLUE ENVIRONMENT - The blue environment is your existing environment, where your existing EC2 instances are. This could be our main site that our customers visit. As an example, let's call this "shop.example.com".

🟢GREEN ENVIRONMENT - To test our changes, we could provision another set of EC2 instances with our changes deployed. These EC2 instances are connected to all production assets (e.g database, cache service, etc) so this is exactly how our site would look like after we deployed our changes. We can make this site accessible via "test-production.example.com". Since the hostname is different, you can use the green environment for testing without affecting the main site ("the blue environment").

Of course, doing this is not without dangers. If you run non-backward-compatible database migrations on your green environment (i.e deleting/renaming columns, removing database constraints), you will bring down functionality on your main site (the blue environment). But understanding that this could be an option for you will allow you to be more rigorous when testing big-ticket changes.

4 | Always document yourself ✍🏽

Working as an infrastructure engineer, most of the work that you'd do is during set up. When done right, maintenance after that should just be a breeze. I usually make a lot of mistakes during set up so I would have to repeat the whole process all over again. In each iteration, I document every step I made and every mistake I did. That way, I don't repeat the same mistakes in the next iterations. I also get to keep a cheat sheet of the setup. This usually comes handy when I have to apply the same setup on new projects.

However, the ultimate self-documenting workflow is Infrastructure as Code (IaC). Instead of clicking around the AWS Management Console and creating resources by hand, you can create a CloudFormation template (a piece of JSON or YML file describing the AWS resources you want to provision), upload that to AWS, and have CloudFormation provision the resources for you.

Doing infrastructure as code can take a lot of effort at the start. But it pays off because your 1) infrastructure can be pushed in a Git repository, 2) changes can be peer-reviewed and logged, and 3) your entire infrastructure can be replicated with just a click of a button.

5 | Use managed services whenever applicable

For a bootstrapped start-up company, time is essential ⏳. If you have to maintain a PostgreSQL server hosted on an EC2 instance (patching its OS every month, for instance), that takes valuable time away from other things you could have otherwise been doing. Besides, there's always the risk that you'd forget to do your typical maintenance jobs and the site goes down in a bleep.

RDS is an indispensable service to host your PostgreSQL workloads. The added reliability and reduced maintenance time more than pays for the added cost of hosting your databases in RDS.

For a medium-sized enterprise, using managed services reduces the risk of downtime and increases the reliability of the individual components that make up the system. It allows your team to focus on what truly matters - deploying features that deliver value to your customers (and indirectly, to your bottom line).

6 | Follow the two-tier architecture

It is always best practice to divide your network in tiers: a public tier and a private tier. This is usually implemented by dividing your VPC into subnets.

Public Tier

A public tier is a group of subnets that is open to the internet. Instances and resources here have a public IP address that entities on the open internet can connect to. Since they are exposed to the open internet, they are vulnerable to attacks. The more resources you add in the public tier, the more potential entrances you give hackers. This is why it is generally prudent to "reduce the surface area" and limit the resources you add here.

For a typical web-based or API-based AWS environment, the only two resources that should be in the public tier are the Application Load Balancer (ALB) and the bastion host.

The bastion host is typically a hardened EC2 instance that you can SSH into. Once inside the bastion host, you can jump into the instances in the private subnet by running the SSH command. The bastion host is typically hardened by creating it from this Amazon machine image (AMI) from the Center of Internet Security

The ALB allows you to host multiple applications with just one load balancer. For example, you have two websites: "shop.example.com" and "staff.example.com". You can add a rule wherein if the ALB sees that the request has a hostname of "shop.example.com", it routes the request to a group of instances that serve the traffic for "shop.example.com". The same thing goes for "staff.example.com".

The instances serving traffic for both applications (shop and staff) are inside the private subnet. This way, we can just keep on adding rules in the ALB to support more and more websites. The instances supporting each website is kept in the private subnet, where they can't be accessed directly (only indirectly through the load balancer).

Private Tier

A private tier is a group of private subnets that cannot be accessed directly from the open internet. Most of your resources should be here. If you need to SSH into them, SSH into the bastion host first, then SSH into the instance. If you need them to be accessible via HTTP/HTTPs, you can use the ALB.

You can learn more about the two-tier architecture here.

7 | Always follow a naming convention

When creating resources in AWS, the most common field among all resources is the name field. They ask you to give a unique identifier to every resource, from your RDS databases to your snapshots. It’s everywhere. When I first started, I just placed random names, to make it work. It didn’t matter at first, but when our infrastructure grew from just a few EC2/RDS instances to a couple of dozen, I found it hard to navigate through the AWS console.

This becomes more annoying during outages when there’s time pressure to bring the site back up online immediately. Decide what kind of information you'd want your names to include, and stick to that naming convention.

As an example, for EC2 Amazon Machine Images (AMIs), my naming convention is <instance-name>-<date-and-time>. So for the AMI of the instance named "app1", my AMI's name would be app1-jan15-2020-430pm.

8 | Use AWS Config

In a very small startup, you are probably going to be the only one handling the AWS account. As your company gets bigger, it would become impossible to keep the AWS account just to yourself. You would have to give people access, one way or another. These people can have varying degrees of knowledge of the best practices in AWS. For instance, they can create security groups with all ports open to the world, or they can create unencrypted databases.

Using AWS Config allows you to monitor how the configuration of every resource changes over time (and who did the change). You can also choose from dozens of predefined rules to check your environment against. So the next time someone creates an unencrypted database, you'd know. You can learn more about AWS Config in my other post.

9 | Automate whenever you can, set alarms whenever you can't ⏰

AWS has a lot of features that can automate certain tasks we do in our environment. Here are some of the more underrated AWS services that can help you automate different aspects of your AWS workload:

Auto Scaling: This allows your EC2 instances to scale from 1 to a few thousand in just a few minutes. This can be configured to support the sudden uptick in traffic by scaling up when the average CPU utilization of your EC2 instance passes a certain threshold.
RDS Storage Auto Scaling: This allows your RDS storage to scale up when your workload needs more storage space. (Unfortunately, it doesn't scale down the storage when your database underutilizes the space)
Systems Manager: This automates a lot of the tasks you commonly do in AWS like creating snapshots of your EBS volumes or running scripts to keep your EC2 instances patched for the latest security updates.

However, there are times when you won't be able to automate certain tasks. Maybe AWS doesn't have the feature for it yet, or you don't have enough time to implement a big change to make automation happen. But that's okay - what matters is you don't leave them uncovered and hope that you'd be the first one to know when your site goes down because of it.

For these times, you can create alarms to inform you and your team of abnormalities in your AWS account. The easiest way to do this is to create alarms in AWS CloudWatch. You specify a threshold value for a specific metric (e.g CPU Utilization of your prod database) and when the metric breaches the threshold for a specific amount of time (e.g 80% for 3minutes), you can have an email sent out to your team asking you to fix it.