Guille Ojeda for AWS Community Builders

Posted on Aug 18, 2023 • Edited on Aug 19, 2023 • Originally published at newsletter.simpleaws.dev

AWS Lambda in a VPC

#aws #cloud #serverless #networking

Note: This content was originally published at the Simple AWS newsletter. Understand the Why behind AWS Solutions. Subscribe for free! 3000 engineers and tech experts already have.

Use case: Secure access to RDS and Secrets Manager from a Lambda function

Scenario

You've deployed a database in RDS and a Lambda function that needs to talk to that database. You've read our previous issue with 20 tips for Lambda, and you learned to put the database password in Secrets Manager. Then you put the Lambda in the database VPC, and everything broke.

Services

Lambda: Serverless compute. We won't dive deep, we've done that already.
RDS: Managed Relational Database Service. We've also talked about this in a previous issue.
Secrets Manager: A service where you store encrypted strings (like passwords) and access them securely.
VPC: A virtual network where your RDS instance is placed. Here's where we'll put our focus.

Solution

Note: If you're actually facing this scenario, these steps will cause downtime. I wrote them in this order to make it easier to understand the final solution, but if you need to fix this specific problem, let me know and I'll help you.

What it looks like

Step by step instructions

First, we're going to “put the Lambda in the VPC”. To do that, go to the Lambda service, choose your Lambda, click on Configuration, on the left click VPC, and click Edit. Select your VPC, pick a few subnets, a Security Group (we'll get back to security groups) and click Save.
Our Lambda function still runs on AWS's shared servers used for Lambda, but it now has an IP address in that VPC's address space. This is important because now our Lambda function can access the RDS instance by sending packets through the VPC instead of the public internet (faster and more secure).
Now we broke internet access for our Lambda! It turns out Lambdas that are “not in a VPC” actually reside in a “secret” VPC with internet access, and when we moved our Lambda to our VPC we broke that. If our Lambda needs to access the internet, here's how to fix that problem. If all our Lambda needs to access is other AWS services, don't bother with internet access, read on.
We also broke our Lambda function's access to Secrets Manager, and we definitely care about that. To fix it, we're going to add a VPC Interface Endpoint, which is like giving an AWS service a private IP address in our VPC, so we can call that service on that address (since we can't access the public one, because we broke internet access).
Go to the VPC console, select "Endpoints" and click "Create Endpoint". Choose the "com.amazonaws.{region}.secretsmanager" service name, select your VPC, the subnets that you picked for the Lambda function, choose a security group (we'll get back to security groups) and click Save.
ve got everything inside the same VPC, we just need to allow traffic to reach the different components (while blocking all other traffic). Security Groups are these really cool firewalls that'll let us do that.
First, create a security group called SecretsManagerSG and associate it with the VPC Endpoint, another one called LambdaSG and associate it with the Lambda function, and another one called DatabaseSG and associate it with the RDS instance.
Next, edit the DatabaseSG to allow inbound traffic to the database port (5432, 3306, etc) originating from the LambdaSG.
Finally, edit the SecretsManagerSG to allow inbound traffic over all protocols and ports originating from the LambdaSG.
That's it for the networking part. Now we just need to configure the proper permissions using IAM Roles. Go to the IAM service, click “Roles”, and click “Create Role”. Choose "AWS service" as the trusted entity and select "Lambda" as the service that will use this role. Click "Next: Permissions", click "Create policy", and use the following sample policy (replace the values between {} with your actual values). Save the policy, save the role with that policy associated, and configure the Lambda function to use that IAM Role.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "rds-db:connect"
            ],
            "Resource": [
                "{database-arn}"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetSecretValue"
            ],
            "Resource": [
                "arn:aws:secretsmanager:{region}:{account_id}:secret:{secret name}"
            ]
        }
    ]
}

Discussion

Do we really need all of that? Yes, we do. Here's why:

Putting the Lambda in the VPC: For performance and security. If you don't put the Lambda in a VPC, it can only connect to your RDS instance through the public internet. On one hand, that's a lot more network jumps, which makes it slower. On the other hand, you'd need to open up your database to the internet (networking-wise, you'll still have the password), which is less secure. There's still the password, and while it can be enough in most cases, you don't want to risk it. Adding multiple defense layers (network and authentication in this case) is a strategy called defense in depth.
Adding a VPC Endpoint: This one's pretty similar, but in reverse. Secrets Manager is a public service, which means it resides in the AWS shared VPC (as opposed to residing in a VPC you own, like an EC2 or RDS instance). If you want to connect to a public service, you can do it from another public service (e.g. Lambda when the function is not in a VPC), through the internet, or through a VPC Endpoint in your VPC. The first one's no longer an option (the Lambda function is either in your VPC or in the shared VPC, it can't be in both). You could add a route to a NAT Gateway in your subnets, so the Lambda function can reach the internet (in fact, you need to do it if your function needs internet access), but the traffic would go through the internet (bad for performance and security). Instead, you can give Secrets Manager a private IP address in your VPC (that's essentially what a VPC Endpoint does), and have the connection go through your VPC and AWS's internal network.
Security Groups: This is essentially about defense in depth. By restricting the inbound traffic on the RDS Security Group you're preventing anyone other than your Lambda function from connecting to the database. That means both evil hackers from the internet, and even other resources such as EC2 instances that are inside your VPC. The reasoning behind this is that if you have multiple resources (e.g. multiple Lambdas or instances), it's possible that one of them gets compromised, and you want to limit what an evil hacker with access to that resource can do. Note that you're not actually restricting traffic to that specific Lambda function, but rather to any resource with the LambdaSG security group.
IAM Roles: Access to AWS services is forbidden by default, you need to allow it explicitly. You could add an IAM Role to your Lambda function with a policy that allows all actions on all resources, but that would mean if the function is compromised (e.g. someone steals your GitHub password and pushes their own code to the function) they'd have full access to your AWS account. You can't just deny all actions though, the function needs permissions. So, you figure out the minimum permissions that the function needs, review and refine that, and craft an IAM policy that grants only those permissions (this is called minimum permissions).
In this case (and in most), the process looks something like this: “The function needs access to secrets manager and RDS" → “It doesn't need full access, it just needs to read secrets and connect to a database" → “It only needs to read this specific secret, and connect to this specific database" → search for the permissions needed to read a secret and to connect to RDS → write policy.

This issue is based on a real case of one of my clients, from my consulting service. It doesn't get any more real than this!

Best Practices

Operational Excellence

Use VPC Flow Logs: Enable VPC Flow Logs to monitor network traffic in your VPC and identify potential security issues.
Monitor Service Quotas: Specifically, you need to keep an eye open for the ENIs quota, since a Lambda in a VPC uses one ENI per subnet. In this case we're only creating one VPC Endpoint because we're only accessing one service, but if you create several, there's a low quota for that as well.
Single responsibility services: I'm building this with just one Lambda function for simplicity, assuming your system does exactly one thing. In reality, your system is going to do multiple things, which should be split across multiple services (and multiple Lambda functions). I don't want to dive into service design in this issue, because it would move the focus out of the technical aspects of Lambda in a VPC. But if you're implementing this for real, don't put all your code in a single Lambda.

Security

Configure CloudTrail and GuardDuty: CloudTrail logs every action on the AWS API, and GuardDuty scans the CloudTrail logs for any suspicious activity. Here's a step by step guide to configure it.
Add Resource Policies: Some resources, such as Secrets Manager secrets, can have policies that determine who can access them. Think S3 Bucket Policies, these are the same but for Secrets Manager. Consider who needs access (probably just your Lambda and the password rotation Lambda) and write a policy that restricts access.
Add VPC Endpoint Policies: Same idea as resource policies, but for the VPC Endpoint (instead of the resource that is accessed through that endpoint). It's more common to use them when using VPC Endpoints to access resources that don't support resource policies, but in this case they add another layer of defense (defense in depth). Specifically, you're defending against your own potential mistakes in the IAM Policy of the Lambda function and in the resource policy of the secret.
Use Private Subnets: Public subnet = route to/from the internet, private subnet = no route to/from the internet (this is a route to the Internet Gateway in the subnet's Route Table, and you can change it). You're not going to access your database from the internet, so you should outright remove that possibility by placing it in a private subnet. You were already blocking access from the internet in your security group, but again, we're doing defense in depth (i.e. protecting ourselves form our own potential mistakes).
Regularly rotate database passwords: When was the last time you rotated a password? Probably way too long ago. Don't risk your data like that. You can even automate this with a Lambda function.
Regularly review and update IAM policies: You should write your policies with minimum permissions, and I trust you'll do your best. But we all make mistakes. If you're missing a permission, you'll notice right away, because things won't work. But if you have one extra permission that you didn't need, you're only going to notice that on a regular review or on a post-mortem of a security incident. I recommend the former method.

Reliability

Create a Disaster Recovery Plan and test it: Your database is going to fail at some point. Set up regular backups to protect from that, and test them frequently. Also, consider what happens if the AZ where your instance is placed goes down, or if the entire region goes down. "Too much work, if the AZ or region is down I'm cool with some downtime" is a perfectly valid strategy, but only if it's a conscious decision.
Use one NAT Gateway per AZ (for prod): If you want to give your Lambda functions access to the internet, you'll need NAT. For the production environment, use one NAT Gateway per AZ, so when an AZ goes down your Lambdas can still access the internet. Of course, in this particular scenario this only makes sense if your RDS database is also highly available.

Performance Efficiency

Retrieve secrets and start DB connections in the init code: Lambda functions have initialization code (outside the handler method) and execution code (inside the handler method). When a Lambda function is invoked for the first time, it launches a new instance, executes the init code (this is called a cold start), and then passes the invocation event to that instance to start the execution of the handler method. Future invocations use existing instances, and the init code is not executed again. If there's no available instance (e.g. all instances are already executing an invocation, or they died after waiting several minutes with no invocations), a new instance is created and the init code is run again.
How does this help? Simple: If you put the code that starts the DB connection inside the handler, a new connection is started every time the function is invoked, and terminated when that execution ends. Put it in the init code (outside the handler), and the database connection will get reused across invocations, so long as that Lambda instance stays alive. Same goes for retrieving the secret from Secrets Manager.
Cache what you can: For data that is updated infrequently and/or you can tolerate not getting the most up to date version, you can add a cache between the Lambda function and the RDS database. It's cheaper to read the cached response than to re-calculate the response on every query, and caches often scale better than relational databases.
Throttle database queries: Lambda scales like a beast, RDS doesn't. If your system is prepared for 100 concurrent users and you suddenly get 200, your Lambda function is going to scale seamlessly, but 200 queries to your database at the same time is going to be too much for your RDS instance. Preemptively increasing the size doesn't help, you'll be paying more per month and you'll still hit this problem at some point (be it 200 users, 500, 1000, etc). Instead, for reads you can read from a cache (as explained above), and for writes you can put the write operations into a queue and have another Lambda read from that queue at a more controlled rate.

Cost Optimization

Use one NAT Instance per VPC (for dev): Dev environments don't need high availability, so instead of multiple NAT Gateways, consider a single NAT Gateway. Since NAT Gateways are zonal resources, if the AZ goes down your dev environment will go down, but you're probably fine with that. And I'll do you one better: Instead of paying $32,40/month (yes, NAT Gateways cost that much!), set up a t4g.micro EC2 instance as a NAT Instance. You can even make that instance self-healing.

Understand the Why behind AWS Solutions.
Join over 3000 devs, tech leads, and experts learning how to architect cloud solutions, not pass exams, with the Simple AWS newsletter.

Real-world scenarios
The Why behind solutions
How to apply best practices

Subscribe for free!

If you'd like to know more about me, you can find me on LinkedIn or at www.guilleojeda.com

DEV Community