DEV Community

Cover image for 🛡️🔗 Secure Data Pipelines: Connect Amazon Glue to Amazon RDS VPC 🛠️💡

🛡️🔗 Secure Data Pipelines: Connect Amazon Glue to Amazon RDS VPC 🛠️💡

👋 Hey there! I’m Sarvar, a Cloud Architect passionate about cutting-edge technologies. With years of experience in Cloud Operations (Azure and AWS), Data Operations, Data Analytics, and DevOps, I've had the privilege of working with clients around the globe, delivering top-notch results. I’m always exploring the latest tech trends and love sharing what I learn along the way. Let’s dive into the world of cloud and tech together! 🚀

In this article we will see how Virtual Private Cloud (VPC) help us highly secure connectivity between Amazon RDS and Amazon Glue. We will look at the challenges of connecting services that are isolated within a VPC, ensuring that data transfers remain secure without exposing RDS the public internet. Configuring Amazon Glue to run in the same VPC as RDS, creating the necessary security groups, IAM Role and making sure that the right database ports are being used for communication are all part of the process. After reading this article, you should be able to execute Extract, Transform, and Load (ETL) operations between Glue and RDS with ease, guaranteeing secure and effective data processing inside a VPC.

I will guide you through configuring Glue to run on the same network as an RDS instance, setting up an RDS instance using cloudformation inside a VPC, and making sure that the two services are securely communicating by defining the security group and subnet settings. This high-level tutorial will demonstrate how to create a private, secure connection that complies with cloud security best practices and allows Glue to execute ETL operations on RDS data quickly and effectively.
Now let's get started.


GitHub Repository

You can find the complete code and setup instructions in my GitHub repository: GitHub Repository


Let's Create Virtual Private Cloud (VPC):

Follow this Link for CFT - Link
Designed to create a secure, isolated VPC with private subnets, this stack is perfect for hosting RDS instances in a production-grade environment. By preventing the RDS database from being accessible over the internet, the private subnets improve security. To further enhance security, a VPC Endpoint is included to enable instances within the private subnets to safely connect to AWS services, such S3 and DynamoDB, without utilizing the internet. Furthermore, a security group enables precise management of network traffic to and from RDS. This design guarantees a durable, scalable, and secure configuration that is appropriate for effectively managing production workloads.

Important

Why VPC Endpoint required?

The reason AWS Glue job needs access to Amazon S3 is that, even when connecting to an RDS database, Glue uses S3 for temporary storage, logs, scripts, or library dependencies. Glue can't reach S3 unless an S3 VPC endpoint is configured since your VPC subnet doesn't have a NAT gateway or internet connectivity. The job cannot access necessary S3 resources without this endpoint, which causes the error you are now seeing. Glue may access S3 without requiring internet access by configuring the S3 VPC endpoint, which fixes the problem.


Let's Create Amazon RDS:

Follow this Link for CFT - Link
This CloudFormation template provisions a minimal resource Aurora PostgreSQL RDS Serverless v2 database in private subnets within provided VPC. Key data like as the VPC ID, two private subnet IDs, a security group ID, and the Secrets Manager secret name which contains the database credentials—are gathered using parameters. The template restricts the database to a serverless capacity range (0.5 to 1 Aurora Capacity Unit), makes the Data API available for simpler access, and configures the database to use the fewest resources possible. When the template is deployed, the user specifies the name of the database, the name of the RDS cluster, and the initial master username. In addition to securely storing the master username and password in AWS Secrets Manager for later use, this setup establishes the RDS instance inside private subnets, guaranteeing that the database is not available to the public.

Here we have create the Amazon RDS Database as you can see Below.

Image description

Load Some Sample to Database:

To load data into your RDS using the AWS Query Editor:

  1. Open AWS RDS Console and go to Query Editor.
  2. Connect to your database by selecting the cluster, database name, and required credentials.
  3. Write SQL queries to create tables and insert data.

Get the sample data from here - Link

Image description

Note: If you have enabled Data API for your cluster and are using Amazon Aurora (Serverless), loading data into an Amazon RDS database with Query Editor is a simple process.

Image description


Let's Connect Amazon RDS from Amazon Glue:

This brings us to the last section of the tutorial: connecting Amazon RDS from Amazon Glue. In order to guarantee a smooth connection between Amazon RDS and Amazon Glue and enable you to access to the RDS Database from glue, we'll go over the last steps in this section.

Creating the connection necessary for Glue to connect to your Amazon RDS instance with the right credentials is important. To establish the Glue connection with your RDS credentials, please follow the instructions shown in the screenshot below.

Image description

In AWS Glue, look for Postgres as the data source after navigating step 1 of the Connections page. Our setup for our database engine is due to the fact that we are using Amazon Aurora PostgreSQL. To continue establishing the connection, pick the relevant choice.

Image description

The connection information for your Amazon Aurora PostgreSQL instance, including the database name, credential type (which needs to be set to "Username and Password"), and the matching username and password for the database, must be entered in Step 2. Once you've entered these details, you can proceed to finish the setup.

Image description

Image description

Now all you need to do is type in a Connection Name and a descriptive description that explains the connection's purpose in detail. In the future, this will be useful for controlling and identifying the connection.

Image description

That's it! The new connection that we have successfully established will be utilised to access the Amazon RDS database.

Image description


In order to verify our connection to the Amazon RDS database, let's now go to AWS Glue Studio. Although a Glue job may be used to test the connection, doing so would add some length to this article. The connection can be used again for any Glue project after it has been established and tested properly. We'll test the connection directly within Glue Studio to keep things simple, as it's easy to verify and inspect the data inside the table there.

Navigate below your jobs section in AWS Glue Studio and select Create job from a blank graph to begin creating your job from scratch.

Image description

Look for Postgres under the source options. Double-click on it to bring up the setup box where you can enter the connection's necessary details.

Image description

This is the final phase. Give the job a name now, and choose the JDBC connection information. Select the connection we created before, making careful to include the table name (in this example, ec2_instance table, which has data of EC2 instances). Select the appropriate IAM role with the required access permissions, as indicated in the snapshot. Click Run when you're done. After a few moments, the procedure should be finished and the table's data should be visible.

Image description

If you are able to see the data in the output as shown below, it means you have successfully connected Amazon RDS to AWS Glue. Congratulations!

Image description

How to Establish Cross-Account RDS Data Connections for AWS Glue Jobs:

I do not have two separate AWS accounts, but sometime clients ask us to implement cross-account connections. The basic approach is the same if you need to set up a cross-account RDS data connection for an AWS Glue job, but there are extra steps to make sure everything is configured correctly. To enable network traffic to pass between the two accounts, first make sure that the VPCs connecting them are peering, or set up a transit gateway. It's also necessary to create a cross-account IAM role that has the ARNs of the RDS and Glue roles included in the trust relationship policy for each account. Setting up VPC peering/Transit Gateways and configuring the appropriate cross-account IAM roles are two essential stages. The other steps of the process follow to the same as above.

Conclusion: In order to provide safe, effective data transfers and processing without exposing resources to the public internet, a secure connection between Amazon RDS and Amazon Glue within a VPC must be created. With careful configuration of security groups, subnets, and database ports, you can create a private, isolated environment that complies with cloud security best practices, even if you run both services in the same VPC. Cloudformation streamlines the process of building up an RDS instance inside the VPC, freeing you time to concentrate on providing dependable, secure ETL processes between Glue and RDS. This method maintains strong data protection while ensuring optimal performance.

— — — — — — — —
Here is the End!

Thank you for reading! ✨ I hope this article helped simplify the process and gave you valuable insights. As I continue to explore the ever-evolving world of technology, I’m excited to share more guides, tips, and updates with you. 🚀 Stay tuned for more content that breaks down complex concepts and makes them easier to grasp. Let’s keep learning and growing together! 💡

Top comments (0)