Practical Way to Use AWS Glue with Postgresql

#beginners #tutorial #aws #etl

AWS Glue is an event-driven, serverless computing platform provided by Amazon as part of Amazon Web Services. It is a computing service that runs code in response to events and automatically manages the computing resources required by that code.

As a popular ETL service, Glue offers numerous options to connect to various databases, including PostgreSQL, which is a widely-used RDBMS.

Glue provides several ways to set up ETL (Extract, Transform, Load) processes, as shown below:

With its visual setup, performing ETL tasks becomes much easier.

You only need a few clicks to create an ETL job that helps transform data from an S3 input to a PostgreSQL output.

However, this setup has several restrictions because you need to follow all the available options before you can create a properly functioning ETL job.

If you are looking for more flexibility in configuration, you can consider using a script setup.

With a script setup, you can connect to your data source or output directly from the script. To do this, switch from the visual setup to the script page as shown below:

For the code, you can use simple scripts like the following:



import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import boto3

# Initialize Glue context and job
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Read data from S3
s3_path = 's3://your-S3-REPO/'
datasource = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={"paths": [s3_path]},
    format="csv",  # Adjust format as necessary
    format_options={"withHeader": True, "separator": ","}
)

datasource.printSchema()

# Transform data if needed (this is a simple pass-through in this example)
transformed = ApplyMapping.apply(
    frame = datasource, 
    mappings = [
        ("id", "string", "id", "int"),
        ("name", "string", "name", "string"),
        ("age", "string", "age", "int")
    ]
)

transformed.printSchema()

# Write data to PostgreSQL
glueContext.write_dynamic_frame.from_options(
    frame = transformed,
    connection_type = "postgresql",
    connection_options = {
        "url": "jdbc:postgresql://your-PostgresqlDB-Endpoint",
        "dbtable": "your_table",
        "user": "your-Posgresql-User",
        "password": "your-Posgresql-Password"
    }
)

# Commit the job
job.commit()

And for the input, you can use a CSV format file like this:



id,name,age
1,John Doe,30
2,Jane Smith, 15
3,Bob Yellow,20
4,Roshan Brown,18
5,Bam Black,55

After that, you can start the job and wait until it finishes. If it succeeds, as shown below:

you can check the latest result in your posgresql.

I think that's it for now for this article. Leave a comment below about your thoughts! Thanks.

DEV Community

Practical Way to Use AWS Glue with Postgresql

Top comments (0)

Read next

Best Memes of Anime Girls Holding Programming Books

Build a Simple Chatbot with Svelte and ElizaBot

The unseen beauty of Antelope Canyon!!

How to solve Data Synchronization in Next.js