AWS Glue is an event-driven, serverless computing platform provided by Amazon as part of Amazon Web Services. It is a computing service that runs code in response to events and automatically manages the computing resources required by that code.
As a popular ETL service, Glue offers numerous options to connect to various databases, including PostgreSQL, which is a widely-used RDBMS.
Glue provides several ways to set up ETL (Extract, Transform, Load) processes, as shown below:
With its visual setup, performing ETL tasks becomes much easier.
You only need a few clicks to create an ETL job that helps transform data from an S3 input to a PostgreSQL output.
However, this setup has several restrictions because you need to follow all the available options before you can create a properly functioning ETL job.
If you are looking for more flexibility in configuration, you can consider using a script setup.
With a script setup, you can connect to your data source or output directly from the script. To do this, switch from the visual setup to the script page as shown below:
For the code, you can use simple scripts like the following:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import boto3
# Initialize Glue context and job
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# Read data from S3
s3_path = 's3://your-S3-REPO/'
datasource = glueContext.create_dynamic_frame.from_options(
connection_type="s3",
connection_options={"paths": [s3_path]},
format="csv", # Adjust format as necessary
format_options={"withHeader": True, "separator": ","}
)
datasource.printSchema()
# Transform data if needed (this is a simple pass-through in this example)
transformed = ApplyMapping.apply(
frame = datasource,
mappings = [
("id", "string", "id", "int"),
("name", "string", "name", "string"),
("age", "string", "age", "int")
]
)
transformed.printSchema()
# Write data to PostgreSQL
glueContext.write_dynamic_frame.from_options(
frame = transformed,
connection_type = "postgresql",
connection_options = {
"url": "jdbc:postgresql://your-PostgresqlDB-Endpoint",
"dbtable": "your_table",
"user": "your-Posgresql-User",
"password": "your-Posgresql-Password"
}
)
# Commit the job
job.commit()
And for the input, you can use a CSV format file like this:
id,name,age
1,John Doe,30
2,Jane Smith, 15
3,Bob Yellow,20
4,Roshan Brown,18
5,Bam Black,55
After that, you can start the job and wait until it finishes. If it succeeds, as shown below:
you can check the latest result in your posgresql.
I think that's it for now for this article. Leave a comment below about your thoughts! Thanks.
Top comments (0)