In One Minute : AWS Glue

#aws #awsglue #beginners #oneminute

AWS Glue is a fully managed ETL (extract, transform, and load) service that can categorize your data, clean it, enrich it, and move it between various data stores. AWS Glue consists of a central data repository known as the AWS Glue Data Catalog, an ETL engine that automatically generates Python code, and a scheduler that handles dependency resolution, job monitoring, and retries. AWS Glue is serverless, so there's no infrastructure to manage.

AWS Glue consists of a number of components components:

A data catalog (implementing functionality of a Hive Metastore) across AWS data sources, primarily S3, but also any JDBC data source on AWS including Amazon RDS and Amazon Redshift
Crawlers, which perform data classification and schema discovery across S3 data and register data with the data catalog
A distributed data processing framework which extends PySpark with functionality for increased schema flexibility.
Code generation tools to template and bootstrap data processing scripts
Scheduling for crawlers and data processing scripts
Serverless development and execution of scripts in an Apache Spark (2.x) environment.

AWS Glue was introduced in August 2017.

With the ability to scale on demand, AWS Glue helps you focus on high-value activities that maximize the value of your data.
To increase agility and optimize costs, AWS Glue provides built-in high availability and pay-as-you-go billing.

Data registered in the AWS Glue Data Catalog is available to many AWS Services, including

Amazon Redshift Spectrum
EMR (Hadoop, Hive, HBase, Presto, Spark, Impala, etc.)
Amazon Athena
AWS Glue scripts

Official website :- https://aws.amazon.com/glue

DEV Community

In One Minute : AWS Glue

Top comments (0)

Read next

Transform Your Cloud Migration Strategy: Transition Microsoft workloads to Linux on AWS with AI Solutions

My First Full-Stack Deployment with Docker and NGINX as Load Balancer

Day 7: Your input is valid 🖐️

Day 18: Deploying Docker to the Cloud