AWS Glue is a fully managed ETL (extract, transform, and load) service that can categorize your data, clean it, enrich it, and move it between various data stores. AWS Glue consists of a central data repository known as the AWS Glue Data Catalog, an ETL engine that automatically generates Python code, and a scheduler that handles dependency resolution, job monitoring, and retries. AWS Glue is serverless, so there's no infrastructure to manage.
AWS Glue consists of a number of components components:
- A data catalog (implementing functionality of a Hive Metastore) across AWS data sources, primarily S3, but also any JDBC data source on AWS including Amazon RDS and Amazon Redshift
- Crawlers, which perform data classification and schema discovery across S3 data and register data with the data catalog
- A distributed data processing framework which extends PySpark with functionality for increased schema flexibility.
- Code generation tools to template and bootstrap data processing scripts
- Scheduling for crawlers and data processing scripts
- Serverless development and execution of scripts in an Apache Spark (2.x) environment.
AWS Glue was introduced in August 2017.
With the ability to scale on demand, AWS Glue helps you focus on high-value activities that maximize the value of your data.
To increase agility and optimize costs, AWS Glue provides built-in high availability and pay-as-you-go billing.
Data registered in the AWS Glue Data Catalog is available to many AWS Services, including
- Amazon Redshift Spectrum
- EMR (Hadoop, Hive, HBase, Presto, Spark, Impala, etc.)
- Amazon Athena
- AWS Glue scripts
Official website :- https://aws.amazon.com/glue
Top comments (0)