AWS Glue become to begin with brought as a serverless ETL provider that allows customers to crawl, catalog, transform, and ingest facts into AWS for analytics. However, over the years, it has advanced right into a fully-managed serverless ETL service.
AWS Glue simplifies the technique of data integration (ETL), which, typically entails discovering, preparing, extracting, and mixing facts for analysis. These duties are regularly treated through more than one individuals/groups with a various set of talents in an organization.
Data discovery (AWS Glue Data Catalog)
AWS Glue Data Catalog may be used to discover and seek statistics throughout all our datasets. Data Catalog allows us to store & save metadata for our datasets and makes it clean to analyse. AWS Glue Data Catalog can be utilized by AWS services like AWS Glue, AWS EMR, Amazon Athena, and Amazon Redshift Spectrum but also with on-premise or third-celebration product implementations for Apache Hive Metastore. AWS Glue Data Catalog is a persistent metastore for data assets. The dataset can be stored anywhere – AWS, on-premise, or in a third-party provider – and Data Catalog can still be used.
AWS Glue Crawlers permit us to populate the Data Catalog with metadata for our datasets via way of means of crawling the statistics shops primarily based totally at the user-described configuration.
AWS Glue Crawlers is a AWS Glue component that facilitates move slowly the records in special varieties of records stores, infers the schema, and populates AWS Glue Data Catalog with the metadata for the dataset that become crawled.
For a crawler to crawl a VPC resource or on-premise data stores such as Amazon Redshift, JDBC data stores (including Amazon RDS data stores), a Glue connection is required. Crawlers are capable of crawling S3 buckets without using Glue connections. However, a Network connection type is required if you must keep S3 request traffic off the public internet.
AWS Glue Schema Registry permits us to control and put in force schemas for statistics streams.
AWS Glue Data Catalog is comprised of the Databases, Tables, and Partitions.
AWS Glue makes it clean to ingest information from numerous sources like HDFS, Amazon S3, JDBC, and AWS Glue. It lets in information to get ingested from SaaS and custom information shops via custom and market connectors.
AWS Glue additionally allows us to interactively create and debug our ETL code using AWS Glue improvement endpoints, AWS Glue interactive sessions, and AWS Glue Jupyter Notebooks.
AWS Glue DataBrew presents an interactive visible interface for cleansing and normalizing facts with out writing code. This is mainly useful to customers who do not have Apache Spark with Python/Scala programming skills. AWS Glue DataBrew comes pre-full of over 250 transformations/recipes that may be used to convert facts as in keeping with our requirements.