Data Lake:
A centralized repository that allows to migrate, store and manage all structured/unstructured data at unlimited scale. Once centralized, we can extract value and gain insights from data through analytics and ML. This makes the data available to more users across more lines of business - enables them to get insights they need.
S3 is ideal for Data Lake as it provides unlimited scalability.
Data Cataloging:
On put of S3-> Use lambda to extract metadata -> DynamoDB and ElasticSearch -> then query the data.
AWS Glue:
Fully managed ETL service. It can organize, cleanse, validate and format data.
In-Place data querying:
Without provisioning and managing servers/clusters we can transform/query the data. So no need to copy and load data into separate analytics platforms. Athena and Redshift Spectrum provide in-place querying of S3 data lake.
Amazon Athena:
Interactive query service that analyze data directly in S3 using SQL Serverless. Pay for scanned data while running queries. Integrates with QuickSight for easy visualization.
Redshift Spectrum:
More complex queries with large number of data lake users can run concurrent workloads.
Top comments (0)