DEV Community

Cover image for AWS Data Lake with Terraform - Part 2 of 6
Augusto Valdivia for AWS Community Builders

Posted on

AWS Data Lake with Terraform - Part 2 of 6

Amazon Glue

Up to this point you are halfway to complete your data lake foundation. Let’s recap what you have achieved so far. First you created individual Terraform templates of your services, tested them, added security policies and streamed your data from EC2 to your S3 bucket landing zone using Kinesis. Now we need a fully managed extract, transform, and load (ETL) service that makes it easy for you to prepare and load the data for analytics. Glue is perfect for this job. With AWS Glue you can complete this task in two different ways such as manually or you could use AWS Glue Crawler.

Once your database is ready you can run Glue Crawler which after a minute or two would be extracting the metadata from your S3 bucket into a nice table schema. Although Glue would not give headers or partition names to this schema so you would need to edit it manually. Nonetheless Glue would be able to recognize the type of data in the schema (E.g. string, bigint, double)

Glue

Terraform Glue section

resource "aws_glue_catalog_database" "athena_db" {
  name = "athena_db"
}
resource "aws_glue_crawler" "glue_crawler" {
  database_name = aws_glue_catalog_database.athena_db.name
  name          = "s3_crawler"
  role          = aws_iam_role.crawler_role.arn

  s3_target {
    path = "s3://${aws_s3_bucket.data_logs.bucket}"
  }
}
Enter fullscreen mode Exit fullscreen mode

Now that you have a structured table in AWS Glue for the data storage in your S3 bucket you can start treating S3 as a data lake database. Next stop Athena.

Amazon Athena

Athena is a serverless service so you do not need to worry about managing anything. This is super cool personally speaking. Athena is also an interactive query service for s3 which offers you a console to query S3 data with standard SQL or NoSQL. Athena also supports a variety of data formats such as:

  • CVS
  • JSON
  • ORC
  • Parquet
  • Avro
  • TSV

Athena interface example shown below:

Athena

With Athena there is no need to load your data from S3. What do I mean by this? Well the data would actually live in S3 and Athena is smart enough to know how to interpret the data and query it interactively. For those who are familiar with Presto – Athena is using Presto under the hood.

Diagram version 2: Data lake
Diag

Diagram final version: Data lake
Diag

It is time for a mini break at this point please stay tune for part 3

Oldest comments (0)