AWS Data Lake with Terraform - Part 2 of 6

#aws #terraform #awsdatalake #bigdata

Amazon Glue

Up to this point you are halfway to complete your data lake foundation. Let’s recap what you have achieved so far. First you created individual Terraform templates of your services, tested them, added security policies and streamed your data from EC2 to your S3 bucket landing zone using Kinesis. Now we need a fully managed extract, transform, and load (ETL) service that makes it easy for you to prepare and load the data for analytics. Glue is perfect for this job. With AWS Glue you can complete this task in two different ways such as manually or you could use AWS Glue Crawler.

Once your database is ready you can run Glue Crawler which after a minute or two would be extracting the metadata from your S3 bucket into a nice table schema. Although Glue would not give headers or partition names to this schema so you would need to edit it manually. Nonetheless Glue would be able to recognize the type of data in the schema (E.g. string, bigint, double)

Terraform Glue section

resource "aws_glue_catalog_database" "athena_db" {
  name = "athena_db"
}
resource "aws_glue_crawler" "glue_crawler" {
  database_name = aws_glue_catalog_database.athena_db.name
  name          = "s3_crawler"
  role          = aws_iam_role.crawler_role.arn

  s3_target {
    path = "s3://${aws_s3_bucket.data_logs.bucket}"
  }
}

Now that you have a structured table in AWS Glue for the data storage in your S3 bucket you can start treating S3 as a data lake database. Next stop Athena.

Amazon Athena

Athena is a serverless service so you do not need to worry about managing anything. This is super cool personally speaking. Athena is also an interactive query service for s3 which offers you a console to query S3 data with standard SQL or NoSQL. Athena also supports a variety of data formats such as:

CVS
JSON
ORC
Parquet
Avro
TSV

Athena interface example shown below:

With Athena there is no need to load your data from S3. What do I mean by this? Well the data would actually live in S3 and Athena is smart enough to know how to interpret the data and query it interactively. For those who are familiar with Presto – Athena is using Presto under the hood.

Diagram version 2: Data lake