AWS Data Lake with Terraform - Part 3 of 6

#awsdatalake #terraform #awsredshift #awsbigdata

Amazon Redshift

Amazon Redshift is a massive AWS parallel processing data warehouse designed for large scale data sets. A very useful feature that Redshift offers is Amazon Redshift Spectrum. In this section I will give a high level summary of why it is so powerful.

Redshift Parallel Processing

Let’s start by reiterating that Amazon Redshift Spectrum is a serverless and there's nothing to provision or manage. You just pay for the resources you consume for the duration of your Redshift Spectrum.

Redshift Spectrum is a query processing engine that would allow you to analyze data that is stored in Amazon S3 using standard Structure Query Language(SQL) without ETL processing.

What do I mean by analyzing data that is sitting in Amazon S3?

I mean that you do not need to think about using Amazon Redshift storage for any data storage as the data will always live on Amazon S3. You can pull, aggregate and filter all sorts of data using Amazon Redshift Spectrum. Remember Amazon Redshift Spectrum is serverless and so you do not need to worry about anything else other than your code.

Redshift clusters

Redshift Terraform

resource "aws_redshift_cluster" "data_logs_db" {
  cluster_identifier = "data-logs-cluster"
  database_name      = "data_logs_db"
  master_username    = "admin"
  master_password    = "Mustbe8+ch@r@cters"
  node_type          = "dc1.large"
  cluster_type       = "single-node"
}

Diagram version 3: Data lake

Diagram final version: Data lake

Please stay tune for part 4

DEV Community

AWS Data Lake with Terraform - Part 3 of 6

Latest comments (0)