Big data has been growing as topic for a while now and it is obvious that data is powerful. Data is indeed the new oil. Any business out there is investing in data research. There are many terms nowadays that describe data and how it is organized. A data lake is one of them. So, what is it?
In simple words a Data Lake is a centralized repository that collects, stores and organizes huge data collection, including structured and semi-structured data. It also allows multiple organizational units (OU) to explore and investigate their current business stage in minutes. It provides users with the availability to do ad-hoc analysis over diverse processing engines like serverless, in-memory processing, queries and batches.
The challenge
In these series of blogs I will explain how I translated MVP core services for a large e-commerce company into Infrastructure-as-Code (IAC) using Terraform scripts to allow for fast and repeatable deployments, efficient testing and to decrease recovery time in case of an unplanned event. This Data Lake architecture version-one use the following services:
- EC2 for elastic compute
- Kinesis process to collect, and analyze data streams in real time (or almost real time)
- S3 for the data landing and the data consumptions zones
Each of these services are a huge topic in their own ecosystem so throughout this article I will highlight information about how they work and how I integrated them.
Diagram final version: Data lake
What method will we be using to deploy this infrastructure?
We will be deploying this infrastructure as a code (IaC) using Terraform.
resource "aws_instance" "logs" {
count = var.ec2_count
ami = "ami-0742b4e673072066f"
instance_type = "t2.micro"
subnet_id = aws_subnet.dlogssub.id
associate_public_ip_address = true
vpc_security_group_ids = [aws_security_group.web_sg.id]
depends_on = [aws_internet_gateway.bigdataigw]
key_name = aws_key_pair.logskey.key_name
iam_instance_profile = aws_iam_instance_profile.ec2_profile.name
user_data = <<-EOF
#!/bin/bash -xe
yum -y update
yum install -y aws-kinesis-agent
EOF
tags = {
"Name" = "ec2-app-02"
}
}
Terraform new default tags feature
provider "aws" {
default_tags {
tags = {
Enviroment = "DataLake-test"
Project = "DataLake-infrastructure"
}
}
region = "us-east-1"
}
Amazon Elastic Compute Cloud (EC2)
EC2 is the backbone of this infrastructure as it is dedicated to holding the e-commerce large data logs during the time of business analysis. Also, it provides you with a resizable compute capacity for this environment. You can kick up a new server optimized for your work in minutes and rapidly scale it up or down as your computing requirements change.
Amazon Kinesis
Kinesis plays a double part within this infrastructure. Firstly, the Kinesis Firehose stream allows you to capture data from a server log being generated on our Amazon EC2 instance and distributes that into your data lake landing zone in Amazon your S3 bucket. The second one uses the Amazon Kinesis agent application in order to publish data (“direct put”) into this Amazon Kinesis firehose using the Amazon Kinesis agent.
2021-06-01 02:13:11.683+000 (Agent.MetricsEmitter RUNNING) com.amazon.kinesis,streaming.agent.Agent [INFO] Agent: Progress: 500000 records parsed (42036691 bytes), and 500000 records sent successfully to destinations.Uptime: 330039ms
A powerful mechanism that Kinesis possesses is the availability to configure how to store your data into s3. You can configure based on buffer size and buffer interval. For the purpose of this project I have decided to select 5 megabytes of a buffer size meaning that incoming data from the firehose will be dividing the files in five megabytes in size. And, for the buffer interval I set it to the lowest value which is 60 seconds. Tips to remember Kinesis firehose is “almost real-time” and cannot go lower than that.
Amazon Simple Storage Server(S3)
S3 is the biggest and most performant data lake storage solution because of its cost-effective, secure data storage with 11 9s of durability and its virtually unlimited scalability model. It makes sense to store your vast data logs in S3, Don’t you think?
The goal for individuals or businesses to use this data lake solution would be to build and integrate Amazon S3 with Amazon Kinesis, Amazon Athena, Amazon Redshift Spectrum, and AWS Glue for data scientists or engineers to query and process a large amount data.
Important to note that this infrastructure is not fully developed I will be adding other servers such as AWS Glue, AWS Athena, AWS Redshift, AWS Cloudwatch and QuickSight 😊 please stay tune.
Functions, arguments and expressions of Terraform that were used in the above project:
providers
variables
modules
resources
types and values
splat or [*]– One of my favorites
default-tags-in-the-terraform-aws-provider– New feature
Find the Terraform repo and directions for this project here
I would like to give a big shout out to my mentor Derek Morgan. Thank you for all of your support all these months and for the amazing course "More Than Certified in Terraform" the best course out there. Link to the course here. If you want to connect with him and ask questions about his course, contact him via LinkedIn Derek Morgan or you can join the Discord channel here.
Top comments (0)