DEV Community

Cover image for Cloud computing quickstart for data engineering
Barbara
Barbara

Posted on • Updated on

Cloud computing quickstart for data engineering

What

Cloud computing is the use of a network of remote servers hosted on the internet to store, manage and process data.

  • no need to invest in hardware upfront
  • rapid provisioning of resources
  • provides efficient global access through deployments in different regions.

Cloud providers are Amazon, Microsoft, Google, Alibaba, Oracle and IBM. As Amazon is the biggest one, we are going to get an overview to get the basics needed for data engineering.

AWS - Amazon Web Services

AWS offers more than 140 services for computation, storage, databases, networking and development tools.
The services can be accessed in 3 ways:

As there are over hundred services available, you might be overwhelmed at first sight. In order to make the start a bit easier we create a glossary with the services you will need for data engineering and the according links to their documentation. As there are a lot more services than the ones mentioned below, feel free to dive deeper into the AWS documentation here.

IAM - Identity and Access Management

User

A user is an entity, person or application that interacts with AWS.

Role

A role can be assigned to anyone who needs it. It is not uniquely connected to an entity.

VPC - Virtual Private Cloud

Enables to launch AWS resources in a virtual network defined by your needs. It is a data center with the benefits of cloud infrastructure.

S3 - Simple Storage Service

It can store, retrieve and access any amount of objects at any time in buckets. Depending on the need there are a lot of different storage classes.

S3 Buckets

A bucket is a container for objects. There are a lot of useful properties like:

  • Versioning: keep multiple versions of an object in the same bucket
  • Static website hosting: a very cost-effective way to serve static web content
  • Requester pays: makes the requester pay for requests and data transfer costs
  • Permission management
  • Data management: create lifecycle rules, transitioning data, archive or delete data
  • Metrics for usage, request, data transfer, bucket size, number of objects
  • Access points: Create access points to share the bucket at scale

S3 Objects

An object is a file and any meta that describes that file.

EC2 - Elastic Cloud Compute

A web service that provides secure, resizable compute capacity in the cloud. If we want to use the cloud self-managed we can use EC2 + Postgresql, EC2 + Unix FS instead of Amazon RDS or Amazon DynamoDB and Amazon S3.

RDS - Relational Database Service

A relational database service that manages common database administration tasks, resizes automatically, and is cost-friendly.

Redshift

  • it is a column-oriented storage
  • MPP (massive parallel processing) database
  • good to store OLAP workloads, summing over a long history
  • internally it is a modified postgresql

IaC - Infrastructure as Code Example with boto3

Discussion (0)