Kinyungu Denis

Posted on Sep 30, 2022

Learning Boto3 and AWS Services the right way in Data Engineering.

#aws #python #beginners

Greetings to my esteemed readers!

In this article we will learn about AWS Boto3 and use it together with other AWS services.It will also cover other AWS services that are essential in data engineering. The prerequisites for this article, just have basic knowledge in Python and AWS services.

What is AWS Boto3

Boto3 is the Amazon Web Services (AWS) SDK for Python.
Boto3 is your new friend when it comes to creating Python scripts for AWS resources.
It allows you to directly create, configure, update and delete AWS resources from your Python scripts. Boto3 provides an easy to use, object-oriented API, as well as low-level access to AWS services.

How to install and configure Boto3

Before you install Boto3 you should have Python version 3.7 or any later version.
To install Boto3 via pip:

pip install boto3

You also install using Anaconda, if you desire it to be in your Anaconda environment:

conda install -c anaconda boto3

You can also install it in your Google Colab, to perform your operations on the cloud:

!pip install boto3

Before using Boto3, you need to set up authentication credentials for your AWS account using either the AWS IAM Console or the AWS CLI. You can either choose an existing user or create a new one.

If you have the AWS CLI installed, use the aws configure command to configure your credentials file:

aws configure

You can also create the credentials file yourself. By default, its location is ~/.aws/credentials. The credentials file should specify the access key and secret access key. Replace the YOUR_ACCESS_KEY_ID with the one for your user and YOUR_SECRET_ACCESS_KEY with your user's password.

[default] 
aws_access_key_id = YOUR_ACCESS_KEY_ID 
aws_secret_access_key = YOUR_SECRET_ACCESS_KEY

Save the file.
Now that you have set up these credentials, you have a default profile, which will be used by Boto3 to interact with your AWS account.

Boto3 SDK features

Session
A session manages state about a particular configuration. By default, a session is created for you when needed. However, it's possible for you maintain your own session. Sessions store the following:

Credentials
AWS Region
Other configurations related to your profile

Default Session

Boto3 acts as a proxy to the default session. This is created when you create a low-level client or resource client:

import boto3

# Using the default session
rds = boto3.client('rds')
s3 = boto3.resource('s3')

Custom Session

You can also manage your own session and create low-level clients or resource clients from it:

import boto3
import boto3.session

# Create your own session
current_session = boto3.session.Session()

# Now we can create low-level clients or resource clients from our custom session
rds = current_session.client('rds')
s3 = current_session.resource('s3')

Clients

Clients provide a low-level interface to AWS whose methods map close to 1:1 with service APIs. All service operations are supported by clients. Clients are generated from a JSON service definition file.

import boto3

# Create a low-level client with the service name
s3 = boto3.client('s3')

To access a low-level client from an existing resource:

# Create the resource
s3_resource = boto3.resource('s3')

# Get the client from the resource
s3 = s3_resource.meta.client

Resources

Resources represent an object-oriented interface to Amazon Web Services (AWS). They provide a higher-level abstraction than the raw, low-level calls made by service clients. To use resources, you invoke the resource() method of a Session and pass in a service name:

# Get resources from the default session

s3 = boto3.resource('s3')

Every resource instance has a number of attributes and methods.

Collections

A collection provides an iterable interface to a group of resources. A collection seamlessly handles pagination for you, making it possible to easily iterate over all items from all pages of data.

# s3 list all buckets
s3 = boto3.resource('s3')
for bucket in s3.bucket.all():
    print(bucket.name)

Paginators

Pagination refers to the process of sending subsequent requests to continue where a previous request left off this is due to AWS operations that returns incomplete results.

import boto3

# Create a client
client = boto3.client('s3', region_name='ap-south-1')

# Create a reusable Paginator
paginator = client.get_paginator('list_objects')

# Create a LineIterator from the Paginator
line_iterator = paginator.paginate(Bucket='sample-bucket')

for line in line_iterator:
    print(line['Contents'])

Client vs Resource which should one use?

Resource offer higher-level object-oriented service access whereas Client offer low-level service access.

The question is, “Which one should I use?”

Understanding how the client and the resource are generated helps in which one to choose:

Boto3 generates the client from a JSON service definition file. The client’s methods support every single type of interaction with the target AWS service.
Resources, on the other hand, are generated from JSON resource definition files.

Boto3 generates the client and the resource from different definitions. As a result, you may find cases in which an operation supported by the client isn’t offered by the resource.

With clients, there is more programmatic work to be done. The majority of the client operations give you a dictionary response. To get the exact information that you need, you’ll have to parse that dictionary yourself. With resource methods, the SDK does that work for you.
With the client, you might see some slight performance improvements. The disadvantage is that your code becomes less readable than it would be if you were using the resource. Resources offer a better abstraction, and your code will be easier to comprehend.

Amazon s3

AWS s3 is an object storage platform that allows you to store and retrieve any amount of data at any time. It is a storage that makes web-scale computing easier for users and developers.

Storage

s3 offers total four class storage solutions, with unlimited data storage capacity:

s3 Standard
s3 Standard Infrequent Access (otherwise known as S3 IA)
s3 One Zoned Infrequent Access
Glacier

Amazon s3 Standard

s3 Standard offers high durability, availability and performance object storage for frequently accessed data. It delivers low latency and high throughput. It is perfect for a wide variety of use cases including cloud applications, dynamic websites, content distribution, mobile applications and Big Data analytics.

For example, a web application collecting farm videos uploads. With unlimited storage, there will never be a disk size issue.

s3 Infrequent Access (IA)

s3 IA is designed for data that is accessed less frequently but requires rapid access when needed. s3 Standard-IA offers the high durability, high throughput, and low latency of s3 Standard, with a low per GB storage price and per GB retrieval fee. This combination of low cost and high performance make s3 Standard-IA ideal for long-term storage, backups and as a data store for disaster recovery.

For example, a web application for collecting farm video uploads on daily basis, soon some of those farm videos will go out of access need like there will be less demand to see year-old farm videos. With IA we can move the objects to different storage class without affecting their durability.

s3 One Zoned-IA

s3 One Zoned-IA is designed for data that is accessed less frequently but requires rapid access when needed. s3 One Zone-IA stores data in a single AZ. Because of this, storing data in s3 One Zone-IA costs 20% less than storing it in s3 Standard-IA. It’s a good choice, for storing secondary backup copies of on-premises data or easily re-creatable data.

s3 Reduced Redundancy Storage

Reduced Redundancy Storage (RRS) is an Amazon S3 storage option that enables customers to store noncritical, reproducible data at lower levels of redundancy than Amazon s3’s standard storage.

Amazon Glacier

Amazon Glacier is a secure, durable, and extremely low-cost storage service for data archiving. Customers can store data for as little as $0.004 per gigabyte per month. To keep costs low yet suitable for varying retrieval needs, Amazon Glacier provides different options for access to archives, from a few minutes to several hours.

Object Store

Amazon s3 is a simple key, value store designed to store as many objects as you want. You store these objects in one or more buckets. An object consists of the following:

Key — The name that you assign to an object. You use the object key to retrieve the object.
Version ID — Within a bucket, a key and version ID uniquely identify an object
Value — The content that we are storing
Metadata — A set of name-value pairs with which you can store information regarding the object.
Subresources — Amazon S3 uses the subresource mechanism to store object-specific additional information.
Access Control Information — We can control access to the objects in Amazon s3.

Connect to Amazon s3

As long as the credentials file from above has been created you should be able to connect to your s3 object storage.

import boto3
s3_client = boto3.resource('s3')

Create and View Buckets

When creating a bucket there is a lot you can configure (location constraint, read access, write access) and you can use the client API do that. Using the high level API resource(). Once we create a new bucket let’s now view all the buckets available in s3.

# create a bucket with given name
sampled_bucket = s3_client.create_bucket(Bucket='sampled_buckets')

# view buckets in s3
for bucket in s3_client.buckets.all():
     print(bucket.name)

View Objects within a Bucket

Adding objects to it and then view all objects within our specific bucket.

# point to bucket and add objects
sampled_bucket.put_object(Key='sampled/object1')
sampled_bucket.put_object(Key='sampled/object2')

# view objects within a bucket
for obj in sampled_bucket.objects.all():
     print(obj.key)

Upload, Download, and Delete Objects

Upload a CSV file, view the objects within our bucket again.

# upload local csv file to a specific s3 bucket
local_file_path = '/Users/Desktop/data.csv'
key_object = 'sampled/data.csv'

sampled_bucket.upload_file(local_file_path, key_object)
for obj in sampled_bucket.objects.all():
    print(obj.key)

# download an s3 file to local machine
filename = 'downloaded_s3_data.csv'

sampled_bucket.download_file(key_object, filename)

To delete some of these objects. Either delete a specific object or delete all objects within a bucket.

# delete a specific object
sampled_bucket.Object('sampled/object2').delete()

# delete all objects in a bucket
sampled_bucket.objects.delete()

You can only delete an empty bucket, so before delete a bucket ensure it contains no object.

# delete specific bucket
sampled_bucket.delete()

Bucket vs Object

A bucket has a unique name in all of s3 and it may contain many objects. The name of the object is the full path from the bucket root and any object has a key which is unique in the bucket.

AWS RedShift

Amazon RedShift is a fully managed columnar cloud datawarehouse you can use it to run complex analytical queries on large datasets through massively parallel processing (MPP). The datasets can range from gigabytes to petabytes. It supports SQL, ODBC, JDBC interfaces.

AWS Redshift Architecture

The components of Redshift Architecture

Cluster
A Cluster in Redshift is set of one or more compute nodes, there are two type of nodes Leader Node and Compute Node. If a cluster has two or more compute nodes and additional leader node is there to coordinate with all compute nodes and external communication with client applications.

Leader node
Leader node interacts with client applications and communicates with compute nodes to carry out operations. It parses and generates an execution plan to carry out database operations. Based on execution plan, then it compiles the code. Then compiled code is distributed to all provisioned compute nodes and its data portion to each node.

Compute nodes
Leader node compiles code, interaction with external applications and client applications. Leader node compile each step of execution plan and assign that to all compute nodes.
Compute nodes carry out execution of the given compiled code and send back intermediate results back to leader node to aggregate the final result for each request from client application.
Each compute node has its own dedicated CPU, memory and storage which are essentially determined by node type.

AWS Redshift provides two node types at high level:

Dense storage nodes (ds1 or ds2)
Dense compute nodes (dc1 or dc2)

Node slices
Compute node is further partitioned into slices. Each slice is allocated a portion of node’s memory, disk space where it carries out the given workload to the node. Leader node manages distributing data to slices for any queries or other database operations to the slices. All slices work in parallel to complete the operation.

Internal network
Internal network is for communication between leader node and compute nodes to perform various database operations. Redshift has very high-bandwidth connections, various communication protocols to provide high speed, private and secure communication across leader and compute nodes.

Databases
A cluster contains one or more databases. User data is stored on compute nodes. Redshift provides same functionality as typical RDBMS including OLTP functions like DMLs, however it’s optimized for high-performance analysis and reporting on large datasets.

Connections
Redshift interacts with client applications using JDBC and ODBC drivers for Postgre SQL.

Client applications
AWS Redshift provides flexibility to connect with various client tools like ETL, business intelligence reporting and analytics tools. It is based on industry standard PostgreSQL most existing SQL client applications are compatible and work with little or without any changes.

Redshift Distribution Keys

AUTO — if we do not specify any size, it figures on size of data
EVEN — rows are distributed across slices in round robin, appropriate when table does not participate in join or when there is no clear choice between key distribution and all. It tries to evenly distribute without thinking about clustering the data that can be accessed at same time.
KEY — according to values in one column. All the data with specific key will be stores on the same slice.
ALL — Entire table is copied to every node. Appropriate for slow moving tables.

Sort Keys

It is similar to index and makes for fast range queries
Rows are stored on disk in sorted order based on the column you designate as sort key.

Types of sort keys
Single column
Compound
Interleaved — gives equal weight to each column

Importing and Exporting Data

UNLOAD command (exporting) — unload from a table into files in s3
COPY command -read from multiple data file or stream simultaneously.
Use COPY to load large amounts of data from outside of Redshift.
Gzip and bzip2 compression supported to speed it up further.
Automatic compression option — Analyzes data being loaded and figures out optimal compression scheme for storing it
Special case: narrow tables (lots of rows, few columns); Load with a single COPY transaction if possible.

Short Query Acceleration (SQA)

Prioritize short-running queries over longer-running ones
Short queries run in a dedicated space, won’t wait in queue behind long queries
Can be used in place of Work Load Management queues for short queries
Works with: CREATE TABLE AS (CTAS)
Read-only queries (SELECT statements)

Concurrency Scaling

Automatically adds cluster capacity to handle increase in concurrent read queries.
Support virtually unlimited concurrent users & queries
WLM queues manage which queries are sent to the concurrency scaling cluster

Vacuum Command

Recovers space from deleted rows
VACUUM FULL — default vacuum operation, it will resort all the rows and reclaim space from deleted rows
VACUUM DELETE ONLY — reclaiming deleted rows
VACUUM SORT ONLY — resort the table but not reclaim the disk space
VACUUM REINDEX — reinitialize interleaved indexes, reinitialize the table sort key column and then performs full vacuum operation

Resizing Redshift Cluster

Elastic resize
Quickly add or remove nodes of same type
Cluster is down for a few minutes
Tries to keep connections open across the downtime
Limited to doubling or halving for some dc2 and ra3 node types.

Classic resize

Change node type and/or number of nodes
Cluster is read-only for hours to days

Snapshot, restore, resize

Used to keep cluster available during a classic resize
Copy cluster, resize new cluster

Operations on AWS Redshift

There are numerous operation we can perform on database like query, create, modify or remove database objects, records, loading and unloading data from and to Simple storage service.

Query
Redshift allows to use SELECT statement to extract data from tables. It allows to extract specific columns, restrict rows based on given conditions using WHERE clause. Data can be sorted in ascending and descending order. Redshift allows extracting data using joins, subqueries, call in-built and user defined functions.

Data Manipulation Language(DML)
Redshift allows to perform transactions using INSERT, UPDATE, DELETE commands. DML’s require commit to be saved permanently in database or Rollback to revert the changes. A set of DML’s are known as Transactions. Transaction is completed if any COMMIT, ROLLBACK, or any DDL is performed.

Loading and Unloading Data
Load and Unload operations in Redshift is done by COPY and UNLOAD command. COPY command copies data from files in S3 while UNLOAD dumps the data into S3 buckets in various formats. COPY command can be used to load data into Redshift from data files or multiple data stream simultaneously. Redshift recommends to use COPY command in in case of bulk inserts rather than INSERT.

Amazon Redshift splits the result of a select statement across a set of one or more files per node slice to simplify parallel loading of data. While unloading the data into S3, files can be generated serially or in parallel. UNLOAD encrypts the data files using Amazon S3 server side encryption (SSE-S3).

Data Definition Language
CREATE, ALTER, DROP are names of few can be used to create, modify and delete Database, SCHEMA, USER, and database objects like Tables, views, stored procedure, user defined functions. Truncate can used to delete tables data and faster than delete. Truncate releases the space immediately.

Grant, Revoke
Access can be shared and restricted for different set of user groups using Grant and Revoke statements. Access can be granted individually or in the form of roles.

Functions
Functions are data set objects with predefined code to perform a specific operation. They stored in database as precompiled code and can be used in select statement, DML, and in any expression. Functions provide reusability, avoid redundant code. There are two types of functions.

User defined functions
Redshift allows to create a custom user-defined scalar function (UDF) using either a SQL SELECT clause or a Python program. The user defined functions are stored in the database. User defined functions can be used by user who has sufficient privileges to run. Functions can be created by CREATE FUNCTION command.

In-Built Functions

Character functions
Number and Math functions
JSON functions
Date Type formatting functions
Aggregate/Group Functions

Stored Procedures
Stored procedures can be created in Redshift using PostgreSQL procedural language. Stored procedures contains the set of queries, logical conditions in its block. Parameters in the procedures can be IN, OUT or IN OUT type. We can use DML, DDL, and SELECT statements in stored procedures. Stored procedures can be reused and removes duplicate piece of code

Use cases of RedShift:

Trading, and Risk Management
To take decision for future trades, decide exposure limits, and mitigate risk against a counter party. Redshift’s feature data compression, result caching and encryption types to secure critical data makes a suitable data warehouse solution for that industry.

Build Data Lake for pricing data
Data can be helpful to implement price forecasting systems for oil, gas, and power sectors. Redshift’s columnar storage is best fit for time series data.

Supply chain management
Supply chain systems generate huge amount of data that is used in planning, scheduling, optimization and dispatching. To query and analyze huge volume of data feature like parallel processing with powerful node types make Redshift a good option.

Conclusion

In this article, you will learn understand about Boto3, AWS s3 and AWS Redshift. It is quite brief and it provides with basics of this services. You need to create your own AWS account and practice on your own to understand the concepts clearly.

DEV Community