MariaZentsova

Posted on Jan 24, 2022

How to load data from S3 to AWS SageMaker

#aws #python #datascience #tutorial

AWS Sagemaker is a great way to analyse data in the cloud and train machine learning models. Most convenient way to store data for machine learning abd analysis is S3 bucket, which could contain any types of data, like csv, pickle, zip or photos and videos.

Here I want to review how to load different data formats from S3.

AWS has created a great boto3 library, which allows for easy access to aws ecosystem of tools and products. SageMaker is a part of aws ecosystem of tools, so it allows easy access to S3.

One of the key concepts in boto3 is a resource, an abstraction that provides access to AWS API and resources. Each AWS resource instance has a unique identifier, which we can use to call it.

Let's get a resource and call a bucket.

import boto3

s3 = boto3.resource('s3')
bucket = s3.Bucket('bucket_id')

If you don't know your bucket id and can't view it in a console, we can list all buckets available.

for bucket in s3.buckets.all():
    print(bucket)

Let's print keys for all files we have in our bucket.

for file in bucket.objects.all():
    print(file.key)

1. Loading csv or txt files

Most often data come in csv or txt format, and this one is quite easy to load.

# Getting data from AWS S3 bucket
s3 = boto3.client('s3')
obj = s3.get_object(Bucket = 'process-news',Key = 'techcrunch_data.csv')

techcrunch_data = pd.read_csv(obj['Body'])

If you don't want to use boto3, you can load csv data from s3 using just pandas.

news_df = pd.read_csv('s3://folder_name/your_data.csv')

2. Loading Pickle

Pickle is a data format that uses very compact binary representation. Python module Pickle allows us to read these type of files from the s3.Object.

import pickle

data = pickle.loads(bucket.Object("your_file.pickle").get()['Body'].read())

Machine Learning models can also be saved, as a pickle file.

3. Loading JSON

JSON format is very popular and APIs most often return data in this format.

import json 

obj = s3.Object('test', 'sample_json.txt')
content = content_object.get()['Body'].read().decode('utf-8')
json = json.loads(file_content)

4. Loading zip archives

Often the data we're working with are quite big and stored in zip archive.


# Import required libraries
from io import BytesIO
from zipfile import ZipFile
import json

# Get a zip object

obj = bucket.Object('corpus-webis-tldr-17.zip')

bytes_ = BytesIO(obj.get()["Body"].read())

z = zipfile.ZipFile(bytes_)

# check what files we have in our archive
for file in z.namelist():
    print(file)

# here we have a json file in our archive - example.json
# load file from the archive
file = z.open('example.json')
data = json.load(file)

5. Loading an images from a folder

Image can be loaded as any other file by key.

file = bucket.download_file(KEY, 'my_local_image.jpg')

DEV Community

How to load data from S3 to AWS SageMaker

1. Loading csv or txt files

2. Loading Pickle

3. Loading JSON

4. Loading zip archives

5. Loading an images from a folder

Top comments (0)

Read next

Clean up S3 with Batch Operations, Tags and Lifecycle policies, they said. It will be cheaper, they said.

Mastering Efficient Queue Structures in TypeScript: A Complete Guide

SQL 101 | Chapter 3: Mastering Data Retrieval with SELECT Statements

How to Configure VSCode for Auto Formatting and Linting in Python