DEV Community

MariaZentsova
MariaZentsova

Posted on

How to load data from S3 to AWS SageMaker

AWS Sagemaker is a great way to analyse data in the cloud and train machine learning models. Most convenient way to store data for machine learning abd analysis is S3 bucket, which could contain any types of data, like csv, pickle, zip or photos and videos.

Here I want to review how to load different data formats from S3.

AWS has created a great boto3 library, which allows for easy access to aws ecosystem of tools and products. SageMaker is a part of aws ecosystem of tools, so it allows easy access to S3.

One of the key concepts in boto3 is a resource, an abstraction that provides access to AWS API and resources. Each AWS resource instance has a unique identifier, which we can use to call it.

Let's get a resource and call a bucket.

import boto3

s3 = boto3.resource('s3')
bucket = s3.Bucket('bucket_id')
Enter fullscreen mode Exit fullscreen mode

If you don't know your bucket id and can't view it in a console, we can list all buckets available.

for bucket in s3.buckets.all():
    print(bucket)
Enter fullscreen mode Exit fullscreen mode

Let's print keys for all files we have in our bucket.

for file in bucket.objects.all():
    print(file.key)
Enter fullscreen mode Exit fullscreen mode

1. Loading csv or txt files

Most often data come in csv or txt format, and this one is quite easy to load.

# Getting data from AWS S3 bucket
s3 = boto3.client('s3')
obj = s3.get_object(Bucket = 'process-news',Key = 'techcrunch_data.csv')

techcrunch_data = pd.read_csv(obj['Body'])
Enter fullscreen mode Exit fullscreen mode

If you don't want to use boto3, you can load csv data from s3 using just pandas.

news_df = pd.read_csv('s3://folder_name/your_data.csv')
Enter fullscreen mode Exit fullscreen mode

2. Loading Pickle

Pickle is a data format that uses very compact binary representation. Python module Pickle allows us to read these type of files from the s3.Object.

import pickle

data = pickle.loads(bucket.Object("your_file.pickle").get()['Body'].read())
Enter fullscreen mode Exit fullscreen mode

Machine Learning models can also be saved, as a pickle file.

3. Loading JSON

JSON format is very popular and APIs most often return data in this format.

import json 

obj = s3.Object('test', 'sample_json.txt')
content = content_object.get()['Body'].read().decode('utf-8')
json = json.loads(file_content)
Enter fullscreen mode Exit fullscreen mode

4. Loading zip archives

Often the data we're working with are quite big and stored in zip archive.


# Import required libraries
from io import BytesIO
from zipfile import ZipFile
import json

# Get a zip object

obj = bucket.Object('corpus-webis-tldr-17.zip')

bytes_ = BytesIO(obj.get()["Body"].read())

z = zipfile.ZipFile(bytes_)

# check what files we have in our archive
for file in z.namelist():
    print(file)

# here we have a json file in our archive - example.json
# load file from the archive
file = z.open('example.json')
data = json.load(file)
Enter fullscreen mode Exit fullscreen mode

5. Loading an images from a folder

Image can be loaded as any other file by key.

file = bucket.download_file(KEY, 'my_local_image.jpg')
Enter fullscreen mode Exit fullscreen mode

Discussion (0)