AWS Sagemaker is a great way to analyse data in the cloud and train machine learning models. Most convenient way to store data for machine learning abd analysis is S3 bucket
, which could contain any types of data, like csv
, pickle
, zip
or photos and videos.
Here I want to review how to load different data formats from S3.
AWS has created a great boto3 library, which allows for easy access to aws ecosystem of tools and products. SageMaker is a part of aws ecosystem of tools, so it allows easy access to S3.
One of the key concepts in boto3
is a resource
, an abstraction that provides access to AWS API and resources. Each AWS resource instance has a unique identifier, which we can use to call it.
Let's get a resource and call a bucket.
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('bucket_id')
If you don't know your bucket id and can't view it in a console, we can list all buckets available.
for bucket in s3.buckets.all():
print(bucket)
Let's print keys for all files we have in our bucket.
for file in bucket.objects.all():
print(file.key)
1. Loading csv or txt files
Most often data come in csv or txt format, and this one is quite easy to load.
# Getting data from AWS S3 bucket
s3 = boto3.client('s3')
obj = s3.get_object(Bucket = 'process-news',Key = 'techcrunch_data.csv')
techcrunch_data = pd.read_csv(obj['Body'])
If you don't want to use boto3, you can load csv data from s3 using just pandas.
news_df = pd.read_csv('s3://folder_name/your_data.csv')
2. Loading Pickle
Pickle is a data format that uses very compact binary representation. Python module Pickle allows us to read these type of files from the s3.Object
.
import pickle
data = pickle.loads(bucket.Object("your_file.pickle").get()['Body'].read())
Machine Learning models can also be saved, as a pickle file.
3. Loading JSON
JSON format is very popular and APIs most often return data in this format.
import json
obj = s3.Object('test', 'sample_json.txt')
content = content_object.get()['Body'].read().decode('utf-8')
json = json.loads(file_content)
4. Loading zip archives
Often the data we're working with are quite big and stored in zip archive.
# Import required libraries
from io import BytesIO
from zipfile import ZipFile
import json
# Get a zip object
obj = bucket.Object('corpus-webis-tldr-17.zip')
bytes_ = BytesIO(obj.get()["Body"].read())
z = zipfile.ZipFile(bytes_)
# check what files we have in our archive
for file in z.namelist():
print(file)
# here we have a json file in our archive - example.json
# load file from the archive
file = z.open('example.json')
data = json.load(file)
5. Loading an images from a folder
Image can be loaded as any other file by key.
file = bucket.download_file(KEY, 'my_local_image.jpg')
Top comments (0)