How To Load Data From AWS S3 into Sagemaker (Using Boto3 or AWSWrangler)
S3 is a storage service from AWS. You can store any type of files such as csv
files or text files. SageMaker provides the compute capacity to build, train and deploy ML models.
You can load data from AWS S3 into AWS SageMaker using Boto3 or AWSWranger.
In this tutorial, you'll learn how to load data from AWS S3 into SageMaker jupyter notebook.
This will only access the data from S3
. The files will not be downloaded to the SageMaker Instance itself. If you want to download the file to the SageMaker policy, read How to Download File From S3 Using Boto3 [Python]?
Prerequisite
- Sagemaker instance MUST have read access to your S3 buckets. Assign the role
AmazonSageMakerServiceCatalogProductsUseRole
while creating SageMaker instance. Refer this link for more details about SageMaker Roles - Install pandas dataframe using
pip install pandas
to read csv file as dataframe. In most cases it is available as default package
Loading CSV file from S3 Bucket Using URI
In this section, you'll load the CSV file from the S3 bucket using the S3 URI.
There are two options to generate the S3 URI
- Copying object URL from the AWS
S3
Console. - Generate the URI manually by using the String format option. (This is demonstrated in the below example)
Follow the below steps to load the CSV file from S3 bucket.
- Import
pandas
package to readcsv
file as a dataframe - Create a variable
bucket
to hold the bucket name. - Create the
file_key
to hold the name of the s3 object. You can prefix the subfolder names, if your object is under any subfolder of the bucket. - Concatenate the bucket name and the object name with the prefix
s3://
to generate the URI of the S3 object - Use the generated URI in the
read_csv()
method of the pandas package and store it in the dataframe object calleddf
In the example, the object is available in the bucket stackvidhya
and sub-folder called csv_files
. Hence you'll use the bucket name as stackvidhya
and the file_key as csv_files/IRIS.csv
Snippet
import pandas as pd
bucket='stackvidhya'
file_key = 'csv_files/IRIS.csv'
s3uri = 's3://{}/{}'.format(bucket, file_key)
df = pd.read_csv(s3uri)
df.head()
The csv file will be read from the S3
location as a pandas dataframe.
You can print the dataframe using df.head()
which will print the first five rows of the dataframe as shown below.
Dataframe will look like
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
This is how you can load s3 data into sagemaker jupyter notebook without using any external libraries.
In this method, the file is also not downloaded into notebook directly.
Next, you'll learn about using the external libraries to load the data.
Loading CSV file from S3 Bucket using Boto3
In this section, you'll use the Boto3.
Boto3 is an AWS SDK for creating, managing and access AWS services such as S3 and EC2 instances.
Follow the below steps to access the file from S3
- Import
pandas
package to readcsv
file as a dataframe - Create a variable
bucket
to hold the bucket name. - Create the
file_key
to hold the name of the s3 object. You can prefix the subfolder names, if your object is under any subfolder of the bucket. - Create an
s3
client using theboto3.client('s3')
. Boto3 Client is a low level representation of the AWS services. - Get s3 object using the
s3_client.get_object()
method. Pass the bucket name and the file key you can created in the previous step. It'll return the s3 data as response. (Stored asobj
) - Read the the object body using
obj['Body'].read()
. It'll return the bytes. Convert these bytes to String usingio.BytesIO()
. - This string can be passed to
read_csv()
available in pandas. Then you'll get a dataframe.
Snippet
import pandas as pd
import boto3
import io
bucket='stackvidhya'
file_key = 'csv_files/IRIS.csv'
s3_client = boto3.client('s3')
obj = s3_client.get_object(Bucket=bucket, Key=file_key)
df = pd.read_csv(io.BytesIO(obj['Body'].read()))
df.head()
The dataframe can be printed using the df.head()
method. It'll print the first five rows of the dataframe as shown below.
Dataframe Will Look Like
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
You can also use the same steps to access files from S3
in jupyter notebook*(outside of sagemaker)*.
Just pass the AWS API security credentials while creating boto3 client as shown below. Refer the tutorial How to create AWS security credentials to create credentials.
Snippet
s3_client = boto3.client('s3', aws_access_key_id='AWS_SERVER_PUBLIC_KEY', aws_secret_access_key='AWS_SERVER_SECRET_KEY', region_name=REGION_NAME )
This is how you can read csv file into sagemaker using boto3.
Next, you'll learn about the package awswrangler
.
Loading CSV File into Sagemaker using AWS Wrangler
In this section, you'll learn how to access data from AWS s3 using AWS Wrangler.
AWS Wrangler is an aws professional service open source python library that extends the functionalities of the pandas library to AWS by connecting dataframe and other data related services.
This package is not installed by default.
Installing AWSWrangler
Install the awswrangler
by using the pip install command.
%
needs to be prefixed to pip
command so the installation directly works from the jupyter notebook.
Snippet
%pip install awswrangler
You'll see the below messages and the AWS Data wrangler will be installed.
Output
Collecting awswrangler
Downloading awswrangler-2.8.0-py3-none-any.whl (179 kB)
Installing collected packages: scramp, redshift-connector, pymysql, pg8000, awswrangler
Successfully installed awswrangler-2.8.0 pg8000-1.19.5 pymysql-1.0.2 redshift-connector-2.0.881 scramp-1.4.0
Note: you may need to restart the kernel to use updated packages.
Now, restart the kernel using the Kernel -> Restart option for activating the package.
Once the kernel is restarted, you can use the awswrangler to access data from aws s3 in your sagemaker notebook.
Follow the below steps to access the file from S3
using AWSWrangler.
- import
pandas
package to readcsv
file as a dataframe - import
awswrangler
aswr
- Create a variable
bucket
to hold the bucket name. - Create the
file_key
to hold the name of theS3
object. You can prefix the subfolder names, if your object is under any subfolder of the bucket. - Concatenate bucket name and the file key to generate the
s3uri
.- Use the
read_csv()
method inawswrangler
to fetch theS3
data using the linewr.s3.read_csv(path=s3uri)
.
- Use the
Snippet
import awswrangler as wr
import pandas as pd
bucket='stackvidhya'
file_key = 'csv_files/IRIS.csv'
s3uri = 's3://{}/{}'.format(bucket, file_key)
df = wr.s3.read_csv(path=s3uri)
df.head()
readcsv()
method will return a pandas dataframe out of csv data. You can print the dataframe using df.head() which will return the first five rows of the dataframe as shown below.
Dataframe Will Look Like
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
This is how you can load the csv file from S3
using awswrangler
.
Next, you'll see how to read a normal text file.
Read Text File from S3
You've seen how to read the csv file
from S3
in a sagemaker notebook.
In this section, you'll see how to access a normal text file from S3
and read its content.
As seen before, you can create an S3
client and get the object from S3
client using the bucket name and the object key.
Then you can read the object body using the read()
method.
The read method will return the file contents as bytes.
You can decode the bytes into string using the contents.decode('utf-8')
. UTF-8
is the most use charset encoding.
Snippet
import boto3
bucket='stackvidhya'
data_key = 'text_files/testfile.txt'
s3_client = boto3.client('s3')
obj = s3_client.get_object(Bucket=bucket, Key=data_key)
contents = obj['Body'].read()
print(contents.decode("utf-8"))
Output
This is a test file to demonstrate the file access functionlity from AWS S3 into sagemaker notebook
Conclusion
To summarize, you've learnt how to access or load the file from aws S3
into sagemaker jupyter notebook using the packages boto3
and awswrangler
.
You've also learnt how to access the file without using any additional packages.
If you've any questions, feel free to comment below.
Top comments (0)