DEV Community

Vikram Aruchamy for AWS Community Builders

Posted on • Originally published at stackvidhya.com

How To Load Data From AWS S3 into Sagemaker (Using Boto3 or AWSWrangler)

How To Load Data From AWS S3 into Sagemaker (Using Boto3 or AWSWrangler)

S3 is a storage service from AWS. You can store any type of files such as csv files or text files. SageMaker provides the compute capacity to build, train and deploy ML models.

You can load data from AWS S3 into AWS SageMaker using Boto3 or AWSWranger.

In this tutorial, you'll learn how to load data from AWS S3 into SageMaker jupyter notebook.

This will only access the data from S3. The files will not be downloaded to the SageMaker Instance itself. If you want to download the file to the SageMaker policy, read How to Download File From S3 Using Boto3 [Python]?

Prerequisite

  • Sagemaker instance MUST have read access to your S3 buckets. Assign the role AmazonSageMakerServiceCatalogProductsUseRole while creating SageMaker instance. Refer this link for more details about SageMaker Roles
  • Install pandas dataframe using pip install pandas to read csv file as dataframe. In most cases it is available as default package

Loading CSV file from S3 Bucket Using URI

In this section, you'll load the CSV file from the S3 bucket using the S3 URI.

There are two options to generate the S3 URI

  • Copying object URL from the AWS S3 Console.
  • Generate the URI manually by using the String format option. (This is demonstrated in the below example)

Follow the below steps to load the CSV file from S3 bucket.

  • Import pandas package to read csv file as a dataframe
  • Create a variable bucket to hold the bucket name.
  • Create the file_key to hold the name of the s3 object. You can prefix the subfolder names, if your object is under any subfolder of the bucket.
  • Concatenate the bucket name and the object name with the prefix s3:// to generate the URI of the S3 object
  • Use the generated URI in the read_csv() method of the pandas package and store it in the dataframe object called df

In the example, the object is available in the bucket stackvidhya and sub-folder called csv_files. Hence you'll use the bucket name as stackvidhya and the file_key as csv_files/IRIS.csv

Snippet

import pandas as pd

bucket='stackvidhya'

file_key = 'csv_files/IRIS.csv'

s3uri = 's3://{}/{}'.format(bucket, file_key)

df = pd.read_csv(s3uri)

df.head()
Enter fullscreen mode Exit fullscreen mode

The csv file will be read from the S3 location as a pandas dataframe.

You can print the dataframe using df.head() which will print the first five rows of the dataframe as shown below.

Dataframe will look like

.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
Enter fullscreen mode Exit fullscreen mode
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

This is how you can load s3 data into sagemaker jupyter notebook without using any external libraries.

In this method, the file is also not downloaded into notebook directly.

Next, you'll learn about using the external libraries to load the data.

Loading CSV file from S3 Bucket using Boto3

In this section, you'll use the Boto3.

Boto3 is an AWS SDK for creating, managing and access AWS services such as S3 and EC2 instances.

Follow the below steps to access the file from S3

  1. Import pandas package to read csv file as a dataframe
  2. Create a variable bucket to hold the bucket name.
  3. Create the file_key to hold the name of the s3 object. You can prefix the subfolder names, if your object is under any subfolder of the bucket.
  4. Create an s3 client using the boto3.client('s3'). Boto3 Client is a low level representation of the AWS services.
  5. Get s3 object using the s3_client.get_object() method. Pass the bucket name and the file key you can created in the previous step. It'll return the s3 data as response. (Stored as obj)
  6. Read the the object body using obj['Body'].read(). It'll return the bytes. Convert these bytes to String using io.BytesIO().
  7. This string can be passed to read_csv() available in pandas. Then you'll get a dataframe.

Snippet

import pandas as pd
import boto3
import io

bucket='stackvidhya'

file_key = 'csv_files/IRIS.csv'

s3_client = boto3.client('s3')

obj = s3_client.get_object(Bucket=bucket, Key=file_key)

df = pd.read_csv(io.BytesIO(obj['Body'].read()))

df.head()
Enter fullscreen mode Exit fullscreen mode

The dataframe can be printed using the df.head() method. It'll print the first five rows of the dataframe as shown below.

Dataframe Will Look Like

.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
Enter fullscreen mode Exit fullscreen mode
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

You can also use the same steps to access files from S3 in jupyter notebook*(outside of sagemaker)*.

Just pass the AWS API security credentials while creating boto3 client as shown below. Refer the tutorial How to create AWS security credentials to create credentials.

Snippet

s3_client = boto3.client('s3', aws_access_key_id='AWS_SERVER_PUBLIC_KEY', aws_secret_access_key='AWS_SERVER_SECRET_KEY', region_name=REGION_NAME )
Enter fullscreen mode Exit fullscreen mode

This is how you can read csv file into sagemaker using boto3.

Next, you'll learn about the package awswrangler.

Loading CSV File into Sagemaker using AWS Wrangler

In this section, you'll learn how to access data from AWS s3 using AWS Wrangler.

AWS Wrangler is an aws professional service open source python library that extends the functionalities of the pandas library to AWS by connecting dataframe and other data related services.

This package is not installed by default.

Installing AWSWrangler

Install the awswrangler by using the pip install command.

% needs to be prefixed to pip command so the installation directly works from the jupyter notebook.

Snippet

%pip install awswrangler
Enter fullscreen mode Exit fullscreen mode

You'll see the below messages and the AWS Data wrangler will be installed.

Output

    Collecting awswrangler
      Downloading awswrangler-2.8.0-py3-none-any.whl (179 kB)

    Installing collected packages: scramp, redshift-connector, pymysql, pg8000, awswrangler
    Successfully installed awswrangler-2.8.0 pg8000-1.19.5 pymysql-1.0.2 redshift-connector-2.0.881 scramp-1.4.0
    Note: you may need to restart the kernel to use updated packages.
Enter fullscreen mode Exit fullscreen mode

Now, restart the kernel using the Kernel -> Restart option for activating the package.

Once the kernel is restarted, you can use the awswrangler to access data from aws s3 in your sagemaker notebook.

Follow the below steps to access the file from S3 using AWSWrangler.

  1. import pandas package to read csv file as a dataframe
  2. import awswrangler as wr
  3. Create a variable bucket to hold the bucket name.
  4. Create the file_key to hold the name of the S3 object. You can prefix the subfolder names, if your object is under any subfolder of the bucket.
  5. Concatenate bucket name and the file key to generate the s3uri.
    1. Use the read_csv() method in awswrangler to fetch the S3 data using the line wr.s3.read_csv(path=s3uri).

Snippet

import awswrangler as wr

import pandas as pd

bucket='stackvidhya'

file_key = 'csv_files/IRIS.csv'

s3uri = 's3://{}/{}'.format(bucket, file_key)

df = wr.s3.read_csv(path=s3uri)

df.head()
Enter fullscreen mode Exit fullscreen mode

readcsv() method will return a pandas dataframe out of csv data. You can print the dataframe using df.head() which will return the first five rows of the dataframe as shown below.

Dataframe Will Look Like

.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
Enter fullscreen mode Exit fullscreen mode
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

This is how you can load the csv file from S3 using awswrangler.

Next, you'll see how to read a normal text file.

Read Text File from S3

You've seen how to read the csv file from S3 in a sagemaker notebook.

In this section, you'll see how to access a normal text file from S3 and read its content.

As seen before, you can create an S3 client and get the object from S3 client using the bucket name and the object key.

Then you can read the object body using the read() method.

The read method will return the file contents as bytes.

You can decode the bytes into string using the contents.decode('utf-8'). UTF-8 is the most use charset encoding.

Snippet

import boto3

bucket='stackvidhya'

data_key = 'text_files/testfile.txt'

s3_client = boto3.client('s3')

obj = s3_client.get_object(Bucket=bucket, Key=data_key)

contents = obj['Body'].read()

print(contents.decode("utf-8"))
Enter fullscreen mode Exit fullscreen mode

Output

This is a test file to demonstrate the file access functionlity from AWS S3 into sagemaker notebook
Enter fullscreen mode Exit fullscreen mode

Conclusion

To summarize, you've learnt how to access or load the file from aws S3 into sagemaker jupyter notebook using the packages boto3 and awswrangler.

You've also learnt how to access the file without using any additional packages.

If you've any questions, feel free to comment below.

Top comments (0)