DEV Community

Cover image for SageMaker Data Ingestion using Kaggle
Gilbert Young Jr for AWS Community Builders

Posted on

SageMaker Data Ingestion using Kaggle

Recently, I have been focusing on learning AI/ML. After overcoming many roadblocks and mistakes, I can now confidently share a successful solution.

I intend to explain one specific foundation of ML which is data ingestion. I will demonstrate how you can import a Kaggle dataset into a SageMaker Studio notebook. Amazon SageMaker is a ML tool that enables developers to rapidly create, train, and deploy machine-learning models in the cloud. Kaggle is an online community platform that has numerous datasets and ML challenges for data scientists and machine learning enthusiasts. You can clearly see how both tools can complement each other if integrated properly.


Prerequisite

  • A SageMaker Studio Notebook is needed. If you don't have one already, you can follow this guide to create one.
  • A Kaggle account is needed, you can register for one here.

🛠 Let's build!!

First we will import the python packages that will be used in the notebook.

Import Packages

import pandas as pd
import time
Enter fullscreen mode Exit fullscreen mode

Install Kaggle CLI

!pip install --q kaggle 
Enter fullscreen mode Exit fullscreen mode

To use Kaggle API, you must have an account and an API token. You can follow this guide to generate your API token, it is completely free. The command below creates a json file to store your Kaggle credentials. Insert your username and API Key in the code blocks.

!touch ~/.kaggle/kaggle.json # Creates json file to store Kaggle API Credentials
kaggle_api_token = {"username":"<username>","key":"<api_key>"}  # Insert your own username and API Key here
Enter fullscreen mode Exit fullscreen mode

We then write our kaggle credentials to the json file we created.

import json 

# Writes API Credentials to Kaggle file
with open('/root/.kaggle/kaggle.json', 'w') as file: 
    json.dump(kaggle_api_token,file)
Enter fullscreen mode Exit fullscreen mode

For security reasons, we must ensure that other users do not have read access to our Kaggle credentials.

!chmod 600 ~/.kaggle/kaggle.json
Enter fullscreen mode Exit fullscreen mode

Since our access token is now configured we can list the available datasets.

!kaggle datasets list # List available datasets
Enter fullscreen mode Exit fullscreen mode

If the above command was successful, you will see a list of available datasets.

Image description

The below command downloads the dataset you specified. You can change this name to any of the names returned in the list of datasets. Downloading the dataset might take some time depending on your network connection.

%%time

!kaggle datasets download -d iamsouravbanerjee/game-of-thrones-dataset --unzip # Downloads & Unzip dataset
Enter fullscreen mode Exit fullscreen mode

Now that the dataset is downloaded, let us visualize what the csv file looks like. We will use pandas to load and display the data.

data = pd.read_csv("Game_of_Thrones.csv", header=0)
df = data.copy()
df.head() 
Enter fullscreen mode Exit fullscreen mode

Image description

Github Repo

You can find the complete SageMaker Studio Notebook on my GitHub.

Additional Features

This was a simple demonstration of data ingestion, you can build on this solution by extracting insights from the data using pandas or perhaps training an ML model. If you do, please feel free to share your project with me.

Stay curious, keep learning and keep building!!!

Top comments (0)