Gilbert Young Jr for AWS Community Builders

Posted on Nov 17, 2022

SageMaker Data Ingestion using Kaggle

#sagemaker #kaggle #aws #machinelearning

Recently, I have been focusing on learning AI/ML. After overcoming many roadblocks and mistakes, I can now confidently share a successful solution.

I intend to explain one specific foundation of ML which is data ingestion. I will demonstrate how you can import a Kaggle dataset into a SageMaker Studio notebook. Amazon SageMaker is a ML tool that enables developers to rapidly create, train, and deploy machine-learning models in the cloud. Kaggle is an online community platform that has numerous datasets and ML challenges for data scientists and machine learning enthusiasts. You can clearly see how both tools can complement each other if integrated properly.

Prerequisite

A SageMaker Studio Notebook is needed. If you don't have one already, you can follow this guide to create one.
A Kaggle account is needed, you can register for one here.

🛠 Let's build!!

First we will import the python packages that will be used in the notebook.

Import Packages

import pandas as pd
import time

Install Kaggle CLI

!pip install --q kaggle

To use Kaggle API, you must have an account and an API token. You can follow this guide to generate your API token, it is completely free. The command below creates a json file to store your Kaggle credentials. Insert your username and API Key in the code blocks.

!touch ~/.kaggle/kaggle.json # Creates json file to store Kaggle API Credentials
kaggle_api_token = {"username":"<username>","key":"<api_key>"}  # Insert your own username and API Key here

We then write our kaggle credentials to the json file we created.

import json 

# Writes API Credentials to Kaggle file
with open('/root/.kaggle/kaggle.json', 'w') as file: 
    json.dump(kaggle_api_token,file)

For security reasons, we must ensure that other users do not have read access to our Kaggle credentials.

!chmod 600 ~/.kaggle/kaggle.json

Since our access token is now configured we can list the available datasets.

!kaggle datasets list # List available datasets

If the above command was successful, you will see a list of available datasets.

The below command downloads the dataset you specified. You can change this name to any of the names returned in the list of datasets. Downloading the dataset might take some time depending on your network connection.

%%time

!kaggle datasets download -d iamsouravbanerjee/game-of-thrones-dataset --unzip # Downloads & Unzip dataset

Now that the dataset is downloaded, let us visualize what the csv file looks like. We will use pandas to load and display the data.

data = pd.read_csv("Game_of_Thrones.csv", header=0)
df = data.copy()
df.head()

Github Repo

You can find the complete SageMaker Studio Notebook on my GitHub.

Additional Features

This was a simple demonstration of data ingestion, you can build on this solution by extracting insights from the data using pandas or perhaps training an ML model. If you do, please feel free to share your project with me.

Stay curious, keep learning and keep building!!!

DEV Community

SageMaker Data Ingestion using Kaggle

Prerequisite

🛠 Let's build!!

Import Packages

Install Kaggle CLI

Github Repo

Additional Features

Top comments (0)

Read next

AI Models Can Now Self-Improve Through Structured Multi-Agent Debates

AWS Compute Optimizer released a new feature

My (non-AI) AWS re:Invent 24 picks

Solved: Why ChatGPT Won't Say "Brian Hood" (Blame Regexes)