Recently, I have been focusing on learning AI/ML. After overcoming many roadblocks and mistakes, I can now confidently share a successful solution.
I intend to explain one specific foundation of ML which is data ingestion. I will demonstrate how you can import a Kaggle dataset into a SageMaker Studio notebook. Amazon SageMaker is a ML tool that enables developers to rapidly create, train, and deploy machine-learning models in the cloud. Kaggle is an online community platform that has numerous datasets and ML challenges for data scientists and machine learning enthusiasts. You can clearly see how both tools can complement each other if integrated properly.
Prerequisite
- A SageMaker Studio Notebook is needed. If you don't have one already, you can follow this guide to create one.
- A Kaggle account is needed, you can register for one here.
🛠Let's build!!
First we will import the python packages that will be used in the notebook.
Import Packages
import pandas as pd
import time
Install Kaggle CLI
!pip install --q kaggle
To use Kaggle API, you must have an account and an API token. You can follow this guide to generate your API token, it is completely free. The command below creates a json file to store your Kaggle credentials. Insert your username and API Key in the code blocks.
!touch ~/.kaggle/kaggle.json # Creates json file to store Kaggle API Credentials
kaggle_api_token = {"username":"<username>","key":"<api_key>"} # Insert your own username and API Key here
We then write our kaggle credentials to the json file we created.
import json
# Writes API Credentials to Kaggle file
with open('/root/.kaggle/kaggle.json', 'w') as file:
json.dump(kaggle_api_token,file)
For security reasons, we must ensure that other users do not have read access to our Kaggle credentials.
!chmod 600 ~/.kaggle/kaggle.json
Since our access token is now configured we can list the available datasets.
!kaggle datasets list # List available datasets
If the above command was successful, you will see a list of available datasets.
The below command downloads the dataset you specified. You can change this name to any of the names returned in the list of datasets. Downloading the dataset might take some time depending on your network connection.
%%time
!kaggle datasets download -d iamsouravbanerjee/game-of-thrones-dataset --unzip # Downloads & Unzip dataset
Now that the dataset is downloaded, let us visualize what the csv file looks like. We will use pandas to load and display the data.
data = pd.read_csv("Game_of_Thrones.csv", header=0)
df = data.copy()
df.head()
Github Repo
You can find the complete SageMaker Studio Notebook on my GitHub.
Additional Features
This was a simple demonstration of data ingestion, you can build on this solution by extracting insights from the data using pandas or perhaps training an ML model. If you do, please feel free to share your project with me.
Stay curious, keep learning and keep building!!!
Top comments (0)