DEV Community

Tomasz Wegrzanowski
Tomasz Wegrzanowski

Posted on

How to access Kaggle data from command line

The war isn't going anywhere for now, so every couple of days I have to do the following steps to update Russian losses tracker:

  • download zip from Kaggle
  • run update_csv script
  • optionally, verify that data looks right with git diff, as occasionally there's a typo which makes losses go backwards (it happened only a few times, and it always gets corrected in the next update)
  • delete archive.zip

The annoying part is that Kaggle requires me to be logged in in a browser to download data, so I can't just replace that step with a curl request.

So let's try to improve this flow a bit.

Get Kaggle API token

You need to create an account on Kaggle.

Then go to your account settings by clicking on top right icon, and selecting Account (https://www.kaggle.com/<name>/account).

There's "Create New API Token" button, which will create new account token, and download it as kaggle.json. Create ~/.kaggle folder, and save that file to ~/.kaggle/kaggle.json.

Kaggle will complain if you don't secure the file so run this: chmod 0600 ~/.kaggle/kaggle.json

Install Kaggle CLI tools

If you have Python3 installed, you just need to do pip3 install kaggle

Download dataset

User name and ID of the data set are in the URL, so to download https://www.kaggle.com/datasets/piterfm/2022-ukraine-russian-war you need to run:

$ kaggle datasets download piterfm/2022-ukraine-russian-war
Enter fullscreen mode Exit fullscreen mode

It will save it as 2022-ukraine-russian-war.zip. There are extra options like where you want to download it, or unzipping it etc.

Full process

Now I can automate the whole process:

$ kaggle datasets download piterfm/2022-ukraine-russian-war
$ ./update_csv 2022-ukraine-russian-war.zip
$ trash 2022-ukraine-russian-war.zip
$ git add -u
$ git ci -m 'Data Update'
$ git push
Enter fullscreen mode Exit fullscreen mode

And since it's just a series of commands, I can even make it run automatically every day, without any intervention.

I could also add some kind of data checks to the process, so if there's anything weird like numbers going backwards, it would stop the update and wait for the next day. But overall, I'm happy with how it all ended up.

Top comments (0)