DEV Community

Cover image for How to get a free GPU and train a spaCy model ?
Rahul Gupta
Rahul Gupta

Posted on

How to get a free GPU and train a spaCy model ?

We all have been there. I have an interesting dataset that we want to train our shiny new model on. Unfortunately, I don't have a dedicated GPU on my Macbook 2015 model. Unless, you are somebody who uses graphical intensive application such as games, numerical processing software regularly, It will not make sense for you to buy a dedicated GPU.

Luckily there are a lot of remote GPU options available to you. Depending upon your use-case, you can choose the one that fits your need.

Option Pros Cons
Cloud Provider(Gcloud, AWS, Azure) flexibility, save the data Higher ramp-up time
Colaboratory notebook Good documentation Short runtimes, slow GPU, not good for long training jobs
Jupyter hub Open-source, multiple language support no free GPU support
Kaggle Notebooks free 43 hours of GPU computing data IO to machine is little inconvenient

So, today we will talk about how we use GPU on kaggle to train a spaCy model for Hindi Language. Biggest challenge of training a model is to get the clean data that accurately represent your Machine learning problem. Let's do a quick search to get a list of the datasets available

A quick search on Github with "Hindi tagger" yields these results
Alt Text

After browsing through these datasets, you will notice that most of these datasets are relatively small and follow incoherent tagging scheme incompatible with how spaCy's input data format. Luckily, we have other dataset that we can use here from CONLL competition.

Summary

The Hindi UD treebank is based on the Hindi Dependency Treebank (HDTB) created at IIIT Hyderabad, India.

Introduction

The Hindi Universal Dependency Treebank was automatically converted from Hindi Dependency Treebank (HDTB) which is part of an ongoing effort of creating multi-layered treebanks for Hindi and Urdu. HDTB is developed at IIIT-H India.

Acknowledgments

The project is supported by NSF Grant (Award Number: CNS 0751202; CFDA Number: 47.070).

Any publication reporting the work done using this data should cite the following references:

Riyaz Ahmad Bhat, Rajesh Bhatt, Annahita Farudi, Prescott Klassen, Bhuvana Narasimhan, Martha Palmer, Owen Rambow, Dipti Misra Sharma, Ashwini Vaidya, Sri Ramagurumurthy Vishnu, and Fei Xia. The Hindi/Urdu Treebank Project. In the Handbook of Linguistic Annotation (edited by Nancy Ide and James Pustejovsky), Springer Press

@InCollection{bhathindi
  Title                    = {The Hindi/Urdu Treebank Project}
  Author                   = {Bhat, Riyaz Ahmad and Bhatt, Rajesh and Farudi, Annahita and Klassen, Prescott and Narasimhan,

Browsing the stats.xml file gives us an overview of different pos tags available in the dataset.

Let's open the notebook and enable GPU for the session from three dots > Accelerator > GPU. Note that there is tpu option as well, but TPU can only be used for Keras and Tensorflow models. Spacy uses none of those, it uses its own custom neural network library, thinc.
Alt Text

Let's clone this repository using the command below in Kaggle notebook. This will download the data from repo in the working directory.

! git clone https://github.com/UniversalDependencies/UD_Hindi-HDTB

Let's quickly check if we have access to gpu

import tensorflow as tf 
tf.test.gpu_device_name()

Spacy expects training input data to be in the form of JSON documents, but our downloaded data is in .connlu format. So, we will use spacy convert for conversion to JSON.

! mkdir data
! spacy convert UD_Hindi-HDTB/hi_hdtb-ud-dev.conllu data
! spacy convert UD_Hindi-HDTB/hi_hdtb-ud-train.conllu data
! spacy convert UD_Hindi-HDTB/hi_hdtb-ud-test.conllu data

Now, we are all setup to start training the model

! spacy train hi model_dir data/hi_hdtb-ud-train.json data/hi_hdtb-ud-dev.json  -g 0

Don't forget to pass the argument -g 0 to enable the gpu usage for training. It will save the trained model in the model_dir directory. It runs about 6X faster on gpu than on my local machine. There are probably ways to make it run faster, as the job on kaggle notebook was CPU constrained. Anyway, the whole job finished in half an hour on kaggle notebook.

Let's load the model and run some inferences

from spacy.lang.hi import Hindi 
from spacy.gold import docs_to_json
nlp_hi = Hindi()


nlp_hi.add_pipe(nlp_hi.create_pipe('tagger'))
nlp_hi.add_pipe(nlp_hi.create_pipe('parser'))
nlp_hi.add_pipe(nlp_hi.create_pipe('ner'))
nlp_hi = nlp_hi.from_disk("model_dir/model-best/")


sentence = "मैं खाना खा रहा हूँ।"
doc = nlp_hi(sentence)
print(docs_to_json([doc]))
# ...
# {'id': 0, 'orth': 'मैं', 'tag': 'PRP', 'head': 2, 'dep': 'nsubj', 'ner': 'O'}
# ...

After the finishes, let's gzip the model and download it locally from the file-viewer pan on the right in the kaggle notebook.

! tar -cvzf model.tgz model_dir/model-best 

Hurray !

Here is the Kaggle notebook link, if you want to play around.
https://www.kaggle.com/rahul1990gupta/training-a-spacy-hindi-model?scriptVersionId=41283884

Top comments (0)