And I wrote my happy songs
Every child may joy to hear
— William Blake
In the previous chapter we had an environment prepared to launch our lovely job. The only thing left before we might let the ship sail freely would be to split the whole job into logical pieces. Why?
Well, the things might go wrong. One might want to tune something on the last step. The cloud computing saves our laptop uptime but it costs money. In most cases while playing with our model parameters we do not need to start from the scratch redoing the whole bunch of preparation steps, like cleaning up text, splitting the input into train and test datasets and like.
I would suggest to have following steps separated for text processing (for images or anything else that might differ, but the whole approach would still work.)
- drop all the unnecessary data and leave only that we are to process
- clean the input with sorta regular expressions
- split the input into train and test datasets
- train the model on train dataset
- check the model on test dataset
My advise would be to split the script into several classes, each playing it’s own role. The files would also contain the
__main__ section allowing to execute them as standalone scripts, accepting parameters that are relevant to that particular step.
The first step should be done locally to decrease the size of the file we are going to upload. That saves us the time while uploading and the size taken by the input data in our basket.
For that one should simply load the CSV, get rid of all the unnecessary columns and save it back. That is it. The result should be copied to the bucket and all steps will be done in the cloud.
In most articles, books, talks on the subject people recommend to get rid of all the punctuation and to expand syntactic sugar (“I’d’ve been” ⇒ “I could have been”) to help the trainer to recognize same words. Most if not all propose the obsolete regular expression to do that.
Texts in the Internets differ from those produced by typewriting machines. They are all use unicode now. Nowadays the apostrophe might be both “'” typed by the lazy post author and “’” if the creator of the text has a sense typographical beauty. The same is for quotes ' " ’ ”, dashes - -- – –, numbers 1 2 ¹ ² ½ and even ʟᴇᴛᴛᴇʀs 𝒾𝓃 𝓉𝒽𝑒 𝖆𝖗𝖙𝖎𝖈𝖑𝖊. And legacy regular expressions would not recognize all that zoo.
Luckily enough, modern regular expressions have matchers that might match exactly what we need to semantically. Tose are called Character Classes and might be used to match e. g. all the punctuation, or all the letters and digits.
For the text processing I would suggest to preserve only alphanumerics (letters and digits) and the punctuation. The latter should be unified (all the double quotes should be converted to typewriter double quotes, the same with single quotes and dashes.)
After this is done, the list of shortened forms might be used to expand all “I’m”_s to _“I am”_s etc. That is basically it. _Dump the result of this step to the bucket. Until the input data has changed, this step might be avoided in subsequent executions of model training process. I personally use pickles for that, but the format does not absolutely matter.
The common approach would be to split the dataset into to parts, one to train the model and another one to test it. Split and save the result to the basket. This process usually takes a few parameters, like padding and max-words which are rarely changed. Chances are you would not tweak them very often.
Save the result.
Train the model. Make sure you log everything to the logger, not to standard output to preserve logs of the process. I usually use
logging.info for debugging message and
logging.warn for iportant ones I want to be reported. Google ML Log Viewer allows to filter out log messages by severity and later one might have a glance at important stuff only.
Save the model.
I do check in the cloud as well. Later on one might download the model to local and use the testset saved in the step 3 to test it, but I am fine with examining logs in the cloud.
If you are like me and had the steps put into classes, the resulting
__main__ of the package that will be run in the cloud would look like:
if __name__ == "__main__": # parse all the arguments if arguments.pop('janitize'): Janitor(**arguments).cleanup(remote_dir) if arguments.pop('split'): Splitter(keep_n=arguments.pop('keep_n')).split(remote_dir) if arguments.pop('train'): model = Modeller(**arguments).train_model(remote_dir) if arguments.pop('check'): Modeller(**arguments).check_model(remote_dir, model=model)
And that is all I wanted to share for now. Happy clouding!