Autodidactic Data Science (5 Part Series)
Here's my weekly accountability report for my self-study approach to learning data science.
Fast.ai Week 3
I made progress on the week 3 instruction, which covered datablocks, multi-class labeling, image regression, and image segmentation.
The foundational class of fast.ai is called a datablock (which is used to create a databunch). The block is essentially stringing together a series of preprocessing steps. Some of the elements include specifying where the data is located, how to import it and organize it in such a way as to be able to label it, how to split into train/validation, how to do transformations and augmentations, and create an object that we can then use for model building. I learned a few things about why last week's Aurebesh project was behaving oddly - namely it was mirroring the letters horizontally to augment the data. This makes sense for classifying cats but not so much for letters. I quickly reran my model from last week and was able to lower the error rate a bit.
After that, I dug into the multi-class labeling problem by searching Kaggle for datasets to explore. I settled on Human Protein Classification which was a competition from nine months ago (while skipping over the interesting but not-immediately-obvious-how-to-go-about-modeling-it datasets on Hierarchical Taxonomy of Wikipedia Articles and The Toxic Comment Classification Challenge). I was able to get the data into a datablock in part because of the helpful pre-processing kernels for the competition. (Note to self: datasets for competitions will not only be cleaner but have far more forum posts/kernels to learn from). For example, there are 4 files for each of the 4 channels (RGBY) but the green was the most relevant for proteins so, in the interest in getting something up and running, I ignored the other three channels. With that pointer and a few lines of code, I was able to get a model running. On the first attempt, its f-score was 0.63, high enough to place it in the top 40 (out of 2000+ entries) and that was both reassuring and not. Jeremy's comment was if you are in the top 10% you're doing good because the folks there know what they are doing. This was reassuring until I realized I don't feel like I know what I'm doing. I tried to rerun it using more detailed versions of the images and in the process ran out of memory, getting CUDA errors, and broke the kernel. And then I ran out of time to investigate further.
Starting this week, I now have to decide whether to spend more time to really grok what's happening or to move forward. I would love to linger and really get what's going on. But there's also something to be said to keep moving. I do wonder for folks who chose to do self-study how they make those decisions?
I'm working my way through a Udemy Machine Learning course. It's review for me, but this time I'm taking notes and benefit from having a better perspective of how all the pieces fit together and more solid coding skills. This week was common preprocessing steps, as well as simple and multiple linear regression, using StatsModels and scikit-learn. I made loads of flashcards.
And today I had the opportunity to chat with an industry veteran to just pick his brain. Grateful for all the senior folks who take 30 minutes out of their day to share what they know.
A crimp in my learning flow this week occurred when my laptop went from occasionally acting erratically to constantly acting erratically. This required some sleuthing on my part. IT repairs are so not my thing. Gimme data to work with. But like my experience with memory issues, there's a whole ecosystem at play in order to do data science, and as much as I dislike hardware and systems-level issues, it helps to be resilient and knowledgeable enough to work through the problem. In this case a shoutout to all those mom and pop repair shops who put "how to fix your X" YouTube videos out. I was able to not only repair my MacBook Pro this week but also my sewing machine that went kaput mid-Halloween costume assembly. The real lesson (besides having the right tools) is that there are a lot of layers of technical components involved in doing tech. Knowing who to turn to for help while not panicking isn't limited to machine learning models.