Thinking about setting up your first machine learning project and don't know where to start? This beginner's checklist will walk you through a step-by-step thought process to get you started!
I'm assuming you arrived on this blog because you've heard of the concept of Machine Learning (ML) and Artificial Intelligence (AI) and watched a couple of videos here and there!
If you haven't done so already, you can explore more through watching some cool TED talks here.
You can also explore an online course - there are many free courses available. You can check out this one by Udacity on 'Intro to Machine Learning' 🤖.
Don't worry about writing the code yet, just get a feel for what's happening in the Machine Learning world. Machine Learning is a concept within Artificial Intelligence (AI), as AI covers many fields. Here's a fantastic blog if you would like to explore more about 'Machine Learning vs. Artificial Intelligence'.
Ok, onto the next step! 😊 Don't worry about achieving something perfect the first time round, the best way to learn is to get stuck into a small project.
First things first, follow a tutorial to help you get started! Build something small to begin with and ask questions like:
- 'What does the data source look like?'
- 'How is the data being formatted?'
- 'What is the function of the code?'
- 'Why is this line of code here?'
- 'How is the machine learning model working?'
- 'How is this implemented in the code?'
This tutorial from Scikit-Learn is a good starting point to help you to get stuck in. The example shows how Scikit-Learn can be used to recognise images of hand-written digits.
Experiment! Try changing the type of classifier and performance metrics to see if this makes a difference to the ability of your model to identify the handwritten digits.
Congratulations! 🎉😎 You just built your first machine learning project! Take it easy ok, there's a lot to take in already.
Once you tried one or two example projects, you can start to tackle your very own one!
Here are some questions to help you:
- Is Machine Learning the right approach for your project?
Sketch out some ideas on your notebook and refine your idea. What questions are you trying to answer? What is your goal? Start small! Machine Learning may or may not be the right approach for your project, so before you invest a lot of time, share your idea around to sense check it is right for you.
- Are you trying to work with images? Are you working with numerical data?
Understand what kind of data you will be working with - this will guide you towards the appropriate solution for your problem.
- Where are you going to get your dataset from?
Before you can build the Machine Learning model, you need access to a dataset. For all projects, data acquisition is a very important step.
- How big is your dataset? Is it the best dataset for your project? Are there issues with the data?
Delve into your dataset; understand it's structure. What is the format of the data? What are the key features of the dataset? Which parts of the dataset do you want to capture? Which bits are relevant? Is your dataset big enough?
N.B. You may not need all of your dataset. Be aware of biases in the dataset sample itself!
Wow! That's a big step out of the way, now onto choosing your model. :)
Are you going to let the model learn by itself (unsupervised learning), or are you going to guide the ML training through (supervised learning)? Hopefully from the previous steps, you should have a jist of the problem type. Is it a classification, regression, clustering problem or something else?
Here's a cool Machine Learning Map to help you decide.
Ok, data is never in the form you want it to be...there will be some data processing and formatting to get the data in a form that's suitable for your machine learning project.
There are so many options out there. Best to explore for yourself and pick what rocks your boat 🚣. Tensorflow and Keras is a good combo, as well as Scikit-Learn :). There are pros and cons for the technologies you choose. If you want, you can even set up an online coding notebook like CoLab notebook 📔 (pretty much a Jupyter notebook for the Python fans out there), so you can experiment a bit. Did I mention you can run your machine learning using a GPU for super speedy stuff?
If you want a quick run down on the techniques of Machine Learning, check out the crash course from Google.
Once you have your dataset ready, a consideration is splitting your dataset into a training and a testing dataset. The training dataset is the dataset your ML model will train on; your testing dataset is the dataset your model will be tested against to check how well the model performs.
Top Tip! It is important to randomise the dataset before you split it, so the order of your dataset doesn't have a major impact on the model training process.
There are many mathematical approaches to measure model performance; but it is important to be aware of model overfitting. This is when the model is too reliant on the data and biased to the training dataset.
The rule of thumb for proportions is generally 90% of the dataset for training / 10% of the dataset for testing, but we have also seen 75% / 25% splits as well as 80%/ 20% splits.
Model training is the official term to mean "Run the Machine Learning model LOL! It's about time!"
All the hard work so far has paid off! You are ready to train your model! Good luck! 👍
Here is a non-exhaustive list of the things you may want to consider:
Where are you going to do the model training? If your dataset is massive, you may consider how long the training process may take.
Consider doing test runs on a small sample of your dataset to check that your model can actually train! Seriously, you don't want to be waiting around for ages and come back to find that there were bugs in the way you interfaced the data to the machine learning model! (Been there and done that LOL 😭)
How many times is your model going to run through the training dataset?
Once you have a trained machine learning model, check how well it performs by testing it against a test dataset (a fancy way of saying the "data your machine learning model has never seen before").
Have a think about how you measure the model performance.
Here are some strategies to improve the performance of your machine learning model, beware of overfitting of course!
Go back to the data source! Is this the best data source for your model? Is there any pitfalls to your selected dataset. If not, maybe you can increase the sample size (how much data you're using).
Try choosing another machine learning model algorithm and do a exercise to see which one yields the best result
Play around with the proportion of data you set aside for training and testing
Refine the training process: see if you can increase the number of times you run through a dataset, although this will slow down the training process
You totally rock! Give yourself a pat on the back! Congratulations on doing Machine Learning 🎉🎉🎉🎉🎉🎉🎈🙌