Patrick Loeber

Posted on Oct 19, 2020 • Originally published at python-engineer.com

5 Machine Learning BEGINNER Projects (+ Datasets & Solutions)

#python #machinelearning

I this tutorial I share 5 Beginner Machine Learning projects with you and give you tips how to solve all of them. These projects are for complete beginners and should teach you some basic machine learning concepts. With each project the difficulty increases a little bit and you'll learn a new algorithm.

For each project I give you an algorithm that you can use and include the links to the datasets, so you can start right away!

For all those projects I recommend to use the scikit-learn library. This is the go-to library in Python when it comes to machine learning. It's incredibly easy to get started with this library and to implement your own Machine Learning algorithms with it.

Regression vs. Classification

Before we go over the projects you should know about the 2 basic types of machine learning tasks: Regression vs. Classification.

Fundamentally, classification is about predicting a label, so a concrete class value while regression is about predicting a quantity, so a continuous value.

Project 1

As first project I recommend to start with a regression problem. For this problem I recommend to do actually 2 projects. One is a super simple project to predict the salary based on the number of years of experience. This only contains 2 variables, so you stay in 2 dimensions and this should give you a good understanding of how the model works. After that I recommend to do the Boston Housing dataset. Here you should predict the price of a home based on multiple different variables. The algorithm you should use here is the so-called Linear Regression model. This is one of the easiest algorithms and shouldn't be too hard to understand.

Datasets

Algorithm

Linear Regression

Project 2

After that I recommend to tackle your first classification problem. The dataset is the Iris dataset. This is probably the most famous dataset in the world of machine learning, and everyone should have solved it at least once. Here we have samples from 3 different flower species, and for each sample we have 4 different features that describe the flower. With this information we want to predict the species of the flower then. As algorithm I recommend to use the K Nearest Neighbor (KNN) algorithm. This is one of the simplest classification algorithms but works pretty well here. The species are very clearly distinguishable, so you should be able to train a good KNN model and reach 100% correct predictions.

I know everyone is using the Iris dataset as first example, so if you cannot see it anymore and want to have an alternative then you can check out the Penguin dataset where we want to predict the species of a penguin based on certain features.

Datasets

Algorithm

K Nearest Neighbor

Project 3

Next, I recommend to use the Breast Cancer dataset. This is another famous dataset with the interesting task to predict if a cancer cell is good or bad (or in medical terms: malignant or benign). Here we have 30 different features for each cancer cell that have been computed from medical images. This is certainly more complex and more difficult than the project before, but still you should be able to reach an accuracy of 95% here. As algorithm I recommend to try out the Logistic Regression model. This is similar to the Linear Regression model in the beginning. Don't be confused by the name, because even though it has Regression in its name, it is actually used for a classification task. The Logistic Regression algorithm also models a continuous value, but this is a probability value between 0 and 1 and can therefore be used for classification. I also recommend to have a look at another new technique that is called feature standardization. Because the 30 different features may have values in different ranges, and this might confuse the model a little bit. So play around with feature standardization here and see if you can improve your model even further with that. (Note: Feature standardization is not required for Logistic Regression, but it's still an important technique and can be important for other classifier here.)

Dataset

https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

Algorithm

Project 4

The fourth project is interesting because it is implemented in everyones email client. Here we want to create a spam filter based on the Spambase dataset. In this dataset we have the frequency of different words and characters, so we calculate the total number of appearances of each word and divide it by the total number of words in the email. Spam emails clearly show certain key words more often than normal mails, so with this information we are able to create a spam classifier. As algorithm I recommend to have a look at the Naive Bayes algorithm here. The new challenge here is then not only to use this dataset and evaluate your model, but then after you have trained your classifier also apply it to a real application. So what do you do with a new email? What do you have to do before you pass it to the classifier? Here you somehow have to find out how to transform the text from the email to the same format that your classifier expects. This should give you a better understanding of how datasets are shaped and created.

Dataset

https://archive.ics.uci.edu/ml/datasets/spambase

Algorithm

Naive Bayes

Project 5

The last project I recommend is the Titanic dataset. This is the first beginner project that Kaggle recommends on their site in the Getting Started section. Here we have a list of all Titanic passengers with certain features like the age, the name, or the sex of the person, and we want to predict if this passenger survived or not. The Titanic dataset requires a little more work before we can use it, because not all information in this dataset are useful and we even have missing values. So here you should learn some preprocessing techniques and how to visualise, analyze, and clean the data. Up to this point we could use the datasets right away, but in real world applications this is actually almost never the case, so you should definitely learn how to analyze datasets. As algorithm I recommend to have a look at Decision Trees, and also at a second algorithm, the Random Forest algorithm, which extends decision trees. As another tip i recommend to have a look at the pandas library here. This makes your life a lot easier when it comes to data visualisation and processing the data beforehand.

Dataset

https://www.kaggle.com/c/titanic/data

Algorithm

Conclusion

If you complete all projects you should have a good understanding of 6 popular machine learning algorithms, and you should also have a feeling for different datasets and some knowledge of how to analyze and process the data.

DEV Community

5 Machine Learning BEGINNER Projects (+ Datasets & Solutions)

Regression vs. Classification

Project 1

Datasets

Algorithm

Project 2

Datasets

Algorithm

Project 3

Dataset

Algorithm

Project 4

Dataset

Algorithm

Project 5

Dataset

Algorithm

Conclusion

Top comments (0)

Read next

Python Data Engineering: Comprehensive Workflow for Data Modeling, Analytics with DuckDB

Natural Language Processing (NLP) and Its Applications

Ethical AI: Bias and Fairness

Back to Basics - Python #02