DEV Community

Cover image for Introduction to Supervised Learning Algorithms in Machine Learning
Jay Codes
Jay Codes

Posted on • Originally published at Medium

Introduction to Supervised Learning Algorithms in Machine Learning

Machine learning is a fascinating field that empowers computers to learn from data and make predictions or decisions without explicit programming. Among various machine learning techniques, supervised learning is the most common and essential approach. This article serves as an introductory guide to supervised learning, geared towards beginners. We will explore the fundamental principles of supervised learning, discuss popular algorithms such as Linear Regression, Decision Trees, and k-Nearest Neighbors (k-NN), and provide practical examples with Python code snippets using the Scikit-learn library.

Prerequisite

Before diving into the exciting world of supervised learning algorithms, it is essential to have a basic understanding of the Python programming language and data manipulation techniques. Familiarity with concepts like variables, data types, loops, and conditional statements will be beneficial. Additionally, having a grasp of NumPy, a Python library for numerical computing, will be helpful, as it simplifies many mathematical operations that are integral to machine learning.

If you are new to Python, there are several online tutorials and resources available that can provide you with a solid foundation. As you progress through this article, we will provide Python code snippets and explanations, but a prior understanding of Python will enhance your learning experience.

What is Supervised Learning?

Imagine you are a gardener learning to differentiate between two types of flowers: roses and sunflowers. You are given a collection of flowers, and each one has a label telling you whether it's a rose or a sunflower. By observing and learning from these labeled flowers, you start to recognize patterns that distinguish the two types.

In supervised learning, the computer follows a similar process. It learns from labeled examples to make predictions on new, unseen data. The "supervision" comes from the labeled data, which acts as a teacher guiding the algorithm's learning process.

Supervised learning can be used for both regression and classification tasks. In regression tasks, the algorithm predicts continuous values, like predicting the price of a house based on its features. In classification tasks, the algorithm predicts discrete labels, such as classifying an email as spam or not spam.

Linear Regression: Predicting Trends

Let's think of Linear Regression as a tool that helps predict trends. Picture a scenario where you have a list of house prices and their respective areas. You observe that as the area of a house increases, its price tends to go up as well. Linear Regression aims to draw a straight line through this data, capturing the overall trend. Once the line is established, you can predict the price of a house based on its area using the line's equation.

Mathematically, a linear regression model represents a linear relationship between the input variable (independent variable) and the output variable (dependent variable). The line is represented by the equation:

y = mx + b
Enter fullscreen mode Exit fullscreen mode

Where:

  • y is the predicted value (dependent variable),
  • x is the input value (independent variable),
  • m is the slope of the line, and
  • b is the y-intercept.

The goal of Linear Regression is to find the best-fitting line that minimizes the error between the predicted values and the actual values in the training data.

Decision Trees: Making Decisions like a Detective

Imagine you're in the market for a new car, and you're trying to decide between two options: a blue car and a purple car.
For the blue car, you find that it has more miles on it, but it comes with some extra features and a lower price compared to the purple car. On the other hand, the purple car has fewer miles on it, but it lacks some of the additional features that the blue car offers. The purple car comes with a higher price tag.
To make your decision, you start by considering your priorities. If you value having the latest features at a more affordable price, the blue car might be the better option for you. However, if you prioritize lower mileage and don't mind paying a bit extra for it, the purple car could be more appealing.
You also consider other factors like the maintenance history, fuel efficiency, and overall condition of each car, which can further influence your decision.
By using a decision tree, you can create a visual representation of these factors and weigh their importance according to your preferences. As you make your way down the branches of the decision tree, you can compare and contrast the attributes of both cars and ultimately make an informed choice that aligns with your needs and budget.

A picture of a decision tree
In this example, the decision tree helps you navigate the complex process of choosing a car, taking into account different variables and personal preferences to reach the best possible decision for you.

Similarly, Decision Trees in machine learning ask a series of questions about the data to classify it or make predictions. Each question splits the data into subsets, leading to a tree-like structure. This allows the algorithm to make decisions based on the features present in the data.

The decision-making process of a Decision Tree involves selecting the most informative features that effectively divide the data into distinct classes. Each internal node represents a question, each branch represents an answer to the question, and each leaf node represents a final decision or outcome.

The construction of a Decision Tree involves finding the best features and splitting points that result in the most accurate predictions on the training data. By following the path from the root node to a leaf node, the algorithm can classify new data points based on the learned rules.

k-Nearest Neighbors (k-NN): Learning from Neighbors

Let's imagine you just moved to a new neighborhood, and you want to know whether it's a friendly and safe area. You decide to ask your k-nearest neighbors, the people living closest to your house, about their experiences. By gathering information from them, you can get an idea of what to expect in the neighborhood.

In the k-Nearest Neighbors (k-NN) algorithm, the "k" represents the number of neighbors considered. The algorithm looks at the data points closest to the one you want to predict and makes a decision based on their labels. If most of the nearby points are of a certain class, the algorithm assigns that class to the new data point.

The k-NN algorithm doesn't build a specific model during the training phase. Instead, it memorizes the training data and uses it to make predictions at runtime. The key decision in k-NN is to determine the appropriate value of "k" and the distance metric used to measure the similarity between data points.

Implementing Supervised Learning Algorithms with Python and Scikit-learn

To apply these algorithms in practice, we'll use Python and the Scikit-learn library, which provides powerful tools for machine learning. If you haven't already installed Scikit-learn, you can do so using the following command:

pip install scikit-learn
Enter fullscreen mode Exit fullscreen mode

We'll start with data preparation, where we organize and preprocess our labeled data. Next, we'll train our models on the prepared data using Linear Regression, Decision Trees, and k-NN algorithms. Finally, we'll evaluate the model's performance and make predictions based on new, unseen data.

Data Preparation

Data preparation is a crucial step in the machine learning process. It involves cleaning and organizing the data to ensure that it is suitable for training and testing our models. Data may come from various sources and might require handling missing values, scaling, and converting categorical variables into numerical representations.

Let's assume we have a dataset of houses with their respective areas and prices, represented as a CSV file:

area,price
1200,300000
1500,350000
1800,400000
2000,420000
2200,450000
Enter fullscreen mode Exit fullscreen mode

We can use Pandas, a popular Python library for data manipulation, to load and preprocess the data:

import pandas as pd

# Load the dataset from CSV file
data = pd.read_csv('house_data.csv')

# Separate the features (areas) and target variable (prices)
X = data['area'].values.reshape(-1, 1)
y = data['price'].values
Enter fullscreen mode Exit fullscreen mode

Training the Model - Linear Regression

With our data prepared, we can now proceed to train the Linear Regression model:

from sklearn.linear_model import LinearRegression

# Create and train the Linear Regression model
model = LinearRegression()
model.fit(X, y)
Enter fullscreen mode Exit fullscreen mode

Training the Model - Decision Trees

For Decision Trees, we use the DecisionTreeRegressor class for regression tasks:

from sklearn.tree

 import DecisionTreeRegressor

# Create and train the Decision Tree model
model = DecisionTreeRegressor()
model.fit(X, y)
Enter fullscreen mode Exit fullscreen mode

Training the Model - k-Nearest Neighbors (k-NN)

For k-Nearest Neighbors, we use the KNeighborsRegressor class for regression tasks:

from sklearn.neighbors import KNeighborsRegressor

# Create and train the k-NN model
model = KNeighborsRegressor(n_neighbors=3)
model.fit(X, y)
Enter fullscreen mode Exit fullscreen mode

Evaluating the Model

After training the model, it's essential to evaluate its performance. For regression tasks, common evaluation metrics include Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE):

from sklearn.metrics import mean_absolute_error, mean_squared_error

# Make predictions on the training data
predictions = model.predict(X)

# Calculate MAE and RMSE
mae = mean_absolute_error(y, predictions)
rmse = mean_squared_error(y, predictions, squared=False)

print("Mean Absolute Error:", mae)
print("Root Mean Squared Error:", rmse)
Enter fullscreen mode Exit fullscreen mode

Conclusion

Congratulations! You've taken your first steps into the world of supervised learning algorithms. We covered the basic concepts of supervised learning and explored popular algorithms like Linear Regression, Decision Trees, and k-Nearest Neighbors (k-NN). Additionally, we provided practical examples and Python code snippets using the Scikit-learn library, enabling you to start building your machine learning models.

Remember, practice makes perfect. As you continue your journey in machine learning, try experimenting with different datasets, tweaking parameters, and exploring other algorithms. The more you explore and learn, the more proficient you'll become in this exciting field.

Keep in mind also that this article serves as an introduction to supervised learning, and there is much more to learn as you progress in your machine learning journey. You may encounter challenges, but don't be discouraged. Embrace them as opportunities to learn and grow as a machine learning practitioner.

You enjoyed reading? follow me on Twitter & LinkedIn
Happy coding!

Top comments (0)