I completed few days ago the "Machine Learning with Python" course by IBM on Coursera. And Because this field can seem quite challenging for many, I decided to write a set of simple, concise, and fun articles to share my knowledge and guide new learners in the machine learning field!
In this easy, fun, and simple-to-understand guide, I will unravel the mystery of machine learning by helping you create your first project: predicting customer categories for a telecommunications provider company.
Setting Up Your Environment
Before we begin, you can use Google Colab to run all the code provided here. This will allow you to execute the code in the cloud and analyze the output directly without using your machine's resources.
Project Overview
Imagine a telecommunications provider has segmented its customer base by service usage patterns, categorizing the customers into four groups:
- Basic Service
- E-Service
- Plus Service
- Total Service.
Why Categorize (or Classify) Customers?
Categorizing customers allows a company to tailor offers based on specific needs. For example, new customers might receive welcome discounts, while loyal customers could get exclusive early access to sales.
Our objective in this project is to build a classifier that can classify new customers into one of these four groups based on previously categorized data. For this, we will use a specific type of classification algorithm called K-Nearest Neighbors (kNN).
To learn more about different ML algorithms, check out this informative article: Types of Machine Learning,
Anyway, let the fun begin!
Data
Of course, Machines learn from the data you give them, which is the essence of machine learning algorithms. They analyze data, learn patterns and relationships, and make predictions based on what they've learned.
Downloading and Understanding Our Dataset
To make it easier to understand our dataset, let's first download it and output the first five rows using the Pandas library.
Pandas is a Python library used for working with data sets. It has functions for analyzing, cleaning, exploring, and manipulating data.
So, go ahead and open Google Colab, create a new notebook, and in the first code cell, type:
import pandas as pd
import matplotlib.pyplot as plt # For creating visualizations and plots
and then run it to import the Pandas library.
Next, in the second code cell, type the following code to read the dataset from the provided URL and display the first five rows:
df = pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/teleCust1000t.csv')
df.head()
As you see in the first five rows, we have customers with attributes such as region, age, and marital status. We will use these attributes to predict a new case. In machine learning terminology, these are called features, while the target field (the attribute we want to predict), called custcat (short for customer category), which has four possible values corresponding to the four customer groups we discussed earlier, is known as the label.
Let's perform a quick analysis with Pandas to see how many of each class is in our dataset. Type the following in a new code cell to get the result:
df['custcat'].value_counts()
Extracting Features and Labels
We will use X
variable to store our feature set, and Y
to store our labels:
import numpy as np
X = df[['region', 'tenure', 'age', 'marital', 'address', 'income', 'ed', 'employ', 'retire', 'gender', 'reside']].values
The line of code above selects specific columns from the DataFrame df to be used as features for our machine learning model.
To see a sample of the data. let's output the first five rows from the array X
using X[0:5]
array([[ 2., 13., 44., 1., 9., 64., 4., 5., 0., 0., 2.],
[ 3., 11., 33., 1., 7., 136., 5., 5., 0., 0., 6.],
[ 3., 68., 52., 1., 24., 116., 1., 29., 0., 1., 2.],
[ 2., 33., 33., 0., 12., 33., 2., 0., 0., 1., 1.],
[ 2., 23., 30., 1., 9., 30., 1., 2., 0., 0., 4.]])
Now, Let's store the values of our label in Y
y = df['custcat'].values
y[0:5]
Data Standardization
Normalizing our data helps convert it into a uniform format. Why? Because imagine in our raw data we have age in years, income in dollars, and height in centimeters; the algorithm may give more importance to features with larger scales. This can skew the results and lead to a biased model.
The line of code below scales and normalizes the data. It standardizes the features so they have a mean of 0 and a standard deviation of 1. This is a good practice in general.
First, make sure to install scikit-learn library
pip install scikit-learn
Scikit-learn is a library in Python that provides many unsupervised and supervised learning algorithms
Then, Go ahead and the run the code below and see in the output how all the values are unified on similar scale:
from sklearn import preprocessing
X = preprocessing.StandardScaler().fit(X).transform(X.astype(float))
X[0:5]
Again, don't worry if the line above seems complex; the important thing right now is to understand the importance of it 😊.
Train-Test Split
Since we only work with one dataset, Train/Test Split involves splitting our dataset into training data (the data we will give to our model to learn the patterns of it) and testing data (the one we will use for predictions).
In future articles, I’ll cover various and more in-depth evaluation techniques for models, including Train/Test Split, K-Fold Cross-Validation, and more.
So, To perform the Train/Test split, we import it from scikit-learn library
from sklearn.model_selection import train_test_split
And now let's use our imported function and explain what is happenig in the line below:
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
train_test_split()
divides the features (X) and labels (y) into two parts: 80% for training the model (X_train and y_train) and 20% for testing it (X_test and y_test). The random_state=4 ensures that each time you run the code, the split result is the same.
To see the size (number of rows and columns) of our train and test dataset, we will use the .shape
attribute:
print('Train set:', X_train.shape, y_train.shape)
print('Test set:', X_test.shape, y_test.shape)
Classification
As I mentioned before, we will use K-Nearest Neighbors or (KNN) algorithm to train and to predict new classes.
(If you want to know how this algorithm works, just leave a comment; I’ll be glad to write an article about it 😉)
Good news!, We don't need to implement the algorithm by ourselves as we can easily import the model from the scikit-learn library:
from sklearn.neighbors import KNeighborsClassifier
Training
The block of code below creates a K-Nearest Neighbors model, The fit() method trains the model with the training data (X_train
) and its corresponding labels (y_train
), allowing the model to learn from this data:
k = 4
neigh = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)
Predicting
Now, let's make some predictions!
Let's create a variable named y_predict
to store our model’s predictions on the test data we created earlier, We will also use this variable to compare its results with y_test
which has the real and correct values. This means we will compare our prediction with the real values to test our model’s accuracy.
y_predict = neigh.predict(X_test)
y_predict[0:5]
Accuracy evaluation
The y_predict[0:5]
will output the first five prediction results, you can do some comparison yourself with y_test
.
But, Comparing results manually can be error-prone and inconsistent, That’s why Scikit-learn offers various accuracy evaluation methods that standardize this process. For example, metrics.accuracy_score
calculates the proportion of correct predictions, providing a clear and objective measure of how well your model performs.
Let's use metrics.accuracy_score
to see how well our trained model is working:
from sklearn import metrics
print("Train set Accuracy: ", metrics.accuracy_score(y_train, neigh.predict(X_train)))
print("Test set Accuracy: ", metrics.accuracy_score(y_test, y_predict))
Our code output accuracy scores says that the model correctly predicted the labels for about 54.75% of the training data and correctly predicted the labels for about 32% of the test data.
The large gap between training and test accuracy might suggest overfitting, where the model learns the training data too well but struggles with new, unseen data.
Of course, If time permits, In future articles, I will discuss strategies to improve a model’s accuracy on new, unseen data.
Conclusion ✨
I really hope this guide was easy to follow and I hope it helped you learn something new.
Always keep experimenting and exploring new techniques. There’s always something new to learn in machine learning, and I’m excited to share more with you in future articles.
Feel free to drop your thoughts or questions in the comments. I’m here to help and would love to hear about your experiences and progress.
Happy coding 👨💻
Bye for now! 😊
Top comments (2)
Well discussed thanks
Thank you so much, glad to hear that you found my article helpful 😊