DEV Community

Cover image for Building your first machine learning model in Python

Posted on

Building your first machine learning model in Python

Machine learning is the use of algorithms that can learn from data over time and therefore can detect and learn patterns from the data. Machine learning models are divided into Supervised, Unsupervised, and Reinforcement learning. The commonly used machine learning algorithms fall under Supervised learning and the linear regression model is usually the first model you will encounter in this category.

Under Linear regression models, we have simple linear and multiple linear models. A simple linear model involves the use of one independent and one dependent variable. On the other hand, multiple linear models have one dependent variable and more than two independent variables. In this article, I will take you through the process of creating your first multiple linear model for predicting the tips that customers give waiters in restaurants.

Getting started

Before we start, there are some technologies that you should be familiar with.

  • Basic understanding of Python
  • Some familiarity with statistics
  • Python libraries including pandas, numpy, matplotlib, seaborn,
  • scikit-learn

Linear regression

Linear regression is among the simple but commonly used algorithms, especially when the focus is to determine how variables are related. A linear regression model aims to get the best fit linear line that minimizes the sum of squared differences between actual and predicted values.

There are many uses of linear regression models. Some of the uses are market analysis, sports analysis, and financial analysis among other uses.

Loading and understanding the dataset

We will use the tips dataset embedded in the Seaborn library. The tips dataset contains simulated data on tips that waiters receive in restaurants in addition to other attributes.
For this demonstration, this is the complete Google Colab that I used. We start by loading the necessary libraries and loading the data.

import pandas as pd
import numpy as np
import seaborn as sns 
import matplotlib.pyplot as plt
Enter fullscreen mode Exit fullscreen mode

After loading the libraries, we first check for the datasets in the Seaborn library.

Enter fullscreen mode Exit fullscreen mode

After looking at the various datasets and opting for the dataset of choice, we can now load the dataset.

tips = sns.load_dataset('tips')

Enter fullscreen mode Exit fullscreen mode

dataset head

The table above shows that there are 7 variables in the dataset. The numerical columns in the dataset are total_bill, tip and size while the categorical columns are sex, smoker, day and time.

For basic statistics, we can use the describe () method.

Enter fullscreen mode Exit fullscreen mode

basic statistics

The describe () function gives the summary statistics of the numerical variables only. From the output, we can see the mean, standard deviation, minimum, maximum, and percentiles of the variables.

Data visualizations

Distribution of sex variable

sns.countplot(x ='sex', data = tips)
plt.title('Distribution of Sex variable')
Enter fullscreen mode Exit fullscreen mode

sex distribution
We can see from the plot above that men comprised a big percentage of the customers represented in the restaurant.

  1. ####Total bill variable
sns.histplot(x ='total_bill', data = tips)
plt.title('Histogram of the Total bill variable')
Enter fullscreen mode Exit fullscreen mode

total bill histogram
The above plots show the distributions of two variables. We can see that the majority of the bills fall between $10 and $20. The sex distribution variable also shows that most of the customers were men.


sns.scatterplot(x='total_bill', y='tip', data=tips)
plt.title('Scatter plot of total bill and tip variables')
Enter fullscreen mode Exit fullscreen mode


Correlation plot

num_cols = tips.select_dtypes(include='number')
corr_matrix = num_cols.Corr()
sns.heatmap(corr_matrix, annot=True)
plt.title('Correlation Heatmap')
Enter fullscreen mode Exit fullscreen mode

Correlation plot
The scatterplot above shows that the tip and total_bill have a strong linear relationship. We can see from the correlation plot that the total_bill and tip correlate 0.68, indicating a strong positive correlation.

Model building

Before building the model, the data has to be processed in a format that is compatible with the machine learning algorithm. Machine learning algorithms work with numerical data and that necessitates changing the categorical values to numerical. To change the data from categorical to numerical, there are various approaches like Label Encoding and OneHotEncoding. For this project, we will use OneHotEncoding.

tips = pd.get_dummies(tips, columns=['sex', 'smoker', 'day', 'time'], dtype=int)
Enter fullscreen mode Exit fullscreen mode

Using OneHotEncoding creates new variables for each of the categorical values. For example, we had a variable named sex which has Male and Female as the values. After using the get_dummies () method which encodes the data using OneHotEncoding, we have two new variables from the sex variable named sex_Male and sex_Female. Note that we started our data analysis with 7 variables and now after applying OneHotEncoding, we have 13 variables.

After encoding the data, we now have to scale the data to fall within the same range. For example, values in the total_bill column vary between 3 and 50 while for the majority of the remaining columns, the values are between 0 and 1. Scaling ensures that the model is robust by ensuring there are no extreme values. For this, we are using the MinMaxScaler class of the scikit-learn library.

Un scaled data

from sklearn.preprocessing import MinMaxScaler
# Instantiate the scaler
MM = MinMaxScaler()
col_to_scale = ['total_bill']
# Fitting and transforming the scaler 
scaled_data = MM.fit_transform(tips[col_to_scale])
# Convert the scaled data into a DataFrame
scaled_df = pd.DataFrame(scaled_data, columns=col_to_scale)
# Dropping the original columns to avoid duplication
tips_df = tips.drop(columns=col_to_scale).join(scaled_df)
Enter fullscreen mode Exit fullscreen mode

After scaling the total_bill column, we have the results below. You can see that the values in the total_bill column now range between 0 and 1 like the rest of the variables.

Scaled data

Next, we split the data into train and test sets. We will use the training data to train the model and test data to test the performance of our model.

from sklearn.model_selection import train_test_split
X= tips_df.drop(columns='tip', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2,random_state=42)

Enter fullscreen mode Exit fullscreen mode

The big X represents the independent variables (features) that will be fed to our model and the small y represents the target variable.

After splitting the data, we now proceed to instantiate the model and fit it to the training data as shown by the code below.

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
LR = LinearRegression(), y_train)
Enter fullscreen mode Exit fullscreen mode

Model evaluation

After fitting the training data to the model, we now proceed to test the model with our unseen data. Evaluating the model is important as it tells us whether our model performance is good or bad. For regression models, the evaluation metrics are the mean absolute error, mean squared error, mean squared error, R squared and Root mean squared error among others.

y_pred = LR.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)
r2 = r2_score(y_test, y_pred)
print("R-Squared (R2) Score:", r2)
Enter fullscreen mode Exit fullscreen mode

The output is:
Mean Squared Error: 0.7033566017436106
R-Squared (R2) Score: 0.43730181943482493

The mean squared error is high and this means that our model is not predicting well while the R squared value is low meaning that the model is not fitting the data well. Ideally, the mean squared error must be low and the R-squared value must be high.
On visualizing the results;

plt.scatter(y_test, y_pred)
#adding labels to the plot
plt.xlabel("The Actual Tip Amount")
plt.ylabel("The Predicted Tip Amount")
plt.title("Plot of Actual versus Predicted Tip Amount")
plt.plot([0, max(y_test)], [0, max(y_test)], color='green', linestyle='--')

Enter fullscreen mode Exit fullscreen mode

Model performance
From the plot, we can see that there are many values below the diagonal line. This means that in many cases, the predicted tip amount tends to be lower than the actual tip amount.


In this article, we successfully built our first machine-learning model to predict the tips that customers pay. This regression model has provided us with a starting point to understand the relationship between several independent features and the tip amount. We also saw in the model evaluation that our model did not perform well in predicting the tip amount.

The performance of our model highlights an important aspect of data science and machine learning which is improving models iteratively. To further improve our model, we may have to use feature engineering, perform hyperparameter tuning, or do data quality checks. As you embark on this machine-learning journey, remember that your model may need several improvements before it achieves the desired performance.

Additional readings


Top comments (0)