DEV Community

Cover image for Predictive Modeling with Jupyter Notebook
Charles Ndavu
Charles Ndavu

Posted on

Predictive Modeling with Jupyter Notebook

Predicting is common for businesses and institutions to anticipate future performance. It is used in different fields, including finance, agriculture, marketing, and healthcare, to make informed decisions and improve current outcomes.

In this article, we will create a model that can predict future events based on selected data. The blog will only be complete and satisfying with practical examples.

So, let’s jump right into it.

TABLE OF CONTENTS
· What is Predictive Modeling?
· Benefits of Predictive Modeling
· Practical Examples of Predictive Modeling
· Conclusion

What is Predictive Modeling?

Predictive modeling is the process of predicting future events utilizing historical data. In this case, you will use statistical and machine learning methods to create a model that can predict by identifying your data’s patterns and relationships.

Benefits of Predictive Modeling

Predictive modeling is an important process of predicting future events when it comes to making informed decisions, as it can impact how institutions and businesses make informed decisions concerning performance. It provides many benefits related to predicting the future, including

  • Improved Accuracy. Predictive modeling makes it easier for sectors to make accurate predictions based on past data by identifying patterns and relationships. Companies can make better decisions out of the outcomes.
  • Resource allocation. Predictive modeling can help organizations effectively allocate resources by identifying areas that need significant resources. This helps reduce costs and attain desired company goals.
  • Satisfaction. Predictive modeling can help organizations improve customer experience and satisfaction by identifying experiences and behaviors.
  • Risk Management. Predictive modeling can help businesses foresee risks and take measures to prevent them from occurring. This helps in reducing the likelihood of massive loss impacting the operations negatively.
  • So, it is evident that predictive modeling is a powerful tool that can help organizations make better decisions by simply identifying patterns and relationships of data. Let’s now dive into examples and understand how it works with Jupyter Notebook using the Scikit-learn library.

Practical Examples of Predictive Modeling

Example 1

We have a sales dataset and want to build a model that predicts profits of product categories, using quantity and discount as independent variables.

Now, let’s open Jupyter Notebook and code the steps.

Libraries required in this process

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
Enter fullscreen mode Exit fullscreen mode

Load the sales data

df = pd.read_csv(‘Sales.csv’)
df
Enter fullscreen mode Exit fullscreen mode

Split the dataset into training and testing sets

X = df[['Quantity', 'Discount']]
y = df['Profit']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Enter fullscreen mode Exit fullscreen mode

Creating Linear regression model and fitting into the training data

model = LinearRegression()
model.fit(X_train, y_train)
Enter fullscreen mode Exit fullscreen mode

Predicting the profits

y_pred = model.predict(X_test)
print(y_pred)

Enter fullscreen mode Exit fullscreen mode

Output:

Screenshot

Test model accuracy and print them

mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print(‘MSE:’, mse)
print(‘MAE:’, mae)
print(‘RMSE:’, rmse)
print(‘R2 score:’, r2)
Enter fullscreen mode Exit fullscreen mode

Output:

Screenshot

The model accuracy is good, but one can argue it is average. For instance, the mean absolute error score is 72.87346513555711. The target variable is large, and the MAE score is good for determining the credibility or reliability of the model.

Example 2

The validation data contains variables that characterize the demographic and socio-economic situation of 181 galaxies over a period of at most 26 years. We are required to predict the well-being index with the highest possible level of certainty.

You have to ask yourself which variables explain variance or attributes of the well-being index.

Let’s start

Libraries required in this process

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as pt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
Enter fullscreen mode Exit fullscreen mode

Load the validation data

df=pd.read_csv(‘Validation.csv’)
Enter fullscreen mode Exit fullscreen mode

Dropping columns that we don’t need

df.columns 
df=df[[‘’,’’,… ]].copy() #add'#' to drop columns and remain with important columns
Enter fullscreen mode Exit fullscreen mode

Place an # on what you’re not going to use. Remember, we started asking ourselves which variables best explain the variance of the well-being index.

Checking correlation to what I narrowed to best explain the variance of the well-being index and also drop the NaN values

df_corr = df[[‘Gross income per capita’, ‘Income Index’,
 ‘Population using at least basic sanitation services (%)’,
 ‘Population using at least basic drinking-water services (%)’, 
 ‘Expected years of education (galactic years)’
 ]].dropna().corr() 
df_corr
Enter fullscreen mode Exit fullscreen mode

Output:

Screenshot

Defining dependent and independent variables and splitting

x = df[[‘Gross income per capita’,
 ‘Population using at least basic sanitation services (%)’,
 ‘Population using at least basic drinking-water services (%)’, 
 ‘Expected years of education (galactic years)’]]
y = df[‘Income Index’]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)
Enter fullscreen mode Exit fullscreen mode

Creating Linear regression model and fitting into the training data

ml= LinearRegression()
ml.fit(x_train,y_train)
Enter fullscreen mode Exit fullscreen mode

Predicting the well-being index

y_pred=ml.predict(x_test)
print(y_pred)
Enter fullscreen mode Exit fullscreen mode

Output:

Screenshot

Test model accuracy and print them

r2_score(y_test,y_pred)

Enter fullscreen mode Exit fullscreen mode

Output:

Screenshot

Our model accuracy in predicting is very good. Remember, the R2 score is between 0 and 1, 0 being inaccurate and 1 being perfect.

I know you will argue that there are no values or columns of the well-being index (predictor variables) to predict the well-being index. Yes, it is impossible to predict the well-being index if you do not have the column of the well-being index.

To do this, you will need to build a model associated with outcomes associated with well-being. In other words, you will work with independent variables.

In this case, we used independent variables, Income Index (target variable), Gross income per capita, Population using at least basic sanitation services (%). Population using at least basic drinking-water services (%) and Expected years of education (galactic years). You have to think out of the box😎.

Conclusion

Predictive modeling is an important process for organizations or institutions to be more of what the future holds. With the Scikit-learn library (not this alone), you can play around with your data to help a client or institution to make an informed decision.

This article on the Predictive model focused on LinerRegression Model to create a prediction model. It is the most common and easiest way to create a predictive model.

For data scientists, I hope this article is a good reference for your next predictive modeling project or a guide on how to get started. Let me know in the comment section or @cndavu for any feedback and suggestions. Thank you for giving this post your valuable time reading it.

Top comments (0)