Apoorva Dave

Posted on Feb 23, 2019 • Edited on Mar 17, 2019

Regression from scratch - Wine quality prediction

#machinelearning #regression #python #beginners

In our previous posts, we covered the basics of machine learning and types of regression. In this article, we will do our first Machine Learning project. This would give an idea of how we can implement regression on different datasets. It will take just an hour to set up, understand and code. So let’s get started! 😃

The task here is to predict the quality of red wine on a scale of 0–10 given a set of features as inputs. I have solved it as a regression problem using Linear Regression.

The dataset used is Wine Quality Data set from UCI Machine Learning Repository. You can check the dataset here

Input variables are fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol. And the output variable (based on sensory data) is quality (score between 0 and 10). Below is a screenshot of the top 5 rows of the dataset.

Top 5 rows of Wine Quality dataset

Dependencies

The code is in python. Other than this, please install the following libraries using pip.

Pandas: pip install pandas
matplotlib: pip install matplotlib
numpy: pip install numpy
scikit-learn: pip install scikit-learn

And that’s it! You are halfway through 😄. Next, follow the below steps in order to build a linear regression model in no time!

Approach

Create a new IPython Notebook and insert the below code to import the necessary modules. In case you get any error, do install the necessary packages using pip.

import pandas as pd 
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression 
from sklearn import metrics 
import matplotlib.pyplot as plt 
import numpy as np 
import seaborn as sns

Read the data using pandas into a dataframe. To check the top 5 rows of the dataset, use df.head()

df = pd.read_csv('winequality-red.csv')
df.head()

Finding correlations between each attribute of dataset using corr()

# there are no categorical variables. each feature is a number. Regression problem. 
# Given the set of values for features, we have to predict the quality of wine. 
# finding correlation of each feature with our target variable - quality
correlations = df.corr()['quality'].drop('quality')
print(correlations)

Correlations between each attribute and target variable — quality

To draw a heatmap and get a detailed diagram of correlation, insert the below code.

sns.heatmap(df.corr())
plt.show()

Heatmap

Define a function get_features() which outputs only those features whose correlation is above a threshold value (passed as an input parameter to function).

def get_features(correlation_threshold):
    abs_corrs = correlations.abs()
    high_correlations = abs_corrs
    [abs_corrs > correlation_threshold].index.values.tolist()
    return high_correlations

Create two vectors, x containing input features and y containing the quality variable. In x, we get all the features except residual sugar. The threshold value can be increased if you want.

# taking features with correlation more than 0.05 as input x and quality as target variable y 
features = get_features(0.05) 
print(features) 
x = df[features] 
y = df['quality']

Create training and testing set using train_test_split. 25% of the data is used for testing and 75% for training. You can check the size of the dataset using x_train.shape

x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=3)

Once the training and testing sets are created, it is time to build your Linear Regression model. You can simply use the built-in function to create a model and then fit to training data. Once trained, coef_ gives the values of the coefficients for each feature.

# fitting linear regression to training data
regressor = LinearRegression()
regressor.fit(x_train,y_train)
# this gives the coefficients of the 10 features selected above. 

print(regressor.coef_)

To predict the quality of wine with this model, use predict().

train_pred = regressor.predict(x_train)
print(train_pred)
test_pred = regressor.predict(x_test) 
print(test_pred)

Calculating Root mean squared error for training as well as testing set. The root-mean-square error (RMSE) is a frequently used measure of the differences between values (sample and population values) predicted by a model and the values actually observed. The RMSE for training and test sets should be very similar if we have built a good model. If the RMSE for the test set is much higher than that of the training set, it is likely that we’ve badly overfit the data.

# calculating rmse
train_rmse = mean_squared_error(train_pred, y_train) ** 0.5
print(train_rmse)
test_rmse = mean_squared_error(test_pred, y_test) ** 0.5
print(test_rmse)
# rounding off the predicted values for test set
predicted_data = np.round_(test_pred)
print(predicted_data)
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, test_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, test_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, test_pred)))
# displaying coefficients of each feature
coeffecients = pd.DataFrame(regressor.coef_,features) coeffecients.columns = ['Coeffecient'] 
print(coeffecients)

Coefficients of each feature

These numbers mean that holding all other features fixed, a 1 unit increase in sulphates will lead to an increase of 0.8 in quality of wine, and similarly for the other features.
Also holding all other features fixed, a 1 unit increase in volatile acidity will lead to a decrease of 0.99 in quality of wine, and similarly for the other features.

Thus, with few lines of code, we were able to build a Linear regression model to predict the quality of wine with RMSE scores of 0.65 and 0.63 for training and testing set respectively. This is just an idea to help you start with regression. You can play with the threshold value, other regression models and try feature engineering as well 😍.

To get the entire code, please use this link to my repository. The dataset is also uploaded :) Clone the repository and run the notebook to see the results.

The next articles would be on Classification and a similar small project on it. Stay tuned for more! Till then happy learning 😸

Top comments (7)

Eric Alcaraz del Pico • May 28 '19

I have a problem:
correlations = df.corr()['quality'].drop('quality')
Keyerror : 'quality'
Some idea?

Apoorva Dave • Jun 2 '19

The dataframe into which you have read csv file should contain the column 'quality'.
correlations = df.corr()['quality'].drop('quality')
Here we are trying to find correlations between column quality and all the other columns other than quality. 'quality' is our target variable.

Eric Alcaraz del Pico • Jun 3 '19 • Edited

This my csv

I have this column named 'quality'.
This is my code:

And i am having the problem:

Help pls :,(

Apoorva Dave • Jun 4 '19

I see you have value for 'quality' column in the dataset but it is not being read properly. As you can see in the output row 2 and 3 are showing .... but values are present in the actual dataset. Can you try printing df['quality'] and see are there are blank values for it?