In our previous posts, we covered the basics of machine learning and types of regression. In this article, we will do our first Machine Learning project. This would give an idea of how we can implement regression on different datasets. It will take just an hour to set up, understand and code. So let’s get started! 😃
The task here is to predict the quality of red wine on a scale of 0–10 given a set of features as inputs. I have solved it as a regression problem using Linear Regression.
The dataset used is Wine Quality Data set from UCI Machine Learning Repository. You can check the dataset here
Input variables are fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol. And the output variable (based on sensory data) is quality (score between 0 and 10). Below is a screenshot of the top 5 rows of the dataset.
Dependencies
The code is in python. Other than this, please install the following libraries using pip.
- Pandas: pip install pandas
- matplotlib: pip install matplotlib
- numpy: pip install numpy
- scikit-learn: pip install scikit-learn
And that’s it! You are halfway through 😄. Next, follow the below steps in order to build a linear regression model in no time!
Approach
Create a new IPython Notebook and insert the below code to import the necessary modules. In case you get any error, do install the necessary packages using pip.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
Read the data using pandas into a dataframe. To check the top 5 rows of the dataset, use df.head()
df = pd.read_csv('winequality-red.csv')
df.head()
Finding correlations between each attribute of dataset using corr()
# there are no categorical variables. each feature is a number. Regression problem.
# Given the set of values for features, we have to predict the quality of wine.
# finding correlation of each feature with our target variable - quality
correlations = df.corr()['quality'].drop('quality')
print(correlations)
To draw a heatmap and get a detailed diagram of correlation, insert the below code.
sns.heatmap(df.corr())
plt.show()
Define a function get_features()
which outputs only those features whose correlation is above a threshold value (passed as an input parameter to function).
def get_features(correlation_threshold):
abs_corrs = correlations.abs()
high_correlations = abs_corrs
[abs_corrs > correlation_threshold].index.values.tolist()
return high_correlations
Create two vectors, x
containing input features and y
containing the quality variable. In x
, we get all the features except residual sugar. The threshold value can be increased if you want.
# taking features with correlation more than 0.05 as input x and quality as target variable y
features = get_features(0.05)
print(features)
x = df[features]
y = df['quality']
Create training and testing set using train_test_split
. 25% of the data is used for testing and 75% for training. You can check the size of the dataset using x_train.shape
x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=3)
Once the training and testing sets are created, it is time to build your Linear Regression model. You can simply use the built-in function to create a model and then fit to training data. Once trained, coef_
gives the values of the coefficients for each feature.
# fitting linear regression to training data
regressor = LinearRegression()
regressor.fit(x_train,y_train)
# this gives the coefficients of the 10 features selected above.
print(regressor.coef_)
To predict the quality of wine with this model, use predict()
.
train_pred = regressor.predict(x_train)
print(train_pred)
test_pred = regressor.predict(x_test)
print(test_pred)
Calculating Root mean squared error for training as well as testing set. The root-mean-square error (RMSE) is a frequently used measure of the differences between values (sample and population values) predicted by a model and the values actually observed. The RMSE for training and test sets should be very similar if we have built a good model. If the RMSE for the test set is much higher than that of the training set, it is likely that we’ve badly overfit the data.
# calculating rmse
train_rmse = mean_squared_error(train_pred, y_train) ** 0.5
print(train_rmse)
test_rmse = mean_squared_error(test_pred, y_test) ** 0.5
print(test_rmse)
# rounding off the predicted values for test set
predicted_data = np.round_(test_pred)
print(predicted_data)
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, test_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, test_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, test_pred)))
# displaying coefficients of each feature
coeffecients = pd.DataFrame(regressor.coef_,features) coeffecients.columns = ['Coeffecient']
print(coeffecients)
These numbers mean that holding all other features fixed, a 1 unit increase in sulphates will lead to an increase of 0.8 in quality of wine, and similarly for the other features.
Also holding all other features fixed, a 1 unit increase in volatile acidity will lead to a decrease of 0.99 in quality of wine, and similarly for the other features.
Thus, with few lines of code, we were able to build a Linear regression model to predict the quality of wine with RMSE scores of 0.65 and 0.63 for training and testing set respectively. This is just an idea to help you start with regression. You can play with the threshold value, other regression models and try feature engineering as well 😍.
To get the entire code, please use this link to my repository. The dataset is also uploaded :) Clone the repository and run the notebook to see the results.
The next articles would be on Classification and a similar small project on it. Stay tuned for more! Till then happy learning 😸
Top comments (7)
I have a problem:
correlations = df.corr()['quality'].drop('quality')
Keyerror : 'quality'
Some idea?
The dataframe into which you have read csv file should contain the column 'quality'.
correlations = df.corr()['quality'].drop('quality')
Here we are trying to find correlations between column quality and all the other columns other than quality. 'quality' is our target variable.
This my csv
I have this column named 'quality'.
This is my code:
And i am having the problem:
Help pls :,(
I see you have value for 'quality' column in the dataset but it is not being read properly. As you can see in the output row 2 and 3 are showing .... but values are present in the actual dataset. Can you try printing df['quality'] and see are there are blank values for it?
It is reading all the columns as one column. You need to pass the separator while reading the CSV file.
ex:
df = pd.read_csv('winequality-red.csv', sep=";")
Hey Eric,
Please pass the separator value while reading the CSV file.
ex:
df = pd.read_csv('winequality-red.csv', sep=";")
congratulation...