In our previous posts, we covered the basics of machine learning and types of regression. In this article, we will do our first Machine Learning project. This would give an idea of how we can implement regression on different datasets. It will take just an hour to set up, understand and code. So let’s get started! 😃
The task here is to predict the quality of red wine on a scale of 0–10 given a set of features as inputs. I have solved it as a regression problem using Linear Regression.
The dataset used is Wine Quality Data set from UCI Machine Learning Repository. You can check the dataset here
Input variables are fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol. And the output variable (based on sensory data) is quality (score between 0 and 10). Below is a screenshot of the top 5 rows of the dataset.
The code is in python. Other than this, please install the following libraries using pip.
- Pandas: pip install pandas
- matplotlib: pip install matplotlib
- numpy: pip install numpy
- scikit-learn: pip install scikit-learn
And that’s it! You are halfway through 😄. Next, follow the below steps in order to build a linear regression model in no time!
Create a new IPython Notebook and insert the below code to import the necessary modules. In case you get any error, do install the necessary packages using pip.
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn import metrics import matplotlib.pyplot as plt import numpy as np import seaborn as sns
Read the data using pandas into a dataframe. To check the top 5 rows of the dataset, use
df = pd.read_csv('winequality-red.csv') df.head()
Finding correlations between each attribute of dataset using
# there are no categorical variables. each feature is a number. Regression problem. # Given the set of values for features, we have to predict the quality of wine. # finding correlation of each feature with our target variable - quality correlations = df.corr()['quality'].drop('quality') print(correlations)
To draw a heatmap and get a detailed diagram of correlation, insert the below code.
Define a function
get_features() which outputs only those features whose correlation is above a threshold value (passed as an input parameter to function).
def get_features(correlation_threshold): abs_corrs = correlations.abs() high_correlations = abs_corrs [abs_corrs > correlation_threshold].index.values.tolist() return high_correlations
Create two vectors,
x containing input features and
y containing the quality variable. In
x, we get all the features except residual sugar. The threshold value can be increased if you want.
# taking features with correlation more than 0.05 as input x and quality as target variable y features = get_features(0.05) print(features) x = df[features] y = df['quality']
Create training and testing set using
train_test_split. 25% of the data is used for testing and 75% for training. You can check the size of the dataset using
Once the training and testing sets are created, it is time to build your Linear Regression model. You can simply use the built-in function to create a model and then fit to training data. Once trained,
coef_ gives the values of the coefficients for each feature.
# fitting linear regression to training data regressor = LinearRegression() regressor.fit(x_train,y_train) # this gives the coefficients of the 10 features selected above. print(regressor.coef_)
To predict the quality of wine with this model, use
train_pred = regressor.predict(x_train) print(train_pred) test_pred = regressor.predict(x_test) print(test_pred)
Calculating Root mean squared error for training as well as testing set. The root-mean-square error (RMSE) is a frequently used measure of the differences between values (sample and population values) predicted by a model and the values actually observed. The RMSE for training and test sets should be very similar if we have built a good model. If the RMSE for the test set is much higher than that of the training set, it is likely that we’ve badly overfit the data.
# calculating rmse train_rmse = mean_squared_error(train_pred, y_train) ** 0.5 print(train_rmse) test_rmse = mean_squared_error(test_pred, y_test) ** 0.5 print(test_rmse) # rounding off the predicted values for test set predicted_data = np.round_(test_pred) print(predicted_data) print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, test_pred)) print('Mean Squared Error:', metrics.mean_squared_error(y_test, test_pred)) print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, test_pred))) # displaying coefficients of each feature coeffecients = pd.DataFrame(regressor.coef_,features) coeffecients.columns = ['Coeffecient'] print(coeffecients)
These numbers mean that holding all other features fixed, a 1 unit increase in sulphates will lead to an increase of 0.8 in quality of wine, and similarly for the other features.
Also holding all other features fixed, a 1 unit increase in volatile acidity will lead to a decrease of 0.99 in quality of wine, and similarly for the other features.
Thus, with few lines of code, we were able to build a Linear regression model to predict the quality of wine with RMSE scores of 0.65 and 0.63 for training and testing set respectively. This is just an idea to help you start with regression. You can play with the threshold value, other regression models and try feature engineering as well 😍.
To get the entire code, please use this link to my repository. The dataset is also uploaded :) Clone the repository and run the notebook to see the results.
The next articles would be on Classification and a similar small project on it. Stay tuned for more! Till then happy learning 😸