Validation & Cross-validation for Predictive Modelling including Linear Model as well as Multi Linear Model
Before starting topic, let’s be familier on some term.
Validation : An act of confirming something as true or correct. Also, Validation is the process of establishing documentary evidence that a procedure, process, or activity was carried out in testing before being put into production.
Cross_Validation : Cross-validation, also known as rotation estimation or out-of-sample testing, is a set of model validation procedures for determining how well the results of a statistical investigation will generalize to new data.
Linear Model : The term “linear model” refers to a model that has a linear relationship between the target variable and the independent variable.
Multi Linear Model : A regression model that uses a straight line to evaluate the connection between a quantitative dependent variable and two or more independent variables is known as multiple linear regression.
Here we will use R’s bulit in data mtcars
for coding purpose. At first let’s divided data into train set and test set in the ratio of 70% to 30%. While doing that task never forgot to use seed()
function.
seed(): The random number generator is initialized using the seed() method. To generate a random number, the random number generator requires a starting value (seed value). The random number generator defaults to using the current system time.
#Define the mtcars data as “data”:
data <- mtcars
#Use random seed to replicate the result
set.seed(123)
#Do random sampling to divide the cases into two independent samples
ind <- sample(2, nrow(mtcars), replace = T, prob = c(0.7, 0.3))
#Data partition
train.data <- data[ind==1,]
test.data <- data[ind==2,]
We divided our data into training and testing set in the ratio of 70 % to 30%.
Let’s fit Linear Model
Set mile per gallon(mpg) as dependent variable and weight(wt) as independent variable.
lmodel <- lm(mpg~wt, data = train.data, method = "lm")
Let’s to model prediction.
pred <- predict(lmodel, data= test.data)
Check value of R square and error value. To do at first we should loadlibrary(caret)
into our R studio.
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
pred <- predict(lmodel, data= test.data)
R2 <- R2(pred, train.data$mpg)
R2
## [1] 0.7377021
Here, we found value of R-square 73.77% that means 73.77% data fit the linear model. Let’s check for error,
RMSE <- RMSE(pred, test.data$mpg)
## Warning in pred - obs: longer object length is not a multiple of shorter object
## length
RMSE
## [1] 8.786064
Hence error for the model is 12.6374.
Leave-One-Out Cross-Validation approach
It’s usual practice when building a machine learning model to validate your methods by setting aside a subset of your data as a test set.
LOOCV (leave-one-person-out cross validation) is a type of cross validation that uses each individual as a “test” set. It’s a form of k-fold cross validation in which the number of folds, k, equals the number of participants in the dataset.
library(caret)
# Define training control
train.control <- trainControl(method = "LOOCV")
# Train the model
model1 <- train(mpg ~wt, data = mtcars, method =
"lm",
trControl = train.control)
print(model1)
## Linear Regression
##
## 32 samples
## 1 predictor
##
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 31, 31, 31, 31, 31, 31, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 3.201673 0.7104641 2.517436
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
pred1 <- predict(model1, test.data)
R2 <- R2(pred1, test.data$mpg)
R2
## [1] 0.7864736
We receive a value of R square 78.46 percent when fitting the model using the leave-one-out strategy, which is higher than the linear regression model.
RMSE <- RMSE(pred1, test.data$mpg)
RMSE
## [1] 2.843768
Error is only 2.44 which is very lower than previous one.
Let’s fit the model using K-folds Cross-Validation approach
A K-fold CV is one in which a given data set is divided into K sections/folds, with each fold serving as a testing set at some point. Let’s look at a 10-fold cross validation case (K=10). The data set is divided into ten folds here. The first fold is used to test the model, while the others are used to train it in the first iteration. The second iteration uses the second fold as the testing set and the rest as the training set. This procedure is repeated until each of the ten folds has been utilized as a test set.
#k-fold cross validation
library(caret)
# Define training control
set.seed(123)
train.control <- trainControl(method = "cv", number = 10)
# Train the model
model2 <- train(mpg ~ wt, data = train.data, method =
"lm",
trControl = train.control)
Calculate value of R sqauere and error observed is it will come diffrerent from previous one.
library(caret)
pred2 <- predict(model2, train.data)
R2 <- R2(pred2, train.data$mpg)
R2
## [1] 0.7377021
This method gives the value of R square 73.77%. Which meand 73% data fitted by the model.
Fit the model using Repeated K-folds Cross-Validation approach
Repeated k-fold cross-validation is a technique for improving a machine learning model’s predicted performance. Simply repeat the cross-validation technique several times and return the mean result across all folds from all runs.
#repeated k-fold cross validation
library(caret)
# Define training control
set.seed(123)
train.control <- trainControl(method = "repeatedcv",
number = 10, repeats = 3)
# Train the model
model <- train(mpg ~wt, data = mtcars, method =
"lm",
trControl = train.control)
# Summarize the results
print(model)
## Linear Regression
##
## 32 samples
## 1 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 28, 28, 29, 29, 29, 30, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 2.975392 0.8351572 2.539797
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Hence we get value of R- square 83.51% similarly value of RMSE 2.97.
Summary: Which one should be used based on R-squared values of “lm” model?
R-square for training set: 0.7013
R-square for training with LOOCV: 0.7104641
R-square for training with k-folds CV: 0.7346939
R-square for training with repeated k-folds CV: 0.8351572
R-square for testing set: 0.9031085
R-square for testing with LOOCV: 0.9031085
R-square for testing with k-folds CV: 0.9031085
R-square for testing with repeated k-folds CV: 0.9031085
Which one should be used based on RMSE value?
RMSE for training set: 3.08648
RMSE for training with LOOCV
3.201673RMSE for training with k-folds CV: 2.85133
RMSE for training with repeated k- folds CV: 2.975392
RMSE for testing test: 2.279303
RMSE for testing with LOOCV: 2.244232
RMSE for testing with k-folds CV: 2.244232
RMSE for testing with repeated k- folds CV: 2.244232
Let’s Repeate same process for Multilinear Regression Model
It is an extension of the simple linear regression. Multi linear regression have more than one (two or more) independent variables. Multi linear regression has one (1) continuous dependent variable. It is a supervised learning. All the assumptions of the simple linear regression are also applicable here. There is one more condition.
Multicollinearity must not be present i.e. correlations between independent variables must not be “high”.
Fitting Multi Linear Regression Model
mlr <- lm(mpg~., data = mtcars)
Let’s check variance inflection factor of mlr
. The inflation factor is the difference between the variance of estimating a parameter in a model with many other factors and the variance of a model with only one term. which is avilable in car packages.
library(car)
## Loading required package: carData
vif(mlr)
## cyl disp hp drat wt qsec vs am
## 15.373833 21.620241 9.832037 3.374620 15.164887 7.527958 4.965873 4.648487
## gear carb
## 5.357452 7.908747
We need to drop the independent variable with highest VIF and run the model again until all the VIF <10!
#Removing “disp” variable:
mlr1 <- lm(mpg ~ cyl+hp+drat+wt+qsec+vs+am+gear+carb, data = mtcars)
vif(mlr)
## cyl disp hp drat wt qsec vs am
## 15.373833 21.620241 9.832037 3.374620 15.164887 7.527958 4.965873 4.648487
## gear carb
## 5.357452 7.908747
#Removing “cyl” variable:
mlr2 <- lm(mpg ~
hp+drat+wt+qsec+vs+am+gear+carb, data = mtcars)
summary(mlr1)
##
## Call:
## lm(formula = mpg ~ cyl + hp + drat + wt + qsec + vs + am + gear +
## carb, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7863 -1.4055 -0.2635 1.2029 4.4753
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.55052 18.52585 0.677 0.5052
## cyl 0.09627 0.99715 0.097 0.9240
## hp -0.01295 0.01834 -0.706 0.4876
## drat 0.92864 1.60794 0.578 0.5694
## wt -2.62694 1.19800 -2.193 0.0392 *
## qsec 0.66523 0.69335 0.959 0.3478
## vs 0.16035 2.07277 0.077 0.9390
## am 2.47882 2.03513 1.218 0.2361
## gear 0.74300 1.47360 0.504 0.6191
## carb -0.61686 0.60566 -1.018 0.3195
## ---
## Signif. codes: 0 ' ***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.623 on 22 degrees of freedom
## Multiple R-squared: 0.8655, Adjusted R-squared: 0.8105
## F-statistic: 15.73 on 9 and 22 DF, p-value: 1.183e-07
vif(mlr2)
## hp drat wt qsec vs am gear carb
## 6.015788 3.111501 6.051127 5.918682 4.270956 4.285815 4.690187 4.290468
Now all Vif less than 10 so, data is ready to fit different prediction model.
Leave-One-Out Cross-Validation approach on Multi Regression Model.
#Leave one out CV
library(caret)
# Define training control
train.control <- trainControl(method = "LOOCV")
# Train the model
mlr <- train(mpg ~ hp+drat+wt+qsec+vs+am+gear+carb, data = mtcars, method = "lm",
trControl = train.control)
# Summarize
summary(mlr)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8187 -1.3903 -0.3045 1.2269 4.5183
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.80810 12.88582 1.072 0.2950
## hp -0.01225 0.01649 -0.743 0.4650
## drat 0.88894 1.52061 0.585 0.5645
## wt -2.60968 1.15878 -2.252 0.0342 *
## qsec 0.63983 0.62752 1.020 0.3185
## vs 0.08786 1.88992 0.046 0.9633
## am 2.42418 1.91227 1.268 0.2176
## gear 0.69390 1.35294 0.513 0.6129
## carb -0.61286 0.59109 -1.037 0.3106
## ---
## Signif. codes: 0 ' ***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.566 on 23 degrees of freedom
## Multiple R-squared: 0.8655, Adjusted R-squared: 0.8187
## F-statistic: 18.5 on 8 and 23 DF, p-value: 2.627e-08
We got value of R square is 86.55% value of error is 2.566 on 23 degree of freedom.
Let’s fit the model using K-folds Cross-Validation approach on Multi Linear Regression Model.
#K- folds Cross- Validation
library(caret)
# Define training control
train.control <- trainControl(method = "cv", number = 10)
# Train the model
mlr1<- train(mpg ~ hp+drat+wt+qsec+vs+am+gear+carb, data = mtcars, method = "lm",
trControl = train.control)
# Summarize
summary(mlr1)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8187 -1.3903 -0.3045 1.2269 4.5183
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.80810 12.88582 1.072 0.2950
## hp -0.01225 0.01649 -0.743 0.4650
## drat 0.88894 1.52061 0.585 0.5645
## wt -2.60968 1.15878 -2.252 0.0342 *
## qsec 0.63983 0.62752 1.020 0.3185
## vs 0.08786 1.88992 0.046 0.9633
## am 2.42418 1.91227 1.268 0.2176
## gear 0.69390 1.35294 0.513 0.6129
## carb -0.61286 0.59109 -1.037 0.3106
## ---
## Signif. codes: 0 ' ***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.566 on 23 degrees of freedom
## Multiple R-squared: 0.8655, Adjusted R-squared: 0.8187
## F-statistic: 18.5 on 8 and 23 DF, p-value: 2.627e-08
Again, we got value of r square 86.55% similarly, value for the error is 2.566.
Fit the model using Repeated K-folds Cross-Validation approach
set.seed(224)
# Repeated K- folds Cross- Validation
library(caret)
# Define training control
train.control <- trainControl(method = "repeatedcv",
number = 10, repeats = 3)
# Train the model
mlr2<- train(mpg ~ hp+drat+wt+qsec+vs+am+gear+carb, data = mtcars, method = "lm",
trControl = train.control)
# Summarize
summary(mlr2)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8187 -1.3903 -0.3045 1.2269 4.5183
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.80810 12.88582 1.072 0.2950
## hp -0.01225 0.01649 -0.743 0.4650
## drat 0.88894 1.52061 0.585 0.5645
## wt -2.60968 1.15878 -2.252 0.0342 *
## qsec 0.63983 0.62752 1.020 0.3185
## vs 0.08786 1.88992 0.046 0.9633
## am 2.42418 1.91227 1.268 0.2176
## gear 0.69390 1.35294 0.513 0.6129
## carb -0.61286 0.59109 -1.037 0.3106
## ---
## Signif. codes: 0 ' ***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.566 on 23 degrees of freedom
## Multiple R-squared: 0.8655, Adjusted R-squared: 0.8187
## F-statistic: 18.5 on 8 and 23 DF, p-value: 2.627e-08
We got value for R square 86.55 % and value for error is 2.566.
Than you for Reading
Top comments (2)
Hi Durga,
Excellent work, liked it alot.
I have one comment on the section "Which one should be used based on RMSE value?"
In that section you list out the training and testing set RMSE, but you do not answer your own question (as far as I can see) about which one to choose. Pointing that out would be instructive.
Thanks ;)
We always prefer lower RMSE value so we can choose one which has lower RMSE