*Originally published on:* Wordpress.

Recently I have been working on a Kaggle competition where participants are tasked with predicting Russian housing prices. In developing a model for the challenge, I came across a few methods for selecting the best regression

model for a given dataset. Let's load up some data and take a look.

```
library(ISLR)
set.seed(123)
swiss <- data.frame(datasets::swiss)
dim(swiss)
```

```
## [1] 47 6
```

```
head(swiss)
```

```
## Fertility Agriculture Examination Education Catholic
## Courtelary 80.2 17.0 15 12 9.96
## Delemont 83.1 45.1 6 9 84.84
## Franches-Mnt 92.5 39.7 5 5 93.40
## Moutier 85.8 36.5 12 7 33.77
## Neuveville 76.9 43.5 17 15 5.16
## Porrentruy 76.1 35.3 9 7 90.57
## Infant.Mortality
## Courtelary 22.2
## Delemont 22.2
## Franches-Mnt 20.2
## Moutier 20.3
## Neuveville 20.6
## Porrentruy 26.6
```

This data provides data on Swiss fertility and some socioeconomic

indicators. Suppose we want to predict fertility rate. Each observation

is as a percentage. In order to assess our prediction ability, create

test and training data sets.

```
index <- sample(nrow(swiss), 5)
train <- swiss[-index,]
test <- swiss[index,]
```

Next, have a quick look at the data.

```
par(mfrow=c(2,2))
plot(train$Education, train$Fertility)
plot(train$Catholic, train$Fertility)
plot(train$Infant.Mortality, train$Fertility)
hist(train$Fertility)
```

Looks like there could be some interesting relationships here. The

following block of code will take a model formula (or model matrix) and

search for the best combination of predictors.

```
library(leaps)
leap <- regsubsets(Fertility~., data = train, nbest = 10)
leapsum <- summary(leap)
plot(leap, scale = 'adjr2')
```

According to the adjusted R-squared value (larger is better), the best

two models are: those with all predictors and all predictors less

Examination. Both have adjusted R-squared values around .69 - a decent fit. Fit these models so we can evaluate them further.

```
swisslm <- lm(Fertility~., data = train)
swisslm2 <- lm(Fertility~.-Examination, data = train)
#use summary() for a more detailed look.
```

First we'll compute the mean squared error (MSE).

```
mean((train$Fertility-predict(swisslm, train))[index]^2)
```

```
## [1] 44.21879
```

```
mean((train$Fertility-predict(swisslm2, train))[index]^2)
```

```
## [1] 36.4982
```

Looks like the model with all predictors is marginally better. We're

looking for a low value of MSE here. We can also look at the Akaike

information criterion (AIC) for more information. Lower is better here

as well.

```
library(MASS)
step1 <- stepAIC(swisslm, direction = "both")
```

```
step1$anova
```

```
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## Fertility ~ Agriculture + Examination + Education + Catholic +
## Infant.Mortality
##
## Final Model:
## Fertility ~ Agriculture + Education + Catholic + Infant.Mortality
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 36 1746.780 168.5701
## 2 - Examination 1 53.77608 37 1800.556 167.8436
```

Here, the model without `Examination`

scores lower than the full

model. It seems that both models are evenly matched, though I might be

inclined to use the model without `Examination`

.

Since we sampled our original data to make the train and test datasets,

the difference in these tests is subject to change based on the training

data used. I encourage anyone who wants to test themselves to change the

`set.seed`

at the beginning and evaluate the above results again. Even

better: use your own data!

If you learned something or have questions feel free to leave a comment

or shoot me an email.

Happy modeling,

Kiefer

## Top comments (0)