DEV Community

Cover image for Steps Involved in Selecting a Model (Model Selection)
Ganiyu Olalekan
Ganiyu Olalekan

Posted on

Steps Involved in Selecting a Model (Model Selection)

Model selection is a key ingredient in the long and essential series of steps involved in creating a machine learning (ML) model that would be deployed into production.

This article aims to act as a guide to machine learning engineers new to the process of model selection in machine learning (ML).

We’ll start by understanding what model selection is:

What is Model Selection

Model selection is the task (or process) of selecting a statistical model from a set of candidate models, given data. Wikipedia.

What this implies is that; model selection is the activity of undergoing a series of events (tasks/processes). This series of activities help us to determine if a statistical model (among others) is best suited to make predictions for a task.

In selecting a model we start by inspecting our dataset because everything we do afterward only matters when we know the kind of data we’re working with.

Is the dataset clean?

So to begin with, we start by looking into the dataset for issues like missing data, incorrectly formatted values, etc. This process is called data cleaning. It is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. tableau.

Trust me! Data Cleaning is a very lengthy and tiring process. It is a whole subject of its own which is necessary and thus, valuable materials to assist those new to it is available in the further reading section below.

What is the size of the dataset?

The next thing we look into will be the size of the data. How big is the data? Is the data big enough to be split into 3 sets (Train, Validation, and Test set) or is it so small we can’t even extract a good enough test set (example: the iris dataset).

Let’s start by identifying how we can address the small dataset.

How do we define a small dataset?

A dataset of 1,000 sets and lower can be considered small. A set higher than 1000 can still be considered small based on the problem you’re trying to solve.

if you try to process a small data set naively, it will still work. If you try to process a large data set naively, it will take orders of magnitude longer than acceptable (and possibly exhaust your computing resources as well). ~Carlos Barge

I consider the metrics by Carlos Barge to be more appropriate for distinguishing a small from a large dataset. What constitutes a large dataset isn’t just the size of the rows but also the size of the columns.

After defining a dataset as small, various steps should be taken to select a model for that dataset.

Note: When performing a model evaluation, consider the rule of thumb for training a model.

Your model should train on at least an order of magnitude more examples than trainable parameters developers.google.com

These steps include:

  1. Transform categorical columns to numeric (If any)
  2. Perform a k-fold cross-validation
  3. Elect candidate models
  4. Perform Model Evaluation
  5. Model selection

To explain this better, I would be making use of the iris dataset to examine the measures listed above. The complete notebook on the model selection process for the iris dataset set can be on my Kaggle page.

Transform categorical columns to numeric

Machine learning models are unable to interpret non-numeric values, so before proceeding, all numeric columns need to be transformed to numeric values.

In most cases, columns that would need to be transformed to numeric values would be categorical columns like [low, medium, high] or [Yes, No] or [Male, Female].

Scikit-learn is a toolbox that was built to handle these conversions: they include the LabelEncoder, OrdinalEncoder, OneHotEncoder, etc. All this is available in sklearn.preprocessing.

Resources to articles that provide clarification on these tools can be found in the further reading section of this article.

Perform a k-fold cross-validation

The k-fold cross-validation is a procedure used to estimate the skill of the model on new data. machine learning mastery.

K-fold Cross-Validation

K-fold cross-validating works by splitting the dataset to a specified number of folds (say 5) and then shifting the position of the test set to a single fold at each iteration (as described above).

After performing the K-fold cross-validation, we then end up with the N number of the same dataset with N different training and testing sets (where N is the number of splits applied on the dataset).

There are two (2) ways to use k-fold cross-validation:

  1. Using k-fold cross-validation for evaluating a model’s performance
  2. Using k-fold cross-validation for hyper-parameter tuning

There’s a lovely article by Rukshan Pramoditha titled k-fold cross-validation explained in plain English which explains both. We would however use k-fold for evaluating model performance in this test case.

"""
Creating a K cross validation fold with sklearn using the iris dataset
"""

from sklearn.datasets import load_iris
from sklearn.model_selection import KFold


# Loads iris dataset
data, target = load_iris(return_X_y=True)

# Splits dataset into 5 folds
iris_kf = KFold(n_splits=5, shuffle=True, random_state=42)

# List to store dataset across the the various folds
kf_data_list = [
    (
        data[train_index], 
        data[test_index], 
        target[train_index], 
        target[test_index]
    )
    for train_index, test_index in iris_kf.split(data, target)
]
Enter fullscreen mode Exit fullscreen mode

The purpose of performing a k-fold cross-validation is to expand the dataset.

What do I mean by this? The iris dataset for instance has a total of 150 data which is so small that extracting a test and cross-validation set will leave us with very little to train with.

By splitting the dataset into a training and test set across 5 different instances here, we try to maximize the use of the available data for training and then test the model.

Elect candidate models

Now that we’ve successfully split our dataset in 5 K-Fold we can proceed to elect the candidate models. This is the instance where we look at the kind of task we are solving and the models that can solve/address it.

Iris Flower Classification

The Iris dataset is a classification task. It has four (4) feature columns which are sepal length (cm), sepal width (cm), petal length (cm), and petal width (cm). All are continuous feature columns.

By visualizing the dataset, we can tell that the petal width (cm) and petal length (cm) feature column is linearly separable from the other feature columns. Well, this and probably more relationships.

Question: What models best decide these relationships?

I’ll go straight to listing out models that can determine these relationships. For more on the reasons, we picked the models check out the further reading section.

We’ll be electing the LogisticRegression, SVC, KNN, and RandomForestClassifier.

Perform Model Evaluation

Now that we’ve decided on the machine learning (ML) models, we can proceed to evaluate the models with our dataset using cross-validation.

We would make use of the sklearn.model_selection.cross_val_score to cross-validate the dataset and get the scores on the model performance across each fold.

"""
Model performance on the iris dataset
Trying to evaluate best performing models using cross validation.
"""

from sklearn.datasets import load_iris

from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import cross_val_score


def model_performance(data, target, *models):
    """
    Takes a record of the model performance during cross validation
    returns the record of the model performance along with the
            model performance rating of the stating which model performed
            best and which performed worst
    """

    record = {
        'Logistic Regression': {},
        'K-Nearest Neighbor': {},
        'Random Forest Classifier': {},
        'Support Vector Classifier': {},
    }

    avg_model_performance = []

    for model, name in zip(models, record.keys()):
        scores = cross_val_score(model, data, target, cv=5, scoring='accuracy')

        record[name]['scores'] = scores
        record[name]['mean_score'] = scores.mean()
        avg_model_performance.append((round(float(scores.mean()) * 100, 2), name))

    record['Model Performance Rating'] = sorted(avg_model_performance, reverse=True)

    return record


data, target = load_iris(return_X_y=True)

record = model_performance(
    data, target,
    LogisticRegression(max_iter=1000),
    KNeighborsClassifier(),
    RandomForestClassifier(), SVC()
)

for model in list(record.keys())[:-1]:
    print(model, record[model])

print(
    "\n\nModel Performance Rating\n",
    record['Model Performance Rating']
)
Enter fullscreen mode Exit fullscreen mode

Iris Model Performance

Model Selection

After cross-validating the dataset we can now conclude that the best performing models are the Logistic Regression and the K-Nearest Neighbor models which both have an accuracy of 97.33%.

This implies that either of them would be efficient for deployment. Now based on the needs of the problem, we can now decide on either of the models. If you have needs for a model-based learning algorithm, you can choose the KNN or the Logistic Regression for instance-based learning.

After cross-validating the dataset we can now conclude that the best performing models are the Logistic Regression and the K-Nearest Neighbor models which both had an accuracy of 97.33%.

Performing cross-validation experiments like this on a large dataset would be very expensive computational-wise.

Now that we’ve figured out how to address the smaller datasets, how do we address larger ones?

How do we define a large dataset?

What do I mean by a large dataset? A dataset of about 10,000 rows upwards is large, while datasets within the range of say 2,000 to 10,000 are reasonably medium. Of course, this metric isn’t the best.

If you try processing a large dataset naively it will take longer processing time and exhaust computing power. This is a more precise metric.
After determining your dataset is large. what are the steps for selecting a model for the dataset then?

Well, unlike with smaller datasets, we can’t process this dataset naively. Thus, we have to split it. This is where reducing the dataset to three (3) set for training and evaluation comes to play.

Before we proceed though, let’s list the steps required to select a model for larger datasets:

  1. Transform Categorical Columns to Numeric (If any)
  2. Scale Continuous Columns (if necessary)
  3. Split the Dataset
  4. Elect Candidate Model
  5. Perform Model Evaluation
  6. Model Selection

You can proceed with these steps if you have a cleaned dataset. The House Prices — Advanced Regression Techniques dataset would be utilized for tutorial purposes as we analyze the steps involved in selecting models for larger datasets.

The House Prices dataset isn’t so large a dataset itself but should explain the concept behind our steps nicely.

The notebook compiling the codes for the dataset and the work we did can be found on my Kaggle page.

I would jump right into splitting the dataset. Below is the code for cleaning the dataset and transforming the columns — in case you desire to follow with the House Prices dataset.

"""
Cleaning and transforming the housing price dataset
House Prices - Advanced Regression Techniques
https://www.kaggle.com/c/house-prices-advanced-regression-techniques
"""

import pandas as pd

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OrdinalEncoder


# Loading both train and test set into a dataframe
train_dataset = pd.read_csv("house_prices/train.csv", index_col='Id')
test_dataset = pd.read_csv("house_prices/test.csv", index_col='Id')

# Merging both train and test set into one data frame
dataset = pd.concat((train_dataset, test_dataset))

#Extracing out target, in which we hope to predict
target = dataset["SalePrice"].to_numpy()

# Dropping some dataset columns
dataset.drop([
    "Alley", "FireplaceQu", "PoolQC", "Fence", "MiscFeature", "SalePrice"
], axis=1, inplace=True)

# Specifying the continuous columns
continuous_col = list(dataset.describe().columns)

# Specifying the categorical columns
categorical_col = [
    col
    for col in dataset.columns
    if col not in continuous_col
]

# Creating the continuous columns data pipeline
continuous_data_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy="median")),
    ('num_scaler', StandardScaler()),
])

# Creating the categorical columns data pipeline
categorical_data_pipeline = Pipeline([
    ('freq_imputer', SimpleImputer(strategy='most_frequent')),
    ('cat_encoder', OrdinalEncoder())
])

# Creating a data pipeline for the whole dataset
housing_price_pipeline = ColumnTransformer([
    ("continous", continuous_data_pipeline, continuous_col),
    ("categorical", categorical_data_pipeline, categorical_col),
])

# Transformed instance of the dataset
# Remember, target (variable) contains it's target values
transformed_dataset = housing_price_pipeline.fit_transform(dataset)
Enter fullscreen mode Exit fullscreen mode

Split the Dataset

The reason we perform an evaluation on machine learning (ML) models is to ensure they don’t under-fit or over-fit.

We were able to evaluate the iris data-set (a small data-set) using cross-validation, but given our data-set isn’t as small, validating naively would be computationally expensive.

Therefore, we have to split the dataset into a train and test set. Given the entire dataset has a shape of (1460, 80), and (1460, 74) after cleaning and transformation, we can perform cross-evaluation on the train-set and evaluate our model performance on the test set.

"""
Splitting the merged dataset of the housing price dataset
Merger:
https://gist.github.com/ganiyuolalekan/8e2acab87a0d4c51ff7fcd59a9ad8c4c
House Prices - Advanced Regression Techniques
https://www.kaggle.com/c/house-prices-advanced-regression-techniques
"""

from sklearn.model_selection import train_test_split

# Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(
    transformed_dataset, target,
    test_size=.3, shuffle=True, random_state=42
)
Enter fullscreen mode Exit fullscreen mode

Elect Candidate Model

Now that we’ve perfectly split the dataset into both train and test sets, we then proceed to elect models that can solve this task.

We have to understand the dataset. I talked about it in my notebook House Prices Prediction (Beginner) where I gave an overview of the dataset.

So, we’re dealing with a regression task consisting of lots of categorical features, having models with linear and decision-making abilities would be useful, like the Decision Tree Regressor or Random Forest Regressor. But let’s go for the Random Forest Regressor since it’s more of an ensemble of Decision Trees.

We should also pick models like Support Vector Regressor, Linear Regression, and K-Neighbors Regressor since we’re performing evaluations.

The XGBoost will prove to be a very vital tool in your ML journey and I suggest examining its usage in the notebook XGBoost by Kaggle grandmaster Dans Becker. More resources on XGBoost in the further reading section.

Perform Model Evaluation

Now that we’ve successfully split our dataset, and elected the models we want to use. It’s time to see how the individual models perform on the training dataset.

Housing Price Performance

Beyond doubt, the Random Forest Regressor performed best, outperforming the Linear Regression model approximately 3x. Although since our focus is on model selection I avoided cross-validating and fine-tuning the models.

In most cases, I would fine-tune and cross-validate the model (using grid search) while searching out the best accuracy each model can produce before making a decision. But the model’s default parameters are also decent enough for this task. So let’s leave it simple.

Model Selection

After splitting the dataset, electing the candidate model, and performing model evaluation we can come to the conclusion that the Random Forest Regressor will be best suited for deployment having a mean absolute error (MAE) of 6732.92.

Although we didn’t quite fine-tune the model. We can get a much better MAE by fine-tuning the Random Forest Regressor, but the point has been established.

You could try out the XGBoost and compare it to see if it performs better. What if you fine-tune the XGBoost model as well!!!

Conclusion

We’ve proven that model selection is a key ingredient in the lengthy series of steps involved in creating a machine learning (ML) model that would be deployed into production.

We showed the metrics for proving if a dataset is either small or large and the reason for cross-validating smaller sets and splitting the larger ones.

We also talked about why we evaluate models and how we elect candidate models before model evaluation.

I hope this guide proves to be effective even as you deploy them into your machine learning tasks.

This article was originally published on Medium by me.

Further Reading

Data Cleaning

Encoding Categorical Columns

Scikit-Learn Models

Further Reading On Model Selection

Associated Notebooks

Book

Top comments (0)