DEV Community

Cover image for Strategies in Evaluating Machine Learning Models.
Arnold Chris
Arnold Chris

Posted on

Strategies in Evaluating Machine Learning Models.

We are going to look at the strategies in evaluating regression, classification and clustering models at the end we will also look at creating reports of evaluation metrics and visualizing the effects of hyperparameter values.

Introduction

Models are only as useful as their quality of predictions, and thus fundamentally our goal is not to create models, but to create high quality models.
Lets begin:

1. Cross-Validating Models

Our method of evaluation should help us understand how well our models are able to make predictions from data they have never seen before.
One strategy might be to hold off a slice of data for testing. This is called validation (or hold-out). In validation our observations (features and targets) are split into two sets, the training set and the test set. Next we train the model using the training set, using the features and target vector to teach the model how to make the best prediction. Finally we simulate having never-before-seen external data by evaluating how our model performs on our test set.

The following sample code explain the step by step process of how to achieve this using the digits data library, except we've used an improved version of this approach as explained below.

# Load libraries
from sklearn import datasets
from sklearn import metrics
from sklearn.model_selection import KFold, cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

# Load digits dataset
digits = datasets.load_digits()

# Create features matrix
features = digits.data

# Create target vector
target = digits.target

# Create standardizer
standardizer = StandardScaler()

# Create logistic regression object
logit = LogisticRegression()

# Create a pipeline that standardizes, then runs logistic regression
pipeline = make_pipeline(standardizer, logit)

# Create k-fold cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=0)

# Conduct k-fold cross-validation
cv_results = cross_val_score(pipeline,      # Pipeline
features,                                   # Feature matrix
target,                                     # Target vector
cv=kf,                                      # Performance metric
scoring="accuracy",                         # Loss function
n_jobs=-1)                                  # Use all CPU cores

# Calculate mean
cv_results.mean()
Enter fullscreen mode Exit fullscreen mode

In my case I got a mean of 0.96995

Weaknesses of this approach

  1. The performance of the model can be highly dependent on which few observations were selected for the test set.
  2. The model is not being trained using all the available data, and it's not being evaluated on all the available data.

K-fold Cross Validation (KFCV)

In this method we split the data into k parts called folds. The model is then trained using k - 1 folds(combined into 1 training set) and then the last fold as the test set. The performance on the model for each of the k iterations is then averaged to produce an overall measurement.

In our code sample above we conducted k-fold cross-validation using five folds and outputted the evaluation scores to cv_results

The results of which is an array for the score of all 5 folds.
I got:

array([0.96111111, 0.96388889, 0.98050139, 0.97214485, 0.97214485])

Points to consider when using KFCV

  1. It assumes that each observation was created independently from the other (i.e The data is independent and identically distributed [IID]). If the data is IID it is better to shuffle observations when assigning to folds. In scikit-learn we can set shuffle=True to perform shuffling

  2. When using KFCV to evaluate a classifier it's beneficial to have folds containing roughly the same percentage of observations from each of the different target classes (called stratified k-fold). For Example, if our target vector contaned gender and 80% were male, then each fold would contain 80% male and 20% female observations. In scikit-learn this is done by replacing the KFold class with StratifiedKFold

  3. When using validation sets or cross-validation, it is important to preprocess data based on the training set and then apply those transformations to both the training and test set. E.g when we fit our standardization object, standardizer, we calculate the mean and variance of only the training set. Then we apply that transformation (using transform) to both the training and test sets as shown in the code block below:

# Import library
from sklearn.model_selection import train_test_split

# Create training and test sets
features_train, features_test, target_train, target_test = train_test_split(
features, target, test_size=0.1, random_state=1)

# Fit standardizer to training set
standardizer.fit(features_train)

# Apply to both training and test sets which can then be used to train models
features_train_std = standardizer.transform(features_train)
features_test_std = standardizer.transform(features_test)
Enter fullscreen mode Exit fullscreen mode

The reason for this is because we are pretending that the test set is unknown data.
If we fit both our preprocessors using observations from both training and test sets, some of the information from the test set leaks into our training set.
This rule applies for any preprocessing step such as
feature selection.

2. Creating a Baseline Regression Model.

This is a common method where we create a baseline regression model to use as a comparison against other models that we train.
We can use scikit-learn's DummyRegressor. This often can be useful to simulate a "naive" existing prediction process in a product or system.

For example, a product might have been originally hardcoded to assume that all new users will spend $100 in the first month, regardless
of their features.
If we encode that assumption into a baseline model, we are able to concretely state the benefits of using a machine learning approach by comparing the dummy model’s score with that of a trained model.

# Load libraries
from sklearn.datasets import load_wine
from sklearn.dummy import DummyRegressor
from sklearn.model_selection import train_test_split

# Load data
wine = load_wine()

# Create features
features, target = wine.data, wine.target

# Make test and training split
features_train, features_test, target_train, target_test = train_test_split(
features, target, random_state=0)

# Create a dummy regressor
dummy = DummyRegressor(strategy='mean')

# "Train" dummy regressor
dummy.fit(features_train, target_train)

# Get R-squared score
dummy.score(features_test, target_test)

-0.0480213580840978  #Result.
Enter fullscreen mode Exit fullscreen mode

To compare, we train our model and evaluate the performance score:

# Load library
from sklearn.linear_model import LinearRegression

# Train simple linear regression model
ols = LinearRegression()
ols.fit(features_train, target_train)

# Get R-squared score
ols.score(features_test, target_test)

0.804353263176954 #Result
Enter fullscreen mode Exit fullscreen mode

Dummyregressor uses the **strategy **parameter to set the method of making predictions, including the mean or median value in the training set. Furthermore, if we set **strategy **to constant and use the constant parameter, we can set the dummy regressor to predict some constant value for every observation.

# Create dummy regressor that predicts 1s for everything
clf = DummyRegressor(strategy='constant', constant=1)
clf.fit(features_train, target_train)

# Evaluate score
clf.score(features_test, target_test)
-0.06299212598425186  #Result
Enter fullscreen mode Exit fullscreen mode

One small note regarding score. By default it returns the coefficient of determination (R-squared).

3. Creating a Baseline Classification Model.

This is basically the same concept as creating a regression baseline model with a few changes.

Note that a common measure of a classifier's performance is how much better it is than random guessing.

Scikit-learn's DummyClassifier makes this comparison easy.
The following code block shows how to effectively create the dummy classifier.

# Load libraries
from sklearn.datasets import load_iris
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import train_test_split

# Load data
iris = load_iris()

# Create target vector and feature matrix
features, target = iris.data, iris.target

# Split into training and test set
features_train, features_test, target_train, target_test = train_test_split(
features, target, random_state=0)

# Create dummy classifier
dummy = DummyClassifier(strategy='uniform', random_state=1)

# "Train" model
dummy.fit(features_train, target_train)
# Get accuracy score

dummy.score(features_test, target_test)

0.42105263157894735  # Result.
Enter fullscreen mode Exit fullscreen mode

By comparing the baseline classifier to our trained classifier, we can see the improvement.

# Load the library
from sklearn.ensemble import RandomForestClassifier

# Create classifier
classifier = RandomForestClassifier()

# Train model.
classifier.fit(features_train, target_train)

# Get accuracy score.
classifier.score(features_test, target_test)

0.9736842105263158   # Result.
Enter fullscreen mode Exit fullscreen mode

The strategy gives us a number of options for generating values.
There are two particularly useful strategies.

  1. Stratified makes predictions proportional to the class proportions of the training set's target vector(e.g 20% of the observations in the training data are women, then DummyClassifier will predict women 20% of the time).

  2. Uniform will generate predictions uniformly at random between the different classes. E.g if 20% of observations are women and 80% are men, uniform will produce predictions that are 50% women and 50% men.

4. Evaluating Binary Classifier Predictions.

  • Given a trained classification model, you want to evaluate its quality.

We can define one of a number of performance metrics, including accuracy, precision, recall and F1.

Accuracy is a common performance metric, it's simply the proportion of observations predicted correctly.

Image description

Where:
TP
Number of true positives, observations that are part of the positive class that we predicted correctly.

**TN**
Enter fullscreen mode Exit fullscreen mode

True negatives, observations that are part of the _*negative *_class and that we predicted correctly.

**FP**
Enter fullscreen mode Exit fullscreen mode

The number of false positives, also called Type I error. Predicted to be part of the positive class but are actually part of the negative class

FN
The number of false negatives, also called a Type II error. These are observations that are predicted to be part of the negative class but are actually part of the negative class.

We can measure in accuracy three-fold(the default number of folds) cross-validation by setting scoring="accuracy"

# Load libraries
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification

# Generate features matrix and target vector
X, y = make_classification(n_samples = 10000,
n_features = 3,
n_informative = 3,
n_redundant = 0,
n_classes = 2,
random_state = 1)

# Create logistic regression
logit = LogisticRegression()

# Cross-validate model using accuracy
cross_val_score(logit, X, y, scoring="accuracy")

array([0.9555, 0.95 , 0.9585, 0.9555, 0.956 ])  # Result
Enter fullscreen mode Exit fullscreen mode

Accuracy works well with balanced data however when in presence of inbalanced classes (99.9% of observations belong to a single class, and 0.1% belong to the second class) accuracy suffers from the paradox where a model is highly accurate but lacks predictive power.

For example, imagine we are trying to predict the presence of a very rare cancer that occurs in 0.1% of the population.

After training our model, we find the accuracy is at 95%. However, 99.9% of people do not have the cancer: if we simply created a
model that “predicted” that nobody had that form of cancer, our _*naive *_model would be 4.9% more accurate, but it clearly is not able to predict anything. For this reason, we are often motivated to use other metrics such as precision, recall, and the F1 score.

Precision is the proportion of every observation predicted to be positive that is actually positive i.e how likely we are to be right when we predict something is positive.

Image description

# Cross-validate model using precision
cross_val_score(logit, X, y, scoring="precision")

# array([0.95963673, 0.94820717, 0.9635996 , 0.96149949, 0.96060606])

Enter fullscreen mode Exit fullscreen mode

Models with high precision are pessimistic in that they predict an observation is of the positive class only when they are very certain about it.

Recall is the proportion of the positive observation that is truly positive. Recall measures the model's ability to identify the positive class. Models with high recall are optimistic in that they have a low bar for predicting that an observation is of the positive class.

# Cross-validate model using recall
cross_val_score(logit, X, y, scoring="recall")

# array([0.951, 0.952, 0.953, 0.949, 0.951])

Enter fullscreen mode Exit fullscreen mode

Image description

Since precision and recall are less intuitive we always want some kind of balance between precision and recall, and this is where the role is filled by the F1 score.

The F1 score is the harmonic mean (a kind of average used for ratios.)

# Cross-validate model using F1
cross_val_score(logit, X, y, scoring="f1")

# array([0.95529884, 0.9500998 , 0.95827049, 0.95520886, 0.95577889])

Enter fullscreen mode Exit fullscreen mode

Image description

This score is a measure of correctness achieved in positive prediction, that is, of the observations labelled as positive how many are actually positive.

As an alternative to using cross_val_score, if we already have the true y values and the predicted y values, we can calculate the metrics accuracy and recall directly.

# Load libraries
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Create training and test split
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                            test_size=0.1,
                                            random_state=1)

# Predict values for training target vector
y_hat = logit.fit(X_train, y_train).predict(X_test)

# Calculate accuracy
accuracy_score(y_test, y_hat)
#   0.947
Enter fullscreen mode Exit fullscreen mode

5. Evaluating Binary Classifier Thresholds.

This is when we want to evaluate a binary classifier and various probability thresholds.

To do this we can use the receiver operating characteristic (ROC) curve to evaluate the quality of the binary classifier. roc_curve helps us calculate the true and false positives at each threshold and then plot them:

# Load libraries
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.model_selection import train_test_split

# Create feature matrix and target vector
features, target = make_classification(n_samples=10000,
                         n_features=10,
                         n_classes=2,
                         n_informative=3,
                         random_state=3)

# Split into training and test sets
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.1, random_state=1)

# Create classifier
logit = LogisticRegression()

# Train model
logit.fit(features_train, target_train)

# Get predicted probabilities
target_probabilities = logit.predict_proba(features_test)[:,1]

# Create true and false positive rates
false_positive_rate, true_positive_rate, threshold = roc_curve(
              target_test,
             target_probabilities
)

# Plot ROC curve
plt.title("Receiver Operating Characteristic")
plt.plot(false_positive_rate, true_positive_rate)
plt.plot([0, 1], ls="--")
plt.plot([0, 0], [1, 0] , c=".7"), plt.plot([1, 1] , c=".7")
plt.ylabel("True Positive Rate")
plt.xlabel("False Positive Rate")
plt.show()
Enter fullscreen mode Exit fullscreen mode

The graph should look something like this:

Image description

ROC compares the presence of true positives and false positives at every probability threshold ( probability that an observation is predicted to be a class ).

A classifier that predicts every observation **correctly **would look like the solid light gray line in the ROC output in the previous
figure, going straight up to the top immediately.
A classifier that predicts at **random **will appear as the diagonal line. The better the model, the closer it is to the solid line.

Predicted Probabilities.

Until now we have only examined models based on the values they predict.
However, in many learning algorithms, those predicted values are based on probability estimates. That is, each observation is given an explicit probability of belonging in each class.
In our solution, we can use predict_proba to see the predicted probabilities for the first observation:

# Get predicted probabilities
logit.predict_proba(features_test)[0:1]

# array([[0.86891533, 0.13108467]])

Enter fullscreen mode Exit fullscreen mode

We can see the classes using classes_:

logit.classes_
# array([0, 1])
Enter fullscreen mode Exit fullscreen mode

In the above example the first observation has ~87% chance of being in the negative class (0) and a 13% chance of being in the positive class (1).

By default scikit-learn predicts an observation is part of the positive class if the probability is greater than 0.5 (threshold). However, instead of the middle ground we might explicitly want to explicitly bias our model to use a different threshold for substantive reasons e.g if a false positive is very costly to our company, we might prefer a model that has a high probability threshold.

We fail to predict some positives, but when an observation is predicted to be positive, we can be very confident that the prediction is correct. The trade-off is represented in the true positive rate (TPR) and the false positive rate (FPR).
The TPR is the number of observations correctly predicted true divided by all true positive observations:

Image description

FPR is the number of incorrectly predicted positives divided by all true negative observations:

Image description

The ROC curve represents the respective TPR and FPR for every probability threshold.
In our solution a threshold of roughly 0.50 has a TPR of ~0.83 and an FPR of ~0.16

print("Threshold:", threshold[124])
print("True Positive Rate:", true_positive_rate[124])
print("False Positive Rate:", false_positive_rate[124])

# Threshold: 0.5008252732632008
# True Positive Rate: 0.8346938775510204
# False Positive Rate: 0.1607843137254902

Enter fullscreen mode Exit fullscreen mode

However, if we increase the threshold to ~80% (i.e., increase how certain the model has to be before it predicts an observation as positive) the TPR drops significantly but so does the FPR:

This is because our higher requirement for being predicted to be in the positive class has caused the model to not identify a number of positive observations (the lower TPR) but has also reduced the noise from negative observations being predicted as positive (the lower FPR).

ROC curve as a general metric for a model. The better a model is, the higher the curve and thus the greater the area under the curve.
Thus it's common to calculate the area under the ROC curve (AUC ROC) to judge the overall quality of a model at all possible thresholds. The close the AUC ROC is closer to 1, the better the model.

We can make this calculation like shown below:

# Calculate area under curve
roc_auc_score(target_test, target_probabilities)

# 0.9073389355742297

Enter fullscreen mode Exit fullscreen mode

6. Evaluating Multiclass Classifier Predictions.

This is useful when we have a model that predicts three or more classes and want to evaluate the model's performance.

The solution is to use cross-validation with an evaluation metric capable of handling more than two classes like so:

# Load libraries
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification

# Generate features matrix and target vector
features, target = make_classification(n_samples = 10000,
                                         n_features = 3,
                                         n_informative = 3,
                                         n_redundant = 0,
                                         n_classes = 3,
                                         random_state = 1)

# Create logistic regression
logit = LogisticRegression()

# Cross-validate model using accuracy
cross_val_score(logit, features, target, scoring='accuracy')

# array([0.841 , 0.829 , 0.8265, 0.8155, 0.82 ])
Enter fullscreen mode Exit fullscreen mode

For balanced classes (roughly equal number of observations in each class of the target vector) we should consider **accuracy **as a simple and interpretable choice of an evaluation metric. However if the classes are imbalanced we should be inclined to use other evaluation metrics.

Note Many of scikit-learn's built-in metrics are for evaluating binary classifiers include Precision, recall and F1 score. These were originally designed for binary classifiers, we can apply them to multiclass settings by treating our data as a set of binary classes.

Thus we can apply the metrics to each class as if it were the only class in the data, and then aggregate the evaluation scores for all the classes by averaging them.

# Cross-validate model using macro averaged F1 score
cross_val_score(logit, features, target, scoring='f1_macro')

# array([0.84061272, 0.82895312, 0.82625661, 0.81515121, 0.81992692])
Enter fullscreen mode Exit fullscreen mode

In this code, **macro **refers to the method used to average the evaluation scores from the classes.

The options are macro, weighted, and micro:

macro
Calculate the mean of metric scores for each class, weighting each class equally.

weighted
Calculate the mean of metric scores for each class, weighting each class proportional to its size in the data.

micro
Calculate the mean of metric scores for each observation-class combination.

7. Visualizing a Classifier's Performance.

We do this when we have predicted classes and true classes of the test data and we want to visually compare the model's quality.

We can start by creating a confusion matrix, which compares the predicted classes and true classes.

# Load libraries
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import pandas as pd

# Load data
iris = datasets.load_iris()

# Create features matrix
features = iris.data

# Create target vector
target = iris.target

# Create list of target class names
class_names = iris.target_names

# Create training and test set
features_train, features_test, target_train, target_test = train_test_split(features, target, random_state=2)

# Create logistic regression
classifier = LogisticRegression()

# Train model and make predictions
target_predicted = classifier.fit(features_train,
target_train).predict(features_test)

# Create confusion matrix
matrix = confusion_matrix(target_test, target_predicted)

# Create pandas dataframe
dataframe = pd.DataFrame(matrix, index=class_names,columns=class_names)

# Create heatmap
sns.heatmap(dataframe, annot=True, cbar=None, cmap="Blues")
plt.title("Confusion Matrix"), plt.tight_layout()
plt.ylabel("True Class"), plt.xlabel("Predicted Class")
plt.show()
Enter fullscreen mode Exit fullscreen mode

Image description

One of the major benefits of confusion matrices is their interpretability. Each column of the matrix (often represented as a heatmap) represents predicted classes, while every row shows true classes.

In the solution, the top-left cell is the number of observations predicted to be Iris setosa (indicated by the column) that are actually Iris setosa (indicated by the row). This means the model accurately predicted all Iris setosa flowers.

However, the model does not do as well at predicting Iris virginica. The bottom-right cell indicates that the model successfully predicted eleven observations were Iris virginica, but (looking one cell up) predicted one flower to be virginica that was actually Iris versicolor.

Noteworthy things about confusion matrices.

  1. A perfect model will have values along the diagonal and zeros everywhere else. A bad model will have the observation counts spread evenly around cells.

  2. A confusion matrix helps us see where the model was wrong and how wrong it was, i.e we can look at the patterns of misclassification.

  3. Confusion matrices work with any number of classes.

8. Evaluating Regression Models.

The simplest method of evaluating a regression model is by calculating the Mean Squared Error (MSE)

# Load libraries
from sklearn.datasets import make_regression
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression

# Generate features matrix, target vector
features, target = make_regression(n_samples = 100,
                                   n_features = 3,
                                   n_informative = 3,
                                   n_targets = 1,
                                   noise = 50,
                                   coef = False,
                                   random_state = 1)

# Create a linear regression object
ols = LinearRegression()

# Cross-validate the linear regression using (negative) MSE
cross_val_score(ols, features, target,scoring='neg_mean_squared_error')

# array([-1974.65337976, -2004.54137625, -3935.19355723, -1060.04361386, -1598.74104702])
Enter fullscreen mode Exit fullscreen mode

Another common regression metric is the coefficient of determination, (R squared).

# Cross-validate the linear regression using R-squared
cross_val_score(ols, features, target, scoring='r2')

# array([0.8622399 , 0.85838075, 0.74723548, 0.91354743, 0.84469331])

Enter fullscreen mode Exit fullscreen mode

MSE is a measurement of the squared sum of all distances between predicted and true values. The higher the value of MSE, the greater the total squared error and thus the worse the model.

Image description

Mathematical benefits of squaring the error term

  1. Forces all the error values to be positive.

  2. It penalizes a few large errors more than many small errors, even if the absolute value of the errors is the same.

Note:
By default, in scikit-learn, arguments of the _scoring _parameter assumethat higher values are better than lower values.

However, this is not the case for MSE, where higher values mean a worse model. For this reason, scikit-learn looks at the negative MSE using
the _neg_mean_squared_error _argument.

R squared measures the amount of variance in the target vector that is explained by the model:-

Image description

where yi **is the true target value of the **ith **observation, **yˆi is the predicted value for the ith **observation, and **y¯ is the mean value of the target vector. The closer that R2 is to 1.0, the better the model.

9. Evaluating Clustering Models.

Involves evaluating the performance of an unsupervised learning algorithm.

We can use the _silhousette coefficients _ to measure the quality of the clusters (not the predictive performance).

# Load libraries
import numpy as np
from sklearn.metrics import silhouette_score
from sklearn import datasets
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

# Generate features matrix
features, _ = make_blobs(n_samples = 1000,
                         n_features = 10,
                         centers = 2,
                         cluster_std = 0.5,
                         shuffle = True,
                         random_state = 1)
# Cluster data using k-means to predict classes
model = KMeans(n_clusters=2, random_state=1).fit(features)

# Get predicted classes
target_predicted = model.labels_

# Evaluate model
silhouette_score(features, target_predicted)

# 0.8916265564072141

Enter fullscreen mode Exit fullscreen mode

Supervised model evaluation compares predictions (e.g classes or quantitative values) with the corresponding true values in the target vector.

The most common motivation for using clustering is that your data doesn't have a target vector.

While we cannot evaluate predictions versus true values if we don't have a target vector, we can evaluate the nature of the clusters themselves.

Silhousette coefficients provide a single value for measuring both of the following:

  1. "Good" clusters which have very small distances between observations in the same cluster (dense clusters).

  2. Large distances between differnt clusters (i.e well spaced clusters).

Formally, the _*ith *_observation’s silhouette coefficient
is:

Image description

where _si _is the silhouette coefficient for observation i, _ai _is the mean distance between _i _and all observations of the same class, and _bi _is the mean distance between _i _and all observations from the closest cluster of a different class.

The value returned by _silhouette_score _is the mean silhouette coefficient for all observations. Silhouette coefficients range between –1 and 1, with 1 indicating dense, well-separated clusters.

10. Creating a Custom Evaluation Metric.

Sometimes you might want to evaluate a model using a metric you created.

Create the metric as a function and convert it into a scorer function using scikit-learn’s make_scorer:

# Load libraries
from sklearn.metrics import make_scorer, r2_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.datasets import make_regression

# Generate features matrix and target vector
features, target = make_regression(n_samples = 100,
n_features = 3, random_state = 1)

# Create training set and test set
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.10, random_state=1)

# Create custom metric
def custom_metric(target_test, target_predicted):

     # Calculate R-squared score
     r2 = r2_score(target_test, target_predicted)

     # Return R-squared score
     return r2

# Make scorer and define that higher scores are better
score = make_scorer(custom_metric, greater_is_better=True)

# Create ridge regression object
classifier = Ridge()

# Train ridge regression model
model = classifier.fit(features_train, target_train)

# Apply custom scorer
score(model, features_test, target_test)

#  0.9997906102882058
Enter fullscreen mode Exit fullscreen mode

First, we define a function that takes in two targets - the ground truth target vector and our predicted values - and outputs some score.

Second, we use make_scorer to create a scorer object, making sure to specify whether higher or lower scores are desirable (using the **greater_is_better **parameter).

The custom metric in the solution (custom_metric) is a toy example since it simply wraps a built-in metric for calculating the R2 score. In a real-world situation, we would replace the **custom_metric **function with whatever custom metric we wanted. However, we can see that the custom metric that calculates R2 does work by comparing the results to scikit-learn’s **r2_score **built-in method:

# Predict values
target_predicted = model.predict(features_test)

# Calculate R-squared score
r2_score(target_test, target_predicted)

#  0.9997906102882058

Enter fullscreen mode Exit fullscreen mode

11. Visualizing the Effect of Training Set Size.

In some cases you would like to evaluate the effect of the number of observations in your training set on some metric. (accuracy, F1, etc).

We can then plot the accuracy against the training set size:

# Load libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import learning_curve

# Load data
digits = load_digits()
# Create feature matrix and target vector
features, target = digits.data, digits.target
# Create CV training and test scores for various training set sizes
train_sizes, train_scores, test_scores = learning_curve(# Classifier
              RandomForestClassifier(),
              # Feature matrix
              features,
              # Target vector
              target,
              # Number of folds
              cv=10,
              # Performance metric
              scoring='accuracy',
              # Use all computer cores
              n_jobs=-1,
              # Sizes of 50
              # Training set
             train_sizes=np.linspace(0.01,
                                     1.0,
                                      50))

# Create means and standard deviations of training set scores
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)

# Create means and standard deviations of test set scores
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)

# Draw lines
plt.plot(train_sizes, train_mean, '--', color="#111111", label="Training score")
plt.plot(train_sizes, test_mean, color="#111111", label="Cross-validation score")

# Draw bands
plt.fill_between(train_sizes, train_mean - train_std,
train_mean + train_std, color="#DDDDDD")
plt.fill_between(train_sizes, test_mean - test_std,
test_mean + test_std, color="#DDDDDD")

# Create plot
plt.title("Learning Curve")
plt.xlabel("Training Set Size"), plt.ylabel("Accuracy Score"),
plt.legend(loc="best")
plt.tight_layout()
plt.show()
Enter fullscreen mode Exit fullscreen mode

Image description

Learning curves visualize the performance (e.g., accuracy, recall) of a model on the training set and during cross-validation as the number of observations in the training set increases.
They are commonly used to determine if our learning algorithms would benefit from gathering additional training data.

In our solution, we plot the accuracy of a random forest classifier at 50 different training set sizes, ranging from 1% of observations to 100%.
The increasing accuracy score of the cross validated models tell us that we would likely benefit from additional observations (although in practice this might not be feasible).

12. Creating a Text Report of Evaluation Metrics.

Text reports are important when we want a quick description of a classifier's performance.

# Load libraries
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Load data
iris = datasets.load_iris()

# Create features matrix
features = iris.data

# Create target vector
target = iris.target

# Create list of target class names
class_names = iris.target_names

# Create training and test set
features_train, features_test, target_train, target_test = train_test_split(features, target, random_state=0)

# Create logistic regression
classifier = LogisticRegression()

# Train model and make predictions
model = classifier.fit(features_train, target_train)
target_predicted = model.predict(features_test)

# Create a classification report
print(classification_report(target_test,target_predicted,target_names=class_names))

Enter fullscreen mode Exit fullscreen mode

Image description

13. Visualizing the Effects of Hyperparameter Values

We want to understand how the performance of a model changes as the value of some hyperparameter changes.

We can plot the hyperparameter against the model accuracy (validation curve).

# Load libraries
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import validation_curve

# Load data
digits = load_digits()

# Create feature matrix and target vector
features, target = digits.data, digits.target

# Create range of values for parameter
param_range = np.arange(1, 250, 2)

# Calculate accuracy on training and test set using range of parameter values 
train_scores, test_scores = validation_curve(
               # Classifier
               RandomForestClassifier(),
               # Feature matrix
               features,
               # Target vector
               target,
               # Hyperparameter to examine
               param_name="n_estimators",
               # Range of hyperparameter's values
               param_range=param_range,
               # Number of folds
               cv=3,
               # Performance metric
               scoring="accuracy",
               # Use all computer cores
               n_jobs=-1)

# Calculate mean and standard deviation for training set scores
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)

# Calculate mean and standard deviation for test set scores
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)

# Plot mean accuracy scores for training and test sets
plt.plot(param_range, train_mean, label="Training score", 
                      color="black")
plt.plot(param_range, test_mean, label="Cross-validation score",
                      color="dimgrey")

# Plot accuracy bands for training and test sets
plt.fill_between(param_range, train_mean - train_std,
train_mean + train_std, color="gray")
plt.fill_between(param_range, test_mean - test_std,
test_mean + test_std, color="gainsboro")

# Create plot
plt.title("Validation Curve With Random Forest")
plt.xlabel("Number Of Trees")
plt.ylabel("Accuracy Score")
plt.tight_layout()
plt.legend(loc="best")
plt.show()

Enter fullscreen mode Exit fullscreen mode

Image description

Most training algorithms contain **hyperparameters **that must be chosen before the training process begins. For example, a random forest classifier creates a “forest” of decision trees, each of which votes on the predicted class of an observation.

One hyperparameter in random forest classifiers is the number of trees in the forest. Most often hyperparameter values are selected during model selection. However, it is occasionally useful to visualize how model performance changes as the hyperparameter value changes.

In our solution, we plot the changes in accuracy for a random forest classifier for the training set and during cross-validation as the number of trees increases. When we have a small number of trees, both the training and cross-validation score are low, suggesting the model is underfitted.

As the number of trees increases to 250, the accuracy of both levels off, suggesting there is probably not much value in the computational cost of training a massive forest.

In scikit-learn, we can calculate the validation curve using validation_curve, which contains three important parameters:

param_name
Name of the hyperparameter to vary

param_range
Value of the hyperparameter to use

scoring
Evaluation metric used to judge to model

This is a quick view of some of the best practices in evaluating Machine Learning, I'd recommend reading books specifically on **model evaluation **to get a more in-depth explanation of the concept discussed above.

Happy Coding :)

Top comments (0)