Jakub Czakon

Posted on Dec 20, 2019 • Edited on Dec 23, 2019 • Originally published at neptune.ml

24 Evaluation Metrics for Binary Classification (And When to Use Them)

#machinelearning #python

This article was originally posted on neptune.ml/blog where you can find more in-depth articles for machine learning practitioners.

Not sure which evaluation metric you should choose for your binary classification problem? After reading this blog post you should have a good idea.

You will learn about a bunch of common and lesser-known evaluation metrics and charts to understand how to choose the model performance metric for your problem. Specifically, for each metric, I will talk about:

What is the definition and intuition behind it,
The non-technical explanation that you can communicate to business stakeholders,
How to calculate or plot it,
When should you use it.

With that, you will understand the trade-offs so that making metric related decisions will be easier.

I will present all the good stuff in a moment, but first, let’s define our classification problem.

Before we start: problem definition

You will be using those evaluation metrics in the context of a project, so I prepared an example fraud-detection problem based on a recent kaggle competiton.

I selected 43 features and sampled 66000 observations from the original dataset adjusting the fraction of positive class to 0.09.

Then I trained a bunch of lightGBM classifiers with different hyperparameters. I only used learning_rate and n_estimators parameters because I wanted to have an intuition as to which models are “truly” better. Specifically, I suspect that the model with only 10 trees is worse than a model with 100 trees. Of course, as use more trees and smaller learning rates, it gets tricky but I think it is a decent proxy.

So for combinations of learning_rate and n_estimators, I did the following:

defined hyperparameter values:

MODEL_PARAMS = {'random_state': 1234,    
                'learning_rate': 0.1,                
                'n_estimators': 10}

predicted on test data:log_binary_classification_metrics(y_test, y_test_pred)

model = lightgbm.LGBMClassifier(**MODEL_PARAMS)
model.fit(X_train, y_train)

predicted on test data:

y_test_pred = model.predict_proba(X_test)

logged all the metrics for each run:

log_binary_classification_metrics(y_test, y_test_pred)

For full code base go to this repository or scroll down to the example script.

You can also explore experiment runs with:

evaluation metrics
performance charts
metric by threshold plots

Ok, now we are ready to talk about those classification metrics!

Learn about the following evaluation metrics

Confusion Martix
False positive rate | Type-I error
False negative rate | Type-II error
True negative rate | Specificity
Negative predictive value
False discovery rate
True positive rate | Recall | Sensitivity
Positive predictive value | Precision
Accuracy
F beta score
F1 score
F2 score
Cohen Kappa
Matthews correlation coefficient | MCC
ROC curve
ROC AUC score
Precision-Recall curve
PR AUC | Average precision
Log loss
Brier score
Cumulative gain chart
Lift curve | Lift chart
Kolmogorov-Smirnov plot
Kolmogorov Smirnov statistics

I know it is a lot to go over at once. That is why you can jump to the section that is interesting to you and read just that.

1. Confusion Matrix

How to compute:

It is a common way of presenting true positive (tp), true negative (tn), false positive (fp) and false negative (fn) predictions. Those values are presented in the form of a matrix where the Y-axis shows the true classes while the X-axis shows the predicted classes.

It is calculated on class predictions, which means the outputs from your model need to be thresholded first.

from sklearn.metrics import confusion_matrix

y_pred_class = y_pred_pos > threshold
cm = confusion_matrix(y_true, y_pred_class)
tn, fp, fn, tp = cm.ravel()

How does it look:

So in this example, we can see that:

11918 predictions were true negatives,
872 were true positives,
82 were false positives,
333 predictions were false negatives.

Also, as we already know, this is an imbalanced problem. By the way, if you want to read more about imbalanced problems I recommend taking a look at this article by Tom Fawcett.

When to use it:

Pretty much always. I like to see the nominal values rather than normalized to get a feeling on how the model is doing on different, often imbalanced, classes.

Jump back to the evaluation metrics list ->

2. False Positive Rate | Type I error

When we predict something when it isn’t we are contributing to the false positive rate. You can think of it as a fraction of false alerts that will be raised based on your model predictions.

How to compute:

from sklearn.metrics import confusion_matrix

y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
false_positive_rate = fp / (fp + tn)

How models score in this metric (threshold=0.5):

For all the models type-1 error alerts are pretty low but by adjusting the threshold we can get an even lower ratio. Since we have true negatives in the denominator, our error will tend to be low just because the dataset is imbalanced.

How does it depend on the threshold:

Obviously, if we increase the threshold only higher scored observations will be classified as positive. In our example, we can see that to reach perfect FPR of 0 we need to increase the threshold to 0.83. However, that will likely mean only very few predictions classified.

When to use it:

You rarely would use this metric alone. Usually as an auxiliary one with some other metric,
If the cost of dealing with an alert is high you should consider increasing the threshold to get fewer alerts.

Jump back to the evaluation metrics list ->

3. False Negative Rate | Type II error

When we don’t predict something when it is, we are contributing to the false negative rate. You can think of it as a fraction of missed fraudulent transactions that your model lets through.

How to compute:

from sklearn.metrics import confusion_matrix

y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
false_negative_rate = fn / (tp + fn)

How models score in this metric (threshold=0.5):

We can see that in our example, type-2 errors are quite a bit higher then type-1 errors. Interestingly our BIN-98 experiment that had the lowest type-1 error has the highest type-2 error. There is a simple explanation based on the fact that our dataset is imbalanced and with type-2 error we don’t have true negatives in the denominator.

How does it depend on the threshold:

If we decrease the threshold, more observations will be classified as positive. At a certain threshold, we will mark everything as positive (fraudulent for example). We can actually get to the FNR of 0.083 by decreasing the threshold to 0.01.

When to use it:

Usually, it is not used alone but rather with some other metric,
If the cost of letting the fraudulent transactions through is high and the value you get from the users isn’t you can consider focusing on this number.

Jump back to the evaluation metrics list ->

True Negative Rate | Specificity

It measures how many observations out of all negative observations have we classified as negative. In our fraud detection example, it tells us how many transactiohttps://i1.wp.com/neptune.ml/wp-content/uploads/cohen_kappa_eq.png?zoom=1.100000023841858&fit=184%2C76&ssl=1ns, out of all non-fraudulent transactions, we marked as clean.

How to compute:

from sklearn.metrics import confusion_matrix

y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
true_negative_rate = tn / (tn + fp)

How models score in this metric (threshold=0.5):

Very high specificity for all the models. If you think about it, in our imbalanced problem you would expect that. Classifying negative cases as negative is a lot easier than classifying positive cases and hence the score is high.

How does it depend on the threshold:

The higher the threshold the more observations are truly negative observations we can recall. We can see that starting from say threshold=0.4 our model is doing really well in classifying negative cases as negative.

When to use it:

Usually, you don’t use it alone but rather as an auxiliary metric,
When you really want to be sure that you are right when you say something is safe. A typical example would be a doctor telling a patient “you are healthy”. Making a mistake here and telling a sick person they are safe and can go home is something you may want to avoid.

Jump back to the evaluation metrics list ->

5. Negative Predictive Value

It measures how many predictions out of all negative predictions were correct. You can think of it as precision for negative class. With our example, it tells us what is the fraction of correctly predicted clean transactions in all non-fraudulent predictions.

How to compute:

from sklearn.metrics import confusion_matrix

y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
negative_predictive_value = tn/ (tn + fn)

How models score in this metric (threshold=0.5):

All models score really high and no wonder, since with an imbalanced problem it is easy to predict negative class.

How does it depend on the threshold:

The higher the threshold the more cases are classified as negative and the score goes down. However, in our imbalanced example even at a very high threshold, the negative predictive value is still good.

When to use it:

When we care about high precision on negative predictions. For example, imagine we really don’t want to have any additional process for screening the transactions predicted as clean. In that case, we may want to make sure that our negative predictive value is high.

Jump back to the evaluation metrics list ->

6. False Discovery Rate

It measures how many predictions out of all positive predictions were incorrect. You can think of it as simply 1-precision. With our example, it tells us what is the fraction of incorrectly predicted fraudulent transactions in all fraudulent predictions.

How to compute:

from sklearn.metrics import confusion_matrix

y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
false_discovery_rate = fp/ (tp + fp)

How models score in this metric (threshold=0.5):

The “best model” is incredibly shallow lightGBM which we expect to be incorrect (deeper model should work better).

That is an important takeaway, looking at precision (or recall) alone can lead to you selecting a suboptimal model.

How does it depend on the threshold:

The higher the threshold, the less positive predictions. The less positive predictions, the ones that are classified as positive have higher certainty scores. Hence, the false discovery rate goes down.

When to use it

Again, it usually doesn’t make sense to use it alone but rather coupled with other metrics like recall.
When raising false alerts is costly and when you want all the positive predictions to be worth looking at you should optimize for precision.

Jump back to the evaluation metrics list ->

7. True Positive Rate | Recall | Sensitivity

It measures how many observations out of all positive observations have we classified as positive. It tells us how many fraudulent transactions we recalled from all fraudulent transactions.

When you are optimizing recall you want to put all guilty in prison.

How to compute:

from sklearn.metrics import confusion_matrix, recall_score

y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
true_positive_rate = tp / (tp + fn)

# or simply

recall_score(y_true, y_pred_class)

How models score in this metric (threshold=0.5):

Our best model can recall 0.72 fraudulent transactions at the threshold 0.5. the difference in recall between our models is quite significant and we can clearly see better and worse models. Of course, for every model, we can adjust the threshold to recall all fraudulent transactions.

How does it depend on the threshold:

For the threshold of 0.1, we classify the vast majority of transactions as fraudulent and hence get really high recall of 0.917. As the threshold increases the recall falls.

When to use it:

Usually, you will not use it alone but rather coupled with other metrics like precision.,
That being said, recall is a go-to metric, when you really care about catching all fraudulent transactions even at a cost of false alerts. Potentially it is cheap for you to process those alerts and very expensive when the transaction goes unseen.

Jump back to the evaluation metrics list ->

8. Positive Predictive Value | Precision

It measures how many observations predicted as positive are in fact positive. Taking our fraud detection example, it tells us what is the ratio of transactions correctly classified as fraudulent.

When you are optimizing precision you want to make sure that people that you put in prison are guilty.

How to compute:

from sklearn.metrics import confusion_matrix, precision_score

y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
positive_predictive_value = tp/ (tp + fp)

# or simply

precision_score(y_true, y_pred_class)

How models score in this metric (threshold=0.5):

It seems like all the models have pretty high precision at this threshold. The “best model” is incredibly shallow lightGBM which obviously smells fishy. That is an important takeaway, looking at precision (or recall) alone can lead to you selecting a suboptimal model.

Of course, for every model, we can adjust the threshold to increase precision. That is because if we take a small fraction of high scoring predictions the precision on those will likely be high.

How does it depend on the threshold:

The higher the threshold the better the precision and with a threshold of 0.68 we can actually get a perfectly precise model. Over this threshold, the model doesn’t classify anything as positive and so we don’t plot it.

When to use it:

Again, it usually doesn’t make sense to use it alone but rather coupled with other metrics like recall.
When raising false alerts is costly when you want all the positive predictions to be worth looking at you should optimize for precision.

Jump back to the evaluation metrics list ->

9. Accuracy

It measures how many observations, both positive and negative, were correctly classified.

You shouldn’t use accuracy on imbalanced problems. Then, it is easy to get a high accuracy score by simply classifying all observations as the majority class. For example in our case, by classifying all transactions as non-fraudulent we can get an accuracy of over 0.9.

How to compute:

from sklearn.metrics import confusion_matrix, accuracy_score

y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
accuracy = (tp + tn) / (tp + fp + fn + tn)

# or simply

accuracy_score(y_true, y_pred_class)

How models score in this metric (threshold=0.5):

We can see that for all the models we beat the dummy model (all clean transactions) by a large margin. Also, the models that we’d expect to be better are in fact at the top.

How does it depend on the threshold:

With accuracy, you can really use charts like the one above to determine the optimal threshold. In this case, choosing something a bit over standard 0.5 could bump the score by a tiny bit 0.9686->0.9688.

When to use it:

When your problem is balanced using accuracy is usually a good start. An additional benefit is that it is really easy to explain it to non-technical stakeholders in your project,
When every class is equally important to you.

Jump back to the evaluation metrics list ->

10. F beta score

Simply put, it combines precision and recall into one metric. The higher the score the better our model is. You can calculate it in the following way:

When choosing beta in your F-beta score the more you care about recall over precision the higher beta you should choose. For example, with F1 score we care equally about recall and precision with F2 score, recall is twice as important to us.

With 01 our optimal threshold moves toward lower thresholds and with beta=1 it is somewhere in the middle.

How to compute:

from sklearn.metrics import fbeta_score

y_pred_class = y_pred_pos > threshold
fbeta_score(y_true, y_pred_class, beta)

Jump back to the evaluation metrics list ->

11. F1 score (beta=1)

It’s the harmonic mean between precision and recall.

How models score in this metric (threshold=0.5):

As we can see combining precision and recall gave us a more realistic view of our models. We get 0.808 for the best one and a lot of room for improvement.

What is good is that it seems to be ranking our models correctly with those larger lightGBMs at the top.

How does it depend on the threshold:

We can adjust the threshold to optimize F1 score. Notice that for both precision and recall you could get perfect scores by increasing or decreasing the threshold. Good thing is, you can find a sweet spot for F1 metric. As you can see, getting the threshold just right can actually improve your score by a bit 0.8077->0.8121.

When to use it:

Pretty much in every binary classification problem. It is my go-to metric when working on those problems. It can be easily explained to business stakeholders.

Jump back to the evaluation metrics list ->

12. F2 score (beta=2)

It’s a metric that combines precision and recall, putting 2x emphasis on recall.

How models score in this metric (threshold=0.5):

This score is even lower for all the models than F1 but can be increased by adjusting the threshold considerably.
Again, it seems to be ranking our models correctly, at least in this simple example.

How does it depend on the threshold:

We can see that with a lower threshold and therefore more true positives recalled we get a higher score. You can usually find a sweet spot for the threshold. Possible gain from 0.755 -> 0.803 show how important threshold adjustments can be here.

When to use it:

I’d consider using it when recalling positive observations (fraudulent transactions) is more important than being precise about it

Jump back to the evaluation metrics list ->

13. Cohen Kappa Metric

In simple words, Cohen Kappa tells you how much better is your model over the random classifier that predicts based on class frequencies.

To calculate it one needs to calculate two things: “observed agreement” (po) and “expected agreement” (pe). Observed agreement (po) is simply how our classifier predictions agree with the ground truth, which means it is just accuracy. The expected agreement (pe) is how the predictions of the random classifier that samples according to class frequencies agree with the ground truth, or accuracy of the random classifier.

From an interpretation standpoint, I like that it extends something very easy to explain (accuracy) to situations where your dataset is imbalanced by incorporating a baseline (dummy) classifier.

How to compute:

from sklearn.metrics import cohen_kappa_score

cohen_kappa_score(y_true, y_pred_class)

How models score in this metric (threshold=0.5):

We can easily distinguish the worst/best models based on this metric. Also, we can see that there is still a lot of room to improve our best model.

How does it depend on the threshold:

With the chart just like the one above we can find a threshold that optimizes cohen kappa. In this case, it is at 0.31 giving us some improvement 0.7909 -> 0.7947 from the standard 0.5.

When to use it:

This metric is not used heavily in the context of classification. Yet it can work really well for imbalanced problems and seems like a great companion/alternative to accuracy.

Jump back to the evaluation metrics list ->

14. Matthews Correlation Coefficient | MCC

It’s a correlation between predicted classes and ground truth. It can be calculated based on values from the confusion matrix:

Alternatively, you could also calculate the correlation between y_true and y_pred.

How to compute:

from sklearn.metrics import matthews_corrcoef

y_pred_class = y_pred_pos > threshold
matthews_corrcoef(y_true, y_pred_class)

How models score in this metric (threshold=0.5):

We can clearly see improvements in our model quality and a lot of room to grow, which I really like. Also, it ranks our models reasonably and puts models that you’d expect to be better on top. Of course, MCC depends on the threshold that we choose.

How does it depend on the threshold:

We can adjust the threshold to optimize MCC. In our case, the best score is at 0.53 but what I really like is that it is not super sensitive to threshold changes.

When to use it:

When working on imbalanced problems,
When you want to have something easily interpretable.

Jump back to the evaluation metrics list ->

15. ROC Curve

It is a chart that visualizes the tradeoff between true positive rate (TPR) and false positive rate (FPR). Basically, for every threshold, we calculate TPR and FPR and plot it on one chart.

Of course, the higher TPR and the lower FPR is for each threshold the better and so classifiers that have curves that are more top-left side are better.

Extensive discussion of ROC Curve and ROC AUC score can be found in this article by Tom Fawcett.

How to compute:

from scikitplot.metrics import plot_roc

fig, ax = plt.subplots()
plot_roc(y_true, y_pred, ax=ax)

How does it look:

We can see a healthy ROC curve, pushed towards the top-left side both for positive and negative class. It is not clear which one performs better across the board as with FPR < ~0.15 positive class is higher and starting from FPR~0.15 the negative class is above.

Jump back to the evaluation metrics list ->

16. ROC AUC score

In order to get one number that tells us how good our curve is, we can calculate the Area Under the ROC Curve, or ROC AUC score. The more top-left your curve is the higher the area and hence higher ROC AUC score.

Alternatively, it can be shown that ROC AUC score is equivalent to calculating the rank correlation between predictions and targets. From an interpretation standpoint, it is more useful because it tells us that this metric shows how good at ranking predictions your model is. It tells you what is the probability that a randomly chosen positive instance is ranked higher than a randomly chosen negative instance.

How to compute:

from sklearn.metrics import roc_auc_score

roc_auc = roc_auc_score(y_true, y_pred_pos)

How models score in this metric:

We can see improvements and the models that one would guess to be better are indeed scoring higher. Also, the score is independent of the threshold which comes in handy.

When to use it:

You should use it when you ultimately care about ranking predictions and not necessarily about outputting well-calibrated probabilities (read this article by Jason Brownlee if you want to learn about probability calibration).
You should not use it when your data is heavily imbalanced. It was discussed extensively in this article by Takaya Saito and Marc Rehmsmeier. The intuition is the following: false positive rate for highly imbalanced datasets is pulled down due to a large number of true negatives.
You should use it when you care equally about positive and negative classes. It naturally extends the imbalanced data discussion from the last section. If we care about true negatives as much as we care about true positives then it totally makes sense to use ROC AUC.

Jump back to the evaluation metrics list ->

17. Precision-Recall Curve

It is a curve that combines precision (PPV) and Recall (TPR) in a single visualization. For every threshold, you calculate PPV and TPR and plot it. The higher on y-axis your curve is the better your model performance.

You can use this plot to make an educated decision when it comes to the classic precision/recall dilemma. Obviously, the higher the recall the lower the precision. Knowing at which recall your precision starts to fall fast can help you choose the threshold and deliver a better model.

How to compute:

from scikitplot.metrics import plot_precision_recall

fig, ax = plt.subplots()
plot_precision_recall(y_true, y_pred, ax=ax)

How does it look:

We can see that for the negative class we maintain high precision and high recall almost throughout the entire range of thresholds. For the positive class precision is starting to fall as soon as we are recalling 0.2 of true positives and by the time we hit 0.8, it decreases to around 0.7.

Jump back to the evaluation metrics list ->

18. PR AUC score | Average precision

Similarly to ROC AUC score you can calculate the Area Under the Precision-Recall Curve to get one number that describes model performance.

You can also think about PR AUC as the average of precision scores calculated for each recall threshold [0.0, 1.0]. You can also adjust this definition to suit your business needs by choosing/clipping recall thresholds if needed.

How to compute:

from sklearn.metrics import average_precision_score

average_precision_score(y_true, y_pred_pos)

How models score in this metric:

The models that we suspect to be “truly” better are in fact better in this metric which is definitely a good thing. Overall, we can see high scores but way less optimistic then ROC AUC scores (0.96+).

When to use it:

when you want to communicate precision/recall decision to other stakeholders
when you want to choose the threshold that fits the business problem.
when your data is heavily imbalanced. As mentioned before, it was discussed extensively in this article by Takaya Saito and Marc Rehmsmeier. The intuition is the following: since PR AUC focuses mainly on the positive class (PPV and TPR) it cares less about the frequent negative class.
when you care more about positive than negative class. If you care more about the positive class and hence PPV and TPR you should go with Precision-Recall curve and PR AUC (average precision).

Jump back to the evaluation metrics list ->

19. Log loss

Log loss is often used as the objective function that is optimized under the hood of machine learning models. Yet, it can also be used as a performance metric.

Basically, we calculate the difference between ground truth and predicted score for every observation and average those errors over all observations. For one observation the error formula reads:

The more certain our model is that an observation is positive when it is, in fact, positive the lower the error. But this is not a linear relationship. It is good to take a look at how the error changes as that difference increases:

So our model gets punished very heavily when we are certain about something that is untrue. For example, when we give a score of 0.9999 to an observation that is negative our loss jumps through the roof. That is why sometimes it makes sense to clip your predictions to decrease the risk of that happening.

If you want to learn more about log-loss read this article by Daniel Godoy.

How to compute:

from sklearn.metrics import log_loss

log_loss(y_true, y_pred)

How models score in this metric:

It is difficult to really see strong improvement and get an intuitive feeling for how strong the model is. Also, the model that was chosen as the best one before (BIN-101) is in the middle of the pack. That can suggest that using log-loss as a performance metric can be a risky proposition.

When to use it:

Pretty much always there is a performance metric that better matches your business problem. Because of that, I would use log-loss as an objective for your model with some other metric to evaluate performance.

Jump back to the evaluation metrics list ->

20. Brier score

It is a measure of how far your predictions lie from the true values. For one observation it simply reads:

Basically, it is a mean square error in the probability space and because of that, it is usually used to calibrate probabilities of the machine learning models. If you want to read more about probability calibration I recommend that you read this article by Jason Brownlee.

It can be a great supplement to your ROC AUC score and other metrics that focus on other things.

How to compute:

from sklearn.metrics import brier_score_loss

brier_score_loss(y_true, y_pred_pos)

How models score in this metric:

Model from the experiment BIN-101 has the best calibration and for that model, on average our predictions were off by 0.16 (√0.0263309).

When to use it:

When you care about calibrated probabilities.

Jump back to the evaluation metrics list ->

21. Cumulative gains chart

In simple words, it helps you gauge how much you gain by using your model over a random model for a given fraction of top scored predictions.

Simply put:

you order your predictions from highest to lowest and
for every percentile you calculate the fraction of true positive observations up to that percentile.

It makes it easy to see the benefits of using your model to target given groups of users/accounts/transactions especially if you really care about sorting them.

How to compute:

from scikitplot.metrics import plot_cumulative_gain

fig, ax = plt.subplots()
plot_cumulative_gain(y_true, y_pred, ax=ax)

How does it look:

We can see that our cumulative gains chart shoots up very quickly as we increase the sample of highest-scored predictions. By the time we get to the 20th percentile over 90% of positive cases are covered. You could use this chart to prioritize and filter out possible fraudulent transactions for processing.

Say we were to use our model to assign possible fraudulent transactions for processing and we needed to prioritize. We could use this chart to tell us where it makes the most sense to choose a cutoff.

When to use it:

Whenever you want to select the most promising customers or transactions to target and you want to use your model for sorting.
It can be a good addition to ROC AUC score which measures ranking/sorting performance of your model.

Jump back to the evaluation metrics list ->

22. Lift curve | lift chart

It is pretty much just a different representation of the cumulative gains chart:

we order the predictions from highest to lowest
for every percentile, we calculate the fraction of true positive observations up to that percentile for our model and for the random model,
we calculate the ratio of those fractions and plot it.

It tells you how much better your model is than a random model for the given percentile of top scored predictions.

How to compute:

from scikitplot.metrics import plot_lift_curve

fig, ax = plt.subplots()
plot_lift_curve(y_true, y_pred, ax=ax)

How does it look:

So for the top 10% of predictions, our model is over 10x better than random, for 20% is over 4x better and so on.

When to use it:

Whenever you want to select the most promising customers or transactions to target and you want to use your model for sorting.
It can be a good addition to ROC AUC score which measures ranking/sorting performance of your model.

Jump back to the evaluation metrics list ->

23. Kolmogorov-Smirnov plot

KS plot helps to assess the separation between prediction distributions for positive and negative classes.

In order to create it you:

sort your observations by the prediction score,
for every cutoff point [0.0, 1.0] of the sorted dataset (depth) calculate the proportion of true positives and true negatives in this depth,
plot those fractions, positive(depth)/positive(all), negative(depth)/negative(all), on Y-axis and dataset depth on X-axis.

So it works similarly to cumulative gains chart but instead of just looking at positive class it looks at the separation between positive and negative class.

A good explanation of KS plot and KS statistic can be found in this article by Riaz Khan.

How to compute:

from scikitplot.metrics import plot_ks_statistic

fig, ax = plt.subplots()
plot_ks_statistic(y_true, y_pred, ax=ax)

How does it look:

So we can see that the largest difference is at a cutoff point of 0.034 of top predictions. After that threshold, it decreases at a moderate rate as we increase the percentage of top predictions. Around 0.8 it is really getting worse really fast. So even though the best separation is at 0.034 we could potentially push it a bit higher to get more positively classified observations.

Jump back to the evaluation metrics list ->

24. Kolmogorov-Smirnov statistic

If we want to take the KS plot and get one number that we can use as a metric we can look at all thresholds (dataset cutoffs) from KS plot and find the one for which the distance (separation) between the distributions of true positive and true negative observations is the highest.

If there is a threshold for which all observations above are truly positive and all observations below are truly negative we get a perfect KS statistic of 1.0.

How to compute:

from scikitplot.helpers import binary_ks_curve

res = binary_ks_curve(y_true, y_pred_pos)
ks_stat = res[3]

How models score in this metric:

By using the KS statistic as the metric we were able to rank BIN-101 as the best model which we truly expect to be “truly” best model.

When to use it:

when your problem is about sorting/prioritizing the most relevant observations and you care equally about positive and negative classes.
It can be a good addition to ROC AUC score which measures ranking/sorting performance of your model.

Jump back to the evaluation metrics list ->

Final Thoughts

In this blog post, you’ve learned about various classification metrics and performance charts.

We went over metric definitions, interpretations, we learned how to calculate them, and talked about when to use them.

Hopefully, with all that knowledge you will be fully equipped to deal with metric-related problems in your future projects.

Bonus

To help you use the information from this blog post to the fullest, I have prepared:

logging helper function that calculates and logs all the metrics, performance charts, and metric by threshold charts
binary classification metrics cheatsheet with everything I talked about digested into a few pages.

Check those out below!

Logging helper function

If you want to log all of those metrics and performance charts that we covered for your machine learning project with just one function call and explore them in Neptune.

install the package:

pip install neptune-contrib[all]

import and run:

import neptunecontrib.monitoring.metrics as npt_metrics

npt_metrics.log_binary_classification_metrics(y_true, y_pred)

explore everything in the app:

Binary classification metrics cheatsheet

We’ve created a nice cheatsheet for you which takes all the content I went over in this blog post and puts it on a few-page, a digestible document which you can print and use whenever you need anything binary classification metrics related.

Download binary classification metrics cheatsheet

Example script

import lightgbm
import matplotlib.pyplot as plt
import neptune
from neptunecontrib.monitoring.utils import pickle_and_send_artifact
from neptunecontrib.monitoring.metrics import log_binary_classification_metrics
from neptunecontrib.versioning.data import log_data_version
import pandas as pd

plt.rcParams.update({'font.size': 18})
plt.rcParams.update({'figure.figsize': [16, 12]})
plt.style.use('seaborn-whitegrid')

# Define parameters
PROJECT_NAME = 'neptune-ml/binary-classification-metrics'

TRAIN_PATH = 'data/train.csv'
TEST_PATH = 'data/test.csv'
NROWS = None

MODEL_PARAMS = {'random_state': 1234,
                'learning_rate': 0.1,
                'n_estimators': 1500}

# Load data
train = pd.read_csv(TRAIN_PATH, nrows=NROWS)
test = pd.read_csv(TEST_PATH, nrows=NROWS)

feature_names = [col for col in train.columns if col not in ['isFraud']]

X_train, y_train = train[feature_names], train['isFraud']
X_test, y_test = test[feature_names], test['isFraud']

# Start experiment
neptune.init(PROJECT_NAME)
neptune.create_experiment(name='lightGBM training',
                          params=MODEL_PARAMS,
                          upload_source_files=['train.py', 'environment.yaml'])
log_data_version(TRAIN_PATH, prefix='train_')
log_data_version(TEST_PATH, prefix='test_')

# Train model
model = lightgbm.LGBMClassifier(**MODEL_PARAMS)
model.fit(X_train, y_train)

# Evaluate model
y_test_pred = model.predict_proba(X_test)

log_binary_classification_metrics(y_test, y_test_pred)
pickle_and_send_artifact((y_test, y_test_pred), 'test_predictions.pkl')

neptune.stop()