DEV Community

daud99
daud99

Posted on • Updated on

Machine Learning Model Performance Evaluation

In Supervised Learning, we divide our problem into two broad Types.

  1. Regression
  2. Classification

You can find about them in this blog.

Both problem have their own Performance Evaluation Parameter/Criteria.

Why and When?

We build our Model and did some prediction/testing with our new Model. We want to know how well our Model did. So, now we will do the Performance evaluation for our model. This can help us understand the performance and decide wheather we need to make some changes to the model or every thing is Ok.

Performance Evaluation for Classification Problem

Important Terminologies

When comes to Performance Evaluation these terms are very important to understand.

For the sake of understanding, suppose we have a binary classification problem with two classes SPAM or HAM (Not Spam).

Now, We have to mark one as Positive class while the other one as Negative class. For the sake of this problem, let's say SPAM is positive and HAM is Negative class respectively.

For imbalanced classification problems, the majority class is typically referred to as the negative outcome (e.g. such as “no change” or “negative test result“), and the minority class is typically referred to as the positive outcome (e.g. “change” or “positive test result”).

True Vs. False

Whether it's True Positive or True Negative it means the class predicated by our Machine Learning Model is same as the actual class. Similarlty, whether it's False Positive or False Negative it means the class predicated by our Machine Learning Model is (opposite) not same as the actual class.

1. True Positive

True indicates that class indicated by our Model will be same as actual class. Moreover, Positive means the Positive class instances/examples which are marked as positive are also marked positive by our Model.

Example:

The Email was SPAM (Positive), and my MACHINE LEARNING Model also predicted SPAM (Positive). Both are same as well as SPAM is Postive class as decided by us already. So, this is nothing, but True Positive.

2. False Positive

False indicates that class indicated by our Model will be opposite to the actual class. Positive means that our model marked the Negative class as Positive.

Example:

The Email was HAM (Negative), but our model predicted it as SPAM (Positive).

3. True Negative

True indicates that class indicated by our Model will be same as actual class. Moreover, Negative means the Negative class instances/examples which are marked as Negative are also marked Negative by our Model.

Example:

The Email was HAM, and my MACHINE LEARNING Model also predicted HAM. Both are same as well as HAM is Negative class as decided by us already. So, this is nothing, but True Negative.

4. False Negative

False indicates that class indicated by our Model will be opposite to the actual class. Negative means that our model marked the Positive class as Negative.

Example:

The Email was SPAM (Positive), but our model predicted it as HAM (Negative).

Confusion Matrix

Confusion Matrix is used to know the performance of a Machine learning classification. It is represented in a matrix form.

Confusion Matrix gives a comparison between Actual and predicted values.

The confusion matrix is a N x N matrix, where N is the number of classes or outputs.

For 2 class ,we get 2 x 2 confusion matrix.

For 3 class ,we get 3 X 3 confusion matrix.

Analogy

If you are familiar with the Pandas Data Frame you can think of the Actual Values (Classes) as Indices and Predicated Values (Classes) as Column Names.

We can say that we Actual Values (Classes) on Y-axis and Predicated Values (Classes) on X-axis.

For Binary Classification Problem

Image description

For Muti-class Classification Problem

The basic's remains the same just as Binary Classification problem. The only difference being here number of columns & rows will be same as number of classes here which will be more than 2 in case of Mult-class classification problem.

Let's say the dataset has 3 flowers as outputs or classes, Versicolor, Virginia, Setosa.

Image description

All the values along the diagonal are the Correct Values while the values other than the diagonal are errors.

For class Setosa, we have a total of 16 instances and all of them are marked Setosa correctly.

For class Versicolor, we have a total of 18 instances out of which 17 are classified correctly as Versicolor while the one gets misclassified as Virginica.

For class Virginca, we have a total of 11 instances out of which all 11 are classified correctly as Virginca.

Accuracy

Accuracy is one of the most common classification problem. It is also the easiest to understand.

Accuracy in the classificatoin problem is the number of correct predictions made by the model divided by the total number of predictions.

Accuracy=TP+TNTN+TP+FP+FN Accuracy = \frac {TP+TN} {TN+TP+FP+FN}

For example:

If we have a total of 100 emails and our model correctly predicted 80 emails as SPAM or Ham, them we 80/100 = 0.8 = 80% accuracy.

When to use Accuracy or when not?

Accuracy is really useful when target classes are well balanced.
For example: In the Cat or Dog predication model dataset, if we have the roughly the same amount of cat images as we have dog images which makes our model balance. So, we can use the accuracy as the model evaluation.

Accuracy is not a good choice for performance evaluation of the imbalanced dataset in which dataset have unbalanced classes.

For example:

Imagine we had 99 SPAM and 1 HAM email out of 100. If our model always predict SPAM we would get 99% accuracy. However, it will be not the good performance evaluation of our machine learning model.

For imbalanced dataset, We will use recall, precision and F1 score.

Recall

Recall is a metric that quantifies the number of correct positive predictions made out of all positive predictions that could have been made.

Recall is based on actual values.

Recall=TPTP+FN Recall = \frac {TP} {TP+FN}

The result is a value between 0.0 for no recall and 1.0 for full or perfect recall. Recall should ideally be 1 (high) for a good classifier. Recall becomes 1 only when the numerator and denominator are equal i.e TP = TP +FN, this also means FN is zero. As FN increases the value of denominator becomes greater than the numerator and recall value decreases (which we don’t want).

Precision

Precision is nothing but out of the total positive prediction made by our model how many of them are rightly predicted positive.

Precision is based on Prediction.

Recall=TPTP+FP Recall = \frac {TP} {TP+FP}

Precision should ideally be 1 (high) for a good classifier. Precision becomes 1 only when the numerator and denominator are equal i.e TP = TP +FP, this also means FP is zero. As FP increases the value of denominator becomes greater than the numerator and precision value decreases (which we don’t want).

Precision Recall Trade Off

Maximizing precision will minimize the number false positives, whereas maximizing the recall will minimize the number of false negatives.

Precision: Appropriate when minimizing false positives is the focus.
Recall: Appropriate when minimizing false negatives is the focus.
Sometimes, we want excellent predictions of the positive class. We want high precision and high recall.

This can be challenging, as often increases in recall often come at the expense of decreases in precision.

Nevertheless, instead of picking one measure or the other, we can choose a new metric that combines both precision and recall into one score called F1-score.

F1-Score

F-Measure provides a way to combine both precision and recall into a single measure that captures both properties.

Alone, neither precision or recall tells the whole story. We can have excellent precision with terrible recall, or alternately, terrible precision with excellent recall. F-measure provides a way to express both concerns with a single score.

Once precision and recall have been calculated for a binary or multiclass classification problem, the two scores can be combined into the calculation of the F-Measure.

The traditional F measure is calculated as follows:

Recall=2precisionrecallprecision+recall Recall = 2 * \frac {precision*recall} {precision+recall}

This is the harmonic mean of the two fractions. This is sometimes called the F-Score or the F1-Score and might be the most common metric used on imbalanced classification problems.

Like precision and recall, a poor F-Measure score is 0.0 and a best or perfect F-Measure score is 1.0.

Why we used Harmonic Mean instead of simple average?

We use the Harmonic Mean instead of simple average because it it punishes extreme values.

For instance a classifier with precision of 1 and recall of 0 has a simple average of 0.5 but an F1 score of 0.

An interpretation of F1-score?

F1 score will be lower when there is too much difference b/w the value of the precision and recall and viceversa. The closer they are the larger will be the F1 score.

Performance Evaluation for Regression Problem

Why we cannot use recall or accuracy for evaluating performance for regression problem?

These sort of metrics are not useful regression problems, we need metrics designed for continuous values.

Common evaluation metrics for regression

  1. Mean Absolute Error
  2. Mean Squared Error
  3. Root Mean Square Error

These are explained briefly in this blog.

References

  1. https://www.analyticsvidhya.com/blog/2021/06/confusion-matrix-for-multi-class-classification/#:~:text=Confusion%20Matrix%20is%20used%20to,number%20of%20classes%20or%20outputs.
  2. https://machinelearningmastery.com/precision-recall-and-f-measure-for-imbalanced-classification/
  3. https://www.youtube.com/watch?v=2osIZ-dSPGE
  4. https://medium.com/analytics-vidhya/confusion-matrix-accuracy-precision-recall-f1-score-ade299cf63cd#:~:text=F1%20Score%20becomes%201%20only%20when%20precision%20and%20recall%20are%20both%201.

Top comments (0)