"F1-score" is one of the main metrics that have always been suggested to evaluate the result of any imbalance classification model. But if you had tried to use it as your key metric, may be faced with different variations of this metric... f1-score, f-score weighted, f-score macro, f-score micro, f-score binary, and f-score class-wise!!!
So, when choose which? Or which helps when?
Introduction
If you had any experience in classification modeling with an imbalanced dataset, one of the main metrics that is been always suggested by the experts is the famous "f1-score".
Imbalance datasets are those that have an asymmetric proportion of items belonging to different classes. In my project, I have wrangled with a 90-10 imbalance dataset of the adverts written by "Realtors" and "People". My main goal is to find the best classification model which classifies the written adverts by a minimum number of misclassified items.
Problem
In imbalanced datasets, accuracy as the most common metric of classification problems does not best describe the model. Why?
If I define a fake model which labels all items as "Realtor", then this fake model has a 90 percent accuracy. Maybe there is no need to put much time and effort to develop a better model!
On the other hand, the model has not seen the same proportion of classes learn them equally. So it is more probable to learn the majority class than the minority one. But in the accuracy, both classes encountered as same importance to evaluate the model.
In these cases, the f1-score is the best metric that could help to assess the model efficiency.
Confusion Matrix
In all classification problems, the first and most useful job to get the most valuable insight into the model is to calculate the values for each cell in the confusion matrix.
The confusion matrix will clearly show how many of the items are truly or falsely classified by the proposed model.
In this matrix, there are one row and column for each class. So there would be four main cells that can categorize the result of the model.
If I supposed that the true label belongs to the "Realtor" class and the false belongs to the "People" class:
- TP (true-positive): number of items categorized "Realtor" and their main label is "Realtor"
- TN (true-negative): number of items categorized "People" and their main label is "People"
- FP (false-positive): number of items categorized "Realtor" and their main label is "People"
- FN (false-negative): number of items categorized "People" and their main label is "Realtor"
As it is clear, the dense the main diameter the more reliable model.
So, If I develop a model which predicts the most TP and TN among the other models, then I have done my job :)
Well, there are already defined metrics over the confusion matrix. The two more important that can help me are:
- Recall = TN / (TN+FP)
- Precision = TN / (TN+FN)
So, if I target to increase both metrics at the same time, then I will gain my best model.
The "F1-score" metric is the one that will do this for me!
F1-score Variations in Azure ML
In the previous section, you can clearly see that both "Precision" and "Recall" metrics are laid between 0 and 1. On the other hand, it is important to cover the error of the imbalance dataset on our evaluation. So, considering the harmonic mean of the Precision and Recall will help us to support all these purposes. F1-score is the harmonic average of the Recall and Precision. For more information about how harmonic average can help us in this case you can take a look at this article.
I work with Azure Machine Learning Service. There are lots of variations of the f1-score metric. At first, it may be so much confusing to choose the right one, but if you know the meaning of the metrics, then it would be even helpful to consider more than one metric to evaluate the model's performance.
An example of Azure ML metrics:
F1-score as a harmonic average:
F1-score = 2 * (precision * recall) / (precision + recall)
Class-wise F1-score:
In the Azure ML Service, the class-wise f1-score will be shown as a dictionary of the f1-score for each class. In the binary classification, it will be calculated from the formula above. For the multiclass problems, it uses One-vs-Rest to calculate the f1-score for each class.
Sample of f1-score for binary classification problem: {'True': 0.80,'False': 0.70}
Macro F1-score:
As its name declares, f1-scores of all classes are taken into account to calculate the macro f1-score.
This metric assumes that all classes have the same weight, then all of them will participate as the equally-weighted parts in the calculation.
For example, if the f-score of the "Realtor" class is 0.80 and the f-score of the "People" class is 0.70, the f-score macro of this model is:
Macro f1-score = (0.80 + 0.70) / 2 = 0.75
Micro F1-score:
In the Micro f1-score, we will sum up the microelements needed to calculate this metric.
I mean that it could be calculated if we have the total TP, FN, and FP over all classes. To get the total TP, we should sum up all TPs for each class, and do so for FNs and FPs. Then we calculate the micro f1-score, using the total TP, total FN, and total FP.
So, again the name of this metric reveals that it considers the overall TP, TN, and TP from micro items.
Weighted F1-score:
As I have mentioned earlier, we would have an imbalanced dataset. It is clear, that if the proportion of the classes is imbalanced, we must take a technique to encounter the proportion in the calculation procedure.
In the weighted f1-score, we use the weight of each class to highlight the effect of the minority class and not let the majority fade it with its power.
In my example, the weighted f1-score will be calculated in this way:
Weighted f1-score = 0.90 * 0.80 + 0.10 * 0.70 = 0.79
As it is clear, if I had a balance dataset, the macro f1-score and weighted f1-score would be a same value.
Binary F1-score:
It is the f1-score for positive class in a binary classification problem.
To Sum up...
To sum up this article, the f1-score is one of the useful metrics which really helped me to evaluate the result of my experiments... I always consider all the metrics above to assess the validity of the model and the effect of the imbalanced dataset on my modeling. I hope it would help the readers to a better evaluation of their machine learning modeling.
Top comments (0)