Imbalanced data means the data that is having more samples of a single class or category and very less data of all other classes. It is a problem of classification problem of Machine Learning. In Supervised Machine Learning classification is one of the significant problems to be solved. It is basically of two types Binary classification and Multi-class classification. In Binary classification the samples are divided into two categories known as classes in machine learning terms. Classification means predicting the labels of the samples present in a dataset(a collection of data having features and labels). A simple example of binary classification is identifying whether the image is of a dog or a cat. Multi-class classification example is identifying the digits based on images which contains 10 classes representing each digit.
Imbalanced data are often present in most real-world scenarios like in the case of spam detection of e-mails there are very few e-mails of type spam, in the case of cancer detection very few cases are of cancerous type, so accuracy which is the ratio of correctly predicted classes to the total number of samples cannot be used for performance evaluation of imbalanced dataset. An example to explain this situation is if we have a dataset containing e-mails with spam and non-spam e-mails, and if the total number of e-mails is 1000 and the number of e-mails that are spam is 10 then the accuracy for a model would be 99.90% if it predicted all the samples to be non-spam. Thus accuracy cannot be a good measurement to assess the performance of the model for this dataset. We need some other performance measures to evaluate the model on an imbalanced dataset.
Some of the performance measures for imbalanced dataset based models are-
- Geometric Mean
- Index Balanced Accuracy
To understand these performance measures we need to understand some terms first. These terms are True Postive (TP), True Negative (TN), False Positive (FP), and False Negative (FN).
True Positive means that the positive classes are predicted accurately and True Negative means that negative classes are predicted accurately. If the actual class is positive and is predicted as negative then it is known as False Negative. If the actual class is negative and the predicted class is positive then it is called False Positive. A summary for which is given in the table below.
|Predicted Positive||Predicted Negative|
|Actual Positive||True Positive (TP)||False Negative (FN)|
|Actual Negative||False Positive (FP)||True Negative (TN)|
Also, an image describing the TP, TN, FP, and FN is given below
It is the ratio of correctly predicted positive classes to the total number of positively predicted classes. The formula for precision is
Precision = TP/(TP+FP)
Recall is also known as Sensitivity and True Positive Rate. It is the ratio of correctly predicted positive classes to the total actual positive classes. The formula for recall is
Recall = TP/(TP+FN)
It is also known as True Negative Rate. It is the ratio of correctly predicted negative classes to the total actual negative classes. The formula for specificity is
Specificity = TN/(TN+FP)
F1-score is the harmonic mean of precision and recall. The formula for F1-score is
F1-score = 2*Precision*Recall/(Precision+Recall)
Geometric mean is the square root of the product of Recall and Specificity. The formula for Geometric Mean is
Geometric Mean = √(Recall*Specificity)or
Geometric Mean = √(True Positive Rate * True Negative Rate)
It is a new metric for measuring performance. First of all Dominance is calculated which is the difference of True Positive Rate and True Negative Rate. The formula is
Dominance = Recall-Specificity. The value of dominance is assigned a weight α. The formula of Index Balanced Accuracy (IBA) is
IBA = (1 + α*Dominance)(GMean²). In simplified terms it is
IBA = (1 + α*(Recall-Specificity))*(Recall*Specificity)
The imbalanced learn library of Python provides all these metrics to measure the performance of imbalanced classes. It can be imported as follow
from imblearn import metrics
An example of the code for finding different measures is
y_true = [1, 1, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2] y_pred = [1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2] target_names = [ 'class 1', 'class 2'] result = metrics.classification_report_imbalanced(y_true, y_pred, target_names=target_names) print(result)
Here pre is precision, rec is recall, spe is specificity, f1 is f1-score, geo is geometric mean, iba is index balanced accuracy and sup is support. The default value of α is 0.1 for IBA.