The task of classification has existed long before the invention of machine learning. A problem that may arise when working with different algorithms is the use of an error function that determines if an algorithm is good enough, with classification algorithms it is no different.
One of the most used metrics applied in these algorithms is the accuracy metric; based on the total number of samples and the predictions made, we return the percentage of samples that were correctly classified. But this method does not always work so well; imagine that we have a total of 1000 samples, and an algorithm called DummyAlgorithm that tries to classify them in two different classes (A and B). Unfortunately, DummyAlgorithm does not know anything about the data distribution, as a result, it always tells us that a given sample is of type A. Now imagine that all the samples are of class A (you might see where I'm going). In this case, it is easy to see that even though DummyAlgorithm has a 100% accuracy rate, it is not a very good algorithm.
In this post, we'll learn how we can complement the accuracy metric with other machine learning strategies that do take into account the problem described before. Consequently we'll see a method to avoid such a problem.
Before going any further, let's define some basic concepts.
Accuracy: metric that returns the percentage of correctly classified samples in a dataset
True Positives: samples that were correctly classified with their respective positive class
True Negatives: samples that were correctly classified with their respective negative class
False Positives: samples that were classified as positives but were negatives
False Negatives: samples that were classified as negatives but were positives
Precision: accuracy of the true positives (TP / TP + FP)
Recall: ratio of positive instances that are correctly classified (TP / TP + FN)
Note: when we talk about positives/negatives, we are talking about a specific class
The confusion matrix creates a division for each of the four possible categorizations. It can be used in multiclass classification. In the following example we are making a binary classification that classifies red dots among other colors.
As with other metrics, the classifier has to make a decision in which if it wants to learn to have a better precision or a better recall. Sometimes you care more about precision than you care about recall. For example, if you wish to detect safe for work posts in a social network, you would probably prefer a classifier that rejects many good videos (low recall) but keeps only safe ones (high precision). On the other hand, suppose you train a classifier to detect shoplifters, it is probably better that the classifier has the most recall as possible (the security system will get some false alerts, but almost all shoplifters will get caught.
Based on this tradeoff we can define a curve called the precision/recall curve
The ROC curve (receiver operating characteristic curve) is a very common tool used with binary classifiers. It is very similar to the precision/recall curve, but it plots the true positive rate against the false positive rate. One way to compare classifiers is to measure the area under the curve (AUC). A perfect classifier will have a AUC equal to 1. A purely random classifier will have a ROC AUC equal to 0.5.
As the ROC curve and the precision/recall curve are very similar, it might be difficult to choose between them. A common approach is to use the precision/recall curve whenever the positive class is rare and when you care more about the false positives than the false negatives, and the ROC curve otherwise.
The accuracy problem essentially happens when the data the model is being tested with is unbalanced. To solve this issue there are several approaches.
- If you have a lot of training data you can discard some of it to create a more balanced data, although your model might generalize worse with less data, this approach must be used in special cases.
- Use a data augmentation technique to increase the data available.
- Use a resampling technique in which you make the training data bigger by using the same data, useful if the data augmentation approach is too complicated.