In every machine learning application you may have found that every machine learning algorithm you try has its flaws. As the no free lunch theorem states there is no single learning algorithm in any domain always induces the most accurate learner. Learning is ill-posed since we do not have all the data and each algorithm converges for a specific subset of the original data. What if there was a way to combine the knowledge of different algorithms?(Spoiler: there is!)
The idea is to have multiple base learners, each an expert in a subset of the dataset and accurate in it. This way each learner complements each other. There are two aspects we must consider:
- We must choose base-learners that complement each other (otherwise we would have unnecessary redundancy). A synonym to this is that they must not be correlated.
- Each learner must try to have the highest accuracy, we should also find a way to combine the output of each learner to achieve it.
There are multiple ways in which we can diversify each learning algorithm. For instance, we can use different algorithms, we can also try similar algorithms but with different hyperparameters or each algorithm could be trained with a different data subset.
Voting is the most simple technique in which the output of multiple classifiers is combined using a linear combination (i.e. sum, product, median, weighted sum, maximum, minimum). The main idea behind this method is the we create a voting scheme over high variance and low bias models, thus bias remains small and variance is reduced. An analogy to this is when we have multiple noisy samples of the same data, if we combine them it is much easier to sort out the noise.
This is variance of the voting method. The difference is that each base-learner is trained on different training sets. To generate each set we draw instances randomly with replacement. By replacing the instances it is possible that many instances are repeated and that some are not used at all.
This type of mixer works by using simple learners that actively learn on the mistake of the previous learners, the most commonly and effective use of boosting is called AdaBoost. For AdaBoost to work, the learners must be very weak, otherwise we have a high risk of overfitting. AdaBoost combines the votes based on weights proportional to the accuracy of each base-learner in the training set.
Stacking is mechanism in which multiple base-learners are trained using the train dataset. After that, a new training data is created by voting the predictions of each base-learner. Finally, another learner (called the mixer) learns from the predictions of the other algorithms.
This is based on a sequence of classifiers, in which each learner is more complex than the latter, and thus generalize less. We first make a prediction with a simple learner. This learner tries to generalize as well as we can, if the confidence is not sufficient, we continue with the next learner. We continue making predictions until the prediction confidence of a learner is high enough. A common approach to this method is to use a weak algorithm such as linear regression as the first learner and a non-parametric approach for subsequent learners. The idea behind this method is that most samples can be explained with simple rules, and that few exceptions are explained with more complex learners.
Note: Combining learners is not always the best approach. It increases the speed and it has an execution overhead so you must be sure if it is worth the effort.