Decision Trees Splitting Algorithms
When building decision tree models, we need to split data effectively at each node. There is a criterion parameter in the Sklearn library, which measures and chooses the best split by minimizing the splitting function.
Let's see what common splitting functions are used in Sklearn. So, for the classification trees, we can use Gini Impurity and Entropy Gain. Whereas, regression trees use Reduction of Variance, which is the average of the sum of squared residuals.
I did not use mathematical formulas, but wanted to give an intuitive explanation of the concepts. You can find equations here.
To get a better perspective, how the Gini index works. This term is often used in economics to describe economic inequality by measuring income distribution. For example, the average income does not give full information about the variation and the distribution of income among the population. So, we can apply the same approach for our binary classification problems, where we measure the Gini index on how to split the data.
Entropy or Information Gain
The entropy uses Log in the formula, and basically, it measures how homogeneous or organized is the system after the split. It is very similar to the Gini index.
Reduction of Variance
This approach is similar to linear regression, where we calculate the sum of squared residuals.
So, for regression trees at each node after the split, we measure the mean value of the data points. Then for each data point, we compute the difference (=residual) between the point (true value) and the calculated mean. Then we calculate the average of the sum of the squared residuals.
In the next post, I will show the application of each method in Python and Sklearn.