What is a decision tree?
A decision tree is a model used in a wide array of industries that visually maps out the if-else flow of a particular scenario. As an example, consider a scenario where you want to plan out the next dayβs schedule. A decision tree is well-suited to visualise the probable schedule:
The decision tree starts off by asking a single question "Will it rain tomorrow?". Depending on the answer, we move on to the next corresponding question, and we repeat this process until we get to the end. To introduce some jargon here, each question/decision is called a node (i.e. the rounded rectangles in the diagram). This decision tree contains 5 nodes. The starting node (i.e. the node with "Will it rain tomorrow?") is called the root node. The nodes at the very end of the tree are called leaf nodes. In this example, we have 3 leaf nodes.
Decision trees in machine learning
In the context of machine learning, decision trees are a suite of tree based-models used for classification and regression. Just like all the other machine learning algorithms, decision trees can be constructed using data.
Transparency
What makes decision trees stand out from the rest is its transparency in how the result are reached. Machine learning models, like neural networks, are notorious for being black-boxes, that is, they fail to justify why the model returned a particular output. Decision trees do not share the same caveat; we can easily visualise the reasons behind its output, as we shall demonstrate later in this tutorial. This feature makes it extremely appealing for some applications such as market segmentation, where we want to differentiate the traits of customers.
Simple numerical example
To keep things concrete, suppose we wanted to predict whether or not students would pass their exam given their gender and group. The following is the our dataset, which we will use as the training data:
gender
group
is_pass
male
A
true
male
B
true
female
A
false
male
A
false
female
C
true
male
B
false
female
C
true
Our training data contains a total of 7 observations, and 2 categorical features: gender and the group. gender is a binary categorical variable, whereas group is a multi-class categorical variable with three distinct values. The last column is_pass is the target label, that is, the value that we want to predict.
Our goal is to build a decision tree using this training data and predict whether or not students would pass the exam given their gender and group.
- Listing all possible binary splits Our first sub-goal is to determine what binary split we should use for the root node. Since decision trees typically opt for binary splits, the four possible binary splits for this instance problem are as follows:
male vs female
group A or not A
group B or not B
group C or not C
The key question here is - which of the above four provides the best split? To answer this question, we need to understand what pure and impure subsets are.
- Building a frequency table To keep things concrete, we first focus on the male vs female split. The following is our original dataset with the gender column extracted and sorted:
gender
is_pass
male
true
male
true
male
false
male
false
female
true
female
false
female
false
We have 4 records with gender=male, and 3 with gender=female. Out of the 4 records with male, we have 2 records with is_pass=true, and 2 records with is_pass=false. We can easily summarise the counts using a frequency table like so:
gender
is_pass (true:false)
male
2:2
female
1:2
Here, the ratio 2:2 just means that there are 2 records with is_pass=true, and 2 records with is_pass=false. In this case, the partition isn't absolute - we have is_pass=true and is_pass=false for both the partitions. We call these partitions impure subsets. On the other hand, pure subsets are partitions where the target class is completely one-sided, that is, the ratio contains a 0 (e.g. 5:0, 0:7).
Intuition behind pure and impure subsets
We can measure how good of a candidate a feature is for the root node by focusing on the metric of impurity. We want to minimise impurity, that is, we prefer pure subsets over impure subsets. To illustrate this point, just suppose for now that the gender split was as follows:
gender
is_pass (true:false)
male
0:3
female
2:0
We've got two pure subsets, making the gender split ideal. In simple words, this means that all male students seem to have failed the exam, while the female students have passed the exam. Intuitively, it makes sense that this split is extremely useful when it comes to predicting whether a student would pass the exam; if the student is male, then we predict that he would fail the exam, and if she is female, then we predict that she would pass the exam.
In this case, the perfect decision tree would be as follows:
Just as a comparison, suppose the ratio table for the gender split was instead as follows:
gender
is_pass (true:false)
male
2:2
female
3:3
This time, we have two impure subsets. Can you see how this is inferior to the case when we had two pure subsets? Just by looking at the ratio, the gender split seems to not really have an impact on whether the student would pass or fail the exam.
From this comparison between pure and impure features, we can intuitively understand that we want to choose a feature with the least impurity for the target node - pure subsets are to be preferred.
- Computing the Gini Impurity for each split With the two ratio tables now complete, we must decide which feature to use as the root node. There are numerous criteria by which we make the decision, but a commonly used one is called the Gini impurity. The Gini impurity is a numerical value we compute for each value of a feature (e.g. male vs female) that tells us how much information that feature will provide to us.
Top comments (0)