Last week we explored Decision Trees, and I said that the algorithm chooses the split of data that is most advantageous as it makes each branch in the tree. But how does the algorithm know which split is the most advantageous? In other words, what criterion does the algorithm use to determine which split will help it answer a question about the data, as opposed to other possible splits?
Entropy is this criterion.
Entropy measures randomness. But that might sound a little abstract if you aren't used to thinking about randomness, so instead let's shift our attention to something concrete: socks.
1st Situation: Imagine you have a laundry basket filled with 50 red socks and 50 blue socks. What is the likelihood that you will pull a red sock out of the basket? There is a 50/50 chance that it will be a red sock, so the chance is 0.5. When there is an equal chance that you will get a red sock or a blue sock, we say that the information is very random - there is high entropy in this situation.
2nd Situation: Now you divide those socks into two different boxes. You put the 50 red socks into a box, labeled Box 1, and the 50 blue socks into another box, labeled Box 2. The likelihood that you will pull a red sock out of Box 1 is 1.0 and the likelihood that you will pull a red sock out of Box 2 is 0. Similarly, the likelihood that you'll pull a blue sock out of Box 2 is 1.0 and the likelihood that you will pull a red sock out of that same box is 0. There is no randomness here, because you know whether you will get a red sock or a blue sock. In this situation, there is no entropy.
3rd Situation: But let's suppose that we split up the basket of socks another way - 50 socks in Box 1 and 50 socks in Box 2, but we kept the distribution of the socks the same - so there are 25 red socks in Box 1 and 25 blue socks. And in Box 2 there are also 25 red socks and 25 blue socks. In this situation, although we have divided the total amount of socks in half, the entropy is the same as the first situation, because the information of whether you will grab a red sock or a blue sock is just as unpredictable as the first example.
So we can say that entropy measures the unpredictability or randomness of the information within a given distribution.
Just so you know, this concept is also called impurity. In the 2nd situation we reduced the entropy, which gives us low impurity. In the 1st and 3rd examples we had high impurity.
A Decision Tree is tasked with reducing the impurity of the data when it is making a new branch in the tree. With those situations above, a better split of the data would be the kind seen in Situation 2, where we separated the blue socks from the red socks. A bad split would be Situation 3, because we didn't make it any easier for us to guess whether a sock would be red or blue.
Why do we want to reduce the randomness/impurity? Remember, the goal of a Decision Tree is to predict which class a given sample belongs to, by looking at all the features of that sample. So we want to reduce the unpredictability as much as possible, and then make an accurate prediction.
This is called information gain. We are measuring how much information a given feature gives us about the class that a sample belongs to.
- We have a sample
- We know the value of a feature of that sample (for example, in Box 1 or Box 2)
- Does knowing that feature reduce the randomness of predicting which class that sample belongs to?
- If it does, than we have reduced the entropy and achieved information gain
The term information gain is more or less self-explanatory, but in case it is less for you: it is telling us how much we learned from a feature. We want to find the most informative feature — the feature that we learn the most from. And when a feature is informative we say we have gained information from it, hence information gain.
I'm going to do my best to break down the mathy part now, for calculating entropy and information gain.
Here is some notation for probabilities:
- True (i.e. a sample belongs to a certain class): 1
- False (i.e. a sample doesn't belong to a certain class): 0
- A given sample: X
- Probability (likelihood) that given sample X is of a certain class (True): p(X = 1)
- Probability that X is not of a certain class (False): p(X = 0)
Notation for entropy:
- H(X): entropy of X
- Note: when* p*(X=1)=0 or p(X=0)=0 there is no entropy
Notation for** information gain**:
IG(X,a) = H(X)−H(X|a)
- IG(X,a): information gain for X when we know an attribute a
- IG(X, a) is defined as the entropy of X, minus the conditional entropy of X given a
- The idea behind information gain is that it is the entropy for a given attribute a that we lose *if we know the value of *a
Specific conditional entropy: for any actual value a0 of the attribute a we calculate:
- So let's imagine there are several different possible attribute values for a that are numbered as follows: a0, a1, a2,...an
- We want to calculate the specific conditional entropy for each of those values...
Conditional entropy: for all possible values of a
- Entropy of X given a
- We add together all the values for specific conditional entropy of each attribute a0 through an...
- ...that's what ∑aip(a=ai) stands for
- ∑ means "sum"
- It is conditional *because it is the entropy that depends on an attribute *a
So here are the steps you can take with those equations to actually find the conditional entropy for a given attribute:
- Calculate each probability for your first possible value of a:
- Calculate the specific conditional entropy for the probability found in Step 1
- Multiply the values from step 1 and step 2 together
- Do that for each possible value for a
- Add together all those specific conditional entropy values you found for a
- ...the sum reached in Step 5 is your conditional entropy
Then to calculate information gain for an attribute a:
- Subtract the conditional entropy from the entropy
- At this point you have found the value for your information gain with respect to that attribute a
What you have now discovered is how random your data really is, if you know the value of a given attribute a (i.e. a given feature). Information gain lets us know how much knowing a reduces the unpredictability.
And this is what the decision tree algorithm does. It uses the entropy to figure out which is the most informative question to ask the data.
This awesome Twitter thread by Tivadar Danka gives a more detailed breakdown into the mathematical concept of entropy, and why there are all those logarithms in the equation. And while we're on the subject, I highly recommend that you follow him on Twitter for great threads on math, probability theory, and machine learning.
If you want a refresher on what logarithms are, check out this very no-nonsense explanation from Math Is Fun. I also highly recommend visiting this cite before Wikipedia for all of your math basics needs. It is written with a younger audience in mind, which is perfect if your brain isn't used to all of the fancy math lingo and symbols yet.
In case you're wondering, my understanding of entropy was drawn from the book Doing Data Science by Rachel Schutt and Cathy O'Neil and the Udacity Machine Learning course lectures.
I hope you enjoyed this Machine Learning Log, and please share any thoughts, questions, concerns, or witty replies in the comments section. Thank you for reading!