Binary Classification Problem: Random Forest & OneHot Encoder

#beginners #python #datascience #flatiron

For more projects checkout my github: https://github.com/bmor2552

With all the uncertainty in the air right now, picking up a new project is one of the best things one can do to kill idle time at home. So, why not learn how to solve a binary classification problem?

This week for our Module 3 Project here at Flatiron School, I was able to team up with my fellow 02-17-2020 Data Science cohort member Taki Yasuoka (github link: https://github.com/Tyasuoka) to figure how to predict whether or not a customer would discontinue their services with a telecommunication company. Below you will find the github link to this project.

https://github.com/Tyasuoka/Module_3_Project/tree/master

The Data

The data was provided by the Kaggle member david_becks and consisted of churns and non-churns in a telecommunication company. The columns contained information on the features customers used, from area codes to evening minutes. The link below will lead you to it.

https://www.kaggle.com/becksddf/churn-in-telecoms-dataset

The Process

First things first clean, explore, & transform the data!! With this data set there were no missing values. We did find some object columns that would interfere with the model of our choice. So we dropped the ones we didn't need and transformed the ones we did need.

Another thing we noticed about the data was that the column names seemed as if they would interact with one another, i.e. evening minutes and evening charges. The churn column was also imbalanced; there were less churns than there were non-churns!

This would later become an issue, so we decided to use the random forest model to take care of it. To measure how are model was performing we went with ROC_AUC (how confident our model is when making predictions) and Accuracy Scores (how correct those predictions are).

The Model
The random forest model is a group of decision trees, THE END.
Just kidding, let's start with what a decision tree is by using our data as an example.

A decision tree model in our case will split its predictions into churn and non-churns. Think of it like sorting apples and oranges, or sorting change. The decision tree is classifying/sorting its predictions! The only issue with using this model is that our non-churns outweigh our churns so the sorting of our predictions will be weighed heavily on the non-churn side.

We don't want a model to tell use everyone will keep their services and not tell us who will disconnect their services! We want a model that can predict both who is a loyal customer and who needs some extra attention before its too late!!

To fix the imbalance of our data we can use, you guessed it, more trees! This is where the random forest came in!! Random forest uses the bagging method and random selection to make its collection of decision tree. This is done to get rid of multicollinearity (aka columns interacting with one another) and balance out the data.

Side Note: Since random forest is using random sampling this helps even out the non-churns and churns in our data, the more samples you pull from your given data the better the distribution of churns and non-churns in your predicted data.

The Results:

Here is the github link to the project; the notebooks include details on findings and the code to obtain those findings.

https://github.com/Tyasuoka/Module_3_Project/tree/master

Random Forest without OneHot Encoder

Accuracy Score: 0.936 aka about 94%
ROC_AUC Score 0.909 aka about 91%

Random Forest with OneHot Encoder

Accuracy Score: 0.942 aka about 94% (but a higher 94%)
ROC_AUC Score: 0.934 aka about 93%

Side Note: Use OneHot encoder on a column that is distributed better than your target. In our case, area code was well distributed between 3 numbers, so to assist the models learning we applied OneHot encoder to the area code column, hence the increase in the scores above.

Conclusion

When you are dealing with a binary classification problem in the real world, keep in mind that if there is multicollinearity present and your data is imbalanced, a good model to consider would be the random forest. To assist in the balancing of the data and performance of the model try an ensemble method like encoding.

Future Recommendations

If I were to do this project all over again without a time frame on the outcome, I would start with a less complicated model like Logistic Regression. To see if I can improve the performance of that model I would use feature engineering and ensemble methods. Then I would move one to decision tree, and finally random forest. From there I would compare the models and choose the one that best fits the business problem I am trying to solve.

References: Below are great sites to help breakdown everything I discussed in this blog.

Random Forest & Decision Trees

https://victorzhou.com/blog/intro-to-random-forests/

Accuracy Score

https://blog.floydhub.com/a-pirates-guide-to-accuracy-precision-recall-and-other-scores/

ROC_AUC Score

https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5

OneHot Encoding

https://www.analyticsvidhya.com/blog/2020/03/one-hot-encoding-vs-label-encoding-using-scikit-learn/

Hopefully this will motivate you to learn something new or even help you solve a different binary classification problem. Happy coding!!!