We were asked to develop a Machine Learning model to predict if a customer will quit the subscripted service from a Telecommunications company in the near future.
What we are trying to achieve is the concept known as Churn prediction, meaning to predict which customers are going to stop using the company services in the future.
The given input is a dataset with active customers in the last 12 months, one row per customer, with several variables in columns including demographic and behavioral data, and a last value that indicates if that customer had quit the company in that period of time.
This is a typical Machine Learning problem, that it is best addressed as a Classification problem using Logistic Regression where the sigmoid function (or logistic function) outputs the conditional probabilities of the prediction. It outputs a number between 0 and 1.
The client is a Telecommunications company in Chile, with millions of customers. The dataset is an extract of 100.000 customers of Región Metropolitana, the area where Santiago de Chile is located. The columns included are:
- Id: internal customer identification number
- Nombre: customer name
- Sexo: gender (Female or Male)
- FNAC: date of birth
- Estudios: highest level of education reached
- Actividad: classification of current economical activity.
- Comuna: location of customer's home (and location of wired contracted services). Since it is in Chile, it corresponds to the administrative division called "Comuna".
- Plan: subscripted plan with the company
- recl_3: number of complaints received in the last 3 months
- recl_12: number of complaints received in the last 12 months
- Exited: 1 or 0 if the customer quit in the last 12 months.
The data shows a churn rate of 18.2%, ie. 18.179 out of 100.000 rows have a '1' in column Exited. The goal is to get a churn prediction using this dataset as training data in a Machine Learning program.
Every row in the dataset has many columns that represent the features of each sample of the dataset. Logistic regression algorithms need numeric features, so we must assign a number to every text feature. Also, algorithms like Gradient Descent work faster when features have similar scales, so a good choice is to normalize data between [0,1]. Let's see a sample from the dataset (omitting the real ID and last name):
10 columns are available for every sample. We will use the last one, exited as the label for the training process (also the number to be predicted for any customer). We decided to dismiss in the training model the first 2 features (the customer's ID and name), considered irrelevant.
Most features are in text mode, and we need to change them before run any model with it. It is not a minor challenge, and the selected way it is a key factor in the churn prediction success.
One option is to map text values with an integer number obtained from a sequence. The LabelEncoder class from SciKit Learn Preprocessing package does it automatically.
Using this code:
from sklearn.preprocessing import LabelEncoder encoder = LabelEncoder() encoder.fit(dataset['comuna']) dataset['comuna'] = encoder.transform(dataset['comuna'])
This transformation will change the column comuna to numbers. For example applying to the same dataset extract from above, we got:
El Bosque is now of value 0, La Granja is number 1, Lampa is number 2, and so on. If we apply the same transformation for every text feature (and also we consider only the year of birth instead of the whole date) we got every column as a number:
But, the problem with this encoding is that we introduced an arbitrary (and random) order in the data, so the model will understand for example that La Granja is between El Bosque and Lampa or that there is any close relationship between those locations. But that is false, so a misleading relationship has been introduced.
Another option is to use OneHotEncoder, which maps a column with N possible values, into N vectors with values 0 or 1.
For example, the column Sexo contains 2 possible values ['F','M'], so OneHotEncoder would map it into two binary vectors.
Using this code:
from sklearn.preprocessing import OneHotEncoder from sklearn.compose import make_column_transformer preprocess = make_column_transformer((OneHotEncoder(),['sexo']),remainder='passthrough') dataset = preprocess.fit_transform(dataset)
The column Sexo has been replaces by two columns, representing each of the two possible values.
For the current problem, applying OneHotEncoder to every text column, the final training dataset will contain 87 columns approximately.
We can apply another technique, using the fact that every feature has a value from a known domain, meaning that we know in advance all possible values for the specific feature. For example, each value of Comuna is one of the 52 possible administrative divisions of Región Metropolitana. You can find about it in Wikipedia here.
And so on, for every feature:
- Sexo: 2 possible values
- Estudios: 6 possible values
- Actividad: 8 possible values
- Comuna: 52 possible values
- Plan: 12 possible values
For every one of those features, we can order them providing a useful and real relationship between them, that can be used to obtain a better prediction.
For example, Estudios colunmn represents the highest level of education reached by the customer. We should conveniently order it from the lowest level to the highest level, because a higher level could imply a steady monthly income.
Actividad column will be ordered from being unemployed (Cesante) to being a professional with a job (Profesional Dependiente).
Plan column will be ordered from the less expensive to the most expensive.
Comuna column could be ordered geographically, but it is also interesting to order it using a Quality of Life Ranking, that could be more related to a probability of staying with the subscripted service when living in a location with best infrastructure and more jobs. Luckily, such ranking exists and is published in ICVU 2018
So, for example applying this code (where every Comuna is ordered according to ICVU 2018):
glo_comuna=["Providencia","Las Condes","Vitacura","Lo Barnechea","San Miguel", "La Reina","Ñuñoa","Santiago","Macul","Maipú","La Florida", "Estación Central","San Joaquín","Quilicura","Peñalolén","Cerrillos", "La Granja","Quinta Normal","Lampa","Huechuraba","Independencia", "La Cisterna","Lo Prado","Pudahuel","Padre Hurtado","Puente Alto", "Conchalí","Renca","San Bernardo","Recoleta","Colina","Lo Espejo", "El Bosque","San Ramón","Melipilla","Cerro Navia","Buin", "Pedro Aguirre Cerda","Paine","Peñaflor","Talagante","La Pintana"] data['comuna'] = data['comuna'].apply(lambda x: glo_columna.index(x))
Every Comuna has been mapped to its number in ICVU ranking.
Logistic Regression will try to classify the multidimensional space in two parts. The customers that will probably quit the company, and those that will not.
With XGBoost the code is very simple:
gbm = xgb.XGBClassifier(max_depth=16, n_estimators=25, learning_rate=0.01) .fit(train_x, train_y.values.ravel())
where train_x is the normalized dataset, and train_y contains the exited column.
- max depth is the maximum tree depth for the base learners
- n_estimators is how many trees will be fit
- learning_rate is the pace of gradient descending
You could learn more about it here
The testing dataset contains another 1.000 customers. We apply the XGBoost classifier to that dataset and we get the following results:
- Hits: % of correct churn predictions
- Miss: % of missed churn predictions when the customer had quit, but the system didn't predict it.
- False hits: % of churn prediction when the customer didn't quit.
The classifier results are:
|HITS||MISSES||FALSE HITS||Time used in fit|
|LabelEncoder||74% (120/162)||25% (42/162)||6% (57/838)||28 sec|
|Custom Order||76% (124/162)||23% (38/162)||6% (51/838)||25 sec|
|OneHotEncoder||93% (151/162)||6% (11/162)||1% (10/838)||96 sec|
Using OneHotEncoder gives a 93% precision in churn prediction, which is a very good result, but a bit slow.
This regression tries to fit a linear function into the dataset, and calculates the cost of it using the logistic function. But a deeper analysis of the dataset may show us that it could be better to use a higher polynomial order. If we have 8 features to fit, instead of fitting a linear function like this:
we could try to find a different formula, with squared or cubic terms, that indirectly adds weights to the features and binds concepts related to a churn. Like this:
where we say:
- add a new feature: Sexo x FNAC. ().
- add a new feature: recl_3 x recl_12 (a very complaining customer has more weight).
- add a new feature Estudios x Actividad ()
- consider Plan squared, giving more weight to the plan subscripted.
These new features allow the algorithm to fit into a more adjusted dataset, that could generate a better prediction.
The predictions with these new features are:
|HITS||MISSES||FALSE HITS||Time used in fit|
|LabelEncoder||85% (138/162)||14% (24/162)||0% (8/838)||25 sec|
|Custom Order||91% (148/162)||8% (14/162)||0% (2/838)||24 sec|
|OneHotEncoder||25% (41/162)||74% (121/162)||1% (9/838)||176 sec|
Now we have a fast and precise enough tool to get the valuable Churn Prediction. Using our Custom Order, the model returns less false hits than OneHotEncoder and it is faster.
The code is available here.
Thanks for your time.