After getting the data, it’s very tempting to jump immediately into trying to fit several models and evaluate their performance. However, the first thing that has to be done is an exploratory data analysis (EDA), which allows us to explore the structure of our data and to understand the relationships governing the variables. Any EDA should involve creating and analysing several plots and creating summary statistics to considered the patterns present in our dataset.
If you want to see, how I performed EDA for this particular project, you can read this previous post.
From the EDA on this project, we have learnt some important features about our dataset. First of all, it doesn’t suffer from class imbalance , that occurs when the total number of observations in one class is significantly lower that the observations in the other class. Also, some of our variables showed skewness that was fixed after log-transforming them and that no variable
showed a perfect linear relationship with the other, though in some of them we could observe a trend to an interaction.
One of the main decisions to make when performing machine learning is choosing the appropriate algorithm that fits the current problem we are dealing with.
Supervised learning refers to the task of inferring a function from a labeled training dataset. We fit the model to the labeled training set with the main goal of finding the optimal parameters that will predict unknown labels of new examples included in the test dataset. There are two main types of supervised learning: regression, in which we want to predict a label that is a real number, and classification, in which we want to predict a categorical label.
In our case, we have a labeled dataset and we want to use a classification algorithm to find the label in the categorical values: 0 and 1.
We can find many classification supervised learning algorithms, some simple but efficient, such as linear classifier or logistic regression, and another ones more complex but powerful such as decision trees and k-means.
In this case, we will choose Random Forest algorithm. Random forest is one of the most used machine learning algorithm due to the fact that it is very simple, flexible and easy to use but produces reliable results.
Briefly, random forest creates ‘a forest’ of multiple decision trees and ensemble them in order to obtain a more accurate prediction. The advantages of random forest over decision trees are that the combination of the individual models improves the overall result and also, that prevents overfitting by creating smaller trees from random subsets of the features.
So we will first load the packages from scikit-learn that we need to perform Random Forest and also to evaluate afterwards the model. We will also replace the categorical values with 0 or 1 or NaN as well as transform all variables to float and log-transform the variables to fix skewness, like we did in the EDA. We will again check the total number of missing values in each variable:
In the EDA, we dropped all NaN values. Here, we need to evaluate what is the best method to handle them.
There are several ways to deal with missing data but none of them is perfect. The first step is to understand why data went missing. In our case, we can guess that the values missing in the categorical variables could be due to the absence of the feature that instead of being imputed as no was left blank or that it was not tested. Also, missing values in continuous variables could be explained by the lack of biochemical studies performed in that particular patient or because the parameters were within normal range and it was not written down.
In both cases, we could be in the presence of Missing at Random value (The fact that the value is missing has nothing to do with the hypothetical value) or Missing not at Random value (The missing value depends on the hypothetical value). If it was the first one, we could drop the NaN value safely, while in the last case it would not be safe to drop it because this missing value tell us something about the hypothetical value. So in our case, we will impute the values of the missing value once we are about to train our model.
Feature scaling or data normalization, a method used to standardize the range of independent variables, is also a very important step before training many classifiers. Some models can perform very poorly if the data is not within the same range. Another advantage of random forest is that does not requiere this step.
Splitting the dataset into training and test datasets
In order to train and test our model, we need to split our dataset into to subdatasets, the training and the test dataset. The model will learn from the training dataset to generalize to other data; the test dataset will be used to “test” what the model learnt in the training and fitting step.
It is common to use the rule of 80%-20% to split the original dataset. It is important to use a reliable method to split the dataset to avoid data leakage; this is the presence in the test set of examples that were also in the training set and can cause overfitting.
First, we will assign all the columns except our dependant variable (“Class”) to the variable X and the column “Class” to the variable Y.
And then we will train_test_split from the scikit-learn library to split them into X_train, X_test, Y_train and Y_test. It is important to add random_state because this will allow us to have the same results every time we run the code.
Note: Train/Test splitting has some disadvantages due to the fact that some models require to tune hyperparameters, that in this context, is done also in the train set. One way to avoid this is to create a Train/Validation/Test dataset with the rule 60/20/20%. There are several effective methods to do this that we will see below.
Training Random Forest
It is very easy now to impute missing values (using Imputer), create and train the basic random forest model using the package Scikit-learn. We will start by apply .ravel() to the Y_train and Y_test to flatten our array as not doing so will rise warnings from our model.
Then, we will impute our missing values using the function Imputer and the strategy most_frequent that will replace the missing values for the most frequent value in the column (axis = 0). It is worthy to notice that doing so can introduce errors and bias, but of course as we state before there is no perfect way to handle missing data.
Our basic model has now been trained and has learnt the relationship between our independent variables and the target variable. Now, we can check how good our model is by making predictions on the test set. We can then compare the prediction with our known labels.
We will again impute the missing values in our test set and use the function predict and the metrics accuracy_score to evaluate the performance of our model.
As we can observe above, our basic model has an accuracy of 74.19% which tell us that it has to be further improved.
There are several ways to improve our model: gather more data, tune the hyperparameters of the model or choose other models. We will choose the second one, we will now tune the hyperparameters of our random forest classifier.
Model parameters are normally learned during training; however hyperparameters must be set manually before training. In the case of random forest, hyperparameters include:
- n_estimators: number of trees in the forest
- max_features: maximum number of features in each tree
- max_depth: maximum splits for all trees
- bootstrap: whether to implement bootstrap or not to build trees
- criterion: assess stopping criteria for decision trees
Of course, when we implement basic random forest, Scikit-learn implements a set of default hyperparameters, but we are not sure if those parameters are the optimal for our particular problem.
In this point is when we need to considered two concepts: underfitting and overfitting. Underfitting occurs when the model is too simple and it doesn’t fit the data well: it has low variance but high bias. On the other hand, Overfitting occurs when the model adjust too well to the training set and performs poorly in new examples. If we tune the hyperparameters in the training dataset, we would then be prone to overfit our random forest classifier. So instead, we will go back to what was mentioned before: the cross validation.
There are a many cross validation methods, the best known are: K-Folds Cross Validation and Leave One Out Cross Validation. In our case, we will use the first one: we will split our data into K different subsets using k-1 subsets as our training set and the last as our test data. In order to tune our hyperparameters, we will perform many iterations on the K-subset cross validation but using different model settings each time. Afterwards, we compare all models and select the best one; then, we will train the best model in the full training set and evaluate it on the testing set. We will take advantage of GridSearchCV package in Scikit-learn to perform this task.
We will determine the parameters and values that we want to optimize and then we will performed the GridSearchCV and set the best parameters obtained to our model.
As we can see above our GridSearchCV improve our accuracy from 74 to 77%. Even though that it’s not a great improvement, it has been reported that with this dataset other studies reached only an accuracy of 80%. So considering this and the fact that the dataset has many missing data and it’s not big (only 155 samples) we can go on and analyse other model metrics.
Test set metrics
Now that we have optimized our hyperparameters, we will proceed to evaluate our model. First of all, we will create a confusion matrixthat will tell us the True Negative, False positive, False negative and True Positive values according to our predicted values and plot it using seaborn heatmap:
True Negative (TN)| False positive(FP)
— — — — — — — — — — — — — — — —
False negative (FN)| True positive (TP)
Analysing the confusion matrix, we can expect that our model shows a higher recall (TP/TP+FN) than precision (TP/TP+FP) but both parameters will be higher than the accuracy(TP+TN)/Total. These three parameters can be taken into consideration according to what we consider our model needs to solve. We will come back to these afterwards.
We can further investigate the False positive rates and true positive rates using ROC Curve and calculating the area under the curve that it is also a metric of the prediction power of our model (if the value is closer to 1 means that our model does a good job in differentiating a random sample into the two classes).
From the ROC curve, we learned that our model does not do a good job in distinguishing between both classes as the auc is 0.60. We could improve this issue by collecting and adding more data to the model.
Last, we can analyse the precision-recall curve:
We can observe that the precision-recall relationship is quite constant for the different values, indicating that our model has a good precision and recall. This is due to the fact that the True Positive values are quite high compared to the true negative, false positive and false negative. It is important to remember that because of the formula of recall and precision, when one is high the other one is low pushing us to find a balance where both are high enough for our model.
Interpreting the results
The last thing that we could do before finishing our project is to evaluate the variable importance, that is to quantify how useful every variable is for our model.
We can observe that age, protime, alk_phosphate, bilirubin, malaise, ascites are some of the most important variables for our model. This reflects what we have seen previously in our EDA and reinforces the importance of performing this exploratory analysis before starting the machine learning algorithm.
So after applying random forest to our dataset, we can conclude that our best model was able to predict survival from patients with hepatitis with an accuracy of 77% and a precision and recall of around 80%. This is not the best situacion since we want our model to perform better, specially in this case that involves survival of patients. However, the moderate good results could be due to the small database and the large number of missing values.