Decision Trees and Random Forest in R Programming

In this section, the process of constructing predictive models in R using the party, rpart, and randomForest packages is demonstrated. The chapter commences by constructing decision trees using the party package and employing the generated tree for classification purposes. Subsequently, an alternative approach to constructing decision trees using the rpart package is introduced. Finally, an example is provided to showcase the training of a random forest model using the randomForest package.

Decision Trees using Package party
This section illustrates the process of constructing a decision tree for the iris data using the ctree() function from the party package. More specifically, the features Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width are utilized to predict the species of flowers. The ctree() function within the package builds the decision tree, while predict() enables predictions for new data. Prior to modeling, the iris data is divided into two subsets: training (70%) and test (30%). To ensure reproducibility of the results, a fixed value is set for the random seed.

The code below shows how to load the party package and build the decision tree model before outputting the prediction result. myFormula outlines our target variable (Species) while initializing the other variables as independent parameters.

Now let us explore our built tree using the print function to output the rules and by plotting the tree.

In the figure above, the bar plot representing each terminal point exhibits the likelihoods of an occurrence being assigned to the three distinct categories. In the figure below, these probabilities are represented as "y" within the nodes. To illustrate, node 2 is denoted as "n=40, y=(1, 0, 0)," indicating the presence of 40 training occurrences, all of which pertain to the initial category, "setosa." Next, the constructed tree must undergo testing using test data.

The current iteration of the ctree() function (specifically, version 0.9-9995) lacks robust handling of missing values. In this case, an instance with a missing value may be assigned to either the left or right sub-tree inconsistently, possibly due to surrogate rules.

Another concern arises when a variable is present in the training data and provided to ctree(), but does not appear in the constructed decision tree. In such instances, the test data must also contain that variable in order for predictions to be made successfully using the predict() function. Additionally, if the categorical variable levels in the test data differ from those in the training data, prediction on the test data will fail.

To address the aforementioned issues, one possible solution is to construct a new decision tree using ctree() after the initial tree is built. This new tree should only include variables that exist in the first tree. Furthermore, it is essential to explicitly set the categorical variable levels in the test data to match the corresponding variable levels in the training data.

Decision Trees with Package rpart
To mix things up a little, we will use the bodyfat dataset alongside rpart package to create a decision tree model. rpart() helps us build the model so that we can select the decision tree with the least prediction error. Thereafter, we apply the model to data it has never seen before and generate prections using the usual suspect: predict(). But first things first, let us load the bodyfat dataset.

The following code splits the datasets into the test and train subsets before a decision tree model is built using the latter subset.

Let us visualize the built tree

Now let us identify the one with the least error rate

We can now use the best tree predict values and compare them to the real values in the bodyfat dataset. The following code uses abline() to draw a diagonal line. If the model is good enough then most of the points should be on or next to this line that represents the actual values.

Random Forest
Finally, let us install the randomForest package for our next predictive model that will use the iris dataset. Unfortunately, randomForest cannot handle datasets with missing values. Moreover, every categorical variable can only have a maximum number of 32 levels. If the levels exceed 32, transformation prior to feeding data is pertinent. You can also leverage package party's cforest() function as it does not limit categorical attributes to 32 levles. Nonetheless, you'll still end up using more memory and spend a lot of time training the model when you have too many levels. We begin by splitting the dataset into training and test subsets.

The code below loads the required package and begins the training process. The logic is pretty much the same as in the other instances.

After that, we plot the error rates with various number of trees.

Finally, the built random forest is tested on test data, and the result is checked with functions table() and margin(). The margin of a data point is as the proportion of votes for the correct class minus maximum proportion of votes for other classes. Generally speaking, positive margin means correct classification.