Decision Tree Regression : A Comprehensive Guide with Python Code Examples and Hyperparameter Tuning

#python #datascience #machinelearning #productivity

Decision Tree regression is popular and powerful algorithm in regression. But to get full potential of this algorithm you have to Hyperparameter Tuning. Means you have to choose some parameters that can best fit the data and predict correctly. In this article we will focus on implementation mainly using python. Also we will learn some hyperparameter tuning techniques.

For more information on Decision tree Regression you can refer to this blog by Ashwin Prasad - Link.

Decision Tree Regression

Decision Tree Regression builds a tree like structure by splitting the data based on the values of various features. Simply it creates different subsets of data. For prediction of new sample or data, average value of target variable from leaf node is used. It handles both categorical and continues variables, making it versatile algorithm for regression tasks.

But in some libraries of python like sklearn categorical variable can not be handled by decision tree regression. So we have to encode it using any encoder method, according to data or model.

Implementation Using Python

We will use sklearn library from python for implementation.
First step will import necessary libraries.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Next we will read the dataset. In this article we will be using simple salary dataset.

## Reading the data
sal_d = pd.read_csv('Position_Salaries.csv')

sal_d ## this line will print the data in jupyter notebook

Our data will look like -

Now we will create dependent and independent variable.

x = sal_d.iloc[:,1:-1].values
y = sal_d.iloc[:,-1].values

In next step we will train our regression model using above variables. Our data is so small, so we are training our model on entire data.

There are various steps in data preparing or preprocessing, you can refer those all steps in article - Link

Below is the code for training our model -


## Importing 
from sklearn.tree import DecisionTreeRegressor

## Creating model
reg = DecisionTreeRegressor(random_state=21)

## training our model
reg.fit(x,y)

Now we will make a prediction using created model.

reg.predict([[6.5]])

Output of above -

array([150000.])

Now we will visualize the prediction of our model. For higher resolution we will create x_grid. Which plot the line smooth.

## For smooth line
x_grid = np.arange(min(x),max(x),0.01)
x_grid = x_grid.reshape(len(x_grid),1)

## this will plot points on chart
plt.scatter(x,y,color='red')

## this will plot the line connecting to points
plt.plot(x_grid,reg.predict(x_grid),color='blue')

## This will give title to our plot
plt.title('actual vs predict')

## this will give label to x axis
plt.xlabel('level')

### this will give label to y axis
plt.ylabel('salary')

## This line will save our plot as image on our computer
plt.savefig('decision_tree_regression.png',bbox_inches='tight')

## And this line is for showing the chart and ending.
plt.show()

Output of above plot code is below -

Hyperparameter Tuning

Hyperparamter Tuning means we have to select the best values for parameters of algorithm in machine learning. It includes searching and evaluating different combinations of parameters to maximize the performance of model.

To enhance the performance of decision tree regression we can tune its parameters using methods in library like GridSearchCV and RandomizedSearchCV.

Grid Search

Grid search is a method to find the best set of values for different options by trying out all possible combinations.

Below is the code for implementing GridSearchCV -


## importing class from library
from sklearn.model_selection import GridSearchCV

## Setting optimum values for parameters.
param_grid = {
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

## creating instance
grid_search = GridSearchCV(DecisionTreeRegressor(), param_grid, cv=5)

## fitting data 
# grid_search.fit(x, y)

grid_search.fit(x_train, y_train)

## getting best parameters 
best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)

Using these parameters we can train our model and enhance the performance.

Randomized Search

Randomized search is a way to find the best values for different parameters by randomly trying out a subset of possible combinations, which makes the search process faster.

Below is the code for implementing RandomizedSearchCV -


## importing class from library
from sklearn.model_selection import RandomizedSearchCV

## Setting optimum values for parameters.
param = {
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

## creating instance
random_search = RandomizedSearchCV(DecisionTreeRegressor(random_state=25), param, n_iter=10, cv=5)

## fitting data
random_search.fit(x_train, y_train)

## getting best parameters 
best_params = random_search.best_params_
print("Best Hyperparameters:", best_params)

Also using these parameters we can enhance performance of our model.

From these two methods RandomizedSearchCV is faster. Because GridSearchCV apply parameters on all data and check one by one which is suitable.

Also according to size of dataset it is decided which is faster.

There are various methods for search best parameters to model. But these two I personally implemented, so explained it here as learned.

Conclusion

In this article we learned how to implement decision tree regression using python. Also we learned some techniques for hyperparameter tuning like GridSearchCV and RandomizedSearchCV.

All code implementations done by me. So if anyone finds a mistake in it please comment it down.