Tamal Barman

Posted on Jul 18, 2023

Running a Random Forest Using Python

#python #coursera #programming

Introduction:

A Practical End-to-End Machine Learning Example

There has never been a better time to delve into machine learning. The abundance of learning resources available online, coupled with free open-source tools offering implementations of a wide range of algorithms, and the affordable availability of computing power through cloud services like AWS, has truly democratized the field of machine learning. Now, anyone with a laptop and a willingness to learn can experiment with state-of-the-art algorithms within minutes. With a little more time and effort, you can develop practical models to assist you in your daily life or work, and even transition into the machine learning field to reap its economic benefits. In this post, I will guide you through an end-to-end implementation of the powerful random forest machine learning model. While it complements my conceptual explanation of random forests, it can also be understood independently as long as you grasp the basic idea of decision trees and random forests. I have provided the complete project, including the data, on GitHub, and you can download the data file and Jupyter Notebook from Google Drive. All you need is a laptop with Python installed and the ability to initiate a Jupyter Notebook to follow along. (For guidance on installing Python and running a Jupyter Notebook, refer to this guide.) Although Python code will be used, its purpose is not to intimidate but rather to demonstrate how accessible machine learning has become with the resources available today! While this project covers a few essential machine learning topics, I will strive to explain them clearly and provide additional learning resources for those who are interested.

Problem Introduction

The problem we are addressing involves predicting tomorrow's maximum temperature in our city using one year of historical weather data. While I have chosen Seattle, WA as the city for this example, feel free to gather data for your own location using the NOAA Climate Data Online tool. Our goal is to make predictions without relying on existing weather forecasts, as it's more exciting to generate our own predictions. We have access to a year's worth of past maximum temperatures, as well as the temperatures from the previous two days and an estimate from a friend who claims to possess comprehensive weather knowledge. This is a supervised regression machine learning problem. It is considered supervised because we have both the features (data for the city) and the targets (temperature) that we want to predict. During the training process, we provide the random forest algorithm with both the features and targets, enabling it to learn how to map the data to a prediction. Furthermore, this task falls under regression since the target value is continuous, in contrast to discrete classes encountered in classification. With this background information established, let's dive into the implementation!

Roadmap

Before diving into programming, it's important to outline a concise guide to keep us focused. The following steps provide the foundation for any machine learning workflow once we have identified a problem and chosen a model:

Clearly state the question and determine the necessary data.
Obtain the data in a format that is easily accessible.
Identify and address any missing data points or anomalies as necessary.
Prepare the data to be suitable for the machine learning model.
Establish a baseline model that you aim to surpass.
Train the model using the training data.
Utilize the model to make predictions on the test data.
Compare the model's predictions to the known targets in the test set and calculate performance metrics.
If the model's performance is unsatisfactory, consider adjusting the model, acquiring more data, or trying a different modeling technique.
Interpret the model's outcomes and report the results in both visual and numerical formats.

Data Acquisition

To begin, we require a dataset. For the purpose of a realistic example, I obtained weather data for Seattle, WA from the year 2016 using the NOAA Climate Data Online tool. Typically, approximately 80% of the time dedicated to data analysis involves cleaning and retrieving data. However, this workload can be minimized by identifying high-quality data sources. The NOAA tool proves to be remarkably user-friendly, enabling us to download temperature data in the form of clean CSV files that can be parsed using programming languages like Python or R. For those who wish to follow along, the complete data file is available for download.

The following Python code loads in the csv data and displays the structure of the data:

# Pandas is used for data manipulation
import pandas as pd
# Read in data and display first 5 rows
features = pd.read_csv('temps.csv')
features.head(5)

year: The year, which is consistent at 2016 for all data points.
month: The numerical representation of the month in the year.
day: The numerical representation of the day in the year.
week: The day of the week, expressed as a character string.
temp_2: The maximum temperature recorded two days prior.
temp_1: The maximum temperature recorded one day prior.
average: The historical average maximum temperature.
actual: The actual measured maximum temperature.
friend: Your friend's prediction, which is a random number generated between 20 below the average and 20 above the average.

Identify Anomalies/ Missing Data

Upon examining the dimensions of the data, we observe that there are only 348 rows, which does not align with the expected 366 days in the year 2016. Upon closer inspection of the NOAA data, I discovered that several days were missing. This serves as a valuable reminder that real-world data collection is never flawless. Missing data, as well as incorrect data or outliers, can impact the analysis. However, in this case, the impact of the missing data is expected to be minimal, and the overall data quality is good due to the reliable source.

print('The shape of our features is:', features.shape)
The shape of our features is: (348, 9)

To identify anomalies, we can quickly compute summary statistics.

# Descriptive statistics for each column
features.describe()

Upon initial inspection, there don't appear to be any data points that immediately stand out as anomalies, and there are no zeros in any of the measurement columns. Another effective method to assess data quality is by creating basic plots. Graphical representations often make it easier to identify anomalies compared to analyzing numerical values alone. I have omitted the actual code here for plotting since it may not be intuitive in Python. However, please feel free to refer to the notebook for the complete implementation. As a good practice, I must admit that I mostly leveraged existing plotting code from Stack Overflow, as many data scientists do.

Examining the quantitative statistics and the graphs, we can feel confident in the high quality of our data. There are no clear outliers, and although there are a few missing points, they will not detract from the analysis.

Data Preparation

However, we're not yet at a stage where we can directly input raw data into a model and expect it to provide accurate answers (although researchers are actively working on this!). We need to perform some preprocessing to make our data understandable by machine learning algorithms. For data manipulation, we will utilize the Python library Pandas, which provides a convenient data structure known as a dataframe, resembling an Excel spreadsheet with rows and columns.

The specific steps for data preparation will vary based on the chosen model and the collected data. However, some level of data manipulation is typically necessary for any machine learning application.

One important step in our case is known as one-hot encoding. This process converts categorical variables, such as days of the week, into a numerical representation without any arbitrary ordering. While we intuitively understand the concept of weekdays, machines lack this inherent knowledge. Computers primarily comprehend numbers, so it's crucial to accommodate them for machine learning purposes. Rather than simply mapping weekdays to numeric values from 1 to 7, which might introduce unintended bias due to the numerical order, we employ a technique called one-hot encoding. This transforms a single column representing weekdays into seven binary columns. Let me illustrate this visually:

and turns it into

So, if a data point is a Wednesday, it will have a 1 in the Wednesday column and a 0 in all other columns. This process can be done in pandas in a single line!

# One-hot encode the data using pandas get_dummies
features = pd.get_dummies(features)
# Display the first 5 rows of the last 12 columns
features.iloc[:,5:].head(5)

Snapshot of data after one-hot encoding:

The dimensions of our data have now become 349 x 15, with all columns consisting of numerical values, which is ideal for our algorithm!

Next, we need to split the data into features and targets. The target, also known as the label, represents the value we want to predict, which in this case is the actual maximum temperature. The features encompass all the columns that the model will utilize to make predictions. Additionally, we will convert the Pandas dataframes into Numpy arrays, as that is the expected format for the algorithm. To retain the column headers, which correspond to the feature names, we will store them in a list for potential visualization purposes later on.

# Use numpy to convert to arrays
import numpy as np
# Labels are the values we want to predict
labels = np.array(features['actual'])
# Remove the labels from the features
# axis 1 refers to the columns
features= features.drop('actual', axis = 1)
# Saving feature names for later use
feature_list = list(features.columns)
# Convert to numpy array
features = np.array(features)

The next step in data preparation involves splitting the data into training and testing sets. During the training phase, we expose the model to the answers (in this case, the actual temperatures) so that it can learn how to predict temperatures based on the given features. We anticipate a relationship between the features and the target value, and the model's task is to learn this relationship during training. When it comes to evaluating the model's performance, we ask it to make predictions on a separate testing set where it only has access to the features (without the answers). Since we have the actual answers for the test set, we can compare the model's predictions against the true values to assess its accuracy.

Typically, when training a model, we randomly split the data into training and testing sets to ensure a representative sample of all data points. If we were to train the model solely on the data from the first nine months of the year and then use the final three months for prediction, the model's performance would be suboptimal because it hasn't encountered any data from those last three months. In this case, I am setting the random state to 42, which ensures that the results of the split remain consistent across multiple runs, thus enabling reproducible results.
The following code splits the data sets with another single line:

# Using Skicit-learn to split data into training and testing sets
from sklearn.model_selection import train_test_split
# Split the data into training and testing sets
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size = 0.25, random_state = 42)

We can look at the shape of all the data to make sure we did everything correctly. We expect the training features number of columns to match the testing feature number of columns and the number of rows to match for the respective training and testing features and the labels :

print('Training Features Shape:', train_features.shape)
print('Training Labels Shape:', train_labels.shape)
print('Testing Features Shape:', test_features.shape)
print('Testing Labels Shape:', test_labels.shape)
Training Features Shape: (261, 14)
Training Labels Shape: (261,)
Testing Features Shape: (87, 14)
Testing Labels Shape: (87,)

It seems that everything is in order! Let's recap the steps we took to prepare the data for machine learning:

One-hot encoded categorical variables.
Split the data into features and labels.
Converted the data into arrays.
Split the data into training and testing sets. Depending on the initial dataset, there may be additional tasks involved, such as handling outliers, imputing missing values, or transforming temporal variables into cyclical representations. These steps may appear arbitrary initially, but once you grasp the basic workflow, you'll find that it remains largely consistent across various machine learning problems. Ultimately, the goal is to convert human-readable data into a format that can be comprehended by a machine learning model.

Establish Baseline

Prior to making and evaluating predictions, it is essential to establish a baseline—a reasonable benchmark that we aim to surpass with our model. If our model fails to improve upon the baseline, it indicates that either we should explore alternative models or acknowledge that machine learning may not be suitable for our specific problem. In our case, the baseline prediction can be derived from the historical average maximum temperatures. Put simply, our baseline represents the error we would incur if we were to predict the average maximum temperature for all days.

# The baseline predictions are the historical averages
baseline_preds = test_features[:, feature_list.index('average')]
# Baseline errors, and display average baseline error
baseline_errors = abs(baseline_preds - test_labels)
print('Average baseline error: ', round(np.mean(baseline_errors), 2))
Average baseline error:  5.06 degrees.

We now have our goal! If we can’t beat an average error of 5 degrees, then we need to rethink our approach.

Train Model

After completing the data preparation steps, the process of creating and training the model becomes relatively straightforward using Scikit-learn. We can accomplish this by importing the random forest regression model from Scikit-learn, initializing an instance of the model, and fitting (Scikit-learn's term for training) the model with the training data. To ensure reproducible results, we can set the random state. Remarkably, this entire process can be achieved in just three lines of code in Scikit-learn!

# Import the model we are using
from sklearn.ensemble import RandomForestRegressor
# Instantiate model with 1000 decision trees
rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)
# Train the model on training data
rf.fit(train_features, train_labels);

Make Predictions on the Test Set

Now that our model has been trained to learn the relationships between the features and targets, the next step is to evaluate its performance. To achieve this, we need to make predictions on the test features (ensuring the model does not have access to the test answers). Subsequently, we compare these predictions to the known answers. In regression tasks, it is crucial to use the absolute error metric, as we anticipate a range of both low and high values in our predictions. We are primarily interested in quantifying the average difference between our predictions and the actual values, hence the use of absolute error (as we did when establishing the baseline).

In Scikit-learn, making predictions with our model is as simple as executing a single line of code.

# Use the forest's predict method on the test data
predictions = rf.predict(test_features)
# Calculate the absolute errors
errors = abs(predictions - test_labels)
# Print out the mean absolute error (mae)
print('Mean Absolute Error:', round(np.mean(errors), 2), 'degrees.')
Mean Absolute Error: 3.83 degrees.

Our average estimate is off by 3.83 degrees. That is more than a 1 degree average improvement over the baseline. Although this might not seem significant, it is nearly 25% better than the baseline, which, depending on the field and the problem, could represent millions of dollars to a company.

Determine Performance Metrics

To put our predictions in perspective, we can calculate an accuracy using the mean average percentage error subtracted from 100 %.

# Calculate mean absolute percentage error (MAPE)
mape = 100 * (errors / test_labels)
# Calculate and display accuracy
accuracy = 100 - np.mean(mape)
print('Accuracy:', round(accuracy, 2), '%.')
Accuracy: 93.99 %.

That looks pretty good! Our model has learned how to predict the maximum temperature for the next day in Seattle with 94% accuracy.

Improve Model if Necessary

At this point in the typical machine learning workflow, we would typically move on to hyperparameter tuning. This process involves adjusting the settings of the model to enhance its performance. These settings are known as hyperparameters, distinguishing them from the model parameters learned during training. The most common approach to hyperparameter tuning involves creating multiple models with different settings, evaluating them all on the same validation set, and determining which configuration yields the best performance. However, manually conducting this process would be laborious, so automated methods are available in Scikit-learn to simplify the task. It's important to note that hyperparameter tuning is often more of an engineering practice than theory-based, and I encourage those interested to explore the documentation and begin experimenting. Achieving an accuracy of 94% is considered satisfactory for this problem. However, it's worth noting that the initial model built is unlikely to be the one that makes it into production, as model improvement is an iterative process.

Interpret Model and Report Results

At this point, we know our model is good, but it’s pretty much a black box. We feed in some Numpy arrays for training, ask it to make a prediction, evaluate the predictions, and see that they are reasonable. The question is: how does this model arrive at the values? There are two approaches to get under the hood of the random forest: first, we can look at a single tree in the forest, and second, we can look at the feature importances of our explanatory variables.

Visualizing a Single Decision Tree

One of the coolest parts of the Random Forest implementation in Skicit-learn is we can actually examine any of the trees in the forest. We will select one tree, and save the whole tree as an image.

The following code takes one tree from the forest and saves it as an image.

# Import tools needed for visualization
from sklearn.tree import export_graphviz
import pydot
# Pull out one tree from the forest
tree = rf.estimators_[5]
# Import tools needed for visualization
from sklearn.tree import export_graphviz
import pydot
# Pull out one tree from the forest
tree = rf.estimators_[5]
# Export the image to a dot file
export_graphviz(tree, out_file = 'tree.dot', feature_names = feature_list, rounded = True, precision = 1)
# Use dot file to create a graph
(graph, ) = pydot.graph_from_dot_file('tree.dot')
# Write graph to a png file
graph.write_png('tree.png')

Let’s take a look:

Wow! That looks like quite an expansive tree with 15 layers (in reality this is quite a small tree compared to some I’ve seen). You can download this image yourself and examine it in greater detail, but to make things easier, I will limit the depth of trees in the forest to produce an understandable image.

# Limit depth of tree to 3 levels
rf_small = RandomForestRegressor(n_estimators=10, max_depth = 3)
rf_small.fit(train_features, train_labels)
# Extract the small tree
tree_small = rf_small.estimators_[5]
# Save the tree as a png image
export_graphviz(tree_small, out_file = 'small_tree.dot', feature_names = feature_list, rounded = True, precision = 1)
(graph, ) = pydot.graph_from_dot_file('small_tree.dot')
graph.write_png('small_tree.png');

Here is the reduced size tree annotated with labels

Based solely on this decision tree, we can make predictions for new data points. Let's consider an example of predicting the maximum temperature for Wednesday, December 27, 2017, with the following values: temp_2 = 39, temp_1 = 35, average = 44, and friend = 30.

Starting at the root node, we encounter the first question, where the answer is True because temp_1 ≤ 59.5. We proceed to the left and come across the second question, which is also True since average ≤ 46.8. Continuing to the left, we reach the third and final question, which is again True because temp_1 ≤ 44.5. As a result, we conclude that our estimate for the maximum temperature is 41.0 degrees, as indicated by the value in the leaf node.

An interesting observation is that the root node only contains 162 samples, despite there being 261 training data points. This is because each tree in the random forest is trained on a random subset of the data points with replacement, a technique known as bagging (bootstrap aggregating). If we want to use all the data points without sampling with replacement, we can disable it by setting bootstrap = False when constructing the forest. The combination of random sampling of data points and a subset of features at each node is why the model is referred to as a "random" forest.

Furthermore, it is worth noting that in our decision tree, we only utilized two variables to make predictions. According to this specific tree, the remaining features such as the month of the year, day of the month, and our friend's prediction are deemed irrelevant for predicting tomorrow's maximum temperature. Our tree's visual representation has increased our understanding of the problem domain, enabling us to discern which data to consider when making predictions.

Variable Importances

To assess the significance of all the variables within the random forest, we can examine their relative importances. The importances, obtained from Scikit-learn, indicate how much including a particular variable enhances the prediction. While the precise calculation of importance is beyond the scope of this post, we can utilize these values to make relative comparisons between variables.

The provided code leverages several useful techniques in the Python language, including list comprehensions, zip, sorting, and argument unpacking. While comprehending these techniques is not crucial at the moment, they are valuable tools to have in your Python repertoire if you aspire to enhance your proficiency with the language.

# Get numerical feature importances
importances = list(rf.feature_importances_)
# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]
# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
# Print out the feature and importances 
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];
Variable: temp_1               Importance: 0.7
Variable: average              Importance: 0.19
Variable: day                  Importance: 0.03
Variable: temp_2               Importance: 0.02
Variable: friend               Importance: 0.02
Variable: month                Importance: 0.01
Variable: year                 Importance: 0.0
Variable: week_Fri             Importance: 0.0
Variable: week_Mon             Importance: 0.0
Variable: week_Sat             Importance: 0.0
Variable: week_Sun             Importance: 0.0
Variable: week_Thurs           Importance: 0.0
Variable: week_Tues            Importance: 0.0
Variable: week_Wed             Importance: 0.0

At the top of the importance list is "temp_1," the maximum temperature of the day before. This finding confirms that the best predictor of the maximum temperature for a given day is the maximum temperature recorded on the previous day, which aligns with our intuition. The second most influential factor is the historical average maximum temperature, which is also a logical result. Surprisingly, your friend's prediction, along with variables such as the day of the week, year, month, and temperature two days prior, appear to be unhelpful in predicting the maximum temperature. These importances make sense, as we wouldn't expect the day of the week to have any bearing on the weather. Additionally, the year remains the same for all data points, rendering it useless for predicting the maximum temperature.

In future implementations of the model, we can exclude these variables with negligible importance, and the performance will not suffer. Moreover, if we were to employ a different model, such as a support vector machine, we could utilize the random forest feature importances as a form of feature selection. To demonstrate this, we can swiftly construct a random forest using only the two most significant variables—the maximum temperature one day prior and the historical average—and compare its performance to the original model.

# New random forest with only the two most important variables
rf_most_important = RandomForestRegressor(n_estimators= 1000, random_state=42)
# Extract the two most important features
important_indices = [feature_list.index('temp_1'), feature_list.index('average')]
train_important = train_features[:, important_indices]
test_important = test_features[:, important_indices]
# Train the random forest
rf_most_important.fit(train_important, train_labels)
# Make predictions and determine the error
predictions = rf_most_important.predict(test_important)
errors = abs(predictions - test_labels)
# Display the performance metrics
print('Mean Absolute Error:', round(np.mean(errors), 2), 'degrees.')
mape = np.mean(100 * (errors / test_labels))
accuracy = 100 - mape
print('Accuracy:', round(accuracy, 2), '%.')
Mean Absolute Error: 3.9 degrees.
Accuracy: 93.8 %.

This insight highlights that we do not necessarily require all the collected data to make accurate predictions. In fact, if we were to continue using this model, we could narrow down our data collection to just the two most significant variables and achieve nearly the same level of performance. However, in a production setting, we would need to consider the trade-off between decreased accuracy and the additional time and resources required to gather more information. Striking the right balance between performance and cost is a vital skill for a machine learning engineer and will ultimately depend on the specific problem at hand.

At this stage, we have covered the fundamentals of implementing a random forest model for a supervised regression problem. We can be confident that our model can predict the maximum temperature for tomorrow with 94% accuracy, leveraging one year of historical data. From here, feel free to experiment with this example or apply the model to a dataset of your choice. To conclude, I will delve into a few visualizations. As a data scientist, I find great joy in creating graphs and models, and visualizations not only provide aesthetic pleasure but also assist us in diagnosing our model by condensing a wealth of numerical information into easily comprehensible images.

Visualizations

To visualize the discrepancies in the relative importance of the variables, I will create a straightforward bar plot of the feature importances. Plotting in Python can be a bit unintuitive, and I often find myself searching for solutions on Stack Overflow when creating graphs. Don't worry if the code provided doesn't fully make sense—sometimes, understanding every line of code isn't essential to achieve the desired outcome!

# Import matplotlib for plotting and use magic command for Jupyter Notebooks
import matplotlib.pyplot as plt
%matplotlib inline
# Set the style
plt.style.use('fivethirtyeight')
# list of x locations for plotting
x_values = list(range(len(importances)))
# Make a bar chart
plt.bar(x_values, importances, orientation = 'vertical')
# Tick labels for x axis
plt.xticks(x_values, feature_list, rotation='vertical')
# Axis labels and title
plt.ylabel('Importance'); plt.xlabel('Variable'); plt.title('Variable Importances');

Next, we can plot the entire dataset with predictions highlighted. This requires a little data manipulation, but its not too difficult. We can use this plot to determine if there are any outliers in either the data or our predictions.

# Use datetime for creating date objects for plotting
import datetime
# Dates of training values
months = features[:, feature_list.index('month')]
days = features[:, feature_list.index('day')]
years = features[:, feature_list.index('year')]
# List and then convert to datetime object
dates = [str(int(year)) + '-' + str(int(month)) + '-' + str(int(day)) for year, month, day in zip(years, months, days)]
dates = [datetime.datetime.strptime(date, '%Y-%m-%d') for date in dates]
# Dataframe with true values and dates
true_data = pd.DataFrame(data = {'date': dates, 'actual': labels})
# Dates of predictions
months = test_features[:, feature_list.index('month')]
days = test_features[:, feature_list.index('day')]
years = test_features[:, feature_list.index('year')]
# Column of dates
test_dates = [str(int(year)) + '-' + str(int(month)) + '-' + str(int(day)) for year, month, day in zip(years, months, days)]
# Convert to datetime objects
test_dates = [datetime.datetime.strptime(date, '%Y-%m-%d') for date in test_dates]
# Dataframe with predictions and dates
predictions_data = pd.DataFrame(data = {'date': test_dates, 'prediction': predictions})
# Plot the actual values
plt.plot(true_data['date'], true_data['actual'], 'b-', label = 'actual')
# Plot the predicted values
plt.plot(predictions_data['date'], predictions_data['prediction'], 'ro', label = 'prediction')
plt.xticks(rotation = '60'); 
plt.legend()
# Graph labels
plt.xlabel('Date'); plt.ylabel('Maximum Temperature (F)'); plt.title('Actual and Predicted Values');

Creating a visually appealing graph does require a bit of effort, but the end result is worth it! From the data, it appears that we don't have any noticeable outliers that need to be addressed. To gain further insights into the model's performance, we can plot the residuals (i.e., the errors) to determine if the model tends to over-predict or under-predict. Additionally, examining the distribution of residuals can help assess if they follow a normal distribution. However, for the purpose of this final chart, I will focus on visualizing the actual values, the temperature one day prior, the historical average, and our friend's prediction. This visualization will help us discern the difference between useful variables and those that provide less valuable information.

# Make the data accessible for plotting
true_data['temp_1'] = features[:, feature_list.index('temp_1')]
true_data['average'] = features[:, feature_list.index('average')]
true_data['friend'] = features[:, feature_list.index('friend')]
# Plot all the data as lines
plt.plot(true_data['date'], true_data['actual'], 'b-', label  = 'actual', alpha = 1.0)
plt.plot(true_data['date'], true_data['temp_1'], 'y-', label  = 'temp_1', alpha = 1.0)
plt.plot(true_data['date'], true_data['average'], 'k-', label = 'average', alpha = 0.8)
plt.plot(true_data['date'], true_data['friend'], 'r-', label = 'friend', alpha = 0.3)
# Formatting plot
plt.legend(); plt.xticks(rotation = '60');
# Lables and title
plt.xlabel('Date'); plt.ylabel('Maximum Temperature (F)'); plt.title('Actual Max Temp and Variables');

The lines on the chart may appear a bit crowded, but we can still observe why the maximum temperature one day prior and the historical average maximum temperature are valuable for predicting the maximum temperature. Conversely, it's evident that our friend's prediction does not provide significant predictive power (but let's not completely dismiss our friend's input, although we should exercise caution in relying heavily on their estimate). Creating graphs like this in advance can assist us in selecting the appropriate variables to include in our model, and they also serve as valuable diagnostic tools. Just as in Anscombe's quartet, graphs often reveal insights that quantitative numbers alone may overlook. Including visualizations as part of any machine learning workflow is highly recommended.

Conclusions

With the inclusion of these graphs, we have successfully completed an end-to-end machine learning example! To further enhance our model, we can explore different hyperparameters, experiment with alternative algorithms, or, perhaps most effectively, gather more data. The performance of any model is directly influenced by the quantity and quality of the data it learns from, and our training data was relatively limited. I encourage everyone to continue refining this model and share their findings. Furthermore, for those interested in delving deeper into the theory and practical application of random forests, there are numerous free online resources available. If you're seeking a comprehensive book that covers both theory and Python implementations of machine learning models, I highly recommend "Hands-On Machine Learning with Scikit-Learn and TensorFlow." Lastly, I hope that those who have followed along with this example have realized the accessibility of machine learning and are motivated to join the inclusive and supportive machine learning community.

Top comments (1)

Darwin Karl Salazar • Jul 22 '23

I believe that you did a good job, through predictive models that address the topic that you are exposing. The subject of machine learning generates a lot of intrigues for me.

DEV Community