In this third and final lecture of the "Data Science With Docker" series we will dig a little bit deeper into the essentials of Data Science by creating a data model that will actually predict results. So bear with me as we go through each section of this tutorial. If you read my previous two lectures, then you're already familiar with the development environment I'm using with Docker-Compose. We're going to install new python packages, we'll load a new data set, we will crate a data model, using the random forest algorithm, and it will predict results based on the data frame, and finally we will store the model in our MinIO service.
My repository is public and here you'll find all the things you need for this lecture:
If you're just interested in the code I'll explain later on, go straight to the data-model folder.
OK! without further to do, let's begin.
Part 1: The Data Set (a new one).
Before we introduce the data set I need to speak a little bit about scikit-learn. This is a Python library, designed to inter-operate with the scientific libraries NumPy and SciPy. Among other fancy things, it is used for regression and clustering algorithms, and most importantly for us: classification methods (more on that later). If you're interested in knowing more about scikit-learn here's the official documentation where you'll find lots of handy tutorials as well: https://scikit-learn.org/stable/
Install the library:
Back to the business, now that we've introduced our new useful friend let's talk about the data set I chose for this lecture: load_digits data set.
Now, this data set has something special. As usual it contains a bunch of data that we as Python programmers and Data Scientist must find what is used for, but for the name of the data set (which speaks by itself actually) it's pretty easy to guess that we're loading the digits of some numbers.
Let's take a look to the data set (remember, always the data you're dealing with). But first I need to add all my libraries to the notebook.
Next, I'm going to import and load the data set into the notebook:
Awesome, now that I have the data set into the digits variable, I can create a pandas data frame from there:
Okay, I know for the name of the data set that this contains numbers; but how do the data actually look like? let's take a glimpse:
Looks like the .info() and .head() functions are not very useful this time, I know that the data has numbers, but what kind of numbers? and most importantly, why on earth there are so many float type records?
We need something else to provide a good visualization of what type of data we're dealing with. I'm going to stop the thrill here by answering, that this data set contains images of handwritten numbers, and the data we just saw correspond to the pixels of each individual number. Okay, but how do we look those numbers? For this we will use the magic of sklearn too. First let's take a look to the attributes of the data set, that's the purpose of the .dir() function in Python:
Now looks like we're going places, and if you're wondering "how can I see those handwritten characters?" the short answer is "visualization", and when it comes to visualize data; look no other than matplotlib. So now I'm going to create for loop that will show the 10 numbers within this dataset:
Great! now we know that we're dealing with images that look like have 8 by 8 pixels each, stored in a two-dimensional array of numbers. Now if you pay attention to the characters, they're actually numbers. That's why when we look the pandas data frame we saw just a bunch of numbers. Let's review our data frame knowing this new information.
I'm going to map the 'target' attribute, by creating a new column in my pandas data frame:
So, the 64 samples that I mapped of each record belong to an individual number.
Part 2: The Model - Random Forest
We know that our data set contains images of 64 pixels (8x8) of 10 numbers (0 to 0), stored in a two dimensional array that we already put in a pandas data frame. Now, the goal of this lecture it to train a model; a model that will provide a good understanding of what is going on in each sample of the data set, or as I might put it more simple words: A MODEL THAT WILL TRY TO INDICATE THE NUMBERS OF EACH SAMPLE.
In order to achieve this, we will use the Random Forest Algorithm. This algorithm is highly used in Machine Learning, because it uses classification to determine the correct, or most probable right answer to a certain scenario. Its name "forest" comes for the fact that it uses decision trees do calculate the probability of assertion. Decisions trees are simple, they look like a flowchart diagram with the terminal nodes representing classification outputs or decisions:
As the name suggest, the Random Forest Classifier consists of a large number of individual decision trees, and each tree within this collection of "forest" gives an answer (prediction), and ultimately the three in the forest with majority of correct answers (based on votes) it's the final answer of the predictive model. So it's the model will create a number of scenarios (trees), and try to solve the problem given, and in the end the best-voted scenario will give the answer of the model.
Now that we know what is a Random Forest and what it does, let's make our own in Python using sklearn.
So I'm going to import the train_test_split specify what portion of the data is going to be test data and what portion will be used for training the model:
Notice that the 0.2, actually means that I'm using 20% of the entire dataset will be used as testing data, and the rest 80% will be training data.
So, I've created the x_train, x_test, y_train and y_test variables. If you want to know the exact size of each individual variable, use the len() function:
Now that I've defined my training and testing samples it's time to import the random forest classifier to train the model:
Notice that we are importing from sklearn.ensemble, this means that we will be using multiple algorithms to predict an answer, and we're doing it by building multiple decisions trees, where the tree with the highest number of votes will be the definitive answer of the model. Now after you run the previous line of code the output you'll get should look like this:
Out of all this information, perhaps the most important for us is the "n_estimators" because this is the number of the random trees we created. Now let's see how accurate this model is. I'm going to use the .score() function on this model, sending as parameters the x_test and y_test parameters:
Seems like our sample has over 96% of accuracy (pretty good). If you want a higher degree of accuracy, just increase the number of n_estimators in the random forest classifier and the test_size parameter in the train_test_split.
Moving forward, let's see the distribution of errors and how our model is performing so far, we will plot our results using a Confusion Matrix, which is basically a table that describes the performance of a classification model on a set of test data for which the true values are known. It provides the visualization of the performance of an algorithm.
First we need to assign a value to the our y_predicted value:
Now let's import that fancy "Confusion Matrix":
Great, now I'm going to name the confusion matrix just "matrix", and we need to provide the tested sample (y_test) and the prediction (y_predicted):
Okay, if you take a look at the plot of the matrix is just a two-dimensional array. The Y axis represents number of votes for each number in the data frame, and the X axis represents the actual value on the data frame (the numbers from 0 to 9). So in perspective, for example the number 0: the matrix is telling me that the model predicted 36 times, that the number on that specific sample was 0. I'm going to plot this very matrix, using a visualization library called Seaborn, to visualize this results in a more pleasant way:
You be the judge on how well the model predicted the numbers.
Part 3: Persisting the Model - MinIO
We've successfully created a predictive model, but what if need it somewhere to have whenever we need it? Let's start the process of storing it. If you read my part i of Data Science With Docker, then you know that I'm going to use MinIO for this task, and we already have the MinIO package installed in our environment, so let's get our hands dirty again!
First we need to import the joblib that later on will allow us to create a file out of our model called "model":
Now it's time to make our model a package, so I'm going to use the .pk1 extension, if you'd like to know more about this package here's the documentation:
So this step it's as simple as it sounds, just dump the model into a .pk1 file:
This will allocate the "model.pk1" file into your Jupyter Server workspace, so go back to your home Files tab and take a look at it:
Moving on, the only step left is to persist it in our MinIO service.
To access MinIO, open up a new tab in your web browser and type the url/ip of your scientific environment, with the 9000 port:
The credentials of the MinIO service are on the .yml file of our Docker scientific environment. Now, the MinIO interface it's pretty clean and straightforward, there's really not much about it: is a place to store your Data Science, Machine Learning and Artificial Intelligence Models.
We're almost done. In order to gain access to our local MinIO client from the Jupyter Server we need to create a Python client for it, using the MinIO package:
If you remember my previous lecture, this is like creating the connection string for PostgreSQL. Now, MinIO is a cloud-based service, so it uses a concept called "buckets" to persist its content. If you're interested in learning more about MinIO and how to manage it I'm leaving here the official documentation, I encourage you to read it.
Now, I'm going to create the bucket for my model:
And the big ending: upload the model to our MinIO service, in its bucket:
Now, if we go to the MinIO service, we'll find our bucket with the "model.pk1" file ready to use:
Part 4: Final thoughts and conclusions.
"Gracias por tu tiempo y dedicación." Just a kind massage to you in my mother's language meaning: "thanks for your time and dedication". It's been a long journey, but hopefully you'll find useful this "Data Science With Docker" 3-part series. We started from scratch, since the creation and configuration of our scientific environment for data science with docker compose, to understand the basics of python for data analysis, how to persist our data frame into a relational database, we even cover some dash-boarding and data visualization and finally we created and trained a model that is persisted in an specific and dedicated data science service.
This is a my humble, but very useful contribution to this community. If you're like me, and you love the technology and working with data, then I encourage you to keep learning, never stop learning. One of the most beautiful things about knowledge is that does not belong to anyone, it belongs to everyone. Now let me finish this three-part series with one of my favorite quotes:
“No one can build you the bridge on which you, and only you, must cross the river of life. There may be countless trails and bridges and demigods who would gladly carry you across; but only at the price of pawning and forgoing yourself. There is one path in the world that none can walk but you. Where does it lead? Don’t ask, walk!”. ~ Friedrich Wilhelm Nietzsche.