DEV Community

loading...
Cover image for How I am learning machine learning - week 8: basic models in scikit-learn

How I am learning machine learning - week 8: basic models in scikit-learn

Gabriele Boccarusso
programmer and tech appassionate
・6 min read

Last week we saw how to clean our data, so now we should be ready to see some machine learning model. The most basics are the regression to predict a number and classification to predict a category.

Table of contents:

Importing a scikit learn toy dataset

A machine learning algorithm is usually called a model or estimator and scikit-learn offers us a cheat-sheet with everything we can use:

scikit learn algorithms cheat sheet
In our case, we already know that we want to try a regression estimator, but this cheat-sheet will come in hand very often.

Let's begin with importing one of the sample datasets that scikit offers:

importing one of the scikit learn sample datasets
from a quick look we can understand that this is a dictionary with four important keys:

  • data: a numpy array with a multidimensional shape that contains all our data
  • target: a series that contains the values that we'll predict
  • feature_names: the names of every column corresponding to the shape of the data.
  • DESCR: a description of our dataset.

note that this is a toy dataset, usually, our data will not come with all this information in handy

to import it and use it our code will be similar to:

# importing the sample dataset
from sklearn.datasets import load_diabetes
# assigning it to a variable
diabetes = load_diabetes()
# reading the dataset description
print(diabetes.DESCR)
Enter fullscreen mode Exit fullscreen mode

and the result will be:

viewing the information about the scikit learn dataset

Now, to convert the dictionary into a dataframe we have to first transform it into a pandas dataframe and we can do it easily with the functions that we already know:

diabetes_df = pd.DataFrame(diabetes['data'], 
                           columns = diabetes['feature_names'])
diabetes_df['target'] = pd.Series(diabetes['target'])
diabetes_df.head()
Enter fullscreen mode Exit fullscreen mode

Using a regression model

now that we have everything prepared we have to simply import the regression model that we need. So, based on the cheat-sheet, let's try the ridge model:

# importing the model
from sklearn.linear_model import Ridge

# setting up the seed 
np.random.seed(10)

# creating the data
x = diabetes_df.drop('target', axis = 1)
y = diabetes_df['target']

# splitting the data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.15)

# instantiating the model
model = Ridge()
model.fit(x_train, y_train)

# checking up the score of the model
model.score(x_test, y_test)
Enter fullscreen mode Exit fullscreen mode

in this case, the output will be:

0.5011092582547553
Enter fullscreen mode Exit fullscreen mode

Now let's see in detail what we did: we first import the model and set up the seed with numpy, we split our data into train and test for then arriving at the clue part, the fit and the score functions.
We create the model variable and assign to it the Ridge() model, on the model we apply first the fit method. In machine learning fitting equals training. The training process finds the coefficients of the model equation, in this case, regression.
The score method evaluates the accuracy score of the model for then printing it on the screen.
In the notebook, all of this be:

using the ridge estimator
In this case, we have very low accuracy but, strangely enough, using a better model, the random forest regressor, we'll still have low results:

using the random forest regressor in scikit learn
Optimizing our models is something we'll see later on. It works, and this is already a great starting point.

classification models

Let's import another toy dataset from the one that scikit learn offers to see some classification models:

importing the iris dataset from scikit learn default datasets
after viewing the dictionary we can see that there isn't an array with the name of our columns but only an array with the corresponding string of what every target should be.
We'll have to do a bit of manual labor:

column_names = ['sepal length', 'sepal width', 'petal length', 'petal width']
iris_df = pd.DataFrame(iris.data, columns = column_names)
iris_df['target'] = iris['target']
iris_df.head()
Enter fullscreen mode Exit fullscreen mode

and the result will look like this:

crating a dataframe out of the scikit learn default dictionary data

but the numbers in the target column are confusing and it is not efficient to go back and forth to see what a category a number indicates, this was my solution:

# changing target values with strings
iris_df['target'] = iris_df['target'].astype(str)

for i in range(len(iris_df['target'])):
    if iris_df['target'][i] == '0':
        iris_df['target'][i] = iris.target_names[0]
    elif iris_df['target'][i] == '1':
        iris_df['target'][i] = iris.target_names[1]
    elif  iris_df['target'][i] == '2':
        iris_df['target'][i] = iris.target_names[2]
Enter fullscreen mode Exit fullscreen mode

this is the moment when we should convert our dataframe with one-hot encoding, but with the latest versions of scikit, we can even do not do it. Previously it would have thrown an error, now works without problems.

Everything is ready, we have to just train our model on the data, so let's try the random forest classifier:

# importing the random forest classifier estimator
from sklearn.ensemble import RandomForestClassifier

# set up random seed
np.random.seed(10)

# make the data
x = iris_df.drop('target', axis = 1)
y = iris_df['target']

# split the data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.15)

# instantiate random forest
clf = RandomForestClassifier() # clf is short for classifier
clf.fit(x_train, y_train)

# evaluate random forest
clf.score(x_test, y_test)
Enter fullscreen mode Exit fullscreen mode

using the random forest classifier

overview of the random forest logic

Now we have successfully used our model to predict the flower categories. We used the random forest classifier and regressor, two algorithms that have a simple logic behind them that makes them highly effective: they build and train various decision trees to then output the average of their decisions.

simplified explanation of the random forest algorithm

We can use the predict and predict_proba functions to see what we are doing a bit further.
the predict function output the vote of the trees. If used for the first 5 samples of x_test it will return the decision of what every target should be:
using the predict function in scikit learn
but every prediction isn't 100% accurate and every sample has a probability of being a class or another one. the predict_proba function shows us this probability in detail:
using the predict_proba function in scikit learn
It returns an array with all the data in it. For classification, every element of the array is a nested array with the probabilities in it. our possible cases are in order: Setosa, Versicolour, and Virginica. If we look at the top of the output we can see three numbers where everyone corresponds to the probability of a plant being a specific type: in the first case there is a probability of 0% that the plant is Setosa, 95% of it being Versicolor and 5% of it being Virginica, the prediction will then be Versicolor, and if we see the precedent figure we can see that the target of the test is Versicolor.

Last thoughts

This week we saw the basics of classifications and regression models for then having an overview of how they reach a result, next week we'll talk about evaluation.

Discussion (0)