DEV Community

Cover image for How I am learning machine learning - week 7: organizing data with scikit-learn
Gabriele Boccarusso
Gabriele Boccarusso

Posted on • Updated on

How I am learning machine learning - week 7: organizing data with scikit-learn

Now that we know the fundamentals of python packages dedicated to machine learning we can look into scikit learning to begin doing something practical and to see some models

table of content:

Organizing the data

before beginning a project it's important to organize our data first. As we already know we divide our dataset into 3 parts: training, validation and set.
We'll gather the data from a csv so we begin importing all the packages that we need:

# imports:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
Enter fullscreen mode Exit fullscreen mode

For then having a quick look ad the dataset:

looking at our dataset

Splitting the data

Now that we know our dataset it's time to divide them into training and test. To do so we first re-organize the data:

dividing our dataframe with the drop pandas function
We created the x-axis as our dataframe without the height column using the drop function and then created the y-axis as just the height column

We can now split our dataset into the training and test sets using the dedicated scikit-learn function train_test_split:

splitting our dataset with the dedicated scikit-learn function

One hot encoding

One hot encoding is a way to classify our values that aren't numbers. In our dataframe about baseball players we have every player into a team, but a team is a string with specifics characters in it and machine learning works only with numbers.
With one hot encoding, we have to transform them into categorical features.
Before one hot encoding our dataframe would be similar to this:

before one hot encoding
after having one hot encoded it it would similar to this:

example of one hot encoding
Now we can proceed with the one hot encoding, our code will be:

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
categorical_features = ['Position', 'Team']
# transforming the columns
transformer = ColumnTransformer([('one_hot',
                                  OneHotEncoder(),
                                  categorical_features)])
encoded_features = transformer.fit_transform(x_axis)
encoded_features
Enter fullscreen mode Exit fullscreen mode

We first create the transformer using the dedicated function that takes as argument an array of tuples (in this case only one) where every tuple has the parameters of the name, the object, in our case the encoder, and the columns that we have put together into a list.
We then use the fit_transform function to concatenate all the results to put them into a variable.
the output of it will be similar to:

<1033x39 sparse matrix of type '<class 'numpy.float64'>'
    with 2066 stored elements in Compressed Sparse Row format>
Enter fullscreen mode Exit fullscreen mode

If we want to see what really happened we could do the same things but with the pandas get_dummies function:


Both the solutions will results in the same encoding that we need.

Missing values

Filling missing values with scikit-learn is a very pragmatic process but before jumping into the action it's always important to know that we should always fill or transform our test and training datasets separately to prevent errors such as overfitting

AS always, we begin importing our dataset, link here for then having a quick look at it

viewing our dataframe

We can then concatenate two functions to know how many missing values are in the dataset:

  • isna() returns a dataframe with every value replaced by true or false whether an element is missing
  • sum() does the sum of how many values are missing

using the isna and sum functions together

Now we have to split the dataframe into the x and y-axis into train and test for then being able to handle the missing values:

Alt Text

Now that our data is ready we can impute (filling) our data.
Our code will be similar to this:

# imputing is a term used to fill missing values
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# filling the numerical values with their mean
num_imputer = SimpleImputer(strategy = 'mean')

# defining the columns
num_features = ['Prefix', 'Assignment', 'Tutorial','Midterm', 'TakeHome']

# creating the imputer
imputer = ColumnTransformer([
    ('num_features', num_imputer, num_features)
])

# filling train and test
filled_x_train = imputer.fit_transform(x_train)
filled_x_test = imputer.transform(x_test)

#checking the result
filled_x_train[:5]
Enter fullscreen mode Exit fullscreen mode

Once imported the libraries we have filled on the average of every column (strategy = 'mean') setting the transformer for then creating an array with all the columns that we have to fill.
We can then create the imputer with the ColumnTransformer function that accepts as a parameter an array of tuples (same as before) where every tuple has three elements: a name, the transformer, and where to apply the transformer (the name of the columns)
Finally, we create variables where we transform and put the test for then checking our results.

this way we have our dataset filled with every missed value. If we would have any string we would have created a transformer similar to

name_imputer = SimpleImputer(strategy="constant", fill_value="missing")
Enter fullscreen mode Exit fullscreen mode

to fill every inexistent string with missing.
A good article that goes deeper into ColumnTransformer is this one by machine learning mastery

Final thoughts

Now that we know how to organize our data and how to handle missing values we can finally begin with seeing some models the next week. See you next time.

Top comments (0)