Now that we know the fundamentals of python packages dedicated to machine learning we can look into scikit learning to begin doing something practical and to see some models
table of content:
before beginning a project it's important to organize our data first. As we already know we divide our dataset into 3 parts: training, validation and set.
We'll gather the data from a csv so we begin importing all the packages that we need:
# imports: import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline
For then having a quick look ad the dataset:
Now that we know our dataset it's time to divide them into training and test. To do so we first re-organize the data:
We can now split our dataset into the training and test sets using the dedicated scikit-learn function train_test_split:
One hot encoding is a way to classify our values that aren't numbers. In our dataframe about baseball players we have every player into a team, but a team is a string with specifics characters in it and machine learning works only with numbers.
With one hot encoding, we have to transform them into categorical features.
Before one hot encoding our dataframe would be similar to this:
from sklearn.preprocessing import OneHotEncoder from sklearn.compose import ColumnTransformer categorical_features = ['Position', 'Team'] # transforming the columns transformer = ColumnTransformer([('one_hot', OneHotEncoder(), categorical_features)]) encoded_features = transformer.fit_transform(x_axis) encoded_features
We first create the transformer using the dedicated function that takes as argument an array of tuples (in this case only one) where every tuple has the parameters of the name, the object, in our case the encoder, and the columns that we have put together into a list.
We then use the fit_transform function to concatenate all the results to put them into a variable.
the output of it will be similar to:
<1033x39 sparse matrix of type '<class 'numpy.float64'>' with 2066 stored elements in Compressed Sparse Row format>
If we want to see what really happened we could do the same things but with the pandas get_dummies function:
Filling missing values with scikit-learn is a very pragmatic process but before jumping into the action it's always important to know that we should always fill or transform our test and training datasets separately to prevent errors such as overfitting
AS always, we begin importing our dataset, link here for then having a quick look at it
We can then concatenate two functions to know how many missing values are in the dataset:
- isna() returns a dataframe with every value replaced by true or false whether an element is missing
- sum() does the sum of how many values are missing
Now we have to split the dataframe into the x and y-axis into train and test for then being able to handle the missing values:
Now that our data is ready we can impute (filling) our data.
Our code will be similar to this:
# imputing is a term used to fill missing values from sklearn.impute import SimpleImputer from sklearn.compose import ColumnTransformer # filling the numerical values with their mean num_imputer = SimpleImputer(strategy = 'mean') # defining the columns num_features = ['Prefix', 'Assignment', 'Tutorial','Midterm', 'TakeHome'] # creating the imputer imputer = ColumnTransformer([ ('num_features', num_imputer, num_features) ]) # filling train and test filled_x_train = imputer.fit_transform(x_train) filled_x_test = imputer.transform(x_test) #checking the result filled_x_train[:5]
Once imported the libraries we have filled on the average of every column (strategy = 'mean') setting the transformer for then creating an array with all the columns that we have to fill.
We can then create the imputer with the ColumnTransformer function that accepts as a parameter an array of tuples (same as before) where every tuple has three elements: a name, the transformer, and where to apply the transformer (the name of the columns)
Finally, we create variables where we transform and put the test for then checking our results.
this way we have our dataset filled with every missed value. If we would have any string we would have created a transformer similar to
name_imputer = SimpleImputer(strategy="constant", fill_value="missing")
to fill every inexistent string with missing.
A good article that goes deeper into ColumnTransformer is this one by machine learning mastery
Now that we know how to organize our data and how to handle missing values we can finally begin with seeing some models the next week. See you next time.