DEV Community

Cover image for Data preprocessing
Hari Krishnan
Hari Krishnan

Posted on

Data preprocessing

Anytime you build a machine learning model, you always have a data preprocessing phase to work on. You have to preprocess the data in the right manner in such a way that the machine learning model which we built can be trained in the right way on the data.

Data set for processing

Let us consider the data in a hospital where we have the country, age, plateletcount, whether patient has survived or not as columns in the data row. This data can be used to train the system in such a way that it can predict survival rate for the following :

  1. Country-wise survival rate with respect to platelets
  2. Age-wise survival rate with respect to platelets
  3. Country and age -wise survival rate with respect to platelets etc..

Let us consider the following dataset below :

Country      Age    Platelets   Survived

1.India      23         45000       Yes
2.India      70         50000       No
3.USA        40         70000       Yes
4.Australia  60         100000      No  
5.USA        60         100000      Yes
6.India                 100000      Yes
7.India      40                     Yes
.... 
....
....
....      
Enter fullscreen mode Exit fullscreen mode

Steps to be followed in Data Preprocessing

  1. Importing the libraries required for data preprocessing
  2. Importing the dataset required for data preprocessing
  3. Take care of missing data
  4. Encoding categorical data
  5. Splitting the data set into the training set and the test set
  6. Feature scaling

Prerequisites for development

Python installation and environment setup, Visual Studio Code with Jupyter Notebook extension added.

Step 1: Importing the libraries

The first step is how to import the libraries and make them ready every time we build a new machine learning model. We will be importing three required libraries :

  1. Numpy :

    Numpy will allow you to work with arrays. Most of our machine learning models will require you to work with arrays.

  2. Matplotlib :

    Matplotlib will allow you to print some very nice charts based on the available dataset.

  3. Pandas :

    Pandas will not only allow you to import the dataset, but also create the matrix of features and the dependent variable vector. We will see about these concepts later.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
Enter fullscreen mode Exit fullscreen mode

Step 2: Importing the dataset

Using the pandas library we will be reading the data from the csv file. But an additional important step is required which is creating two new entities. :

  1. Matrix of features
  2. Dependent variable vector

Important principe to know :

In any dataset with which you are going to train a machine learning model, you have the same entities which are the features and the dependent variable vector.

Feature columns are those with the help of which you are going to predict the output, the output column is the dependent variable vector. The data set itself should be prepared in such a way that the dependent variable should be in the last column. Features are also called the independent variables.

In our case, the columns Country, age, platelets are the feature variables and the last column survived is the dependent variable. Always remember in any machine learning model, features should be present in the beginning of the dataset and dependent variables should be in the last columns.

dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values
Y = dataset.iloc[:, -1].values
print(X)
print(Y)
Enter fullscreen mode Exit fullscreen mode

In the above code, X will contain the features and Y will contain the dependent variables. iLoc is a method used to obtain the rows and columns in a dataset by specifying the index, range etc..

Step 3 : Taking care of missing data

In our dataset in the last two rows, in one row we have a missing age and in the last row we have the platelets missing. Generally you shouldn't have any missing data in the dataset, as it can cause some errors while training the machine learning model. So we have to handle these missing value rows. There are several ways to handle them :

1.Ignoring the row with missing values

This method actually will work when you have a very large dataset and you have only 1% or 2% missing data, removing this data won't much affect the learning quality of your machine learning model. Sometimes when there is a large amount of missing data, we have to handle them the right way.

2.Replace the missing value by the average of all the available values in the column

For performing these kind of operations, we have one of the best data science libraries available called the scikit-learn. It is library which contains a lot of tools including
data preprocessing tools.

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X[:,1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
print(X)
Enter fullscreen mode Exit fullscreen mode

The above code will replace the missing values in the age and platelets column with the mean value of all the values in their respective columns. Please note that this method can be applied only for numerical values.

Step 4: Encoding categorical data

As you can see one column named Country contains categories such as India, Australia and the USA. This data would cause difficulties for machine learning models to compute some correlations between the feature columns and the dependent variable (outcome).

Hence, we have to change these categories into numbers. One method is to encode USA to 0, Australia to 1 and India to 2, however if you do this, your machine learning model can understand that because USA is 0, Australia is 1 and India is 2, there is a numerical order between these three countries, and mostly it could interpret that this order matters. You need to avoid such kind of interpretation by your machine learning model.

One hot encoding

This method turns the country column into three columns, in our dataset it is three columns whereas if there are more than 3 columns, there will be those many number of column generated.

It creates binary vectors for each of the countries. Very simply, for India it would be 100, Australia will be 010 and USA will be 001, and hence there is not a numerical order between the countries. Please follow the below two steps for one hot encoding

1. Encoding the independent variable

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
X = np.array(ct.fit_transform(X))
print(X)
Enter fullscreen mode Exit fullscreen mode

2. Encoding the dependent variable using Label Encoding

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
Y = le.fit_transform(Y)
print(Y)
Enter fullscreen mode Exit fullscreen mode

Step 5: Splitting the data into Training set and Test set

It makes two separate sets, one training set where you are going to train your machine learning model on existing observations, and one test set with which you are going to evaluate the performance of your model.

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.2, random_state = 1)
print(X_train)
print(X_test)
print(Y_train)
print(Y_test)
Enter fullscreen mode Exit fullscreen mode

Step 6 : Feature scaling

It will allow you to put all your features on the same scale.

Why should you do this ?

For some machine learning models, in order to avoid some features to be dominated by other features in such a way that these features are not even considered by the machine learning model.

When to apply feature scaling ?

You need not apply feature scaling for all the machine learning models, but only for some, where feature values highly vary in range, units etc.. Feature scaling helps to weigh all the features equally in order to improve the quality of your machine learning model.

How to apply feature scaling ?

There are two techniques to apply feature scaling :

1.Standardisation :

This will put all the values of the feature between around -3 and +3 for example. When you apply this transformation on all the different features, all your features will take values around -3 and +3.

2.Normalisation :

All the values of the feature will become between 0 and 1 for example.

Which one to opt on the above two techniques ?

Normalisation is recommended when you have a normal distribution in most of your features. Normal distribution is a probability distribution which often resembles a bell shaped curve. For example: the height of kids.

Standardisation will work well in all the cases, mostly you will use this technique except for features having a normal distribution. This will always improve the training process.

We will not apply feature scaling on the full dataset, instead we will apply feature scaling for the test and the training set separately. Please note that we need not apply feature scaling for the categorically encoded data, because they already are present in a certain range with less standard deviation. For example, the categorically encoded data for the country column already lies between -3 and +3.

In our dataset, we will be applying feature scaling for the age feature and the platelets feature. The below code will transform these two features values in between -2 and +2.
Now all the values of these two features are on the same scale and this will improve the training of most machine learning models.

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
X_test[:, 3:] = sc.fit_transform(X_test[:, 3:])
print(X_train)
print(X_test)
Enter fullscreen mode Exit fullscreen mode

Note: Please note that we will be working on huge data sets, the data set which i have prepared contains less than 10 rows of data and this is just for your understanding.

Top comments (0)