DEV Community

Cover image for Machine Learning - Data Preprocessing- 1
Nikhil Dhawan
Nikhil Dhawan

Posted on

Machine Learning - Data Preprocessing- 1

Hi all, in this article we will discuss about the first step to building our model - Data Preprocessing.

Importing Libraries - In this we import most common used libraries like pandas, numpy . There might be others also but for this example lets keep it as it

import numpy as np
import pandas as pd

Importing Datasets - for this example we have a dataset that is as below containing data if a purchase was made or not from a specific person having particular identity

Image description

In this data we can see first 3 columns are features and last is the dependent variable . We mostly divide our data using this only, with X as the inputs and y as the output and are loaded using pandas and then divided using iloc operation on dataset

dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

So now we have our data separated and ready for further handling.

Handling missing data- So if you see the data we have 2 missing data (marked in yellow)

Image description

Most broadly used library for data science operations is sklearn. So in this case also we will use it only. Strategy for this will be replacing the empty data with average value from the column and to use it SimpleImputer from sklearn impute is used

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

and it will give us

Image description

One more thing to take care here is first column that has text/string values, as models might not be able to interpret these correctly we also need to encode this to digits to be able to feed them to model.
For this ColumnTransformer is used from sklearn compose and OneHotEncoder from sklearn preprocessing

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

In this ColumnTransformer takes input as transformers that is tupple which has name of transformer, transformer itself and column to transform and remainder is to specify what to do for other columns, which here is passthrough which mean no change in other columns.
and our output will be

Image description

Hope it was helpful. In this next part of preprocessing we will see how to encode labels, Feature scaling and spliting data into training and test set.

Top comments (0)