DEV Community

Manav Modi
Manav Modi

Posted on • Originally published at manavmodi.hashnode.dev

Introduction to Data Preprocessing

What is Data Preprocessing?

Data Preprocessing comes right in after you have cleaned up your data and done some Exploratory Data Analysis. It is the step where we prepare the data for modeling. Modeling in Python needs numerical input.

Refreshing Pandas Skills

You can skip this section if you know the basics.

Before we proceed with the series, it is important to know the commands that can assist you in knowing your dataset well.

import pandas as pd
hiking = pd.read_json("datasets/hiking.json")
print(hiking.head())
Enter fullscreen mode Exit fullscreen mode

image.png

print(hiking.columns)
Enter fullscreen mode Exit fullscreen mode

image.png

print(hiking.dtypes)
Enter fullscreen mode Exit fullscreen mode

image.png

Removing Missing Data

Sample Data

image.png

Dropping rows with null values

print(df.dropna())
Enter fullscreen mode Exit fullscreen mode

image.png

Dropping specific rows from using an array

print(df.drop([1,2,3]))
Enter fullscreen mode Exit fullscreen mode

image.png

Dropping a specific column(here axis=1 specifies that column needs to be dropped.)

print(df.drop("A", axis=1))
Enter fullscreen mode Exit fullscreen mode

image.png

Fetching the not null rows from a specific column.

print(df[df["B"].notnull()])
Enter fullscreen mode Exit fullscreen mode

image.png

Working on DataTypes

While preprocessing the data, many times the datatype of columns is not as desired. We use the following command to convert the column datatype.

Remember: Always apply the datatype that fits all of the data in the particular column.

This code sample will help you convert column "C" to the float datatype.

df["C"] = df["C"].astype("float")
print(df.dtypes)

Enter fullscreen mode Exit fullscreen mode

Stratified Sampling

Train test split is done on the dataset for training and testing the model.
Say, the original dataset is 80% class 1 and 20% class 2. You would want a similar distribution in both train and test datasets to make sure you have the best representation.

 # Total "labels" counts
y["labels"].value_counts()
Enter fullscreen mode Exit fullscreen mode

image.png

X_train,X_test,y_train,y_test = train_test_split(X,y, stratify=y)
y_train["labels"].value_counts() 
y_test["labels"].value_counts()
Enter fullscreen mode Exit fullscreen mode

image.png

image.png

Check out the exercises linked to this here

Interested in Machine Learning content? Follow me on Twitter.

Top comments (0)