DEV Community

Cover image for Machine Learning - Data Preprocessing-2 label encoder, Splitting Data and Feature scaling
Nikhil Dhawan
Nikhil Dhawan

Posted on • Updated on

Machine Learning - Data Preprocessing-2 label encoder, Splitting Data and Feature scaling

H all, So in this article we will discuss about label encoder, splitting data set and then feature scaling it.
So let's start with label encoder these are basically used for encoding dependent variable which is Y here. so in data set we have values like yes or no , that will be encoded to numeric values of 0 or 1. We will use LablelEncoder from sklearn preprocessing and then transform the data as below

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

After we are done with this we can now split the data into Training and Test sets. Generally we keep 80% of data as training and rest as test set. It is useful for evaluating model performance once we are done training the model. This can be done via train_test_split from sklearn model_selection

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

Once we have got 2 sets of data if our data needs to Feature scaling we will perform that using Standard Scaler from sklearn preprocessing . Here we use mostly Standardization method for Feature Scaling, there is also another method which is called normalization but mostly standardization is used. We will first scale training data and then using same scaling parameters test set will be scaled. It is recommended and general practice to scale after data splitting as test data should be unseen data to the model and if we split after scaling it will have data bias and our model might not perform in the way we intend.

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Now we have transformed data after preprocessing and ready to develop our models to predict values. We will cover it in future articles.

Hope it helped.

Top comments (0)