## DEV Community is a community of 638,230 amazing developers

We're a place where coders share, stay up-to-date and grow their careers.

# Machine Learning - Data preprocessing

Kristijan Pajtasev Updated on ・2 min read

As part of my trying to complete Machine Learning A-Z Udemy course, this series of posts, starting with this one, will contain note I gather from it.

### Dependent vs independent variables

• Dependent – variable being tested and measured – predicted result
• Independent – variable being changed or controlled – features

Used libraries(python):

• numpy, a library containing mathematical tools
• matplotlib.pyplot, plotting library
• pandas – importing datasets
• sklearn.preprocessing – library for processing data

Importing dataset with pandas:

``````import pandas as pd
``````

### Missing data

Option 1:

• remove rows with missing data
• dangerous because we might be losing valuable information Option 2:
• set missing values to mean of that feature

Library used:

``````sklearn.preprocessing.Imputer
``````

### Categorical data

Labels need to be converted into numbers - Euclidean distance can’t be calculated on labels
Library:

``````sklearn.preprocessing.LabelEncoder
``````

Problem with LabelEncoder: converting labels into numbers can lead to problems as numbers can be ordered. Labels not necessary
Solution: Creating feature per label
Library: sklearn.preprocessing.OneHotEncoder

### Splitting data

For creating a model, data needs to be split into two sets, train and test. The train set is the one we use for creating a model, and the test is one we use to evaluate that mode's correctness.
Library: sklearn.model_selection.train_test_split
Usual ration: 70-80% for train data

### Feature scaling

One feature, because of large values, can dominate the smaller number value feature. This is why all features should be scaled to the same scale.
Option 1, standardization:
Each value is reduced by the mean and divided by the standard deviation.

Option 2, normalization:
Reduce each x by minimal x value. Ather that, divide by the difference of the maximum and minimum value of x.

Library:

``````sklearn.preprocessing.StandardScaler
``````