DEV Community

Cover image for Taming the Wild West: Data Preprocessing for Machine Learning Success
Abhinav Yadav
Abhinav Yadav

Posted on

Taming the Wild West: Data Preprocessing for Machine Learning Success

Picture a cook preparing an excellent dish with spoiled meat, vegetables, or other basic products. The question here arises, regardless of even the greatest act by the chef it would not work out perfectly. The same applies to machine learning models – they are used as predictive models and require clean high-quality data to function properly. This is where data preprocessing comes in handy, as we take a closer look at what this phase of data preparation entails before feeding data into a machine learning model.

Table Of Content

  • Why is Data Preprocessing Important?
  • The Data Preprocessing Toolbox
  • Dealing With Imbalanced Data
  • Challenges and Best Practices

Why is Data Preprocessing Important?

Raw data is always in an unformatted form and looks like the wild west – generally it is crowded and crude. It may be missing some values and as a result be incomplete, self-contradictory and filled with irrelevant information. These imperfections can be disastrous to your machine learning model, resulting in the following:

  • Poor Model Performance: Picture a situation, where a model attempts to forecast house price with no corresponding information on the size of the house. Inaccurate predictions are likely.

  • Biased Results: Unequal data cleaning for different subgroups as well as any possible improper data cleaning leads to more bias and contributes to problems, which result in a model that is more skewed to some populations.

  • Training Inefficiency: Such incidental characteristics only slow down the model’s training process and convey no useful information.

Data preprocessing addresses these issues, transforming your wild west data into a well-organized town, ready for model training.

The Data Preprocessing Toolbox

Data preprocessing involves several key steps to clean, transform, and structure your data:

1.Data Cleaning: This tackles the mess by identifying and handling missing values, removing duplicate records, and dealing with outliers.

  • Missing Values: In the event that we are missing one or more entries then we can use such approaches as the use of mean, median, mode, removing whole rows and even even using algorithms for making predictions.
# Imputing missing values with mean
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
data_imputed = imputer.fit_transform(data)
Enter fullscreen mode Exit fullscreen mode
  • Duplicates: Removing duplicate records prevents the model from overfitting to redundant data.
# Remove duplicate rows
data = data.drop_duplicates()

Enter fullscreen mode Exit fullscreen mode
  • Outliers: Depending on the chosen level, extreme values can distort the results of the model. We can apply statistical methods (for example, IQR or Z-scores) to find them and make the decision concerning their removal or transformation.
# Removing outliers using Z-score
from scipy import stats

data = data[(np.abs(stats.zscore(data)) < 3).all(axis=1)]
Enter fullscreen mode Exit fullscreen mode

2.Data Transformation: This prepares the data for the model's mathematical operations.

  • Normalization/Standardization: Sclalability of features by making them in a normalized scale or making them standardized with zero mean and unit variance can enhance the model performance.
# Standardize data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data_standardized = scaler.fit_transform(data)

# Normalize data
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
data_normalized = scaler.fit_transform(data)
Enter fullscreen mode Exit fullscreen mode

3.Encoding Categorical Variables: Categorical data like "color" needs conversion to numerical format for the model to understand it.

  • One-Hot Encoding: Creates new binary features for each category.

  • Label Encoding: Assigns a numerical value to each category.

# One-hot encoding
from sklearn.preprocessing import OneHotEncoder
onehot_encoder = OneHotEncoder()
data_encoded = onehot_encoder.fit_transform(data[['categorical_feature']])

# Label encoding
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
data['categorical_feature_encoded'] = label_encoder.fit_transform(data['categorical_feature'])

Enter fullscreen mode Exit fullscreen mode

4.Feature Engineering: This involves creating new features from existing ones to potentially improve model performance.

  • New Features: We can combine features, create interaction terms, or extract features from dates/times.
# Creating a new feature from existing ones
data['new_feature'] = data['feature1'] * data['feature2']
Enter fullscreen mode Exit fullscreen mode

5.Feature Selection: Not all features are equally important. Selecting relevant features can improve model performance and reduce overfitting.

  • Filter Methods: Select features based on a score (e.g., correlation with the target variable).

  • Wrapper Methods: Use a machine learning model itself to evaluate feature subsets.

  • Embedded Methods: Leverage regularization techniques (like LASSO) that inherently perform feature selection.

# Selecting top 5 features based on ANOVA F-test
from sklearn.

Enter fullscreen mode Exit fullscreen mode

Dealing With Imbalanced Data

1.Understanding Imbalance:

When working with imbalanced datasets, ML faces considerable challenges. Says Imbalance when the frequency of one class (the minority class) is substantially smaller than the frequency of the other class (the majority class). This means that the model training will favor the majority class which in turn affects the accuracy of the same in predicting the minority class.

2.Impact on Model Training:

Imbalanced datasets can lead to several issues:

  • Biased Models: Models tend to favour the majority class, leading to poor predictive performance for the minority class.

  • Misleading Accuracy: Accuracy can be misleadingly high if the model predicts the majority class well but fails to predict the minority class.

  • Difficulty in Learning Patterns: The model may not learn enough about the minority class due to its limited representation.

3.Resampling Techniques:

To address imbalance, resampling techniques are commonly used:

  • Oversampling:

The SMOTE (Synthetic Minority Over-sampling Technique) works on the principle of creating synthetic samples of the minority classes from between two of these similar set instances. This technique aids in the balancing of the classes while at the same time avoiding repetition of the samples.

from imblearn.over_sampling import SMOTE

# Applying SMOTE
smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X, y)

Enter fullscreen mode Exit fullscreen mode
  • Undersampling:

Undersampling has the ability to decrease the quantity of instance in majority class in order to make it equal to that of the minority class. It can reject some useful data but is suitable for dealing with massive datasets because of its simplicity.

  • Using Class Weights:

Another method, although not as effective as the above, is to fine-tune the designation of weights according to the classification algorithms in order to have the classification of the minority class penalised more than that of the majority one. This can often be achieved in many classifiers such as the Random Forest model by adjusting the parameter as class_weight=’balanced’.

from sklearn.ensemble import RandomForestClassifier

# Adjust class weights
model = RandomForestClassifier(class_weight='balanced')
model.fit(X_train, y_train)

Enter fullscreen mode Exit fullscreen mode

Challenges and Best Practices

1.Data Quality: High-quality data is crucial for reliable model training, especially in addressing imbalanced classes.

2.Consistency: Maintain consistent preprocessing steps across training and test datasets to avoid bias and ensure fairness.

3.Automation: Use tools like Scikit-Learn Pipelines for automating preprocessing tasks, ensuring efficiency and reproducibility.

By following these steps and practices, beginners can effectively tackle imbalanced datasets in machine learning projects, improving model performance and reliability.

Happy Learning !

Please do comment below if you like the content or not

Have any questions or ideas or want to collaborate on a project, here is my linkedin

Top comments (0)