Handling Outliers|| Feature Engineering || Machine Learning

#machinelearning #datascience #beginners #tutorial

Hey reader👋Hope you are doing well😊
We know that to improve performance machine learning model feature engineering is crucial step. One of most important tasks in feature engineering is handling outliers. In this blog we are going to do a detailed discussion on handling outliers. So let's get started 🔥.

What are Outliers?

Outliers are extreme values that differ from most other data points in a dataset. They can have big impact on statistical analysis and skew the result of any hypothesis test.

To understand it better let's consider an example-:
Dataset A = [1,2,3,4,5,6]
Mean => 3.75
Now let's some more datapoints in the dataset.
A = [1,2,3,4,5,6,100,101]
Mean => 27.75
So here we can see that the mean is very much high just by adding two points and these two points are very different from rest of the other points in dataset, these points are definitely outliers.
The outliers can negatively affect our data and modeling so it is very important to properly handle them.

How Outliers are introduced in Data?

Outliers in a dataset can be introduced through various mechanisms, both intentional and unintentional. Here are some common ways outliers can be introduced:

Human Error: Manual data entry mistakes, such as typing errors, can lead to outliers. For example, entering an extra zero or a decimal point in the wrong place.
Instrument Error: Faulty measurement instruments or sensors can produce erroneous values that stand out as outliers.
Rare Events: Some outliers occur naturally due to rare events or extreme conditions. For example, an unusually high sales figure during a holiday season.
Merging Datasets: Combining datasets with different scales or units without proper alignment or adjustment can introduce outliers.
Intentional Manipulation: In some cases, outliers might be introduced intentionally, such as in fraudulent financial reporting or tampering with experimental data.

Types of Outliers

Based on their characteristics, outliers or anomalies can be divided into three categories -:

1. Global Outliers
Any observations or data points are considered as global outliers if they deviate significantly from the rest of the observations or data points in a dataset. For example, if you are collecting observations of temperatures in a city, then a value of 100 degrees would be considered an outlier, as it is an extreme as well as impossible temperature value for a city.

2. Contextual Outliers
Any data points or observations are considered as contextual outliers if their value significantly deviates from the rest of the data points in a particular context. It means that the same values may not be considered an outlier in a different context. For example, if you have observations of temperatures in a city, then a value of 40 degrees would be considered an outlier in winter, but the same value might be part of the normal observations in summer.

3. Collective Outliers
Any group of observations or data points within a data set is considered collective outliers if these observations as a collection deviate significantly from the entire data set. It means that these values, individually without collection with other data points, are not considered as either contextual or global outliers.

Identifying Outliers

There are four ways of identifying outliers -:

1. Percentile Method
The percentile method identifies outliers in a dataset by comparing each observation to the rest of the data using percentiles. In this method, We first define the upper and lower bounds of a dataset using the desired percentiles.
For example, we may use the 5th and 95th percentile for a dataset's lower and upper bounds, respectively. Any observations or data points that reside beyond and outside of these bounds can be considered outliers.
This method is simple and useful for identifying outliers in symmetrical and normal distributions.

2. Inter Quartile Range (IQR) Method
This method is similar to Percentile method, a slight difference is here we define an Inter Quartile Range for detecting outliers.
Q1 = 25th percentile
Q3 = 75th percentile
IQR = Q3-Q1
Upper bound = Q3+1.5*(IQR)
Lower bound = Q1-1.5*(IQR)
We check every datapoint ,if the point is in range [Lower bound ,Upper bound] then it is a valid point otherwise it is an outlier.

We are considering 25th and 75th percentile here because we are assuming that our data is normally distributed and most of our data resides in this range.

3. Using Visualization
In python we can use box plot or whisker plot to detect outliers in a dataset.

The box plot just gives the visualization of IQR method.

4. Using Z score method

For a given value, the respective z-score represents its distance in terms of the standard deviation. For example, a z-score of 2 represents that the data point is 2 standard deviations away from the mean. To detect the outliers using the z-score, we can define the lower and upper bounds of the dataset. The upper bound is defined as z = 3, and the lower bound is defined as z = -3. This means any value more than 3 standard deviations away from the mean will be considered an outlier.

Python Implementation for detecting outliers

Handling Outliers

Depending on the dataset there are various ways to handle outliers-:

Removing Outliers
If the outliers are because of manual error it is better to remove them entirely from dataset. If dataset contains large number of outliers then removing them may result in loss of data.
Transforming Outliers
The impact of outliers can be reduced or eliminated by transforming the feature. For example, a log transformation of a feature can reduce the skewness in the data, reducing the impact of outliers.
(We will read about transformations in upcoming blogs)
Impute Outliers
In this outliers are considered as missing values and we can replace them with mean, median, mode, nearest neighbor etc.
Use robust statistical methods
Some of the statistical methods are less sensitive to outliers and can provide more reliable results when outliers are present in the data. For example, we can use median and IQR for the statistical analysis as they are not affected by the outlier’s presence. This way we can minimize the impact of outliers in statistical analysis.

Python Implementation of Handling Outliers

I hope you have understood that how outliers are handled in our dataset. In the next blog we are going to read about how to handle missing values. Till then stay connected and don't forget to follow me.
Thankyou 💙