DEV Community

Ajaykrishnan Selucca
Ajaykrishnan Selucca

Posted on

Machine Learning - Outliers, its type and causes

Alt Text

OUTLIERS :

Outliers are points which are like introverts who never mingle with other points or group of points(distribution) like me. Outliers are extreme values that deviate from other observations on data, an outlier is an observation that diverges from an overall pattern on a sample. If the outliers are not treated during our EDA (Exploratory Data Analytics) the resulting machine learning model may have problems like low accuracy, errors etc.

TYPES OF OUTLIERS:

UNI-VARIATE OUTLIERS : It is a data point that consists of extreme value on one variable.

MULTIVARIATE OUTLIERS : It is the combination of unusual scores on atleast two variables.

TYPES OF OUTLIERS BASED ON ENVIRONMENT:

POINT OUTLIERS : They are single data points that lay far from the rest of the distribution.

CONTEXTUAL OUTLIERS : It deviates significantly with the respect to a specific context of the object (Noise)

COLLECTIVE OUTLIERS : A subset of data objects collectively deviate significantly from the whole data set, even if the individual data objects may not be the outliers.

CAUSES OF OUTLIERS:

Handling outliers is very important, because all the outliers aren't a bad thing. Its very important to understand that simply removing the outliers from our dataset without considering how they will impact the results is a recipe for disaster.

Outliers can impact the results of our analysis and statistical modelling in a drastic way. Especially in logistical regression, outliers has a greater impact and lets discuss about that in a separate post.

The causes of outliers are as follows,

  1. Data Entry Errors (Human Errors)
  2. Measurement Errors (Instrument errors)
  3. Experimental Errors (Data Extraction or execution errors)
  4. Intentional Errors (dummy outliers made to test detection methods)
  5. Data Processing Errors ( Data manipulation or Data set unintended mutations)
  6. Sampling Errors (Extracting or mixing data from wrong or various sources)
  7. Natural (Its not an error, but an real extreme value in a Data)

In the next blog we can see about, detecting and dealing with outliers.

Top comments (0)