DEV Community

Nigel Lowa
Nigel Lowa

Posted on

4 Outlier Detection Strategies with R

In this article, I focus on 4 examples of outlier detection using R. I will start with univariate outlier detection before providing an example using Local Outlier Factor clustering, and time series outlier detection.

1. Univariate Outlier Detection
We sometimes need to detect outliers when we are dealing with a single variable. R provides an elegant method for achieving this task: boxplot.stats(). The function (especially its 'out' component) returns the data typically used to generate boxplots. The 'out' components contains an entire list of all the outliers in the dataset. We can also use 'coef' argument to change the whisker's boundaries. But that's a story for another day.

Image description

Image description
The same univariate outlier detection process is also appropriate for multivariate data. The code below shows how we can use the variables x and y inside a dataframe to detect outliers. We begin by detecting the variables outliers separately before achieving the same result for both column x and y.

Image description

Image description
We can also assume that outliers are any extreme values in either column.

Image description

Image description

2. Outlier Detection with LOF
LOF stands for "Local Outlier Factor." It is an unsupervised machine learning algorithm used for outlier detection in datasets. The LOF algorithm quantifies the 'outlierness' of each data point with respect to its local neighborhood. That is, a point's local density is contrasted with neighbors' and the point is regarded as sparse when the LOF value is more than 1 (making it an outlier).

LOF is particularly useful when dealing with datasets containing irregularly shaped clusters or non-uniform density distributions. Unfortunately, we can only use this method on numeric data. We use lofactor() to estimate the local outlier factors using R's dprep and DMwR packages. In the example below, we detect the outliers using variable k as the assumed number of neighbors during the calculation. We have also generated a chart of the factor scores.

Image description

Image description

Image description

Image description
In the code above, we use prcomp() to estimate the principal component analysis and biplot() to generate the chart by ploting the data with the two initial components. Principal Component Analysis (PCA) is a widely used statistical technique in data analysis and dimensionality reduction. It is used to transform high-dimensional data into a lower-dimensional space while preserving as much of the data's original variability as possible. PCA achieves this by finding a new set of orthogonal axes, known as principal components, along which the data has the maximum variance. In the chart above, the x- and y-axis are respectively the first and second principal components, the arrows show the original columns (variables), and the five outliers are labeled with their row numbers.

Additionally, we can visualize outliers using a pairs plot, where we represent outliers with a red "+" symbol. The code is demontrated below:

Image description
Package dbscan has function lof(), an implementation of the LOF algorithm. While it serves the same function as lofactor(), it also has two extra methods that can leverage multiple k choice distance metrics and values. The following code shows an application of lof(). We first calculate the outliers the detect them by displaying the top values.

Image description
3. Outlier Detection by Clustering
When we cluster data, all the datapoints that do not belong to a specific group or category are automatically regarded as outliers. A good example is how DBSCAN's density-based clustering group objects into a single cluster is they belong the same densely populated area. As a result, an object that is not assigned a cluster are assumed to be outliers due to their isolation.

However, in this section, we will use the k-means algorithm to achieve our goal as demonstrated in the code below. We first partition our data into k groups by placing the datapoints closest to their cluster centers. The next step is to estimate the dissimilarity (distance) between every item and their cluster centers to identify the ones with the largest distance values.

Image description

Image description
4. Outlier Detection from Time Series
We will now use a time series dataset for the purspoe of outlier detection. We begin by decomposing the data using function stl's robust regression methodology. The process is also known as robust fitting, and it allows us to to identify the outliers. Check out http://cs.wellesley.edu/~cs315/Papers/stl%20statistical%20model if you wanna learn more about STL.

Image description

Image description
Conclusion
If you made it this far, the least I could do is offer you a veritable easter egg in the form of a book quote:

"Mario, what do you get when you cross an insomniac, an unwilling agnostic and a dyslexic?"

"I give."

"You get someone who stays up all night torturing himself mentally over the question of whether or not there's a dog."

David Foster Wallace - Infinite Jest

Top comments (0)