Originally published on Wordpress.
Hello, everyone! I've been meaning to get a new blog post out for the past couple of weeks. During that time I've been messing around with clustering. Clustering, or cluster analysis, is a method of data mining that groups similar observations together. Classification and clustering are quite alike, but clustering is more concerned with exploration than actual results.
Note: This post is far from an exhaustive look at all clustering has to offer. Check out this guide for more. I am reading Data Mining by Aggarwal presently, which is very informative.
data("iris")
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
For simplicity, we'll use the iris
dataset. We're going to try to use the numeric data to correctly group the observations by species. There are 50 of each species in the dataset, so ideally we would end up with three clusters of 50.
library(ggplot2)
ggplot() +
geom_point(aes(iris$Sepal.Length, iris$Sepal.Width, col = iris$Species))
As you can see, there is already some groupings present. Let's use hierarcical clustering first.
iris2 <- iris[,c(1:4)] #not going to use the `Species` column.
medians <- apply(iris2, 2, median)
mads <- apply(iris2,2,mad)
iris3 <- scale(iris2, center = medians, scale = mads)
dist <- dist(iris3)
hclust <- hclust(dist, method = 'median') #there are several methods for hclust
cut <- cutree(hclust, 3)
table(cut)
## cut
## 1 2 3
## 49 34 67
iris <- cbind(iris, cut)
iris$cut <- factor(iris$cut)
levels(iris$cut) <- c('setosa', 'versicolor', 'virginica')
err <- iris$Species == iris$cut
table(err)
## err
## FALSE TRUE
## 38 112
ggplot() +
geom_point(aes(iris2$Sepal.Length, iris2$Sepal.Width, col = iris$cut))
Nice groupings here, but it looks like the algorithm has some trouble determining between versicolor and virginica. If we used this information to classify the observations, we'd get an error rate of about .25. Let's try another clustering technique: DBSCAN.
library(dbscan)
db <- dbscan(iris3, eps = 1, minPts = 20)
table(db$cluster)
##
## 0 1 2
## 5 48 97
iris2 <- cbind(iris2, db$cluster)
iris2$cluster <- factor(iris2$`db$cluster`)
ggplot() +
geom_point(aes(iris2$Sepal.Length, iris2$Sepal.Width, col = iris2$cluster))
DBSCAN classifies points into three different categories: core, border, and noise points on the basis of density. Thus, the versicolor/ virginica cluster is treated as one group. Since our data is not structured in such a way that density is meaningful, DBSCAN is probably not a wise choice here.
Let's look at one last algo: the ROCK. No, not that ROCK.
library(cba)
distm <- as.matrix(dist)
rock <- rockCluster(distm, 3, theta = .02)
## Clustering:
## computing distances ...
## computing links ...
## computing clusters ...
iris$rock <- rock$cl
levels(iris$rock) <- c('setosa', 'versicolor', 'virginica')
ggplot() +
geom_point(aes(iris2$Sepal.Length, iris2$Sepal.Width, col = rock$cl))
err <- (iris$Species == iris$rock)
table(err)
## err
## FALSE TRUE
## 24 126
While it may not look like it, the ROCK seems to do the best job at determining clusters in this data - the error rate dropped to 16%. Typically this method is reserved for categorical data, but since we used dist
it shouldn't cause any problems.
I have been working on a project using some of these (and similar) data mining procedures to explore spatial data and search for distinct groups. While clustering the iris
data may not be all that meaningful, I think it is illustrative of the power of clustering. I have yet to try higher-dimension clustering techniques, which might perform even better.
Have any comments? Questions? Suggestions for future posts? I am always happy to hear from readers; please contact me!
Happy clustering,
Kiefer
Top comments (2)
Thanks for the walkthrough, Kiefer. I'm always so fascinated by cluster or nearest-neighbor analysis. It seems to accomplish the very difficult -- modeling non-parametric, complex trends -- in such a straightforward way that's usually based on Euclidean distance. Woo!
Clustering wines
K-Means
Learn R here: hackr.io/tutorials/learn-r
This first example is to learn to make cluster analysis with R. The library rattle is loaded in order to use the data set wines.
install.packages('rattle')
data(wine, package='rattle')
head(wine)
Type Alcohol Malic Ash Alcalinity Magnesium Phenols Flavanoids
1 1 14.23 1.71 2.43 15.6 127 2.80 3.06
2 1 13.20 1.78 2.14 11.2 100 2.65 2.76
3 1 13.16 2.36 2.67 18.6 101 2.80 3.24
4 1 14.37 1.95 2.50 16.8 113 3.85 3.49
5 1 13.24 2.59 2.87 21.0 118 2.80 2.69
6 1 14.20 1.76 2.45 15.2 112 3.27 3.39
Nonflavanoids Proanthocyanins Color Hue Dilution Proline
1 0.28 2.29 5.64 1.04 3.92 1065
2 0.26 1.28 4.38 1.05 3.40 1050
3 0.30 2.81 5.68 1.03 3.17 1185
4 0.24 2.18 7.80 0.86 3.45 1480
5 0.39 1.82 4.32 1.04 2.93 735
6 0.34 1.97 6.75 1.05 2.85 1450
In this data set we observe the composition of different wines. Given a set of observations (x1,x2,.,xn)(x1,x2,.,xn), where each observation is a dd-dimensional real vector, kk-means clustering aims to partition the n observations into (k≤nk≤n) S={S1,S2,.,Sk}S={S1,S2,.,Sk} so as to minimize the within-cluster sum of squares (WCSS). In other words, its objective is to find::
argminS∑i=1k∑xj∈Si∥xj−μi∥2
argminS∑i=1k∑xj∈Si‖xj−μi‖2
where μiμi is the mean of points in SiSi. The clustering optimization problem is solved with the function kmeans in R.
wine.stand <- scale(wine[-1]) # To standarize the variables
K-Means
k.means.fit <- kmeans(wine.stand, 3) # k = 3
In k.means.fit are contained all the elements of the cluster output:
attributes(k.means.fit)
$names
[1] "cluster" "centers" "totss" "withinss"
[5] "tot.withinss" "betweenss" "size" "iter"
[9] "ifault"
$class
[1] "kmeans"
Centroids:
k.means.fit$centers
Alcohol Malic Ash Alcalinity Magnesium Phenols Flavanoids
1 0.1644 0.8691 0.1864 0.5229 -0.07526 -0.97658 -1.21183
2 0.8329 -0.3030 0.3637 -0.6085 0.57596 0.88275 0.97507
3 -0.9235 -0.3929 -0.4931 0.1701 -0.49033 -0.07577 0.02075
Nonflavanoids Proanthocyanins Color Hue Dilution Proline
1 0.72402 -0.7775 0.9389 -1.1615 -1.2888 -0.4059
2 -0.56051 0.5787 0.1706 0.4727 0.7771 1.1220
3 -0.03344 0.0581 -0.8994 0.4605 0.2700 -0.7517
Clusters:
k.means.fit$cluster
[1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[36] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 1 3 3 3 3 3 3 3 3
[71] 3 3 3 2 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3
[106] 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 2 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1
[141] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[176] 1 1 1
Cluster size:
k.means.fit$size
[1] 51 62 65
A fundamental question is how to determine the value of the parameter kk. If we looks at the percentage of variance explained as a function of the number of clusters: One should choose a number of clusters so that adding another cluster doesn’t give much better modeling of the data. More precisely, if one plots the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph. The number of clusters is chosen at this point, hence the “elbow criterion”.