DEV Community

Cover image for Cluster Analysis using K-means
Avinash Gupta
Avinash Gupta

Posted on

Cluster Analysis using K-means

Introduction:

The k-means algorithm explores a preplanned number of clusters in an unlabeled multidimensional dataset, it concludes this via an easy interpretation of how an optimized cluster can be expressed.

Primarily the concept would be in two steps:

  • First, the cluster center is the arithmetic mean (AM) of all the data points associated with the cluster.
  • Second, each point is adjoint to its cluster center in comparison to other cluster centers. These two interpretations are the foundation of the k-means clustering model.

You can take the center as a data point that outlines the means of the cluster, also it might not possibly be a member of the dataset.

In simple terms, k-means clustering enables us to cluster the data into several groups by detecting the distinct categories of groups in the unlabeled datasets by itself, even without training data.

This is the centroid-based algorithm such that each cluster is connected to a centroid while following the objective to minimize the sum of distances between the data points and their corresponding clusters.

Specifically performing two tasks, the k-means algorithm:

  • Calculates the correct value of K-center points or centroids by an iterative method.
  • Assigns every data point to its nearest k-center, and the data points, closer to a particular k-center, make a cluster. Therefore, data points, in each cluster, have some similarities and are far apart from other clusters.

Explanation:

K-Means is just the Expectation-Maximization (EM) algorithm, It is a persuasive algorithm that exhibits a variety of contexts in data science, the E-M approach incorporates two parts in its procedure:

  1. To assume some cluster centers.
  2. Re-run as far as transformed.
  • E-Steps: To appoint data points to the closest cluster center.
  • M-Steps: To introduce the cluster centers to the mean.

Where the E-step is the Expectation step, it comprises upgrading forecasts of associating the data point with the respective cluster.
And, M-step is the Maximization step, which includes maximizing some features that specify the region of the cluster centers, for this maximization, is expressed by considering the mean of the data points of each cluster.

In account of some critical possibilities, each reiteration of E-step and M-step algorithms will always yield in terms of improved estimation of clusters’ characteristics.

K-means utilize an iterative procedure to yield its final clustering based on the number of predefined clusters, as per need according to the dataset and represented by the variable K.

For instance, if K is set to 3 (k3), then the dataset would be categorized into 3 clusters if k is equal to 4, then the number of clusters will be 4, and so on.

The fundamental aim is to define k centers, one for each cluster, these centers must be located in a sharp manner because of the various allocation causes different outcomes. So, it would be best to put them as far away as possible from each other.

Also, The maximum number of plausible clusters will be the same as the total number of observations/features present in the dataset.

Code:

from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
from sklearn.decomposition import PCA
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi 
%matplotlib inline
RND_STATE = 55121
Enter fullscreen mode Exit fullscreen mode

Loading Data:

data = pd.read_csv("data/tree_addhealth.csv")
data.columns = map(str.upper, data.columns)

data_clean = data.dropna()

cluster=data_clean[['ALCEVR1','MAREVER1','ALCPROBS1','DEVIANT1','VIOL1',
'DEP1','ESTEEM1','SCHCONN1','PARACTV', 'PARPRES','FAMCONCT']]

cluster.describe()
Enter fullscreen mode Exit fullscreen mode

Preprocessing Data:

clustervar=cluster.copy()
clustervar['ALCEVR1']=preprocessing.scale(clustervar['ALCEVR1'].astype('float64'))
clustervar['ALCPROBS1']=preprocessing.scale(clustervar['ALCPROBS1'].astype('float64'))
clustervar['MAREVER1']=preprocessing.scale(clustervar['MAREVER1'].astype('float64'))
clustervar['DEP1']=preprocessing.scale(clustervar['DEP1'].astype('float64'))
clustervar['ESTEEM1']=preprocessing.scale(clustervar['ESTEEM1'].astype('float64'))
clustervar['VIOL1']=preprocessing.scale(clustervar['VIOL1'].astype('float64'))
clustervar['DEVIANT1']=preprocessing.scale(clustervar['DEVIANT1'].astype('float64'))
clustervar['FAMCONCT']=preprocessing.scale(clustervar['FAMCONCT'].astype('float64'))
clustervar['SCHCONN1']=preprocessing.scale(clustervar['SCHCONN1'].astype('float64'))
clustervar['PARACTV']=preprocessing.scale(clustervar['PARACTV'].astype('float64'))
clustervar['PARPRES']=preprocessing.scale(clustervar['PARPRES'].astype('float64'))

clus_train, clus_test = train_test_split(clustervar, test_size=0.3, random_state=RND_STATE)
Enter fullscreen mode Exit fullscreen mode

K-means Analysis for 9 clusters:

clusters=range(1,10)
meandist=[]

for k in clusters:
    model=KMeans(n_clusters=k)
    model.fit(clus_train)
    clusassign=model.predict(clus_train)
    meandist.append(sum(np.min(cdist(clus_train, model.cluster_centers_, 'euclidean'), axis=1)) 
    / clus_train.shape[0])
Enter fullscreen mode Exit fullscreen mode

Relation between cluster and average distance:

plt.plot(clusters, meandist)
plt.xlabel('Number of clusters')
plt.ylabel('Average distance')
plt.title('Selecting k with the Elbow Method')
plt.show()
Enter fullscreen mode Exit fullscreen mode

Plotting Output of Relation:

cluster vs avg distance

Solution for 3 cluster model:

model3=KMeans(n_clusters=3)
model3.fit(clus_train)
clusassign=model3.predict(clus_train)

pca_2 = PCA(2)
plot_columns = pca_2.fit_transform(clus_train)
plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=model3.labels_,)
plt.xlabel('Canonical variable 1')
plt.ylabel('Canonical variable 2')
plt.title('Scatterplot of Canonical Variables for 3 Clusters')
plt.show()

clus_train.reset_index(level=0, inplace=True)
cluslist=list(clus_train['index'])
labels=list(model3.labels_)
newlist=dict(zip(cluslist, labels))

newclus=DataFrame.from_dict(newlist, orient='index')
newclus.columns = ['cluster']
newclus.describe()
Enter fullscreen mode Exit fullscreen mode

Plotting Clusters:

clusters graph

newclus.reset_index(level=0, inplace=True)
merged_train=pd.merge(clus_train, newclus, on='index')
merged_train.head(n=100)
merged_train.cluster.value_counts()
Enter fullscreen mode Exit fullscreen mode

Output reset_index:
reset index

clustergrp = merged_train.groupby('cluster').mean()
print ("Clustering variable means by cluster")
print(clustergrp)
Enter fullscreen mode Exit fullscreen mode

Output of Cluster Variable means:
cluster variable means

gpa_data=data_clean['GPA1']
gpa_train, gpa_test = train_test_split(gpa_data, test_size=.3, random_state=RND_STATE)
gpa_train1=pd.DataFrame(gpa_train)
gpa_train1.reset_index(level=0, inplace=True)
merged_train_all=pd.merge(gpa_train1, merged_train, on='index')
sub1 = merged_train_all[['GPA1', 'cluster']].dropna()

gpamod = smf.ols(formula='GPA1 ~ C(cluster)', data=sub1).fit()
print (gpamod.summary())
Enter fullscreen mode Exit fullscreen mode

OLS Regression Results:
OLS Regression

print ('means for GPA by cluster')
m1= sub1.groupby('cluster').mean()
print (m1)
Enter fullscreen mode Exit fullscreen mode

means_gpa

print ('standard deviations for GPA by cluster')
m2= sub1.groupby('cluster').std()
print (m2)
Enter fullscreen mode Exit fullscreen mode

sd_gpa

mc1 = multi.MultiComparison(sub1['GPA1'], sub1['cluster'])
res1 = mc1.tukeyhsd()
print(res1.summary())
Enter fullscreen mode Exit fullscreen mode

Output for Comparison of means:

comparison_means

Top comments (0)