Harsh Mishra

Posted on

# K Means Clustering, Clustering: Unsupervised Machine Learning

### k-Means Clustering

#### Definition and Purpose

k-Means Clustering is an unsupervised machine learning algorithm used to partition a dataset into `k` distinct, non-overlapping clusters. Each cluster is defined by its centroid, which is the mean of the data points within that cluster. The main purpose of k-means clustering is to group similar data points together and discover underlying patterns or structures in the data without predefined labels.

#### Key Objectives:

• Cluster Formation: Partitioning the data into `k` clusters based on similarity.
• Centroid Calculation: Determining the mean position of all the points in a cluster.
• Minimizing Variance: Reducing the variance within each cluster to ensure that points in the same cluster are as similar as possible.

### How k-Means Clustering Works

1. Initialization: The algorithm starts by randomly selecting `k` initial centroids.

• Random Initialization: Choose `k` data points at random from the dataset to serve as initial centroids.

2. Assignment Step: Each data point is assigned to the nearest centroid, forming `k` clusters.

• Distance Metric: Commonly uses Euclidean distance to measure the closeness of data points to centroids.
• Euclidean Distance:
• `d(p, q) = sqrt((p1 - q1)^2 + (p2 - q2)^2 + ... + (pn - qn)^2)`
• Measures the straight-line distance between two points `p` and `q` in n-dimensional space.

3. Update Step: The centroids are recalculated as the mean of all data points assigned to each cluster.

• Centroid Update:
• For each cluster, compute the mean of all the data points in that cluster.
• Update the centroid to this mean position.

4. Iteration: Steps 2 and 3 are repeated until convergence, i.e., when the centroids no longer change significantly.

• Convergence: The algorithm stops when the centroids stabilize or the changes are minimal.

### Key Concepts

1. Centroid: The central point of a cluster, calculated as the mean of all points in the cluster.

2. Inertia: A measure of how internally coherent the clusters are. It is the sum of squared distances of samples to their closest centroid.

• Inertia Minimization: The goal of k-means is to minimize inertia, thus creating compact clusters.
3. Cluster Assignment: The process of assigning data points to the nearest centroid based on a distance metric.

4. Number of Clusters (k): The value of `k` is a critical hyperparameter. It can be chosen based on prior knowledge, domain expertise, or methods like the elbow method or silhouette analysis.

• Elbow Method: Plot inertia for different values of `k` and look for an "elbow point" where the rate of decrease slows.
• Silhouette Analysis: Measure how similar a data point is to its own cluster compared to other clusters.
5. Iterative Refinement: The process of repeatedly updating cluster assignments and centroids until convergence.

### k-Means Clustering Example

k-Means Clustering is an unsupervised learning algorithm that partitions a dataset into `k` distinct clusters. This example demonstrates how to implement k-Means clustering using synthetic data, evaluate the clustering results, and visualize the clusters along with their centroids.

#### Python Code Example

1. Import Libraries

``````import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
``````

This block imports the necessary libraries for data manipulation and plotting.

2. Generate Sample Data with 3 Clusters

``````np.random.seed(42)  # For reproducibility
n_samples = 300

# Cluster 1: Centered at (0, 0)
X1 = np.random.randn(n_samples // 3, 2) * 0.5

# Cluster 2: Centered at (3, 3)
X2 = np.random.randn(n_samples // 3, 2) * 0.5 + [3, 3]

# Cluster 3: Centered at (-2, 3)
X3 = np.random.randn(n_samples // 3, 2) * 0.5 + [-2, 3]

# Combine all clusters
X = np.vstack((X1, X2, X3))
``````

This block generates synthetic data for three clusters located in different regions of the feature space.

3. Perform k-Means Clustering

``````n_clusters = 3
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
cluster_labels = kmeans.fit_predict(X)
``````

This block initializes the k-Means algorithm with the specified number of clusters and fits it to the dataset, predicting the cluster labels for each data point.

4. Visualize the Clustering Results

``````plt.figure(figsize=(12, 8))

# Plot the data points
scatter = plt.scatter(X[:, 0], X[:, 1], c=cluster_labels, cmap='viridis', alpha=0.7)

# Plot the cluster centers
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', marker='x', s=200, linewidths=3)

plt.title(f'K-means Clustering (k={n_clusters})')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.colorbar(scatter)

# Add text annotations for each cluster
for i, center in enumerate(centers):
plt.annotate(f'Cluster {i}', center, fontsize=12, fontweight='bold',
xytext=(5, 5), textcoords='offset points')

plt.show()
``````

This block visualizes the clustering results, showing the data points colored by their cluster labels and marking the cluster centroids with red 'X' symbols.

`Output:`

This visualization helps to illustrate how k-Means clustering groups the data points into distinct clusters, providing insights into the structure and distribution of the data.

By following this structured approach, you can effectively implement and visualize k-Means clustering for various datasets, gaining valuable insights into their underlying patterns.