K-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster. K-means clustering is an unsupervised machine learning algorithm, which means it does not require any labels or classes for the data points, but instead tries to discover the inherent structure or patterns in the data.
The main idea behind k-means clustering is to minimize the within-cluster variation, which is measured by the sum of squared distances from each data point to its assigned cluster center. The lower the within-cluster variation, the more compact and homogeneous the clusters are. The algorithm tries to find the optimal cluster centers that minimize this variation.
Working
Step 1: Choose k initial cluster centers randomly from the data points. These are the initial guesses for the cluster means.
Step 2: Assign each data point to the closest cluster center based on some distance measure, such as Euclidean distance. This forms k clusters of data points.
Step 3: Recalculate the cluster centers by taking the mean of all the data points assigned to each cluster. This updates the cluster means.
Step 4: Repeat steps 2 and 3 until the cluster assignments do not change or a maximum number of iterations is reached. This means the algorithm has converged to a solution.
Pseudocode
# Input: data points X, number of clusters k, maximum number of iterations max_iter
# Output: cluster assignments C, cluster centers M
# Step 1: Initialize k cluster centers randomly
M = random_sample(X, k)
# Initialize cluster assignments
C = empty_array(X.size)
# Initialize number of iterations
iter = 0
# Loop until convergence or maximum iterations
while iter < max_iter:
# Step 2: Assign each data point to the closest cluster center
for i in range(X.size):
# Calculate the distance from X[i] to each cluster center
distances = distance(X[i], M)
# Find the index of the minimum distance
min_index = argmin(distances)
# Assign X[i] to that cluster
C[i] = min_index
# Step 3: Recalculate the cluster centers by taking the mean of each cluster
for j in range(k):
# Find all the data points assigned to cluster j
cluster_j = X[C == j]
# Calculate the mean of cluster j
mean_j = mean(cluster_j)
# Update the cluster center M[j]
M[j] = mean_j
# Increment the number of iterations
iter = iter + 1
# Return the final cluster assignments and cluster centers
return C,
Advantages
- It is simple and easy to implement.
- It is fast and efficient in terms of computational cost.
- It can handle large data sets and high dimensional data.
- It can produce tight and spherical clusters.
Disadvantages
- It requires choosing k in advance, which can be difficult or arbitrary.
- It is sensitive to initialization and may converge to local optima.
- It is not robust to outliers and noise.
- It assumes spherical clusters and may fail for complex shapes.
References and Further Reading
- K Means Clustering with Simple Explanation for Beginners - Analytics Vidhya
- K means Clustering - Introduction - GeeksforGeeks
- k-means clustering - Wikipedia
This will be a multipart series which will follow up with more clustering algorithms with their working, pseudocode, advantages and disadvantages.
Please stay tuned for more such content.
If you liked this post, please share it with your friends and fellow developers. And don’t forget to follow us for more programming tutorials and examples! 😊
And also,
have a look👀 @ my Portfolio
code👨💻 together @ Github
connect🔗 @ LinkedIn
Top comments (0)