Spectral clustering is a method of clustering data points based on their similarity or affinity, rather than their distance or compactness. The algorithm uses the eigenvalues and eigenvectors of a similarity matrix to project the data into a lower dimensional space, where the clusters become more separable. The algorithm can handle clusters of arbitrary shapes and sizes, unlike k-means which assumes spherical clusters.
Working
Step 1: Construct a similarity matrix S for the data points, where S[i][j] represents the similarity or affinity between the i-th and j-th data point. The similarity can be measured by various methods, such as Gaussian kernel, k-nearest neighbors, or epsilon-neighborhood.
Step 2: Compute the degree matrix D for the similarity matrix S, where D[i][i] is the sum of the i-th row of S, and D[i][j] is zero for i not equal to j. The degree matrix represents the degree of connectivity of each data point.
Step 3: Compute the Laplacian matrix L for the similarity matrix S and the degree matrix D, where L = D - S. The Laplacian matrix captures the difference between the degree of a data point and its similarity to other data points.
Step 4: Compute the eigenvalues and eigenvectors of the Laplacian matrix L, and sort them in ascending order. Choose the k smallest eigenvalues and their corresponding eigenvectors, where k is the number of clusters desired. The eigenvectors represent the new features that best capture the structure of the data.
Step 5: Form a matrix U by stacking the k eigenvectors as columns. Each row of U represents a data point in the new feature space.
Step 6: Apply k-means clustering on the rows of U to obtain k clusters.
Pseudocode
# Input: data points X, number of clusters k, similarity measure S
# Output: cluster assignments C
# Step 1: Construct the similarity matrix S
S = similarity_matrix(X, S)
# Step 2: Compute the degree matrix D
D = degree_matrix(S)
# Step 3: Compute the Laplacian matrix L
L = laplacian_matrix(D, S)
# Step 4: Compute the eigenvalues and eigenvectors of L
E, V = eigen(L)
# Sort the eigenvalues and eigenvectors in ascending order
E, V = sort(E, V)
# Choose the k smallest eigenvalues and eigenvectors
E_k = E[:k]
V_k = V[:k]
# Step 5: Form the matrix U by stacking V_k as columns
U = stack(V_k)
# Step 6: Apply k-means clustering on U
C = kmeans(U, k)
# Return the cluster assignments
return C
Advantages
- It can handle clusters with complex shapes and sizes, unlike k-means which assumes spherical clusters.
- It can capture the global structure of the data by using the spectral properties of the similarity matrix.
- It can be easily implemented by using standard linear algebra methods.
Disadvantages
- It requires choosing k in advance, which can be difficult or arbitrary.
- It is sensitive to the choice of similarity measure, which can affect the quality and number of clusters.
- It can be computationally expensive, as it requires calculating and decomposing a large similarity matrix.
References and Further Reading
- Spectral Clustering in Machine Learning - GeeksforGeeks
- A Tutorial on Spectral Clustering - arXiv.org
- Spectral Clustering - GitHub Pages
This will be a multipart series which will follow up with more clustering algorithms with their working, pseudocode, advantages and disadvantages.
Please stay tuned for more such content.
If you liked this post, please share it with your friends and fellow developers. And don’t forget to follow us for more programming tutorials and examples! 😊
And also,
have a look👀 @ my Portfolio
code👨💻 together @ Github
connect🔗 @ LinkedIn
Top comments (0)