Introduction
This post will explore the intersection of physics and programming, specifically how physical phenomena can be used to cluster data. A teacher from my university suggested implementing an interesting method. Our new clustering method, called Surface Tension Clustering, aims to address some limitations of existing clustering algorithms by leveraging the physical analogy of surface tension observed in liquids. This method identifies cluster boundary points based on the imbalance of attractive forces, similar to how surface tension acts on molecules at the surface of a liquid.
But let's first talk about what clustering is and what methods exist.
General Information on Clustering
Clustering is an unsupervised learning method, meaning it does not rely on pre-labeled data to train a model. The main goal is to find natural groupings within the data. There are several main approaches to clustering:
- Centroid-based methods: Clusters are represented by a central point (centroid). Example: K-means clustering.
- Hierarchical methods: Clusters are formed into a hierarchy, either by merging smaller clusters into larger ones (agglomerative) or splitting larger clusters into smaller ones (divisive).
- Density-based methods: Clusters are defined as dense regions in the data space. Example: DBSCAN.
Popular Clustering Methods
K-means Clustering
K-means is one of the most widely used clustering algorithms. It partitions data into K clusters, where each data point belongs to the cluster with the nearest mean. Here’s a simple implementation in Python:
from sklearn.cluster import KMeans
import numpy as np
# Sample data
X = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]])
# Create KMeans instance with 2 clusters
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
# Get cluster centers
centers = kmeans.cluster_centers_
# Predict the cluster for new samples
predictions = kmeans.predict([[0, 0], [4, 4]])
Mean Shift Clustering
Mean shift is a sliding-window-based algorithm that attempts to find dense areas of data points. It automatically determines the number of clusters.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
DBSCAN is a density-based clustering algorithm that groups together points that are closely packed together, marking points that lie alone in low-density regions as outliers.
from sklearn.cluster import DBSCAN
# Sample data
X = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]])
# Create DBSCAN instance
dbscan = DBSCAN(eps=1, min_samples=2).fit(X)
# Get labels
labels = dbscan.labels_
Gaussian Mixture Models (GMM)
GMM is a probabilistic model that assumes all data points are generated from a mixture of several Gaussian distributions with unknown parameters.
from sklearn.mixture import GaussianMixture
# Sample data
X = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]])
# Create GMM instance
gmm = GaussianMixture(n_components=2).fit(X)
# Predict cluster for new samples
predictions = gmm.predict([[0, 0], [4, 4]])
Hierarchical Clustering
Hierarchical clustering builds a hierarchy of clusters either by merging smaller clusters into larger ones (agglomerative) or splitting larger clusters into smaller ones (divisive).
Introducing New Clustering Method: Surface Tension Clustering
Concept and Algorithm
The Surface Tension Clustering method is inspired by the behavior of molecules in liquids, where surface molecules experience an inward force due to the lack of neighboring molecules above them. In this analogy, boundary points of clusters are determined by the imbalance of vector forces from surrounding points.
Here’s a step-by-step outline of the algorithm:
- Distance Calculation: Compute the distances between all pairs of points.
- Neighborhood Identification: For each point, identify the nearest neighbors (typically 4-6 points).
- Center Calculation: Determine the center of mass for the neighboring points.
- Force Calculation: Calculate the resultant force acting on each point. If the force is significant, the point is considered a boundary point.
- Cluster Formation: Use the boundary points to form clusters and identify the support points, which are the closest points between different clusters.
How did I improve it?
Before applying the main clustering algorithm, I divide the points into smaller clusters using k-means and then look for the boundary points within them. I also wrote an interface using PyQT, which I will demonstrate below.
Demonstration of the work
So, here you can see the interface of the program, where I have specified random points, and different colours represent different clusters within which the boundary points will be searched. You can also adjust the sensitivity of the algorithm here.
Research can also be carried out in 3D space:
Now I'll show you a more visual example of how the algorithm works. In the picture, you can see that the red dots are the boundary dots, and if you connect them with a line, you can imagine that the grey dots are inside the water drop, i.e. they belong to the same cluster.
Conclusion
Clustering is a powerful tool for uncovering hidden structures in data. Traditional methods like K-means and DBSCAN are widely used, but new Surface Tension Clustering method demonstrates significant advancements in efficiency, adaptability, and scalability. By leveraging the physical analogy of surface tension, we offer a novel approach to clustering that addresses some of the limitations of existing methods. GitHub repository you can see here.
Thank you for reading this far, I'd love to read your thoughts in the comments!
Top comments (0)