DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. It is a density based clustering algorithm that can find clusters of arbitrary shapes and sizes, and also identify outliers or noise in the data. DBSCAN does not require specifying the number of clusters in advance, unlike k-means. Instead, it relies on two parameters:
- epsilon
- min_samples
Epsilon is the maximum distance between two data points to be considered as neighbors. min_samples is the minimum number of data points required to form a dense region. A data point is considered as a core point if it has at least min_samples neighbors within epsilon distance. A data point is considered as a border point if it has fewer than min_samples neighbors, but is within epsilon distance of a core point. A data point is considered as a noise point if it is neither a core point nor a border point.
Working
Step 1: Mark all data points as unvisited.
Step 2: Choose a random unvisited data point p and mark it as visited.
Step 3: Find all the neighbors of p within epsilon distance and store them in a set N.
Step 4: If N has at least min_samples elements, then p is a core point and a new cluster C is formed with p and N. Otherwise, p is a noise point and no cluster is formed.
Step 5: For each core point q in N, mark it as visited and find its neighbors within epsilon distance. If q has at least min_samples neighbors, then add them to N. If q is not assigned to any cluster, then assign it to C.
Step 6: Repeat step 5 until all core points in N are visited and assigned to C.
Step 7: Repeat steps 2 to 6 until all data points are visited.
Pseudocode
# Input: data points X, maximum distance epsilon, minimum number of neighbors min_samples
# Output: cluster assignments C
# Initialize cluster index
C = 0
# Mark all data points as unvisited
visited = false_array(X.size)
# Loop through all data points
for i in range(X.size):
# Skip visited data points
if visited[i]:
continue
# Mark current data point as visited
visited[i] = true
# Find neighbors of current data point within epsilon distance
N = find_neighbors(X[i], X, epsilon)
# Check if current data point is a core point
if N.size >= min_samples:
# Increment cluster index
C = C + 1
# Assign current data point to current cluster
C[i] = C
# Loop through all neighbors of current data point
for j in N:
# Skip visited neighbors
if visited[j]:
continue
# Mark neighbor as visited
visited[j] = true
# Find neighbors of neighbor within epsilon distance
N' = find_neighbors(X[j], X, epsilon)
# Check if neighbor is a core point
if N'.size >= min_samples:
# Add neighbors of neighbor to N
N = N U N'
# Check if neighbor is not assigned to any cluster
if C[j] == 0:
# Assign neighbor to current cluster
C[j] = C
# Else current data point is a noise point
else:
# Assign current data point to noise cluster (-1)
C[i] = -1
# Return the final cluster assignments
return C
Advantages
- It can handle irregularly shaped and sized clusters, unlike k-means which assumes spherical clusters.
- It can detect outliers or noise points and separate them from the clusters.
- It does not require specifying the number of clusters in advance, unlike k-means which needs k as an input.
- It is insensitive to the order of the data points in the dataset.
Disadvantages
- It can struggle with clusters of similar density, as it may merge them into one cluster or split them into multiple clusters.
- It is sensitive to the choice of epsilon and min_samples parameters, which can affect the quality and number of clusters.
- It can have difficulty with high dimensional data, as the distance measure becomes less meaningful and the density varies a lot.
- It is not suitable when there are different densities involved in the clusters, as it may miss some clusters or include some noise points.
References and Further Reading
- Understand The DBSCAN Clustering Algorithm! - Analytics Vidhya
- DBSCAN Clustering: How Does It Work? | Baeldung on Computer Science
- DBSCAN Clustering in ML | Density based clustering
- DBSCAN - Wikipedia
- What is DBSCAN - Online Tutorials Library
- DBSCAN | Engati
- When to use DBSCAN - Crunching the Data
- Density-based algorithms. The pure apprehension of two… | by Zanfina…
This will be a multipart series which will follow up with more clustering algorithms with their working, pseudocode, advantages and disadvantages.
Please stay tuned for more such content.
If you liked this post, please share it with your friends and fellow developers. And don’t forget to follow us for more programming tutorials and examples! 😊
And also,
have a look👀 @ my Portfolio
code👨💻 together @ Github
connect🔗 @ LinkedIn
Top comments (0)