DEV Community

Rajaniraiyn R
Rajaniraiyn R

Posted on

DBSCAN Clustering Algorithm Demystified

DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. It is a density based clustering algorithm that can find clusters of arbitrary shapes and sizes, and also identify outliers or noise in the data. DBSCAN does not require specifying the number of clusters in advance, unlike k-means. Instead, it relies on two parameters:

  • epsilon
  • min_samples

Epsilon is the maximum distance between two data points to be considered as neighbors. min_samples is the minimum number of data points required to form a dense region. A data point is considered as a core point if it has at least min_samples neighbors within epsilon distance. A data point is considered as a border point if it has fewer than min_samples neighbors, but is within epsilon distance of a core point. A data point is considered as a noise point if it is neither a core point nor a border point.

Working

Step 1: Mark all data points as unvisited.
Step 2: Choose a random unvisited data point p and mark it as visited.
Step 3: Find all the neighbors of p within epsilon distance and store them in a set N.
Step 4: If N has at least min_samples elements, then p is a core point and a new cluster C is formed with p and N. Otherwise, p is a noise point and no cluster is formed.
Step 5: For each core point q in N, mark it as visited and find its neighbors within epsilon distance. If q has at least min_samples neighbors, then add them to N. If q is not assigned to any cluster, then assign it to C.
Step 6: Repeat step 5 until all core points in N are visited and assigned to C.
Step 7: Repeat steps 2 to 6 until all data points are visited.

Pseudocode

# Input: data points X, maximum distance epsilon, minimum number of neighbors min_samples
# Output: cluster assignments C

# Initialize cluster index
C = 0

# Mark all data points as unvisited
visited = false_array(X.size)

# Loop through all data points
for i in range(X.size):

    # Skip visited data points
    if visited[i]:
        continue

    # Mark current data point as visited
    visited[i] = true

    # Find neighbors of current data point within epsilon distance
    N = find_neighbors(X[i], X, epsilon)

    # Check if current data point is a core point
    if N.size >= min_samples:

        # Increment cluster index
        C = C + 1

        # Assign current data point to current cluster
        C[i] = C

        # Loop through all neighbors of current data point
        for j in N:

            # Skip visited neighbors
            if visited[j]:
                continue

            # Mark neighbor as visited
            visited[j] = true

            # Find neighbors of neighbor within epsilon distance
            N' = find_neighbors(X[j], X, epsilon)

            # Check if neighbor is a core point
            if N'.size >= min_samples:

                # Add neighbors of neighbor to N
                N = N U N'

            # Check if neighbor is not assigned to any cluster
            if C[j] == 0:

                # Assign neighbor to current cluster
                C[j] = C

    # Else current data point is a noise point
    else:

        # Assign current data point to noise cluster (-1)
        C[i] = -1

# Return the final cluster assignments
return C
Enter fullscreen mode Exit fullscreen mode

Advantages

  • It can handle irregularly shaped and sized clusters, unlike k-means which assumes spherical clusters.
  • It can detect outliers or noise points and separate them from the clusters.
  • It does not require specifying the number of clusters in advance, unlike k-means which needs k as an input.
  • It is insensitive to the order of the data points in the dataset.

Disadvantages

  • It can struggle with clusters of similar density, as it may merge them into one cluster or split them into multiple clusters.
  • It is sensitive to the choice of epsilon and min_samples parameters, which can affect the quality and number of clusters.
  • It can have difficulty with high dimensional data, as the distance measure becomes less meaningful and the density varies a lot.
  • It is not suitable when there are different densities involved in the clusters, as it may miss some clusters or include some noise points.

References and Further Reading


This will be a multipart series which will follow up with more clustering algorithms with their working, pseudocode, advantages and disadvantages.

Please stay tuned for more such content.


If you liked this post, please share it with your friends and fellow developers. And don’t forget to follow us for more programming tutorials and examples! 😊

And also,
have a look👀 @ my Portfolio
code👨‍💻 together @ Github
connect🔗 @ LinkedIn

Top comments (0)