DBSCAN Clustering

A density-based clustering algorithm called DBSCAN (Density-Based Spatial Clustering of Applications with Noise) clusters together densely packed data points while classifying outliers as noise. The user does not need to predetermine the number of clusters while using DBSCAN. Clusters are described as continuous high-density areas divided by low-density areas. DBSCAN can detect clusters of any shape and is resistant to outliers. Its two parameters are min_samples, which is the bare minimum of data points needed to build a dense region, and epsilon (eps), which specifies the radius of the neighborhood around a data point.

Implementing DBSCAN using PyTorch

1. Importing necessary libraries

Python3




import torch
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons


2. Generate Synthetic Data

Generate synthetic data using make_moons function with 200 samples, some noise (0.05), and a fixed random state.

Python3




# Generate synthetic data
X, _ = make_moons(n_samples=200, noise=0.05, random_state=0)
X = torch.tensor(X, dtype=torch.float)


3. DBSCAN Algorithm

Define Euclidean distance function to calculate the distance between two points.

Python3




def euclidean_distance(x1, x2):
    return torch.sqrt(torch.sum((x1 - x2) ** 2, dim=1))


DBSCAN Function performs clustering.

The dbscan function implements the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm, a density-based clustering method. It takes an input dataset X, maximum distance parameter eps, and minimum number of samples parameter min_samples. It iterates over each data point, finding its neighbors within the specified distance. If a point has fewer neighbors than min_samples, it is labeled as noise; otherwise, it is labeled with a new cluster label and its neighbors are recursively visited to expand the cluster. The function efficiently assigns cluster labels to each point in the dataset based on their density and proximity relationships, providing a flexible and effective clustering method without requiring the number of clusters to be specified beforehand.

Python3




def dbscan(X, eps, min_samples):
   
    n_samples = X.shape[0]
    labels = torch.zeros(n_samples, dtype=torch.int)
 
    # Initialize cluster label and visited flags
    cluster_label = 0
    visited = torch.zeros(n_samples, dtype=torch.bool)
 
    # Iterate over each point
    for i in range(n_samples):
        if visited[i]:
            continue
        visited[i] = True
 
        # Find neighbors
        neighbors = torch.nonzero(euclidean_distance(X[i], X) < eps).squeeze()
         
        if neighbors.shape[0] < min_samples:
            # Label as noise
            labels[i] = 0
        else:
            # Expand cluster
            cluster_label += 1
            labels[i] = cluster_label
            expand_cluster(X, labels, visited, neighbors, cluster_label, eps, min_samples)
 
    return labels


This function implements the core logic of the DBSCAN algorithm, which iteratively identifies core points, expands clusters, and labels noise points. It operates directly on the input data and efficiently finds clusters without requiring the user to specify the number of clusters in advance.

Expand Cluster Function expands the cluster by assigning the cluster label to all reachable points from the core point.

Python3




def expand_cluster(X, labels, visited, neighbors, cluster_label, eps, min_samples):
    i = 0
    while i < neighbors.shape[0]:
        neighbor_index = neighbors[i].item()
        if not visited[neighbor_index]:
            visited[neighbor_index] = True
            neighbor_neighbors = torch.nonzero(euclidean_distance(X[neighbor_index], X) < eps).squeeze()
            if neighbor_neighbors.shape[0] >= min_samples:
                neighbors = torch.cat((neighbors, neighbor_neighbors))
        if labels[neighbor_index] == 0:
            labels[neighbor_index] = cluster_label
        i += 1


4. Performing Clustering and Visualize Clusters

Python3




# DBSCAN parameters
eps = 0.3
min_samples = 5
 
# Perform clustering
labels = dbscan(X, eps, min_samples)
 
# Visualize clusters
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.title('DBSCAN Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.colorbar()
plt.show()


Output:

PyTorch for Unsupervised Clustering

The aim of unsupervised clustering, a fundamental machine learning problem, is to divide data into groups or clusters based on resemblance or some underlying structure. One well-liked deep learning framework for unsupervised clustering problems is PyTorch.

Table of Content

  • What is Unsupervised Clustering?
  • K-means Clustering
  • Hierarchical Clustering
  • DBSCAN Clustering
  • Evaluating Clustering Performance

Similar Reads

What is Unsupervised Clustering?

Unsupervised clustering is a machine-learning method that does not require labelled instances in order to find hidden patterns or groupings within data. It entails dividing data points according to distance or similarity measures into discrete clusters....

K-means Clustering

A well-liked unsupervised machine learning technique for dividing data points into K clusters is K-means clustering. The approach updates the centroids to minimize the within-cluster sum of squared distances by iteratively assigning each data point to the closest centroid based on the Euclidean distance. K-means may converge to a local minimum and is sensitive to the centroids that are first chosen....

Hierarchical Clustering

...

DBSCAN Clustering

...

Evaluating Clustering Performance

...

Conclusion

...

Contact Us