Hierarchical Clustering

Data points can be grouped into a hierarchy of clusters using the hierarchical clustering technique. Unlike K-means, it doesn’t require a set number of clusters. Rather, by repeatedly merging or dividing clusters depending on similarity measurements, it creates a tree-like structure of clusters until all data points are part of a single cluster or separate clusters. Both dividing and agglomerative (bottom-up) hierarchical clustering are possible (top-down).

  1. Hierarchical Agglomerative Clustering: It starts with every data point as a distinct cluster and repeatedly joins the closest pairs of clusters until every point is a part of a single cluster.
  2. Hierarchical Divisive Clustering: It divides the dataset recursively into smaller clusters until every data point is in its own cluster, starting with all the data points in one cluster.

Implementing Agglomerative Clustering using PyTorch

The code demonstrates how to perform hierarchical clustering using the linkage function from scipy.cluster.hierarchy and visualize the resulting dendrogram using Matplotlib.

  1. Import Libraries: Import necessary libraries including PyTorch for tensor operations, SciPy for hierarchical clustering, and Matplotlib for plotting.
  2. Sample Data: Create a tensor X containing sample data points.
  3. Standardize Data: Standardize the data using z-score normalization to ensure that all features have equal importance in the distance calculation.
  4. Calculate Pairwise Euclidean Distances: Use torch.cdist to calculate pairwise Euclidean distances between all points in the standardized data.
  5. Convert Distances to Numpy Array: Convert the distance tensor to a NumPy array since SciPy’s linkage function expects a NumPy array.
  6. Perform Hierarchical Clustering: Use SciPy’s linkage function to perform hierarchical clustering on the pairwise distances. The method 'single' indicates that the minimum distance between clusters should be used as the metric for merging clusters.
  7. Plot Dendrogram: Plot the dendrogram using Matplotlib, visualizing the hierarchical clustering results.

NOTE: We are using SciPy for hierarchical clustering as PyTorch does not have built-in functions for hierarchical clustering. We use PyTorch for calculating pairwise distances between data points and then convert the distances to a NumPy array for use with SciPy’s hierarchical clustering functions.

Python3




import torch
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt
 
# Sample data
X = torch.tensor([[1, 2], [1, 4], [1, 0],
                 [4, 2], [4, 4], [4, 0]])
 
# Standardize data (ensure floating point output)
X_std = (X.float() - X.float().mean(dim=0)) / X.float().std(dim=0)
 
# Calculate pairwise Euclidean distances using PyTorch
distances = torch.cdist(X_std, X_std, p=2# p=2 for Euclidean distance
 
# Convert distances to numpy array for SciPy usage
distances = distances.numpy()
 
# Perform hierarchical clustering using SciPy
Z = linkage(distances, 'single')
 
# Plot dendrogram using matplotlib
plt.figure(figsize=(10, 5))
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Index')
plt.ylabel('Distance')
dendrogram(Z)
plt.show()


Output:

PyTorch for Unsupervised Clustering

The aim of unsupervised clustering, a fundamental machine learning problem, is to divide data into groups or clusters based on resemblance or some underlying structure. One well-liked deep learning framework for unsupervised clustering problems is PyTorch.

Table of Content

  • What is Unsupervised Clustering?
  • K-means Clustering
  • Hierarchical Clustering
  • DBSCAN Clustering
  • Evaluating Clustering Performance

Similar Reads

What is Unsupervised Clustering?

Unsupervised clustering is a machine-learning method that does not require labelled instances in order to find hidden patterns or groupings within data. It entails dividing data points according to distance or similarity measures into discrete clusters....

K-means Clustering

A well-liked unsupervised machine learning technique for dividing data points into K clusters is K-means clustering. The approach updates the centroids to minimize the within-cluster sum of squared distances by iteratively assigning each data point to the closest centroid based on the Euclidean distance. K-means may converge to a local minimum and is sensitive to the centroids that are first chosen....

Hierarchical Clustering

...

DBSCAN Clustering

...

Evaluating Clustering Performance

...

Conclusion

...

Contact Us