Steps to Evaluate Clustering Using Sklearn

Let’s consider an example using the Iris dataset and the K-Means clustering algorithm. We will calculate the Silhouette Score, Davies-Bouldin Index, Calinski-Harabasz Index, and Adjusted Rand Index to evaluate the clustering.

Import Libraries

Import the necessary libraries, including scikit-learn (sklearn).

Python3
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score
from sklearn.metrics import mutual_info_score, adjusted_rand_score

Load Your Data

Load or generate your dataset for clustering. Iris dataset consists of 150 samples of iris flowers. There are three species of iris flower: setosa, versicolor, and virginica with four features: sepal length, sepal width, petal length, and petal width.

Python3
# Example using a built-in dataset (e.g., Iris dataset)
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data

Perform Clustering

Choose a clustering algorithm, such as K-Means, and fit it to your data.

K means is an unsupervised technique used for creating cluster based on similarity. It iteratively assigns data points to the nearest cluster center and updates the centroids until convergence.

Python3
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)

Calculate Clustering Metrics

Use the appropriate clustering metrics to evaluate the clustering results.

Python3
# Calculate clustering metrics
silhouette = silhouette_score(X, kmeans.labels_)
db_index = davies_bouldin_score(X, kmeans.labels_)
ch_index = calinski_harabasz_score(X, kmeans.labels_)
ari = adjusted_rand_score(iris.target, kmeans.labels_)
mi = mutual_info_score(iris.target, kmeans.labels_)

# Print the metric scores
print(f"Silhouette Score: {silhouette:.2f}")
print(f"Davies-Bouldin Index: {db_index:.2f}")
print(f"Calinski-Harabasz Index: {ch_index:.2f}")
print(f"Adjusted Rand Index: {ari:.2f}")
print(f"Mutual Information (MI): {mi:.2f}")

Output:

Silhouette Score: 0.55
Davies-Bouldin Index: 0.67
Calinski-Harabasz Index: 561.59
Adjusted Rand Index: 0.72
Mutual Information (MI): 0.81

Interpret the Metrics

Analyze the metric scores to assess the quality of your clustering results. Higher scores are generally better.

Here’s an interpretation of the metric scores obtained:

  • Silhouette Score (0.55): This score reveals how similar data points are inside their clusters when compared to data points from other clusters. A result of 0.55 indicates that there is some separation between the clusters, but there is still space for improvement. Closer to 1 values suggest better-defined clusters.
  • Davies-Bouldin Index (0.66): This index calculates the average similarity between each cluster and its closest neighbors. A lower score is preferable, and 0.66 suggests a pretty strong separation across clusters.
  • The score Index (561.63) calculates the ratio of between-cluster variation to within-cluster variance. Higher values suggest more distinct groups. Your clusters are distinct and independent with a score of 561.63.
  • The Adjusted Rand Index (0.73) compares the resemblance of genuine class labels to predicted cluster labels. A rating of 0.73 shows that the clustering findings and the actual class labels correspond rather well.
  • Mutual Information (MI) (0.75): This metric measures the agreement between the true class labels and the predicted cluster labels. A score of 0.75 indicates a substantial amount of shared information between the true labels and the clusters assigned by the algorithm. It signifies that the clustering solution captures a significant portion of the underlying structure in the data, aligning well with the actual class labels.

In this article, we have demonstrated how to apply clustering metrics using scikit-learn in machine learning using Iris dataset and K means clustering. These metrics provide quantifiable estimates of how well data points are clustered and how closely these clusters fit with the data’s underlying structure. These metrics allow data scientists to measure the quality of clustering findings quantitatively, resulting in more informed judgments and improvements to clustering algorithms and applications.

Clustering Metrics in Machine Learning

Clustering is an unsupervised machine-learning approach that is used to group comparable data points based on specific traits or attributes. It is critical to evaluate the quality of the clusters created when using clustering techniques. These metrics are quantitative indicators used to evaluate the performance and quality of clustering algorithms. In this post, we will explore clustering metrics principles, analyze their importance, and implement them using scikit-learn.

Table of Content

  • Silhouette Score
  • Davies-Bouldin Index
  • Calinski-Harabasz Index (Variance Ratio Criterion)
  • Adjusted Rand Index (ARI)
  • Mutual Information (MI)
  • Steps to Evaluate Clustering Using Sklearn

Clustering Metrics

Clustering metrics play a pivotal role in evaluating the effectiveness of machine learning algorithms designed to group similar data points. These metrics provide quantitative measures to assess the quality of clusters formed, helping practitioners choose optimal algorithms for diverse datasets. By gauging factors like compactness, separation, and variance, clustering metrics such as silhouette score, Davies–Bouldin index, and Calinski-Harabasz index offer insights into the performance of clustering techniques. Understanding and applying these metrics contribute to the refinement and selection of clustering algorithms, fostering better insights in unsupervised learning scenarios.

Similar Reads

Silhouette Score

A metric called the Silhouette Score is employed to assess a dataset’s well-defined clusters. The cohesiveness and separation between clusters are quantified. Better-defined clusters are indicated by higher scores, which range from -1 to 1. An object is said to be well-matched to its own cluster and poorly-matched to nearby clusters if its score is close to 1. A score of about -1, on the other hand, suggests that the object might be in the incorrect cluster. The Silhouette Score is useful for figuring out how appropriate clustering methods are and how many clusters are best for a particular dataset....

Davies-Bouldin Index

A statistic for assessing the effectiveness of clustering algorithms is the Davies-Bouldin Index. It evaluates a dataset’s clusters’ compactness and separation. Better-defined clusters are indicated by a lower Davies-Bouldin Index, which is determined by comparing each cluster’s average similarity-to-dissimilarity ratio to that of its most similar neighbor. Since clusters with the smallest intra-cluster and largest inter-cluster distances provide a lower index, it aids in figuring out the ideal number of clusters. This index helps choose the best clustering solutions for a variety of datasets by offering a numerical assessment of the clustering quality....

Calinski-Harabasz Index (Variance Ratio Criterion)

A clustering validation metric called the Calinski-Harabasz Index is used to evaluate the quality of clusters within a dataset. Higher values indicate compact and well-separated clusters. It computes the ratio of the within-cluster variance to the between-cluster variance. It helps determine the ideal number of clusters for a given dataset by comparing the index across various clusterings. Improved cluster definition is implied by a higher Calinski-Harabasz Index. This measure is useful for assessing how well clustering algorithms work, which helps choose the best clustering solution for a variety of datasets....

Adjusted Rand Index (ARI)

The Adjusted Rand Index (ARI) is a metric that compares findings from segmentation or clustering to a ground truth in order to assess how accurate the results are. It evaluates whether data point pairs are clustered together or apart in both the true and anticipated clusterings. Higher values of the index imply better agreement; it corrects for chance agreement and produces a score between -1 and 1. ARI is reliable and appropriate in situations when the cluster sizes in the ground truth may differ. It offers a thorough assessment of clustering performance in situations where class labels are known....

Mutual Information (MI)

A metric called mutual information is used to quantify how dependent two variables are on one another. It evaluates the degree of agreement between the actual and expected cluster designations in the context of clustering evaluation. Mutual Information measures the degree to which the knowledge of one variable reduces uncertainty about the other, hence capturing the quality of clustering outcomes. Better agreement is indicated by higher values; zero denotes no agreement and higher scores signify more mutual information. It provides a reliable indicator of how well clustering algorithms are working and sheds light on how closely anticipated and actual clusters match up....

Steps to Evaluate Clustering Using Sklearn

Let’s consider an example using the Iris dataset and the K-Means clustering algorithm. We will calculate the Silhouette Score, Davies-Bouldin Index, Calinski-Harabasz Index, and Adjusted Rand Index to evaluate the clustering....

Frequently Asked Questions (FAQs) on Clustering Metrics

Q. What are clustering metrics?...

Contact Us