Steps to Evaluate Clustering Using Sklearn
Let’s consider an example using the Iris dataset and the K-Means clustering algorithm. We will calculate the Silhouette Score, Davies-Bouldin Index, Calinski-Harabasz Index, and Adjusted Rand Index to evaluate the clustering.
Import Libraries
Import the necessary libraries, including scikit-learn (sklearn).
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score
from sklearn.metrics import mutual_info_score, adjusted_rand_score
Load Your Data
Load or generate your dataset for clustering. Iris dataset consists of 150 samples of iris flowers. There are three species of iris flower: setosa, versicolor, and virginica with four features: sepal length, sepal width, petal length, and petal width.
# Example using a built-in dataset (e.g., Iris dataset)
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
Perform Clustering
Choose a clustering algorithm, such as K-Means, and fit it to your data.
K means is an unsupervised technique used for creating cluster based on similarity. It iteratively assigns data points to the nearest cluster center and updates the centroids until convergence.
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
Calculate Clustering Metrics
Use the appropriate clustering metrics to evaluate the clustering results.
# Calculate clustering metrics
silhouette = silhouette_score(X, kmeans.labels_)
db_index = davies_bouldin_score(X, kmeans.labels_)
ch_index = calinski_harabasz_score(X, kmeans.labels_)
ari = adjusted_rand_score(iris.target, kmeans.labels_)
mi = mutual_info_score(iris.target, kmeans.labels_)
# Print the metric scores
print(f"Silhouette Score: {silhouette:.2f}")
print(f"Davies-Bouldin Index: {db_index:.2f}")
print(f"Calinski-Harabasz Index: {ch_index:.2f}")
print(f"Adjusted Rand Index: {ari:.2f}")
print(f"Mutual Information (MI): {mi:.2f}")
Output:
Silhouette Score: 0.55
Davies-Bouldin Index: 0.67
Calinski-Harabasz Index: 561.59
Adjusted Rand Index: 0.72
Mutual Information (MI): 0.81
Interpret the Metrics
Analyze the metric scores to assess the quality of your clustering results. Higher scores are generally better.
Here’s an interpretation of the metric scores obtained:
- Silhouette Score (0.55): This score reveals how similar data points are inside their clusters when compared to data points from other clusters. A result of 0.55 indicates that there is some separation between the clusters, but there is still space for improvement. Closer to 1 values suggest better-defined clusters.
- Davies-Bouldin Index (0.66): This index calculates the average similarity between each cluster and its closest neighbors. A lower score is preferable, and 0.66 suggests a pretty strong separation across clusters.
- The score Index (561.63) calculates the ratio of between-cluster variation to within-cluster variance. Higher values suggest more distinct groups. Your clusters are distinct and independent with a score of 561.63.
- The Adjusted Rand Index (0.73) compares the resemblance of genuine class labels to predicted cluster labels. A rating of 0.73 shows that the clustering findings and the actual class labels correspond rather well.
- Mutual Information (MI) (0.75): This metric measures the agreement between the true class labels and the predicted cluster labels. A score of 0.75 indicates a substantial amount of shared information between the true labels and the clusters assigned by the algorithm. It signifies that the clustering solution captures a significant portion of the underlying structure in the data, aligning well with the actual class labels.
In this article, we have demonstrated how to apply clustering metrics using scikit-learn in machine learning using Iris dataset and K means clustering. These metrics provide quantifiable estimates of how well data points are clustered and how closely these clusters fit with the data’s underlying structure. These metrics allow data scientists to measure the quality of clustering findings quantitatively, resulting in more informed judgments and improvements to clustering algorithms and applications.
Clustering Metrics in Machine Learning
Clustering is an unsupervised machine-learning approach that is used to group comparable data points based on specific traits or attributes. It is critical to evaluate the quality of the clusters created when using clustering techniques. These metrics are quantitative indicators used to evaluate the performance and quality of clustering algorithms. In this post, we will explore clustering metrics principles, analyze their importance, and implement them using scikit-learn.
Table of Content
- Silhouette Score
- Davies-Bouldin Index
- Calinski-Harabasz Index (Variance Ratio Criterion)
- Adjusted Rand Index (ARI)
- Mutual Information (MI)
- Steps to Evaluate Clustering Using Sklearn
Clustering Metrics
Clustering metrics play a pivotal role in evaluating the effectiveness of machine learning algorithms designed to group similar data points. These metrics provide quantitative measures to assess the quality of clusters formed, helping practitioners choose optimal algorithms for diverse datasets. By gauging factors like compactness, separation, and variance, clustering metrics such as silhouette score, Davies–Bouldin index, and Calinski-Harabasz index offer insights into the performance of clustering techniques. Understanding and applying these metrics contribute to the refinement and selection of clustering algorithms, fostering better insights in unsupervised learning scenarios.
Contact Us