Fuzzy Clustering in R on Medical Diagnosis dataset

Fuzzy Clustering in R using Customer Segmentation datset

1. Loading Required Libraries

R

# Loading Required Libraries 
# For fuzzy clustering 
library(e1071)  
library(ggplot2)

2. Loading the Dataset

We are creating a fictional dataset about patient health parameters. Synthetic data is created for 100 patients, the parameters that are used here are : blood pressure, cholesterol and BMI(body mass index).

R

# Loading the Dataset 
set.seed(123)  # for reproducibility 
patients <- data.frame( 
  patient_id = 1:100, 
  blood_pressure = rnorm(100, mean = 120, sd = 10), 
  cholesterol = rnorm(100, mean = 200, sd = 30), 
  bmi = rnorm(100, mean = 25, sd = 5) 
)

3. Data Preprocessing

This step is important to ensure that all the variables are on the same scale, this is a common practice done in clustering.

R

# Data Preprocessing 
scaled_data <- scale(patients[, -1])

4. Data Selection for Clustering

This segment involve selecting relevant variables for clustering.

R

# Data Selection for Clustering 
selected_data <- scaled_data[, c("blood_pressure", "cholesterol", "bmi")]

5. Fuzzy C-means Clustering with FGK Algorithm

The Fuzzy Gustafson-Kessel (FGK) algorithm is a variant of the Fuzzy C-means (FCM) clustering algorithm which focuses on overlapping clusters. It works with dataset that are overlapping and have non-spherical clustering. he membership grades are determined based on the weighted Euclidean distance between data points and cluster centers. Euclidean Distance formula is used to measure straight line distance between two points in Euclidean space. The formula is given by:

d = √[ (x2– x1) 2  + (y2– y1 )2]

where (x₁, y₁) are the coordinates of one point
and (y₁, y₂) are the coordinates of other point.
and d is the distance between them

R

# Fuzzy C-means Clustering with FGK algorithm 
set.seed(456)  
fgk_clusters <- e1071::cmeans(selected_data, centers = 3, m = 2)$cluster

selected_data refers to the selected columns we need for clustering. Number of centers here are 3 and High value of m shows fuzzier cluster.

Data Membership Degree Matrix and the Cluster Prototype Evolution Matrices

In fuzzy clustering each data point is assigned with a degree of membership which defines the degree of belongingness of that data point to a definite cluster whereas the cluster prototype evolution matrices are used to show the change in centroid position over the iteration.

R

# Fuzzy C-means Clustering with FGK algorithm 
set.seed(456)  # for reproducibility 
fuzzy_result <- e1071::cmeans(selected_data, centers = 3, m = 2) 
  
# Access the membership matrix and cluster centers 
membership_matrix <- fuzzy_result$membership 
cluster_centers <- fuzzy_result$centers 
  
# Print the membership matrix and cluster centers 
print("Data Membership Degree Matrix:") 
print(membership_matrix) 
  
print("Cluster Prototype Evolution Matrices:") 
print(cluster_centers)

Output:

"Data Membership Degree Matrix:"
        1          2          3
  [1,] 0.15137740 0.15999978 0.68862282
  [2,] 0.10702292 0.19489294 0.69808414
  [3,] 0.71018858 0.18352624 0.10628518
  [4,] 0.21623783 0.18849017 0.59527200
  [5,] 0.70780116 0.14281776 0.14938109
  [6,] 0.63998321 0.23731396 0.12270283
  [7,] 0.82691960 0.10470764 0.06837277
  [8,] 0.33246815 0.25745565 0.41007620
  [9,] 0.08219287 0.10368827 0.81411886
 [10,] 0.06659943 0.83694230 0.09645826....
[100,] 0.12656903 0.12155473 0.75187624

"Cluster Prototype Evolution Matrices:"
 blood_pressure cholesterol        bmi
1      0.6919000  -0.5087515 -0.4642972
2     -0.1031542   0.7724248 -0.3050143
3     -0.6279179  -0.3104457  0.8176061

The higher values show a strong relationship between the clusters and data points as given in our output. All the 100 rows are not represented here, you can get those values by following the code.
The values in the matrix show the movement of the cluster centroids in each dimension of each variable that is blood pressure, cholesterol and bmi.

6. Interpret the Clustering Results

In this step we are combining the clustering results with our original data with the help of cbind() function. summary() function gives us an insight of our data.

R

# Interpret the Clustering Results 
clustered_data <- cbind(patients, cluster = fgk_clusters) 
summary(clustered_data)

Output:

   patient_id     blood_pressure    cholesterol         bmi           cluster    
 Min.   :  1.00   Min.   : 96.91   Min.   :138.4   Min.   :16.22   Min.   :1.00  
 1st Qu.: 25.75   1st Qu.:115.06   1st Qu.:176.0   1st Qu.:22.34   1st Qu.:1.00  
 Median : 50.50   Median :120.62   Median :193.2   Median :25.18   Median :2.00  
 Mean   : 50.50   Mean   :120.90   Mean   :196.8   Mean   :25.60   Mean   :2.02  
 3rd Qu.: 75.25   3rd Qu.:126.92   3rd Qu.:214.0   3rd Qu.:28.82   3rd Qu.:3.00  
 Max.   :100.00   Max.   :141.87   Max.   :297.2   Max.   :36.47   Max.   :3.00

The summary() shows the min, first quartile, median, 3rd quartile and max of different columns of our dataset. This information can be helpful for the researchers in studying the underlying patterns in the dataset for further decision making.

GAP INDEX

Gap index or Gap statistics is used to calculate the optimal number of clusters withing a dataset. It defines the optimal number of clusters after which adding the cluster number will not play any significant role in analysis.

R

# Function to calculate the gap statistic 
gap_statistic <- function(data, max_k, B = 50, seed = NULL) { 
  require(cluster) 
    
  set.seed(seed) 
    
  # Compute the observed within-cluster dispersion for different values of k 
  wss <- numeric(max_k) 
  for (i in 1:max_k) { 
    wss[i] <- sum(kmeans(data, centers = i)$withinss) 
  } 
    
  # Generate B reference datasets and calculate the within-cluster dispersion for each 
  B_wss <- matrix(NA, B, max_k) 
  for (b in 1:B) { 
    ref_data <- matrix(rnorm(nrow(data) * ncol(data)), nrow = nrow(data)) 
    for (i in 1:max_k) { 
      B_wss[b, i] <- sum(kmeans(ref_data, centers = i)$withinss) 
    } 
  } 
    
  # Calculate the gap statistic 
  gap <- log(wss) - apply(B_wss, 2, mean) 
  return(gap) 
} 
  
# Example usage of the gap_statistic function 
gap_values <- gap_statistic(selected_data, max_k = 10, B = 50, seed = 123) 
print(gap_values)

Output:

[1] -286.82712 -209.32084 -163.01342 -131.98106 -112.70612  -98.07825  -87.90545
 [8]  -77.92460  -69.81373  -63.42550

This output suggests that the smaller clusters will be better to present this dataset. The negative values suggest that the observed within-cluster variation is less than the expected variation in the dataset.

Davies-Bouldin’s index

It assess the average similarity between the clusters, it deals with both the scatter within the clusters and the separation between the clusters covering a wide range and helping us in estimating the quality of the clusters.

R

# Function to calculate the Davies-Bouldin index 
davies_bouldin_index <- function(data, cluster_centers, membership_matrix) { 
  require(cluster) 
  
  num_clusters <- nrow(cluster_centers) 
  scatter <- numeric(num_clusters) 
  for (i in 1:num_clusters) { 
    scatter[i] <- mean(sqrt(rowSums((data - cluster_centers[i,])^2)) * membership_matrix[i,]) 
  } 
  
  # Calculate the cluster separation 
  separation <- matrix(0, nrow = num_clusters, ncol = num_clusters) 
  for (i in 1:num_clusters) { 
    for (j in 1:num_clusters) { 
      if (i != j) { 
        separation[i, j] <- sqrt(sum((cluster_centers[i,] - cluster_centers[j,])^2)) 
      } 
    } 
  } 
  
  # Calculate the Davies-Bouldin index 
  db_index <- 0 
  for (i in 1:num_clusters) { 
    max_val <- -Inf
    for (j in 1:num_clusters) { 
      if (i != j) { 
        val <- (scatter[i] + scatter[j]) / separation[i, j] 
        if (val > max_val) { 
          max_val <- val 
        } 
      } 
    } 
    db_index <- db_index + max_val 
  } 
  db_index <- db_index / num_clusters 
  return(db_index) 
} 
  
# Example usage of the Davies-Bouldin index function 
db_index <- davies_bouldin_index(selected_data, cluster_centers, membership_matrix) 
print(paste("Davies-Bouldin Index:", db_index))

Output:

"Davies-Bouldin Index: 0.77109024677212"

Based on our output result which is a lower value it shows that our clusters are well defined and these are well separated from each other.

7. Visualizing the Clustering Results

R

# Visualizing the Clustering Results 
ggplot(clustered_data, aes(x = blood_pressure, y = cholesterol,  
                           color = factor(cluster))) + 
  geom_point(size = 3) + 
  labs(title = "Clustering of Patients Based on Health Parameters", 
       x = "Blood Pressure", y = "Cholesterol") + 
  scale_color_manual(values = c("darkgreen", "green3", "lightgreen")) + 
  theme_minimal()

Output:

Cholesterol vs Blood pressure graph

In this graph each data point represents a patient defined by the cluster color. The different shades of cluster defines the difference between them.

Data Point Cluster Representation

This information is required since we simplify the complex structure into easier forms for better understanding. This representation can help understand the underlying trends, patterns and complex information that cannot be understood with high dimensional original dataset. It also helps in understanding the outliers that are not easily detectable in the original dataset.

R

# Load the required library 
library(cluster) 
  
# Create a data frame including the cluster assignment 
clustered_data$cluster <- as.factor(clustered_data$cluster) 
  
# Plot the clusters using clusplot 
clusplot(selected_data, clustered_data$cluster, color = TRUE, shade = TRUE, 
         labels = 2, lines = 0)

Output:

Data Point Cluster Representation

Different clusters are represented in different colors here and the clusters are also shaded to provide a clear view with each data points. 71.02% of the point variability explains the percentage of variance in the data. This means the two principal components of this data capture 71.02% of variance present in the original dataset.

Variable Relationships Visualization

To visualize the relationship between the variables we plot the pair scatter plot of our dataset. Here we are using pairs() function to create a scatter plot matrix.

R

# Load necessary libraries 
library(ggplot2) 
  
# Create a scatter plot matrix 
pairs(selected_data, main = "Scatter Plot Matrix of Health Parameters")

Output:

Variable Relationships Visualization

The diagonal elements show the distribution of each variable. This scatter plot helps us to visualize the relationships between different variables, such as blood pressure, cholesterol, and BMI, in the dataset. In the context of patient health parameters, the scatter plot can help us understand how these variables are connected to each other. Understanding these patterns can help us asses the potential risk and help in decision making.

In this example, we created a fictional data set for medical diagnosis using Fuzzy Gustafson-Kessel (FGK) algorithm. We used different packages for clustering the results with the original dataset. Such kind of clustering helps the medical practitioner draw conclusions based on the medical history and similarities between multiple patients and their symptoms. This also makes treatment decisions easier.

Conclusion

In this article, we got to know about different algorithms and their base used for fuzzy clustering and how it helps in various fields such as medical, agriculture, traffic pattern analysis and customer segmentation. We applied this on various type of dataset from different sources. We also plotted the results of these clustering on graph for better visualization. These clustering data points help researchers identify how each of them belong or contribute to different factors and how they affect the study as a whole.