Fuzzy Clustering in R on Medical Diagnosis dataset

1. Loading Required Libraries

R




# Loading Required Libraries
# For fuzzy clustering
library(e1071) 
library(ggplot2)


2. Loading the Dataset

We are creating a fictional dataset about patient health parameters. Synthetic data is created for 100 patients, the parameters that are used here are : blood pressure, cholesterol and BMI(body mass index).

R




# Loading the Dataset
set.seed(123)  # for reproducibility
patients <- data.frame(
  patient_id = 1:100,
  blood_pressure = rnorm(100, mean = 120, sd = 10),
  cholesterol = rnorm(100, mean = 200, sd = 30),
  bmi = rnorm(100, mean = 25, sd = 5)
)


3. Data Preprocessing

This step is important to ensure that all the variables are on the same scale, this is a common practice done in clustering.

R




# Data Preprocessing
scaled_data <- scale(patients[, -1])


4. Data Selection for Clustering

This segment involve selecting relevant variables for clustering.

R




# Data Selection for Clustering
selected_data <- scaled_data[, c("blood_pressure", "cholesterol", "bmi")]


5. Fuzzy C-means Clustering with FGK Algorithm

The Fuzzy Gustafson-Kessel (FGK) algorithm is a variant of the Fuzzy C-means (FCM) clustering algorithm which focuses on overlapping clusters. It works with dataset that are overlapping and have non-spherical clustering. he membership grades are determined based on the weighted Euclidean distance between data points and cluster centers. Euclidean Distance formula is used to measure straight line distance between two points in Euclidean space. The formula is given by:

d = √[ (x2– x1) 2  + (y2– y1 )2]
  • where (x1, y1) are the coordinates of one point
  • and (y1, y2) are the coordinates of other point.
  • and d is the distance between them

R




# Fuzzy C-means Clustering with FGK algorithm
set.seed(456) 
fgk_clusters <- e1071::cmeans(selected_data, centers = 3, m = 2)$cluster


selected_data refers to the selected columns we need for clustering. Number of centers here are 3 and High value of m shows fuzzier cluster.

Data Membership Degree Matrix and the Cluster Prototype Evolution Matrices

In fuzzy clustering each data point is assigned with a degree of membership which defines the degree of belongingness of that data point to a definite cluster whereas the cluster prototype evolution matrices are used to show the change in centroid position over the iteration.

R




# Fuzzy C-means Clustering with FGK algorithm
set.seed(456)  # for reproducibility
fuzzy_result <- e1071::cmeans(selected_data, centers = 3, m = 2)
  
# Access the membership matrix and cluster centers
membership_matrix <- fuzzy_result$membership
cluster_centers <- fuzzy_result$centers
  
# Print the membership matrix and cluster centers
print("Data Membership Degree Matrix:")
print(membership_matrix)
  
print("Cluster Prototype Evolution Matrices:")
print(cluster_centers)


Output:

"Data Membership Degree Matrix:"
        1          2          3
  [1,] 0.15137740 0.15999978 0.68862282
  [2,] 0.10702292 0.19489294 0.69808414
  [3,] 0.71018858 0.18352624 0.10628518
  [4,] 0.21623783 0.18849017 0.59527200
  [5,] 0.70780116 0.14281776 0.14938109
  [6,] 0.63998321 0.23731396 0.12270283
  [7,] 0.82691960 0.10470764 0.06837277
  [8,] 0.33246815 0.25745565 0.41007620
  [9,] 0.08219287 0.10368827 0.81411886
 [10,] 0.06659943 0.83694230 0.09645826....
[100,] 0.12656903 0.12155473 0.75187624

"Cluster Prototype Evolution Matrices:"
 blood_pressure cholesterol        bmi
1      0.6919000  -0.5087515 -0.4642972
2     -0.1031542   0.7724248 -0.3050143
3     -0.6279179  -0.3104457  0.8176061

The higher values show a strong relationship between the clusters and data points as given in our output. All the 100 rows are not represented here, you can get those values by following the code.
The values in the matrix show the movement of the cluster centroids in each dimension of each variable that is blood pressure, cholesterol and bmi.

6. Interpret the Clustering Results

In this step we are combining the clustering results with our original data with the help of cbind() function. summary() function gives us an insight of our data.

R




# Interpret the Clustering Results
clustered_data <- cbind(patients, cluster = fgk_clusters)
summary(clustered_data)


Output:

   patient_id     blood_pressure    cholesterol         bmi           cluster    
 Min.   :  1.00   Min.   : 96.91   Min.   :138.4   Min.   :16.22   Min.   :1.00  
 1st Qu.: 25.75   1st Qu.:115.06   1st Qu.:176.0   1st Qu.:22.34   1st Qu.:1.00  
 Median : 50.50   Median :120.62   Median :193.2   Median :25.18   Median :2.00  
 Mean   : 50.50   Mean   :120.90   Mean   :196.8   Mean   :25.60   Mean   :2.02  
 3rd Qu.: 75.25   3rd Qu.:126.92   3rd Qu.:214.0   3rd Qu.:28.82   3rd Qu.:3.00  
 Max.   :100.00   Max.   :141.87   Max.   :297.2   Max.   :36.47   Max.   :3.00

The summary() shows the min, first quartile, median, 3rd quartile and max of different columns of our dataset. This information can be helpful for the researchers in studying the underlying patterns in the dataset for further decision making.

GAP INDEX

Gap index or Gap statistics is used to calculate the optimal number of clusters withing a dataset. It defines the optimal number of clusters after which adding the cluster number will not play any significant role in analysis.

R




# Function to calculate the gap statistic
gap_statistic <- function(data, max_k, B = 50, seed = NULL) {
  require(cluster)
    
  set.seed(seed)
    
  # Compute the observed within-cluster dispersion for different values of k
  wss <- numeric(max_k)
  for (i in 1:max_k) {
    wss[i] <- sum(kmeans(data, centers = i)$withinss)
  }
    
  # Generate B reference datasets and calculate the within-cluster dispersion for each
  B_wss <- matrix(NA, B, max_k)
  for (b in 1:B) {
    ref_data <- matrix(rnorm(nrow(data) * ncol(data)), nrow = nrow(data))
    for (i in 1:max_k) {
      B_wss[b, i] <- sum(kmeans(ref_data, centers = i)$withinss)
    }
  }
    
  # Calculate the gap statistic
  gap <- log(wss) - apply(B_wss, 2, mean)
  return(gap)
}
  
# Example usage of the gap_statistic function
gap_values <- gap_statistic(selected_data, max_k = 10, B = 50, seed = 123)
print(gap_values)


Output:

[1] -286.82712 -209.32084 -163.01342 -131.98106 -112.70612  -98.07825  -87.90545
 [8]  -77.92460  -69.81373  -63.42550

This output suggests that the smaller clusters will be better to present this dataset. The negative values suggest that the observed within-cluster variation is less than the expected variation in the dataset.

Davies-Bouldin’s index

It assess the average similarity between the clusters, it deals with both the scatter within the clusters and the separation between the clusters covering a wide range and helping us in estimating the quality of the clusters.

R




# Function to calculate the Davies-Bouldin index
davies_bouldin_index <- function(data, cluster_centers, membership_matrix) {
  require(cluster)
  
  num_clusters <- nrow(cluster_centers)
  scatter <- numeric(num_clusters)
  for (i in 1:num_clusters) {
    scatter[i] <- mean(sqrt(rowSums((data - cluster_centers[i,])^2)) * membership_matrix[i,])
  }
  
  # Calculate the cluster separation
  separation <- matrix(0, nrow = num_clusters, ncol = num_clusters)
  for (i in 1:num_clusters) {
    for (j in 1:num_clusters) {
      if (i != j) {
        separation[i, j] <- sqrt(sum((cluster_centers[i,] - cluster_centers[j,])^2))
      }
    }
  }
  
  # Calculate the Davies-Bouldin index
  db_index <- 0
  for (i in 1:num_clusters) {
    max_val <- -Inf
    for (j in 1:num_clusters) {
      if (i != j) {
        val <- (scatter[i] + scatter[j]) / separation[i, j]
        if (val > max_val) {
          max_val <- val
        }
      }
    }
    db_index <- db_index + max_val
  }
  db_index <- db_index / num_clusters
  return(db_index)
}
  
# Example usage of the Davies-Bouldin index function
db_index <- davies_bouldin_index(selected_data, cluster_centers, membership_matrix)
print(paste("Davies-Bouldin Index:", db_index))


Output:

"Davies-Bouldin Index: 0.77109024677212"

Based on our output result which is a lower value it shows that our clusters are well defined and these are well separated from each other.

7. Visualizing the Clustering Results

R




# Visualizing the Clustering Results
ggplot(clustered_data, aes(x = blood_pressure, y = cholesterol, 
                           color = factor(cluster))) +
  geom_point(size = 3) +
  labs(title = "Clustering of Patients Based on Health Parameters",
       x = "Blood Pressure", y = "Cholesterol") +
  scale_color_manual(values = c("darkgreen", "green3", "lightgreen")) +
  theme_minimal()


Output:

Cholesterol vs Blood pressure graph

In this graph each data point represents a patient defined by the cluster color. The different shades of cluster defines the difference between them.

Data Point Cluster Representation

This information is required since we simplify the complex structure into easier forms for better understanding. This representation can help understand the underlying trends, patterns and complex information that cannot be understood with high dimensional original dataset. It also helps in understanding the outliers that are not easily detectable in the original dataset.

R




# Load the required library
library(cluster)
  
# Create a data frame including the cluster assignment
clustered_data$cluster <- as.factor(clustered_data$cluster)
  
# Plot the clusters using clusplot
clusplot(selected_data, clustered_data$cluster, color = TRUE, shade = TRUE,
         labels = 2, lines = 0)


Output:

Data Point Cluster Representation

Different clusters are represented in different colors here and the clusters are also shaded to provide a clear view with each data points. 71.02% of the point variability explains the percentage of variance in the data. This means the two principal components of this data capture 71.02% of variance present in the original dataset.

Variable Relationships Visualization

To visualize the relationship between the variables we plot the pair scatter plot of our dataset. Here we are using pairs() function to create a scatter plot matrix.

R




# Load necessary libraries
library(ggplot2)
  
# Create a scatter plot matrix
pairs(selected_data, main = "Scatter Plot Matrix of Health Parameters")


Output:

Variable Relationships Visualization

The diagonal elements show the distribution of each variable. This scatter plot helps us to visualize the relationships between different variables, such as blood pressure, cholesterol, and BMI, in the dataset. In the context of patient health parameters, the scatter plot can help us understand how these variables are connected to each other. Understanding these patterns can help us asses the potential risk and help in decision making.

In this example, we created a fictional data set for medical diagnosis using Fuzzy Gustafson-Kessel (FGK) algorithm. We used different packages for clustering the results with the original dataset. Such kind of clustering helps the medical practitioner draw conclusions based on the medical history and similarities between multiple patients and their symptoms. This also makes treatment decisions easier.

Conclusion

In this article, we got to know about different algorithms and their base used for fuzzy clustering and how it helps in various fields such as medical, agriculture, traffic pattern analysis and customer segmentation. We applied this on various type of dataset from different sources. We also plotted the results of these clustering on graph for better visualization. These clustering data points help researchers identify how each of them belong or contribute to different factors and how they affect the study as a whole.



Fuzzy Clustering in R

Clustering is an unsupervised machine-learning technique that is used to identify similarities and patterns within data points by grouping similar points based on their features. These points can belong to different clusters simultaneously. This method is widely used in various fields such as Customer Segmentation, Recommendation Systems, Document Clustering, etc. It is a powerful tool that helps data scientists identify the underlying trends in complex data structures. In this article, we will understand the use of fuzzy clustering with the help of multiple real-world examples.

Similar Reads

Understanding Fuzzy Clustering

In real-world scenarios, these clusters can belong to multiple other different clusters. Fuzzy clustering addresses this limitation by allowing data points to belong to multiple clusters simultaneously. In R Programming Language is widely used in research, academia, and industry for various data analysis tasks. It has the following advantages over normal clustering:...

Fuzzy Clustering in R using Customer Segmentation datset

In this example we will apply fuzzy clustering on a Sample sales dataset which we will download from the Kaggle website.This dataset contains data about Order Info, Sales, Customer, Shipping, etc., which is used for analysis and clustering. We will follow the code implementation steps that is needed....

Fuzzy Clustering in R on Medical Diagnosis dataset

...

Contact Us