Fuzzy Clustering in R using Customer Segmentation datset ❤

In this example we will apply fuzzy clustering on a Sample sales dataset which we will download from the Kaggle website.
This dataset contains data about Order Info, Sales, Customer, Shipping, etc., which is used for analysis and clustering. We will follow the code implementation steps that is needed.

1. Loading Required Libraries

As discussed above the libraries that we need for clustering are e1071, cluster, factoextra, ggplot2 and their roles are already mentioned. Syntax to install and load these libraries are:

R

# Install libraries 
install.packages("e1071") 
install.packages("cluster") 
install.packages("factoextra") 
#load libraries 
library(e1071) 
library(cluster) 
library(factoextra)

2.Loading the Dataset

This part of the code reads the dataset by the provided path. You can replace the name from the path of your actual file.

R

data <- read.csv("your_path.csv") 
  
head(data)

Output:

 ORDERNUMBER QUANTITYORDERED PRICEEACH ORDERLINENUMBER   SALES       ORDERDATE
1       10107              30     95.70               2 2871.00  2/24/2003 0:00
2       10121              34     81.35               5 2765.90   5/7/2003 0:00
3       10134              41     94.74               2 3884.34   7/1/2003 0:00
4       10145              45     83.26               6 3746.70  8/25/2003 0:00
5       10159              49    100.00              14 5205.27 10/10/2003 0:00
6       10168              36     96.66               1 3479.76 10/28/2003 0:00
   STATUS QTR_ID MONTH_ID YEAR_ID PRODUCTLINE MSRP PRODUCTCODE
1 Shipped      1        2    2003 Motorcycles   95    S10_1678
2 Shipped      2        5    2003 Motorcycles   95    S10_1678
3 Shipped      3        7    2003 Motorcycles   95    S10_1678
4 Shipped      3        8    2003 Motorcycles   95    S10_1678
5 Shipped      4       10    2003 Motorcycles   95    S10_1678
6 Shipped      4       10    2003 Motorcycles   95    S10_1678
              CUSTOMERNAME            PHONE                  ADDRESSLINE1
1        Land of Toys Inc.       2125557818       897 Long Airport Avenue
2       Reims Collectables       26.47.1555            59 rue de l'Abbaye
3          Lyon Souveniers +33 1 46 62 7555 27 rue du Colonel Pierre Avia
4        Toys4GrownUps.com       6265557265            78934 Hillside Dr.
5 Corporate Gift Ideas Co.       6505551386               7734 Strong St.
6     Technics Stores Inc.       6505556809             9408 Furth Circle
  ADDRESSLINE2          CITY STATE POSTALCODE COUNTRY TERRITORY CONTACTLASTNAME
1                        NYC    NY      10022     USA      <NA>              Yu
2                      Reims            51100  France      EMEA         Henriot
3                      Paris            75508  France      EMEA        Da Cunha
4                   Pasadena    CA      90003     USA      <NA>           Young
5              San Francisco    CA                USA      <NA>           Brown
6                 Burlingame    CA      94217     USA      <NA>          Hirano
  CONTACTFIRSTNAME DEALSIZE
1             Kwai    Small
2             Paul    Small
3           Daniel   Medium
4            Julie   Medium
5            Julie   Medium
6             Juri   Medium

3. Data Preprocessing

na.omit() function helps in removing rows that have missing values. These missing values can alter our analysis so dealing with them is important.

R

# column wise missing values 
colSums(is.na(data)) 
  
# Handle missing values 
data<- na.omit(data)

Output:

     ORDERNUMBER  QUANTITYORDERED        PRICEEACH  ORDERLINENUMBER            SALES 
               0                0                0                0                0 
       ORDERDATE           STATUS           QTR_ID         MONTH_ID          YEAR_ID 
               0                0                0                0                0 
     PRODUCTLINE             MSRP      PRODUCTCODE     CUSTOMERNAME            PHONE 
               0                0                0                0                0 
    ADDRESSLINE1     ADDRESSLINE2             CITY            STATE       POSTALCODE 
               0                0                0                0                0 
         COUNTRY        TERRITORY  CONTACTLASTNAME CONTACTFIRSTNAME         DEALSIZE 
               0             1074                0                0                0

4. Data Selection for Clustering

Our dataset is huge, therefore we need to select the columns we wanna deal with. Here, we will perform clustering on Quantity ordered, price each, sales and manufacturer’s suggested retail price. You can get the column names by colnames() syntax in R.

R

data_for_clustering <- data[, c("QUANTITYORDERED", "PRICEEACH", "SALES", "MSRP")]

5. Fuzzy C-means Clustering

Now, we will perform clustering on our selected data for which we use cmeans() function. It defines the number of clusters as well as fuzziness coefficient.

R

set.seed(123) 
n_cluster <- 5 
m <- 2 
result <- cmeans(data_for_clustering, centers = n_cluster, m = m)

Data Membership Degree Matrix:

The Data Membership Degree Matrix, also known as the Fuzzy Membership Matrix, is a fundamental concept in fuzzy clustering algorithms which shows the degree to which each data point belongs to each of the clusters. These values typically range between 0 and 1, where 0 indicates no membership, and 1 indicates full membership.

R

# Data Membership Degree Matrix 
fuzzy_membership_matrix <- result$membership 
  
# Cluster Prototype Evolution Matrices 
initial_centers <- result$centers 
final_centers <- t(result$centers)

6. Interpret the Clustering Results

R

cluster_membership <- as.data.frame(result$membership) 
data_with_clusters <- cbind(data, cluster_membership) 
head(data_with_clusters)

Output:

   ORDERNUMBER QUANTITYORDERED PRICEEACH ORDERLINENUMBER   SALES       ORDERDATE
2        10121              34     81.35               5 2765.90   5/7/2003 0:00
3        10134              41     94.74               2 3884.34   7/1/2003 0:00
7        10180              29     86.13               9 2497.77 11/11/2003 0:00
8        10188              48    100.00               1 5512.32 11/18/2003 0:00
10       10211              41    100.00              14 4708.44  1/15/2004 0:00
11       10223              37    100.00               1 3965.66  2/20/2004 0:00
    STATUS QTR_ID MONTH_ID YEAR_ID PRODUCTLINE MSRP PRODUCTCODE
2  Shipped      2        5    2003 Motorcycles   95    S10_1678
3  Shipped      3        7    2003 Motorcycles   95    S10_1678
7  Shipped      4       11    2003 Motorcycles   95    S10_1678
8  Shipped      4       11    2003 Motorcycles   95    S10_1678
10 Shipped      1        1    2004 Motorcycles   95    S10_1678
11 Shipped      1        2    2004 Motorcycles   95    S10_1678
                 CUSTOMERNAME            PHONE                  ADDRESSLINE1
2          Reims Collectables       26.47.1555            59 rue de l'Abbaye
3             Lyon Souveniers +33 1 46 62 7555 27 rue du Colonel Pierre Avia
7    Daedalus Designs Imports       20.16.1555       184, chausse de Tournai
8                Herkku Gifts    +47 2267 3215   Drammen 121, PR 744 Sentrum
10           Auto Canal Petit   (1) 47.55.6555             25, rue Lauriston
11 Australian Collectors, Co.     03 9520 4555             636 St Kilda Road
   ADDRESSLINE2      CITY    STATE POSTALCODE   COUNTRY TERRITORY CONTACTLASTNAME
2                   Reims               51100    France      EMEA         Henriot
3                   Paris               75508    France      EMEA        Da Cunha
7                   Lille               59000    France      EMEA           Rance
8                  Bergen              N 5804    Norway      EMEA          Oeztan
10                  Paris               75016    France      EMEA         Perrier
11      Level 3 Melbourne Victoria       3004 Australia      APAC        Ferguson
   CONTACTFIRSTNAME DEALSIZE            1           2            3            4
2              Paul    Small 0.0001541063 0.999690200 0.0001263451 2.272976e-05
3            Daniel   Medium 0.0048451207 0.020573748 0.9663439156 6.979966e-03
7           Martine    Small 0.0878851450 0.874531088 0.0290942132 6.458913e-03
8            Veysel   Medium 0.0045450830 0.009277472 0.0321427220 9.454499e-01
10        Dominique   Medium 0.0290347397 0.075020298 0.6323656865 2.425679e-01
11            Peter   Medium 0.0011707987 0.004626692 0.9918892502 1.975149e-03
              5
2  6.618887e-06
3  1.257249e-03
7  2.030640e-03
8  8.584810e-03
10 2.101140e-02
11 3.381099e-04

head() function prints the first few rows of our dataset. The results we got can be divided into four parts for better understanding:

Customer Details: This sections gives information about the individual customer such as order number, quantity ordered, price each, total sales, order date, product information, name, etc, to identify each customer individually.
Cluster Membership Probabilities: Column 1 to 5 shows the probabilities of a customer belonging to a certain cluster generated by our algorithm.
Deal Size Classification: This category can hold three values, small, medium and large in our dataset but the results that we got has just small and medium based on the size of the deal.
Understanding Customer Behavior: This is the main purpose of our analysis since we are dividing our customers into different clusters to understand their purchasing behaviour. This clustering will help us understand the buying patterns and give better recommendation to that cluster or group of customers. Clustering in such cases also helps in improving customer services.

Cluster Separation Score or Gap Index

Cluster Separation Score or Gap Index is used calculate the optimal number of clusters in our dataset. Higher gap index suggests better defined clusters.
It measures the gap between the observed clustering quality and the expected clustering quality

R

# Load required libraries 
library(clusterSim) 
  
# Clustering Comparison for Determining Optimal Clusters 
  
# Computing PAM clustering with k = 4 
cl1 <- pam(data_for_clustering, 4)  
  
# Computing PAM clustering with k = 5 
cl2 <- pam(data_for_clustering, 5)   
  
# Combine the clustering results 
cl_all <- cbind(cl1$clustering, cl2$clustering) 
  
# Calculate the Gap index for the dataset 
gap <- index.Gap(data_for_clustering, cl_all, reference.distribution = "unif",  
                 B = 10, method = "pam") 
  
# Print the Gap index 
print(gap)

Output:

$gap
[1] 0.3893237
$diffu
[1] -0.1060642

gap index(gap) : represents distinct and quality clusters. The value we got here shows a moderate level of distinctiveness and separation between clusters.
difference values(diffu) : represents the uncertainty in the estimated Gap statistic. A negative value here indicates that the clustering solution has a lower standard deviation which shows that the clusters are distinct in comparison to random distribution.
Together these values help in estimating the quality of clusters.

Davies-Bouldin’s index

This Index is useful in finding the similarities between the clusters. This deals with both the scatter within the clusters and the separation between the clusters for model fit. A lower Davies-Bouldin’s index indicates better clustering.

R

# Load required libraries 
library(cluster) 
  
# Calculate PAM clustering with k = 5 
clustering_results <- pam(data_for_clustering, 5)   
# Calculate Davies-Bouldin's index for the dataset 
db_index <- index.DB(data_for_clustering, clustering_results$clustering,  
                     centrotypes = "centroids") 
  
# Print Davies-Bouldin's index 
print(db_index)

Output:

$DB
[1] 0.6708421
$r
[1] 0.6288000 0.5934592 0.7515756 0.7515756 0.6288000
$R
          [,1]      [,2]      [,3]      [,4]      [,5]
[1,]       Inf 0.5934592 0.2948785 0.3190663 0.6288000
[2,] 0.5934592       Inf 0.5603336 0.4305076 0.3335226
[3,] 0.2948785 0.5603336       Inf 0.7515756 0.2264422
[4,] 0.3190663 0.4305076 0.7515756       Inf 0.2724843
[5,] 0.6288000 0.3335226 0.2264422 0.2724843       Inf
$d
         1        2        3        4        5
1    0.000 1142.226 2635.563 4947.227 1088.237
2 1142.226    0.000 1493.378 3805.072 2230.441
3 2635.563 1493.378    0.000 2311.701 3723.724
4 4947.227 3805.072 2311.701    0.000 6035.324
5 1088.237 2230.441 3723.724 6035.324    0.000
$S
[1]  309.1230  368.7419  468.0478 1269.3704  375.1605
$centers
         [,1]     [,2]     [,3]      [,4]
[1,] 33.48421 81.46667 2704.752  89.80421
[2,] 35.98091 94.16315 3846.702 111.30788
[3,] 41.05263 99.80880 5339.980 126.80827
[4,] 45.48344 99.95550 7651.551 150.98013
[5,] 28.18721 59.94959 1616.948  68.52968

$DB: The value of Davies-Bouldin’s index for the clustering result.
$r: These value represent the average distances between each point in one cluster to every other point in the same cluster.
$R: These value represent between the centroids of different clusters.
$d: These value represent the distance between each pair of data points.
$S: The scatter values for each cluster.
$centers: The coordinates of the cluster centers for each variable

Variance Ratio Criterion or Calinski-Harabasz index

This parameter is used to calculate the ratio between the variance between the clusters and variance within the clusters. A higher value is preferred as it suggests better defined clusters and clear separation between them.
PAM is a partitional clustering method used to create clusters with actual data points.

R

# Cluster Evaluation using Calinski-Harabasz Index 
  
# Calculate PAM clustering with k = 10 
clustering_results <- pam(data_for_clustering, 10)   
# Calculate Calinski-Harabasz pseudo F-statistic for the dataset 
ch_index <- index.G1(data_for_clustering, clustering_results$clustering) 
  
# Print the Calinski-Harabasz pseudo F-statistic 
print(ch_index)

Output:

[1] 7433.806

This code first performance PAM clustering on 10 clusters and then computers Variance Ratio Criterion for the results we got by clustering. The output we got suggests that the data points are well separated into distinct clusters, with minimal variations within each cluster. This output suggest better clustering performance.

7. Visualizing the Clustering Results

Now to visualize the results we will use “ggplot2” the famous package used for plotting graphs.

R

centers <- t(result$centers) 
data_with_clusters$Cluster <- apply(result$membership, 1, which.max) 
ggplot(data_with_clusters, aes(x = QUANTITYORDERED, y = PRICEEACH,  
                               color = as.factor(Cluster))) + 
  geom_point() + 
  labs(title = "Fuzzy C-means Clustering", x = "Quantity Ordered", y = "Price Each")

Output:

Fuzzy Clustering in R

Each color represents the different clusters of the customers that have same purchasing habits. The data points represents the relationship for the customer about the quantity of their order and the price for their order. This gives insights on the customer segmentation.

Variable Relationships Visualization:

Variable Relationship Plots are made to understand the relationship between the variables present in our dataset. Pairwise scatter plots visualize the relationships between pairs of variables, aiding in identifying patterns and relationships between features. This graph also helps in identifying the potential relationship

R

pairs(data_for_clustering, pch = 16, col = as.numeric(result$cluster))

Output:

Variable Relationships Visualization

Scatter plots illustrate how ‘QUANTITYORDERED’ and ‘PRICEEACH’ or ‘SALES’ and ‘MSRP’ are related, highlighting any trends or correlations in the data. This helps in identifying both underlying trends as well as potential relations.

Data Point Cluster Representation or Clusplot refers to visualization of data points in relation to their clusters. This is another visualization technique to analyze clusters, which provides a different perspective on the distribution and relationships between clusters. This plot helps to understand how each data point is distributed and belongs to a certain cluster based on their features or attributes.

Data Point Cluster Representation Another visualization technique to analyze clusters, which provides a different perspective on the distribution and relationships between clusters.

R

clusplot(data_for_clustering, result$cluster, color = TRUE, shade = TRUE, 
clusplot(data_for_clustering, result$cluster, color = TRUE, shade = TRUE, 
         labels = 2, lines = 0))

Output:

Fuzzy Clustering in R

A higher percentage of point variability between the variables represents that the clusters can be distributed into distinct clusters based on the components. The graph appears to be complex since we have too many datasets. Here 88.42% of the point variability shows that if we are working a large dataset then these two components are useful on determining the main characteristics of the dataset we have. These two components will help us to understand and see main patterns.

This analysis is important to understand the purchasing patterns of the customers for a particular business. These segmentations help in making informed decisions such as making strategies or recommendations which helps in improving customer satisfaction. It also helps in improving customer engagement by providing personalized recommendations.

Fuzzy Clustering in R using Customer Segmentation datset

1. Loading Required Libraries

R

2.Loading the Dataset

R

3. Data Preprocessing

R

4. Data Selection for Clustering

R

5. Fuzzy C-means Clustering

R

R

6. Interpret the Clustering Results

R

Cluster Separation Score or Gap Index

R

Davies-Bouldin’s index

R

Variance Ratio Criterion or Calinski-Harabasz index

R

7. Visualizing the Clustering Results

R

Variable Relationships Visualization:

R

R

Fuzzy Clustering in R

Similar Reads

Contact Us