Churn Prediction for Subscription Services in R

Churn Predictions have become an important part of today’s economic world for all the companies providing subscriptions to their consumers. Churn prediction is a process identifying the customers who are more likely to cancel their subscriptions to a service. These predictions help the service provider retain their existing customer and identify the issues making the consumer cancel their subscriptions. Businesses use this technique for better marketing and sales. This article will use R Programming Language to predict customer churn for subscriptions.

Predicting customer churn

Predicting customer churn, the phenomenon where customers cease doing business with a company is crucial for businesses across various industries. By identifying customers at risk of churning, companies can take proactive measures to retain them, thereby reducing revenue loss and maintaining a loyal customer base.

Dataset Link: Customer Churn

Now we will discuss step-by-step Churn Prediction for Subscription Services in R.

Step 1: Data Preparation

In this step, we will load the necessary libraries and datasets. Make sure you replace the path with the original link.

R
# Load necessary libraries
library(tidyverse)
library(caret)
library(randomForest)
library(pROC)

# Load the dataset
data <- read.csv("subscription_data.csv")

# Handle missing values (remove rows with missing values)
data <- na.omit(data)

# Check the structure of the data
str(data)

Output:

'data.frame':    7032 obs. of  21 variables:
$ customerID : chr "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
$ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
$ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
$ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
$ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
$ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
$ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
$ MultipleLines : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
$ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
$ OnlineSecurity : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
$ OnlineBackup : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
$ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
$ TechSupport : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
$ StreamingTV : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
$ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
$ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
$ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
$ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
$ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
$ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
$ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
- attr(*, "na.action")= 'omit' Named int [1:11] 489 754 937 1083 1341 3332 3827 4381 5219 6671 ...
..- attr(*, "names")= chr [1:11] "489" "754" "937" "1083" ...

Step 2: Exploratory Data Analysis (EDA)

EDA helps us understand the relationship between the variables.

R
# Checking missing values and removed them
colSums(is.na(data))
# Drop  missing Values
data<-na.omit(data))

Output:

      customerID           gender    SeniorCitizen          Partner 
0 0 0 0
Dependents tenure PhoneService MultipleLines
0 0 0 0
InternetService OnlineSecurity OnlineBackup DeviceProtection
0 0 0 0
TechSupport StreamingTV StreamingMovies Contract
0 0 0 0
PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
0 0 0 11
Churn
0

So we have total 11 missing values persant in MonthlyCharges column and we will remove that values.

Visualization of the Dataset

We can also understand the churn relationship with contract type and payment method and how it plays a role in it. This relationship and pattern is better understood when we plot it on the graph using ggplot2.

R
# Create a new feature: average monthly charges
trainData <- data %>%
  mutate(avgMonthlyCharges = TotalCharges / (tenure + 1))

testData <- data%>%
  mutate(avgMonthlyCharges = TotalCharges / (tenure + 1))

# Plot the distribution of tenure by churn
ggplot(trainData, aes(x = tenure, fill = Churn)) +
  geom_histogram(binwidth = 1, position = "dodge") +
  labs(title = "Distribution of Tenure by Churn",
       x = "Tenure (months)", y = "Count")

Output:

Churn Prediction for Subscription Services in R

This plot visualizes the distribution of tenure (in months) for customers who churned (Churn = Yes) and those who did not churn (Churn = No). It helps in understanding the tenure distribution for both churned and non-churned customers, which can be useful for further analysis and model building.

Plot the distribution of monthly charges by churn

We create a histogram to visualize the distribution of monthly charges for customers who churned and those who did not churn.

R
# Plot the distribution of monthly charges by churn
ggplot(trainData, aes(x = MonthlyCharges, fill = Churn)) +
  geom_histogram(binwidth = 5, position = "dodge") +
  labs(title = "Distribution of Monthly Charges by Churn",
       x = "Monthly Charges", y = "Count")

Output:

Churn Prediction for Subscription Services in R

By visualizing the distribution of monthly charges by churn status, businesses can gain valuable insights into customer behavior and make data-driven decisions to mitigate churn and enhance customer retention strategies.

Plot churn distribution by payment method

We can also understand the churn relationship with contract type and payment method and how it plays a role in it. This relationship and pattern is better understood when we plot it on the graph using ggplot2.

R
# Plot churn distribution by payment method
ggplot(trainData, aes(x = PaymentMethod, fill = Churn)) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(title = "Churn Distribution by Payment Method", x = "Payment Method",
       y = "Percentage", fill = "Churn") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Output:

Percentage Vs Payment Method

This bar chart helps us to understand that people paying through electronic check have the highest churn percentage.

Step 3: Model Bueilding

In this example we will use random forest algorithm to build and train our model. For this we need to install randomForest package in R. This algorithm uses multiple decision trees to improve the overall performance. Each tree is trained on a sample training dataset. Due to this method the accuracy is high.

R
# Split the data into training and test sets
set.seed(123)
trainIndex <- createDataPartition(data$Churn, p = 0.8, list = FALSE)
trainData <- data[trainIndex, ]
testData <- data[-trainIndex, ]

# Create a new feature: average monthly charges
trainData <- trainData %>%
  mutate(avgMonthlyCharges = TotalCharges / (tenure + 1))

testData <- testData %>%
  mutate(avgMonthlyCharges = TotalCharges / (tenure + 1))

# Build the Random Forest model
set.seed(123)
rfModel <- randomForest(Churn ~ ., data = trainData, ntree = 100, mtry = 3, 
                        importance = TRUE)

# Summary of the model
print(rfModel)

Output:

Call:
randomForest(formula = Churn ~ ., data = trainData, ntree = 100, mtry = 3, importance = TRUE)
Type of random forest: classification
Number of trees: 100
No. of variables tried at each split: 3

OOB estimate of error rate: 20.47%
Confusion matrix:
No Yes class.error
No 3715 416 0.1007020
Yes 736 760 0.4919786

The confusion matrix provides a detailed summary of the classification results, showing the counts of true positive, true negative, false positive, and false negative predictions.

  • The overall OOB error rate is 20.47%, indicating that the model’s predictions are incorrect about one-fifth of the time.
  • The confusion matrix shows that the model performs better in predicting ‘No’ (non-churn) compared to ‘Yes’ (churn).
  • The high class error for ‘Yes’ suggests that the model struggles more with correctly predicting churners.

Step 4: Model Evaluation

Now we will check that the accuracy of owr model.

R
# Predict on the test set
predictions <- predict(rfModel, newdata = testData)

# Confusion Matrix
confusionMatrix <- table(predictions, testData$Churn)
confusionMatrix

# Calculate accuracy, precision, and recall
accuracy <- sum(diag(confusionMatrix)) / sum(confusionMatrix)
precision <- confusionMatrix["Yes", "Yes"] / sum(confusionMatrix["Yes", ])
recall <- confusionMatrix["Yes", "Yes"] / sum(confusionMatrix[, "Yes"])

# Print metrics
cat("Accuracy: ", accuracy, "\n")
cat("Precision: ", precision, "\n")
cat("Recall: ", recall, "\n")

Output:

predictions  No Yes
No 919 177
Yes 113 196

Accuracy: 0.7935943

Precision: 0.6343042

Recall: 0.5254692

Area under the curve: 0.8332

The accuracy is approximately 79.36%, meaning the model correctly predicts the churn status about 79.36% of the time.

  • The precision is approximately 63.43%, meaning that when the model predicts a customer will churn, it is correct about 63.43% of the time.
  • The recall is approximately 52.55%, meaning the model correctly identifies 52.55% of the customers who actually churned.
  • The AUC is 0.8332, meaning the model has an 83.32% chance of correctly distinguishing between churners and non-churners.

Step 5: Making Prediction for Model

We will create a new data frame with the same structure as the training data for making predictions.

R
# Create a new data frame for prediction with matching structure
newData <- data.frame(
  customerID = "9999-AAAA",
  gender = factor("Male", levels = levels(trainData$gender)),
  SeniorCitizen = 0,
  Partner = factor("Yes", levels = levels(trainData$Partner)),
  Dependents = factor("No", levels = levels(trainData$Dependents)),
  tenure = 12,
  PhoneService = factor("Yes", levels = levels(trainData$PhoneService)),
  MultipleLines = factor("No", levels = levels(trainData$MultipleLines)),
  InternetService = factor("Fiber optic", levels = levels(trainData$InternetService)),
  OnlineSecurity = factor("No", levels = levels(trainData$OnlineSecurity)),
  OnlineBackup = factor("No", levels = levels(trainData$OnlineBackup)),
  DeviceProtection = factor("No", levels = levels(trainData$DeviceProtection)),
  TechSupport = factor("No", levels = levels(trainData$TechSupport)),
  StreamingTV = factor("No", levels = levels(trainData$StreamingTV)),
  StreamingMovies = factor("No", levels = levels(trainData$StreamingMovies)),
  Contract = factor("Month-to-month", levels = levels(trainData$Contract)),
  PaperlessBilling = factor("Yes", levels = levels(trainData$PaperlessBilling)),
  PaymentMethod = factor("Electronic check", levels = levels(trainData$PaymentMethod)),
  MonthlyCharges = 70.35,
  TotalCharges = 843.9,
  avgMonthlyCharges = 843.9 / (12 + 1)
)

# Predict churn for the new data
newPrediction <- predict(rfModel, newdata = newData)
newPrediction

Output:

  1 
Yes
Levels: No Yes

Now we plot the Feature Importance Plot tells us about the variable and how important they are for our predictions.

R
# Plot feature importance
importance <- importance(rfModel)
varImportance <- data.frame(Variables = row.names(importance), Importance = importance[,1])

# Plot
ggplot(varImportance, aes(x = reorder(Variables, Importance), y = Importance)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(title = "Feature Importance", x = "Features", y = "Importance")

Output:

VARIANCE IMPORTANCE PLOT

This plot informs us about the features that actually affect the prediction rate.

Conclusion

In this article, we understood the meaning of churn and how it affects our business, With the help of R Programming we used a dataset and predicted customer churn based on different variables. We also discussed about the algorithm to be used and how it works. This helps us in business management and economic growth.



Contact Us