Financial Fraud Detection in R

In today’s digital age, credit card fraud is a big problem that takes away from the convenience of using cashless payments. Luckily, there are some cool ways that machine learning can help detect and prevent fraud. These innovative methods allow banks and other financial institutions to spot suspicious activities in real time and protect people’s hard-earned money. we will cover this project in R Programming Language.

Understanding Financial Fraud Detection

Financial fraud has become a major threat to institutions in the banking, finance, and e-commerce sectors. Recent studies show that 72% of businesses worldwide have noticed an increase in fraud incidents in the past year, with credit card fraud being especially common. The financial impact is significant, resulting in billions of dollars in losses every year. But it doesn’t stop there – fraud also undermines customer trust and can seriously damage an institution’s reputation.

Challenges in Detecting Fraud

Detecting fraud is getting harder for financial institutions due to evolving schemes like phishing, skimming, synthetic identity fraud, and account takeovers. Advanced technologies such as machine learning, AI, and blockchain are crucial for analyzing data and identifying fraud, but they need constant updates and must minimize false alarms. Compliance with strict regulations adds complexity, as there needs to be a balance between preventing fraud and protecting customer privacy. Robust fraud detection requires both technology and human expertise to adapt to the ever-changing nature of financial fraud, ensuring the security and stability of our financial systems.

How Credit Card Default Happens

Sometimes credit card-issuing banks in order to gain large amount of market share issue credit cards to unqualified clients without suffient information about their ability of repayment of the bills. When the card holders overuse their cards to consume servises and goods irrespective of their repayment ability of the bills, they accumulate a heavy debts.

About The Data Set

Our dataset ‘Default of credit card clients’ consists of informations about transactions from April 2005 to September 2005 of 30000 clients who were credit holders in a bank in Taiwan. This dataset has binary response variable ‘default.payment.next.month’ that takes the value 1 if the corresponding client has default payment and 0 otherwise. Out of 30000 clients 6636(22.12%) were with default payment. There are 23 other independent or explanatory variables:

Dataset Link: Financial Fraud Detection

Step 1: Loading Packages

We’ll load some packages into the session first, required in this project. Such as data.table, dplyr for data importing and wrangling, ggplot2, cowplot, pROC,ROCR for visualization of data and diagonistic plotting, caret for models training and several other packages, using library() function. If the package is not installed then it has to be installed using install.packages(“package name”). In our case, we have our packages installed, we just need bring them into our session.

R
# Load required libraries for data manipulation and visualization
library(data.table)  # Efficient data manipulation
library(ggplot2)     # Data visualization
library(psych)       # Psychological, psychometric, and personality research
library(GGally)      # Extensions to ggplot2
library(dplyr)       # Data manipulation
library(cowplot)     # Plot arrangement
library(caret)       # Classification and regression training
library(pROC)        # ROC curve and AUC calculations
library(ROCR)        # Visualizing the performance of scoring classifiers
library(MASS)        # Functions and datasets for Venables and Ripley's MASS book
library(dummies)     # Create dummy/indicator variables
library(class)       # K-nearest neighbors classification
library(xgboost)     # Extreme Gradient Boosting
library(e1071)       # Misc functions from the Department of Statistics, up
library(nnet)        # Feed-forward neural networks and multinomial log-linear models

Step 2: Load the datase

We’ll read the data ‘Default of credit card client’ in as a csv file into an object named as credit.

R
#Reading the data in R session
credit=fread("default of credit card clients.csv")
# print the dataset
head(credit)

Output:

 LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 PAY_5 PAY_6 BILL_AMT1 BILL_AMT2
1: 20000 2 2 1 24 2 2 -1 -1 -2 -2 3913 3102
2: 120000 2 2 2 26 -1 2 0 0 0 2 2682 1725
3: 90000 2 2 2 34 0 0 0 0 0 0 29239 14027
4: 50000 2 2 1 37 0 0 0 0 0 0 46990 48233
5: 50000 1 2 1 57 -1 0 -1 0 0 0 8617 5670
6: 50000 1 1 2 37 0 0 0 0 0 0 64400 57069

BILL_AMT3 BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY
1: 689 0 0 0 0 689 0 0 0 0
2: 2682 3272 3455 3261 0 1000 1000 1000 0 2000
3: 13559 14331 14948 15549 1518 1500 1000 1000 1000 5000
4: 49291 28314 28959 29547 2000 2019 1200 1100 1069 1000
5: 35835 20940 19146 19131 2000 36681 10000 9000 689 679
6: 57608 19394 19619 20024 2500 1815 657 1000 1000 800

default payment next month
1: 1
2: 1
3: 0
4: 0
5: 0
6: 0

Step 3: Perform Exploratory Data Analysis

The column ‘id’ has no role to play in our analysis. Hence, it is omitted and Let’s inspect for missing values. We rename of the PAY_0 and default payment next month column. Transforming all the qualitative variables into factor variables in R as per the data description.

R
## Droping the unnecessary variable/column
credit=credit[,-1]#There's no use of the column id in our analysis

##Looking for missing value
sum(is.na(credit))#it is observed that there's no missing values

##Changing the name of the variable PAY_0 to PAY_1
names(credit)
names(credit)[6]="PAY_1"
names(credit)[24] = "target"

##Transforming the variables SEX,MARRIAGE,EDUCATION and default payment into factors
df=as.data.frame(credit)
df[c("SEX","MARRIAGE","EDUCATION","target","PAY_1","PAY_2","PAY_3","PAY_4","PAY_5",
     "PAY_6")]= 
  lapply(df[c("SEX","MARRIAGE","EDUCATION","target","PAY_1","PAY_2","PAY_3","PAY_4",
              "PAY_5","PAY_6")]
        ,function(x) as.factor(x))                                                                       



credit=df
rm(df)

Step 4: Visualization of Financial Fraud Detection dataset

Financial fraud detection is a critical area in the financial industry where machine learning and data visualization techniques play a pivotal role. In this article, we will explore how to visualize a financial fraud detection dataset using R.

Bivariate analysis

Now we’ll scrutinize the correlations between the quantitative variables and will check if there are high correlation between some of the features. We employ correlation step plot.

R
##Correlation analysis and Correlogram plot

df=as.data.frame(data.matrix(credit[,c(-2:-4,-6:-11,-24)]))
ggcorr(df,method=c("everything", "pearson"))+ggtitle("Correlation Steps")

rm(df)

Output:

Correlation steps

It can be observed that the correlation among the bill amounts for 6 months are on the higher side. All other features have low or moderate correlation among them.

Density plot for Credit Amount

We’ll now dive into the visualizations of the dataset in hand. Several plots like density plot for credit amount , histogram of age, several bar-plots for marital status and gender also dot-plots for Credit Amount versus Payment Statuses( PAY_1 ,..,PAY_6) and Bill Amounts (BILL_AMT1 ,….,BILL_AMT6).

R
## Visualizing the data

ggplot(credit,aes(x=credit$LIMIT_BAL,fill=credit$target))+
      geom_density(alpha=0.6,show.legend = T,color="blue")+
      ggtitle("Density plot oh Credit Amount")+
      xlab("Credit Amount")

Output:

Density plot on credit amount

AGE for different customers with respect to default

We see Customers with relatively lower credit amount tend to be the defaulters.

R
ggplot(credit,aes(x=credit$AGE,fill=credit$target))+
  geom_histogram(show.legend = T,alpha=0.9)+
  ggtitle("AGE for different customers with respect to default")+
  xlab("AGE")

Output:

Age for different customers with respect to default

Different marital status

Customers with age between 20-35 have relatively higher number of defaults.

R
ggplot(credit,aes(x=credit$MARRIAGE,group=credit$target))+
  geom_bar(show.legend = T,fill="lightblue")+
  ggtitle("Default for different marital status")+
  xlab("Marriage")+
  facet_grid(~credit$target)

Output:

Default for different marital status

Proportion of default is greater for female compared to male. Now, we arrange the scatter plots of Limit Balance & Bill Amounts in a grid, colour coded in default payment status.

We make another grid of scatter plots of Repayment Statuses with Limit Balance, colour coded in default payment status.

R
q=list()             #creating empty plot list
for(i in 6:11){
  q[[i]]=ggplot(credit,aes(x=credit[,i],y=credit[,1],
                           color=credit$target,palette="jco"))+
               geom_point(show.legend = T)+
               xlab(paste0("PAY_",i-5,sep=""))+
               ylab("Limit Bal")+
               ggtitle(paste0("PAY_",i-5,"Vs Limit Bal",sep=""))
}

plot_grid(q[[6]],q[[7]],q[[8]],q[[9]],q[[10]],q[[11]],nrow=3,ncol=2)

Output:

scatter plots of Repayment Statuses with Limit Balance

Most of the default customers have delays in their repayment status.

Step 5: Data Preprocessing

There are some undocumented labels in the factor variables like EDUCATION and MARRIAGE. For example, the labels 4, 5 and 6 of EDUCATION are not documented clearly in the description of the dataset, so we merge these labels with the label 0 that implies qualification other than high school, graduate and university.

Similarly, we merge the labels 0 and 3 for MARRIAGE factor.As 3 implies divorce and 0 is other.

R
credit$EDUCATION = recode_factor(credit$EDUCATION, '4' = "0", '5' = "0", '6' = "0",
                                 .default = levels(credit$EDUCATION))
credit$MARRIAGE = recode_factor(credit$MARRIAGE, '0'="3",
                                .default = levels(credit$MARRIAGE))
#Partitioning the whole data in quantitative and qualitative parts defining the target
quanti=credit[,c(-2:-4,-6:-11,-24)]
quali=credit[,c(2:4,6:11)]
target=credit$target
(table(target)/length(target))

all.features=cbind(quanti,quali,target)
head(all.features)

Output:

   LIMIT_BAL AGE BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 
1: 20000 24 3913 3102 689 0 0 0 0 689 0 0
2: 120000 26 2682 1725 2682 3272 3455 3261 0 1000 1000 1000
3: 90000 34 29239 14027 13559 14331 14948 15549 1518 1500 1000 1000
4: 50000 37 46990 48233 49291 28314 28959 29547 2000 2019 1200 1100
5: 50000 57 8617 5670 35835 20940 19146 19131 2000 36681 10000 9000
6: 50000 37 64400 57069 57608 19394 19619 20024 2500 1815 657 1000
PAY_AMT5 PAY_AMT6 SEX EDUCATION MARRIAGE PAY_1 PAY_2 PAY_3 PAY_4 PAY_5 PAY_6 target
1: 0 0 2 2 1 2 2 -1 -1 -2 -2 1
2: 0 2000 2 2 2 -1 2 0 0 0 2 1
3: 1000 5000 2 2 2 0 0 0 0 0 0 0
4: 1069 1000 2 2 1 0 0 0 0 0 0 0
5: 689 679 1 2 1 -1 0 -1 0 0 0 0
6: 1000 800 1 1 2 0 0 0 0 0 0 0

Now we go deeper into engineering more features like variable transformations, important variables selection etc. However, these are good when working with one or two models based on their criteria for good fit, but in study of a good no of models too much upgradation in features may lead to misleading results and a loss of interpretability. Therefore, we won’t indulge in any further engineering.

Step 6: Test-Train split of the data

We split the combined data-frame(or data-table) into two parts. One is training set, consists of 80% of the data, on which the model(s) will be trained and the other one is test set, consists of remaining 20% of the data, on which the model(s) will be validated.

R
#Splitting the into test and train sets in 80:20 ratio

set.seed(666)#for reproducability of result
ind=sample(nrow(all.features),24000,replace = F)

train.logit=all.features[ind,]
test.logit=all.features[-ind,]

Step 7: Model fitting

For each of the six models we’ll perform the task according to the following template:

  • Training the model on the training set(tuning the hyper-parameters if needed)
  • Making prediction on both train and test set
  • Calculate error rate for both the sets and store them in two vector
  • Plotting ROC curve for both the sets and store the area under curve(AUC) in two vectors
  • Lastly plot a cumulative gain chart for test set.
  • Fitting a logistic model
R
model.logit=glm(target~.,data=train.logit,family="binomial")

summary(model.logit)

Output:

Call:
glm(formula = target ~ ., family = "binomial", data = train.logit)

Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.464e+00 2.935e-01 -8.397 < 2e-16 ***
LIMIT_BAL -1.916e-06 1.956e-07 -9.795 < 2e-16 ***
AGE 2.809e-03 2.189e-03 1.283 0.199370
BILL_AMT1 -8.960e-07 1.192e-06 -0.752 0.452230
BILL_AMT2 1.537e-06 1.581e-06 0.972 0.331022
BILL_AMT3 2.813e-06 1.423e-06 1.977 0.048057 *
BILL_AMT4 -5.358e-07 1.460e-06 -0.367 0.713644
BILL_AMT5 -3.373e-07 1.653e-06 -0.204 0.838339
BILL_AMT6 -5.837e-07 1.317e-06 -0.443 0.657540
PAY_AMT1 -1.346e-05 2.709e-06 -4.967 6.81e-07 ***
PAY_AMT2 -7.198e-06 2.204e-06 -3.266 0.001091 **
PAY_AMT3 -4.088e-07 1.799e-06 -0.227 0.820274
PAY_AMT4 -2.835e-06 2.091e-06 -1.356 0.175033
PAY_AMT5 -3.042e-06 1.991e-06 -1.528 0.126540
PAY_AMT6 -2.053e-06 1.414e-06 -1.452 0.146383
SEX2 -1.495e-01 3.605e-02 -4.148 3.35e-05 ***
EDUCATION1 1.053e+00 2.204e-01 4.779 1.76e-06 ***
EDUCATION2 1.117e+00 2.195e-01 5.090 3.59e-07 ***
EDUCATION3 1.044e+00 2.222e-01 4.696 2.66e-06 ***
MARRIAGE1 2.303e-01 1.601e-01 1.438 0.150385
MARRIAGE2 8.493e-02 1.617e-01 0.525 0.599434
PAY_1-1 5.816e-01 1.210e-01 4.808 1.52e-06 ***
PAY_10 -1.234e-01 1.304e-01 -0.947 0.343713
PAY_11 8.537e-01 9.439e-02 9.045 < 2e-16 ***
PAY_12 2.070e+00 1.184e-01 17.485 < 2e-16 ***
PAY_13 2.168e+00 1.924e-01 11.266 < 2e-16 ***
PAY_14 1.841e+00 3.474e-01 5.299 1.16e-07 ***
PAY_15 1.659e+00 5.322e-01 3.118 0.001821 **
PAY_16 6.749e-01 1.009e+00 0.669 0.503492
PAY_17 2.402e+00 1.684e+00 1.427 0.153718
PAY_18 4.326e+00 7.572e+02 0.006 0.995441
PAY_2-1 -3.328e-01 1.268e-01 -2.625 0.008663 **
PAY_20 -1.368e-01 1.548e-01 -0.884 0.376871
PAY_21 -5.518e-01 6.741e-01 -0.819 0.413065
PAY_22 -5.650e-02 1.302e-01 -0.434 0.664206
PAY_23 -1.097e-01 1.983e-01 -0.553 0.580160
PAY_24 -9.950e-01 3.715e-01 -2.678 0.007398 **
PAY_25 1.644e-01 8.186e-01 0.201 0.840800
PAY_26 -6.654e-02 1.763e+00 -0.038 0.969897
PAY_27 NA NA NA NA
PAY_28 -8.565e-01 7.572e+02 -0.001 0.999097
PAY_3-1 8.224e-03 1.224e-01 0.067 0.946429
PAY_30 8.623e-02 1.418e-01 0.608 0.543082
PAY_31 -1.232e+01 5.354e+02 -0.023 0.981645
PAY_32 4.327e-01 1.433e-01 3.020 0.002529 **
PAY_33 4.238e-01 2.491e-01 1.702 0.088842 .
PAY_34 1.782e-01 4.829e-01 0.369 0.712113
PAY_35 -8.301e-01 1.038e+00 -0.800 0.423928
PAY_36 -1.870e+00 7.572e+02 -0.002 0.998029
PAY_37 -1.908e-02 9.266e-01 -0.021 0.983572
PAY_38 -2.428e+01 3.440e+02 -0.071 0.943717
PAY_4-1 -3.669e-02 1.233e-01 -0.298 0.766015
PAY_40 2.963e-02 1.368e-01 0.217 0.828576
PAY_41 1.416e+01 5.354e+02 0.026 0.978902
PAY_42 2.856e-01 1.464e-01 1.950 0.051139 .
PAY_43 1.446e-01 2.878e-01 0.502 0.615408
PAY_44 3.760e-01 5.144e-01 0.731 0.464847
PAY_45 -1.583e+00 8.779e-01 -1.803 0.071339 .
PAY_46 -1.327e+01 5.354e+02 -0.025 0.980222
PAY_47 9.932e+00 8.840e+02 0.011 0.991036
PAY_48 -1.605e+01 1.020e+03 -0.016 0.987437
PAY_5-1 -1.729e-01 1.190e-01 -1.453 0.146224
PAY_50 -1.181e-01 1.316e-01 -0.898 0.369334
PAY_52 1.613e-01 1.480e-01 1.090 0.275680
PAY_53 2.624e-01 2.791e-01 0.940 0.346987
PAY_54 -3.245e-01 5.371e-01 -0.604 0.545746
PAY_55 4.876e-01 9.817e-01 0.497 0.619405
PAY_56 2.683e+01 7.572e+02 0.035 0.971735
PAY_57 1.400e+00 8.677e+02 0.002 0.998713
PAY_58 2.678e+01 1.979e+03 0.014 0.989203
PAY_6-1 -6.052e-02 9.112e-02 -0.664 0.506581
PAY_60 -2.511e-01 9.820e-02 -2.557 0.010561 *
PAY_62 1.001e-01 1.148e-01 0.871 0.383576
PAY_63 9.828e-01 2.736e-01 3.592 0.000328 ***
PAY_64 5.379e-01 5.511e-01 0.976 0.329041
PAY_65 -9.197e-01 8.596e-01 -1.070 0.284683
PAY_66 1.141e+00 1.036e+00 1.101 0.270856
PAY_67 -1.065e+01 1.690e+02 -0.063 0.949755
PAY_68 2.621e+01 1.152e+03 0.023 0.981843
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 25493 on 23999 degrees of freedom
Residual deviance: 20929 on 23922 degrees of freedom
AIC: 21085

Number of Fisher Scoring iterations: 12

Step 8: Model Prediction

Making prediction for the train set and test set.

R
pred.logit=predict(model.logit,type="response",newdata = test.logit)

pred.def=ifelse(pred.logit>0.5,"1","0")
pred.def=ifelse(predict(model.logit,type="response",newdata = train.logit)>0.5,"1","0")

Step 9: Plot Predicted values

Ploting ROC curve and AUC for test and train set.

R
par(mfrow=c(1,2))
par(pty="s")

# For training
roc(train.logit$target,model.logit$fitted.values,plot=T,col="#69b3a2",
    print.auc=T,legacy.axes=TRUE,percent = T,
    xlab="False Positive percentage",ylab="True Positive percentage",
    lwd=5,main="Train Set")

# for testing

roc(test.logit$target,pred.logit,plot=T,col="navyblue",print.auc=T,legacy.axes=TRUE,
    percent = T,
    xlab="False Positive percentage",ylab="True Positive percentage",
    lwd=5,main="Test Set")

Output:

Financial Fraud Detection in R

ROC curve and AUC for test and train set methods is also used in machine learning for classification problem. This model specifies that for each given class of response variable the posterior probability of a sample given the class follows multivariate normal distribution with common variance-covariance matrix.

Conclusion

Financial Fraud Detection in R is a generalized version of Fisher’s discriminant rule. This method is also used in machine learning for classification problem. This model specifies that for each given class of response variable the posterior probability of a sample given the class follows multivariate normal distribution with common variance-covariance matrix. LDA also use linear combination of features for discriminating the different categories of the response variable and its objective is to maximize the distance between different categories and minimizing the distance within each category. Besides the formula and training data, one more parameter prior is passed to the function lda(). prior is a vector specifying the prior probabilities of class membership. We will use the proportion of the classes in our dataset as our input.



Contact Us