Financial Fraud Detection in R

Climate Change Impact Visualization in R

In today’s digital age, credit card fraud is a big problem that takes away from the convenience of using cashless payments. Luckily, there are some cool ways that machine learning can help detect and prevent fraud. These innovative methods allow banks and other financial institutions to spot suspicious activities in real time and protect people’s hard-earned money. we will cover this project in R Programming Language.

Understanding Financial Fraud Detection

Financial fraud has become a major threat to institutions in the banking, finance, and e-commerce sectors. Recent studies show that 72% of businesses worldwide have noticed an increase in fraud incidents in the past year, with credit card fraud being especially common. The financial impact is significant, resulting in billions of dollars in losses every year. But it doesn’t stop there – fraud also undermines customer trust and can seriously damage an institution’s reputation.

Challenges in Detecting Fraud

Detecting fraud is getting harder for financial institutions due to evolving schemes like phishing, skimming, synthetic identity fraud, and account takeovers. Advanced technologies such as machine learning, AI, and blockchain are crucial for analyzing data and identifying fraud, but they need constant updates and must minimize false alarms. Compliance with strict regulations adds complexity, as there needs to be a balance between preventing fraud and protecting customer privacy. Robust fraud detection requires both technology and human expertise to adapt to the ever-changing nature of financial fraud, ensuring the security and stability of our financial systems.

How Credit Card Default Happens

Sometimes credit card-issuing banks in order to gain large amount of market share issue credit cards to unqualified clients without suffient information about their ability of repayment of the bills. When the card holders overuse their cards to consume servises and goods irrespective of their repayment ability of the bills, they accumulate a heavy debts.

About The Data Set

Our dataset ‘Default of credit card clients’ consists of informations about transactions from April 2005 to September 2005 of 30000 clients who were credit holders in a bank in Taiwan. This dataset has binary response variable ‘default.payment.next.month’ that takes the value 1 if the corresponding client has default payment and 0 otherwise. Out of 30000 clients 6636(22.12%) were with default payment. There are 23 other independent or explanatory variables:

Dataset Link: Financial Fraud Detection

Step 1: Loading Packages

We’ll load some packages into the session first, required in this project. Such as data.table, dplyr for data importing and wrangling, ggplot2, cowplot, pROC,ROCR for visualization of data and diagonistic plotting, caret for models training and several other packages, using library() function. If the package is not installed then it has to be installed using install.packages(“package name”). In our case, we have our packages installed, we just need bring them into our session.

# Load required libraries for data manipulation and visualization
library(data.table)  # Efficient data manipulation
library(ggplot2)     # Data visualization
library(psych)       # Psychological, psychometric, and personality research
library(GGally)      # Extensions to ggplot2
library(dplyr)       # Data manipulation
library(cowplot)     # Plot arrangement
library(caret)       # Classification and regression training
library(pROC)        # ROC curve and AUC calculations
library(ROCR)        # Visualizing the performance of scoring classifiers
library(MASS)        # Functions and datasets for Venables and Ripley's MASS book
library(dummies)     # Create dummy/indicator variables
library(class)       # K-nearest neighbors classification
library(xgboost)     # Extreme Gradient Boosting
library(e1071)       # Misc functions from the Department of Statistics, up
library(nnet)        # Feed-forward neural networks and multinomial log-linear models

Step 2: Load the datase

We’ll read the data ‘Default of credit card client’ in as a csv file into an object named as credit.

#Reading the data in R session
credit=fread("default of credit card clients.csv")
# print the dataset
head(credit)

Output:

 LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 PAY_5 PAY_6 BILL_AMT1 BILL_AMT2
1:     20000   2         2        1  24     2     2    -1    -1    -2    -2      3913      3102
2:    120000   2         2        2  26    -1     2     0     0     0     2      2682      1725
3:     90000   2         2        2  34     0     0     0     0     0     0     29239     14027
4:     50000   2         2        1  37     0     0     0     0     0     0     46990     48233
5:     50000   1         2        1  57    -1     0    -1     0     0     0      8617      5670
6:     50000   1         1        2  37     0     0     0     0     0     0     64400     57069

 BILL_AMT3 BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY
1:       689         0         0         0        0      689        0        0        0        0
2:      2682      3272      3455      3261        0     1000     1000     1000        0     2000
3:     13559     14331     14948     15549     1518     1500     1000     1000     1000     5000
4:     49291     28314     28959     29547     2000     2019     1200     1100     1069     1000
5:     35835     20940     19146     19131     2000    36681    10000     9000      689      679
6:     57608     19394     19619     20024     2500     1815      657     1000     1000      800

   default payment next month
1:                          1
2:                          1
3:                          0
4:                          0
5:                          0
6:                          0

Step 3: Perform Exploratory Data Analysis

The column ‘id’ has no role to play in our analysis. Hence, it is omitted and Let’s inspect for missing values. We rename of the PAY_0 and default payment next month column. Transforming all the qualitative variables into factor variables in R as per the data description.

## Droping the unnecessary variable/column
credit=credit[,-1]#There's no use of the column id in our analysis

##Looking for missing value
sum(is.na(credit))#it is observed that there's no missing values

##Changing the name of the variable PAY_0 to PAY_1
names(credit)
names(credit)[6]="PAY_1"
names(credit)[24] = "target"

##Transforming the variables SEX,MARRIAGE,EDUCATION and default payment into factors
df=as.data.frame(credit)
df[c("SEX","MARRIAGE","EDUCATION","target","PAY_1","PAY_2","PAY_3","PAY_4","PAY_5",
     "PAY_6")]= 
  lapply(df[c("SEX","MARRIAGE","EDUCATION","target","PAY_1","PAY_2","PAY_3","PAY_4",
              "PAY_5","PAY_6")]
        ,function(x) as.factor(x))                                                                       



credit=df
rm(df)

Step 4: Visualization of Financial Fraud Detection dataset

Financial fraud detection is a critical area in the financial industry where machine learning and data visualization techniques play a pivotal role. In this article, we will explore how to visualize a financial fraud detection dataset using R.

Bivariate analysis

Now we’ll scrutinize the correlations between the quantitative variables and will check if there are high correlation between some of the features. We employ correlation step plot.

##Correlation analysis and Correlogram plot

df=as.data.frame(data.matrix(credit[,c(-2:-4,-6:-11,-24)]))
ggcorr(df,method=c("everything", "pearson"))+ggtitle("Correlation Steps")

rm(df)

Output:

Correlation steps

It can be observed that the correlation among the bill amounts for 6 months are on the higher side. All other features have low or moderate correlation among them.

Density plot for Credit Amount

We’ll now dive into the visualizations of the dataset in hand. Several plots like density plot for credit amount , histogram of age, several bar-plots for marital status and gender also dot-plots for Credit Amount versus Payment Statuses( PAY_1 ,..,PAY_6) and Bill Amounts (BILL_AMT1 ,….,BILL_AMT6).

## Visualizing the data

ggplot(credit,aes(x=credit$LIMIT_BAL,fill=credit$target))+
      geom_density(alpha=0.6,show.legend = T,color="blue")+
      ggtitle("Density plot oh Credit Amount")+
      xlab("Credit Amount")

Output:

Density plot on credit amount

AGE for different customers with respect to default

We see Customers with relatively lower credit amount tend to be the defaulters.

ggplot(credit,aes(x=credit$AGE,fill=credit$target))+
  geom_histogram(show.legend = T,alpha=0.9)+
  ggtitle("AGE for different customers with respect to default")+
  xlab("AGE")

Output:

Age for different customers with respect to default

Different marital status

Customers with age between 20-35 have relatively higher number of defaults.

ggplot(credit,aes(x=credit$MARRIAGE,group=credit$target))+
  geom_bar(show.legend = T,fill="lightblue")+
  ggtitle("Default for different marital status")+
  xlab("Marriage")+
  facet_grid(~credit$target)

Output:

Default for different marital status

Proportion of default is greater for female compared to male. Now, we arrange the scatter plots of Limit Balance & Bill Amounts in a grid, colour coded in default payment status.

We make another grid of scatter plots of Repayment Statuses with Limit Balance, colour coded in default payment status.

q=list()             #creating empty plot list
for(i in 6:11){
  q[[i]]=ggplot(credit,aes(x=credit[,i],y=credit[,1],
                           color=credit$target,palette="jco"))+
               geom_point(show.legend = T)+
               xlab(paste0("PAY_",i-5,sep=""))+
               ylab("Limit Bal")+
               ggtitle(paste0("PAY_",i-5,"Vs Limit Bal",sep=""))
}

plot_grid(q[[6]],q[[7]],q[[8]],q[[9]],q[[10]],q[[11]],nrow=3,ncol=2)

Output:

scatter plots of Repayment Statuses with Limit Balance

Most of the default customers have delays in their repayment status.

Step 5: Data Preprocessing

There are some undocumented labels in the factor variables like EDUCATION and MARRIAGE. For example, the labels 4, 5 and 6 of EDUCATION are not documented clearly in the description of the dataset, so we merge these labels with the label 0 that implies qualification other than high school, graduate and university.

Similarly, we merge the labels 0 and 3 for MARRIAGE factor.As 3 implies divorce and 0 is other.

credit$EDUCATION = recode_factor(credit$EDUCATION, '4' = "0", '5' = "0", '6' = "0",
                                 .default = levels(credit$EDUCATION))
credit$MARRIAGE = recode_factor(credit$MARRIAGE, '0'="3",
                                .default = levels(credit$MARRIAGE))
#Partitioning the whole data in quantitative and qualitative parts defining the target
quanti=credit[,c(-2:-4,-6:-11,-24)]
quali=credit[,c(2:4,6:11)]
target=credit$target
(table(target)/length(target))

all.features=cbind(quanti,quali,target)
head(all.features)

Output:

   LIMIT_BAL AGE BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 
1:     20000  24      3913      3102       689         0         0         0        0      689        0        0
2:    120000  26      2682      1725      2682      3272      3455      3261        0     1000     1000     1000
3:     90000  34     29239     14027     13559     14331     14948     15549     1518     1500     1000     1000
4:     50000  37     46990     48233     49291     28314     28959     29547     2000     2019     1200     1100
5:     50000  57      8617      5670     35835     20940     19146     19131     2000    36681    10000     9000
6:     50000  37     64400     57069     57608     19394     19619     20024     2500     1815      657     1000
   PAY_AMT5 PAY_AMT6 SEX EDUCATION MARRIAGE PAY_1 PAY_2 PAY_3 PAY_4 PAY_5 PAY_6 target
1:        0        0   2         2        1     2     2    -1    -1    -2    -2      1
2:        0     2000   2         2        2    -1     2     0     0     0     2      1
3:     1000     5000   2         2        2     0     0     0     0     0     0      0
4:     1069     1000   2         2        1     0     0     0     0     0     0      0
5:      689      679   1         2        1    -1     0    -1     0     0     0      0
6:     1000      800   1         1        2     0     0     0     0     0     0      0

Now we go deeper into engineering more features like variable transformations, important variables selection etc. However, these are good when working with one or two models based on their criteria for good fit, but in study of a good no of models too much upgradation in features may lead to misleading results and a loss of interpretability. Therefore, we won’t indulge in any further engineering.

Step 6: Test-Train split of the data

We split the combined data-frame(or data-table) into two parts. One is training set, consists of 80% of the data, on which the model(s) will be trained and the other one is test set, consists of remaining 20% of the data, on which the model(s) will be validated.

#Splitting the into test and train sets in 80:20 ratio

set.seed(666)#for reproducability of result
ind=sample(nrow(all.features),24000,replace = F)

train.logit=all.features[ind,]
test.logit=all.features[-ind,]

Step 7: Model fitting

For each of the six models we’ll perform the task according to the following template:

Training the model on the training set(tuning the hyper-parameters if needed)
Making prediction on both train and test set
Calculate error rate for both the sets and store them in two vector
Plotting ROC curve for both the sets and store the area under curve(AUC) in two vectors
Lastly plot a cumulative gain chart for test set.
Fitting a logistic model

model.logit=glm(target~.,data=train.logit,family="binomial")

summary(model.logit)

Output:

Call:
glm(formula = target ~ ., family = "binomial", data = train.logit)

Coefficients: (1 not defined because of singularities)
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.464e+00  2.935e-01  -8.397  < 2e-16 ***
LIMIT_BAL   -1.916e-06  1.956e-07  -9.795  < 2e-16 ***
AGE          2.809e-03  2.189e-03   1.283 0.199370    
BILL_AMT1   -8.960e-07  1.192e-06  -0.752 0.452230    
BILL_AMT2    1.537e-06  1.581e-06   0.972 0.331022    
BILL_AMT3    2.813e-06  1.423e-06   1.977 0.048057 *  
BILL_AMT4   -5.358e-07  1.460e-06  -0.367 0.713644    
BILL_AMT5   -3.373e-07  1.653e-06  -0.204 0.838339    
BILL_AMT6   -5.837e-07  1.317e-06  -0.443 0.657540    
PAY_AMT1    -1.346e-05  2.709e-06  -4.967 6.81e-07 ***
PAY_AMT2    -7.198e-06  2.204e-06  -3.266 0.001091 ** 
PAY_AMT3    -4.088e-07  1.799e-06  -0.227 0.820274    
PAY_AMT4    -2.835e-06  2.091e-06  -1.356 0.175033    
PAY_AMT5    -3.042e-06  1.991e-06  -1.528 0.126540    
PAY_AMT6    -2.053e-06  1.414e-06  -1.452 0.146383    
SEX2        -1.495e-01  3.605e-02  -4.148 3.35e-05 ***
EDUCATION1   1.053e+00  2.204e-01   4.779 1.76e-06 ***
EDUCATION2   1.117e+00  2.195e-01   5.090 3.59e-07 ***
EDUCATION3   1.044e+00  2.222e-01   4.696 2.66e-06 ***
MARRIAGE1    2.303e-01  1.601e-01   1.438 0.150385    
MARRIAGE2    8.493e-02  1.617e-01   0.525 0.599434    
PAY_1-1      5.816e-01  1.210e-01   4.808 1.52e-06 ***
PAY_10      -1.234e-01  1.304e-01  -0.947 0.343713    
PAY_11       8.537e-01  9.439e-02   9.045  < 2e-16 ***
PAY_12       2.070e+00  1.184e-01  17.485  < 2e-16 ***
PAY_13       2.168e+00  1.924e-01  11.266  < 2e-16 ***
PAY_14       1.841e+00  3.474e-01   5.299 1.16e-07 ***
PAY_15       1.659e+00  5.322e-01   3.118 0.001821 ** 
PAY_16       6.749e-01  1.009e+00   0.669 0.503492    
PAY_17       2.402e+00  1.684e+00   1.427 0.153718    
PAY_18       4.326e+00  7.572e+02   0.006 0.995441    
PAY_2-1     -3.328e-01  1.268e-01  -2.625 0.008663 ** 
PAY_20      -1.368e-01  1.548e-01  -0.884 0.376871    
PAY_21      -5.518e-01  6.741e-01  -0.819 0.413065    
PAY_22      -5.650e-02  1.302e-01  -0.434 0.664206    
PAY_23      -1.097e-01  1.983e-01  -0.553 0.580160    
PAY_24      -9.950e-01  3.715e-01  -2.678 0.007398 ** 
PAY_25       1.644e-01  8.186e-01   0.201 0.840800    
PAY_26      -6.654e-02  1.763e+00  -0.038 0.969897    
PAY_27              NA         NA      NA       NA    
PAY_28      -8.565e-01  7.572e+02  -0.001 0.999097    
PAY_3-1      8.224e-03  1.224e-01   0.067 0.946429    
PAY_30       8.623e-02  1.418e-01   0.608 0.543082    
PAY_31      -1.232e+01  5.354e+02  -0.023 0.981645    
PAY_32       4.327e-01  1.433e-01   3.020 0.002529 ** 
PAY_33       4.238e-01  2.491e-01   1.702 0.088842 .  
PAY_34       1.782e-01  4.829e-01   0.369 0.712113    
PAY_35      -8.301e-01  1.038e+00  -0.800 0.423928    
PAY_36      -1.870e+00  7.572e+02  -0.002 0.998029    
PAY_37      -1.908e-02  9.266e-01  -0.021 0.983572    
PAY_38      -2.428e+01  3.440e+02  -0.071 0.943717    
PAY_4-1     -3.669e-02  1.233e-01  -0.298 0.766015    
PAY_40       2.963e-02  1.368e-01   0.217 0.828576    
PAY_41       1.416e+01  5.354e+02   0.026 0.978902    
PAY_42       2.856e-01  1.464e-01   1.950 0.051139 .  
PAY_43       1.446e-01  2.878e-01   0.502 0.615408    
PAY_44       3.760e-01  5.144e-01   0.731 0.464847    
PAY_45      -1.583e+00  8.779e-01  -1.803 0.071339 .  
PAY_46      -1.327e+01  5.354e+02  -0.025 0.980222    
PAY_47       9.932e+00  8.840e+02   0.011 0.991036    
PAY_48      -1.605e+01  1.020e+03  -0.016 0.987437    
PAY_5-1     -1.729e-01  1.190e-01  -1.453 0.146224    
PAY_50      -1.181e-01  1.316e-01  -0.898 0.369334    
PAY_52       1.613e-01  1.480e-01   1.090 0.275680    
PAY_53       2.624e-01  2.791e-01   0.940 0.346987    
PAY_54      -3.245e-01  5.371e-01  -0.604 0.545746    
PAY_55       4.876e-01  9.817e-01   0.497 0.619405    
PAY_56       2.683e+01  7.572e+02   0.035 0.971735    
PAY_57       1.400e+00  8.677e+02   0.002 0.998713    
PAY_58       2.678e+01  1.979e+03   0.014 0.989203    
PAY_6-1     -6.052e-02  9.112e-02  -0.664 0.506581    
PAY_60      -2.511e-01  9.820e-02  -2.557 0.010561 *  
PAY_62       1.001e-01  1.148e-01   0.871 0.383576    
PAY_63       9.828e-01  2.736e-01   3.592 0.000328 ***
PAY_64       5.379e-01  5.511e-01   0.976 0.329041    
PAY_65      -9.197e-01  8.596e-01  -1.070 0.284683    
PAY_66       1.141e+00  1.036e+00   1.101 0.270856    
PAY_67      -1.065e+01  1.690e+02  -0.063 0.949755    
PAY_68       2.621e+01  1.152e+03   0.023 0.981843    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 25493  on 23999  degrees of freedom
Residual deviance: 20929  on 23922  degrees of freedom
AIC: 21085

Number of Fisher Scoring iterations: 12

Step 8: Model Prediction

Making prediction for the train set and test set.

pred.logit=predict(model.logit,type="response",newdata = test.logit)

pred.def=ifelse(pred.logit>0.5,"1","0")
pred.def=ifelse(predict(model.logit,type="response",newdata = train.logit)>0.5,"1","0")

Step 9: Plot Predicted values

Ploting ROC curve and AUC for test and train set.

par(mfrow=c(1,2))
par(pty="s")

# For training
roc(train.logit$target,model.logit$fitted.values,plot=T,col="#69b3a2",
    print.auc=T,legacy.axes=TRUE,percent = T,
    xlab="False Positive percentage",ylab="True Positive percentage",
    lwd=5,main="Train Set")

# for testing

roc(test.logit$target,pred.logit,plot=T,col="navyblue",print.auc=T,legacy.axes=TRUE,
    percent = T,
    xlab="False Positive percentage",ylab="True Positive percentage",
    lwd=5,main="Test Set")

Output:

Financial Fraud Detection in R

ROC curve and AUC for test and train set methods is also used in machine learning for classification problem. This model specifies that for each given class of response variable the posterior probability of a sample given the class follows multivariate normal distribution with common variance-covariance matrix.

Conclusion

Financial Fraud Detection in R is a generalized version of Fisher’s discriminant rule. This method is also used in machine learning for classification problem. This model specifies that for each given class of response variable the posterior probability of a sample given the class follows multivariate normal distribution with common variance-covariance matrix. LDA also use linear combination of features for discriminating the different categories of the response variable and its objective is to maximize the distance between different categories and minimizing the distance within each category. Besides the formula and training data, one more parameter prior is passed to the function lda(). prior is a vector specifying the prior probabilities of class membership. We will use the proportion of the classes in our dataset as our input.

Tags:

#Data Science Blogathon 2024 #R Projects #AI-ML-DS #Blogathon #R Machine Learning

Financial Awareness for Bank Exams

Climate Change Impact Visualization in R

Financial Fraud Detection in R

Understanding Financial Fraud Detection

Challenges in Detecting Fraud

How Credit Card Default Happens

About The Data Set

Step 1: Loading Packages

Step 2: Load the datase

Step 3: Perform Exploratory Data Analysis

Step 4: Visualization of Financial Fraud Detection dataset

Bivariate analysis

Density plot for Credit Amount

AGE for different customers with respect to default

Different marital status

Step 5: Data Preprocessing

Step 6: Test-Train split of the data

Step 7: Model fitting

Step 8: Model Prediction

Step 9: Plot Predicted values

Conclusion

Contact Us