Implementation with Isolation Forest
In this section, we are going to delve into the implementation of Isolation Forest. We are going to perform anomaly detection on credit card transaction using the algorithm using the following steps:
Step 1: Importing required libraries
# Import necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import IsolationForest
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler
Step 2: Dataset loading and pre-processing
Now we will load very famous Credit Card Anomaly detection dataset and limit its row count to 40000 for faster processing speed. Then we will standardize the features in the dataset excluding the target variable ‘Class’ using StandardScaler, ensuring that each feature has a mean of 0 and a standard deviation of 1. Next, it selects the first 40,000 rows of the standardized data and converts it into a Data Frame. Finally, it separates the features (X) from the target variable (y), where ‘X’ contains all columns except ‘Class’, and ‘y’ contains only the ‘Class’ column indicating the transaction’s fraud status.
credit_data = pd.read_csv('creditcard.csv', nrows=40000) # https://www.kaggle.com/mlg-ulb/creditcardfraud
scaler = StandardScaler().fit_transform(credit_data.loc[:,credit_data.columns!='Class'])
scaled_data = scaler[0:40000]
df = pd.DataFrame(data=scaled_data)
# Separate features and target variable
X = credit_data.drop(columns=['Class'])
y = credit_data['Class']
Defining Isolation Forest model
Now it is time to train our Isolation Forest model. Firstly, the fraction of outliers in the dataset is determined by calculating the ratio of fraudulent transactions (‘Class’ equals 1) to non-fraudulent transactions (‘Class’ equals 0). Subsequently, an Isolation Forest model is created and fitted to the data. The hyperparameters for the Isolation Forest model are defined as follows–> ‘n_estimators’ is set to 100, indicating the number of base estimators in the ensemble, and ‘contamination’ is assigned the previously calculated outlier fraction, representing the expected proportion of outliers in the dataset. Additionally, ‘random_state’ is used for reproducibility.
# Determine the fraction of outliers
outlier_fraction = len(credit_data[credit_data['Class']==1])/float(len(credit_data[credit_data['Class']==0]))
# Create and fit the Isolation Forest model
model = IsolationForest(n_estimators=100, contamination=outlier_fraction, random_state=42)
model.fit(df)
Output:
IsolationForest(contamination=0.0026067776218167233, random_state=42)
Model evaluation
Now we will evaluate our model on the basis of how much accurately our model is separating the outliers or potential anomalies present in the dataset. So, here we will calculate the anomaly score from model’s decision boundary function then print Accuracy of it.
# Predict outliers
scores_prediction = model.decision_function(df)
y_pred = model.predict(df)
y_pred[y_pred == 1] = 0
y_pred[y_pred == -1] = 1
# Print the accuracy in separating outliers or anomalies
print("Accuracy in finding anomaly:",accuracy_score(y,y_pred))
Output:
Accuracy in finding anomaly: 0.997175
So, we have achived above 99% of accuracy.
Comparative visualization
Now we will plot the normal vs. anomalous instances of any feature of the dataset. Here we will plot the ‘Amount’ feature of the dataset but you can just change the name of the feature to visualize that feature’s results.
# Selecting the feature for y-axis
y_feature = credit_data['Amount'] # change the feature name to visualize another
# Adding the predicted labels to the original dataset
credit_data['predicted_class'] = y_pred
# Plotting the graph
plt.figure(figsize=(7, 4))
sns.scatterplot(x=credit_data.index, y=y_feature, hue=credit_data['predicted_class'], palette={0: 'blue', 1: 'red'}, s=50)
plt.title('Visualization of Normal vs Anomalous Transactions')
plt.xlabel('Data points')
plt.ylabel(y_feature.name)
plt.legend(title='Predicted Class', loc='best')
plt.show()
Output:
From the above plot, we can clearly see that the normal instances and anomalous instances are separated in well manner with very little overlap.
What is Isolation Forest?
Isolation forest is a state-of-the-art anomaly detection algorithm which is very famous for its efficiency and simplicity. By removing anomalies from a dataset using binary partitioning, it quickly identifies outliers with minimal computational overhead, making it the way to go for anomalies in areas ranging from cybersecurity to finance. In this article, we are going to explore the fundamentals of Isolation Forest algorithm.
Table of Content
- What is Isolation Forest?
- How Isolation forest Algorithm Works?
- Implementation with Isolation Forest
- Advantages of Isolation Forest
- Limitations of Isolation Forest
Contact Us