Detecting and Visualizing MultiCollinearity
To better understand multicollinearity, we can make use of the iris dataset. It consists of 3 different types of irises (Setosa, Versicolour, and Virginica) and has 4 features: sepal length, sepal width, petal length, and petal width.
Letâs load the iris dataset. The code is as follows:
from sklearn import datasets
# load iris dataset
iris = datasets.load_iris()
# features
iris.feature_names
Output:
['sepal length (cm)', 'sepal width (cm)',
'petal length (cm)', 'petal width (cm)']
Visualizing Correlation with a Scatter Plot Diagram
We can make use of a scatter plot diagram to plot every numerical attribute against every other numerical attribute. If there is a linear upward or downward trend (correlation) for more than one combination, we can conclude that there is multicollinearity. Letâs visually understand which features are highly correlated. The code is as follows:
import pandas as pd
import seaborn as sns
# convert iris data from numpy array to pandas dataframe
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
# plot scatter plot to identify correlated features
sns.pairplot(data=iris_df)
Output:
Here, we converted the iris dataset set from a numpy array to a Pandas dataframe, which made plotting and correlation calculation easier. The dataframe is passed to the pairplot() method to plot every numerical attribute against every other numerical attribute.
Clearly, there is a higher correlation between petal length and petal width. We can also notice a strong correlation between sepal length and petal length.
Calculating the Correlation Value
With a scatter plot, we will not be able to identify how much each feature is correlated. To identify the value, we can make use of the Pandas dataframe method. Letâs look at how much each attribute correlates with the other attributes.
iris_corr = iris_df.corr()
print(iris_corr)
Output:
sepal length (cm) sepal width (cm) petal length (cm) \
sepal length (cm) 1.000000 -0.117570 0.871754
sepal width (cm) -0.117570 1.000000 -0.428440
petal length (cm) 0.871754 -0.428440 1.000000
petal width (cm) 0.817941 -0.366126 0.962865
petal width (cm)
sepal length (cm) 0.817941
sepal width (cm) -0.366126
petal length (cm) 0.962865
petal width (cm) 1.000000
Using the corr() method from the Pandas dataframe, we identified how much each attribute correlates with the other attributes.
- A strong correlation between petal length and petal width with a value of 0.96;
- sepal length and petal length show a positive correlation with a value of 0.87,
- and sepal length and petal width have a positive correlation with a value of 0.81.
Note: A correlation value close to 1 means that it has a strong positive correlation. If the value is close to -1, it means that there is a strong negative correlation.
Letâs draw a heatmap for a better visualization experience.
sns.heatmap(iris_corr, vmin=-1, vmax=1, annot=True)
Output:
Applying PCA to Logistic Regression to remove Multicollinearity
Multicollinearity is a common issue in regression models, where predictor variables are highly correlated. This can lead to unstable estimates of regression coefficients, making it difficult to determine the effect of each predictor on the response variable. Principal Component Analysis (PCA) is a powerful technique to address this issue by transforming the original correlated variables into a set of uncorrelated variables called principal components. This article explores how PCA can be applied to logistic regression to remove multicollinearity and improve model performance.
Table of Content
- Understanding Multicollinearity
- Principal Component Analysis (PCA) for Multicollinearity
- Detecting and Visualizing MultiCollinearity
- Visualizing Correlation with a Scatter Plot Diagram
- Calculating the Correlation Value
- Steps to Perform PCA for Removing Multicollinearity
- 1. Implementing PCA to Remove Multicollinearity
- 2. Training Logistic Regression with PCA
Contact Us