Comparison Between Different Cross Decomposition Methods
Let’s look at the code implementation of the methods:
First, we will be creating a dataset for analysis purposes and after that we will be applying cross decomposition methods on the data.
CanonicalPLS
Creation of dataset:
In the first step we will be importing numpy library which is essential for mathematical operations, and then we will be setting the ramdom.seed value for result reproducibility and creating our dataset with a sample size of 600 data points, and defining total number of features and target variables of 5 and 3 respectively. After these steps X is generated as a random normal dataset with n samples and p features, and Y is created to have some linear relationship with X (X multiplied by true_coef) plus some random noise.
Python3
import numpy as np # Setting random seed value for reproducibility np.random.seed( 42 ) # Generating synthetic data n = 600 # Total number of samples p = 5 # Total number of features in X q = 3 # Total number of features in Y # Creating independent variables X = np.random.normal(size = n * p).reshape((n, p)) # Dependent variables with some relationship to X true_coef = np.array([ 0.1 , - 0.2 , 0.3 , 0.4 , 0.5 ]) Y = X @ true_coef[:p].reshape((p, 1 )) + np.random.normal(size = n * q).reshape((n, q)) |
Applying Canonical PLS to the Data:
After the generation of data we will be importing the PLSCanonical method from the cross_decomposition module of sklearn. We will be initializing the PLSCanonical method with 2 components and then fitting our synthetic dataset to the model for training purpose. After the training process is completed we will be transforming the data according to the model and the final datasets will have only 2 latent features left.
Python3
from sklearn.cross_decomposition import PLSCanonical # Initializing and fit the PLSCanonical model with 2 latent var pls_canonical = PLSCanonical(n_components = 2 ) pls_canonical.fit(X, Y) # Transforming the datasets X_c, Y_c = pls_canonical.transform(X, Y) |
Visualizing the Results:
For the visualization purposes first we will be importing matplotlib. Then we will be plotting two subplots showcasing their relationship between the first pair of canonical variables or the first pair of latent variables (), where is the first latent variable of X data matrix and is the first latent variable of Y data matrix. The same way the second subplot contains the second pair of latent variables (), the plt.subplot(121) signifies that the plot is going to be divided into 2 subplots and the third ‘1’ in this signifies that the subplot is at the first location of the overall plot which is of size (10X4). The correlation between the data is set on top of the subplots in the title.
Python3
import matplotlib.pyplot as plt # Plot the first pair of canonical variates plt.figure(figsize = ( 10 , 4 )) plt.subplot( 1 , 2 , 1 ) plt.scatter(X_c[:, 0 ], Y_c[:, 0 ], c = 'blue' , label = 'First Canonical Variates' ) plt.xlabel( 'First canonical variate of X' ) plt.ylabel( 'First canonical variate of Y' ) plt.title( 'Plot of the First Pair of Canonical Variates (corr = %.2f)' % np.corrcoef(X_c[:, 0 ], Y_c[:, 0 ])[ 0 , 1 ] ) plt.legend() # Plot the second pair of canonical variates plt.subplot( 1 , 2 , 2 ) plt.scatter(X_c[:, 1 ], Y_c[:, 1 ], c = 'red' , label = 'Second Canonical Variates' ) plt.xlabel( 'Second canonical variate of X' ) plt.ylabel( 'Second canonical variate of Y' ) plt.title( 'Plot of the Second Pair of Canonical Variates (corr = %.2f)' % np.corrcoef(X_c[:, 1 ], Y_c[:, 1 ])[ 0 , 1 ] ) plt.legend() plt.tight_layout() plt.show() |
Output:
Through this plot we can observe that the correlation between the first pair of latent variables is pretty decent in amount whereas the second pair has not that much of correlation. This is the maximized correlation between the pair of latent variables that PLSCanonical method could produce.
Now we will be visualizing the correlation between the intra level of the datasets, which means we will be visualizing the first latent variable of the feature dataset with the second latent variable of the first dataset in the same way we will be visualizing for the response dataset. This will help us understand that if the PLSCanonical method was able to minimize or discard the correlation in the same data matrix.
Python3
# Plotting the latent variables of the feature data plt.figure(figsize = ( 10 , 4 )) plt.subplot( 1 , 2 , 1 ) plt.scatter(X_c[:, 0 ], X_c[:, 1 ], c = 'yellow' , label = "Corr data of feature matrix" ) plt.xlabel( 'First canonical variate of X' ) plt.ylabel( 'Second canonical variate of X' ) plt.title( 'Canonical Variates of the X Matrix (corr = %.2f)' % np.corrcoef(X_c[:, 0 ], X_c[:, 1 ])[ 0 , 1 ] ) plt.legend() # Plotting the latent variables of the response data plt.subplot( 1 , 2 , 2 ) plt.scatter(Y_c[:, 0 ], Y_c[:, 1 ], c = 'green' , label = "Corr data of response matrix" ) plt.xlabel( 'First canonical variate of Y' ) plt.ylabel( 'Second canonical variate of Y' ) plt.title( 'Canonical Variates of the Y Matrix (corr = %.2f)' % np.corrcoef(Y_c[:, 0 ], Y_c[:, 1 ])[ 0 , 1 ] ) plt.legend() plt.tight_layout() plt.show() |
Output:
With the help of this visualization we can see that there is no such correlation between the latent variables of the same dataset as the correlation mentioned is nearly zero, therefore, PLSCanonical method selected the latent variables in such a way that the correlation in the pair of latent variables is maximized and the latent variables of the same dataset is minimum to none.
The created two components through PLSCanonical are good for the fact that there is good correlation between different matrices and nearly zero correlation between same components of the dataset, therefore it aligns the dataset with the pre model training conditions.
PLS Regression with Multivariate Response
Here first we will be creating a synthetic dataset and then we will be applying PLSRegression on it to find out how the dataset perform after cross decomposition.
First we will be importing numpy for numerical operations, PLSRegression from the sklearn‘s cross_decomposition module, and mean_squared_error from sklearn‘s metrics module after that we will be defining 500 samples with 10 predictor variables. Then X becomes the predictor matrix of n*p size which is 5000. A variable true_coefficients is defined which will be linearly combined with X to produce Y. The first two rows of true_coefficients are [1.5, -2], and the rest are zeros. This means only the first two features in X are relevant for predicting Y. At last, Y is generated by the linear model Y = X.true_coefficients + noise, where noise is added from a normal distribution.
Python3
import numpy as np from sklearn.cross_decomposition import PLSRegression from sklearn.metrics import mean_squared_error # Set random seed for reproducibility np.random.seed( 0 ) # Generate some synthetic data n = 500 p = 10 # Predictor variables Xr = np.random.normal(size = n * p).reshape((n, p)) # Dependent variable with some noise true_coefficients = np.array([ 1.5 , - 2.0 , 0.0 , 0.0 , 0.0 , 0.0 , 0.0 , 0.0 , 0.0 , 0.0 ]) Yr = np.dot(Xr, true_coefficients) + np.random.normal(size = n) |
Training and Testing the Model:
PLSRegression model is instantiated with 2 latent variables and the data is fit in it for the model training purpose, after that the predictions are made with the trained model, and the mean square error is calculated to check how well the model performs.
Python3
# Number of latent variables n_components = 2 # Creating and fitting the PLS regression model pls = PLSRegression(n_components = n_components) pls.fit(Xr, Yr) # Prediction using the PLS regression model Y_pred = pls.predict(Xr) # Calculating the Mean Squared Error mse = mean_squared_error(Yr, Y_pred) print (f "Mean Squared Error: {mse}" ) |
Output:
Mean Squared Error: 1.0056612234303313
We can now visualize the PLSRegression model with the given data to see how well the model fits the given data.
Visualization of Performance:
For visualization purpose we will be importing Matplotlib, we will be visualizing the original values with the predicted model.
We will be visualizing the predicted value by the model with respect to the original value, more the value of the data points inside the plots near to the diagonal the better the model will be performing.
Python3
# Plotting the true vs predicted values plt.scatter(Yr, Y_pred) plt.xlabel( 'True Values' ) plt.ylabel( 'Predicted Values' ) plt.title( 'PLS Regression: True vs Predicted' ) plt.plot([Yr. min (), Yr. max ()], [Yr. min (), Yr. max ()], 'k--' , lw = 2 ) plt.show() |
Output:
Here we can see that the predicted and the true value matches pretty well, hence we can conclude that this was a good approach for handling correlated dataset.
CCA (PLS mode B with symmetric deflation)
In this section we will be using the same data used in the above PLSCanonical example and we will be applying Canonical Correlation Analysis on it.
First we will be importing the CCA method from cross_decomposition module of sklearn and then we will be initializing the CCA model with total 2 components, the final data will be reduced to two components. After assigning CCA model to a variable we will be training the model on the and finally we will be transforming the dataset according to the trained model.
Python3
from sklearn.cross_decomposition import CCA cca = CCA(n_components = 2 ) cca.fit(X, Y) X_cca, Y_cca = cca.transform(X, Y) |
After transforming the data we can visualize it with the help of matplotlib, we will be plotting two subplots on a single plot, the subplots are going to be for the pair of canonical variables of X and Y dataset, which will help us understand the relationship between the new components created by CCA method. The first plot (subplot(1,2,1)) represents the figure of first canonical component of X vs the first canonical component of Y, in the same way the second plot (subplot(1,2,2)) represents the figure of second canonical component of X vs the second canonical component of Y.
Python3
import matplotlib.pyplot as plt # Plot the results plt.figure(figsize = ( 12 , 6 )) # Plot for the 1st pair of canonical variables plt.subplot( 1 , 2 , 1 ) plt.scatter(X_cca[:, 0 ], Y_cca[:, 0 ], color = 'blue' , label = 'First Latent Var' ) plt.xlabel( 'X canonical variable 1' ) plt.ylabel( 'Y canonical variable 1' ) plt.title( 'First pair of canonical variables (corr = %.2f)' % np.corrcoef(X_cca[:, 0 ], Y_cca[:, 0 ])[ 0 , 1 ]) plt.legend() # Plot for the 2nd pair of canonical variables plt.subplot( 1 , 2 , 2 ) plt.scatter(X_cca[:, 1 ], Y_cca[:, 1 ], color = 'red' , label = 'Train' ) plt.xlabel( 'X canonical variable 2' ) plt.ylabel( 'Y canonical variable 2' ) plt.title( 'Second pair of canonical variables (corr = %.2f)' % np.corrcoef(X_cca[:, 1 ], Y_cca[:, 1 ])[ 0 , 1 ]) plt.legend() plt.tight_layout() plt.show() |
Output:
Here, we can see that the dataset follows sort of a correlated figure, which explains that the CCA model is good for the analysis and dimension reduction of correlated dataset. The CCA model and the PLSCanonical produced similar results for the same dataset.
Understanding Cross Decomposition in Machine Learning
Usually, in real-world datasets, some of the features of the data are highly correlated with each other. Applying normal regression methods to highly correlated data is not an effective way to analyze such data, since multicollinearity makes the estimates highly sensitive to any change in the model. In this article, we will be diving deep into Cross Decomposition which will help us understand the optimal solutions to problems like multicollinearity in the data.
Contact Us