Comparison Between Different Cross Decomposition Methods

Let’s look at the code implementation of the methods:

First, we will be creating a dataset for analysis purposes and after that we will be applying cross decomposition methods on the data.

CanonicalPLS

Creation of dataset:

In the first step we will be importing numpy library which is essential for mathematical operations, and then we will be setting the ramdom.seed value for result reproducibility and creating our dataset with a sample size of 600 data points, and defining total number of features and target variables of 5 and 3 respectively. After these steps X is generated as a random normal dataset with n samples and p features, and Y is created to have some linear relationship with X (X multiplied by true_coef) plus some random noise.

Python3

import numpy as np
 
# Setting random seed value for reproducibility
np.random.seed(42)
 
# Generating synthetic data
n = 600  # Total number of samples
p = 5   # Total number of features in X
q = 3    # Total number of features in Y
 
# Creating independent variables
X = np.random.normal(size=n * p).reshape((n, p))
 
# Dependent variables with some relationship to X
true_coef = np.array([0.1, -0.2, 0.3, 0.4, 0.5])
Y = X @ true_coef[:p].reshape((p, 1)) + np.random.normal(size=n * q).reshape((n, q))

                    


Applying Canonical PLS to the Data:

After the generation of data we will be importing the PLSCanonical method from the cross_decomposition module of sklearn. We will be initializing the PLSCanonical method with 2 components and then fitting our synthetic dataset to the model for training purpose. After the training process is completed we will be transforming the data according to the model and the final datasets will have only 2 latent features left.

Python3

from sklearn.cross_decomposition import PLSCanonical
 
# Initializing and fit the PLSCanonical model with 2 latent var
pls_canonical = PLSCanonical(n_components=2)
pls_canonical.fit(X, Y)
 
# Transforming the datasets
X_c, Y_c = pls_canonical.transform(X, Y)

                    


Visualizing the Results:

For the visualization purposes first we will be importing matplotlib. Then we will be plotting two subplots showcasing their relationship between the first pair of canonical variables or the first pair of latent variables (), where is the first latent variable of X data matrix and is the first latent variable of Y data matrix. The same way the second subplot contains the second pair of latent variables (), the plt.subplot(121) signifies that the plot is going to be divided into 2 subplots and the third ‘1’ in this signifies that the subplot is at the first location of the overall plot which is of size (10X4). The correlation between the data is set on top of the subplots in the title.

Python3

import matplotlib.pyplot as plt
 
# Plot the first pair of canonical variates
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.scatter(X_c[:, 0], Y_c[:, 0], c='blue', label='First Canonical Variates')
plt.xlabel('First canonical variate of X')
plt.ylabel('First canonical variate of Y')
plt.title('Plot of the First Pair of Canonical Variates (corr = %.2f)'
          % np.corrcoef(X_c[:, 0], Y_c[:, 0])[0, 1]
)
plt.legend()
 
# Plot the second pair of canonical variates
plt.subplot(1, 2, 2)
plt.scatter(X_c[:, 1], Y_c[:, 1], c='red', label='Second Canonical Variates')
plt.xlabel('Second canonical variate of X')
plt.ylabel('Second canonical variate of Y')
plt.title('Plot of the Second Pair of Canonical Variates (corr = %.2f)'
         % np.corrcoef(X_c[:, 1], Y_c[:, 1])[0, 1]
)
plt.legend()
 
plt.tight_layout()
plt.show()

                    

Output:

Correlation plots between pairs of canonical variates


Through this plot we can observe that the correlation between the first pair of latent variables is pretty decent in amount whereas the second pair has not that much of correlation. This is the maximized correlation between the pair of latent variables that PLSCanonical method could produce.

Now we will be visualizing the correlation between the intra level of the datasets, which means we will be visualizing the first latent variable of the feature dataset with the second latent variable of the first dataset in the same way we will be visualizing for the response dataset. This will help us understand that if the PLSCanonical method was able to minimize or discard the correlation in the same data matrix.

Python3

# Plotting the latent variables of the feature data
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.scatter(X_c[:, 0], X_c[:, 1], c='yellow', label="Corr data of feature matrix")
plt.xlabel('First canonical variate of X')
plt.ylabel('Second canonical variate of X')
plt.title('Canonical Variates of the X Matrix (corr = %.2f)'
         % np.corrcoef(X_c[:, 0], X_c[:, 1])[0, 1]
)
plt.legend()
 
# Plotting the latent variables of the response data
plt.subplot(1, 2, 2)
plt.scatter(Y_c[:, 0], Y_c[:, 1], c='green', label="Corr data of response matrix")
plt.xlabel('First canonical variate of Y')
plt.ylabel('Second canonical variate of Y')
plt.title('Canonical Variates of the Y Matrix (corr = %.2f)'
         % np.corrcoef(Y_c[:, 0], Y_c[:, 1])[0, 1]
)
plt.legend()
 
plt.tight_layout()
plt.show()

                    

Output:

Correlation plots between latent variables of same dataset


With the help of this visualization we can see that there is no such correlation between the latent variables of the same dataset as the correlation mentioned is nearly zero, therefore, PLSCanonical method selected the latent variables in such a way that the correlation in the pair of latent variables is maximized and the latent variables of the same dataset is minimum to none.

The created two components through PLSCanonical are good for the fact that there is good correlation between different matrices and nearly zero correlation between same components of the dataset, therefore it aligns the dataset with the pre model training conditions.

PLS Regression with Multivariate Response

Here first we will be creating a synthetic dataset and then we will be applying PLSRegression on it to find out how the dataset perform after cross decomposition.

First we will be importing numpy for numerical operations, PLSRegression from the sklearn‘s cross_decomposition module, and mean_squared_error from sklearn‘s metrics module after that we will be defining 500 samples with 10 predictor variables. Then X becomes the predictor matrix of n*p size which is 5000. A variable true_coefficients is defined which will be linearly combined with X to produce Y. The first two rows of true_coefficients are [1.5, -2], and the rest are zeros. This means only the first two features in X are relevant for predicting Y. At last, Y is generated by the linear model Y = X.true_coefficients + noise, where noise is added from a normal distribution.

Python3

import numpy as np
from sklearn.cross_decomposition import PLSRegression
from sklearn.metrics import mean_squared_error
 
# Set random seed for reproducibility
np.random.seed(0)
 
# Generate some synthetic data
n = 500
p = 10
 
# Predictor variables
Xr = np.random.normal(size=n * p).reshape((n, p))
 
# Dependent variable with some noise
true_coefficients = np.array([1.5, -2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
Yr = np.dot(Xr, true_coefficients) + np.random.normal(size=n)

                    


Training and Testing the Model:

PLSRegression model is instantiated with 2 latent variables and the data is fit in it for the model training purpose, after that the predictions are made with the trained model, and the mean square error is calculated to check how well the model performs.

Python3

# Number of latent variables
n_components = 2
 
# Creating and fitting the PLS regression model
pls = PLSRegression(n_components=n_components)
pls.fit(Xr, Yr)
 
# Prediction using the PLS regression model
Y_pred = pls.predict(Xr)
 
# Calculating the Mean Squared Error
mse = mean_squared_error(Yr, Y_pred)
print(f"Mean Squared Error: {mse}")

                    

Output:

Mean Squared Error: 1.0056612234303313

We can now visualize the PLSRegression model with the given data to see how well the model fits the given data.

Visualization of Performance:

For visualization purpose we will be importing Matplotlib, we will be visualizing the original values with the predicted model.

We will be visualizing the predicted value by the model with respect to the original value, more the value of the data points inside the plots near to the diagonal the better the model will be performing.

Python3

# Plotting the true vs predicted values
plt.scatter(Yr, Y_pred)
plt.xlabel('True Values')
plt.ylabel('Predicted Values')
plt.title('PLS Regression: True vs Predicted')
plt.plot([Yr.min(), Yr.max()], [Yr.min(), Yr.max()], 'k--', lw=2)
plt.show()

                    

Output:

True value vs Predicted value by PLSRegression method

Here we can see that the predicted and the true value matches pretty well, hence we can conclude that this was a good approach for handling correlated dataset.

CCA (PLS mode B with symmetric deflation)

In this section we will be using the same data used in the above PLSCanonical example and we will be applying Canonical Correlation Analysis on it.

First we will be importing the CCA method from cross_decomposition module of sklearn and then we will be initializing the CCA model with total 2 components, the final data will be reduced to two components. After assigning CCA model to a variable we will be training the model on the and finally we will be transforming the dataset according to the trained model.

Python3

from sklearn.cross_decomposition import CCA
 
cca = CCA(n_components=2)
cca.fit(X, Y)
X_cca, Y_cca = cca.transform(X, Y)

                    


After transforming the data we can visualize it with the help of matplotlib, we will be plotting two subplots on a single plot, the subplots are going to be for the pair of canonical variables of X and Y dataset, which will help us understand the relationship between the new components created by CCA method. The first plot (subplot(1,2,1)) represents the figure of first canonical component of X vs the first canonical component of Y, in the same way the second plot (subplot(1,2,2)) represents the figure of second canonical component of X vs the second canonical component of Y.

Python3

import matplotlib.pyplot as plt
 
# Plot the results
plt.figure(figsize=(12, 6))
 
# Plot for the 1st pair of canonical variables
plt.subplot(1, 2, 1)
plt.scatter(X_cca[:, 0], Y_cca[:, 0], color='blue', label='First Latent Var')
plt.xlabel('X canonical variable 1')
plt.ylabel('Y canonical variable 1')
plt.title('First pair of canonical variables (corr = %.2f)'
         % np.corrcoef(X_cca[:, 0], Y_cca[:, 0])[0, 1])
plt.legend()
 
# Plot for the 2nd pair of canonical variables
plt.subplot(1, 2, 2)
plt.scatter(X_cca[:, 1], Y_cca[:, 1], color='red', label='Train')
plt.xlabel('X canonical variable 2')
plt.ylabel('Y canonical variable 2')
plt.title('Second pair of canonical variables (corr = %.2f)'
         % np.corrcoef(X_cca[:, 1], Y_cca[:, 1])[0, 1])
plt.legend()
 
plt.tight_layout()
plt.show()

                    

Output:

Correlation between pair of canonical variables through CCA method


Here, we can see that the dataset follows sort of a correlated figure, which explains that the CCA model is good for the analysis and dimension reduction of correlated dataset. The CCA model and the PLSCanonical produced similar results for the same dataset.

Understanding Cross Decomposition in Machine Learning

Usually, in real-world datasets, some of the features of the data are highly correlated with each other. Applying normal regression methods to highly correlated data is not an effective way to analyze such data, since multicollinearity makes the estimates highly sensitive to any change in the model. In this article, we will be diving deep into Cross Decomposition which will help us understand the optimal solutions to problems like multicollinearity in the data.

Similar Reads

What is Cross Decomposition?

Cross Decomposition is a supervised machine learning technique used in multivariate data analysis through which the data set could be divided into matrices which could help in capturing different aspects of the data. The data might contain features that are correlated to each other and fitting all the original features to a normal data modelling technique might not be an efficient way to model the data with correlated features. Cross decomposition helps in the dimensionality reduction of such data sets without losing the insights of the original data. What cross decomposition does is that it divides the input features and different targets into component matrices and then finds out the relationship between the input features and the target matrix. When it divides the input features in another matrix it reduces the number of features in the matrix the same way principal component regression (PCR) does, the only difference here being that the PCR is a unsupervised learning algorithm whereas cross decomposition uses supervised learning algorithms considering the target values also. In the next section we will be discussing about the components of cross decomposition, what it actually comprises of and how is it useful for modelling complex datasets....

Comparison Between Different Cross Decomposition Methods

Let’s look at the code implementation of the methods:...

Advantages and Disadvantages of Cross Decomposition

...

Conclusion

...

Contact Us