Understanding Cross Decomposition in Machine Learning

Usually, in real-world datasets, some of the features of the data are highly correlated with each other. Applying normal regression methods to highly correlated data is not an effective way to analyze such data, since multicollinearity makes the estimates highly sensitive to any change in the model. In this article, we will be diving deep into Cross Decomposition which will help us understand the optimal solutions to problems like multicollinearity in the data.

What is Cross Decomposition?

Cross Decomposition is a supervised machine learning technique used in multivariate data analysis through which the data set could be divided into matrices which could help in capturing different aspects of the data. The data might contain features that are correlated to each other and fitting all the original features to a normal data modelling technique might not be an efficient way to model the data with correlated features. Cross decomposition helps in the dimensionality reduction of such data sets without losing the insights of the original data. What cross decomposition does is that it divides the input features and different targets into component matrices and then finds out the relationship between the input features and the target matrix. When it divides the input features in another matrix it reduces the number of features in the matrix the same way principal component regression (PCR) does, the only difference here being that the PCR is a unsupervised learning algorithm whereas cross decomposition uses supervised learning algorithms considering the target values also. In the next section we will be discussing about the components of cross decomposition, what it actually comprises of and how is it useful for modelling complex datasets.

Different Components of Cross Decomposition

Two of the most popular methods of cross decomposition are Canonical Correlation Analysis (CCA) and Partial Least Squares (PLS) which have several important applications in the cross-decomposition process.

Partial Least Squares:

PLS is a type of Cross Decomposition method which is used for predictive modelling and finding the relationships between two different matrices of data. It is useful in the case where the data is multicollinear or the number of features are more than the number of observations. The working principle of PLS is to find the best fitting linear regression model by projecting the features and targets to a new space, it extracts new components of the predictors called the latent variables which have the best predictive capabilities for the target variables.

How Does PLS Works:

  • PLS starts with standardizing the data for both the predictor and target matrices, it helps stabilizing the scales of the variables.
  • After the first step it converts the predictor and target matrices into latent variables which are not correlated to each other in the same set, and which do not lose the information of the original set. These latent variables are chosen in such a way that the predictor and target set must have a maximized correlation with each other.
  • This model is constructed in a new space with fewer variables such that the latent variables of the same set are uncorrelated to each other.
  • At last, linear regression is performed on this new space of latent variables.

The three main algorithms that are used under the Partial Least Squares method are PLSCanonical, PLSSVD, and PLSRegression. All of these works in a similar way just that they have their own advantages and disadvantages. Let’s discuss them one by one:

  • PLSCanonical Algorithm: PLSCanonical is the most commonly used PLS algorithm it is used to find out the relationship between two different matrices of data whether they are numerical or categorical. PLSCanonical algorithm finds out the latent variables from the predictor and target variables and maximises the covarience between them.
  • Partial Least Squares Singular Value Decomposition (PLSSVD): PLSSVD is a more specialized algorithm which works with numerical data only, it has the combination of Partial Least Squares and Singular Value Decomposition.
  • PLSRegression: PLSRegression is a regression algorithm that can be used to predict a continuous target variable from a set of predictor variables.


Canonical Correlation Analysis (CCA):

Canonical Correlation Analysis is similar to the PLSCanonical Algorithm, but CCA finds the projections that maximizes the correlation between linear combinations of variables. Suppose that the feature matrix is X with different variables in it as and the target matrix is Y with variables , then CCa finds out the linear combinations:

Canonical Variate of X =

Canonical Variate of Y =

The values of and are chosen such that the correlation between the canonical variate of X and the canonical variate of Y is maximum.

Comparison Between Different Cross Decomposition Methods

Let’s look at the code implementation of the methods:

First, we will be creating a dataset for analysis purposes and after that we will be applying cross decomposition methods on the data.

CanonicalPLS

Creation of dataset:

In the first step we will be importing numpy library which is essential for mathematical operations, and then we will be setting the ramdom.seed value for result reproducibility and creating our dataset with a sample size of 600 data points, and defining total number of features and target variables of 5 and 3 respectively. After these steps X is generated as a random normal dataset with n samples and p features, and Y is created to have some linear relationship with X (X multiplied by true_coef) plus some random noise.

Python3

import numpy as np
 
# Setting random seed value for reproducibility
np.random.seed(42)
 
# Generating synthetic data
n = 600  # Total number of samples
p = 5   # Total number of features in X
q = 3    # Total number of features in Y
 
# Creating independent variables
X = np.random.normal(size=n * p).reshape((n, p))
 
# Dependent variables with some relationship to X
true_coef = np.array([0.1, -0.2, 0.3, 0.4, 0.5])
Y = X @ true_coef[:p].reshape((p, 1)) + np.random.normal(size=n * q).reshape((n, q))

                    


Applying Canonical PLS to the Data:

After the generation of data we will be importing the PLSCanonical method from the cross_decomposition module of sklearn. We will be initializing the PLSCanonical method with 2 components and then fitting our synthetic dataset to the model for training purpose. After the training process is completed we will be transforming the data according to the model and the final datasets will have only 2 latent features left.

Python3

from sklearn.cross_decomposition import PLSCanonical
 
# Initializing and fit the PLSCanonical model with 2 latent var
pls_canonical = PLSCanonical(n_components=2)
pls_canonical.fit(X, Y)
 
# Transforming the datasets
X_c, Y_c = pls_canonical.transform(X, Y)

                    


Visualizing the Results:

For the visualization purposes first we will be importing matplotlib. Then we will be plotting two subplots showcasing their relationship between the first pair of canonical variables or the first pair of latent variables (), where is the first latent variable of X data matrix and is the first latent variable of Y data matrix. The same way the second subplot contains the second pair of latent variables (), the plt.subplot(121) signifies that the plot is going to be divided into 2 subplots and the third ‘1’ in this signifies that the subplot is at the first location of the overall plot which is of size (10X4). The correlation between the data is set on top of the subplots in the title.

Python3

import matplotlib.pyplot as plt
 
# Plot the first pair of canonical variates
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.scatter(X_c[:, 0], Y_c[:, 0], c='blue', label='First Canonical Variates')
plt.xlabel('First canonical variate of X')
plt.ylabel('First canonical variate of Y')
plt.title('Plot of the First Pair of Canonical Variates (corr = %.2f)'
          % np.corrcoef(X_c[:, 0], Y_c[:, 0])[0, 1]
)
plt.legend()
 
# Plot the second pair of canonical variates
plt.subplot(1, 2, 2)
plt.scatter(X_c[:, 1], Y_c[:, 1], c='red', label='Second Canonical Variates')
plt.xlabel('Second canonical variate of X')
plt.ylabel('Second canonical variate of Y')
plt.title('Plot of the Second Pair of Canonical Variates (corr = %.2f)'
         % np.corrcoef(X_c[:, 1], Y_c[:, 1])[0, 1]
)
plt.legend()
 
plt.tight_layout()
plt.show()

                    

Output:

Correlation plots between pairs of canonical variates


Through this plot we can observe that the correlation between the first pair of latent variables is pretty decent in amount whereas the second pair has not that much of correlation. This is the maximized correlation between the pair of latent variables that PLSCanonical method could produce.

Now we will be visualizing the correlation between the intra level of the datasets, which means we will be visualizing the first latent variable of the feature dataset with the second latent variable of the first dataset in the same way we will be visualizing for the response dataset. This will help us understand that if the PLSCanonical method was able to minimize or discard the correlation in the same data matrix.

Python3

# Plotting the latent variables of the feature data
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.scatter(X_c[:, 0], X_c[:, 1], c='yellow', label="Corr data of feature matrix")
plt.xlabel('First canonical variate of X')
plt.ylabel('Second canonical variate of X')
plt.title('Canonical Variates of the X Matrix (corr = %.2f)'
         % np.corrcoef(X_c[:, 0], X_c[:, 1])[0, 1]
)
plt.legend()
 
# Plotting the latent variables of the response data
plt.subplot(1, 2, 2)
plt.scatter(Y_c[:, 0], Y_c[:, 1], c='green', label="Corr data of response matrix")
plt.xlabel('First canonical variate of Y')
plt.ylabel('Second canonical variate of Y')
plt.title('Canonical Variates of the Y Matrix (corr = %.2f)'
         % np.corrcoef(Y_c[:, 0], Y_c[:, 1])[0, 1]
)
plt.legend()
 
plt.tight_layout()
plt.show()

                    

Output:

Correlation plots between latent variables of same dataset


With the help of this visualization we can see that there is no such correlation between the latent variables of the same dataset as the correlation mentioned is nearly zero, therefore, PLSCanonical method selected the latent variables in such a way that the correlation in the pair of latent variables is maximized and the latent variables of the same dataset is minimum to none.

The created two components through PLSCanonical are good for the fact that there is good correlation between different matrices and nearly zero correlation between same components of the dataset, therefore it aligns the dataset with the pre model training conditions.

PLS Regression with Multivariate Response

Here first we will be creating a synthetic dataset and then we will be applying PLSRegression on it to find out how the dataset perform after cross decomposition.

First we will be importing numpy for numerical operations, PLSRegression from the sklearn‘s cross_decomposition module, and mean_squared_error from sklearn‘s metrics module after that we will be defining 500 samples with 10 predictor variables. Then X becomes the predictor matrix of n*p size which is 5000. A variable true_coefficients is defined which will be linearly combined with X to produce Y. The first two rows of true_coefficients are [1.5, -2], and the rest are zeros. This means only the first two features in X are relevant for predicting Y. At last, Y is generated by the linear model Y = X.true_coefficients + noise, where noise is added from a normal distribution.

Python3

import numpy as np
from sklearn.cross_decomposition import PLSRegression
from sklearn.metrics import mean_squared_error
 
# Set random seed for reproducibility
np.random.seed(0)
 
# Generate some synthetic data
n = 500
p = 10
 
# Predictor variables
Xr = np.random.normal(size=n * p).reshape((n, p))
 
# Dependent variable with some noise
true_coefficients = np.array([1.5, -2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
Yr = np.dot(Xr, true_coefficients) + np.random.normal(size=n)

                    


Training and Testing the Model:

PLSRegression model is instantiated with 2 latent variables and the data is fit in it for the model training purpose, after that the predictions are made with the trained model, and the mean square error is calculated to check how well the model performs.

Python3

# Number of latent variables
n_components = 2
 
# Creating and fitting the PLS regression model
pls = PLSRegression(n_components=n_components)
pls.fit(Xr, Yr)
 
# Prediction using the PLS regression model
Y_pred = pls.predict(Xr)
 
# Calculating the Mean Squared Error
mse = mean_squared_error(Yr, Y_pred)
print(f"Mean Squared Error: {mse}")

                    

Output:

Mean Squared Error: 1.0056612234303313

We can now visualize the PLSRegression model with the given data to see how well the model fits the given data.

Visualization of Performance:

For visualization purpose we will be importing Matplotlib, we will be visualizing the original values with the predicted model.

We will be visualizing the predicted value by the model with respect to the original value, more the value of the data points inside the plots near to the diagonal the better the model will be performing.

Python3

# Plotting the true vs predicted values
plt.scatter(Yr, Y_pred)
plt.xlabel('True Values')
plt.ylabel('Predicted Values')
plt.title('PLS Regression: True vs Predicted')
plt.plot([Yr.min(), Yr.max()], [Yr.min(), Yr.max()], 'k--', lw=2)
plt.show()

                    

Output:

True value vs Predicted value by PLSRegression method

Here we can see that the predicted and the true value matches pretty well, hence we can conclude that this was a good approach for handling correlated dataset.

CCA (PLS mode B with symmetric deflation)

In this section we will be using the same data used in the above PLSCanonical example and we will be applying Canonical Correlation Analysis on it.

First we will be importing the CCA method from cross_decomposition module of sklearn and then we will be initializing the CCA model with total 2 components, the final data will be reduced to two components. After assigning CCA model to a variable we will be training the model on the and finally we will be transforming the dataset according to the trained model.

Python3

from sklearn.cross_decomposition import CCA
 
cca = CCA(n_components=2)
cca.fit(X, Y)
X_cca, Y_cca = cca.transform(X, Y)

                    


After transforming the data we can visualize it with the help of matplotlib, we will be plotting two subplots on a single plot, the subplots are going to be for the pair of canonical variables of X and Y dataset, which will help us understand the relationship between the new components created by CCA method. The first plot (subplot(1,2,1)) represents the figure of first canonical component of X vs the first canonical component of Y, in the same way the second plot (subplot(1,2,2)) represents the figure of second canonical component of X vs the second canonical component of Y.

Python3

import matplotlib.pyplot as plt
 
# Plot the results
plt.figure(figsize=(12, 6))
 
# Plot for the 1st pair of canonical variables
plt.subplot(1, 2, 1)
plt.scatter(X_cca[:, 0], Y_cca[:, 0], color='blue', label='First Latent Var')
plt.xlabel('X canonical variable 1')
plt.ylabel('Y canonical variable 1')
plt.title('First pair of canonical variables (corr = %.2f)'
         % np.corrcoef(X_cca[:, 0], Y_cca[:, 0])[0, 1])
plt.legend()
 
# Plot for the 2nd pair of canonical variables
plt.subplot(1, 2, 2)
plt.scatter(X_cca[:, 1], Y_cca[:, 1], color='red', label='Train')
plt.xlabel('X canonical variable 2')
plt.ylabel('Y canonical variable 2')
plt.title('Second pair of canonical variables (corr = %.2f)'
         % np.corrcoef(X_cca[:, 1], Y_cca[:, 1])[0, 1])
plt.legend()
 
plt.tight_layout()
plt.show()

                    

Output:

Correlation between pair of canonical variables through CCA method


Here, we can see that the dataset follows sort of a correlated figure, which explains that the CCA model is good for the analysis and dimension reduction of correlated dataset. The CCA model and the PLSCanonical produced similar results for the same dataset.

Advantages and Disadvantages of Cross Decomposition

There are many advantages of using cross decomposition but with the advantages we must also consider the drawbacks that might come with the positives. Here are some of the advantages and disadvantages of using cross decomposition.

Advantages

  • Handling Multicollinearity: Cross Decomposition is a great tool to use if the predictor dataset is filled with correlated data features. Collinearity is one of the main factors to consider in order to find an efficient model.
  • Dimensionality Reduction: Cross Decomposition reduces the original dataset into set of latent variables which are less dimensional than the original dataset but captures the insights of the original dataset pretty well.
  • Multivariate Relationship: Cross Decomposition is suitable for complex datasets with multiple sets of variables efficiently.
  • Robustness to Imperfect Data: Cross Decomposition methods are well suited when we are dealing with data that is either missing some values or it contains noise or outliers. They can capture patterns even in the presence of missing data.

Disadvantages

  • Overfitting: In situations where the features are greater ibn amount than the observed data the PLS model may get prone to overfitting, therefore, we must carefully consider proper model validation.
  • Assumptions: These methods are based on the assumptions that there is a linear relationship between data, in case of non linear relationships cross decomposition might not be the best choice.
  • Algorithm Sensitivity: The performance of cross decomposition algorithms can be sensitive to hyperparameters such as number of components. Selection of proper number of components could be done through cross validation but it increases the complexity of the model.
  • Computational Complexity: As the number of dimensions, the features or target variables increases the algorithms might get computationally intense.

Conclusion

In conclusion, it could be said that cross decomposition is a powerful tool which can help us in reducing the number of dimensions of the data with the help of supervised learning algorithms, it could help us solve the problem of multicollinearity in the dataset, and it can help handling data with high number of variables and less observations. We must be aware of the negatives that might come with it like a possibility of overfitting, and we must choose the number of components careful such that cross decomposition model could be used in alignment with our goals.



Contact Us