Stepwise Guide of how Decision Trees Handle Multicollinearity

  1. Generating Synthetic Data:
    • We use np.random.rand(100, 1) * 10 to generate 100 random numbers between 0 and 10, which serves as our feature X.
    • We use np.sin(X) to create the target variable y as a sine wave of the feature X.
    • We add some random noise using np.random.normal(0, 0.1, size=(100, 1)) to make the relationship more realistic.
  2. Splitting the Dataset:
    • We split the dataset into training and test sets using train_test_split, with 80% of the data used for training and 20% for testing.
  3. Linear Regression Model:
    • We fit a Linear Regression model (lr.fit(X_train, y_train)) to the training data and make predictions on the test data (lr.predict(X_test)).
    • We calculate the Mean Squared Error (MSE) between the predicted and actual values using mean_squared_error.
  4. Decision Tree Regression Model:
    • We fit a Decision Tree Regression model (dtr.fit(X_train, y_train)) to the training data and make predictions on the test data (dtr.predict(X_test)).
    • We calculate the Mean Squared Error (MSE) between the predicted and actual values using mean_squared_error.
  5. Printing Results:
    • We print the MSE for both the Linear Regression and Decision Tree Regression models to compare their performance.

Python3




import numpy as np
import pandas as pd
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
 
# Create a synthetic dataset with a non-linear relationship
np.random.seed(42)
X = np.random.rand(100, 1) * 10
y = np.sin(X) + np.random.normal(0, 0.1, size=(100, 1))
 
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
 
## Calculate the correlation matrix between X and y
corr_matrix = np.corrcoef(X.squeeze(), y.squeeze())
 
# Display the correlation matrix
print("Correlation Matrix between X and y:")
print(corr_matrix)


Output:

Correlation Matrix between X and y:
[[ 1. -0.94444709]
[-0.94444709 1. ]]

Fitting Linear Regression And Decision Tree to Compare

Python3




# Fit a Linear Regression model
lr = LinearRegression()
lr.fit(X_train, y_train)
lr_pred = lr.predict(X_test)
lr_mse = mean_squared_error(y_test, lr_pred)
 
# Fit a Decision Tree Regression model
dtr = DecisionTreeRegressor(random_state=42)
dtr.fit(X_train, y_train)
dtr_pred = dtr.predict(X_test)
dtr_mse = mean_squared_error(y_test, dtr_pred)
 
print("Linear Regression MSE:", lr_mse)
print("Decision Tree Regression MSE:", dtr_mse)


Output:

Linear Regression MSE: 0.4352358901582881
Decision Tree Regression MSE: 0.01578187903036423

A lower MSE indicates a better fit of the model to the data. In this example, the Decision Tree Regression model has a significantly lower MSE compared to the Linear Regression model, which shows Decision tree performs better.



Solving the Multicollinearity Problem with Decision Tree

Multicollinearity is a common issue in data science, affecting various types of models, including decision trees. This article explores what multicollinearity is, why it’s problematic for decision trees, and how to address it.

Table of Content

  • Multicollinearity in Decision Trees
  • Detecting Multicollinearity
  • Stepwise Guide of how Decision Trees Handle Multicollinearity

Similar Reads

Multicollinearity in Decision Trees

What is Multicollinearity?...

Detecting Multicollinearity

Detecting multicollinearity is an important step in ensuring the reliability of your regression model. Here are two common methods for detecting multicollinearity:...

Stepwise Guide of how Decision Trees Handle Multicollinearity

...

Contact Us