Stepwise Guide of how Decision Trees Handle Multicollinearity

Generating Synthetic Data:
- We use np.random.rand(100, 1) * 10 to generate 100 random numbers between 0 and 10, which serves as our feature X.
- We use np.sin(X) to create the target variable y as a sine wave of the feature X.
- We add some random noise using np.random.normal(0, 0.1, size=(100, 1)) to make the relationship more realistic.
Splitting the Dataset:
- We split the dataset into training and test sets using train_test_split, with 80% of the data used for training and 20% for testing.
Linear Regression Model:
- We fit a Linear Regression model (lr.fit(X_train, y_train)) to the training data and make predictions on the test data (lr.predict(X_test)).
- We calculate the Mean Squared Error (MSE) between the predicted and actual values using mean_squared_error.
Decision Tree Regression Model:
- We fit a Decision Tree Regression model (dtr.fit(X_train, y_train)) to the training data and make predictions on the test data (dtr.predict(X_test)).
- We calculate the Mean Squared Error (MSE) between the predicted and actual values using mean_squared_error.
Printing Results:
- We print the MSE for both the Linear Regression and Decision Tree Regression models to compare their performance.

Python3

import numpy as np
import pandas as pd
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
 
# Create a synthetic dataset with a non-linear relationship
np.random.seed(42)
X = np.random.rand(100, 1) * 10
y = np.sin(X) + np.random.normal(0, 0.1, size=(100, 1))
 
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
 
## Calculate the correlation matrix between X and y
corr_matrix = np.corrcoef(X.squeeze(), y.squeeze())
 
# Display the correlation matrix
print("Correlation Matrix between X and y:")
print(corr_matrix)

Output:

Correlation Matrix between X and y:
[[ 1.         -0.94444709]
 [-0.94444709  1.        ]]

Fitting Linear Regression And Decision Tree to Compare

Python3

# Fit a Linear Regression model
lr = LinearRegression()
lr.fit(X_train, y_train)
lr_pred = lr.predict(X_test)
lr_mse = mean_squared_error(y_test, lr_pred)
 
# Fit a Decision Tree Regression model
dtr = DecisionTreeRegressor(random_state=42)
dtr.fit(X_train, y_train)
dtr_pred = dtr.predict(X_test)
dtr_mse = mean_squared_error(y_test, dtr_pred)
 
print("Linear Regression MSE:", lr_mse)
print("Decision Tree Regression MSE:", dtr_mse)

Output:

Linear Regression MSE: 0.4352358901582881
Decision Tree Regression MSE: 0.01578187903036423

A lower MSE indicates a better fit of the model to the data. In this example, the Decision Tree Regression model has a significantly lower MSE compared to the Linear Regression model, which shows Decision tree performs better.

Solving the Multicollinearity Problem with Decision Tree

Multicollinearity is a common issue in data science, affecting various types of models, including decision trees. This article explores what multicollinearity is, why it’s problematic for decision trees, and how to address it.

Table of Content

Multicollinearity in Decision Trees
Detecting Multicollinearity
Stepwise Guide of how Decision Trees Handle Multicollinearity

Tags:

#AI-ML-DS #Machine Learning #Machine Learning

Detecting Multicollinearity

Stepwise Guide of how Decision Trees Handle Multicollinearity

Python3

Fitting Linear Regression And Decision Tree to Compare

Python3

Solving the Multicollinearity Problem with Decision Tree

Similar Reads

Contact Us