Stepwise Guide of how Decision Trees Handle Multicollinearity
- Generating Synthetic Data:
- We use
np.random.rand(100, 1) * 10
to generate 100 random numbers between 0 and 10, which serves as our featureX
. - We use
np.sin(X)
to create the target variabley
as a sine wave of the featureX
. - We add some random noise using
np.random.normal(0, 0.1, size=(100, 1))
to make the relationship more realistic.
- We use
- Splitting the Dataset:
- We split the dataset into training and test sets using
train_test_split
, with 80% of the data used for training and 20% for testing.
- We split the dataset into training and test sets using
- Linear Regression Model:
- We fit a Linear Regression model (
lr.fit(X_train, y_train)
) to the training data and make predictions on the test data (lr.predict(X_test)
). - We calculate the Mean Squared Error (MSE) between the predicted and actual values using
mean_squared_error
.
- We fit a Linear Regression model (
- Decision Tree Regression Model:
- We fit a Decision Tree Regression model (
dtr.fit(X_train, y_train)
) to the training data and make predictions on the test data (dtr.predict(X_test)
). - We calculate the Mean Squared Error (MSE) between the predicted and actual values using
mean_squared_error
.
- We fit a Decision Tree Regression model (
- Printing Results:
- We print the MSE for both the Linear Regression and Decision Tree Regression models to compare their performance.
Python3
import numpy as np import pandas as pd from sklearn.datasets import make_regression from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.tree import DecisionTreeRegressor from sklearn.metrics import mean_squared_error # Create a synthetic dataset with a non-linear relationship np.random.seed( 42 ) X = np.random.rand( 100 , 1 ) * 10 y = np.sin(X) + np.random.normal( 0 , 0.1 , size = ( 100 , 1 )) # Split the dataset into training and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2 , random_state = 42 ) ## Calculate the correlation matrix between X and y corr_matrix = np.corrcoef(X.squeeze(), y.squeeze()) # Display the correlation matrix print ( "Correlation Matrix between X and y:" ) print (corr_matrix) |
Output:
Correlation Matrix between X and y:
[[ 1. -0.94444709]
[-0.94444709 1. ]]
Fitting Linear Regression And Decision Tree to Compare
Python3
# Fit a Linear Regression model lr = LinearRegression() lr.fit(X_train, y_train) lr_pred = lr.predict(X_test) lr_mse = mean_squared_error(y_test, lr_pred) # Fit a Decision Tree Regression model dtr = DecisionTreeRegressor(random_state = 42 ) dtr.fit(X_train, y_train) dtr_pred = dtr.predict(X_test) dtr_mse = mean_squared_error(y_test, dtr_pred) print ( "Linear Regression MSE:" , lr_mse) print ( "Decision Tree Regression MSE:" , dtr_mse) |
Output:
Linear Regression MSE: 0.4352358901582881
Decision Tree Regression MSE: 0.01578187903036423
A lower MSE indicates a better fit of the model to the data. In this example, the Decision Tree Regression model has a significantly lower MSE compared to the Linear Regression model, which shows Decision tree performs better.
Solving the Multicollinearity Problem with Decision Tree
Multicollinearity is a common issue in data science, affecting various types of models, including decision trees. This article explores what multicollinearity is, why it’s problematic for decision trees, and how to address it.
Table of Content
- Multicollinearity in Decision Trees
- Detecting Multicollinearity
- Stepwise Guide of how Decision Trees Handle Multicollinearity
Contact Us