Step-by-Step Implementation of Feature Selection Using Random Forest

Step 1: Load the Dataset

First, we’ll generate a synthetic dataset with informative and non-informative features, and then split the dataset.

Python
import pandas as pd
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Generate the dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=5, n_redundant=2, n_repeated=0, n_classes=2, random_state=42)

# Convert to DataFrame for ease of use
feature_names = [f'feature_{i}' for i in range(X.shape[1])]
X = pd.DataFrame(X, columns=feature_names)
y = pd.Series(y)

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Step 2: Train a Random Forest Model (Before Feature Selection)

Next, we’ll train a Random Forest classifier using all the features and evaluate its accuracy.

Python
from sklearn.ensemble import RandomForestClassifier

# Train the Random Forest model
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Evaluate the model
accuracy_before = rf.score(X_test, y_test)
print(f'Accuracy before feature selection: {accuracy_before:.2f}')

Output:

Accuracy before feature selection: 0.89

Step 3: Perform Feature Selection Using Random Forest

Now, we’ll use the Random Forest model to select the most important features.

Python
# Extract feature importances
importances = rf.feature_importances_
feature_names = X.columns
feature_importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances})

# Rank features by importance
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)
print(feature_importance_df)

# Select top N features (example selecting top 10 features)
top_features = feature_importance_df['Feature'][:10].values
X_train_selected = X_train[top_features]
X_test_selected = X_test[top_features]

Output:

       Feature  Importance
10 feature_10 0.166347
18 feature_18 0.129780
9 feature_9 0.127592
15 feature_15 0.116865
4 feature_4 0.113428
12 feature_12 0.059363
1 feature_1 0.051482
14 feature_14 0.020885
3 feature_3 0.020203
11 feature_11 0.019620
2 feature_2 0.019236
17 feature_17 0.018607
5 feature_5 0.018271
6 feature_6 0.018121
7 feature_7 0.017843
8 feature_8 0.017514
0 feature_0 0.017097
16 feature_16 0.016739
13 feature_13 0.015980
19 feature_19 0.015027

Step 4: Train a Random Forest Model (After Feature Selection)

We’ll train a new Random Forest classifier using only the selected features and evaluate its accuracy.

Python
# Train the Random Forest model with selected features
rf_selected = RandomForestClassifier(n_estimators=100, random_state=42)
rf_selected.fit(X_train_selected, y_train)

# Evaluate the model
accuracy_after = rf_selected.score(X_test_selected, y_test)
print(f'Accuracy after feature selection: {accuracy_after:.2f}')

Output:

Accuracy after feature selection: 0.94

In this example, feature selection using Random Forest improved the model’s accuracy from 89% to 94%. This demonstrates that by focusing on the most important features, the model can achieve better performance. Feature selection helps reduce overfitting by eliminating irrelevant features and improves the model’s ability to generalize to unseen data.

This method is particularly useful in datasets with many features, where not all features contribute equally to the predictive power of the model. By selecting only the most relevant features, we can build more efficient, interpretable, and higher-performing models.



Feature Selection Using Random Forest

Feature selection is a crucial step in building machine learning models. It involves selecting the most important features from your dataset that contribute to the predictive power of the model. Random Forest, an ensemble learning method, is widely used for feature selection due to its inherent ability to rank features based on their importance. This article explores the process of feature selection using Random Forest, its benefits, and practical implementation.

Similar Reads

What is Random Forest?

Random Forest is a versatile machine learning algorithm that operates by constructing multiple decision trees during training and outputting the mode of the classes (classification) or mean prediction (regression) of the individual trees. It combines the concepts of bagging (bootstrap aggregating) and random feature selection, leading to improved accuracy and robustness....

Why Use Random Forest for Feature Selection?

Random Forest is particularly suited for feature selection for several reasons:...

Step-by-Step Implementation of Feature Selection Using Random Forest

Step 1: Load the Dataset...

Contact Us