Handling Missing Data in Logistic Regression by Deletion

In this method, we simply remove observations with missing values from the dataset. This approach is straightforward but may lead to loss of valuable information.

Pros of Handling Missing Data in Logistic Regression by Deletion

  1. Simplicity: It’s simple to apply and comprehend the deletion process. No further modeling procedures or sophisticated imputation techniques are needed.
  2. Preservation of Data Structure: There is no need to change or manipulate the data because missing values are eliminated, maintaining the dataset’s structure.

Cons Handling Missing Data in Logistic Regression by Deletion

  1. Loss of Important Information: The deletion method’s primary disadvantage is the information that is lost. It is possible to eliminate potentially significant patterns or relationships in the data by eliminating observations that have missing values.
  2. Reduced Statistical Power: Deletion of observations might result in a smaller sample size and, thus, a lower level of statistical power. Less observations could lead to less accurate estimations and less trustworthy outcomes.

Implementation

  • A synthetic dataset with missing values is generated using NumPy’s random functions.
  • The dataset includes 1000 samples and 5 features, with 20% missing values randomly inserted.
  • The dataset is split into training and testing sets using a 80-20 split ratio.
  • Observations with missing values are removed from the training set using boolean indexing.
  • A logistic regression model is trained on the modified training set without missing values.
  • The trained model’s accuracy is evaluated on the testing set, excluding observations with missing values.
  • The output indicates the accuracy achieved by the logistic regression model trained using the deletion method for handling missing data.
  • In this specific run, the accuracy obtained is approximately 51.56%.
  • The achieved accuracy may be relatively low due to the loss of valuable information caused by the deletion of observations with missing values.
Python
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Step 1: Generate Synthetic Dataset with Missing Values
np.random.seed(0)
n_samples = 1000
n_features = 5
X = np.random.rand(n_samples, n_features)
y = np.random.randint(0, 2, n_samples)  # binary target variable
missing_mask = np.random.rand(n_samples, n_features) < 0.2  # 20% missing values
X[missing_mask] = np.nan

# Step 2: Split Data into Training and Testing Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Deletion Method:
# Remove observations with missing values
X_train_deleted = X_train[~np.isnan(X_train).any(axis=1)]
y_train_deleted = y_train[~np.isnan(X_train).any(axis=1)]

# Train logistic regression model
model_deleted = LogisticRegression()
model_deleted.fit(X_train_deleted, y_train_deleted)

# Evaluate model on test set
accuracy_deleted = model_deleted.score(X_test[~np.isnan(X_test).any(axis=1)], y_test[~np.isnan(X_test).any(axis=1)])
print("Accuracy with Deletion Method:", accuracy_deleted)

Output:

Accuracy with Deletion Method: 0.515625

The output reflects the accuracy of a logistic regression model trained on data with missing values removed (using the deletion method). With an accuracy of 51.56%, the model’s performance is relatively low, likely due to the loss of valuable information from deleted observations, reducing the training data and hindering its ability to generalize to unseen data.

How to Handle Missing Data in Logistic Regression?

Logistic regression is a robust statistical method employed to model the likelihood of binary results. Nevertheless, real-world datasets frequently have missing values, presenting obstacles while fitting logistic regression models. Dealing with missing data effectively is essential to prevent skewed estimates and maintain the model’s accuracy. In this article, we have discussed how can we handle missing data in logistic regression.

Table of Content

  • How to Handle Missing Data in Logistic Regression?
  • 1. Handling Missing Data in Logistic Regression by Deletion
  • 2. Handling Missing Data in Logistic Regression by Imputation
  • 3. Handling Missing Data in Logistic Regression using Missingness Indicator

Similar Reads

How to Handle Missing Data in Logistic Regression?

Handling missing data in logistic regression is important to ensure the accuracy of the model. Some of the strategies for handling mission data are discussed below:...

Handling Missing Data in Logistic Regression by Deletion

In this method, we simply remove observations with missing values from the dataset. This approach is straightforward but may lead to loss of valuable information....

Handling Missing Data in Logistic Regression by Imputation

Imputation involves replacing missing values with estimated values. Common imputation techniques include mean imputation, median imputation, and K-nearest neighbors (KNN) imputation....

Handling Missing Data in Logistic Regression using Missingness Indicator

In this approach, we incorporate the missingness mechanism into the analysis by including variables that indicate whether values are missing. This approach allows the model to learn from the missingness pattern and make more accurate predictions....

Conclusion

Handling missing data is crucial for building reliable logistic regression models. By understanding the types of missing data and employing appropriate techniques such as imputation or deletion, researchers can mitigate bias and ensure accurate predictions . With careful consideration and implementation, logistic regression can provide valuable insights into binary outcomes in various fields....

Contact Us