Cook’s Distance Formula
In statistical modeling, it’s important to understand how each piece of data affects the overall picture. Cook’s Distance is a way to measure how much each point in a regression analysis influences the final results. Named after the statistician R. Dennis Cook, Cook’s Distance helps us pinpoint which data points have a big impact on the analysis. By showing us which points matter most, Cook’s Distance helps us make better decisions about our data and our models.
Formula for Cook’s Distance
Cook’s Distance ( ( Di ) ) for the ith observation in a regression model with p predictors is calculated using the formula-
Where,
- is the predicted value of the jth observation with the ith observation excluded from the model.
- is the predicted value of the jth observation from the full model.
- is the leverage of the ith observation.
- MSE is the mean squared error of the model.
Now we will implement the Cook’s Distance Formula in R Programming Language.
Step 1:Fit a linear Regression Model
lm() function fits a linear regression model, where mpg is the dependent variable and wt, hp, and disp are the independent variables.
data(mtcars)
# Fit a linear regression model using mtcars dataset
model <- lm(mpg ~ wt + hp + disp, data = mtcars)
Step 2:Compute Cook’s Distance
cooks.distance() computes Cook’s Distance for each observation based on the fitted model. cooksd will contain the Cook’s Distance values for each observation in the mtcars dataset.
# Compute Cook's Distance
cooksd <- cooks.distance(model)
# View Cook's Distance values
cooksd
Output:
Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
1.152035e-02 4.621112e-03 1.598334e-02 1.283888e-04
Hornet Sportabout Valiant Duster 360 Merc 240D
1.839055e-03 1.560119e-02 1.053270e-02 1.313511e-02
Merc 230 Merc 280 Merc 280C Merc 450SE
2.525382e-03 3.671067e-03 2.606104e-02 1.551454e-03
Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental
1.049983e-04 5.648180e-03 7.218880e-05 1.298764e-02
Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla
3.199707e-01 1.196019e-01 9.092102e-03 1.529771e-01
Toyota Corona Dodge Challenger AMC Javelin Camaro Z28
2.215865e-02 4.218196e-02 4.909944e-02 7.181085e-03
Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
6.980693e-02 4.163138e-04 1.732523e-06 5.959750e-02
Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
7.279943e-03 1.100867e-02 3.402911e-01 8.796726e-03
Here the output represent the Cook’s Distance values for each observation in the mtcars dataset.
- Observations with larger Cook’s Distance values have a greater impact on the regression model.
- Observations with Cook’s Distance values significantly larger than others may be considered influential outliers and should be investigated further.
- We can analyze these values to identify any influential observations that may affect the reliability of your regression analysis.
Step 3: Visualizing Cook’s Distance
now visualize Cook’s Distance from a linear regression model using the mtcars
dataset in R, you can create a plot that highlights influential points. Here’s an example of how to do it:
# Load necessary library
library(ggplot2)
# Fit the linear regression model
model <- lm(mpg ~ wt + hp + disp, data = mtcars)
# Compute Cook's Distance
cooksd <- cooks.distance(model)
# Create a data frame for plotting
influence_data <- data.frame(
index = 1:length(cooksd),
cooksd = cooksd
)
# Plot Cook's Distance
ggplot(influence_data, aes(x = index, y = cooksd)) +
geom_bar(stat = "identity", fill = "steelblue") +
geom_hline(yintercept = 4 / (nrow(mtcars) - length(coef(model))),
color = "red", linetype = "dashed") +
labs(title = "Cook's Distance for Influential Observations",
x = "Observation Index",
y = "Cook's Distance") +
theme_minimal()
# Print Cook's Distance values (optional)
print(cooksd)
Output:
ggplot
is used to create a bar plot of Cook’s Distance.
- A horizontal dashed red line is added at 4?−?n−p4 as a threshold to identify potentially influential observations, where ?n is the number of observations and ?p is the number of predictors plus one (for the intercept).
labs
adds titles and labels to the plot.theme_minimal
gives a clean look to the plot.
The plot will help you identify observations with high Cook’s Distance, which could indicate they have a significant influence on the fitted model. Observations above the red dashed line are typically considered influential.
Conclusion
Cook’s Distance is a valuable tool for detecting influential observations in regression analysis. It offers insights into data reliability and model performance, it’s essential to consider its limitations and interpret results in conjunction with other diagnostic measures. By exploring Cook’s Distance effectively, we can enhance the quality and validity of our regression models, leading to more informed decision-making in various domains.
Contact Us