Cook’s Distance Formula

In statistical modeling, it’s important to understand how each piece of data affects the overall picture. Cook’s Distance is a way to measure how much each point in a regression analysis influences the final results. Named after the statistician R. Dennis Cook, Cook’s Distance helps us pinpoint which data points have a big impact on the analysis. By showing us which points matter most, Cook’s Distance helps us make better decisions about our data and our models.

Formula for Cook’s Distance

Cook’s Distance ( ( Di ) ) for the ith observation in a regression model with p predictors is calculated using the formula-

Where,

  • is the predicted value of the jth observation with the ith observation excluded from the model.
  • is the predicted value of the jth observation from the full model.
  • is the leverage of the ith observation.
  • MSE is the mean squared error of the model.

Now we will implement the Cook’s Distance Formula in R Programming Language.

Step 1:Fit a linear Regression Model

lm() function fits a linear regression model, where mpg is the dependent variable and wt, hp, and disp are the independent variables.

R
data(mtcars)

# Fit a linear regression model using mtcars dataset
model <- lm(mpg ~ wt + hp + disp, data = mtcars)

Step 2:Compute Cook’s Distance

cooks.distance() computes Cook’s Distance for each observation based on the fitted model. cooksd will contain the Cook’s Distance values for each observation in the mtcars dataset.

R
# Compute Cook's Distance
cooksd <- cooks.distance(model)

# View Cook's Distance values
cooksd

Output:

          Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
       1.152035e-02        4.621112e-03        1.598334e-02        1.283888e-04 
  Hornet Sportabout             Valiant          Duster 360           Merc 240D 
       1.839055e-03        1.560119e-02        1.053270e-02        1.313511e-02 
           Merc 230            Merc 280           Merc 280C          Merc 450SE 
       2.525382e-03        3.671067e-03        2.606104e-02        1.551454e-03 
         Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
       1.049983e-04        5.648180e-03        7.218880e-05        1.298764e-02 
  Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
       3.199707e-01        1.196019e-01        9.092102e-03        1.529771e-01 
      Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
       2.215865e-02        4.218196e-02        4.909944e-02        7.181085e-03 
   Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
       6.980693e-02        4.163138e-04        1.732523e-06        5.959750e-02 
     Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
       7.279943e-03        1.100867e-02        3.402911e-01        8.796726e-03 

Here the output represent the Cook’s Distance values for each observation in the mtcars dataset.

  • Observations with larger Cook’s Distance values have a greater impact on the regression model.
  • Observations with Cook’s Distance values significantly larger than others may be considered influential outliers and should be investigated further.
  • We can analyze these values to identify any influential observations that may affect the reliability of your regression analysis.

Step 3: Visualizing Cook’s Distance

now visualize Cook’s Distance from a linear regression model using the mtcars dataset in R, you can create a plot that highlights influential points. Here’s an example of how to do it:

R
# Load necessary library
library(ggplot2)

# Fit the linear regression model
model <- lm(mpg ~ wt + hp + disp, data = mtcars)

# Compute Cook's Distance
cooksd <- cooks.distance(model)

# Create a data frame for plotting
influence_data <- data.frame(
  index = 1:length(cooksd),
  cooksd = cooksd
)

# Plot Cook's Distance
ggplot(influence_data, aes(x = index, y = cooksd)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  geom_hline(yintercept = 4 / (nrow(mtcars) - length(coef(model))), 
             color = "red", linetype = "dashed") +
  labs(title = "Cook's Distance for Influential Observations",
       x = "Observation Index",
       y = "Cook's Distance") +
  theme_minimal()

# Print Cook's Distance values (optional)
print(cooksd)

Output:

Cook’s Distance Formula

ggplot is used to create a bar plot of Cook’s Distance.

  1. A horizontal dashed red line is added at 4?−?np4​ as a threshold to identify potentially influential observations, where ?n is the number of observations and ?p is the number of predictors plus one (for the intercept).
  2. labs adds titles and labels to the plot.
  3. theme_minimal gives a clean look to the plot.

The plot will help you identify observations with high Cook’s Distance, which could indicate they have a significant influence on the fitted model. Observations above the red dashed line are typically considered influential.

Conclusion

Cook’s Distance is a valuable tool for detecting influential observations in regression analysis. It offers insights into data reliability and model performance, it’s essential to consider its limitations and interpret results in conjunction with other diagnostic measures. By exploring Cook’s Distance effectively, we can enhance the quality and validity of our regression models, leading to more informed decision-making in various domains.



Contact Us