Methods for Dealing with Outliers in Regression Analysis

Outliers are the unusual values in the dataset that abnormally lie outside the overall data pattern. Detecting outliers is one of the most important steps in data preprocessing since it can negatively affect the statistical analysis and the training process of a machine learning algorithm.

In this article, we will explore different methods to deal with outliers in regression analysis.

Table of Content

  • Robust Regression Techniques for Outliers
    • 1. Huber Regression
    • 2. RANSAC Regression
    • 3. Theil Sen Regression
    • 4. Quantile Regression
  • Choosing the Right Technique – When to Use?

Robust Regression Techniques for Outliers

Robust regression techniques are essential when dealing with outliers in data, as they aim to minimize the impact of outliers on the regression model’s parameter estimation. Different Techniques for Dealing with Outliers in Regression Analysis are:

  • Huber Regression
  • RANSAC Regression
  • Theil Sen Regression
  • Quantile Regression

1. Huber Regression

The Huber Regressor is a robust regression algorithm that is less sensitive to outliers when compared to traditional linear regression. It provides the advantages of both the least-squares method and the absolute deviation method. The MSE loss is great for learning outliers, and the MAE loss is good at ignoring them. It is good at following ways:

  • HuberRegressor is scaling invariant. Once epsilon is set, it produces the same robustness to outliers even after scaling X and y.
  • It is more efficient for a small number of samples.

The Huber regressor applies a Huber loss to samples. The sample is considered an inlier if it is less than a certain threshold; otherwise, it is categorized as an outlier. Here, it doesn’t ignore the effect of the outliers but gives them a lesser weight.

We can define it using the following equation:

[Tex]\min_{w, \sigma} {\sum_{i=1}^n\left(\sigma + H_{\epsilon}\left(\frac{X_{i}w – y_{i}}{\sigma}\right)\sigma\right) + \alpha {||w||_2}^2}[/Tex]

where ? denotes the standard deviation, Xi represents the set of input features, yi is the regression target variable, ? is a vector of the estimated coefficient and α is the regularisation parameter. The [Tex]H_\epsilon[/Tex] is defined as shown below:

[Tex]H_\varepsilon(x) = \begin{cases} x^2, & \text{if } |x| < \varepsilon \ \frac{1}{2}|x| – \varepsilon^2, & \text{otherwise} \end{cases}[/Tex]

It optimizes the squared loss for the samples where |?| < ? and the absolute loss for the samples where |z| > ?. The model coefficients ?, the intercept c and the scale ? are parameters to be optimized. Applying the MAE to larger loss values will lessen the weight on outliers. Whereas, utilizing MSE maintains a quadratic function for smaller loss values near the center.

Let’s look at the implementation in Python using Sklearn. The code is as follows:

Python

import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn import datasets from sklearn.linear_model import HuberRegressor n_samples = 500 n_outliers = 25 # Generate data X, y, coef = datasets.make_regression( n_samples=n_samples, n_features=1, n_informative=1, noise=20, coef=True, random_state=42 ) # add outliers np.random.seed(42) X[:n_outliers] = 10 + 0.75 * np.random.normal(size=(n_outliers, 1)) y[:n_outliers] = -15 + 20 * np.random.normal(size=n_outliers) x =np.linspace(X.min(), X.max(), 7) # Hubber Regressor with epsilon 1.5 epsilon = 1.5 huber = HuberRegressor(alpha=0.0, epsilon=epsilon).fit(X, y) hcoef_ = huber.coef_ * x + huber.intercept_ # plot fix, ax = plt.subplots() plt.plot(X, y, "b.") plt.plot(x, hcoef_,"r-", label="huber loss") plt.title("Huber regression on data with outliers") plt.legend(loc=0) plt.show()

Output:

Huber Regressor

In the above code, we generated a random regression problem using the make_regression() method and replaced the first 25 observations with outliers. Next, we make use of HuberRegressor from sklearn to build the regression model and fit the dataset that has outliers. It doesn’t completely ignore the outliers, but at the same time, it is not heavily influenced by their effects. Next, we calculate the Huber regressor coefficient and plot it using Matplotlib along with the input dataset.

Let’s compare our Huber Regression model with Linear Regression.

Python

from sklearn.linear_model import LinearRegression # Linear Regression lr = LinearRegression().fit(X, y) lcoef_ = lr.coef_ * x + lr.intercept_ # plot fix, ax = plt.subplots() plt.plot(X, y, "b.") plt.plot(x, hcoef_,"r-", label="huber regression") plt.plot(x, lcoef_,"b-", label="linear regression") plt.title("Comparison of HuberRegressor vs Linear") plt.legend(loc=0) plt.xlabel("X") plt.ylabel("y") plt.show()

Output:

Comparison of Huber Regressor and Linear Regressor

Here, we built a linear regression model and calculated the linear coefficient for the dataset. Then we plotted both the coefficients using Matplotlib. Clearly, linear regression model has a higher impact on outliers when compared with the Huber regressor model. Let’s look at how change in epsilon parameter impacts the Huber Regressor.

Python

fix, ax = plt.subplots() # Fit the huber regressor over a series of epsilon values. colors = ["r-", "y-", "m-"] epsilon_values = [1, 1.5, 1.9] for k, epsilon in enumerate(epsilon_values): huber = HuberRegressor(alpha=0.0, epsilon=epsilon) huber.fit(X, y) hcoef_ = huber.coef_ * x + huber.intercept_ plt.plot(x, hcoef_, colors[k], label="huber loss, %s" % epsilon) plt.plot(X, y, "b.") plt.legend(loc=0) plt.xlabel("X") plt.ylabel("y") plt.show()

Output:

Comparison of Huber Regressor with different epsilon value

In the above code, we considered a series of epsilon values (1, 1.5, and 1.9). You can notice that the decision function changes as the parameter epsilon increases.

2. RANSAC Regression

RANSAC (RANdom SAmple Consensus) algorithm splits the entire input sample data into a set of inliers (including noise and outliers) and fits a model from random subsets of inliers. Then the final model is estimated from the determined inliers.

  • RANSAC scales much better with the number of samples.
  • It provides a better solution with large outliers in the y direction.
  • It is used for linear and non-linear regression problems.
  • RANSAC is popular in the field of photogrammetric computer vision.

RANSAC uses repeated random sub-sampling, where it estimates the parameters of a model by random sampling of observed data, and then, based on a voting scheme, it identifies the optimal fitting result. Let’s look at the steps followed in the RANSAC algorithm.

  1. Select a random subset (hypothetical inliers) of the original data.
  2. A model is fitted to the set of hypothetical inliers, and the entire data is tested against the fitted model.
  3. Then it identifies the data points that fit the estimated model well based on some model-specific loss function. These data points are called the consensus set (set of inliers).
  4. The estimated model is considered a great fit if sufficient data points have been classified as part of the consensus set.
  5. The model may be improved by re-estimating it by using all the members of the consensus set.

This robust regression algorithm achieves its goal by repeating the above steps. It evaluates how well the model fits into the consensus set. The execution can be terminated at a specific performance threshold or after a fixed number of iterations.

Let’s look at the implementation in Python using Sklearn. The code is as follows:

Python

from sklearn.linear_model import RANSACRegressor # Robustly fit linear model with RANSAC algorithm ransac = RANSACRegressor() ransac.fit(X, y) x = np.linspace(X.min(), X.max(), 7) line_x = x[:, np.newaxis] line_y_ransac = ransac.predict(line_x) # plot fix, ax = plt.subplots() plt.plot(X, y, "b.") plt.plot(line_x, line_y_ransac,"r-", label="ransac regression") plt.title("RANSAC Regressor") plt.legend(loc=0) plt.xlabel("X") plt.ylabel("y") plt.show()

Output:

RANSAC Regressor

The above code makes use of the RANSACRegressor() method from Sklearn to robustly fit linear data. Here, the RANSAC regressor splits the data into inliers and outliers, and the fitted line is determined by the optimal inliers.

Let’s check the RANSAC coefficient. To get the coefficient, first we need to access the final estimator of the RANSAC model using estimator_, and from the estimator we can retrieve the coefficient. The code is as follows:

Python

ransac_coef = ransac.estimator_.coef_ print("RANSAC Coefficient::", ransac_coef)

Output:

RANSAC Coefficient:: [62.74662047]

Let’s inspect the inliers and outliers in RANSAC regression.

Python

inlier_mask = ransac.inlier_mask_ outlier_mask = ~inlier_mask print(f"Total outliers: {sum(outlier_mask)}") plt.plot(X[inlier_mask], y[inlier_mask], "b.", label="Inliers") plt.plot(X[outlier_mask], y[outlier_mask], "r.", label="Outliers") plt.title("RANSAC - outliers vs inliers")

Output:

Total outliers: 55
Text(0.5, 1.0, 'RANSAC - outliers vs inliers')


Outliers vs Inliers

In the above code, we separated the inlier (blue dots) and outlier (red dots) using Matplotlib. Using inlier_mask_, we can get the inlier details, and negation of the inlier provides the outlier details.


3. Theil Sen Regression

In Theil-Sen regression, the slope is calculated by taking the median of the slopes between each pair of points in the data, and the intercept is calculated by taking the median of the intercepts (each intercept is calculated based on the Theil-Sen slope that passes through each point).

For a pair of points (xi, yi) and (xj, yj), the theil-sen estimator is the median m of the slopes (yj – yi)/(xj – xi) for all pairs of sample points. Based on the slope m (median of slopes), a line is identified from the sample points by setting the bi(y-intercept) to be the median of the value yi – mxi (fit line as yi = mxi + bi).

  • Theil Sen is better with medium-size outliers in the X direction.
  • It is robust to multivariate outliers since it uses a generalization of the median in multiple dimensions.
  • Disadvantage is that it loses its robustness properties in high dimensions.

Let’s look at the implementation in Python using Sklearn. The code is as follows:

Python

from sklearn.linear_model import TheilSenRegressor # Theilsen Regressor x = np.linspace(X.min(), X.max(), 7) line_x = x[:, np.newaxis] theilsen = TheilSenRegressor(random_state=42) theilsen.fit(X, y) line_y_theilsen = theilsen.predict(line_x) # plot fix, ax = plt.subplots() plt.plot(X, y, "b.") plt.plot(line_x, line_y_theilsen,"r-", label="theilsen regression") plt.title("Theilsen Regressor") plt.legend(loc=0) plt.xlabel("X") plt.ylabel("y") plt.show()

Output:

TheilSen Regressor

The above code makes use of the TheilSenRegressor from Sklearn, where it calculates each intercept based on the median of the slope for every point, and the final intercept is the median of the intercepts. 

Let’s look at the coefficient of Theilsen Regresor.

Python

theilsen.coef_[0]

Output:

59.48730609420659

4. Quantile Regression

Quantile regression is a type of robust regression (outlier-resistant) that estimates the conditional median of the response variable. It is an extension of linear regression, which explores different aspects of the relationship between the dependent variable and the independent variables and also makes no assumptions about the distribution of the residuals.

  • Quantile regression is useful for predicting an interval instead of a point.
  • It provides sensible prediction intervals even for errors with non-constant (but predictable) variance or a non-normal distribution.

As a linear model, the QuantileRegressor gives linear predictions for the ŷ(ω, X) = Xω for the q-th quantile q Є (0, 1) . The weights or coefficients ω are then found by the following minimization problem.

Let’s look at the implementation in Python using Sklearn. The code is as follows:

Python

from sklearn.linear_model import QuantileRegressor # Quantile Regressor qr = QuantileRegressor(quantile=0.5, alpha=0, solver="highs") y_pred = qr.fit(X, y) y_quantile_pred = qr.predict(X) # plot fix, ax = plt.subplots() plt.plot(X, y, "b.") plt.plot(X, y_quantile_pred,"y-", label="quantile regression") plt.title("Quantile Regressor") plt.legend(loc=0) plt.xlabel("X") plt.ylabel("y") plt.show()

Output:

Quantile Regressor

Here, we make use of the QuantileRegressor() method from Sklearn and plot the prediction for a quantile of 0.5.

Let’s try different quantile values.

Python

fix, ax = plt.subplots() # Fit the quantile regressor over a series of quantile values. colors = ["r-", "y-", "m-"] quantiles= [.05, 0.5, .95] for k, quantile in enumerate(quantiles): qr = QuantileRegressor(quantile=quantile, alpha=0, solver="highs") y_pred = qr.fit(X, y) y_quantile_pred = qr.predict(X) plt.plot(X, y_quantile_pred, colors[k], label="Quantile: %s" % quantile) plt.plot(X, y, "b.") plt.legend(loc=0) plt.xlabel("X") plt.ylabel("y") plt.show()

Output:

Quantile Regressor for series of quantiles

In the above code, we considered a series of quantile values (.05,0.5, and 0,.95). You can notice the difference in predictions.

Choosing the Right Technique – When to Use?

TechniqueWhen to Use
Huber Regression– Small to medium-sized outliers in the data.
– When a combination of both least-squares and absolute deviation methods is desired
– When scaling of input features and target variable is expected
RANSAC Regression– Large outliers in the y direction
– Linear and non-linear regression problems
– Situations where a subset of data points can be considered as inliers (e.g., photogrammetry)
Theil Sen Regression– Medium-sized outliers in the X direction
– Robustness to multivariate outliers
Quantile RegressionPredicting an interval instead of a point

Conclusion

In conclusion, when dealing with outliers in regression analysis, it’s crucial to choose the appropriate technique based on the characteristics of the data and the nature of the outliers.

  • Huber Regression strikes a balance between robustness and efficiency, making it suitable for scenarios with small to medium-sized outliers.
  • RANSAC Regression is ideal for situations where there are large outliers in the y direction and when a subset of data can be considered as inliers.
  • Theil Sen Regression offers robustness to medium-sized outliers in the X direction and multivariate outliers.
  • Lastly, Quantile Regression is valuable for predicting intervals and handling errors with non-constant variance or non-normal distribution.


Contact Us