8000 Huber regressor · scikit-learn/scikit-learn@be913c9 · GitHub
[go: up one dir, main page]

Skip to content

Commit be913c9

Browse files
committed
Huber regressor
Add gradient calculation in _huber_loss_and_gradient Add tests to check the correctness of the loss and gradient Fix for old scipy Add parameter sigma for robust linear regression Add gradient formula to robust _huber_loss_and_gradient Add fit_intercept option and fix tests Add docs to HuberRegressor and the helper functions Add example demonstrating ridge_regression vs huber_regression Add sample_weight implementation Add scaling invariant huber test Remove exp and add bounds to fmin_l_bfgs_b Add sparse data support Add more tests and refactoring of code Add narrative docs review huber regressor Minor additions to docs and tests Minor fixes that deals with dealing with NaN values in targets and old verions of SciPy and NumPy Add HuberRegressor to robust estimator Refactored computation of gradient and make docs render properly Temp Remove float64 dtype conversion trivial optimizations and add a note about R Remove sample_weights special_casing address @amueller comments
1 parent 323bcea commit be913c9

File tree

10 files changed

+647
-17
lines changed

10 files changed

+647
-17
lines changed

doc/modules/classes.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -689,6 +689,7 @@ Kernels:
689689
linear_model.BayesianRidge
690690
linear_model.ElasticNet
691691
linear_model.ElasticNetCV
692+
linear_model.HuberRegressor
692693
linear_model.Lars
693694
linear_model.LarsCV
694695
linear_model.Lasso

doc/modules/linear_model.rst

Lines changed: 77 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -902,15 +902,24 @@ in these settings.
902902

903903
.. topic:: **Trade-offs: which estimator?**
904904

905-
Scikit-learn provides 2 robust regression estimators:
906-
:ref:`RANSAC <ransac_regression>` and
907-
:ref:`Theil Sen <theil_sen_regression>`
905+
Scikit-learn provides 3 robust regression estimators:
906+
:ref:`RANSAC <ransac_regression>`,
907+
:ref:`Theil Sen <theil_sen_regressio ED4F n>` and
908+
:ref:`HuberRegressor <huber_regression>`
908909

909-
* :ref:`RANSAC <ransac_regression>` is faster, and scales much better
910-
with the number of samples
910+
* :ref:`HuberRegressor <huber_regression>` should be faster than
911+
:ref:`RANSAC <ransac_regression>` and :ref:`Theil Sen <theil_sen_regression>`
912+
unless the number of samples are very large, i.e ``n_samples`` >> ``n_features``.
913+
This is because :ref:`RANSAC <ransac_regression>` and :ref:`Theil Sen <theil_sen_regression>`
914+
fit on smaller subsets of the data. However, both :ref:`Theil Sen <theil_sen_regression>`
915+
and :ref:`RANSAC <ransac_regression>` are unlikely to be as robust as
916+
:ref:`HuberRegressor <huber_regression>` for the default parameters.
911917

912-
* :ref:`RANSAC <ransac_regression>` will deal better with large
913-
outliers in the y direction (most common situation)
918+
* :ref:`RANSAC <ransac_regression>` is faster than :ref:`Theil Sen <theil_sen_regression>`
919+
and scales much better with the number of samples
920+
921+
* :ref:`RANSAC <ransac_regression>` will deal better with large
922+
outliers in the y direction (most common situation)
914923

915924
* :ref:`Theil Sen <theil_sen_regression>` will cope better with
916925
medium-size outliers in the X direction, but this property will
@@ -1050,6 +1059,67 @@ considering only a random subset of all possible combinations.
10501059
10511060
.. [#f2] T. Kärkkäinen and S. Äyrämö: `On Computation of Spatial Median for Robust Data Mining. <http://users.jyu.fi/~samiayr/pdf/ayramo_eurogen05.pdf>`_
10521061
1062+
.. _huber_regression:
1063+
1064+
Huber Regression
1065+
----------------
1066+
1067+
The :class:`HuberRegressor` is different to :class:`Ridge` because it applies a
1068+
linear loss to samples that are classified as outliers.
1069+
A sample is classified as an inlier if the absolute error of that sample is
1070+
lesser than a certain threshold. It differs from :class:`TheilSenRegressor`
1071+
and :class:`RANSACRegressor` because it does not ignore the effect of the outliers
1072+
but gives a lesser weight to them.
1073+
1074+
.. figure:: ../auto_examples/linear_model/images/plot_huber_vs_ridge_001.png
1075+
:target: ../auto_examples/linear_model/plot_huber_vs_ridge.html
1076+
:align: center
1077+
:scale: 50%
1078+
1079+
The loss function that :class:`HuberRegressor` minimizes is given by
1080+
1081+
.. m F438 ath::
1082+
1083+
\underset{w, \sigma}{min\,} {\sum_{i=1}^n\left(\sigma + H_m\left(\frac{X_{i}w - y_{i}}{\sigma}\right)\sigma\right) + \alpha {||w||_2}^2}
1084+
1085+
where
1086+
1087+
.. math::
1088+
1089+
H_m(z) = \begin{cases}
1090+
z^2, & \text {if } |z| < \epsilon, \\
1091+
2\epsilon|z| - \epsilon^2, & \text{otherwise}
1092+
\end{cases}
1093+
1094+
It is advised to set the parameter ``epsilon`` to 1.35 to achieve 95% statistical efficiency.
1095+
1096+
Notes
1097+
-----
1098+
The :class:`HuberRegressor` differs from using :class:`SGDRegressor` with loss set to `huber`
1099+
in the following ways.
1100+
1101+
- :class:`HuberRegressor` is scaling invariant. Once ``epsilon`` is set, scaling ``X`` and ``y``
1102+
down or up by different values would produce the same robustness to outliers as before.
1103+
as compared to :class:`SGDRegressor` where ``epsilon`` has to be set again when ``X`` and ``y`` are
1104+
scaled.
1105+
1106+
- :class:`HuberRegressor` should be more efficient to use on data with small number of
1107+
samples while :class:`SGDRegressor` needs a number of passes on the training data to
1108+
produce the same robustness.
1109+
1110+
.. topic:: Examples:
1111+
1112+
* :ref:`example_linear_model_plot_huber_vs_ridge.py`
1113+
1114+
.. topic:: References:
1115+
1116+
.. [#f1] Peter J. Huber, Elvezio M. Ronchetti: Robust Statistics, Concomitant scale estimates, pg 172
1117+
1118+
Also, this estimator is different from the R implementation of Robust Regression
1119+
(http://www.ats.ucla.edu/stat/r/dae/rreg.htm) because the R implementation does a weighted least
1120+
squares implementation with weights given to each sample on the basis of how much the residual is
1121+
greater than a certain threshold.
1122+
10531123
.. _polynomial_regression:
10541124

10551125
Polynomial regression: extending linear models with basis functions

doc/whats_new.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,9 @@ New features
3535
- Added new supervised learning algorithm: :ref:`Multi-layer Perceptron <multilayer_perceptron>`
3636
(`#3204 <https://github.com/scikit-learn/scikit-learn/pull/3204>`_) by `Issam H. Laradji`_
3737

38+
- Added :class:`linear_model.HuberRegressor`, a linear model robust to outliers.
39+
(`#5291 <https://github.com/scikit-learn/scikit-learn/pull/5291>`_) by `Manoj Kumar`_.
40+
3841
Enhancements
3942
............
4043

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
"""
2+
=======================================================
3+
HuberRegressor vs Ridge on dataset with strong outliers
4+
=======================================================
5+
6+
Fit Ridge and HuberRegressor on a dataset with outliers.
7+
8+
The example shows that the predictions in ridge are strongly influenced
9+
by the outliers present in the dataset. The Huber regressor is less
10+
influenced by the outliers since the model uses the linear loss for these.
11+
As the parameter epsilon is increased for the Huber regressor, the decision
12+
function approaches that of the ridge.
13+
"""
14+
15+
# Authors: Manoj Kumar mks542@nyu.edu
16+
# License: BSD 3 clause
17+
18+
print(__doc__)
19+
20+
import numpy as np
21+
import matplotlib.pyplot as plt
22+
23+
from sklearn.datasets import make_regression
24+
from sklearn.linear_model import HuberRegressor, Ridge
25+
26+
# Generate toy data.
27+
rng = np.random.RandomState(0)
28+
X, y = make_regression(n_samples=20, n_features=1, random_state=0, noise=4.0,
29+
bias=100.0)
30+
31+
# Add four strong outliers to the dataset.
32+
X_outliers = rng.normal(0, 0.5, size=(4, 1))
33+
y_outliers = rng.normal(0, 2.0, size=4)
34+
X_outliers[:2, :] += X.max() + X.mean() / 4.
35+
X_outliers[2:, :] += X.min() - X.mean() / 4.
36+
y_outliers[:2] += y.min() - y.mean() / 4.
37+
y_outliers[2:] += y.max() + y.mean() / 4.
38+
X = np.vstack((X, X_outliers))
39+
y = np.concatenate((y, y_outliers))
40+
plt.plot(X, y, 'b.')
41+
42+
# Fit the huber regressor over a series of epsilon values.
43+
colors = ['r-', 'b-', 'y-', 'm-']
44+
45+
x = np.linspace(X.min(), X.max(), 7)
46+
epsilon_values = [1.35, 1.5, 1.75, 1.9]
47+
for k, epsilon in enumerate(epsilon_values):
48+
huber = HuberRegressor(fit_intercept=True, alpha=0.0, max_iter=100,
49+
epsilon=epsilon)
50+
huber.fit(X, y)
51+
coef_ = huber.coef_ * x + huber.intercept_
52+
plt.plot(x, coef_, colors[k], label="huber loss, %s" % epsilon)
53+
54+
# Fit a ridge regressor to compare it to huber regressor.
55+
ridge = Ridge(fit_intercept=True, alpha=0.0, random_state=0, normalize=True)
56+
ridge.fit(X, y)
57+
coef_ridge = ridge.coef_
58+
coef_ = ridge.coef_ * x + ridge.intercept_
59+
plt.plot(x, coef_, 'g-', label="ridge regression")
60+
61+
plt.title("Comparison of HuberRegressor vs Ridge")
62+
plt.xlabel("X")
63+
plt.ylabel("y")
64+
plt.legend(loc=0)
65+
plt.show()

examples/linear_model/plot_robust_fit.py

Lines changed: 16 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -22,14 +22,20 @@
2222
- RANSAC is good for strong outliers in the y direction
2323
2424
- TheilSen is good for small outliers, both in direction X and y, but has
25-
a break point above which it performs worst than OLS.
25+
a break point above which it performs worse than OLS.
26+
27+
- The scores of HuberRegressor may not be compared directly to both TheilSen
28+
and RANSAC because it does not attempt to completely filter the outliers
29+
but lessen their effect.
2630
2731
"""
2832

2933
from matplotlib import pyplot as plt
3034
import numpy as np
3135

32-
from sklearn import linear_model, metrics
36+
from sklearn.linear_model import (
37+
LinearRegression, TheilSenRegressor, RANSACRegressor, HuberRegressor)
38+
from sklearn.metrics import mean_squared_error
3339
from sklearn.preprocessing import PolynomialFeatures
3440
from sklearn.pipeline import make_pipeline
3541

@@ -56,12 +62,14 @@
5662
X_errors_large = X.copy()
5763
X_errors_large[::3] = 10
5864

59-
estimators = [('OLS', linear_model.LinearRegression()),
60-
('Theil-Sen', linear_model.TheilSenRegressor(random_state=42)),
61-
('RANSAC', linear_model.RANSACRegressor(random_state=42)), ]
62-
colors = {'OLS': 'turquoise', 'Theil-Sen': 'gold', 'RANSAC': 'lightgreen'}
63-
linestyle = {'OLS': '-', 'Theil-Sen': '-.', 'RANSAC': '--'}
65+
estimators = [('OLS', LinearRegression()),
66+
('Theil-Sen', TheilSenRegressor(random_state=42)),
67+
('RANSAC', RANSACRegressor(random_state=42)),
68+
('HuberRegressor', HuberRegressor())]
69+
colors = {'OLS': 'turquoise', 'Theil-Sen': 'gold', 'RANSAC': 'lightgreen', 'HuberRegressor': 'black'}
70+
linestyle = {'OLS': '-', 'Theil-Sen': '-.', 'RANSAC': '--', 'HuberRegressor': '--'}
6471
lw = 3
72+
6573
x_plot = np.linspace(X.min(), X.max())
6674
for title, this_X, this_y in [
6775
('Modeling Errors Only', X, y),
@@ -75,7 +83,7 @@
7583
for name, estimator in estimators:
7684
model = make_pipeline(PolynomialFeatures(3), estimator)
7785
model.fit(this_X, this_y)
78-
mse = metrics.mean_squared_error(model.predict(X_test), y_test)
86+
mse = mean_squared_error(model.predict(X_test), y_test)
7987
y_plot = model.predict(x_plot[:, np.newaxis])
8088
plt.plot(x_plot, y_plot, color=colors[name], linestyle=linestyle[name],
8189
linewidth=lw, label='%s: error = %.3f' % (name, mse))

sklearn/linear_model/__init__.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818
lasso_path, enet_path, MultiTaskLasso,
1919
MultiTaskElasticNet, MultiTaskElasticNetCV,
2020
MultiTaskLassoCV)
21+
from .huber import HuberRegressor
2122
from .sgd_fast import Hinge, Log, ModifiedHuber, SquaredLoss, Huber
2223
from .stochastic_gradient import SGDClassifier, SGDRegressor
2324
from .ridge import (Ridge, RidgeCV, RidgeClassifier, RidgeClassifierCV,
@@ -39,7 +40,7 @@
3940
'ElasticNet',
4041
'ElasticNetCV',
4142
'Hinge',
42-
'Huber',
43+
'HuberRegressor',
4344
'Lars',
4445
'LarsCV',
4546
'Lasso',

0 commit comments

Comments
 (0)
0