8000 Merge pull request #5291 from MechCoder/huber_loss · scikit-learn/scikit-learn@540c7c6 · GitHub
[go: up one dir, main page]

Skip to content

Commit 540c7c6

Browse files
committed
Merge pull request #5291 from MechCoder/huber_loss
[MRG+1] Add Huber Estimator to sklearn linear models
2 parents 7895d38 + 7e79adc commit 540c7c6

File tree

10 files changed

+651
-17
lines changed

10 files changed

+651
-17
lines changed

doc/modules/classes.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -689,6 +689,7 @@ Kernels:
689689
linear_model.BayesianRidge
690690
linear_model.ElasticNet
691691
linear_model.ElasticNetCV
692+
linear_model.HuberRegressor
692693
linear_model.Lars
693694
linear_model.LarsCV
694695
linear_model.Lasso

doc/modules/linear_model.rst

Lines changed: 77 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -902,15 +902,24 @@ in these settings.
902902

903903
.. topic:: **Trade-offs: which estimator?**
904904

905-
Scikit-learn provides 2 robust regression estimators:
906-
:ref:`RANSAC <ransac_regression>` and
907-
:ref:`Theil Sen <theil_sen_regression>`
905+
Scikit-learn provides 3 robust regression estimators:
906+
:ref:`RANSAC <ransac_regression>`,
907+
:ref:`Theil Sen <theil_sen_regression>` and
908+
:ref:`HuberRegressor <huber_regression>`
908909

909-
* :ref:`RANSAC <ransac_regression>` is faster, and scales much better
910-
with the number of samples
910+
* :ref:`HuberRegressor <huber_regression>` should be faster than
911+
:ref:`RANSAC <ransac_regression>` and :ref:`Theil Sen <theil_sen_regression>`
912+
unless the number of samples are very large, i.e ``n_samples`` >> ``n_features``.
913+
This is because :ref:`RANSAC <ransac_regression>` and :ref:`Theil Sen <theil_sen_regression>`
914+
fit on smaller subsets of the data. However, both :ref:`Theil Sen <theil_sen_regression>`
915+
and :ref:`RANSAC <ransac_regression>` are unlikely to be as robust as
916+
:ref:`HuberRegressor <huber_regression>` for the default parameters.
911917

912-
* :ref:`RANSAC <ransac_regression>` will deal better with large
913-
outliers in the y direction (most common situation)
918+
* :ref:`RANSAC <ransac_regression>` is faster than :ref:`Theil Sen <theil_sen_regression>`
919+
and scales much better with the number of samples
920+
921+
* :ref:`RANSAC <ransac_regression>` will deal better with large
922+
outliers in the y direction (most common situation)
914923

915924
* :ref:`Theil Sen <theil_sen_regression>` will cope better with
916925
medium-size outliers in the X direction, but this property will
@@ -1050,6 +1059,67 @@ considering only a random subset of all possible combinations.
10501059
10511060
.. [#f2] T. Kärkkäinen and S. Äyrämö: `On Computation of Spatial Median for Robust Data Mining. <http://users.jyu.fi/~samiayr/pdf/ayramo_eurogen05.pdf>`_
10521061
1062+
.. _huber_regression:
1063+
1064+
Huber Regression
1065+
----------------
1066+
1067+
The :class:`HuberRegressor` is different to :class:`Ridge` because it applies a
1068+
linear loss to samples that are classified as outliers.
1069+
A sample is classified as an inlier if the absolute error of that sample is
1070+
lesser than a certain threshold. It differs from :class:`TheilSenRegressor`
1071+
and :class:`RANSACRegressor` because it does not ignore the effect of the outliers
1072+
but gives a lesser weight to them.
1073+
1074+
.. figure:: ../auto_examples/linear_model/images/plot_huber_vs_ridge_001.png
1075+
:target: ../auto_examples/linear_model/plot_huber_vs_ridge.html
1076+
:align: center
1077+
:scale: 50%
1078+
1079+
The loss function that :class:`HuberRegressor` minimizes is given by
1080+
1081+
.. math::
1082+
1083+
\underset{w, \sigma}{min\,} {\sum_{i=1}^n\left(\sigma + H_m\left(\frac{X_{i}w - y_{i}}{\sigma}\right)\sigma\right) + \alpha {||w||_2}^2}
1084+
1085+
where
1086+
1087+
.. math::
1088+
1089+
H_m(z) = \begin{cases}
1090+
z^2, & \text {if } |z| < \epsilon, \\
1091+
2\epsilon|z| - \epsilon^2, & \text{otherwise}
1092+
\end{cases}
1093+
1094+
It is advised to set the parameter ``epsilon`` to 1.35 to achieve 95% statistical efficiency.
1095+
1096+
Notes
1097+
-----
1098+
The :class:`HuberRegressor` differs from using :class:`SGDRegressor` with loss set to `huber`
1099+
in the following ways.
1100+
1101+
- :class:`HuberRegressor` is scaling invariant. Once ``epsilon`` is set, scaling ``X`` and ``y``
1102+
down or up by different values would produce the same robustness to outliers as before.
1103+
as compared to :class:`SGDRegressor` where ``epsilon`` has to be set again when ``X`` and ``y`` are
1104+
scaled.
1105+
1106+
- :class:`HuberRegressor` should be more efficient to use on data with small number of
1107+
samples while :class:`SGDRegressor` needs a number of passes on the training data to
1108+
produce the same robustness.
1109+
1110+
.. topic:: Examples:
1111+
1112+
* :ref:`example_linear_model_plot_huber_vs_ridge.py`
1113+
1114+
.. topic:: References:
1115+
1116+
.. [#f1] Peter J. Huber, Elvezio M. Ronchetti: Robust Statistics, Concomitant scale estimates, pg 172
1117+
1118+
Also, this estimator is different from the R implementation of Robust Regression
1119+
(http://www.ats.ucla.edu/stat/r/dae/rreg.htm) because the R implementation does a weighted least
1120+
squares implementation with weights given to each sample on the basis of how much the residual is
1121+
greater than a certain threshold.
1122+
10531123
.. _polynomial_regression:
10541124

10551125
Polynomial regression: extending linear models with basis functions

doc/whats_new.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,9 @@ New features
3535
- Added new supervised learning algorithm: :ref:`Multi-layer Perceptron <multilayer_perceptron>`
3636
(`#3204 <https://github.com/scikit-learn/scikit-learn/pull/3204>`_) by `Issam H. Laradji`_
3737

38+
- Added :class:`linear_model.HuberRegressor`, a linear model robust to outliers.
39+
(`#5291 <https://github.com/scikit-learn/scikit-learn/pull/5291>`_) by `Manoj Kumar`_.
40+
3841
Enhancements
3942
............
4043

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
"""
2+
=======================================================
3+
HuberRegressor vs Ridge on dataset with strong outliers
4+
=======================================================
5+
6+
Fit Ridge and HuberRegressor on a dataset with outliers.
7+
8+
The example shows that the predictions in ridge are strongly influenced
9+
by the outliers present in the dataset. The Huber regressor is less
10+
influenced by the outliers since the model uses the linear loss for these.
11+
As the parameter epsilon is increased for the Huber regressor, the decision
12+
function approaches that of the ridge.
13+
"""
14+
15+
# Authors: Manoj Kumar mks542@nyu.edu
16+
# License: BSD 3 clause
17+
18+
print(__doc__)
19+
20+
import numpy as np
21+
import matplotlib.pyplot as plt
22+
23+
from sklearn.datasets import make_regression
24+
from sklearn.linear_model import HuberRegressor, Ridge
25+
26+
# Generate toy data.
27+
rng = np.random.RandomState(0)
28+
X, y = make_regression(n_samples=20, n_features=1, random_state=0, noise=4.0,
29+
bias=100.0)
30+
31+
# Add four strong outliers to the dataset.
32+
X_outliers = rng.normal(0, 0.5, size=(4, 1))
33+
y_outliers = rng.normal(0, 2.0, size=4)
34+
X_outliers[:2, :] += X.max() + X.mean() / 4.
35+
X_outliers[2:, :] += X.min() - X.mean() / 4.
36+
y_outliers[:2] += y.min() - y.mean() / 4.
37+
y_outliers[2:] += y.max() + y.mean() / 4.
38+
X = np.vstack((X, X_outliers))
39+
y = np.concatenate((y, y_outliers))
40+
plt.plot(X, y, 'b.')
41+
42+
# Fit the huber regressor over a series of epsilon values.
43+
colors = ['r-', 'b-', 'y-', 'm-']
44+
45+
x = np.linspace(X.min(), X.max(), 7)
46+
epsilon_values = [1.35, 1.5, 1.75, 1.9]
47+
for k, epsilon in enumerate(epsilon_values):
48+
huber = HuberRegressor(fit_intercept=True, alpha=0.0, max_iter=100,
49+
epsilon=epsilon)
50+
huber.fit(X, y)
51+
coef_ = huber.coef_ * x + huber.intercept_
52+
plt.plot(x, coef_, colors[k], label="huber loss, %s" % epsilon)
53+
54+
# Fit a ridge regressor to compare it to huber regressor.
55+
ridge = Ridge(fit_intercept=True, alpha=0.0, random_state=0, normalize=True)
56+
ridge.fit(X, y)
57+
coef_ridge = ridge.coef_
58+
coef_ = ridge.coef_ * x + ridge.intercept_
59+
plt.plot(x, coef_, 'g-', label="ridge regression")
60+
61+
plt.title("Comparison of HuberRegressor vs Ridge")
62+
plt.xlabel("X")
63+
plt.ylabel("y")
64+
plt.legend(loc=0)
65+
plt.show()

examples/linear_model/plot_robust_fit.py

Lines changed: 16 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -22,14 +22,20 @@
2222
- RANSAC is good for strong outliers in the y direction
2323
2424
- TheilSen is good for small outliers, both in direction X and y, but has
25-
a break point above which it performs worst than OLS.
25+
a break point above which it performs worse than OLS.
26+
27+
- The scores of HuberRegressor may not be compared directly to both TheilSen
28+
and RANSAC because it does not attempt to completely filter the outliers
29+
but lessen their effect.
2630
2731
"""
2832

2933
from matplotlib import pyplot as plt
3034
import numpy as np
3135

32-
from sklearn import linear_model, metrics
36+
from sklearn.linear_model import (
37+
LinearRegression, TheilSenRegressor, RANSACRegressor, HuberRegressor)
38+
from sklearn.metrics import mean_squared_error
3339
from sklearn.preprocessing import PolynomialFeatures
3440
from sklearn.pipeline import make_pipeline
3541

@@ -56,12 +62,14 @@
5662
X_errors_large = X.copy()
5763
X_errors_large[::3] = 10
5864

59-
estimators = [('OLS', linear_model.LinearRegression()),
60-
('Theil-Sen', linear_model.TheilSenRegressor(random_state=42)),
61-
('RANSAC', linear_model.RANSACRegressor(random_state=42)), ]
62-
colors = {'OLS': 'turquoise', 'Theil-Sen': 'gold', 'RANSAC': 'lightgreen'}
63-
linestyle = {'OLS': '-', 'Theil-Sen': '-.', 'RANSAC': '--'}
65+
estimators = [('OLS', LinearRegression()),
66+
('Theil-Sen', TheilSenRegressor(random_state=42)),
67+
('RANSAC', RANSACRegressor(random_state=42)),
68+
('HuberRegressor', HuberRegressor())]
69+
colors = {'OLS': 'turquoise', 'Theil-Sen': 'gold', 'RANSAC': 'lightgreen', 'HuberRegressor': 'black'}
70+
linestyle = {'OLS': '-', 'Theil-Sen': '-.', 'RANSAC': '--', 'HuberRegressor': '--'}
6471
lw = 3
72+
6573
x_plot = np.linspace(X.min(), X.max())
6674
for title, this_X, this_y in [
6775
('Modeling Errors Only', X, y),
@@ -75,7 +83,7 @@
7583
for name, estimator in estimators:
7684
model = make_pipeline(PolynomialFeatures(3), estimator)
7785
model.fit(this_X, this_y)
78-
mse = metrics.mean_squared_error(model.predict(X_test), y_test)
86+
mse = mean_squared_error(model.predict(X_test), y_test)
7987
y_plot = model.predict(x_plot[:, np.newaxis])
8088
plt.plot(x_plot, y_plot, color=colors[name], linestyle=linestyle[name],
8189
linewidth=lw, label='%s: error = %.3f' % (name, mse))

sklearn/linear_model/__init__.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818
lasso_path, enet_path, MultiTaskLasso,
1919
MultiTaskElasticNet, MultiTaskElasticNetCV,
2020
MultiTaskLassoCV)
21+
from .huber import HuberRegressor
2122
from .sgd_fast import Hinge, Log, ModifiedHuber, SquaredLoss, Huber
2223
from .stochastic_gradient import SGDClassifier, SGDRegressor
2324
from .ridge import (Ridge, RidgeCV, RidgeClassifier, RidgeClassifierCV,
@@ -39,7 +40,7 @@
3940
'ElasticNet',
4041
'ElasticNetCV',
4142
'Hinge',
42-
'Huber',
43+
'HuberRegressor',
4344
'Lars',
4445
'LarsCV',
4546
'Lasso',

0 commit comments

Comments
 (0)
0