8000 address @amueller comments · scikit-learn/scikit-learn@3a3a230 · GitHub
[go: up one dir, main page]

Skip to content

Commit 3a3a230

Browse files
committed
address @amueller comments
1 parent ada333d commit 3a3a230

File tree

5 files changed

+40
-25
lines changed

5 files changed

+40
-25
lines changed

doc/modules/linear_model.rst

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -909,7 +909,7 @@ in these settings.
909909

910910
* :ref:`HuberRegressor <huber_regression>` should be faster than
911911
:ref:`RANSAC <ransac_regression>` and :ref:`Theil Sen <theil_sen_regression>`
912-
unless the number of samples are very large, i.e n_samples >> n_features.
912+
unless the number of samples are very large, i.e ``n_samples`` >> ``n_features``.
913913
This is because :ref:`RANSAC <ransac_regression>` and :ref:`Theil Sen <theil_sen_regression>`
914914
fit on smaller subsets of the data. However, both :ref:`Theil Sen <theil_sen_regression>`
915915
and :ref:`RANSAC <ransac_regression>` are unlikely to be as robust as
@@ -1064,8 +1064,8 @@ considering only a random subset of all possible combinations.
10641064
Huber Regression
10651065
----------------
10661066

1067-
The :class:`HuberRegressor` is similar to :class:`Ridge` that it also applies
1068-
L2 regularization and a squared loss for samples which are classified as inliers.
1067+
The :class:`HuberRegressor` is different to :class:`Ridge` because it applies a
1068+
linear loss to samples that are classified as outliers.
10691069
A sample is classified as an inlier if the absolute error of that sample is
10701070
lesser than a certain threshold. It differs from :class:`TheilSenRegressor`
10711071
and :class:`RANSACRegressor` because it does not ignore the effect of the outliers
@@ -1100,7 +1100,7 @@ in the following ways.
11001100

11011101
- :class:`HuberRegressor` is scaling invariant. Once ``epsilon`` is set, scaling ``X`` and ``y``
11021102
down or up by different values would produce the same robustness to outliers as before.
1103-
as compared to :class:`SGDRegessor` where `epsilon` has to be set again when ``X`` and ``y`` are
1103+
as compared to :class:`SGDRegressor` where ``epsilon`` has to be set again when ``X`` and ``y`` are
11041104
scaled.
11051105

11061106
- :class:`HuberRegressor` should be more efficient to use on data with small number of

examples/linear_model/plot_robust_fit.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,8 +24,9 @@
2424
- TheilSen is good for small outliers, both in direction X and y, but has
2525
a break point above which it performs worse than OLS.
2626
27-
- HuberRegressor should not differ much in performance to both RANSAC
28-
and TheilSen due to outliers in both X and y direction, since it checks if
27+
- HuberRegressor may not be compared directly to both TheilSen and RANSAC
28+
because it does not attempt to completely filter the outliers but
29+
lessen their effect. The higher the deviance of the outliers
2930
the mean absolute error is lesser than a certain threshold.
3031
3132
"""

sklearn/linear_model/huber.py

Lines changed: 18 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -18,33 +18,33 @@ def _huber_loss_and_gradient(w, X, y, epsilon, alpha, sample_weight=None):
1818
1919
Parameters
2020
----------
21-
w: ndarray, shape (n_features + 1,) or (n_features + 2,)
21+
w : ndarray, shape (n_features + 1,) or (n_features + 2,)
2222
Feature vector.
23-
w[:n_features] gives the feature vector
23+
w[:n_features] gives the coefficients
2424
w[-1] gives the scale factor and if the intercept is fit w[-2]
2525
gives the intercept factor.
2626
27-
X: ndarray, shape (n_samples, n_features)
27+
X : ndarray, shape (n_samples, n_features)
2828
Input data.
2929
30-
y: ndarray, shape (n_samples,)
30+
y : ndarray, shape (n_samples,)
3131
Target vector.
3232
33-
epsilon: float
33+
epsilon : float
3434
Robustness of the Huber estimator.
3535
36-
alpha: float
36+
alpha : float
3737
Regularization parameter.
3838
39-
sample_weight: ndarray, shape (n_samples,), optional
39+
sample_weight : ndarray, shape (n_samples,), optional
4040
Weight assigned to each sample.
4141
4242
Returns
4343
-------
4444
loss: float
4545
Huber loss.
4646
47-
gradient: ndarray, shape (n_features + 1,) or (n_features + 2,)
47+
gradient: ndarray, shape (len(w))
4848
Returns the derivative of the Huber loss with respect to each
4949
coefficient, intercept and the scale as a vector.
5050
"""
@@ -129,8 +129,9 @@ class HuberRegressor(LinearModel, RegressorMixin, BaseEstimator):
129129
``|(y - X'w) / sigma| < epsilon`` and the absolute loss for the samples
130130
where ``|(y - X'w) / sigma| > epsilon``, where w and sigma are parameters
131131
to be optimized. The parameter sigma makes sure that if y is scaled up
132-
or down by a certain factor, one does not need to rescale epsilon to acheive
133-
the same robustness.
132+
or down by a certain factor, one does not need to rescale epsilon to
133+
achieve the same robustness. Note that this does not take into account
134+
the fact that the different features of X may be of different scales.
134135
135136
This makes sure that the loss function is not heavily influenced by the
136137
outliers while not completely ignoring their effect.
@@ -141,11 +142,12 @@ class HuberRegressor(LinearModel, RegressorMixin, BaseEstimator):
141142
----------
142143
epsilon : float, greater than 1.0, default 1.35
143144
The parameter epsilon controls the number of samples that should be
144-
classified as outliers. The lesser the epsilon, the more robust it is
145+
classified as outliers. The smaller the epsilon, the more robust it is
145146
to outliers.
146147
147148
max_iter : int, default 100
148-
Number of iterations that scipy.optimize.fmin_l_bfgs_b should run for.
149+
Maximum number of iterations that scipy.optimize.fmin_l_bfgs_b
150+
should run for.
149151
150152
alpha : float, default 0.0001
151153
Regularization parameter.
@@ -174,7 +176,7 @@ class HuberRegressor(LinearModel, RegressorMixin, BaseEstimator):
174176
scale_ : float
175177
The value by which ``|y - X'w - c|`` is scaled down.
176178
177-
n_iter_: int
179+
n_iter_ : int
178180
Number of iterations that fmin_l_bfgs_b has run for.
179181
Not available if SciPy version is 0.9 and below.
180182
@@ -207,7 +209,7 @@ def fit(self, X, y, sample_weight=None):
207209
y : array-like, shape (n_samples,)
208210
Target vector relative to X.
209211
210-
sample_weight: array-like, shape (n_samples,)
212+
sample_weight : array-like, shape (n_samples,)
211213
Weight given to each sample.
212214
213215
Returns
@@ -225,7 +227,8 @@ def fit(self, X, y, sample_weight=None):
225227

226228
if self.epsilon < 1.0:
227229
raise ValueError(
228-
"epsilon should be greater than 1.0, got %f" % self.epsilon)
230+
"epsilon should be greater than or equal to 1.0, got %f"
231+
% self.epsilon)
229232

230233
if self.warm_start and hasattr(self, 'coef_'):
231234
parameters = np.concatenate(

sklearn/linear_model/tests/test_huber.py

Lines changed: 14 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -83,14 +83,24 @@ def test_huber_sample_weights():
8383
assert_array_almost_equal(huber.coef_, huber_coef, 3)
8484
assert_array_almost_equal(huber.intercept_, huber_intercept, 3)
8585

86-
# Test sparse implementation with sparse weights.
87-
# Checking sparse=non_sparse should be covered in the common tests.
86+
# Test sparse implementation with sample weights.
8887
X_csr = sparse.csr_matrix(X)
8988
huber_sparse = HuberRegressor(fit_intercept=True, alpha=0.1)
9089
huber_sparse.fit(X_csr, y, sample_weight=[1, 3, 1, 2, 1])
9190
assert_array_almost_equal(huber_sparse.coef_, huber_coef, 3)
9291

9392

93+
def test_huber_sparse():
94+
X, y = make_regression_with_outliers()
95+
huber = HuberRegressor(fit_intercept=True, alpha=0.1)
96+
huber.fit(X, y)
97+
98+
X_csr = sparse.csr_matrix(X)
99+
huber_sparse = HuberRegressor(fit_intercept=True, alpha=0.1)
100+
huber_sparse.fit(X_csr, y)
101+
assert_array_almost_equal(huber_sparse.coef_, huber.coef_)
102+
103+
94104
def return_outliers(X, y, huber):
95105
"""Return the number of outliers."""
96106
return np.abs(huber.predict(X) - y) > huber.epsilon * huber.scale_
@@ -165,6 +175,8 @@ def test_huber_better_r2_score():
165175
huber_score = huber.score(X[mask], y[mask])
166176
huber_outlier_score = huber.score(X[~mask], y[~mask])
167177

178+
# The Ridge regressor should be influenced by the outliers and hence
179+
# give a worse score on the non-outliers as compared to the huber regressor.
168180
ridge = Ridge(fit_intercept=True, alpha=0.01)
169181
ridge.fit(X, y)
170182
ridge_score = ridge.score(X[mask], y[mask])
@@ -173,4 +185,3 @@ def test_huber_better_r2_score():
173185

174186
# The huber model should also fit poorly on the outliers.
175187
assert_greater(ridge_outlier_score, huber_outlier_score)
176-

sklearn/utils/estimator_checks.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1488,7 +1488,7 @@ def check_non_transformer_estimators_n_iter(name, estimator,
14881488
estimator.fit(X, y_)
14891489

14901490
# HuberRegressor depends on scipy.optimize.fmin_l_bfgs_b
1491-
# which does return a n_iter for old versions of SciPy.
1491+
# which doesn't return a n_iter for old versions of SciPy.
14921492
if not (name == 'HuberRegressor' and estimator.n_iter_ is None):
14931493
assert_greater(estimator.n_iter_, 0)
14941494

0 commit comments

Comments
 (0)
0