Censor out-of-bounds x values for isotonic regression #3147

mjbommar · 2014-05-13T00:58:38Z

predict and transform in IsotonicRegression rely on interp1d, which fails if an element of X is outside of [x_min, x_max] for the vector that the model was fit on.

For example, this block below will fail with the following exception:

# Load boston house price data
boston_data = sklearn.datasets.load_boston()

# Load features/targets
features = boston_data.data
target = boston_data.target
indus_index = boston_data.feature_names.tolist().index('INDUS')

# Fit the model
reg = sklearn.isotonic.IsotonicRegression(x_censor=False, increasing=False)
reg.fit(features[:, indus_index], target)

# Now shift X up and down
y_all_up = reg.predict(features[:, indus_index] + 1)
y_all_down = reg.predict(features[:, indus_index] + 1)

ValueError: A value in x_new is above the interpolation range.

To make this more convenient for usage in real-world scenarios, I added a simple x_censor argument to the IsotonicRegression constructor. If set to True, the predict and transform methods will map out-of-bounds x values to the nearest bound. Test included.

…s x values

agramfort · 2014-05-13T08:43:53Z

I think it's a bug that it fails when x_test is out of bound.

I would remove x_censor and support this by default.

@NelleV @fabianp any opinion?

jnothman · 2014-05-13T08:45:34Z

Thanks @mjbommar.

I'm not a user of isotonic regression to know whether your suggestion is appropriate. However:

you should not modify the input to predict and transform (i.e. you need to make a copy)
you should use numpy.clip to trim back those outlying values
x_censor is not a very good name. clip_x might be more apt

And tests are failing.

mjbommar · 2014-05-13T12:10:39Z

@agramfort, that's another fine solution for the use case.

@jnothman, let me switch to np.clip/clip_x for now until a decision is made regarding default behavior.

mjbommar · 2014-05-13T13:19:53Z

Switched arguments to clip_x
Switched to np.clip, don't touch input array T
Cleaned up unit tests, working on clean clone/make/nosetests
Updated docstrings

coveralls · 2014-05-13T13:28:06Z

Coverage remained the same when pulling 346cf03 on mjbommar:isotonic-censoring-2 into d62971d on scikit-learn:master.

fabianp · 2014-05-13T14:19:23Z

Glad to see this PR, I've stumbled upon this annoying behaviour before but didn't fix it. Thanks!

agramfort · 2014-05-15T20:30:38Z

sklearn/isotonic.py

+        # Handle the left and right bounds on X if clipping
+        if self.clip_x:
+            self.X_min = np.min(self.X_)
+            self.X_max = np.max(self.X_)


if it's estimated from the data it should be named : self.X_min_ and self.X_max_

I personally think that clip_x should be the default behavior and that this extra param could be avoided.

I personally think that clip_x should be the default behavior and that this
extra param could be avoided.

+1

mjbommar · 2014-05-15T21:24:40Z

OK, @agramfort , made default behavior, removed clip_x argument, and changed privates to X_{min, max}_.

agramfort · 2014-05-15T21:40:45Z

sklearn/isotonic.py

+
+        # Handle the left and right bounds on X
+        self.X_min_ = np.min(self.X_)
+        self.X_max_ = np.max(self.X_)


you should update the attributes in the docstring. Then +1 for merge.

…handling

mjbommar · 2014-05-15T21:58:47Z

OK, sorry about that. Added now, as well as a short line in the class description indicating how out-of-bound values are handled.

NelleV · 2014-05-16T08:03:49Z

@agramfort We actually discussed that matter before and decided that this was the correct behaviour. There are tones of ways to deal with out of bounds prediction, and IMO, this should be done by the user and not silently in the program.

ogrisel · 2014-05-16T09:26:35Z

The travis failures are unrelated (unstable OMP test and random timeout in a joblib test that I fail to reproduce on my box). Please ignore.

ogrisel · 2014-05-16T09:31:45Z

@NelleV I think both @fabianp and @mjbommar find that the previous behavior (without cliping) to be annoying so I think this is a reasonable request to at least have the clipping behavior there by default.

I would be +1 to re-add the clip_input=True option, true by default to let advanced user disable the cliping and handle that case manually if they wish. Out of curiosity @NelleV do you have an example to share where auto-cliping would be detrimental and another strategy should be preferred?

NelleV · 2014-05-16T09:43:49Z

@ogrisel In my case, for smaller Xs than the one on which the isotonic regression is fitted, the output should be 0, while for the larger, it depends on the dataset (ie, the organism I study. Often, I expect a power law fit, thus clipping would be a terrible choice).

I think there is a reason why interpolate.interp1d never infers out of bound value. Sure, it is annoying to have to deal with those, but it is even more annoying when you have to deal with a random bug in your code because of an implementation choice that is very specific to the user. This was the reason why when we implemented this, we chose the default behaviour of interp1d, and let each user deal with out of bounds values.
I think a much more reasonable choice would be to return nan for out of bounds errors (which is what interp1d does when you set bounds_error to False).

agramfort · 2014-05-16T10:22:50Z

honestly I have no strong preference besides make it work out of the box as
much as possible.

8000 mjbommar · 2014-05-17T01:21:52Z

@NelleV, I don't disagree that clipping is not always desired, which is why my original approach was to not change default behavior. However, don't we have other regression methods implemented in sklearn whose behavior might make no sense out of training set domain? That doesn't stop us from extrapolating in those cases by default.

What's the resolution process for a decision? Would really like to finalize this in an acceptable way so that I can stop using my personal branch and switch back sklearn master.

agramfort · 2014-05-18T19:35:44Z

returning nan would be fine with me too. Again no strong feeling. I just think it should work out of the box as much as possible.

GaelVaroquaux · 2014-05-18T19:39:17Z

I just think it should work out of the box as much as possible.

+1

ogrisel · 2014-05-19T17:01:19Z

However, don't we have other regression methods implemented in sklearn whose whose behavior might make no sense out of training set domain?

The typical case is linear regression where this is almost always the case, but nobody would expect the model to refuse to make a prediction in that case.

@mjbommar please re-add a constructor param to control the behavior for the handling of out of bound values such as:

out_of_bounds="clip", out_of_bounds="raise", out_of_bounds="nan".

I would still use out_of_bounds="clip" as the default to make it work out of the box as much as possible while warning the user that depending on the application, clip might be misleading.

GaelVaroquaux · 2014-05-19T17:07:10Z

I would still use out_of_bounds="clip" as the default to make it work out of
the box as much as possible while warning the user that depending on the
application, clip might be misleading.

+1

NelleV · 2014-05-19T17:50:46Z

On 19 May 2014 19:01, Olivier Grisel notifications@github.com wrote:

However, don't we have other regression methods implemented in sklearn
whose whose behavior might make no sense out of training set domain?

The typical case is linear regression where this is almost always the
case, but nobody would expect the model to refuse to make a prediction in
that case.

It makes sense for any model that is parametric, but not so much for none
parametric models such as isotonic regression. In fact, clipping values can
(would in most cases) give extremely poor results.

@mjbommar https://github.com/mjbommar please re-add a constructor param
to control the behavior for the handling of out of bound values such as:

out_of_bounds="clip", out_of_bounds="raise", out_of_bounds="nan".

I would still use out_of_bounds="clip" as the default to make it work out
of the box as much as possible while warning the user that depending on the
application, clip might be misleading.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/3147#issuecomment-43529658
.

agramfort · 2014-05-19T21:29:23Z

t makes sense for any model that is parametric, but not so much for none
parametric models such as isotonic regression. In fact, clipping values
can
(would in most cases) give extremely poor results.

maybe but in any cross-val loop you'll have the problem as the extreme
values will
at some point be in the test set.

ogrisel · 2014-05-20T09:25:56Z

It makes sense for any model that is parametric, but not so much for none parametric models such as isotonic regression. In fact, clipping values can (would in most cases) give extremely poor results.

Then consider Random Forest Regressor and GBRT: non parametric, piece wise constant regression models.

NelleV · 2014-05-20T09:42:38Z

It makes sense for any model that is parametric, but not so much for none
parametric models such as isotonic regression. In fact, clipping values can
(would in most cases) give extremely poor results

Then consider Random Forest Regressor and GBRT: non parametric, piece wise
constant regression models.

I don't know about GBRT, but for random forest (1) it is cleary defined in
the model to be constant per area, (2) there is bagging to compensate this.
Two points that are never part of isotonic regression models.

In the case of Fabian and Alex, it makes sense to clip the values, as they
fit an isotonic regression on a probability cumulative distribution
function (thus, the values should be trimmed either to 0 or 1), but this is
not the most common usecase.
For instance, if you look at the example from the wikipedia page, (which I
think is the example I designed for scikit learn), they you can easily see
that it would be a very poor choice to clip the values.

I understand that you'd want this to work "by default", in order to avoid
angry emails on the mailing list. But it doesn't make it more right. So at
least, let's not pretend it is.

Last but not least, I truly believe that it is better for a program to fail
nicely instead of returning something that doesn't make sense. I am not
against adding a parameter to have the clipping behavior available, but I
really wouldn't make that the default behavior.

—

Reply to this email directly or view it on GitHubhttps://github.com//pull/3147#issuecomment-43604072
.

ogrisel · 2014-05-20T09:54:57Z

I don't know about GBRT, but for random forest (1) it is cleary defined in the model to be constant per area, (2) there is bagging to compensate this.

Bagging does not compensate anything in areas outside of the training set. It has exactly the same clipping bias and cannot generalize any non-constant trend present in the data. However Random Forests are still very useful in practice.

To me the clipping bias sounds like a very reasonable default, even in the wikipedia example. The best way to counter it's effect is to collect more samples on the boundaries of the training set.

mjbommar · 2014-05-20T11:02:50Z

I'll add the out_of_bounds={clip, nan, raise} behavior, but if my PR 3157 isn't too controversial, could we finalize that one first so as to avoid merge issues?

Handling out-of-bounds in IsotonicRegression (clean PR #3147)

mjbommar · 2014-05-26T15:57:31Z

Closing as unmerged as completed in #3199

mjbommar added 2 commits May 12, 2014 20:49

Adding x_censor argument to IsotonicRegression to handle out-of-bound…

335c16d

…s x values

Adding test for isotonic censoring

fb3ddc1

mjbommar added 4 commits May 13, 2014 08:54

Changing x_censor to clip_x, switching to np.clip

f93aa03

Updating test case for clip_x argument naming

719f776

Adding assert_equal import to isotonic tests

1cf7a38

Updating docstrings for censoring vs clipping in isotonic

346cf03

agramfort reviewed May 15, 2014
View reviewed changes

Removing clip_x and making default behavior

c887f10

agramfort reviewed May 15, 2014
View reviewed changes

Addig attribute docstring and clarity regarding out-of-bound x-value …

5a594a0

…handling

mjbommar added 2 commits May 25, 2014 19:53

Merge remote-tracking branch 'upstream/master' into isotonic-censoring-2

5d7e9f6

Merging from isotonic-increasing-auto changes

647a4f7

mjbommar mentioned this pull request May 26, 2014

Handling out-of-bounds in IsotonicRegression (clean PR #3147) #3199

Merged

GaelVaroquaux added a commit that referenced this pull request May 26, 2014

Merge pull request #3199 from mjbommar/isotonic-out-of-bounds-2

d0f6052

Handling out-of-bounds in IsotonicRegression (clean PR #3147)

mjbommar closed this May 26, 2014

amueller mentioned this pull request Jan 29, 2015

IsotonicRegression gives NANs on normal data #2507

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Censor out-of-bounds x values for isotonic regression #3147

Censor out-of-bounds x values for isotonic regression #3147

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Censor out-of-bounds x values for isotonic regression #3147

Censor out-of-bounds x values for isotonic regression #3147

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!