8000 Censor out-of-bounds x values for isotonic regression by mjbommar · Pull Request #3147 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Censor out-of-bounds x values for isotonic regression #3147

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from

Conversation

mjbommar
Copy link
Contributor

predict and transform in IsotonicRegression rely on interp1d, which fails if an element of X is outside of [x_min, x_max] for the vector that the model was fit on.

For example, this block below will fail with the following exception:

# Load boston house price data
boston_data = sklearn.datasets.load_boston()

# Load features/targets
features = boston_data.data
target = boston_data.target
indus_index = boston_data.feature_names.tolist().index('INDUS')

# Fit the model
reg = sklearn.isotonic.IsotonicRegression(x_censor=False, increasing=False)
reg.fit(features[:, indus_index], target)

# Now shift X up and down
y_all_up = reg.predict(features[:, indus_index] + 1)
y_all_down = reg.predict(features[:, indus_index] + 1)
ValueError: A value in x_new is above the interpolation range.

To make this more convenient for usage in real-world scenarios, I added a simple x_censor argument to the IsotonicRegression constructor. If set to True, the predict and transform methods will map out-of-bounds x values to the nearest bound. Test included.

@agramfort
Copy link
Member

I think it's a bug that it fails when x_test is out of bound.

I would remove x_censor and support this by default.

@NelleV @fabianp any opinion?

@jnothman
Copy link
Member

Thanks @mjbommar.

I'm not a user of isotonic regression to know whether your suggestion is appropriate. However:

  • you should not modify the input to predict and transform (i.e. you need to make a copy)
  • you should use numpy.clip to trim back those outlying values
  • x_censor is not a very good name. clip_x might be more apt

And tests are failing.

@mjbommar
Copy link
Contributor Author

@agramfort, that's another fine solution for the use case.

@jnothman, let me switch to np.clip/clip_x for now until a decision is made regarding default behavior.

@mjbommar
Copy link
Contributor Author
  • Switched arguments to clip_x
  • Switched to np.clip, don't touch input array T
  • Cleaned up unit tests, working on clean clone/make/nosetests
  • Updated docstrings

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 346cf03 on mjbommar:isotonic-censoring-2 into d62971d on scikit-learn:master.

@fabianp
Copy link
Member
fabianp commented May 13, 2014

Glad to see this PR, I've stumbled upon this annoying behaviour before but didn't fix it. Thanks!

# Handle the left and right bounds on X if clipping
if self.clip_x:
self.X_min = np.min(self.X_)
self.X_max = np.max(self.X_)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's estimated from the data it should be named : self.X_min_ and self.X_max_

I personally think that clip_x should be the default behavior and that this extra param could be avoided.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally think that clip_x should be the default behavior and that this
extra param could be avoided.

+1

@mjbommar
Copy link
Contributor Author

OK, @agramfort , made default behavior, removed clip_x argument, and changed privates to X_{min, max}_.


# Handle the left and right bounds on X
self.X_min_ = np.min(self.X_)
self.X_max_ = np.max(self.X_)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should update the attributes in the docstring. Then +1 for merge.

@mjbommar
Copy link
Contributor Author

OK, sorry about that. Added now, as well as a short line in the class description indicating how out-of-bound values are handled.

@NelleV
Copy link
Member
NelleV commented May 16, 2014

@agramfort We actually discussed that matter before and decided that this was the correct behaviour. There are tones of ways to deal with out of bounds prediction, and IMO, this should be done by the user and not silently in the program.

@ogrisel
Copy link
Member
ogrisel commented May 16, 2014

The travis failures are unrelated (unstable OMP test and random timeout in a joblib test that I fail to reproduce on my box). Please ignore.

@ogrisel
Copy link
Member
ogrisel commented May 16, 2014

@NelleV I think both @fabianp and @mjbommar find that the previous behavior (without cliping) to be annoying so I think this is a reasonable request to at least have the clipping behavior there by default.

I would be +1 to re-add the clip_input=True option, true by default to let advanced user disable the cliping and handle that case manually if they wish. Out of curiosity @NelleV do you have an example to share where auto-cliping would be detrimental and another strategy should be preferred?

@NelleV
Copy link
Member
NelleV commented May 16, 2014

@ogrisel In my case, for smaller Xs than the one on which the isotonic regression is fitted, the output should be 0, while for the larger, it depends on the dataset (ie, the organism I study. Often, I expect a power law fit, thus clipping would be a terrible choice).

I think there is a reason why interpolate.interp1d never infers out of bound value. Sure, it is annoying to have to deal with those, but it is even more annoying when you have to deal with a random bug in your code because of an implementation choice that is very specific to the user. This was the reason why when we implemented this, we chose the default behaviour of interp1d, and let each user deal with out of bounds values.
I think a much more reasonable choice would be to return nan for out of bounds errors (which is what interp1d does when you set bounds_error to False).

@agramfort
Copy link
Member

honestly I have no strong preference besides make it work out of the box as
much as possible.

@mjbommar
Copy link
Contributor Author

@NelleV, I don't disagree that clipping is not always desired, which is why my original approach was to not change default behavior. However, don't we have other regression methods implemented in sklearn whose behavior might make no sense out of training set domain? That doesn't stop us from extrapolating in those cases by default.

What's the resolution process for a decision? Would really like to finalize this in an acceptable way so that I can stop using my personal branch and switch back sklearn master.

@agramfort
Copy link
Member

returning nan would be fine with me too. Again no strong feeling. I just think it should work out of the box as much as possible.

@GaelVaroquaux
Copy link
Member

I just think it should work out of the box as much as possible.

+1

@ogrisel
Copy link
Member
ogrisel commented May 19, 2014

However, don't we have other regression methods implemented in sklearn whose whose behavior might make no sense out of training set domain?

The typical case is linear regression where this is almost always the case, but nobody would expect the model to refuse to make a prediction in that case.

@mjbommar please re-add a constructor param to control the behavior for the handling of out of bound values such as:

out_of_bounds="clip", out_of_bounds="raise", out_of_bounds="nan".

I would still use out_of_bounds="clip" as the default to make it work out of the box as much as possible while warning the user that depending on the application, clip might be misleading.

@GaelVaroquaux
Copy link
Member

I would still use out_of_bounds="clip" as the default to make it work out of
the box as much as possible while warning the user that depending on the
application, clip might be misleading.

+1

@NelleV
Copy link
Member
NelleV commented May 19, 2014

On 19 May 2014 19:01, Olivier Grisel notifications@github.com wrote:

However, don't we have other regression methods implemented in sklearn
whose whose behavior might make no sense out of training set domain?

The typical case is linear regression where this is almost always the
case, but nobody would expect the model to refuse to make a prediction in
that case.

It makes sense for any model that is parametric, but not so much for none
parametric models such as isotonic regression. In fact, clipping values can
(would in most cases) give extremely poor results.

@mjbommar https://github.com/mjbommar please re-add a constructor param
to control the behavior for the handling of out of bound values such as:

out_of_bounds="clip", out_of_bounds="raise", out_of_bounds="nan".

I would still use out_of_bounds="clip" as the default to make it work out
of the box as much as possible while warning the user that depending on the
application, clip might be misleading.


Reply to this email directly or view it on GitHubhttps://github.com//pull/3147#issuecomment-43529658
.

@agramfort
Copy link
Member

t makes sense for any model that is parametric, but not so much for none
parametric models such as isotonic regression. In fact, clipping values
can
(would in most cases) give extremely poor results.

maybe but in any cross-val loop you'll have the problem as the extreme
values will
at some point be in the test set.

@ogrisel
Copy link
Member
ogrisel commented May 20, 2014

It makes sense for any model that is parametric, but not so much for none parametric models such as isotonic regression. In fact, clipping values can (would in most cases) give extremely poor results.

Then consider Random Forest Regressor and GBRT: non parametric, piece wise constant regression models.

@NelleV
Copy link
Member
NelleV commented May 20, 2014

It makes sense for any model that is parametric, but not so much for none
parametric models such as isotonic regression. In fact, clipping values can
(would in most cases) give extremely poor results

Then consider Random Forest Regressor and GBRT: non parametric, piece wise
constant regression models.

I don't know about GBRT, but for random forest (1) it is cleary defined in
the model to be constant per area, (2) there is bagging to compensate this.
Two points that are never part of isotonic regression models.

In the case of Fabian and Alex, it makes sense to clip the values, as they
fit an isotonic regression on a probability cumulative distribution
function (thus, the values should be trimmed either to 0 or 1), but this is
not the most common usecase.
For instance, if you look at the example from the wikipedia page, (which I
think is the example I designed for scikit learn), they you can easily see
that it would be a very poor choice to clip the values.

I understand that you'd want this to work "by default", in order to avoid
angry emails on the mailing list. But it doesn't make it more right. So at
least, let's not pretend it is.

Last but not least, I truly believe that it is better for a program to fail
nicely instead of returning something that doesn't make sense. I am not
against adding a parameter to have the clipping behavior available, but I
really wouldn't make that the default behavior.

Reply to this email directly or view it on GitHubhttps://github.com//pull/3147#issuecomment-43604072
.

@ogrisel
Copy link
Member
ogrisel commented May 20, 2014

I don't know about GBRT, but for random forest (1) it is cleary defined in the model to be constant per area, (2) there is bagging to compensate this.

Bagging does not compensate anything in areas outside of the training set. It has exactly the same clipping bias and cannot generalize any non-constant trend present in the data. However Random Forests are still very useful in practice.

To me the clipping bias sounds like a very reasonable default, even in the wikipedia example. The best way to counter it's effect is to collect more samples on the boundaries of the training set.

@mjbommar
Copy link
Contributor Author

I'll add the out_of_bounds={clip, nan, raise} behavior, but if my PR 3157 isn't too controversial, could we finalize that one first so as to avoid merge issues?

GaelVaroquaux added a commit that referenced this pull request May 26, 2014
Handling out-of-bounds in IsotonicRegression (clean PR #3147)
@mjbommar
Copy link
Contributor Author

Closing as unmerged as completed in #3199

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants
0