-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Censor out-of-bounds x values for isotonic regression #3147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks @mjbommar. I'm not a user of isotonic regression to know whether your suggestion is appropriate. However:
And tests are failing. |
@agramfort, that's another fine solution for the use case. @jnothman, let me switch to np.clip/clip_x for now until a decision is made regarding default behavior. |
|
Glad to see this PR, I've stumbled upon this annoying behaviour before but didn't fix it. Thanks! |
# Handle the left and right bounds on X if clipping | ||
if self.clip_x: | ||
self.X_min = np.min(self.X_) | ||
self.X_max = np.max(self.X_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it's estimated from the data it should be named : self.X_min_ and self.X_max_
I personally think that clip_x should be the default behavior and that this extra param could be avoided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally think that clip_x should be the default behavior and that this
extra param could be avoided.
+1
OK, @agramfort , made default behavior, removed clip_x argument, and changed privates to X_{min, max}_. |
|
||
# Handle the left and right bounds on X | ||
self.X_min_ = np.min(self.X_) | ||
self.X_max_ = np.max(self.X_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should update the attributes in the docstring. Then +1 for merge.
OK, sorry about that. Added now, as well as a short line in the class description indicating how out-of-bound values are handled. |
@agramfort We actually discussed that matter before and decided that this was the correct behaviour. There are tones of ways to deal with out of bounds prediction, and IMO, this should be done by the user and not silently in the program. |
The travis failures are unrelated (unstable OMP test and random timeout in a joblib test that I fail to reproduce on my box). Please ignore. |
@NelleV I think both @fabianp and @mjbommar find that the previous behavior (without cliping) to be annoying so I think this is a reasonable request to at least have the clipping behavior there by default. I would be +1 to re-add the |
@ogrisel In my case, for smaller Xs than the one on which the isotonic regression is fitted, the output should be 0, while for the larger, it depends on the dataset (ie, the organism I study. Often, I expect a power law fit, thus clipping would be a terrible choice). I think there is a reason why interpolate.interp1d never infers out of bound value. Sure, it is annoying to have to deal with those, but it is even more annoying when you have to deal with a random bug in your code because of an implementation choice that is very specific to the user. This was the reason why when we implemented this, we chose the default behaviour of interp1d, and let each user deal with out of bounds values. |
honestly I have no strong preference besides make it work out of the box as |
@NelleV, I don't disagree that clipping is not always desired, which is why my original approach was to not change default behavior. However, don't we have other regression methods implemented in sklearn whose behavior might make no sense out of training set domain? That doesn't stop us from extrapolating in those cases by default. What's the resolution process for a decision? Would really like to finalize this in an acceptable way so that I can stop using my personal branch and switch back sklearn master. |
returning nan would be fine with me too. Again no strong feeling. I just think it should work out of the box as much as possible. |
+1 |
The typical case is linear regression where this is almost always the case, but nobody would expect the model to refuse to make a prediction in that case. @mjbommar please re-add a constructor param to control the behavior for the handling of out of bound values such as:
I would still use |
+1 |
On 19 May 2014 19:01, Olivier Grisel notifications@github.com wrote:
|
maybe but in any cross-val loop you'll have the problem as the extreme |
Then consider Random Forest Regressor and GBRT: non parametric, piece wise constant regression models. |
I don't know about GBRT, but for random forest (1) it is cleary defined in In the case of Fabian and Alex, it makes sense to clip the values, as they I understand that you'd want this to work "by default", in order to avoid Last but not least, I truly believe that it is better for a program to fail —
|
Bagging does not compensate anything in areas outside of the training set. It has exactly the same clipping bias and cannot generalize any non-constant trend present in the data. However Random Forests are still very useful in practice. To me the clipping bias sounds like a very reasonable default, even in the wikipedia example. The best way to counter it's effect is to collect more samples on the boundaries of the training set. |
I'll add the |
Handling out-of-bounds in IsotonicRegression (clean PR #3147)
Closing as unmerged as completed in #3199 |
predict
andtransform
in IsotonicRegression rely on interp1d, which fails if an element of X is outside of [x_min, x_max] for the vector that the model was fit on.For example, this block below will fail with the following exception:
To make this more convenient for usage in real-world scenarios, I added a simple
x_censor
argument to theIsotonicRegression
constructor. If set to True, thepredict
andtransform
methods will map out-of-bounds x values to the nearest bound. Test included.