-
-
Notifications
You must be signed in to change notification settings - Fork 26k
RFE/RFECV step enhancements #13470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFE/RFECV step enhancements #13470
Conversation
Also, I've had trouble getting the expected string right for a deeply nested repr test. You will see in my pull request commit e8221f9 I made the right changes but it must be formatted in a way so that the assertion passes and I cannot figure out how to do that. Would you have any advice here? |
What do you both think would be a better parameter name for the second step size at the step change threshold? I'm not a big fan of my changed_step name. Something like step_2? |
|
@jnothman would you have any advice for me here? I sat for a while trying to figure out how to print out a deeply nested object repr so that I could update the expected string in the test because whitespace and parameter order matter but I cannot get the assertion to pass. |
Oh dear. Maybe we needed to not be using real estimators in that repr test. It seems to now exceed the character limits we have in there, which in a recent PR @NicolasHug makes configurable... we should probably make the character limits infinite for this test once that PR is merged, or else copy the |
@hermidalc for the repr test to pass, you need to put the in the I get:
@jnothman |
How do you get to print the full nested object repr without the ellipsis (...)? I spent too long trying to figure out how to do that and couldn't. |
@NicolasHug @jnothman I'm just following Nicolas's advice and changing the expected to be the result of print(rfe) with the ellipsis/truncation. I passes that way. Sorry I was confused because the existing expected strings in all of the tests do not have truncated estimator reprs with ellipsis. |
Yes, the |
@jnothman have a question, for |
Have it all calculated from the original number of features. The user can
compute be the difference, but having different bases would be confusing
imo.
|
@jnothman actually I forgot and realized why I wrote the code that way. To me it's more intuitive to calculate the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I failed to submit this comment a while ago.
sklearn/feature_selection/rfe.py
Outdated
@@ -65,6 +67,26 @@ class RFE(BaseEstimator, MetaEstimatorMixin, SelectorMixin): | |||
If within (0.0, 1.0), then ``step`` corresponds to the percentage | |||
(rounded down) of features to remove at each iteration. | |||
|
|||
step_change_at : int or float or None, optional (default=None) | |||
If specified ``step`` int or float equates to > 1 feature, then the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Once the remaining number of features drops below this value, tuning_step
is used rather than step
. May be specified as an integer number of remaining features or a fraction (value between 0.0 and 1.0)"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about "Once the remaining number of features reaches this value, ..." since that is what actually happens. The new code looks at how close number of remaining features is to tune_step_at
and if step
is larger then it steps to tune_step_at
number of remaining features, not over.
@jnothman pls ignore last comment (which I deleted) as I believe I’ve finished all the tests and test coverage now. |
@jnothman looks like a failure unrelated to this PR (failure downloading some archives?). My local build and test works fine as well as the previous commit github build/test. The only thing I changed in latest commit is to fix the conflict with merged #13529. Is there some way to restart all the github checks? |
You can push an empty commit to trigger the CI:
|
All Github tests back in order now. So everything from my end is finished, when there is time on our side I guess we can do final review and discussion, including any changes you might want. |
@jnothman if you feel that we should calculate |
Another reason to keep it the way it currently is is that the |
I'm fairly ambivalent to the float specification of the tuning step. |
sklearn/utils/tests/test_pprint.py
Outdated
n_features_to_select=None, step=1, verbose=0), | ||
n_features_to_select=None, step=1, verbose=0), | ||
n_features_to_select=None, step=1, verbose=0)""" | ||
... step=1, tune_step_at=None, tuning_step=1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't like having to change this test to include ellipsis!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NicolasHug can comment. I couldn't get the test to past without doing print(rfe) and using exactly that string, which includes an ellipsis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use fewer nested RFE
maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, this comment on test_pprint was old, and got submitted with the other review. Never mind it. The file is unmodified here.
Number of remaining features reached when ``tuning_step`` is used | ||
rather than ``step``. May be specified as an (integer) number of | ||
remaining features or, if within (0.0, 1.0), a percentage (rounded | ||
down) of the original number of features. If original number of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect this last sentence is too much technical detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I just remove it?
If greater than or equal to 1, then ``tuning_step`` corresponds to the | ||
(integer) number of features to remove at each iteration. If within | ||
(0.0, 1.0), then ``tuning_step`` corresponds to the percentage (rounded | ||
down) of features to remove at each iteration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of which features?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why which features (original or remaining) is left out of this parameter description is because it depends on the reducing_step boolean setting. Description for step parameter is same.
@@ -171,26 +219,28 @@ def _fit(self, X, y, step_score=None): | |||
self.scores_ = [] | |||
|
|||
# Elimination | |||
while np.sum(support_) > n_features_to_select: | |||
n_remaining_features = n_features | |||
self.n_remaining_feature_steps_ = [n_remaining_features] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This attribute should be documented
# Eliminate the worse features | ||
threshold = min(step, np.sum(support_) - n_features_to_select) | ||
# Adjust step using special parameters if specified | ||
if self.tune_step_at is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this could be neatened by instead implementing the whole thing as:
self._main_rfe_loop(X, y, n_features, tune_step_at, self.step)
self._main_rfe_loop(X, y, tune_step_at, n_features_to_select, self.tuning_step)
This makes clearer the fact that there are two distinct phases of operation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the suggestion - will work on it
|
||
n_features_list = [300] * 15 | ||
n_features_to_select_list = [50] * 15 | ||
step_list = [100, 100, 100, 100, 100, 100, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would all be clearer if the data was shown zipped in the code (i.e. (100, 105, 10, False, [300, 200, 105, 95, 85, 75, 65, 55, 50])
for the first example), perhaps using pytest.mark.parametrize
Co-Authored-By: hermidalc <hermidal@cs.umd.edu>
Co-Authored-By: hermidalc <hermidal@cs.umd.edu>
Sorry the long delay on this PR is my fault. I believe I need to complete three remaining items:
I will do this next week. |
Reference Issues/PRs
#10368
Also refer to old closed pull request #12578
What does this implement/fix? Explain your changes.
Adds three new RFE/RFECV parameters and associated functionality:
Comments
New options and their default settings do not alter existing default behavior.