10000 RFE/RFECV step enhancements by hermidalc · Pull Request #13470 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

RFE/RFECV step enhancements #13470

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 40 commits into from

Conversation

hermidalc
Copy link
Contributor
@hermidalc hermidalc commented Mar 18, 2019

Reference Issues/PRs

#10368

Also refer to old closed pull request #12578

What does this implement/fix? Explain your changes.

Adds three new RFE/RFECV parameters and associated functionality:

tune_step_at : int or float or None, optional (default=None)
    Number of remaining features reached when ``tuning_step`` is used
    rather than ``step``. May be specified as an (integer) number of
    remaining features or, if within (0.0, 1.0), a percentage (rounded
    down) of the original number of features. If original number of
    features and parameter settings would result in stepping past
    ``tune_step_at``, then the number of features removed in the iteration
    prior to stepping over will adjust to arrive at this value.

tuning_step : int or float, optional (default=1)
    Step to use starting at ``tune_step_at`` number of remaining features.
    If greater than or equal to 1, then ``tuning_step`` corresponds to the
    (integer) number of features to remove at each iteration. If within
    (0.0, 1.0), then ``tuning_step`` corresponds to the percentage (rounded
    down) of features to remove at each iteration.

reducing_step : boolean, optional (default=False)
    If true and ``step`` or ``tuning_step`` is a float, the number of
    features removed is calculated as a fraction of the remaining features
    in that iteration. If false, the number of features removed is constant
    across iterations and a fraction of the original number of features for
    ``step`` or fraction of the ``tune_step_at`` number of remaining
    features for ``tuning_step``.

Comments

New options and their default settings do not alter existing default behavior.

@hermidalc
Copy link
Contributor Author

@jnothman, @amueller

Please tell me if you don't like the parameter name "changed_step" and have a better suggestion for the parameter name for the second s 8000 tep size.

@hermidalc
Copy link
Contributor Author
hermidalc commented Mar 18, 2019

Also, I've had trouble getting the expected string right for a deeply nested repr test. You will see in my pull request commit e8221f9 I made the right changes but it must be formatted in a way so that the assertion passes and I cannot figure out how to do that. Would you have any advice here?

@hermidalc
Copy link
Contributor Author

@jnothman, @amueller

Please tell me if you don't like the parameter name "changed_step" and have a better suggestion for the parameter name for the second step size.

What do you both think would be a better parameter name for the second step size at the step change threshold? I'm not a big fan of my changed_step name. Something like step_2?

@jnothman
Copy link
Member

fine_tuning_step?? Or drop "fine" or "tuning"?

@hermidalc
Copy link
Contributor Author

Also, I've had trouble getting the expected string right for a deeply nested repr test. You will see in my pull request commit e8221f9 I made the right changes but it must be formatted in a way so that the assertion passes and I cannot figure out how to do that. Would you have any advice here?

@jnothman would you have any advice for me here? I sat for a while trying to figure out how to print out a deeply nested object repr so that I could update the expected string in the test because whitespace and parameter order matter but I cannot get the assertion to pass.

@jnothman
Copy link
Member

Oh dear. Maybe we needed to not be using real estimators in that repr test. It seems to now exceed the character limits we have in there, which in a recent PR @NicolasHug makes configurable... we should probably make the character limits infinite for this test once that PR is merged, or else copy the __init__ signature for LogisticRegression and RFE into this test file so that no future conflicts occur.

@NicolasHug
Copy link
Member

@hermidalc for the repr test to pass, you need to put the in the expected variable what print(estimator) gives you (no printing should be done in the test of course).

I get:

RFE(changed_step=1,
    estimator=RFE(changed_step=1,
                  estimator=RFE(changed_step=1,
                                estimator=RFE(changed_step=1,
                                              estimator=RFE(changed_step=1,
                                                            estimator=RFE(changed_step=1,
                    ...         n_features_to_select=None, reducing_step=False,
                                step=1, step_change_at=None, verbose=0),
                  n_features_to_select=None, reducing_step=False, step=1,
                  step_change_at=None, verbose=0),
    n_features_to_select=None, reducing_step=False, step=1, step_change_at=None,
    verbose=0)

@jnothman
Indeed this repr will change when #13436 is merged. Any change to the parameters of an estimator used in test_pprint.py will result in the need to change the expected strings in test_pprint.py.
We already raised this issue before #11705 (comment) and decided to keep the current (real) estimators, and not use fake ones.
I'd be fine with both options. On one hand it's nice to have "real world" checks, OTOH if it makes contributions harder then it makes sense to get rid of it.

@hermidalc
Copy link
Contributor Author

@hermidalc for the repr test to pass, you need to put the in the expected variable what print(estimator) gives you (no printing should be done in the test of course).

I get:

RFE(changed_step=1,
    estimator=RFE(changed_step=1,
                  estimator=RFE(changed_step=1,
                                estimator=RFE(changed_step=1,
                                              estimator=RFE(changed_step=1,
                                                            estimator=RFE(changed_step=1,
                    ...         n_features_to_select=None, reducing_step=False,
                                step=1, step_change_at=None, verbose=0),
                  n_features_to_select=None, reducing_step=False, step=1,
                  step_change_at=None, verbose=0),
    n_features_to_select=None, reducing_step=False, step=1, step_change_at=None,
    verbose=0)

How do you get to print the full nested object repr without the ellipsis (...)? I spent too long trying to figure out how to do that and couldn't.

@hermidalc
Copy link
Contributor Author

@hermidalc for the repr test to pass, you need to put the in the expected variable what print(estimator) gives you (no printing should be done in the test of course).
I get:

RFE(changed_step=1,
    estimator=RFE(changed_step=1,
                  estimator=RFE(changed_step=1,
                                estimator=RFE(changed_step=1,
                                              estimator=RFE(changed_step=1,
                                                            estimator=RFE(changed_step=1,
                    ...         n_features_to_select=None, reducing_step=False,
                                step=1, step_change_at=None, verbose=0),
                  n_features_to_select=None, reducing_step=False, step=1,
                  step_change_at=None, verbose=0),
    n_features_to_select=None, reducing_step=False, step=1, step_change_at=None,
    verbose=0)

How do you get to print the full nested object repr without the ellipsis (...)? I spent too long trying to figure out how to do that and couldn't.

@NicolasHug @jnothman I'm just following Nicolas's advice and changing the expected to be the result of print(rfe) with the ellipsis/truncation. I passes that way. Sorry I was confused because the existing expected strings in all of the tests do not have truncated estimator reprs with ellipsis.

@NicolasHug
Copy link
Member

Yes, the expected variable needs to be exactly what print gives you, whether there's an ellipsis or not.

@hermidalc
Copy link
Contributor Author
hermidalc commented Apr 2, 2019

@jnothman have a question, for tuning_step, if the user specifies a float does it make more sense to the user to have that calculated from the original number of features or from the tune_step_at number of features? Current code I am calculating it from the tune_step_at number of features.

@jnothman
Copy link
Member
jnothman commented Apr 3, 2019 via email

@hermidalc
Copy link
Contributor Author
hermidalc commented Apr 3, 2019

@jnothman actually I forgot and realized why I wrote the code that way. To me it's more intuitive to calculate the tuning_step float from tune_step_at because then users would be able to avoid having to potentially specify very small tuning_step floats if the number of starting features is very high. Does that make sense? Otherwise another approach would be to only let tuning_step be an int.

Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I failed to submit this comment a while ago.

@@ -65,6 +67,26 @@ class RFE(BaseEstimator, MetaEstimatorMixin, SelectorMixin):
If within (0.0, 1.0), then ``step`` corresponds to the percentage
(rounded down) of features to remove at each iteration.

step_change_at : int or float or None, optional (default=None)
If specified ``step`` int or float equates to > 1 feature, then the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Once the remaining number of features drops below this value, tuning_step is used rather than step. May be specified as an integer number of remaining features or a fraction (value between 0.0 and 1.0)"

Copy link
Contributor Author
@hermidalc hermidalc Apr 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about "Once the remaining number of features reaches this value, ..." since that is what actually happens. The new code looks at how close number of remaining features is to tune_step_at and if step is larger then it steps to tune_step_at number of remaining features, not over.

@hermidalc
Copy link
Contributor Author

@jnothman pls ignore last comment (which I deleted) as I believe I’ve finished all the tests and test coverage now.

@hermidalc
Copy link
Contributor Author
hermidalc commented Apr 9, 2019

@jnothman looks like a failure unrelated to this PR (failure downloading some archives?). My local build and test works fine as well as the previous commit github build/test. The only thing I changed in latest commit is to fix the conflict with merged #13529. Is there some way to restart all the github checks?

@NicolasHug
Copy link
Member

You can push an empty commit to trigger the CI:

git commit --allow-empty -m "Trigger CI"

@hermidalc
Copy link
Contributor Author

All Github tests back in order now. So everything from my end is finished, when there is time on our side I guess we can do final review and discussion, including any changes you might want.

@hermidalc
Copy link
Contributor Author

@jnothman actually I forgot and realized why I wrote the code that way. To me it's more intuitive to calculate the tuning_step float from tune_step_at because then users would be able to avoid having to potentially specify very small tuning_step floats if the number of starting features is very high. Does that make sense? Otherwise another approach would be to only let tuning_step be an int.

@jnothman if you feel that we should calculate tuning_step float percentages from the original number of features when reducing_step is false (doesn't apply when true) I'm actually fine with that. It is minimal code to change. I thought about it more and if users have cases where they would need to specify a small float because number of original features is large then they will probably just specify an integer don't you think?

@hermidalc
Copy link
Contributor Author
hermidalc commented Apr 9, 2019

@jnothman actually I forgot and realized why I wrote the code that way. To me it's more intuitive to calculate the tuning_step float from tune_step_at because then users would be able to avoid having to potentially specify very small tuning_step floats if the number of starting features is very high. Does that make sense? Otherwise another approach would be to only let tuning_step be an int.

@jnothman if you feel that we should calculate tuning_step float percentages from the original number of features when reducing_step is false (doesn't apply when true) I'm actually fine with that. It is minimal code to change. I thought about it more and if users have cases where they would need to specify a small float because number of original features is large then they will probably just specify an integer don't you think?

Another reason to keep it the way it currently is is that the tuning_step float specification would be more consistent when reducing_step being true or false.

@jnothman
Copy link
Member
jnothman commented Apr 9, 2019

I'm fairly ambivalent to the float specification of the tuning step.
A bigger issue: I think these features will easily get lost to most users. I wonder whether you can modify an example in our gallery to demonstrate these features (or do we need bigger datasets than our examples can run with?) or at least update the user guide.

n_features_to_select=None, step=1, verbose=0),
n_features_to_select=None, step=1, verbose=0),
n_features_to_select=None, step=1, verbose=0)"""
... step=1, tune_step_at=None, tuning_step=1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like having to change this test to include ellipsis!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NicolasHug can comment. I couldn't get the test to past without doing print(rfe) and using exactly that string, which includes an ellipsis.

Copy link
Member
@NicolasHug NicolasHug Apr 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use fewer nested RFE maybe?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this comment on test_pprint was old, and got submitted with the other review. Never mind it. The file is unmodified here.

Number of remaining features reached when ``tuning_step`` is used
rather than ``step``. May be specified as an (integer) number of
remaining features or, if within (0.0, 1.0), a percentage (rounded
down) of the original number of features. If original number of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this last sentence is too much technical detail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I just remove it?

If greater than or equal to 1, then ``tuning_step`` corresponds to the
(integer) number of features to remove at each iteration. If within
(0.0, 1.0), then ``tuning_step`` corresponds to the percentage (rounded
down) of features to remove at each iteration.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of which features?

Copy link
Contributor Author
@hermidalc hermidalc Apr 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why which features (original or remaining) is left out of this parameter description is because it depends on the reducing_step boolean setting. Description for step parameter is same.

@@ -171,26 +219,28 @@ def _fit(self, X, y, step_score=None):
self.scores_ = []

# Elimination
while np.sum(support_) > n_features_to_select:
n_remaining_features = n_features
self.n_remaining_feature_steps_ = [n_remaining_features]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This attribute should be documented

# Eliminate the worse features
threshold = min(step, np.sum(support_) - n_features_to_select)
# Adjust step using special parameters if specified
if self.tune_step_at is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this could be neatened by instead implementing the whole thing as:

self._main_rfe_loop(X, y, n_features, tune_step_at, self.step)
self._main_rfe_loop(X, y, tune_step_at, n_features_to_select, self.tuning_step)

This makes clearer the fact that there are two distinct phases of operation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion - will work on it


n_features_list = [300] * 15
n_features_to_select_list = [50] * 15
step_list = [100, 100, 100, 100, 100, 100,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would all be clearer if the data was shown zipped in the code (i.e. (100, 105, 10, False, [300, 200, 105, 95, 85, 75, 65, 55, 50]) for the first example), perhaps using pytest.mark.parametrize

jnothman and others added 2 commits April 16, 2019 10:06
Co-Authored-By: hermidalc <hermidal@cs.umd.edu>
Co-Authored-By: hermidalc <hermidal@cs.umd.edu>
@hermidalc
Copy link
Contributor Author

@amueller @jnothman

Sorry the long delay on this PR is my fault. I believe I need to complete three remaining items:

  1. Refactor code using _main_rfe_loop recommendation
  2. Refactor test code using pytest.mark.parametrize recommendation
  3. Add examples and explanation of new features to User Guide.

I will do this next week.

Base automatically changed from master to main January 22, 2021 10:50
@hermidalc hermidalc closed this Jun 14, 2022
@hermidalc hermidalc deleted the rfe_step_enhancements branch June 14, 2022 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0