RFE/RFECV step enhancements #13470

hermidalc · 2019-03-18T20:40:45Z

Reference Issues/PRs

Also refer to old closed pull request #12578

What does this implement/fix? Explain your changes.

Adds three new RFE/RFECV parameters and associated functionality:

tune_step_at : int or float or None, optional (default=None)
    Number of remaining features reached when ``tuning_step`` is used
    rather than ``step``. May be specified as an (integer) number of
    remaining features or, if within (0.0, 1.0), a percentage (rounded
    down) of the original number of features. If original number of
    features and parameter settings would result in stepping past
    ``tune_step_at``, then the number of features removed in the iteration
    prior to stepping over will adjust to arrive at this value.

tuning_step : int or float, optional (default=1)
    Step to use starting at ``tune_step_at`` number of remaining features.
    If greater than or equal to 1, then ``tuning_step`` corresponds to the
    (integer) number of features to remove at each iteration. If within
    (0.0, 1.0), then ``tuning_step`` corresponds to the percentage (rounded
    down) of features to remove at each iteration.

reducing_step : boolean, optional (default=False)
    If true and ``step`` or ``tuning_step`` is a float, the number of
    features removed is calculated as a fraction of the remaining features
    in that iteration. If false, the number of features removed is constant
    across iterations and a fraction of the original number of features for
    ``step`` or fraction of the ``tune_step_at`` number of remaining
    features for ``tuning_step``.

Comments

New options and their default settings do not alter existing default behavior.

hermidalc · 2019-03-18T20:42:34Z

@jnothman, @amueller

Please tell me if you don't like the parameter name "changed_step" and have a better suggestion for the parameter name for the second s 8000 tep size.

hermidalc · 2019-03-18T20:46:21Z

Also, I've had trouble getting the expected string right for a deeply nested repr test. You will see in my pull request commit e8221f9 I made the right changes but it must be formatted in a way so that the assertion passes and I cannot figure out how to do that. Would you have any advice here?

hermidalc · 2019-03-20T19:12:31Z

@jnothman, @amueller

Please tell me if you don't like the parameter name "changed_step" and have a better suggestion for the parameter name for the second step size.

What do you both think would be a better parameter name for the second step size at the step change threshold? I'm not a big fan of my changed_step name. Something like step_2?

jnothman · 2019-03-21T01:40:02Z

fine_tuning_step?? Or drop "fine" or "tuning"?

hermidalc · 2019-03-21T02:32:04Z

Also, I've had trouble getting the expected string right for a deeply nested repr test. You will see in my pull request commit e8221f9 I made the right changes but it must be formatted in a way so that the assertion passes and I cannot figure out how to do that. Would you have any advice here?

@jnothman would you have any advice for me here? I sat for a while trying to figure out how to print out a deeply nested object repr so that I could update the expected string in the test because whitespace and parameter order matter but I cannot get the assertion to pass.

jnothman · 2019-03-21T07:27:30Z

Oh dear. Maybe we needed to not be using real estimators in that repr test. It seems to now exceed the character limits we have in there, which in a recent PR @NicolasHug makes configurable... we should probably make the character limits infinite for this test once that PR is merged, or else copy the __init__ signature for LogisticRegression and RFE into this test file so that no future conflicts occur.

NicolasHug · 2019-03-21T11:04:24Z

@hermidalc for the repr test to pass, you need to put the in the expected variable what print(estimator) gives you (no printing should be done in the test of course).

I get:

RFE(changed_step=1,
    estimator=RFE(changed_step=1,
                  estimator=RFE(changed_step=1,
                                estimator=RFE(changed_step=1,
                                              estimator=RFE(changed_step=1,
                                                            estimator=RFE(changed_step=1,
                    ...         n_features_to_select=None, reducing_step=False,
                                step=1, step_change_at=None, verbose=0),
                  n_features_to_select=None, reducing_step=False, step=1,
                  step_change_at=None, verbose=0),
    n_features_to_select=None, reducing_step=False, step=1, step_change_at=None,
    verbose=0)

@jnothman
Indeed this repr will change when #13436 is merged. Any change to the parameters of an estimator used in test_pprint.py will result in the need to change the expected strings in test_pprint.py.
We already raised this issue before #11705 (comment) and decided to keep the current (real) estimators, and not use fake ones.
I'd be fine with both options. On one hand it's nice to have "real world" checks, OTOH if it makes contributions harder then it makes sense to get rid of it.

hermidalc · 2019-03-21T13:26:32Z

@hermidalc for the repr test to pass, you need to put the in the expected variable what print(estimator) gives you (no printing should be done in the test of course).

I get:

RFE(changed_step=1,
    estimator=RFE(changed_step=1,
                  estimator=RFE(changed_step=1,
                                estimator=RFE(changed_step=1,
                                              estimator=RFE(changed_step=1,
                                                            estimator=RFE(changed_step=1,
                    ...         n_features_to_select=None, reducing_step=False,
                                step=1, step_change_at=None, verbose=0),
                  n_features_to_select=None, reducing_step=False, step=1,
                  step_change_at=None, verbose=0),
    n_features_to_select=None, reducing_step=False, step=1, step_change_at=None,
    verbose=0)

How do you get to print the full nested object repr without the ellipsis (...)? I spent too long trying to figure out how to do that and couldn't.

hermidalc · 2019-03-21T13:55:12Z

@hermidalc for the repr test to pass, you need to put the in the expected variable what print(estimator) gives you (no printing should be done in the test of course).
I get:
RFE(changed_step=1,
    estimator=RFE(changed_step=1,
                  estimator=RFE(changed_step=1,
                                estimator=RFE(changed_step=1,
                                              estimator=RFE(changed_step=1,
                                                            estimator=RFE(changed_step=1,
                    ...         n_features_to_select=None, reducing_step=False,
                                step=1, step_change_at=None, verbose=0),
                  n_features_to_select=None, reducing_step=False, step=1,
                  step_change_at=None, verbose=0),
    n_features_to_select=None, reducing_step=False, step=1, step_change_at=None,
    verbose=0)
How do you get to print the full nested object repr without the ellipsis (...)? I spent too long trying to figure out how to do that and couldn't.

@NicolasHug @jnothman I'm just following Nicolas's advice and changing the expected to be the result of print(rfe) with the ellipsis/truncation. I passes that way. Sorry I was confused because the existing expected strings in all of the tests do not have truncated estimator reprs with ellipsis.

NicolasHug · 2019-03-21T15:08:05Z

Yes, the expected variable needs to be exactly what print gives you, whether there's an ellipsis or not.

… tuning_step

hermidalc · 2019-04-02T23:37:51Z

@jnothman have a question, for tuning_step, if the user specifies a float does it make more sense to the user to have that calculated from the original number of features or from the tune_step_at number of features? Current code I am calculating it from the tune_step_at number of features.

jnothman · 2019-04-03T02:03:22Z

Have it all calculated from the original number of features. The user can compute be the difference, but having different bases would be confusing imo.

hermidalc · 2019-04-03T02:29:54Z

@jnothman actually I forgot and realized why I wrote the code that way. To me it's more intuitive to calculate the tuning_step float from tune_step_at because then users would be able to avoid having to potentially specify very small tuning_step floats if the number of starting features is very high. Does that make sense? Otherwise another approach would be to only let tuning_step be an int.

jnothman

I failed to submit this comment a while ago.

jnothman · 2019-03-25T07:18:26Z

sklearn/feature_selection/rfe.py

@@ -65,6 +67,26 @@ class RFE(BaseEstimator, MetaEstimatorMixin, SelectorMixin):
        If within (0.0, 1.0), then ``step`` corresponds to the percentage
        (rounded down) of features to remove at each iteration.

+    step_change_at : int or float or None, optional (default=None)
+        If specified ``step`` int or float equates to > 1 feature, then the


"Once the remaining number of features drops below this value, tuning_step is used rather than step. May be specified as an integer number of remaining features or a fraction (value between 0.0 and 1.0)"

How about "Once the remaining number of features reaches this value, ..." since that is what actually happens. The new code looks at how close number of remaining features is to tune_step_at and if step is larger then it steps to tune_step_at number of remaining features, not over.

hermidalc · 2019-04-07T01:44:55Z

@jnothman pls ignore last comment (which I deleted) as I believe I’ve finished all the tests and test coverage now.

…ents

hermidalc · 2019-04-09T14:24:02Z

@jnothman looks like a failure unrelated to this PR (failure downloading some archives?). My local build and test works fine as well as the previous commit github build/test. The only thing I changed in latest commit is to fix the conflict with merged #13529. Is there some way to restart all the github checks?

NicolasHug · 2019-04-09T14:39:34Z

You can push an empty commit to trigger the CI:

git commit --allow-empty -m "Trigger CI"

hermidalc · 2019-04-09T17:26:50Z

All Github tests back in order now. So everything from my end is finished, when there is time on our side I guess we can do final review and discussion, including any changes you might want.

hermidalc · 2019-04-09T17:34:45Z

@jnothman actually I forgot and realized why I wrote the code that way. To me it's more intuitive to calculate the tuning_step float from tune_step_at because then users would be able to avoid having to potentially specify very small tuning_step floats if the number of starting features is very high. Does that make sense? Otherwise another approach would be to only let tuning_step be an int.

@jnothman if you feel that we should calculate tuning_step float percentages from the original number of features when reducing_step is false (doesn't apply when true) I'm actually fine with that. It is minimal code to change. I thought about it more and if users have cases where they would need to specify a small float because number of original features is large then they will probably just specify an integer don't you think?

hermidalc · 2019-04-09T20:09:10Z

@jnothman actually I forgot and realized why I wrote the code that way. To me it's more intuitive to calculate the tuning_step float from tune_step_at because then users would be able to avoid having to potentially specify very small tuning_step floats if the number of starting features is very high. Does that make sense? Otherwise another approach would be to only let tuning_step be an int.

@jnothman if you feel that we should calculate tuning_step float percentages from the original number of features when reducing_step is false (doesn't apply when true) I'm actually fine with that. It is minimal code to change. I thought about it more and if users have cases where they would need to specify a small float because number of original features is large then they will probably just specify an integer don't you think?

Another reason to keep it the way it currently is is that the tuning_step float specification would be more consistent when reducing_step being true or false.

jnothman · 2019-04-09T23:54:39Z

I'm fairly ambivalent to the float specification of the tuning step.
A bigger issue: I think these features will easily get lost to most users. I wonder whether you can modify an example in our gallery to demonstrate these features (or do we need bigger datasets than our examples can run with?) or at least update the user guide.

jnothman · 2019-04-04T11:04:12Z

sklearn/utils/tests/test_pprint.py

-                                n_features_to_select=None, step=1, verbose=0),
-                  n_features_to_select=None, step=1, verbose=0),
-    n_features_to_select=None, step=1, verbose=0)"""
+                                                                                          ...                  step=1, tune_step_at=None, tuning_step=1,


I really don't like having to change this test to include ellipsis!

@NicolasHug can comment. I couldn't get the test to past without doing print(rfe) and using exactly that string, which includes an ellipsis.

use fewer nested RFE maybe?

Sorry, this comment on test_pprint was old, and got submitted with the other review. Never mind it. The file is unmodified here.

sklearn/feature_selection/rfe.py

jnothman · 2019-04-16T13:44:45Z

sklearn/feature_selection/rfe.py

+        Number of remaining features reached when ``tuning_step`` is used
+        rather than ``step``. May be specified as an (integer) number of
+        remaining features or, if within (0.0, 1.0), a percentage (rounded
+        down) of the original number of features. If original number of


I suspect this last sentence is too much technical detail.

Should I just remove it?

sklearn/feature_selection/rfe.py

jnothman · 2019-04-16T13:45:56Z

sklearn/feature_selection/rfe.py

+        If greater than or equal to 1, then ``tuning_step`` corresponds to the
+        (integer) number of features to remove at each iteration. If within
+        (0.0, 1.0), then ``tuning_step`` corresponds to the percentage (rounded
+        down) of features to remove at each iteration.


of which features?

The reason why which features (original or remaining) is left out of this parameter description is because it depends on the reducing_step boolean setting. Description for step parameter is same.

jnothman · 2019-04-16T13:47:31Z

sklearn/feature_selection/rfe.py

@@ -171,26 +219,28 @@ def _fit(self, X, y, step_score=None):
            self.scores_ = []

        # Elimination
-        while np.sum(support_) > n_features_to_select:
+        n_remaining_features = n_features
+        self.n_remaining_feature_steps_ = [n_remaining_features]


This attribute should be documented

jnothman · 2019-04-16T13:49:40Z

sklearn/feature_selection/rfe.py

-            # Eliminate the worse features
-            threshold = min(step, np.sum(support_) - n_features_to_select)
+            # Adjust step using special parameters if specified
+            if self.tune_step_at is not None:


I wonder if this could be neatened by instead implementing the whole thing as:

self._main_rfe_loop(X, y, n_features, tune_step_at, self.step) self._main_rfe_loop(X, y, tune_step_at, n_features_to_select, self.tuning_step)

This makes clearer the fact that there are two distinct phases of operation

Thank you for the suggestion - will work on it

jnothman · 2019-04-16T13:51:52Z

sklearn/feature_selection/tests/test_rfe.py

+
+    n_features_list = [300] * 15
+    n_features_to_select_list = [50] * 15
+    step_list = [100, 100, 100, 100, 100, 100,


I think this would all be clearer if the data was shown zipped in the code (i.e. (100, 105, 10, False, [300, 200, 105, 95, 85, 75, 65, 55, 50]) for the first example), perhaps using pytest.mark.parametrize

Co-Authored-By: hermidalc <hermidal@cs.umd.edu>

hermidalc · 2019-08-20T13:31:50Z

@amueller @jnothman

Sorry the long delay on this PR is my fault. I believe I need to complete three remaining items:

Refactor code using _main_rfe_loop recommendation
Refactor test code using pytest.mark.parametrize recommendation
Add examples and explanation of new features to User Guide.

I will do this next week.

hermidalc added 5 commits March 17, 2019 10:36

Full implementation

c596a99

Fix for n_jobs > 1

efeb552

Remove obsolete note

047eb6b

RFECV parallel and score tie fixes

4837bf5

Update expected deep nested RFE repr

e8221f9

Allow step_change_at float specification

fde2d90

hermidalc added 3 commits March 21, 2019 11:58

Changed parameter changed_step -> fine_tuning_step

8245ef8

Fixed deep RFE expected repr

09c57ba

Update fine_tuning_step doc

ab3d678

jnothman mentioned this pull request Mar 25, 2019

sklearn/utils/tests/test_pprint.py is too brittle to API changes #13508

Closed

hermidalc added 4 commits March 25, 2019 10:45

Changed params step_change_at -> tune_step_at and fine_tuning_step ->…

95710e8

… tuning_step

Update expected deeply nested rfe

4d5e22f

Update expected deeply nested rfe

74242ae

Remove test print

41b7284

scouvreur mentioned this pull request Mar 27, 2019

[MRG] Fixes to test_pprint.py breaking after changes to the tested API #13529

Merged

11 tasks

Make step logic consistent; minor doc/comment updates

8fe41ce

jnothman reviewed Apr 4, 2019

View reviewed changes

hermidalc added 6 commits April 6, 2019 17:53

Move n_features init to orig loc

f9bf9d7

Added RFECV n_remaining_feature_steps_ attr

5f44764

Add more codecov flagged missing tests

7bed852

Keep scores code style more like original

418f5b7

Add n_features_to_select pos int check

c85d822

Fix missing codecov tests

cb40430

hermidalc added 4 commits April 7, 2019 15:16

Improve RFECV syntax clarity

69fc4c0

Merge remote-tracking branch 'upstream/master' into rfe_step_enhancem…

dc3f6fe

…ents

Update test_pprint.py to be in line with scikit-learn#13529

82efb27

Merge remote-tracking branch 'upstream/master' into rfe_step_enhancem…

b1331b3

…ents

Trigger Github CI

5ae1163

jnothman reviewed Apr 16, 2019

View reviewed changes

jnothman and others added 2 commits April 16, 2019 10:06

Update doc

d5f040d

Co-Authored-By: hermidalc <hermidal@cs.umd.edu>

Update doc

7ac69c0

Co-Authored-By: hermidalc <hermidal@cs.umd.edu>

amueller added the Waiting for Reviewer label Aug 6, 2019

github-actions bot added the module:feature_selection label Mar 2, 2020

Base automatically changed from master to main January 22, 2021 10:50

cmarmo added help wanted Stalled and removed Waiting for Reviewer labels Dec 23, 2021

hermidalc closed this Jun 14, 2022

hermidalc deleted the rfe_step_enhancements branch June 14, 2022 15:07

Uh oh!

RFE/RFECV step enhancements #13470

RFE/RFECV step enhancements #13470

Uh oh!

Conversation

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!