8000 sklearn.feature_selection.SequentialFeatureSelector Select features as long as score gets better. · Issue #20137 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

sklearn.feature_selection.SequentialFeatureSelector Select features as long as score gets better. #20137

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
paapu88 opened this issue May 26, 2021 · 10 comments · Fixed by #20145

Comments

@paapu88
Copy link
paapu88 commented May 26, 2021

Dear Developers,

Currently one can give sklearn.feature_selection.SequentialFeatureSelector
parameter 'n_features_to_select'
to get desired number of features.

It would be desired, that instead of giving a fixed number of 'n_features_to_select' one could use it as an upper limit and select features only up to point where the score improves (so the number of selected features could be smaller than 'n_features_to_select').

Terveisin, Markus

murata-yu added a commit to murata-yu/scikit-learn that referenced this issue May 27, 2021
…es as long as score gets better. scikit-learn#20137

Add `censored_rate` parameter in SequentialFeatureSelector.
If censored_rate is NOT None, Fitting is aborted if new_score is lower than old_score*(1+censoring_rate).
murata-yu added a commit to murata-yu/scikit-learn that referenced this issue May 27, 2021
…es as long as score gets better. scikit-learn#20137

Add `censored_rate` parameter in SequentialFeatureSelector.
If censored_rate is NOT None, Fitting is aborted if new_score is lower than old_score*(1+censoring_rate).
@ogrisel
Copy link
Member
ogrisel commented May 27, 2021

I agree. The problem is how to manage backward compat with the current behavior without introducing too many spurious deprecation warnings and too many new parameters...

@norbusan suggested the following strategy in a private conversation: add a new tol parameter as an absolute tolerance on the score change before and after including the new feature, with a default value of None that would keep the existing behavior. If not None (for instance tol == 0., this value would be use to stop the loop earlier.

We will _get_best_new_feature return (new_feature_idx, new_score) instead of just new_feature_idx.

@ogrisel
Copy link
Member
ogrisel commented May 27, 2021

What is good with the solution above is:

  • we introduce a minimal number of new parameter,
  • the backward compat is preserved with no change in the default behavior

There are several things I find itchy with the solution proposed above:

  • I find the n_features_to_select defaulting to n_features // 2 potentially confusing when tol=0., one would rather like max_features=n_features in this case.
  • it would be nice to use automated selection based on score change (tol=0) by default in the future.

@ogrisel
Copy link
Member
ogrisel commented May 27, 2021

Maybe we could implement the following with an additional max_features:

  • SequentialFeatureSelector() default is equivalent to SequentialFeatureSelector(n_features_to_select=None, max_features=None, tol=None) and implements the current behavior
  • SequentialFeatureSelector() (both n_features_to_select is None and max_features is None) would raise a FutureWarning that in the future tol=0 will be used instead of selecting half of the features by default. We could tell the use to set tol=0 explicitly to silence the warning.
  • SequentialFeatureSelector(n_features_to_select=some_integer) would always respect this (ignoring tol and max_features.
  • SequentialFeatureSelector(tol=0.) would do the automatic selection, implicitly assuming max_features=n_features.
  • SequentialFeatureSelector(tol=0., max_features=some_integer) would do the automatic selection up to that number of features.
  • SequentialFeatureSelector(tol=0., n_features_to_select=some_integer) would raise ValueError with an informative error message
  • SequentialFeatureSelector(n_features_to_select=some_integer, max_features=some_integer_or_float) would raise ValueError with an informative error message

Optionally we could make max_features also accept a float between 0 and 1 to select a percentage of X_train.shape[1] for convenience and consistency with BaggingClassifier and the like.

WDYT @NicolasHug and others?

@paapu88
Copy link
Author
paapu88 commented May 27, 2021

Just another comment which I forgot from the initial request:

It may be the case that adding one feature more will always improve score (not sure about this). So some sort of threshold how much the score must improve in order to continue adding features might be needed....

@ogrisel
Copy link
Member
ogrisel commented May 27, 2021

It may be the case that adding one feature more will always improve score (not sure about this). So some sort of threshold how much the score must improve in order to continue adding features might be needed....

I agree, this is the purpose of the tol parameter I introduced above.

@NicolasHug
Copy link
Member
NicolasHug commented May 27, 2021

@ogrisel you mentioned tol=0, but wouldn't tol=0 be equivalent to tol=None, i.e. equivalent to the current behaviour?

I'm not sure I properly understand this point:

it would be nice to use automated selection based on score change (tol=0) by default in the future.

Regarding this one:

I find the n_features_to_select defaulting to n_features // 2 potentially confusing when tol=0., one would rather like max_features=n_features in this case.

The current default for n_features_to_select is None. We can extend None to actually mean n_features when tol != None? That would still be backward compatible and that wouldn't be confusing as long as we properly document this in tol's and n_features_to_select's docstring.

Eitherway, this is a form of early stopping. I think I would prefer what I just proposed but if we're going to introduce more parameters, we might need to also introduce an early_stopping parameter to keep consistent with the rest of the code base.

@ogrisel
Copy link
Member
ogrisel commented May 27, 2021

tol=0 would make it possible to switch to the new early stopping behavior while the default tol=None would keep the current behavior during the depreciation cycle.

@ogrisel
Copy link
Member
ogrisel commented May 27, 2021

The current default for n_features_to_select is None. We can extend None to actually mean n_features when tol != None? That would still be backward compatible and that wouldn't be confusing as long as we properly document this in tol's and n_features_to_select's docstring.

That's an option. I think I like it.

The only remaining question would be: do we plan to enable the early stopping behavior by default in the future? (I think that would be nice). How do we handle the deprecation?

  • tol="warn" whose behavior would be equivalent to tol=None for now with a FutureWarning that mentions that in the future tol=0. will be the default;
  • tol=0 will be the future default value with early stopping enable as a result;
  • tol=eps with eps > 0 would also work for enabling early stopping with an increment gap;
  • passing tol=None explicitly makes it possible to keep using the current behavior.

@ogrisel
Copy link
Member
ogrisel commented Jun 1, 2021

@NicolasHug do you agree with the plan above? If so we can implement it as part of #20145.

@NicolasHug
Copy link
Member
NicolasHug commented Jun 1, 2021

Yes this sounds good.

An alternative that I slightly prefer would be to directly introduce tol=0 and instead do the deprecation on n_features_to_select:

  • set n_features_to_select to 'warn' indicating that None is deprecated in favor of 'auto'
  • Switch the default to 'auto' in 2 versions and remove None.

The reason the default is currently None is for consistency with RFE, but 'auto' is a more accurate name. Since we have to deprecate something anyway, we might as well introduce a better name in the process.

I'm fine with either strategy anyway, so feel free to implement the one you like best.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants
0