8000 FEA Sequential feature selection by rasbt · Pull Request #8684 · scikit-learn/scikit-learn · GitHub

FEA Sequential feature selection #8684

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

rasbt wants to merge 37 commits into scikit-learn:master from rasbt:sfs

Contributor

rasbt commented

Reference Issue

What does this implement/fix? Explain your changes.

Implements a new class for sequential feature selection

Any other comments?

This is a first PR to pick-up the discussion from #6545 and to get some feedback for improvements.
Also, there are some more todos I have in mind when we get further down this PR:

Todos:

Add documentation & examples
Add more unit tests (e.g., GridSearchCV)

rasbt mentioned this pull request

Add forward selection to scikit-learn #6545

Closed

Contributor Author

rasbt commented

•

the unit tests are failing partly because of the non-meta estimator checks, e.g.,

======================================================================
ERROR: /home/travis/build/scikit-learn/scikit-learn/sklearn/tests/test_common.py.test_non_meta_estimators:check_fit1d_1sample(SFS)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/miniconda/envs/testenv/lib/python3.6/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/travis/build/scikit-learn/scikit-learn/sklearn/utils/testing.py", line 741, in __call__
    return self.check(*args, **kwargs)
  File "/home/travis/build/scikit-learn/scikit-learn/sklearn/utils/testing.py", line 292, in wrapper
    return fn(*args, **kwargs)
  File "/home/travis/build/scikit-learn/scikit-learn/sklearn/utils/estimator_checks.py", line 640, in check_fit1d_1sample
    estimator = Estimator()
TypeError: __init__() missing 1 required positional argument: 'estimator'

What would be a good way to handle this particular case, since the SFS would require a positional argument estimator that is used to perform the feature selection with. Do you think it's a good idea to use a default for that (e.g., logistic regression for classification and linear regression or regression), or would this SFS be better suited as a meta-estimator?

chkoar reviewed

View reviewed changes

sklearn/feature_selection/sfs.py Outdated

+                      self.forward = forward
+                      self.pre_dispatch = pre_dispatch
+                      self.scoring = scoring
+                      if scoring is None:

Contributor

chkoar

The scikit-learn's convention is not to alter parameters in __init__ method. Validation must be performed in fit. The Pipeline object is an exception I think.

chkoar reviewed

View reviewed changes

sklearn/feature_selection/sfs.py Outdated

+                      self.pre_dispatch = pre_dispatch
+                      self.scoring = scoring
+                      if scoring is None:
+                          if self.estimator._estimator_type == 'classifier':

Contributor

chkoar

•

You could add this to get the type

    @property
    def _estimator_type(self):
        return self.estimator._estimator_type

Member

GaelVaroquaux commented

via email

You need to avoid using "SFS", as it's hard to know what it means. Maybe "SeqFeatureSelection"?

chkoar reviewed

View reviewed changes

sklearn/feature_selection/sfs.py Outdated

+                              raise ValueError('Estimator must '
+                                               'be a Classifier or Regressor.')
+                      if isinstance(scoring, str):

Contributor

chkoar

You could use the check_scoring from sklearn.metrics.scorer in the fit method to get the scorer

chkoar reviewed

View reviewed changes

sklearn/feature_selection/sfs.py Outdated

+                  ----------
+                  estimator : scikit-learn Classifier or Regressor
+                  k_features : int or tuple (default=1)

Contributor

chkoar

I would name it n_features_to_select like in RFE.

chkoar reviewed

View reviewed changes

sklearn/feature_selection/sfs.py Outdated

		from ..metrics import get_scorer


		class SFS(BaseEstimator, MetaEstimatorMixin):

Contributor

chkoar

I would expect to derive from SelectorMixin like RFE

Contributor Author

rasbt

agreed! Tried that first though and was having some issues, i.e. TypeError: Can't instantiate abstract class SFS with abstract methods _get_support_mask. Do you know what would need to be added to make this work?

Contributor

chkoar

You should implement the _get_support_mask method. This method must return a boolean array of shape [# input features]. (True if its corresponding feature is selected False otherwise). So, you will get the transform method from SelectorMixin.

Contributor

chkoar commented

Hey @rasbt, thanks for this contribution. I would name it SequentialFeatureSelector or SequentialFeatureSelection. It is not that long and clearly states what you would expect from that class.

Member

TomDLT commented

would this SFS be better suited as a meta-estimator?

I would say so. You can add it here:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/testing.py#L468

Contributor Author

rasbt commented

•

Thanks for the feedback everyone. Will try to address all of your points soon and get an updated PR up.

@GaelVaroquaux @chkoar

You need to avoid using "SFS", as it's hard to know what it means. Maybe
"SeqFeatureSelection"?
...
would name it SequentialFeatureSelector or SequentialFeatureSelection

Completely agree with you both. I would prefer SequentialFeatureSelector (that's also how I called the orig implementation) -- if you think that's not too long, @GaelVaroquaux . Otherwise, we could use SeqFeatureSelection / SeqFeatureSelector like you suggested. (I went with SFS because of RFE, but I agree, no one would know what SFS is :))

Contributor Author

rasbt commented

would this SFS be better suited as a meta-estimator?

I would say so. You can add it here:

Thanks @TomDLT , will make it a meta-estimator :) -- makes sense since it is a "wrapper algorithm" for feature selection.

Member

GaelVaroquaux commented

via email

Completely agree with you both. I would prefer SequentialFeatureSelector (that's also how I called the orig implementation) -- if you think that's not too long, @GaelVaroquaux .

I am OK with that.

I went with SFS because of RFE,

I am not sure that today we would name RFE like that, with insight.

Contributor Author

rasbt commented

Sounds good!

In the meantime, I addressed all your code suggestions, @chkoar -- thanks for the very useful feedback! The unit tests seem to pass now as well. Let me know if you have further suggestions. I will try to get to the documentation in the next couple of days.

glemaitre reviewed

View reviewed changes

sklearn/feature_selection/sequential_feature_selector.py Outdated

+                      self.scoring = scoring
+                      self.cv = cv
+                      self.n_jobs = n_jobs
+                      self.subsets_ = {}

Member

glemaitre

This should go in fit, isn't it?

Contributor

chkoar

Conventionally, yes.

Contributor Author

rasbt

good catch! will fix that in a sec

chkoar reviewed

View reviewed changes

sklearn/feature_selection/sequential_feature_selector.py Outdated

+                      self : object
+                      """
+                      if self.scoring is None:

Contributor

chkoar

I think that we could remove these four lines, no?

check_scoring(self.estimator. scoring=self.scoring)

Contributor

chkoar

Sorry for my ignorance, but is there a specific reason for attaching the scorer in the instance itself?

Contributor Author

rasbt

•

I think if we would remove those lines, the evaluation metric would always be either 'accuracy' (for classifiers) or 'r2' (for regressors). I think these are good defaults, but users may prefer different metrics for the feature subset selection.

Contributor

chkoar

By using this one

check_scoring(self.estimator. scoring=self.scoring)

a user could pass whatever metric he/she wants

Contributor Author

rasbt

Oh nice, didn't know that! In this case, we can remove those lines, of course.

specific reason for attaching the scorer in the instance itself?

It's used in several helper methods, so it would be easier to fetch the once instead of passing it around as function argument to the helper methods I guess. I renamed it to self._scorer but we can also completely remove it, just thought that it makes the code a bit more verbose

amueller mentioned this pull request

Model-based feature importances for any classifier? #8898

Closed

Member

jorisvandenbossche commented

I would like to see this added to scikit-learn. @rasbt Is there anything I can help with, or does it mainly need review of core developers?

Member

glemaitre commented

It looks something that could be pushed during the sprint (I am starting to make a list ;))

Contributor Author

rasbt commented

I am happy to make any modifications if necessary. Other than that, I think it's just the documentation that's missing. I would take the one at http://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/ as a template and simplify/modify it a bit so that it is consistent with the code as it is currently implemented.

@jorisvandenbossche I don't have anything particular in mind regarding potential changes, but if you have any comments regarding to code, please let me know.

austinmw commented

Patiently waiting for forward selection based on R2 and/or p-vals to be added 😊

Contributor Author

rasbt commented

Unless there's a bug, it should already support 'R2'-based feature selection. Regarding the p-values, I don't think adding this is a good idea because approaches like sequential feature selection and k-fold selection would violate assumptions in typical statistical hypothesis tests I'd say

Member

amueller commented

mrg conflict? CI error? ;)

rasbt force-pushed the sfs branch from 712a4ba to 02f002f Compare

July 15, 2017 21:34

Contributor Author

rasbt commented

mrg conflict? CI error? ;)

ehm, yeah ... everything under control now, has been such a long time since I started this, and it desperately needed a rebase

scikit-learn deleted a comment from codecov bot

Contributor Author

rasbt commented

I was about to set up the documentation for this SFS class and remembered that there are currently two implementations for recursive feature elimination: RFE and RFECV. Now, I was wondering if this SFS class should be split up in the same way, i.e., having an SFS and and SFSCV? Or would that be overkill?

(Note that the current SFS implementation has a cv argument like RFECV, this class would hence become SFSCV).

@glemaitre @chkoar @amueller , what do you think?

Member

jnothman commented

via email

I think a single class would be acceptable. I think, but I might be wrong, that RFECV is a bit different in that it performs a full RFE fit for each train-test split then aggregates over their results, rather than making each evaluation using CV.

On 18 July 2017 at 10:17, Sebastian Raschka ***@***.***> wrote: I was about to set up the documentation for this SFS class and remembered that there are currently two implementations for recursive feature elimination: RFE and RFECV. Now, I was wondering if this SFS class should be split up in the same way, i.e., having an SFS and and SFSCV? Or would that be overkill? (Note that the current SFS implementation has a cv argument like RFECV, this class would hence become SFSCV). @glemaitre <https://github.com/glemaitre> @chkoar <https://github.com/chkoar> @amueller <https://github.com/amueller> , what do you think? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#8684 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz66-uQuWxX1vAVpOW1879w0phXze2ks5sO_gUgaJpZM4MwYuz> .

Contributor Author

rasbt commented

Yeah you are right, SFS & RFE are really doing different things. Since SFS evaluates subsets based on estimator performance, I think it wouldn't make much sense to have an SFS without cv -- it would hopelessly overfit.

TomDLT added the New Feature label

rasbt added 12 commits

February 26, 2018 17:04


          work in progress

6eda64b


          addressed jnothman's comments

3eabf92


          fix unit tests

27ea187


          added See also and References sections to the code file

b1877b5


          using a support_ mask

87858b0


          rebase

e6dbbfe


          change to sequential.py plus docstring example fix

0aca041


          use double tickmarks for monospace

fcbf375


          update whats_new

fd3ba3f


          change n_features_to_select to n_features

d714cc9


          import packages and libs in order

85bbe7c


          add see :term:scoring

rasbt force-pushed the sfs branch from 86ace7c to 3990470 Compare

February 27, 2018 02:29

Contributor Author

rasbt commented

•

Current todos (to keep track of what I need to do next based on the last round of feedback from @jnothman )

Add more unit tests (e.g., GridSearchCV)
change parallelism (from cross_val_score to feature subsets) {after unit tests are added}
rewrite to use separate code paths for select_in_range (set max and min to ints) {after unit tests were added}
refactor inclusion and exclusion {after unit tests were added}
reg. np.nanmean, handle it like it's done in GridSearchCV using error_score (see [MRG+1] Change default error_score to raise_deprecating #10677) {after unit tests were added}
Add documentation & examples

Edit:

The plan for moving forward would be to address the items in the order they are listed (unit tests first, then the rewrites, to make sure everything is rewritten properly with the least amount of hassle)

jnothman changed the title ~~SFS class for sequential feature selection~~ FEA Sequential feature selection

jnothman reviewed

View reviewed changes

Member

jnothman left a comment •

hiSorry

Sorry for the incompleteness of my reviews. Why are there two test files?
Please add at least narrative docs, and perhaps an example (or modify one) comparing this to other feature selection methods in terms of runtime and optimality.

sklearn/feature_selection/sequential.py

+                  n_features : int or tuple (default=1)
+                      An integer arguments specifies the number of features to select,
+                      where n_features < the full feature set.
+                      Optionally, a tuple containing a min and max value can be provided

Member

jnothman

Optionally -> alternatively

sklearn/feature_selection/sequential.py

+                      In all other cases, ``KFold`` is used.
+                  n_jobs : int (default=1)
+                      The number of CPUs to use for cross validation.

Member

jnothman

See :term:`the Glossary <n_jobs>`

sklearn/feature_selection/sequential.py

+                  subsets_ : dict
+                      A dictionary containing the best selected feature subset
+                      for each feature subset size selected by the

Member

jnothman

Selected -> evaluated

rasbt mentioned this pull request

RFECV with SVC & kernel != 'linear' == ValueError #5168

Closed

jnothman mentioned this pull request

Implement mRMR feature selection #8889

Open

amueller mentioned this pull request

feature_importances_ should be a method in the ideal design #9606

Open

Member

amueller commented

no-one is working on this right now, right?

Contributor Author

rasbt commented

I don't think anyone is working on it right now. Somehow, I lost track of this. Is there still interest in adding it? I think we were almost done back then, only docs and tests were missing

Member

amueller commented

I'm definitely still interested in adding this!

Contributor Author

rasbt commented

Same here! Seriously lost track of it as some point. I can get back to it mid august (currently traveling)

Member

amueller commented

Awesome, thanks!

github-actions bot added module:feature_selection module:utils labels

Member

NicolasHug commented

Hi @rasbt , if that's OK with you I'll start finalizing this soon in another PR. But please LMK if you'd still want to finish it! Cheers

Contributor Author

rasbt commented

No worries, that would be okay with me. I am really busy these days & I feel bad for not finishing this, and I'd actually appreciate it if someone else could :)

NicolasHug mentioned this pull request

[MRG] FEA Add sequential feature selection transformer #17159

Merged

3 tasks

glemaitre closed this in #17159

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:feature_selection module:utils New Feature

0