8000 FEA Sequential feature selection by rasbt · Pull Request #8684 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

FEA Sequential feature selection #8684

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 37 commits into from
Closed

FEA Sequential feature selection #8684

wants to merge 37 commits into from

Conversation

rasbt
Copy link
Contributor
@rasbt rasbt commented Apr 1, 2017

Reference Issue

#6545

What does this implement/fix? Explain your changes.

Implements a new class for sequential feature selection

Any other comments?

This is a first PR to pick-up the discussion from #6545 and to get some feedback for improvements.
Also, there are some more todos I have in mind when we get further down this PR:

Todos:

  • Add documentation & examples
  • Add more unit tests (e.g., GridSearchCV)

@rasbt
Copy link
Contributor Author
rasbt commented Apr 1, 2017

the unit tests are failing partly because of the non-meta estimator checks, e.g.,

======================================================================
ERROR: /home/travis/build/scikit-learn/scikit-learn/sklearn/tests/test_common.py.test_non_meta_estimators:check_fit1d_1sample(SFS)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/miniconda/envs/testenv/lib/python3.6/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/travis/build/scikit-learn/scikit-learn/sklearn/utils/testing.py", line 741, in __call__
    return self.check(*args, **kwargs)
  File "/home/travis/build/scikit-learn/scikit-learn/sklearn/utils/testing.py", line 292, in wrapper
    return fn(*args, **kwargs)
  File "/home/travis/build/scikit-learn/scikit-learn/sklearn/utils/estimator_checks.py", line 640, in check_fit1d_1sample
    estimator = Estimator()
TypeError: __init__() missing 1 required positional argument: 'estimator'

What would be a good way to handle this particular case, since the SFS would require a positional argument estimator that is used to perform the feature selection with. Do you think it's a good idea to use a default for that (e.g., logistic regression for classification and linear regression or regression), or would this SFS be better suited as a meta-estimator?

self.forward = forward
self.pre_dispatch = pre_dispatch
self.scoring = scoring
if scoring is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scikit-learn's convention is not to alter parameters in __init__ method. Validation must be performed in fit. The Pipeline object is an exception I think.

self.pre_dispatch = pre_dispatch
self.scoring = scoring
if scoring is None:
if self.estimator._estimator_type == 'classifier':
Copy link
Contributor
@chkoar chkoar Apr 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could add this to get the type

    @property
    def _estimator_type(self):
        return self.estimator._estimator_type

@GaelVaroquaux
Copy link
Member
GaelVaroquaux commented Apr 3, 2017 via email

raise ValueError('Estimator must '
'be a Classifier or Regressor.')

if isinstance(scoring, str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use the check_scoring from sklearn.metrics.scorer in the fit method to get the scorer

----------
estimator : scikit-learn Classifier or Regressor

k_features : int or tuple (default=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would name it n_features_to_select like in RFE.

from ..metrics import get_scorer


class SFS(BaseEstimator, MetaEstimatorMixin):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect to derive from SelectorMixin like RFE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed! Tried that first though and was having some issues, i.e. TypeError: Can't instantiate abstract class SFS with abstract methods _get_support_mask. Do you know what would need to be added to make this work?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should implement the _get_support_mask method. This method must return a boolean array of shape [# input features]. (True if its corresponding feature is selected False otherwise). So, you will get the transform method from SelectorMixin.

@chkoar
Copy link
Contributor
chkoar commented Apr 3, 2017

Hey @rasbt, thanks for this contribution. I would name it SequentialFeatureSelector or SequentialFeatureSelection. It is not that long and clearly states what you would expect from that class.

@TomDLT
Copy link
Member
TomDLT commented Apr 3, 2017

would this SFS be better suited as a meta-estimator?

I would say so. You can add it here:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/testing.py#L468

@rasbt
Copy link
Contributor Author
rasbt commented Apr 3, 2017

Thanks for the feedback everyone. Will try to address all of your points soon and get an updated PR up.

@GaelVaroquaux @chkoar

You need to avoid using "SFS", as it's hard to know what it means. Maybe
"SeqFeatureSelection"?
...
would name it SequentialFeatureSelector or SequentialFeatureSelection

Completely agree with you both. I would prefer SequentialFeatureSelector (that's also how I called the orig implementation) -- if you think that's not too long, @GaelVaroquaux . Otherwise, we could use SeqFeatureSelection / SeqFeatureSelector like you suggested. (I went with SFS because of RFE, but I agree, no one would know what SFS is :))

@rasbt
Copy link
Contributor Author
rasbt commented Apr 3, 2017

would this SFS be better suited as a meta-estimator?

I would say so. You can add it here:

Thanks @TomDLT , will make it a meta-estimator :) -- makes sense since it is a "wrapper algorithm" for feature selection.

@GaelVaroquaux
Copy link
Member
GaelVaroquaux commented Apr 4, 2017 via email

@rasbt
Copy link
Contributor Author
rasbt commented Apr 4, 2017

Sounds good!

In the meantime, I addressed all your code suggestions, @chkoar -- thanks for the very useful feedback! The unit tests seem to pass now as well. Let me know if you have further suggestions. I will try to get to the documentation in the next couple of days.

self.scoring = scoring
self.cv = cv
self.n_jobs = n_jobs
self.subsets_ = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should go in fit, isn't it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conventionally, yes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! will fix that in a sec

self : object

"""
if self.scoring is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we could remove these four lines, no?

check_scoring(self.estimator. scoring=self.scoring)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for my ignorance, but is there a specific reason for attaching the scorer in the instance itself?

Copy link
Contributor Author
@rasbt rasbt Apr 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we would remove those lines, the evaluation metric would always be either 'accuracy' (for classifiers) or 'r2' (for regressors). I think these are good defaults, but users may prefer different metrics for the feature subset selection.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By using this one

check_scoring(self.estimator. scoring=self.scoring)

a user could pass whatever metric he/she wants

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nice, didn't know that! In this case, we can remove those lines, of course.

specific reason for attaching the scorer in the instance itself?

It's used in several helper methods, so it would be easier to fetch the once instead of passing it around as function argument to the helper methods I guess. I renamed it to self._scorer but we can also completely remove it, just thought that it makes the code a bit more verbose

@jorisvandenbossche
Copy link
Member

I would like to see this added to scikit-learn. @rasbt Is there anything I can help with, or does it mainly need review of core developers?

@glemaitre
Copy link
Member

It looks something that could be pushed during the sprint (I am starting to make a list ;))

@rasbt
Copy link
Contributor Author
rasbt commented Jun 3, 2017

I am happy to make any modifications if necessary. Other than that, I think it's just the documentation that's missing. I would take the one at http://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/ as a template and simplify/modify it a bit so that it is consistent with the code as it is currently implemented.

@jorisvandenbossche I don't have anything particular in mind regarding potential changes, but if you have any comments regarding to code, please let me know.

@austinmw
Copy link

Patiently waiting for forward selection based on R2 and/or p-vals to be added 😊

@rasbt
Copy link
Contributor Author
rasbt commented Jun 14, 2017

Unless there's a bug, it should already support 'R2'-based feature selection. Regarding the p-values, I don't think adding this is a good idea because approaches like sequential feature selection and k-fold selection would violate assumptions in typical statistical hypothesis tests I'd say

@amueller
Copy link
Member

mrg conflict? CI error? ;)

@rasbt
Copy link
Contributor Author
rasbt commented Jul 15, 2017

mrg conflict? CI error? ;)

ehm, yeah ... everything under control now, has been such a long time since I started this, and it desperately needed a rebase

@scikit-learn scikit-learn deleted a comment from codecov bot Jul 17, 2017
@rasbt
Copy link
Contributor Author
rasbt commented Jul 18, 2017

I was about to set up the documentation for this SFS class and remembered that there are currently two implementations for recursive feature elimination: RFE and RFECV. Now, I was wondering if this SFS class should be split up in the same way, i.e., having an SFS and and SFSCV? Or would that be overkill?

(Note that the current SFS implementation has a cv argument like RFECV, this class would hence become SFSCV).

@glemaitre @chkoar @amueller , what do you think?

@jnothman
Copy link
Member
jnothman commented Jul 18, 2017 via email

@rasbt
Copy link
Contributor Author
rasbt commented Jul 18, 2017

Yeah you are right, SFS & RFE are really doing different things. Since SFS evaluates subsets based on estimator performance, I think it wouldn't make much sense to have an SFS without cv -- it would hopelessly overfit.

@rasbt
Copy link
Contributor Author
rasbt commented Feb 27, 2018

Current todos (to keep track of what I need to do next based on the last round of feedback from @jnothman )

  • Add more unit tests (e.g., GridSearchCV)
  • change parallelism (from cross_val_score to feature subsets) {after unit tests are added}
  • rewrite to use separate code paths for select_in_range (set max and min to ints) {after unit tests were added}
  • refactor inclusion and exclusion {after unit tests were added}
  • reg. np.nanmean, handle it like it's done in GridSearchCV using error_score (see [MRG+1] Change default error_score to raise_deprecating #10677) {after unit tests were added}
  • Add documentation & examples

Edit:

The plan for moving forward would be to address the items in the order they are listed (unit tests first, then the rewrites, to make sure everything is rewritten properly with the least amount of hassle)

@jnothman jnothman changed the title SFS class for sequential feature selection FEA Sequential feature selection Mar 4, 2018
Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • hiSorry

Sorry for the incompleteness of my reviews. Why are there two test files?
Please add at least narrative docs, and perhaps an example (or modify one) comparing this to other feature selection methods in terms of runtime and optimality.

n_features : int or tuple (default=1)
An integer arguments specifies the number of features to select,
where n_features < the full feature set.
Optionally, a tuple containing a min and max value can be provided
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optionally -> alternatively

In all other cases, ``KFold`` is used.

n_jobs : int (default=1)
The number of CPUs to use for cross validation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See :term:`the Glossary <n_jobs>`


subsets_ : dict
A dictionary containing the best selected feature subset
for each feature subset size selected by the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Selected -> evaluated

@amueller
Copy link
Member
amueller commented Aug 2, 2019

no-one is working on this right now, right?

@rasbt
Copy link
Contributor Author
rasbt commented Aug 2, 2019

I don't think anyone is working on it right now. Somehow, I lost track of this. Is there still interest in adding it? I think we were almost done back then, only docs and tests were missing

@amueller
Copy link
Member
amueller commented Aug 2, 2019

I'm definitely still interested in adding this!

@rasbt
Copy link
Contributor Author
rasbt commented Aug 2, 2019

Same here! Seriously lost track of it as some point. I can get back to it mid august (currently traveling)

@amueller
Copy link
Member
amueller commented Aug 5, 2019

Awesome, thanks!

@NicolasHug
Copy link
Member

Hi @rasbt , if that's OK with you I'll start finalizing this soon in another PR. But please LMK if you'd still want to finish it! Cheers

@rasbt
Copy link
Contributor Author
rasbt commented May 7, 2020

No worries, that would be okay with me. I am really busy these days & I feel bad for not finishing this, and I'd actually appreciate it if someone else could :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

0