-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
FEA Sequential feature selection #8684
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
the unit tests are failing partly because of the non-meta estimator checks, e.g.,
What would be a good way to handle this particular case, since the SFS would require a positional argument |
sklearn/feature_selection/sfs.py
Outdated
self.forward = forward | ||
self.pre_dispatch = pre_dispatch | ||
self.scoring = scoring | ||
if scoring is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scikit-learn
's convention is not to alter parameters in __init__
method. Validation must be performed in fit
. The Pipeline
object is an exception I think.
sklearn/feature_selection/sfs.py
Outdated
self.pre_dispatch = pre_dispatch | ||
self.scoring = scoring | ||
if scoring is None: | ||
if self.estimator._estimator_type == 'classifier': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could add this to get the type
@property
def _estimator_type(self):
return self.estimator._estimator_type
You need to avoid using "SFS", as it's hard to know what it means. Maybe
"SeqFeatureSelection"?
|
sklearn/feature_selection/sfs.py
Outdated
raise ValueError('Estimator must ' | ||
'be a Classifier or Regressor.') | ||
|
||
if isinstance(scoring, str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use the check_scoring
from sklearn.metrics.scorer
in the fit
method to get the scorer
sklearn/feature_selection/sfs.py
Outdated
---------- | ||
estimator : scikit-learn Classifier or Regressor | ||
|
||
k_features : int or tuple (default=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would name it n_features_to_select
like in RFE.
sklearn/feature_selection/sfs.py
Outdated
from ..metrics import get_scorer | ||
|
||
|
||
class SFS(BaseEstimator, MetaEstimatorMixin): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect to derive from SelectorMixin
like RFE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed! Tried that first though and was having some issues, i.e. TypeError: Can't instantiate abstract class SFS with abstract methods _get_support_mask
. Do you know what would need to be added to make this work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should implement the _get_support_mask
method. This method must return a boolean array of shape [# input features]. (True if its corresponding feature is selected False otherwise). So, you will get the transform
method from SelectorMixin
.
Hey @rasbt, thanks for this contribution. I would name it |
I would say so. You can add it here: |
Thanks for the feedback everyone. Will try to address all of your points soon and get an updated PR up.
Completely agree with you both. I would prefer |
Thanks @TomDLT , will make it a meta-estimator :) -- makes sense since it is a "wrapper algorithm" for feature selection. |
Completely agree with you both. I would prefer SequentialFeatureSelector
(that's also how I called the orig implementation) -- if you think that's not
too long, @GaelVaroquaux .
I am OK with that.
I went with SFS because of RFE,
I am not sure that today we would name RFE like that, with insight.
|
Sounds good! In the meantime, I addressed all your code suggestions, @chkoar -- thanks for the very useful feedback! The unit tests seem to pass now as well. Let me know if you have further suggestions. I will try to get to the documentation in the next couple of days. |
self.scoring = scoring | ||
self.cv = cv | ||
self.n_jobs = n_jobs | ||
self.subsets_ = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should go in fit
, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conventionally, yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch! will fix that in a sec
self : object | ||
|
||
""" | ||
if self.scoring is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we could remove these four lines, no?
check_scoring(self.estimator. scoring=self.scoring)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for my ignorance, but is there a specific reason for attaching the scorer in the instance itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we would remove those lines, the evaluation metric would always be either 'accuracy' (for classifiers) or 'r2' (for regressors). I think these are good defaults, but users may prefer different metrics for the feature subset selection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By using this one
check_scoring(self.estimator. scoring=self.scoring)
a user could pass whatever metric he/she wants
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh nice, didn't know that! In this case, we can remove those lines, of course.
specific reason for attaching the scorer in the instance itself?
It's used in several helper methods, so it would be easier to fetch the once instead of passing it around as function argument to the helper methods I guess. I renamed it to self._scorer
but we can also completely remove it, just thought that it makes the code a bit more verbose
I would like to see this added to scikit-learn. @rasbt Is there anything I can help with, or does it mainly need review of core developers? |
It looks something that could be pushed during the sprint (I am starting to make a list ;)) |
I am happy to make any modifications if necessary. Other than that, I think it's just the documentation that's missing. I would take the one at http://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/ as a template and simplify/modify it a bit so that it is consistent with the code as it is currently implemented. @jorisvandenbossche I don't have anything particular in mind regarding potential changes, but if you have any comments regarding to code, please let me know. |
Patiently waiting for forward selection based on R2 and/or p-vals to be added 😊 |
Unless there's a bug, it should already support 'R2'-based feature selection. Regarding the p-values, I don't think adding this is a good idea because approaches like sequential feature selection and k-fold selection would violate assumptions in typical statistical hypothesis tests I'd say |
mrg conflict? CI error? ;) |
ehm, yeah ... everything under control now, has been such a long time since I started this, and it desperately needed a rebase |
I was about to set up the documentation for this (Note that the current @glemaitre @chkoar @amueller , what do you think? |
I think a single class would be acceptable. I think, but I might be wrong,
that RFECV is a bit different in that it performs a full RFE fit for each
train-test split then aggregates over their results, rather than making
each evaluation using CV.
…On 18 July 2017 at 10:17, Sebastian Raschka ***@***.***> wrote:
I was about to set up the documentation for this SFS class and remembered
that there are currently two implementations for recursive feature
elimination: RFE and RFECV. Now, I was wondering if this SFS class should
be split up in the same way, i.e., having an SFS and and SFSCV? Or would
that be overkill?
(Note that the current SFS implementation has a cv argument like RFECV,
this class would hence become SFSCV).
@glemaitre <https://github.com/glemaitre> @chkoar
<https://github.com/chkoar> @amueller <https://github.com/amueller> ,
what do you think?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#8684 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz66-uQuWxX1vAVpOW1879w0phXze2ks5sO_gUgaJpZM4MwYuz>
.
|
Yeah you are right, SFS & RFE are really doing different things. Since SFS evaluates subsets based on estimator performance, I think it wouldn't make much sense to have an SFS without |
Current todos (to keep track of what I need to do next based on the last round of feedback from @jnothman )
Edit: The plan for moving forward would be to address the items in the order they are listed (unit tests first, then the rewrites, to make sure everything is rewritten properly with the least amount of hassle) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- hiSorry
Sorry for the incompleteness of my reviews. Why are there two test files?
Please add at least narrative docs, and perhaps an example (or modify one) comparing this to other feature selection methods in terms of runtime and optimality.
n_features : int or tuple (default=1) | ||
An integer arguments specifies the number of features to select, | ||
where n_features < the full feature set. | ||
Optionally, a tuple containing a min and max value can be provided |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optionally -> alternatively
In all other cases, ``KFold`` is used. | ||
|
||
n_jobs : int (default=1) | ||
The number of CPUs to use for cross validation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See :term:`the Glossary <n_jobs>`
|
||
subsets_ : dict | ||
A dictionary containing the best selected feature subset | ||
for each feature subset size selected by the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Selected -> evaluated
no-one is working on this right now, right? |
I don't think anyone is working on it right now. Somehow, I lost track of this. Is there still interest in adding it? I think we were almost done back then, only docs and tests were missing |
I'm definitely still interested in adding this! |
Same here! Seriously lost track of it as some point. I can get back to it mid august (currently traveling) |
Awesome, thanks! |
Hi @rasbt , if that's OK with you I'll start finalizing this soon in another PR. But please LMK if you'd still want to finish it! Cheers |
No worries, that would be okay with me. I am really busy these days & I feel bad for not finishing this, and I'd actually appreciate it if someone else could :) |
Reference Issue
#6545
What does this implement/fix? Explain your changes.
Implements a new class for sequential feature selection
Any other comments?
This is a first PR to pick-up the discussion from #6545 and to get some feedback for improvements.
Also, there are some more todos I have in mind when we get further down this PR:
Todos: