-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG] refactor feature selection by score #2093
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
"The old attribute will be removed in 0.15.", DeprecationWarning) | ||
return self.stop_words_ | ||
|
||
def restrict(self, support, indices=False, return_removed=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before someone else points it out, I think this could do with another few moments' work: fix the docstring and test. I also don't see the point in the indices
parameter, but copied it from DictVectorizer
.
Somehow I tend to say "if it's not broken don't fix it" can you clarify what is broken now? you need to realize that currently 2-3 people know this code very well and can fix it. |
Yes, I'm aware that's an issue. In part I've posted this because it's a cleaner version of something I had lying around, and I wanted it out of my way. Of course, bounding scores to create a mask is trivial, though limiting isn't. So "broken" may too strong a word, but here are some annoyances that I believe are best addressed through shared code, admittedly none of which I have mentioned above, nor properly documented, mostly because I had forgotten to do so:
In short, limiting the number of features should be an option anywhere bounding by value is (or it should be easy for someone to do so), and it's not trivial to implement. At best, these annoyances suggest that this PR is a bit premature and requires documentation. At worst, they suggest this PR is both unnecessary and unwanted. Fine. |
Is there a way to force Travis to re-check this, having rebased and added another commit? |
travis still checking this???? |
Closing this until it can be more strongly motivated. |
A number of places across the package perform feature selection by score, bounding the scores (specified absolutely or relatively) and/or limiting the number of features (specified absolutely or relatively).
While I think a
mask_by_score
utility could be useful for anyone playing with feature selection, I have particularly used it to assure correctness and common functionality in the many places this selection appears (univariate selection, learnt selector mixin, randomized l1, feature extraction).I am not sure whether
mask_by_score
, or its realisation as a transformer,SelectByScore
, should be part of the public API, or how they might come into examples or narrative documentation, and opinions are certainly welcome. It may be confusing thatunivariate_selection
is also selection by score, but there thescore_func
returns both scores and p-values, and here we don't care what the scores are as long as they are orderable.And I apologize for not having a negative total line count (but I think we lose a couple of lines if we remove comments, tests and blank lines)!