8000 [MRG] refactor feature selection by score by jnothman · Pull Request #2093 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] refactor feature selection by score #2093

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

jnothman
Copy link
Member

A number of places across the package perform feature selection by score, bounding the scores (specified absolutely or relatively) and/or limiting the number of features (specified absolutely or relatively).

While I think a mask_by_score utility could be useful for anyone playing with feature selection, I have particularly used it to assure correctness and common functionality in the many places this selection appears (univariate selection, learnt selector mixin, randomized l1, feature extraction).

I am not sure whether mask_by_score, or its realisation as a transformer, SelectByScore, should be part of the public API, or how they might come into examples or narrative documentation, and opinions are certainly welcome. It may be confusing that univariate_selection is also selection by score, but there the score_func returns both scores and p-values, and here we don't care what the scores are as long as they are orderable.

And I apologize for not having a negative total line count (but I think we lose a couple of lines if we remove comments, tests and blank lines)!

"The old attribute will be removed in 0.15.", DeprecationWarning)
return self.stop_words_

def restrict(self, support, indices=False, return_removed=False):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before someone else points it out, I think this could do with another few moments' work: fix the docstring and test. I also don't see the point in the indices parameter, but copied it from DictVectorizer.

@jnothman jnothman mentioned this pull request Jun 24, 2013
10 tasks
@agramfort
Copy link
Member

Somehow I tend to say "if it's not broken don't fix it"

can you clarify what is broken now?

you need to realize that currently 2-3 people know this code very well and can fix it.
When we modify / refactor you ask these 2-3 people to update their knowledge
and if they have no time you loose contributors.

@jnothman
Copy link
Member Author

Yes, I'm aware that's an issue. In part I've posted this because it's a cleaner version of something I had lying around, and I wanted it out of my way. Of course, bounding scores to create a mask is trivial, though limiting isn't. So "broken" may too strong a word, but here are some annoyances that I believe are best addressed through shared code, admittedly none of which I have mentioned above, nor properly documented, mostly because I had forgotten to do so:

  • SelectKBest and SelectPercentile warn when there are duplicate scores present, rather than when a tie needs to be broken (and perhaps the user should be able to specify that all ties on the selection boundary be accepted or discarded for the sake of determinism; but this certainly requires them to use the same limit implementation)
  • CountVectorizer.max_features does the same without warning at all
  • CountVectorizer.max_features only allows an absolute number of features to be specified, rather than allowing the user to keep a specified proportion of the total vocabulary.
  • Is there a reason one shouldn't be able to select k or % features by their coefficients in a linear model or scores in randomized l1? this PR doesn't support that, but it's available to the interested user with 3 additional lines of code, rather than with a full reimplementation of the nontrivial limiting code, or faking p-values to use with SelectKBest or SelectPercentile. Alternatively, SelectByScore could be used.
  • The selector mixin allows bounding the scores by the mean or median value (with some multiplier). Why not be consistent and provide the same for RandomizedLasso.selection_threshold, CountVectorizer.min_df or even SelectFpr.alpha?

In short, limiting the number of features should be an option anywhere bounding by value is (or it should be easy for someone to do so), and it's not trivial to implement.

At best, these annoyances suggest that this PR is a bit premature and requires documentation. At worst, they suggest this PR is both unnecessary and unwanted. Fine.

@jnothman
Copy link
Member Author

Is there a way to force Travis to re-check this, having rebased and added another commit?

@jaquesgrobler
Copy link
Member

travis still checking this????

@jnothman
Copy link
Member Author

Closing this until it can be more strongly motivated.

@jnothman jnothman closed this Aug 10, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0