[MRG] refactor feature selection by score #2093

jnothman · 2013-06-24T08:58:14Z

A number of places across the package perform feature selection by score, bounding the scores (specified absolutely or relatively) and/or limiting the number of features (specified absolutely or relatively).

While I think a mask_by_score utility could be useful for anyone playing with feature selection, I have particularly used it to assure correctness and common functionality in the many places this selection appears (univariate selection, learnt selector mixin, randomized l1, feature extraction).

I am not sure whether mask_by_score, or its realisation as a transformer, SelectByScore, should be part of the public API, or how they might come into examples or narrative documentation, and opinions are certainly welcome. It may be confusing that univariate_selection is also selection by score, but there the score_func returns both scores and p-values, and here we don't care what the scores are as long as they are orderable.

And I apologize for not having a negative total line count (but I think we lose a couple of lines if we remove comments, tests and blank lines)!

jnothman · 2013-06-24T09:00:50Z

sklearn/feature_extraction/text.py

            "The old attribute will be removed in 0.15.", DeprecationWarning)
        return self.stop_words_

+    def restrict(self, support, indices=False, return_removed=False):


Before someone else points it out, I think this could do with another few moments' work: fix the docstring and test. I also don't see the point in the indices parameter, but copied it from DictVectorizer.

agramfort · 2013-06-24T17:50:04Z

Somehow I tend to say "if it's not broken don't fix it"

can you clarify what is broken now?

you need to realize that currently 2-3 people know this code very well and can fix it.
When we modify / refactor you ask these 2-3 people to update their knowledge
and if they have no time you loose contributors.

jnothman · 2013-06-24T23:22:22Z

Yes, I'm aware that's an issue. In part I've posted this because it's a cleaner version of something I had lying around, and I wanted it out of my way. Of course, bounding scores to create a mask is trivial, though limiting isn't. So "broken" may too strong a word, but here are some annoyances that I believe are best addressed through shared code, admittedly none of which I have mentioned above, nor properly documented, mostly because I had forgotten to do so:

SelectKBest and SelectPercentile warn when there are duplicate scores present, rather than when a tie needs to be broken (and perhaps the user should be able to specify that all ties on the selection boundary be accepted or discarded for the sake of determinism; but this certainly requires them to use the same limit implementation)
CountVectorizer.max_features does the same without warning at all
CountVectorizer.max_features only allows an absolute number of features to be specified, rather than allowing the user to keep a specified proportion of the total vocabulary.
Is there a reason one shouldn't be able to select k or % features by their coefficients in a linear model or scores in randomized l1? this PR doesn't support that, but it's available to the interested user with 3 additional lines of code, rather than with a full reimplementation of the nontrivial limiting code, or faking p-values to use with SelectKBest or SelectPercentile. Alternatively, SelectByScore could be used.
The selector mixin allows bounding the scores by the mean or median value (with some multiplier). Why not be consistent and provide the same for RandomizedLasso.selection_threshold, CountVectorizer.min_df or even SelectFpr.alpha?

In short, limiting the number of features should be an option anywhere bounding by value is (or it should be easy for someone to do so), and it's not trivial to implement.

At best, these annoyances suggest that this PR is a bit premature and requires documentation. At worst, they suggest this PR is both unnecessary and unwanted. Fine.

jnothman · 2013-11-24T00:08:01Z

Is there a way to force Travis to re-check this, having rebased and added another commit?

jaquesgrobler · 2013-11-25T13:23:42Z

travis still checking this????

jnothman · 2014-08-10T12:13:37Z

Closing this until it can be more strongly motivated.

jnothman reviewed Jun 24, 2013
View reviewed changes

jnothman mentioned this pull request Jun 24, 2013

WIP Overhaul of feature_selection #1939

Closed

10 tasks

jnothman added 2 commits November 24, 2013 08:20

ENH refactor feature selection by score

1da6268

TST use assert_warns in test_by_score

a9082fe

jnothman mentioned this pull request May 8, 2014

[MRG+1] MAINT Refactor univariate_selection module #3131

Closed

jnothman closed this Aug 10, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[MRG] refactor feature selection by score #2093

[MRG] refactor feature selection by score #2093

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[MRG] refactor feature selection by score #2093

[MRG] refactor feature selection by score #2093

Uh oh!

Conversation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!