[MRG+1] MAINT Refactor univariate_selection module #3131

arjoly · 2014-05-05T14:58:43Z

The refactoring :

simplifies the class hierachy of univariate feature selection transformer (_BaseFilter, _PvalueFilter, _ScoreFilter to only _BaseFilter)
improves and performs all parameter checks during the fit,
fix some pep8
set the file to ascii only (instead of utf-8)
fix issue Missmatch between text and code in the in example of univariate selection #3132

jnothman · 2014-05-05T21:31:14Z

Testing failed because coveralls went down (I can't say I've found coveralls especially useful).

I'll try take a look at this soon...

jnothman · 2014-05-05T21:38:15Z

sklearn/feature_selection/univariate_selection.py

-                             % (k, len(self.scores_)))
+        else:
+            scores = _clean_nans(self.scores_)
+            # XXX This should be refactored; we're getting an array of indices


I think you can remove this comment. All our feature selectors output masks now (as they should, IMO). It will be turned back into indices only for sparse matrix indexing, but only until we require scipy >= 0.14 (?).

jnothman · 2014-05-05T21:42:14Z

This LGTM. Pity it doesn't actually reduce the number of lines as well as the number of classes in univariate_selection

jnothman · 2014-05-05T21:42:33Z

Is WIP intentional?

arjoly · 2014-05-06T06:34:51Z

I want to write some tests for parameter checking and was waiting for travis.

arjoly · 2014-05-06T06:58:32Z

Rebase on top of master. This is ready for review.

coveralls · 2014-05-06T07:47:09Z

Coverage remained the same when pulling 1ca49ec on arjoly:simplify-fselection into 5aa5357 on scikit-learn:master.

jnothman · 2014-05-06T09:38:20Z

Still LGTM. +1

arjoly · 2014-05-06T09:39:00Z

Thanks @jnothman for the review!

doc/modules/feature_selection.rst

While you are at it, I think it would be great to make it possible to pass "chi2" string constant for the score_func argument to spare the user an import statement:

>>> from sklearn.datasets import load_iris
>>> from sklearn.feature_selection import SelectKBest
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> X.shape
(150, 4)
>>> X_new = SelectKBest(score_func="chi2", k=2).fit_transform(X, y)
>>> X_new.shape
(150, 2)

The docstring will need to be updated to give the list of score_func values available by default.

ogrisel · 2014-05-07T14:15:14Z

Apart from my 2 comments, looks good to me.

GaelVaroquaux · 2014-05-07T16:17:54Z

Side note, unrelated to this PR, but the feature selection should really
be refactored to expose different objects for classification and
regression, as people keep getting it wrong. They would have default
values for score_func.

To avoid a class explosion, it would be useful to explore setting the
thresholding strategy (FDR, k_best, ...) and a parameter of these
classes.

jnothman · 2014-05-08T00:02:38Z

To avoid a class explosion, it would be useful to explore setting the thresholding
strategy (FDR, k_best, ...) and a parameter of these classes.

which the GenericUnivariateSelector already does. But I agree we can
deprecate the underlying classes, just as we don't have separate classes
for various SGD loss functions.

On 8 May 2014 02:17, Gael Varoquaux notifications@github.com wrote:

Side note, unrelated to this PR, but the feature selection should really
be refactored to expose different objects for classification and
regression, as people keep getting it wrong. They would have default
values for score_func.

To avoid a class explosion, it would be useful to explore setting the
thresholding strategy (FDR, k_best, ...) and a parameter of these
classes.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/3131#issuecomment-42448449
.

arjoly · 2014-05-08T07:54:34Z

Side note, unrelated to this PR, but the feature selection should really
be refactored to expose different objects for classification and
regression, as people keep getting it wrong. They would have default
values for score_func.

The class hierarchy could be rewrote to have only a ClassificationUnivariateSelectionand a RegressionUnivariateSelection class (and maybe UnsupervisedUnivariateSelection). This means deprecating SelectPercentile, SelectKBest, SelectFpr, SelectFdr, SelectFwe and GenericUnivariateSelect.

By the way, fpr, fdr and fwe are horrible names to say that we correct or not for multiple tests (bonferonni or benjamini-hochberg correction). SelectPercentile and SelectKBest could be more or less equivalent (percentile = k / n_features * 100) if both perform the same thing for ties.

Is there a reason to have a score_func with the signature

    score_func : callable
        Function taking two arrays X and y, and returning a pair of arrays
        (scores, pvalues).

when you might want to select only based on a p-values (stastical test) or a score (e.g. correlation, variance)?

While you are at it, I think it would be great to make it possible to pass "chi2" string constant for the score_func argument to spare the user an import statement

Good idea! But it's not worth to make it if we deprecate most classes of the module.

…eck in fit

…tput masks now (as they should, IMO). It will be turned back into indices only for sparse matrix indexing, but only until we require scipy >= 0.14 (?).')

coveralls · 2014-05-08T08:07:16Z

Coverage remained the same when pulling ce3ccc7 on arjoly:simplify-fselection into 5e5ed7a on scikit-learn:master.

jnothman · 2014-05-08T09:31:05Z

when you might want to select only based on a p-values (stastical test)
or a score (e.g. correlation, variance)?

I agree somewhat, in that I don't see why the same/similar facility
shouldn't be used for things like selecting text classification features by
document frequency (however, this is unsupervised, rather than univariate).
I have previously attempted to refactor this code to use a generic
mask_by_score which is designed to handle the sorts of selection and
threshold interpretation that happen in the LearntSelectorMixin and
feature_extraction as well (#2093), but it didn't receive any positive
attention (which I completely understand given the load and its priority).

Yet, when you have some selectors that require a score, and others that
require a p-value, the current interface required functions returning two
values (assuming p-value and score aren't directly correlated). If we
replace the input to be a string by default, then the interface for a
function could perhaps be more flexible...

On 8 May 2014 18:07, Coveralls notifications@github.com wrote:

[image: Coverage Status] https://coveralls.io/builds/749754

Coverage remained the same when pulling ce3ccc7
ce3ccc7
on arjoly:simplify-fselection into 5e5ed7a
5e5ed7a
on scikit-learn:master.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/3131#issuecomment-42523099
.

larsmans · 2014-05-08T18:29:36Z

@GaelVaroquaux says

They would have default values for score_func.

But which defaults? The appropriate test depends on the distribution of feature values in X. Only chi² works for sparse matrices. I know it goes against the conventions, but this is the one module where I've never heard someone ask "why does it not work with the default settings?"

larsmans · 2014-05-08T18:40:38Z

I changed some of the classes previously to only look at the score, not the p-value, so that I could implement pointwise MI feature selection (which I never did) and because the probabilities from the chi² test where unstable in high-d input. Having one API for scoring functions keeps the code simple. When you don't need probabilities, just return (scores, None). SelectKBest will ignore the None and the FPR procedure will complain.

Also, I've never noticed anyone trying to define their own feature selection metric and reporting trouble. I interpret that as "it's not broken, so don't fix it".

larsmans · 2014-05-08T18:58:03Z

Merged from the command line as 494a91b.

GaelVaroquaux · 2014-05-11T20:16:29Z

Is there a reason to have a score_func with the signature
score_func : callable
    Function taking two arrays X and y, and returning a pair of arrays
    (scores, pvalues).
when you might want to select only based on a p-values (stastical test) or a
score (e.g. correlation, variance)?

That was to be compatible with functions in scipy.stats.

arjoly changed the title ~~[MRG] Refactor a bit univariate_selection module~~ [MRG] MAINT Refactor univariate_selection module May 5, 2014

arjoly changed the title ~~[MRG] MAINT Refactor univariate_selection module~~ [WIP] MAINT Refactor univariate_selection module May 5, 2014

jnothman reviewed May 5, 2014
View reviewed changes

arjoly changed the title ~~[WIP] MAINT Refactor univariate_selection module~~ [MRG] MAINT Refactor univariate_selection module May 6, 2014

jnothman changed the title ~~[MRG] MAINT Refactor univariate_selection module~~ [+1] MAINT Refactor univariate_selection module May 6, 2014

jnothman changed the title ~~[+1] MAINT Refactor univariate_selection module~~ [MRG+1] MAINT Refactor univariate_selection module May 6, 2014

ogrisel reviewed May 7, 2014
View reviewed changes

arjoly added 8 commits May 8, 2014 09:56

MAINT remove redundant class hierarchy + fix api perform parameter ch…

c579d91

…eck in fit

DOC fix example in univariate selection (fix issue scikit-learn#3132)

6bdda9a

FIX testing datasets have too few feature for default of SelectKBest

8cdb84c

TST improve parameter checking

a174641

TST use standard assert_ function

b264ffd

DOC remove comment thanks to @jnothman ('All our feature selectors ou…

958e5fa

…tput masks now (as they should, IMO). It will be turned back into indices only for sparse matrix indexing, but only until we require scipy >= 0.14 (?).')

FIX TypeError: unorderable types: int() <= str()

3eacee4

COSMIT

ce3ccc7

larsmans closed this in 494a91b May 8, 2014

arjoly deleted the simplify-fselection branch June 23, 2014 09:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[MRG+1] MAINT Refactor univariate_selection module #3131

[MRG+1] MAINT Refactor univariate_selection module #3131

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[MRG+1] MAINT Refactor univariate_selection module #3131

[MRG+1] MAINT Refactor univariate_selection module #3131

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!