10000 [MRG+1] MAINT Refactor univariate_selection module by arjoly · Pull Request #3131 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG+1] MAINT Refactor univariate_selection module #3131

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from

Conversation

arjoly
Copy link
Member
@arjoly arjoly commented May 5, 2014

The refactoring :

@arjoly arjoly changed the title [MRG] Refactor a bit univariate_selection module [MRG] MAINT Refactor univariate_selection module May 5, 2014
@arjoly arjoly changed the title [MRG] MAINT Refactor univariate_selection module [WIP] MAINT Refactor univariate_selection module May 5, 2014
@jnothman
Copy link
Member
jnothman commented May 5, 2014

Testing failed because coveralls went down (I can't say I've found coveralls especially useful).

I'll try take a look at this soon...

% (k, len(self.scores_)))
else:
scores = _clean_nans(self.scores_)
# XXX This should be refactored; we're getting an array of indices
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can remove this comment. All our feature selectors output masks now (as they should, IMO). It will be turned back into indices only for sparse matrix indexing, but only until we require scipy >= 0.14 (?).

@jnothman
Copy link
Member
jnothman commented May 5, 2014

This LGTM. Pity it doesn't actually reduce the number of lines as well as the number of classes in univariate_selection

@jnothman
Copy link
Member
jnothman commented May 5, 2014

Is WIP intentional?

@arjoly
Copy link
Member Author
arjoly commented May 6, 2014

I want to write some tests for parameter checking and was waiting for travis.

@arjoly arjoly changed the title [WIP] MAINT Refactor univariate_selection module [MRG] MAINT Refactor univariate_selection module May 6, 2014
@arjoly
Copy link
Member Author
arjoly commented May 6, 2014

Rebase on top of master. This is ready for review.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 1ca49ec on arjoly:simplify-fselection into 5aa5357 on scikit-learn:master.

@jnothman
Copy link
Member
jnothman commented May 6, 2014

Still LGTM. +1

@jnothman jnothman changed the title [MRG] MAINT Refactor univariate_selection module [+1] MAINT Refactor univariate_selection module May 6, 2014
@jnothman jnothman changed the title [+1] MAINT Refactor univariate_selection module [MRG+1] MAINT Refactor univariate_selection module May 6, 2014
@arjoly
Copy link
Member Author
arjoly commented May 6, 2014

Thanks @jnothman for the review!

(150, 4)
>>> X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
>>> X_new.shape
(150, 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While you are at it, I think it would be great to make it possible to pass "chi2" string constant for the score_func argument to spare the user an import statement:

>>> from sklearn.datasets import load_iris
>>> from sklearn.feature_selection import SelectKBest
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> X.shape
(150, 4)
>>> X_new = SelectKBest(score_func="chi2", k=2).fit_transform(X, y)
>>> X_new.shape
(150, 2)

The docstring will need to be updated to give the list of score_func values available by default.

@ogrisel
Copy link
Member
ogrisel commented May 7, 2014

Apart from my 2 comments, looks good to me.

@GaelVaroquaux
Copy link
Member

Side note, unrelated to this PR, but the feature selection should really
be refactored to expose different objects for classification and
regression, as people keep getting it wrong. They would have default
values for score_func.

To avoid a class explosion, it would be useful to explore setting the
thresholding strategy (FDR, k_best, ...) and a parameter of these
classes.

@jnothman
Copy link
Member
jnothman commented May 8, 2014

To avoid a class explosion, it would be useful to explore setting the thresholding
strategy (FDR, k_best, ...) and a parameter of these classes.

which the GenericUnivariateSelector already does. But I agree we can
deprecate the underlying classes, just as we don't have separate classes
for various SGD loss functions.

On 8 May 2014 02:17, Gael Varoquaux notifications@github.com wrote:

Side note, unrelated to this PR, but the feature selection should really
be refactored to expose different objects for classification and
regression, as people keep getting it wrong. They would have default
values for score_func.

To avoid a class explosion, it would be useful to explore setting the
thresholding strategy (FDR, k_best, ...) and a parameter of these
classes.


Reply to this email directly or view it on GitHubhttps://github.com//pull/3131#issuecomment-42448449
.

@arjoly
Copy link
Member Author
arjoly commented May 8, 2014

Side note, unrelated to this PR, but the feature selection should really
be refactored to expose different objects for classification and
regression, as people keep getting it wrong. They would have default
values for score_func.

The class hierarchy could be rewrote to have only a ClassificationUnivariateSelectionand a RegressionUnivariateSelection class (and maybe UnsupervisedUnivariateSelection). This means deprecating SelectPercentile, SelectKBest, SelectFpr, SelectFdr, SelectFwe and GenericUnivariateSelect.

By the way, fpr, fdr and fwe are horrible names to say that we correct or not for multiple tests (bonferonni or benjamini-hochberg correction). SelectPercentile and SelectKBest could be more or less equivalent (percentile = k / n_features * 100) if both perform the same thing for ties.

Is there a reason to have a score_func with the signature

    score_func : callable
        Function taking two arrays X and y, and returning a pair of arrays
        (scores, pvalues).

when you might want to select only based on a p-values (stastical test) or a score (e.g. correlation, variance)?

While you are at it, I think it would be great to make it possible to pass "chi2" string constant for the score_func argument to spare the user an import statement

Good idea! But it's not worth to make it if we deprecate most classes of the module.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling ce3ccc7 on arjoly:simplify-fselection into 5e5ed7a on scikit-learn:master.

@jnothman
Copy link
Member

when you might want to select only based on a p-values (stastical test)
or a score (e.g. correlation, variance)?

I agree somewhat, in that I don't see why the same/similar facility
shouldn't be used for things like selecting text classification features by
document frequency (however, this is unsupervised, rather than univariate).
I have previously attempted to refactor this code to use a generic
mask_by_score which is designed to handle the sorts of selection and
threshold interpretation that happen in the LearntSelectorMixin and
feature_extraction as well (#2093), but it didn't receive any positive
attention (which I completely understand given the load and its priority).

Yet, when you have some selectors that require a score, and others that
require a p-value, the current interface required functions returning two
values (assuming p-value and score aren't directly correlated). If we
replace the input to be a string by default, then the interface for a
function could perhaps be more flexible...

On 8 May 2014 18:07, Coveralls notifications@github.com wrote:

[image: Coverage Status] https://coveralls.io/builds/749754

Coverage remained the same when pulling ce3ccc7
ce3ccc7
on arjoly:simplify-fselection
into 5e5ed7a
5e5ed7a
on scikit-learn:master
.


Reply to this email directly or view it on GitHubhttps://github.com//pull/3131#issuecomment-42523099
.

@larsmans
Copy link
Member
larsmans commented May 8, 2014

@GaelVaroquaux says

They would have default values for score_func.

But which defaults? The appropriate test depends on the distribution of feature values in X. Only chi² works for sparse matrices. I know it goes against the conventions, but this is the one module where I've never heard someone ask "why does it not work with the default settings?"

@larsmans
Copy link
Member
larsmans commented May 8, 2014

I changed some of the classes previously to only look at the score, not the p-value, so that I could implement pointwise MI feature selection (which I never did) and because the probabilities from the chi² test where unstable in high-d input. Having one API for scoring functions keeps the code simple. When you don't need probabilities, just return (scores, None). SelectKBest will ignore the None and the FPR procedure will complain.

Also, I've never noticed anyone trying to define their own feature selection metric and reporting trouble. I interpret that as "it's not broken, so don't fix it".

@larsmans larsmans closed this in 494a91b May 8, 2014
@larsmans
Copy link
Member
larsmans commented May 8, 2014

Merged from the command line as 494a91b.

@GaelVaroquaux
Copy link
Member

Is there a reason to have a score_func with the signature

score_func : callable
    Function taking two arrays X and y, and returning a pair of arrays
    (scores, pvalues).

when you might want to select only based on a p-values (stastical test) or a
score (e.g. correlation, variance)?

That was to be compatible with functions in scipy.stats.

@arjoly arjoly deleted the simplify-fselection branch June 23, 2014 09:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
0