8000 [MRG] Implementing SelectKBest feature in SelectFromModel by qmaruf · Pull Request #6717 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] Implementing SelectKBest feature in SelectFromModel #6717

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

qmaruf
Copy link
@qmaruf qmaruf commented Apr 26, 2016

Working on #6689 to implement SelectKBest feature in SelectFromModel.
It sho 8000 uld return top k most important features.


This change is Reviewable

reference = np.median(importances)
elif reference == "mean":
reference = np.mean(importances)
if threshold is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines 50-57 are exactly the same as 40-57, right? Or am I missing something?

@hlin117
Copy link
Contributor
hlin117 commented Apr 26, 2016

Thanks for your contribution, @qmaruf. We like seeing new contributors to the library. Please show that your code works by building a test case for it.

@jnothman
Copy link
Member
  • Please add tests
  • I think k should be renamed n_selected (the model is not called SelectKBest here) or something similar (max_features?)
  • If there are tied features, this currently can return more than n_selected features. Is this desirable behaviour? It is not the behaviour of SelectKBest...
  • Rather than just assuming k is not None overrides threshold: What should semantics be if both threshold and n_selected are set? It's easy to return the subset of features matching both criteria.
  • ?It might be useful to support n_selected as a float between 0 and 1 to function like SelectPercentile.
  • Please update docstring parameter description

FWIW, I addressed a number of these questions in #1939, in a mask_by_score function. I closed it due to lack of reviewer interest when I was new to the project. It was a bit of a space shuttle, but it could be borrowed from.

@qmaruf
Copy link
Author
qmaruf commented May 1, 2016

travis-ci is showing the following test error.

ERROR: sklearn.feature_selection.tests.test_from_model.test_feature_importances
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/miniconda/envs/testenv/lib/python3.5/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/home/travis/sklearn_build_latest/scikit-learn/sklearn/utils/testing.py", line 690, in run_test
    return func(*args, **kwargs)
  File "/home/travis/sklearn_build_latest/scikit-learn/sklearn/feature_selection/tests/test_from_model.py", line 136, in test_feature_importances
    assert_raises(ValueError, transformer.fit, X, y)
  File "/home/travis/miniconda/envs/testenv/lib/python3.5/unittest/case.py", line 727, in assertRaises
    return context.handle('assertRaises', args, kwargs)
  File "/home/travis/miniconda/envs/testenv/lib/python3.5/unittest/case.py", line 176, in handle
    callable_obj(*args, **kwargs)
  File "/home/travis/sklearn_build_latest/scikit-learn/sklearn/feature_selection/from_model.py", line 252, in fit
    self._check_params(X, y)
  File "/home/travis/sklearn_build_latest/scikit-learn/sklearn/feature_selection/from_model.py", line 214, in _check_params
    if self.k is not None and not (self.k == "all" or 0 <= self.k <= X.shape[1]):
TypeError: unorderable types: int() <= str()

But I don't get similar error while running nosetest sklearn/ in my PC. Moreover, the the above test is passed too! The complete output is here. https://gist.github.com/qmaruf/7d694c69c8dc7beac7047f1f082ce82e

What is the problem?

@jnothman
Copy link
Member
jnothman commented May 1, 2016

That's a Python 3-only error. In Py3 some_string < some_number raises a TypeError while Py2 gives a result and no error.

threshold = np.sort(importances)[-k]
8000 return threshold

if threshold is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block is repeated

@MechCoder
Copy link
Member
MechCoder commented May 5, 2016

I am ambivalent on k vs n_selected. In fact, it is my mistake that threshold=None is doing what threshold='auto' is supposed to do :/

What do you think about renaming it be max_selected? So that we return the maximum number of features that are above the given threshold? For the user to ignore the threshold option, he can set threshold=0 and for the use to ignore the max_selected option, he can set max_selected="all" (default)

@hlin117
Copy link
Contributor
hlin117 commented May 10, 2016

I like that change. That clarifies the language well.

On Thu, May 5, 2016 at 1:53 PM, Manoj Kumar notifications@github.com
wrote:

I am ambivalent on k vs n_selected. In fact, it is my mistake that
threshold=None is doing what threshold='auto' is supposed to do :/

What do you think about renaming it be max_selected? So that we return
the maximum number of features that are above the given threshold? For the
user to ignore the threshold option, he can set threshold=0 and for the
use to ignore the max_selected option, he can set max_selected="all"


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_scikit-2Dlearn_scikit-2Dlearn_pull_6717-23issuecomment-2D217241629&d=CwMCaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=IgVP_sCIQHhgRgxQoo7gufqSQOvlFIckTGgYIpXBa_4&m=hGmWfN4Mm5nURdCYsUAExumSglA6DFdXzFykIf4T5To&s=hcKW4Svdk_5PzH8VG-GyR4s0eqMaYROD4Hpa0QjXRVw&e=

Henry Lin, Research AssistantM.S. Computer ScienceUniversity of Illinois
at Urbana-Champaign 2016847-769-8729 | halin2@illinois.edu
halin2@illinois.edu

@jnothman
Copy link
Member

I'm okay with max_selected or max_features or max_n_features. Note that max_* implies something about tie-breaking, i.e. when the max_selectedth feature by rank has the same importance as the max_selected+1th feature. It says that we don't include max_selected+1th, or maybe that we don't include any feature with that boundary threshold. Ties in SelectKBest and SelectPercentile are currently broken in an arbitrary, deterministic way.

@qmaruf
Copy link
Author
qmaruf commented May 11, 2016

So, k should be replaced using max_n_features. Right?

@MechCoder
Copy link
Member

yeah, that seems to be the consensus.

@hlin117
Copy link
Contributor
hlin117 commented May 11, 2016

@qmaruf If you are going to continue with this PR, please continue in a test-oriented way. Show us (the reviewers) how you want to design the code interface, let us review it, and then dive into the implementation. This way, we could provide feedback prior to you investing development time in it.

@qmaruf
Copy link
Author
qmaruf commented May 12, 2016

@hlin117 I am not sure about how to continue in a test-oriented way. Could you please provide an example?

@jnothman
Copy link
Member

You don't have to work test-driven, but it usually helps clarify that we're all looking for the same thing, how boundary cases might be handled etc. You should extend the tests in test_from_model.py to cover expected functionality for this new case, even while the code it's testing doesn't exist (i.e. the tests will fail). So an example is really any unit test, but pretending the code it's testing doesn't exist. Then (perhaps after confirming with us that you're properly testing the expected functionality) you implement the changes required to pass those tests.

@qmaruf
Copy link
Author
qmaruf commented May 12, 2016

Do you have any other test in mind? I am out of ideas.

@@ -173,6 +173,10 @@ class SelectFromModel(BaseEstimator, SelectorMixin):
Otherwise train the model using ``fit`` and then ``transform`` to do
feature selection.

k : int, between 0 and number of features, default None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change this to max_n_features

@hlin117
Copy link
Contributor
hlin117 commented May 12, 2016

Overall, good tests @qmaruf. I'm still looking for two other tests:

  • Test this code against an actual model. The current test cases does this with Lasso on this line. Set max_n_features for a trained model.
  • @jnothman mentioned in this comment that the features picked are deterministic if their scores match. Can you write a test case to show how the tie breaking will work?

@qmaruf
Copy link
Author
qmaruf commented May 13, 2016

@hlin117 I've added two new tests according to your suggestion. Please check.

def _check_params(self, X, y):
X, y = check_X_y(X, y)
if self.max_n_features is not None and not (
self.max_n_features == "all" or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remind me why we're providing the "all" alternative?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "all" option is available for SelectKBest. So it makes sense that SelectFromModel supports this option.

@jnothman
Copy link
Member

Didn't we decide that we should return the intersection of the features selected by threshold and max_n_features?

Also, I would eventually like to see support for 0.0 <= max_n_features < 1.0 indicating a proportion of features (like SelectPercentile).

@hlin117
Copy link
Contributor
hlin117 commented May 14, 2016

@jnothman The SelectKBest module currently does not allow you to pass in a percentage. I feel that if we allow users to pass in a percentage to SelectFromModel, it makes sense to also enhance SelectKBest to match the same code interface.

@hlin117
Copy link
Contributor
hlin117 commented May 14, 2016

@qmaruf If you could write a test case that follows @jnothman's comment here, that would be great.

@jnothman
Copy link
Member
jnothman commented Sep 7, 2016

Repeatedly needing to resolve conflicts sometimes happens if you poorly resolve a conflict in the first place, and so create more conflicts! It can also be avoided by squashing all your commits into one, e.g. by doing git merge --squash or using git rebase --interactive or this git squash tool

@jnothman
Copy link
Member
jnothman commented Sep 7, 2016

It looks like this isn't rebased on master. You can do something like this:

$ git fetch upstream master  # assuming github.com/scikit-learn/scikit-learn is set up as this remote
$ git checkout -b tmp
$ git reset --hard upstream/master
$ git merge --squash add-selectKbest-feature-for-models
$ git commit -m "Select k-best features in SelectFromModel"
$ git checkout add-selectKbest-feature-for-models
$ git reset --hard tmp
$ git branch -d tmp
$ git push origin -f add-selectKbest-feature-for-models  # assuming origin is your fork

@qmaruf qmaruf force-pushed the add-selectKbest-feature-for-models branch from a7546a1 to 41b9679 Compare September 11, 2016 05:31
@qmaruf qmaruf force-pushed the add-selectKbest-feature-for-models branch from 6d8469f to 775facc Compare September 11, 2016 13:20
@hlin117
Copy link
Contributor
hlin117 commented Sep 12, 2016

@qmaruf Please ping this thread again when you're ready for a final round of reviews.

@qmaruf
Copy link
Author
qmaruf commented Sep 12, 2016

It seems ok to me. No test case failure locally after rebasing with master, but travis ci shows an error. Not sure how to reproduce locally.

@jnothman
Copy link
Member

No test case failure locally after rebasing with master, but travis ci shows an error

The failure is in doc; perhaps you're not testing this locally.

candidate_indices = np.argsort(-scores,
kind='mergesort')[:self.max_features]
mask[candidate_indices] = True
if self.threshold is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this condition.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this condition is removed, we need to use threshold=-np.inf to pass most of the test case. Is that Ok?
However, I've edited the max_features doc. You can check it below.

max_features : int, between 0 and number of features, optional.
Use threshold=-np.inf to ignore threshold and apply
max_features only. threshold=None to select at most this many
features that score above the threshold. Any int between 0 and
number of features will return the desired number of features.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only if you changed the behavior, which you shouldn't have.. otherwise it will use the default threshold.

Copy link
Contributor
@hlin117 hlin117 Sep 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qmaruf Unfortunately this is a breaking change, so you can't change this behavior. I'm sure you could find another way to create the mask though. =]

@@ -63,6 +65,117 @@ def test_input_estimator_unchanged():
assert_true(transformer.estimator is est)


def check_invalid_max_features(est, X, y):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qmaruf Why did you name each of these tests check_*? The scikit-learn standard is to name the unit tests with test_*.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Often we use check_ for helpers in tests. test_ only applies when there are no required arguments.

X = X.copy()
max_features = X.shape[1]

check_valid_max_features(est, X, y)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a private function, please prefix it with _.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't tend to use _ prefixes for helpers in tests.

return scores >= self.threshold_
mask = np.zeros_like(scores, dtype=bool)
if self.max_features == 'all':
self.max_features = scores.size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In scikit-learn , we do not modify the attributes of an estimator after it's initialized.

If threshold and max_features are not provided, all features are
returned, use threshold=None if it is not required.
"""
transformer = SelectFromModel(estimator=Lasso(alpha=0.1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Lasso estimator is not deterministic (even though you can prove that it'll converge to a unique optimal solution every time). My suggestion is to mock a real estimator whose results we can easily interpret, and use it in these test cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless, if you continue to use Lasso or some other estimator with a random_state parameter, set that random_state to an integer.

@jnothman
Copy link
Member

@qmaruf do you intend to finish this, or should we find another contributor?

@qmaruf
Copy link
Author
qmaruf commented May 17, 2017

Another contributor should do it. I can't manage time for this right now.

@jnothman
Copy link
Member
jnothman commented May 17, 2017 via email

@naoyak
Copy link
Contributor
naoyak commented May 17, 2017

I could look into this one around next week.

@jnothman
Copy link
Member
jnothman commented May 17, 2017 via email

@agramfort
Copy link
Member

this needs rebase

@amueller
Copy link
Member

done in #9616

@amueller amueller closed this Jul 16, 2018
amueller pushed a commit that referenced this pull request Jul 16, 2018
#### Reference Issue
Continuation of work from [PR #6717](#6717).


#### What does this implement/fix? Explain your changes.
Will merge in master (this branch is a year old) and make changes as discussed in previous PR discussion to make it ready for merging in.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants
0