[MRG] Implementing SelectKBest feature in SelectFromModel #6717

qmaruf · 2016-04-26T06:08:24Z

Working on #6689 to implement SelectKBest feature in SelectFromModel.
It sho 8000 uld return top k most important features.

This change is

hlin117 · 2016-04-26T06:34:23Z

sklearn/feature_selection/from_model.py

-                reference = np.median(importances)
-            elif reference == "mean":
-                reference = np.mean(importances)
+        if threshold is None:


Lines 50-57 are exactly the same as 40-57, right? Or am I missing something?

hlin117 · 2016-04-26T06:38:10Z

Thanks for your contribution, @qmaruf. We like seeing new contributors to the library. Please show that your code works by building a test case for it.

jnothman · 2016-04-26T07:26:25Z

Please add tests
I think k should be renamed n_selected (the model is not called SelectKBest here) or something similar (max_features?)
If there are tied features, this currently can return more than n_selected features. Is this desirable behaviour? It is not the behaviour of SelectKBest...
Rather than just assuming k is not None overrides threshold: What should semantics be if both threshold and n_selected are set? It's easy to return the subset of features matching both criteria.
?It might be useful to support n_selected as a float between 0 and 1 to function like SelectPercentile.
Please update docstring parameter description

FWIW, I addressed a number of these questions in #1939, in a mask_by_score function. I closed it due to lack of reviewer interest when I was new to the project. It was a bit of a space shuttle, but it could be borrowed from.

qmaruf · 2016-05-01T08:53:02Z

travis-ci is showing the following test error.

ERROR: sklearn.feature_selection.tests.test_from_model.test_feature_importances
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/miniconda/envs/testenv/lib/python3.5/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/home/travis/sklearn_build_latest/scikit-learn/sklearn/utils/testing.py", line 690, in run_test
    return func(*args, **kwargs)
  File "/home/travis/sklearn_build_latest/scikit-learn/sklearn/feature_selection/tests/test_from_model.py", line 136, in test_feature_importances
    assert_raises(ValueError, transformer.fit, X, y)
  File "/home/travis/miniconda/envs/testenv/lib/python3.5/unittest/case.py", line 727, in assertRaises
    return context.handle('assertRaises', args, kwargs)
  File "/home/travis/miniconda/envs/testenv/lib/python3.5/unittest/case.py", line 176, in handle
    callable_obj(*args, **kwargs)
  File "/home/travis/sklearn_build_latest/scikit-learn/sklearn/feature_selection/from_model.py", line 252, in fit
    self._check_params(X, y)
  File "/home/travis/sklearn_build_latest/scikit-learn/sklearn/feature_selection/from_model.py", line 214, in _check_params
    if self.k is not None and not (self.k == "all" or 0 <= self.k <= X.shape[1]):
TypeError: unorderable types: int() <= str()

But I don't get similar error while running nosetest sklearn/ in my PC. Moreover, the the above test is passed too! The complete output is here. https://gist.github.com/qmaruf/7d694c69c8dc7beac7047f1f082ce82e

What is the problem?

jnothman · 2016-05-01T09:11:43Z

That's a Python 3-only error. In Py3 some_string < some_number raises a TypeError while Py2 gives a result and no error.

MechCoder · 2016-05-05T18:41:43Z

sklearn/feature_selection/from_model.py

+            threshold = np.sort(importances)[-k]
+        return threshold
+
+    if threshold is None:


This block is repeated

MechCoder · 2016-05-05T18:52:07Z

I am ambivalent on k vs n_selected. In fact, it is my mistake that threshold=None is doing what threshold='auto' is supposed to do :/

What do you think about renaming it be max_selected? So that we return the maximum number of features that are above the given threshold? For the user to ignore the threshold option, he can set threshold=0 and for the use to ignore the max_selected option, he can set max_selected="all" (default)

hlin117 · 2016-05-10T17:22:38Z

I like that change. That clarifies the language well.

On Thu, May 5, 2016 at 1:53 PM, Manoj Kumar notifications@github.com
wrote:

I am ambivalent on k vs n_selected. In fact, it is my mistake that
threshold=None is doing what threshold='auto' is supposed to do :/

What do you think about renaming it be max_selected? So that we return
the maximum number of features that are above the given threshold? For the
user to ignore the threshold option, he can set threshold=0 and for the
use to ignore the max_selected option, he can set max_selected="all"

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_scikit-2Dlearn_scikit-2Dlearn_pull_6717-23issuecomment-2D217241629&d=CwMCaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=IgVP_sCIQHhgRgxQoo7gufqSQOvlFIckTGgYIpXBa_4&m=hGmWfN4Mm5nURdCYsUAExumSglA6DFdXzFykIf4T5To&s=hcKW4Svdk_5PzH8VG-GyR4s0eqMaYROD4Hpa0QjXRVw&e=

Henry Lin, Research AssistantM.S. Computer ScienceUniversity of Illinois
at Urbana-Champaign 2016847-769-8729 | halin2@illinois.edu
halin2@illinois.edu

jnothman · 2016-05-10T23:46:57Z

I'm okay with max_selected or max_features or max_n_features. Note that max_* implies something about tie-breaking, i.e. when the max_selectedth feature by rank has the same importance as the max_selected+1th feature. It says that we don't include max_selected+1th, or maybe that we don't include any feature with that boundary threshold. Ties in SelectKBest and SelectPercentile are currently broken in an arbitrary, deterministic way.

qmaruf · 2016-05-11T04:37:36Z

So, k should be replaced using max_n_features. Right?

MechCoder · 2016-05-11T19:11:57Z

yeah, that seems to be the consensus.

hlin117 · 2016-05-11T19:18:05Z

@qmaruf If you are going to continue with this PR, please continue in a test-oriented way. Show us (the reviewers) how you want to design the code interface, let us review it, and then dive into the implementation. This way, we could provide feedback prior to you investing development time in it.

qmaruf · 2016-05-12T04:28:04Z

@hlin117 I am not sure about how to continue in a test-oriented way. Could you please provide an example?

jnothman · 2016-05-12T04:54:04Z

You don't have to work test-driven, but it usually helps clarify that we're all looking for the same thing, how boundary cases might be handled etc. You should extend the tests in test_from_model.py to cover expected functionality for this new case, even while the code it's testing doesn't exist (i.e. the tests will fail). So an example is really any unit test, but pretending the code it's testing doesn't exist. Then (perhaps after confirming with us that you're properly testing the expected functionality) you implement the changes required to pass those tests.

qmaruf · 2016-05-12T18:28:14Z

Do you have any other test in mind? I am out of ideas.

hlin117 · 2016-05-12T18:33:25Z

sklearn/feature_selection/from_model.py

@@ -173,6 +173,10 @@ class SelectFromModel(BaseEstimator, SelectorMixin):
        Otherwise train the model using ``fit`` and then ``transform`` to do
        feature selection.

+    k : int, between 0 and number of features, default None


Change this to max_n_features

hlin117 · 2016-05-12T19:59:35Z

Overall, good tests @qmaruf. I'm still looking for two other tests:

Test this code against an actual model. The current test cases does this with Lasso on this line. Set max_n_features for a trained model.
@jnothman mentioned in this comment that the features picked are deterministic if their scores match. Can you write a test case to show how the tie breaking will work?

qmaruf · 2016-05-13T19:25:20Z

@hlin117 I've added two new tests according to your suggestion. Please check.

jnothman · 2016-05-14T14:03:53Z

sklearn/feature_selection/from_model.py

+    def _check_params(self, X, y):
+        X, y = check_X_y(X, y)
+        if self.max_n_features is not None and not (
+                                    self.max_n_features == "all" or


Remind me why we're providing the "all" alternative?

The "all" option is available for SelectKBest. So it makes sense that SelectFromModel supports this option.

jnothman · 2016-05-14T14:10:59Z

Didn't we decide that we should return the intersection of the features selected by threshold and max_n_features?

Also, I would eventually like to see support for 0.0 <= max_n_features < 1.0 indicating a proportion of features (like SelectPercentile).

hlin117 · 2016-05-14T20:32:26Z

@jnothman The SelectKBest module currently does not allow you to pass in a percentage. I feel that if we allow users to pass in a percentage to SelectFromModel, it makes sense to also enhance SelectKBest to match the same code interface.

hlin117 · 2016-05-14T20:33:34Z

@qmaruf If you could write a test case that follows @jnothman's comment here, that would be great.

jnothman · 2016-09-07T00:00:18Z

Repeatedly needing to resolve conflicts sometimes happens if you poorly resolve a conflict in the first place, and so create more conflicts! It can also be avoided by squashing all your commits into one, e.g. by doing git merge --squash or using git rebase --interactive or this git squash tool

jnothman · 2016-09-07T06:57:17Z

It looks like this isn't rebased on master. You can do something like this:

$ git fetch upstream master  # assuming github.com/scikit-learn/scikit-learn is set up as this remote
$ git checkout -b tmp
$ git reset --hard upstream/master
$ git merge --squash add-selectKbest-feature-for-models
$ git commit -m "Select k-best features in SelectFromModel"
$ git checkout add-selectKbest-feature-for-models
$ git reset --hard tmp
$ git branch -d tmp
$ git push origin -f add-selectKbest-feature-for-models  # assuming origin is your fork

hlin117 · 2016-09-12T08:07:02Z

@qmaruf Please ping this thread again when you're ready for a final round of reviews.

qmaruf · 2016-09-12T08:22:12Z

It seems ok to me. No test case failure locally after rebasing with master, but travis ci shows an error. Not sure how to reproduce locally.

jnothman · 2016-09-12T12:50:47Z

No test case failure locally after rebasing with master, but travis ci shows an error

The failure is in doc; perhaps you're not testing this locally.

jnothman · 2016-09-12T12:51:59Z

sklearn/feature_selection/from_model.py

+        candidate_indices = np.argsort(-scores,
+                                       kind='mergesort')[:self.max_features]
+        mask[candidate_indices] = True
+        if self.threshold is not None:


Please remove this condition.

If this condition is removed, we need to use threshold=-np.inf to pass most of the test case. Is that Ok?
However, I've edited the max_features doc. You can check it below.

max_features : int, between 0 and number of features, optional.
Use threshold=-np.inf to ignore threshold and apply
max_features only. threshold=None to select at most this many
features that score above the threshold. Any int between 0 and
number of features will return the desired number of features.

only if you changed the behavior, which you shouldn't have.. otherwise it will use the default threshold.

@qmaruf Unfortunately this is a breaking change, so you can't change this behavior. I'm sure you could find another way to create the mask though. =]

hlin117 · 2016-09-15T06:05:20Z

sklearn/feature_selection/tests/test_from_model.py

@@ -63,6 +65,117 @@ def test_input_estimator_unchanged():
    assert_true(transformer.estimator is est)


+def check_invalid_max_features(est, X, y):


@qmaruf Why did you name each of these tests check_*? The scikit-learn standard is to name the unit tests with test_*.

Often we use check_ for helpers in tests. test_ only applies when there are no required arguments.

hlin117 · 2016-09-15T06:06:28Z

sklearn/feature_selection/tests/test_from_model.py

+    X = X.copy()
+    max_features = X.shape[1]
+
+    check_valid_max_features(est, X, y)


If this is a private function, please prefix it with _.

we don't tend to use _ prefixes for helpers in tests.

hlin117 · 2016-09-15T06:11:34Z

sklearn/feature_selection/from_model.py

-        return scores >= self.threshold_
+        mask = np.zeros_like(scores, dtype=bool)
+        if self.max_features == 'all':
+            self.max_features = scores.size


In scikit-learn , we do not modify the attributes of an estimator after it's initialized.

hlin117 · 2016-09-15T06:17:36Z

sklearn/feature_selection/tests/test_from_model.py

    If threshold and max_features are not provided, all features are
+    returned, use threshold=None if it is not required.
+    """
+    transformer = SelectFromModel(estimator=Lasso(alpha=0.1))


The Lasso estimator is not deterministic (even though you can prove that it'll converge to a unique optimal solution every time). My suggestion is to mock a real estimator whose results we can easily interpret, and use it in these test cases.

Regardless, if you continue to use Lasso or some other estimator with a random_state parameter, set that random_state to an integer.

jnothman · 2017-05-17T00:34:21Z

@qmaruf do you intend to finish this, or should we find another contributor?

qmaruf · 2017-05-17T04:31:33Z

Another contributor should do it. I can't manage time for this right now.

jnothman · 2017-05-17T08:19:33Z

thanks for your work and for the prompt reply

…

On 17 May 2017 2:31 pm, "Quazi Marufur Rahman" ***@***.***> wrote: Another contributor should do it. I can't manage time for this right now. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#6717 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6zj0aHXoLZHWNWnNVh_6s8EMajb3ks5r6ngngaJpZM4IPn6R> .

naoyak · 2017-05-17T08:26:34Z

I could look into this one around next week.

jnothman · 2017-05-17T11:04:36Z

Thanks!

…

On 17 May 2017 at 18:26, Naoya Kanai ***@***.***> wrote: I could look into this one around next week. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#6717 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6w4PEL2iH7UwBQsFEWSBPD7Cmf5rks5r6q87gaJpZM4IPn6R> .

agramfort · 2017-06-07T08:26:07Z

this needs rebase

amueller · 2018-07-16T19:15:59Z

done in #9616

#### Reference Issue Continuation of work from [PR #6717](#6717). #### What does this implement/fix? Explain your changes. Will merge in master (this branch is a year old) and make changes as discussed in previous PR discussion to make it ready for merging in.

hlin117 reviewed Apr 26, 2016
View reviewed changes

MechCoder reviewed May 5, 2016
View reviewed changes

hlin117 reviewed May 12, 2016
View reviewed changes

jnothman reviewed May 14, 2016
View reviewed changes

qmaruf force-pushed the add-selectKbest-feature-for-models branch from a7546a1 to 41b9679 Compare September 11, 2016 05:31

Select k-best features in SelectFromModel

775facc

qmaruf force-pushed the add-selectKbest-feature-for-models branch from 6d8469f to 775facc Compare September 11, 2016 13:20

jnothman reviewed Sep 12, 2016
View reviewed changes

hlin117 suggested changes Sep 15, 2016

View reviewed changes

hlin117 reviewed Sep 15, 2016

View reviewed changes

jnothman added the Need Contributor label May 17, 2017

agramfort added the Sprint label Jun 7, 2017

jnothman mentioned this pull request Aug 23, 2017

Percentile-based SelectFromModel #9613

Closed

nsheth12 mentioned this pull request Aug 24, 2017

[MRG+1] Select k-best features in SelectFromModel #9616

Merged

nsheth12 added a commit to nsheth12/scikit-learn that referenced this pull request Aug 26, 2017

Fixed issues in PR scikit-learn#6717 thread, now passes all tests

529bcb8

lesteve added help wanted and removed Need Contributor labels Oct 18, 2017

amueller closed this Jul 16, 2018

		@@ -63,6 +65,117 @@ def test_input_estimator_unchanged():
		assert_true(transformer.estimator is est)


		def check_invalid_max_features(est, X, y):

Uh oh!

[MRG] Implementing SelectKBest feature in SelectFromModel #6717

[MRG] Implementing SelectKBest feature in SelectFromModel #6717

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!