[MRG+1] Select k-best features in SelectFromModel #9616

nsheth12 · 2017-08-24T07:02:36Z

Reference Issue

Continuation of work from PR #6717.

What does this implement/fix? Explain your changes.

Will merge in master (this branch is a year old) and make changes as discussed in previous PR discussion to make it ready for merging in.

nsheth12 · 2017-08-27T17:23:12Z

The AppVeyor build continues to fail. It fails the _check_max_features() check on Windows with Python 2.7.8 and 64-bit architecture. From some reason, it's receiving "10L" as the max_features parameter. I'm not able to reproduce the issue locally. Any ideas as to what's going on/how to reproduce/how to fix? Below is a screenshot of the error traceback in the AppVeyor console:

jnothman

Thanks for taking this up!

jnothman · 2017-08-27T23:27:59Z

sklearn/feature_selection/from_model.py

+                if 0 <= self.max_features <= X.shape[1]:
+                    return
+            elif self.max_features == 'all':
+                    return


less indentation

jnothman · 2017-08-27T23:28:59Z

sklearn/feature_selection/from_model.py

+            if isinstance(self.max_features, int):
+                if 0 <= self.max_features <= X.shape[1]:
+                    return
+            elif self.max_features == 'all':


what's the difference between None and 'all'?

jnothman · 2017-08-27T23:31:32Z

sklearn/feature_selection/from_model.py

@@ -108,6 +108,9 @@ class SelectFromModel(BaseEstimator, SelectorMixin, MetaEstimatorMixin):
        Otherwise train the model using ``fit`` and then ``transform`` to do
        feature selection.

+    max_features : int, between 0 and number of features, optional.
+        Select at most this many features that score above the threshold.


Could you please add a note that to use only max_features, and no threshold, threshold=-np.inf can be used. But perhaps we should allow users to disable threshold with a string '-inf'? (I'd rather not 'none' which will get confused with the current None, meaning automatic threshold determination. Although we could in turn consider deprecating the use of threshold=None and renaming it to threshold='auto'.)

jnothman · 2017-08-27T23:33:00Z

sklearn/feature_selection/tests/test_from_model.py

+        transformer = SelectFromModel(estimator=est,
+                                      max_features=invalid_max_n_feature,
+                                      threshold=-np.inf)
+        assert_raises(ValueError, transformer.fit, X, y)


it's generally better to check the right error is being raised, especially for something as generic as a ValueError. Use assert_raises_regexp or assert_raise_message.

jnothman · 2017-08-27T23:33:44Z

sklearn/feature_selection/from_model.py

-            raise ValueError(
-                'Either fit SelectFromModel before transform or set "prefit='
-                'True" and pass a fitted estimator to the constructor.')
+            raise ValueError('Either fit the model before transform or set'


It probably wasn't your doing, but generally you should avoid touching code not related to the change.

jnothman · 2017-08-27T23:35:45Z

sklearn/feature_selection/tests/test_from_model.py

+        assert_equal(X_new.shape[1], n_features)
+
+
+def check_threshold_and_max_features(est, X, y):


I'd rather this be a separate test_threshold_and_max_features

I think this test currently tests threshold too much, when it is already covered above. A good set of tests should look like a proof by induction. We start by checking the basic features, and then that their combination makes sense. So I think this test would work well if we assume threshold and max_features work alone, and only confirm that their combination produces the features corresponding to the set intersection of their selections. I.e. this test should not bother with coef_ or with shape.

jnothman · 2017-08-27T23:38:07Z

sklearn/feature_selection/tests/test_from_model.py

+
+    # Test max_features against actual model.
+    transformer1 = SelectFromModel(estimator=Lasso(alpha=0.025,
+                                   random_state=42))


This is not appropriate indentation. It makes it look like the random_state belongs to SelectFromModel

jnothman · 2017-08-27T23:38:25Z

sklearn/feature_selection/tests/test_from_model.py

+    assert_array_equal(transformer1.estimator_.coef_,
+                       transformer2.estimator_.coef_)
+
+    # Test if max_features can break tie among feature importance


I think this should be a separate test function.

jnothman · 2017-08-27T23:44:54Z

sklearn/feature_selection/tests/test_from_model.py

+            threshold=-np.inf)
+        X_new = transformer.fit_transform(X, y)
+        selected_feature_indices = np.where(transformer._get_support_mask())[0]
+        assert_array_equal(selected_feature_indices, np.arange(n_features))


I'm okay with this approach, but wonder if we'd be better off taking max_features literally and returning none of the tying features at the cutoff (to avoid users being surprised by the tie-breaking; although we do break ties like this in SelectKBest and SelectPercentile, and perhaps we should remain consistent). WDYT? Perhaps it rarely matters.

I personally think it is better to give users exactly the number of features they ask for. From my experience as a user, I don't care so much which of feature X or feature Y I get when both are tied in importance as much as I do that when I ask for Z features, I get Z features and not less. Consistency with SelectKBest and SelectPercentile would be other arguments in favor of keeping as is. However, this is just my 2 cents, and I'll defer to you on the final decision.

jnothman · 2017-08-27T23:45:58Z

sklearn/feature_selection/tests/test_from_model.py

+    assert_array_equal(X_new3, X[:, selected_indices[0]])
+
+    """
+    If threshold and max_features are not provided, all features are


I don't think this comment makes sense. If threshold and max_features are not provided, the default threshold is used.

jnothman · 2017-08-28T00:31:33Z

10L is just another version of 10 (on Python 2). It should work fine when compared to n_features...

…

On 28 August 2017 at 03:23, Nihar Sheth ***@***.***> wrote: The AppVeyor build continues to fail. It fails the _check_max_features() check on Windows with Python 2.7.8 and 64-bit architecture. From some reason, it's receiving "10L" as the max_features parameter. I'm not able to reproduce the issue locally. Any ideas as to what's going on/how to reproduce/how to fix? Below is a screenshot of the error traceback in the AppVeyor console: [image: appveyor_build_fail_console] <https://user-images.githubusercontent.com/17126380/29752370-71303a26-8b11-11e7-92b4-51593a72a6a5.png> — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#9616 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz61iMxBjSh_QD2ZM_aBPQXnMcaAaxks5scaYBgaJpZM4PA-NB> .

jnothman

Thanks

jnothman · 2017-08-28T08:26:23Z

sklearn/feature_selection/from_model.py

        self.norm_order = norm_order

+    def _check_max_features(self, X, max_features):
+        if self.max_features is None or self.max_features == 'all':


is there a reason to have both 'all' and None available? Can't we just allow the default value, either 'all' or None, to yield this behaviour?

Agreed. Removed support for "all".

jnothman · 2017-08-28T08:27:54Z

sklearn/feature_selection/from_model.py

+
+    def _check_params(self, X, y):
+        X, y = check_X_y(X, y)
+        self._check_max_features(X, self.max_features)


We tend to avoid such nesting and would rather have the check_max_features logic inline here.

jnothman · 2017-08-28T08:33:56Z

sklearn/feature_selection/from_model.py

+        n_features_to_select = self.max_features
+        if self.max_features == 'all':
+            n_features_to_select = scores.size
+        candidate_indices = np.argsort(-scores,


This sort is unnecessary in the default max_features case, and so seems to be wasted computation.

Note that an alternative way to implement this, in O(n) time, is to just set threshold=max(threshold, np.percentile(scores, 100 * max_features / n_features)), and then handle ties explicitly if too many features are selected. I'm happy with mergesort for readability and comparison to SelectKBest

For now, I just added a check to avoid sorting in the default case. I think the tiebreaking code required for the percentile approach would decrease readability significantly (as you mention). Of course, if performance becomes an issue, I can always go back and change it.

jnothman · 2017-08-28T08:35:30Z

sklearn/feature_selection/tests/test_from_model.py

+
+
+def check_diff_models_threshold_and_max_features(est, X, y):
+    """


I don't think this comment is clear. The default threshold is used, even if max_features is provided. I don't think it's the right place for the comment, either. It can just be removed.

jnothman · 2017-08-28T08:36:48Z

sklearn/feature_selection/tests/test_from_model.py

+        n_repeated=0, shuffle=False, random_state=0)
+
+    check_diff_models_threshold_and_max_features(
+        RandomForestClassifier(n_estimators=50, random_state=0), X, y)


Again, this doesn't seem to be the place to test that SelectFromModel works with different kinds of models. We only need to test the interaction of threshold and max_features assuming that independently, they both work correctly.

jnothman

LGTM

jnothman · 2017-08-29T00:16:46Z

sklearn/feature_selection/from_model.py

+                np.argsort(-scores, kind='mergesort')[:self.max_features]
+            mask[candidate_indices] = True
+        else:
+            mask = np.logical_not(mask)


Nitpick: I'd rather see this as ones_like, with zeros_like in the if case. But whatever.

I approve this change :)

jnothman · 2017-08-29T00:19:12Z

sklearn/feature_selection/tests/test_from_model.py

+    assert_equal(X_new3.shape[1], min(X_new1.shape[1], X_new2.shape[1]))
+    selected_indices = \
+        transformer3.transform(np.arange(X.shape[1])[np.newaxis, :])
+    assert_array_equal(X_new3, X[:, selected_indices[0]])


Were this to fail, the error would not be very clear. Much clearer if we were just comparing ranges. But it's okay.

nsheth12 · 2017-09-11T21:00:36Z

Is there anything else I need to do for this to be reviewed for moving to MRG+2?

jnothman · 2017-09-11T23:44:08Z

The review will come, eventually. Feel free to take up another issue in the meantime.

Also, please add an entry to doc/whats_new/v0.20.rst citing @qmaruf and yourself.

…into selectKbest-for-models

…w.rst

nsheth12 · 2017-11-20T20:52:21Z

Just wanted to check in regarding when the second review for this PR will occur?

amueller · 2017-11-20T21:00:11Z

@nsheth12 when someone finds time ;) Sorry, a lot of us are pretty busy.

amueller · 2017-11-20T21:00:28Z

Can you please resolve the conflict?

nsheth12 · 2017-12-01T08:36:50Z

Resolved the conflict - sorry for the delay. Is there anything else I need to do?

jnothman · 2017-12-14T00:17:26Z

Wait for a reviewer :\

glemaitre

@nsheth12 Can you address those minor issues.

glemaitre · 2018-06-18T16:49:09Z

sklearn/feature_selection/from_model.py

        self.norm_order = norm_order

+    def _check_params(self, X, y):
+        X, y = check_X_y(X, y)


Is there any reason to not accept sparse matrices. I would think that the underlying estimator should take care about it.

you're right: we should not have any criteria on X or y as long as X has shape on its second axis, and can be indexed on it. It's a bit upsetting that we don't have a test for that!

glemaitre · 2018-06-18T16:50:08Z

sklearn/feature_selection/from_model.py

@@ -108,6 +109,11 @@ class SelectFromModel(BaseEstimator, SelectorMixin, MetaEstimatorMixin):
        Otherwise train the model using ``fit`` and then ``transform`` to do
        feature selection.

+    max_features : int, between 0 and number of features, optional.


Do not mention between 0 and number of features in the first line. Also remove the final full stop.

glemaitre · 2018-06-18T16:50:47Z

sklearn/feature_selection/from_model.py

+    max_features : int, between 0 and number of features, optional.
+        Select at most this many features that score above the threshold.
+        To disable the threshold, and only select based on max_features,
+        set threshold = -np.inf.


put some backsticks and no space around equal

glemaitre · 2018-06-18T16:54:44Z

sklearn/feature_selection/from_model.py

+        X, y = check_X_y(X, y)
+
+        if self.max_features is None:
+            return


We should also return X and y if we check them.

glemaitre · 2018-06-18T16:54:59Z

sklearn/feature_selection/from_model.py

        self.norm_order = norm_order

+    def _check_params(self, X, y):


I would rename this function _check_inputs

glemaitre · 2018-06-18T17:08:26Z

sklearn/feature_selection/tests/test_from_model.py

@@ -40,6 +42,117 @@ def test_input_estimator_unchanged():
    assert_true(transformer.estimator is est)


+def check_invalid_max_features(est, X, y):


We can parametrize the test using pytest from now on:

@pytest.mark.parametrize("max_features", [-1, X.shape[1] + 1, 'gobbledigook', 'all']) def check_invalid_max_features(est, X, y, max_features): transformer = SelectFromModel(estimator=est, max_features=max_features, threshold=-np.inf) with pytest.raises(ValueError, err_msg): transformer.fit(X, y)

Also it is weird that you have always the same error "max_features should be >=0".
It is not meaningful for string. We need to make 2 if conditions for type and values.

glemaitre · 2018-06-18T17:09:58Z

sklearn/feature_selection/tests/test_from_model.py

+
+def check_valid_max_features(est, X, y):
+    max_features = X.shape[1]
+    for valid_max_n_feature in [0, max_features, 5]:


parametrize

glemaitre · 2018-06-18T17:10:11Z

sklearn/feature_selection/tests/test_from_model.py

+                                      max_features=valid_max_n_feature,
+                                      threshold=-np.inf)
+        X_new = transformer.fit_transform(X, y)
+        assert_equal(X_new.shape[1], valid_max_n_feature)


call it max_features

glemaitre · 2018-06-18T17:13:10Z

sklearn/feature_selection/tests/test_from_model.py

+                       transformer2.estimator_.coef_)
+
+
+def test_max_features_tiebreak():


We can also parametrize this test at a first glance.

glemaitre · 2018-06-18T17:13:55Z

sklearn/feature_selection/tests/test_from_model.py

+    transformer3 = SelectFromModel(estimator=est, max_features=3,
+                                   threshold=0.04)
+    X_new3 = transformer3.fit_transform(X, y)
+    assert_equal(X_new3.shape[1], min(X_new1.shape[1], X_new2.shape[1]))


use bare assert

…t-for-models

glemaitre · 2018-06-25T14:57:22Z

@jorisvandenbossche I did the changes that I requested. Can you have a look to the PR to have an extra eye on it before merging.

sklearn-lgtm · 2018-06-25T15:20:03Z

This pull request introduces 1 alert when merging 2bdfb48 into eec7649 - view on LGTM.com

new alerts:

1 for Unused import

Comment posted by LGTM.com

agramfort · 2018-07-16T07:30:04Z

sklearn/feature_selection/from_model.py

@@ -123,10 +129,12 @@ class SelectFromModel(BaseEstimator, SelectorMixin, MetaEstimatorMixin):
    threshold_ : float
        The threshold value used for feature selection.
    """
-    def __init__(self, estimator, threshold=None, prefit=False, norm_order=1):
+    def __init__(self, estimator, threshold=None, prefit=False,
+                 max_features=None, norm_order=1):


to avoid API breakage max_features should be added at the end of the signature

…earn into nsheth12-selectKbest-for-models

glemaitre · 2018-07-16T14:54:08Z

@agramfort I did the change

hermidalc · 2018-09-13T11:11:06Z

@nsheth12 @jnothman @glemaitre @amueller sorry that I have seen this so late and after the merge. I've had customized code to do this exact functionality for a long time. My question to everyone is why is the implementation here so complex? For consistency why did you not use k as a parameter? Was it because you wanted to combine threshold and k best to determine the number of features?

Here is the diff between my code and 0.19.1, i FBB9 t's very simple. I ignore any threshold if k is specified, which is the behavior I wanted since I want it to be consistent and comparable to scoring functions.

>     k : int or "all", optional, default None
>         Number of top features to select.
>         The "all" option bypasses selection, for use in a parameter search.
>         If k is specified threshold is ignored.
> 
126c131
<     def __init__(self, estimator, threshold=None, prefit=False, norm_order=1):
---
>     def __init__(self, estimator, threshold=None, k=None, prefit=False, norm_order=1):
128a134
>         self.k = k
131a138,143
>     def _check_params(self, X, y):
>         if self.k is not None and not (self.k == "all" or 0 <= self.k <= X.shape[1]):
>             raise ValueError("k should be >=0, <= n_features; got %r."
>                              "Use k='all' to return all features."
>                              % self.k)
> 
142,144c154,167
<         scores = _get_feature_importances(estimator, self.norm_order)
<         threshold = _calculate_threshold(estimator, scores, self.threshold)
<         return scores >= threshold
---
>         self.scores_ = _get_feature_importances(estimator, self.norm_order)
>         if self.k is None:
>             threshold = _calculate_threshold(estimator, self.scores_, self.threshold)
>             return self.scores_ >= threshold
>         elif self.k == 'all':
>             return np.ones(self.scores_.shape, dtype=bool)
>         elif self.k == 0:
>             return np.zeros(self.scores_.shape, dtype=bool)
>         else:
>             mask = np.zeros(self.scores_.shape, dtype=bool)
>             # Request a stable sort. Mergesort takes more memory (~40MB per
>             # megafeature on x86-64).
>             mask[np.argsort(self.scores_, kind="mergesort")[-self.k:]] = True
>             return mask

glemaitre · 2018-09-13T13:02:43Z

My question to everyone is why is the implementation here so complex

I don't see why our solution is more complex. This is exactly the same steps but we allow to use both max_features and threshold if desired.

        if self.max_features is not None:
            mask = np.zeros_like(scores, dtype=bool)
            candidate_indices = \
                np.argsort(-scores, kind='mergesort')[:self.max_features]
            mask[candidate_indices] = True
        else:
            mask = np.ones_like(scores, dtype=bool)
        mask[scores < threshold] = False
        return mask

For consistency why did you not use k as a parameter

This is true that k could have been an option. IMO, max_features is more explicit only considering the documentation of SelectFromModel.

jnothman · 2018-09-14T07:17:21Z

max_features can also be extended in the future to handle fractions of the number of samples (like SelectPercentile), while k cannot as intuitively.

Select k-best features in SelectFromModel

775facc

nsheth12 mentioned this pull request Aug 24, 2017

Percentile-based SelectFromModel #9613

Closed

nsheth12 added 3 commits August 24, 2017 19:34

Resolved merge conflicts

07c1709

Fixed issues in PR scikit-learn#6717 thread, now passes all tests

529bcb8

Fixed PEP8 issues

53ee64f

jnothman reviewed Aug 27, 2017

View reviewed changes

nsheth12 added 2 commits August 27, 2017 22:12

Addressed issues from review

f4f8ea3

This should fix AppVeyor issues

63438b5

jnothman reviewed Aug 28, 2017

View reviewed changes

Fixed issues from review

a3eac5b

jnothman reviewed Aug 29, 2017

View reviewed changes

jnothman changed the title ~~Select k-best features in SelectFromModel~~ [MRG+1] Select k-best features in SelectFromModel Aug 29, 2017

nsheth12 added 3 commits September 12, 2017 18:08

Added entry to whats_new.rst

d509f33

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

0012403

…into selectKbest-for-models

Previously modified wrong whats_new.rst, now updated correct whats_ne…

c56c25b

…w.rst

Merge branch 'master' into selectKbest-for-models

40805d8

jnothman added the Waiting for Reviewer label Dec 13, 2017

jnothman added this to the 0.20 milestone Jun 17, 2018

jnothman approved these changes Jun 17, 2018

View reviewed changes

glemaitre requested changes Jun 18, 2018

View reviewed changes

Merge remote-tracking branch 'origin/master' into nsheth12-selectKbes…

a0875d8

…t-for-models

FIX test and code cleanup

2bdfb48

glemaitre approved these changes Jun 25, 2018

View reviewed changes

glemaitre added 5 commits June 26, 2018 14:58

Update from_model.py

d522280

10000

Update test_from_model.py

8b0d7b1

Update test_from_model.py

42b94bd

TST fix matching

e901c3d

Update test_from_model.py

dba1928

agramfort requested changes Jul 16, 2018

View reviewed changes

glemaitre added 2 commits July 16, 2018 16:52

agramfort comments

9249b53

Merge branch 'selectKbest-for-models' of github.com:nsheth12/scikit-l…

e5bb39a

…earn into nsheth12-selectKbest-for-models

agramfort approved these changes Jul 16, 2018

View reviewed changes

amueller merged commit 211ded8 into scikit-learn:master Jul 16, 2018

This was referenced Jul 16, 2018

[MRG] Implementing SelectKBest feature in SelectFromModel #6717

Closed

Add a SelectKBest feature for Models? #6689

Closed

		assert_equal(X_new.shape[1], n_features)


		def check_threshold_and_max_features(est, X, y):



		def check_diff_models_threshold_and_max_features(est, X, y):
		"""

		@@ -40,6 +42,117 @@ def test_input_estimator_unchanged():
		assert_true(transformer.estimator is est)


		def check_invalid_max_features(est, X, y):

		transformer2.estimator_.coef_)


		def test_max_features_tiebreak():

Uh oh!

[MRG+1] Select k-best features in SelectFromModel #9616

[MRG+1] Select k-best features in SelectFromModel #9616

Uh oh!

Conversation

Reference Issue

What does this implement/fix? Explain your changes.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment