-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG] Implementing SelectKBest feature in SelectFromModel #6717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Implementing SelectKBest feature in SelectFromModel #6717
Conversation
reference = np.median(importances) | ||
elif reference == "mean": | ||
reference = np.mean(importances) | ||
if threshold is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lines 50-57 are exactly the same as 40-57, right? Or am I missing something?
Thanks for your contribution, @qmaruf. We like seeing new contributors to the library. Please show that your code works by building a test case for it. |
FWIW, I addressed a number of these questions in #1939, in a |
travis-ci is showing the following test error.
But I don't get similar error while running What is the problem? |
That's a Python 3-only error. In Py3 |
threshold = np.sort(importances)[-k] | ||
8000 return threshold | ||
|
||
if threshold is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block is repeated
I am ambivalent on What do you think about renaming it be |
I like that change. That clarifies the language well. On Thu, May 5, 2016 at 1:53 PM, Manoj Kumar notifications@github.com
Henry Lin, Research AssistantM.S. Computer ScienceUniversity of Illinois |
I'm okay with |
So, |
yeah, that seems to be the consensus. |
@qmaruf If you are going to continue with this PR, please continue in a test-oriented way. Show us (the reviewers) how you want to design the code interface, let us review it, and then dive into the implementation. This way, we could provide feedback prior to you investing development time in it. |
@hlin117 I am not sure about how to continue in a test-oriented way. Could you please provide an example? |
You don't have to work test-driven, but it usually helps clarify that we're all looking for the same thing, how boundary cases might be handled etc. You should extend the tests in test_from_model.py to cover expected functionality for this new case, even while the code it's testing doesn't exist (i.e. the tests will fail). So an example is really any unit test, but pretending the code it's testing doesn't exist. Then (perhaps after confirming with us that you're properly testing the expected functionality) you implement the changes required to pass those tests. |
Do you have any other test in mind? I am out of ideas. |
@@ -173,6 +173,10 @@ class SelectFromModel(BaseEstimator, SelectorMixin): | |||
Otherwise train the model using ``fit`` and then ``transform`` to do | |||
feature selection. | |||
|
|||
k : int, between 0 and number of features, default None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change this to max_n_features
Overall, good tests @qmaruf. I'm still looking for two other tests:
|
@hlin117 I've added two new tests according to your suggestion. Please check. |
def _check_params(self, X, y): | ||
X, y = check_X_y(X, y) | ||
if self.max_n_features is not None and not ( | ||
self.max_n_features == "all" or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remind me why we're providing the "all" alternative?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "all" option is available for SelectKBest
. So it makes sense that SelectFromModel
supports this option.
Didn't we decide that we should return the intersection of the features selected by Also, I would eventually like to see support for |
@jnothman The |
Repeatedly needing to resolve conflicts sometimes happens if you poorly resolve a conflict in the first place, and so create more conflicts! It can also be avoided by squashing all your commits into one, e.g. by doing |
It looks like this isn't rebased on master. You can do something like this:
|
a7546a1
to
41b9679
Compare
6d8469f
to
775facc
Compare
@qmaruf Please ping this thread again when you're ready for a final round of reviews. |
It seems ok to me. No test case failure locally after rebasing with master, but travis ci shows an error. Not sure how to reproduce locally. |
The failure is in |
candidate_indices = np.argsort(-scores, | ||
kind='mergesort')[:self.max_features] | ||
mask[candidate_indices] = True | ||
if self.threshold is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this condition is removed, we need to use threshold=-np.inf
to pass most of the test case. Is that Ok?
However, I've edited the max_features
doc. You can check it below.
max_features : int, between 0 and number of features, optional.
Usethreshold=-np.inf
to ignorethreshold
and apply
max_features
only.threshold=None
to select at most this many
features that score above the threshold. Anyint
between 0 and
number of features will return the desired number of features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only if you changed the behavior, which you shouldn't have.. otherwise it will use the default threshold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@qmaruf Unfortunately this is a breaking change, so you can't change this behavior. I'm sure you could find another way to create the mask though. =]
@@ -63,6 +65,117 @@ def test_input_estimator_unchanged(): | |||
assert_true(transformer.estimator is est) | |||
|
|||
|
|||
def check_invalid_max_features(est, X, y): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@qmaruf Why did you name each of these tests check_*
? The scikit-learn standard is to name the unit tests with test_*
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Often we use check_
for helpers in tests. test_
only applies when there are no required arguments.
X = X.copy() | ||
max_features = X.shape[1] | ||
|
||
check_valid_max_features(est, X, y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is a private function, please prefix it with _
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't tend to use _
prefixes for helpers in tests.
return scores >= self.threshold_ | ||
mask = np.zeros_like(scores, dtype=bool) | ||
if self.max_features == 'all': | ||
self.max_features = scores.size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In scikit-learn , we do not modify the attributes of an estimator after it's initialized.
If threshold and max_features are not provided, all features are | ||
returned, use threshold=None if it is not required. | ||
""" | ||
transformer = SelectFromModel(estimator=Lasso(alpha=0.1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Lasso
estimator is not deterministic (even though you can prove that it'll converge to a unique optimal solution every time). My suggestion is to mock a real estimator whose results we can easily interpret, and use it in these test cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regardless, if you continue to use Lasso
or some other estimator with a random_state
parameter, set that random_state
to an integer.
@qmaruf do you intend to finish this, or should we find another contributor? |
Another contributor should do it. I can't manage time for this right now. |
thanks for your work and for the prompt reply
…On 17 May 2017 2:31 pm, "Quazi Marufur Rahman" ***@***.***> wrote:
Another contributor should do it. I can't manage time for this right now.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#6717 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6zj0aHXoLZHWNWnNVh_6s8EMajb3ks5r6ngngaJpZM4IPn6R>
.
|
I could look into this one around next week. |
Thanks!
…On 17 May 2017 at 18:26, Naoya Kanai ***@***.***> wrote:
I could look into this one around next week.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#6717 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6w4PEL2iH7UwBQsFEWSBPD7Cmf5rks5r6q87gaJpZM4IPn6R>
.
|
this needs rebase |
done in #9616 |
Working on #6689 to implement
SelectKBest
feature inSelectFromModel
.It sho 8000 uld return top
k
most important features.This change is