-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG+1] ENH: Feature selection based on mutual information #5372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There is an issue with What is the best approach to deal with it? I can split the algorithm in classification / regression, but it looks like unnecessary duplication. |
Thanks for the PR. Not sure if providing To get rid of the failure, you could do the conversion in |
Are you OK if I introduce two classes |
|
||
Attributes | ||
---------- | ||
n_features_ : int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems unnecessary
I agree it might be in style, but only if it were We have So for now, leave |
@@ -0,0 +1,224 @@ | |||
# Authors: Andrea Bravi <a.bravi@uottawa.ca> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please drop _filtering
from this filename. Or just call it mutualinfo.py
@jnothman I have addressed your suggestions. I kept What would be the next step? |
Oh. Right. How unnecessarily verbose. Next step is wait for someone to give you a full review, including finding On 12 October 2015 at 05:45, Nikolay Mayorov notifications@github.com
|
Let's hope that it will ever happen. My thoughts on what should be do 8000 ne:
|
I can try having a look in the coming week. |
@MechCoder, that would be great. |
|
||
x_std = np.std(x) | ||
if x_std > 0: | ||
x = x / x_std |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we normalize X and y if they are not categorical at one go before the expensive looping?
Else this is done multiple times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I just decided to do it later in "optimization phase". I mean I will do it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we shouldn't to it for sparse X
: storing each row scaled can result in storing the whole X
in dense format, which probably was avoided for a good reason.
As for dense X
I think it's fine to introduce copy=True
parameter and either modify X
in place or make a copy then modify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About sparse matrices: I was very wrong, just scaling an element doesn't change it value from zero (obviously) so we can handle both cases equivalently.
@nmayorov Sorry for the looong delay. I made a first minor pass. Hopefully you are still here! |
Btw, I changed the title to MRG |
@MechCoder looking forward to finish this PR. Will try to do it on weekends. |
Sure.. I'm looking forward as well. |
Issues left to consider:
|
------- | ||
self | ||
""" | ||
X, y = check_X_y(X, y, accept_sparse='csr', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accept_sparse='csc'
here and do away with the conversion again below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My doubt was caused by the docstring of scale
: "To avoid memory copy the caller should pass a CSR matrix." What do you say considering this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
o.O This is odd indeed. I would expect it to be the other way around, no? since operations along each column is easy in csc_matrices.
See: #5791
Merged with master ! Thanks a lot @nmayorov for your patience and congrats 🍷 🍷 |
@MechCoder @agramfort thanks a lot for working with me. |
Hi! This is my attempt to finish/rework #2547
I tried to address code style issues and also added algorithms estimating mutual information with continuous variable involved.
There are places for trivial optimization, but for now I tried to keep the code as transparent as possible.
It would be great if some of the core developers can start seriously reviewing with PR.