8000 [MRG+1] ENH: Feature selection based on mutual information by nmayorov · Pull Request #5372 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG+1] ENH: Feature selection based on mutual information #5372

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 25 commits into from

Conversation

nmayorov
Copy link
Contributor
@nmayorov nmayorov commented Oct 8, 2015

Hi! This is my attempt to finish/rework #2547

I tried to address code style issues and also added algorithms estimating mutual information with continuous variable involved.

There are places for trivial optimization, but for now I tried to keep the code as transparent as possible.

It would be great if some of the core developers can start seriously reviewing with PR.

@nmayorov
Copy link
Contributor Author
nmayorov commented Oct 8, 2015

There is an issue with y not being converted to numeric from object (in test check_dtype_object), which causes the error in numpy 1.6.2.

What is the best approach to deal with it? I can split the algorithm in classification / regression, but it looks like unnecessary duplication.

@amueller
Copy link
Member
amueller commented Oct 9, 2015

Thanks for the PR.

Not sure if providing categorical_target as an option is the right way to go. Most things in sklearn either work on a discrete or a continuous y. On the other hand, adding classes for that is a bit too much, and trying to figure it out automatically might be to magic.

To get rid of the failure, you could do the conversion in _compute_mi when y is supposed to be continuous. That doesn't really solve the API issue, though.

@nmayorov
Copy link
Contributor Author

Are you OK if I introduce two classes MutualInfoRegression and MutualInfoClassification? It will be very much in style of scikit-learn, I think.

jnothman

Attributes
----------
n_features_ : int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unnecessary

@jnothman
Copy link
Member

I agree it might be in style, but only if it were MutualInfoRegressionSelector and similar, which is a nasty name. Could consider ClassMutualInfoSelector, but not sure what the regression variant is. Ultimately, I agree with @amueller that two classes may be excessive.

We have type_of_target to sniff classification targets, but its output that something is binary or multiclass should only be taken to mean that that is the finest target type that could be encoded as such.

So for now, leave categorical_target as it is.

@@ -0,0 +1,224 @@
# Authors: Andrea Bravi <a.bravi@uottawa.ca>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please drop _filtering from this filename. Or just call it mutualinfo.py

@nmayorov
Copy link
Contributor Author

@jnothman I have addressed your suggestions. I kept n_features_to_select name, as in RFE \ RFECV.

What would be the next step?

@jnothman
Copy link
Member

Oh. Right. How unnecessarily verbose.

Next step is wait for someone to give you a full review, including finding
time to review the literature reference.

On 12 October 2015 at 05:45, Nikolay Mayorov notifications@github.com
wrote:

@jnothman https://github.com/jnothman I have addressed your
suggestions. I kept n_features_to_select name, as in RFE \ RFECV.

What would be the next step?


Reply to this email directly or view it on GitHub
#5372 (comment)
.

@nmayorov
Copy link
Contributor Author

Let's hope that it will ever happen.

My thoughts on what should be do 8000 ne:

  1. Do all optimizations, like: scale each column only once, fit NearestNeighbors only once for each column.
  2. Add parameter use_redundancy=True. If False then select feature based only on relevance. Perhaps rename the class to MutualInfoSelector.
  3. Introduce score_ attribute, which stores relevance - redundancy for each feature. The idea is that if we already computed relevance_ and redundancy_ then it's relatively cheap to compute this score_ for each feature. Having score_ computed we can change number of features to select after the transformer was fit. Not sure if it's a common practice, but it seems useful here.

@MechCoder
Copy link
Member

I can try having a look in the coming week.

@nmayorov
Copy link
Contributor Author

@MechCoder, that would be great.


x_std = np.std(x)
if x_std > 0:
x = x / x_std
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we normalize X and y if they are not categorical at one go before the expensive looping?

Else this is done multiple times.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I just decided to do it later in "optimization phase". I mean I will do it now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we shouldn't to it for sparse X: storing each row scaled can result in storing the whole X in dense format, which probably was avoided for a good reason.

As for dense X I think it's fine to introduce copy=True parameter and either modify X in place or make a copy then modify.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About sparse matrices: I was very wrong, just scaling an element doesn't change it value from zero (obviously) so we can handle both cases equivalently.

@MechCoder
Copy link
Member

@nmayorov Sorry for the looong delay. I made a first minor pass. Hopefully you are still here!

@MechCoder
Copy link
Member

Btw, I changed the title to MRG

@MechCoder MechCoder changed the title [WIP] ENH: Feature selection based on mutual information [MRG] ENH: Feature selection based on mutual information Nov 5, 2015
@nmayorov
Copy link
Contributor Author
nmayorov commented Nov 5, 2015

@MechCoder looking forward to finish this PR. Will try to do it on weekends.

@MechCoder
8000 Copy link
Member

Sure.. I'm looking forward as well.

@nmayorov
Copy link
Contributor Author
nmayorov commented Nov 7, 2015

Issues left to consider:

  1. Subset selection vs. ranking of all features.
  2. Which attributes to compute and in what way.
  3. Add the option to select based only on relevance (to reduce computations).
  4. Precompute KDTree for individual features for efficiency.

-------
self
"""
X, y = check_X_y(X, y, accept_sparse='csr',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accept_sparse='csc' here and do away with the conversion again below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My doubt was caused by the docstring of scale: "To avoid memory copy the caller should pass a CSR matrix." What do you say considering this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

o.O This is odd indeed. I would expect it to be the other way around, no? since operations along each column is easy in csc_matrices.
See: #5791

@MechCoder
Copy link
Member

Merged with master ! Thanks a lot @nmayorov for your patience and congrats 🍷 🍷

@MechCoder MechCoder closed this Jan 22, 2016
@nmayorov
Copy link
Contributor Author

@MechCoder @agramfort thanks a lot for working with me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants
0