[MRG+1] ENH: Feature selection based on mutual information #5372

nmayorov · 2015-10-08T22:32:49Z

Hi! This is my attempt to finish/rework #2547

I tried to address code style issues and also added algorithms estimating mutual information with continuous variable involved.

There are places for trivial optimization, but for now I tried to keep the code as transparent as possible.

It would be great if some of the core developers can start seriously reviewing with PR.

nmayorov · 2015-10-08T23:50:09Z

There is an issue with y not being converted to numeric from object (in test check_dtype_object), which causes the error in numpy 1.6.2.

What is the best approach to deal with it? I can split the algorithm in classification / regression, but it looks like unnecessary duplication.

amueller · 2015-10-09T20:42:56Z

Thanks for the PR.

Not sure if providing categorical_target as an option is the right way to go. Most things in sklearn either work on a discrete or a continuous y. On the other hand, adding classes for that is a bit too much, and trying to figure it out automatically might be to magic.

To get rid of the failure, you could do the conversion in _compute_mi when y is supposed to be continuous. That doesn't really solve the API issue, though.

nmayorov · 2015-10-10T00:33:32Z

Are you OK if I introduce two classes MutualInfoRegression and MutualInfoClassification? It will be very much in style of scikit-learn, I think.

jnothman · 2015-10-10T23:02:17Z

sklearn/feature_selection/multivariate_filtering.py

+
+    Attributes
+    ----------
+    n_features_ : int


This seems unnecessary

jnothman · 2015-10-10T23:13:24Z

I agree it might be in style, but only if it were MutualInfoRegressionSelector and similar, which is a nasty name. Could consider ClassMutualInfoSelector, but not sure what the regression variant is. Ultimately, I agree with @amueller that two classes may be excessive.

We have type_of_target to sniff classification targets, but its output that something is binary or multiclass should only be taken to mean that that is the finest target type that could be encoded as such.

So for now, leave categorical_target as it is.

jnothman · 2015-10-10T23:14:23Z

sklearn/feature_selection/multivariate_filtering.py

@@ -0,0 +1,224 @@
+# Authors: Andrea Bravi <a.bravi@uottawa.ca>


Please drop _filtering from this filename. Or just call it mutualinfo.py

nmayorov · 2015-10-11T18:45:03Z

@jnothman I have addressed your suggestions. I kept n_features_to_select name, as in RFE \ RFECV.

What would be the next step?

jnothman · 2015-10-11T22:28:44Z

Oh. Right. How unnecessarily verbose.

Next step is wait for someone to give you a full review, including finding
time to review the literature reference.

On 12 October 2015 at 05:45, Nikolay Mayorov notifications@github.com
wrote:

@jnothman https://github.com/jnothman I have addressed your
suggestions. I kept n_features_to_select name, as in RFE \ RFECV.

What would be the next step?

—
Reply to this email directly or view it on GitHub
#5372 (comment)
.

nmayorov · 2015-10-15T22:22:49Z

Let's hope that it will ever happen.

My thoughts on what should be do 8000 ne:

Do all optimizations, like: scale each column only once, fit NearestNeighbors only once for each column.
Add parameter use_redundancy=True. If False then select feature based only on relevance. Perhaps rename the class to MutualInfoSelector.
Introduce score_ attribute, which stores relevance - redundancy for each feature. The idea is that if we already computed relevance_ and redundancy_ then it's relatively cheap to compute this score_ for each feature. Having score_ computed we can change number of features to select after the transformer was fit. Not sure if it's a common practice, but it seems useful here.

MechCoder · 2015-10-17T04:37:01Z

I can try having a look in the coming week.

nmayorov · 2015-10-17T10:46:42Z

@MechCoder, that would be great.

MechCoder · 2015-11-04T23:49:29Z

sklearn/feature_selection/mutual_info.py

+
+    x_std = np.std(x)
+    if x_std > 0:
+        x = x / x_std


Should we normalize X and y if they are not categorical at one go before the expensive looping?

Else this is done multiple times.

Sure, I just decided to do it later in "optimization phase". I mean I will do it now.

Seems like we shouldn't to it for sparse X: storing each row scaled can result in storing the whole X in dense format, which probably was avoided for a good reason.

As for dense X I think it's fine to introduce copy=True parameter and either modify X in place or make a copy then modify.

About sparse matrices: I was very wrong, just scaling an element doesn't change it value from zero (obviously) so we can handle both cases equivalently.

MechCoder · 2015-11-04T23:57:31Z

@nmayorov Sorry for the looong delay. I made a first minor pass. Hopefully you are still here!

MechCoder · 2015-11-04T23:58:55Z

Btw, I changed the title to MRG

nmayorov · 2015-11-05T17:30:02Z

@MechCoder looking forward to finish this PR. Will try to do it on weekends.

MechCoder · 2015-11-05T17:50:11Z

Sure.. I'm looking forward as well.

nmayorov · 2015-11-07T19:41:28Z

Issues left to consider:

Subset selection vs. ranking of all features.
Which attributes to compute and in what way.
Add the option to select based only on relevance (to reduce computations).
Precompute KDTree for individual features for efficiency.

MechCoder · 2015-11-09T20:07:28Z

sklearn/feature_selection/mutual_info.py

+        -------
+        self
+        """
+        X, y = check_X_y(X, y, accept_sparse='csr',


accept_sparse='csc' here and do away with the conversion again below.

My doubt was caused by the docstring of scale: "To avoid memory copy the caller should pass a CSR matrix." What do you say considering this?

o.O This is odd indeed. I would expect it to be the other way around, no? since operations along each column is easy in csc_matrices.
See: #5791

MechCoder · 2016-01-22T05:54:47Z

Merged with master ! Thanks a lot @nmayorov for your patience and congrats 🍷 🍷

nmayorov · 2016-01-23T17:58:53Z

@MechCoder @agramfort thanks a lot for working with me.

nmayorov force-pushed the mrmr branch from 63cda19 to 96a9a81 Compare October 8, 2015 22:45

sklearn/feature_selection/multivariate_filtering.py

Attributes

----------

n_features_ : int

Copy link

Member

jnothman Oct 10, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unnecessary

jnothman reviewed Oct 10, 2015
View reviewed changes

nmayorov force-pushed the mrmr branch from 562a8e8 to 1c4161a Compare October 11, 2015 18:23

MechCoder reviewed Nov 4, 2015
View reviewed changes

MechCoder changed the title ~~[WIP] ENH: Feature selection based on mutual information~~ [MRG] ENH: Feature selection based on mutual information Nov 5, 2015

MechCoder reviewed Nov 9, 2015
View reviewed changes

Nikolay Mayorov and others added 23 commits January 22, 2016 00:27

MAINT: Use six.moves.zip in mutual_info

0245fc8

MAINT: Renamed module mutual_info to mutual_info_

c1aea3f

DOC: Example for mutual_information

689ed0d

API: Split mutual_info into _regression and _classif

835102a

MAINT: Add blank lines between parameters in mutual_info_.py

8394c1b

MAINT: Add check_classification_targets to mutual_info_classif

ad2f5f5

TST: Change tolerance checks in test_mutual_info.py

ffc4fe9

MAINT: Small changes to plot_f_test_vs_mi.py

824dda3

MAINT: Slightly improve logic of discrete-continuous MI estimation

051d3a2

MAINT: Slightly improve copy logic in _estimate_mi

7869992

DOC: Add short descriptions of methods for mutual info estimation

ec17289

DOC: Add a short explanation of F-test vs MI in narrative doc

b0491be

BUG: Fix copy logic for mutual info functions

d3a497a

TST: Speed up 2 tests related to mutual info

375b070

DOC: Small fixes in mutual_info_.py documentation

d60636a

MAINT: Small refactoring in mutual_info_.py

094a077

MAINT: Get rid of classes in test_mutual_info.py

5b3f515

DOC: Add one more reference for mutual info methods

b48a108

MAINT: Add a clarification comment in mutual_info_.py

e1bc056

DOC: Modify SelectKBest and SelectPercentile docstrings slightly

a36edf2

MAINT: Mention mutual info methods in whats_new.rst

e716c64

BUG: Remove non-ASCII symbols from mutual_info_.py

daa73c7

MAINT: Modify whats_new item related to mutual information

4cc82a3

nmayorov force-pushed the mrmr branch from 866fe99 to 4cc82a3 Compare January 21, 2016 19:28

MechCoder closed this Jan 22, 2016

MechCoder mentioned this pull request Feb 10, 2016

Added Boruta and mutual information based feature selection modules. #6313

Closed

MechCoder mentioned this pull request Apr 21, 2016

More scoring flexibility for SelectKBest / SelectPercentile #6673

Closed

glemaitre mentioned this pull request May 16, 2017

Implement mRMR feature selection #8889

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[MRG+1] ENH: Feature selection based on mutual information #5372

[MRG+1] ENH: Feature selection based on mutual information #5372

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

		@@ -0,0 +1,224 @@
		# Authors: Andrea Bravi <a.bravi@uottawa.ca>

Uh oh!

[MRG+1] ENH: Feature selection based on mutual information #5372

[MRG+1] ENH: Feature selection based on mutual information #5372

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!