-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Minimum redundancy maximum relevance (mRMR) feature selection #2547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
3559544
9f47461
73d55aa
14a7cf0
ec17096
62cf55b
c2d1852
0f47cca
9d1ea9a
4b97111
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,69 @@ | ||
""" | ||
=========================================== | ||
Minimum redundancy maximum relevance (mRMR) | ||
=========================================== | ||
|
||
Mutual information is a metric assessing the degree of statistical independence | ||
between two random variables. | ||
|
||
mRMR feature selection consists in selecting a subset of the available features | ||
showing high mutual information with the target and low mutual information with | ||
each other. | ||
|
||
This example compares mRMR feature selection with Recursive feature elimination | ||
(RFE) and Univariate feature selection (Uni), taking advantage of a synthetic | ||
dataset. | ||
|
||
This dataset has 100 samples and 3 features: A, B and C, enabling to | ||
respectively classify 60%, 50% and 40% of the data. | ||
|
||
Let's assume the plan is to choose only 2 of those 3 features. Given that A | ||
and B have higher accuracy, we would expect a selection algorithm to pick those | ||
two. However, it turns out that A and B are redundant with each other (i.e. | ||
they are able to classify the same samples). Conversely, C has lower accuracy, | ||
but provides indepedent information respect to A and B. | ||
|
||
As expectable, mRMR selects feature A and C, while the other two selection | ||
algorithm select features A and B. | ||
|
||
|
||
.. note:: | ||
|
||
See also :ref:`example_plot_rfe_digits.py`, | ||
:ref:`example_plot_feature_selection.py` | ||
|
||
""" | ||
print(__doc__) | ||
|
||
import numpy as np | ||
from sklearn.feature_selection import RFE, SelectKBest, chi2, \ | ||
MinRedundancyMaxRelevance | ||
from sklearn.linear_model import LogisticRegression | ||
|
||
|
||
# Number of samples in the dataset | ||
N = 100 | ||
|
||
# Associating a class to each sample in the dataset | ||
y = np.array([0] * 50 + [1] * 50) | ||
|
||
# Creating a feature able to classify 60% of the samples | ||
A = np.array([0] * 30 + [1] * 20 + [1] * 20 + [2] * 30) | ||
|
||
# Creating a feature able to classify 50% of the samples | ||
B = np.array([2] * 25 + [1] * 25 + [1] * 25 + [0] * 25) | ||
|
||
# Creating a feature able to classify 40% of the samples | ||
C = np.array([2] * 20 + [0] * 30 + [1] * 30 + [2] * 20) | ||
|
||
X = np.array([A, B, C]).T | ||
feature = ['A', 'B', 'C'] | ||
|
||
# We will be using the following three selectors | ||
selectors = [('RFE', RFE(LogisticRegression(), 2)), | ||
('Uni', SelectKBest(chi2, k=2)), | ||
('mRMR', MinRedundancyMaxRelevance(k=2))] | ||
|
||
for name, selector in selectors: | ||
k = selector.fit(X, y).get_support(True).tolist() | ||
print name, 'selected %s and %s' % (feature[k[0]], feature[k[1]]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please make this Python 3-compatible (i.e. use the function form of print with |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,154 @@ | ||
# Author: Andrea Bravi <a.bravi@uottawa.ca> | ||
# License: 3-clause BSD | ||
|
||
import numpy as np | ||
from ..base import BaseEstimator | ||
from .base import SelectorMixin | ||
from ..metrics.cluster.supervised import mutual_info_score | ||
from ..utils.validation import array2d | ||
|
||
|
||
class MinRedundancyMaxRelevance(BaseEstimator, SelectorMixin): | ||
""" | ||
Select the subset of features with minimal redundancy and maximal | ||
relevance (mRMR) with the outcome. | ||
|
||
IMPORTANT: This version only supports data in categorical or integer form. | ||
|
||
Attributes | ||
---------- | ||
k : int, default=2 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This should be in a |
||
Number of features to select (selected_features) | ||
mask : list, len=selected_features | ||
Integer list of the features ordered by maximal relevance and | ||
minimal redundancy | ||
score : array, shape=[selected_features] | ||
mRMR score associated to each entry in mask | ||
relevance : array, shape=[n_features] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. All attributes representing the model should have a |
||
Relevance of all the features | ||
redundancy : array, shape=[n_features] | ||
Redundancy of all the features | ||
rule : string, default='diff' | ||
Rule to combine relevance and redundancy, either | ||
'diff' - difference between the two | ||
'prod' - product between the two | ||
X : array, shape=[n_samples, n_features] | ||
Input dataset, must be either integer or categorical | ||
y : array, shape=[n_samples] | ||
Label vector, must be either integer or categorical | ||
|
||
References | ||
---------- | ||
.. [1] H. Peng, F. Long, and C. Ding, "Feature selection based on mutual | ||
information: criteria of max-dependency, max-relevance, and | ||
min-redundancy", IEEE Transactions on Pattern Analysis and Machine | ||
Intelligence, Vol. 27, No. 8, pp.1226-1238, 2005. | ||
""" | ||
def __init__(self, k=2, rule='diff'): | ||
""" | ||
Parameters | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Rather, this should move to the class docstring. |
||
---------- | ||
k : int, default=2 | ||
Number of features to select | ||
rule : string, default='diff' | ||
Rule to combine relevance and redundancy, either | ||
'diff' - difference between the two | ||
'prod' - product between the two | ||
""" | ||
self.k = k | ||
self.rule = rule | ||
|
||
def fit(self, X, y): | ||
""" | ||
Parameters | ||
---------- | ||
X : array, shape=[n_samples, n_features] | ||
Input dataset, must be either integer or categorical | ||
y : array, shape=[n_samples] | ||
Label vector, must be either integer or categorical | ||
""" | ||
X = array2d(X) | ||
|
||
self.X = X | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think these should be stored. |
||
self.y = y | ||
self.mask, self.score = self._compute_mRMR(X, y) | ||
return self | ||
|
||
def _get_support_mask(self): | ||
""" | ||
Returns | ||
------- | ||
support : array, dype=bool, shape=[n_features] | ||
Boolean mask with True the selected features | ||
""" | ||
|
||
support = np.zeros(self.n_features, dtype=bool) | ||
support[self.mask] = True | ||
return support | ||
|
||
def _compute_mRMR(self, X, y): | ||
""" | ||
Parameters | ||
---------- | ||
X : array, shape=[n_samples, n_features] | ||
Input dataset, must be either integer or categorical | ||
y : array, shape=[n_samples] | ||
Label vector, must be either integer or categorical | ||
|
||
Returns | ||
------- | ||
mask : list, len=selected_features | ||
Integer list of the features ordered by maximal relevance and | ||
minimal redundancy | ||
score : list, len=selected_features | ||
mRMR score associated to each entry in mask | ||
""" | ||
M = X.shape[1] # Number of features | ||
|
||
# Computation of relevance and redundancy | ||
relevance = np.zeros(M) | ||
redundancy = np.zeros([M, M]) | ||
for m1 in range(0, M): | ||
relevance[m1] = mutual_info_score(X[:, m1], y) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would be nice if we could reimplement mutual_info to support calculation over sparse feature spaces (in a CSC matrix). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, we could possible calculate MI for all the features more efficiently. |
||
for m2 in range(m1+1, M): | ||
redundancy[m1, m2] = mutual_info_score(X[:, m1], | ||
X[:, m2]) | ||
redundancy[m2, m1] = redundancy[m1, m2] | ||
|
||
# Sequential search optimization | ||
mask = [] | ||
score = [] | ||
search_space = range(0, M) | ||
|
||
score.append(max(relevance)) | ||
ind = int(relevance.argmax(0)) # Optimal feature | ||
mask.append(ind) | ||
search_space.pop(ind) | ||
|
||
if self.rule == 'diff': | ||
for m in range(0, self.k-1): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Rather than repeating yourself with the logic, you could define |
||
tmp_score = relevance[search_space] - \ | ||
np.mean(redundancy[:, search_space] | ||
.take(mask, axis=0), 0) | ||
score.append(max(tmp_score)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please use np.max when operating over an array There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Although it may be more sense to just use |
||
ind = tmp_score.argmax(0) | ||
mask.append(search_space[ind]) | ||
search_space.pop(ind) | ||
|
||
elif self.rule == 'prod': | ||
for m in range(0, self.k-1): | ||
tmp_score = relevance[search_space] * \ | ||
np.mean(redundancy[:, search_space] | ||
.take(mask, axis=0), 0) | ||
score.append(max(tmp_score)) | ||
ind = tmp_score.argmax(0) | ||
mask.append(search_space[ind]) | ||
search_space.pop(ind) | ||
else: | ||
raise ValueError("rule should be either 'diff' or 'prod'") | ||
|
||
self.n_features = M | ||
self.relevance = relevance | ||
self.redundancy = redundancy | ||
|
||
return mask, score |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
from sklearn.utils.testing import (assert_array_equal, assert_raises) | ||
|
||
import numpy as np | ||
|
||
from sklearn.feature_selection import MinRedundancyMaxRelevance | ||
|
||
X = np.array([[1, 3, 1], | ||
[3, 3, 3], | ||
[1, 3, 1], | ||
[1, 3, 3], | ||
[1, 3, 1]]) | ||
|
||
y = np.array([3, 1, 3, 1, 3]) | ||
|
||
|
||
def test_mMRM(): | ||
""" | ||
Test MinRedundancyMaxRelevance with default setting. | ||
""" | ||
|
||
m = MinRedundancyMaxRelevance().fit(X, y) | ||
|
||
assert_array_equal([2, 0], m.mask) | ||
|
||
assert_array_equal(0.6730116670092563, m.score[0]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would be better to compare to results in the literature, which I presume are not reported to 16 decimal places ;) |
||
|
||
m = MinRedundancyMaxRelevance(rule='prod').fit(X, y) | ||
|
||
assert_array_equal(0.049793044493117354, m.score[1]) | ||
|
||
assert_raises(ValueError, MinRedundancyMaxRelevance(rule='none').fit, X, y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this transformation should be provided as part of the feature selector.