Description
Description
I am trying to use SelectFromModel to select the most important X% of features based on feature importance. However, SelectFromModel only allows feature selection threshold to be determined by the mean of feature importances, median, some multiple of mean or median, or just by setting the threshold directly. I think it would be helpful for the API to allow the threshold to be set as a percentile to select a certain percent of features. This functionality is currently not present in scikit-learn.
Steps/Code to Reproduce
Example:
from sklearn.feature_selection import SelectFromModel
help(SelectFromModel)
Desired API
I'll write this in as a change to the SelectFromModel API, modifying the threshold parameter to take percentile inputs in the format "X-percentile" to select the top X% of features by importance.
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestRegressor
select = SelectFromModel(RandomForestRegressor(), threshold="10-percentile") # following the format of mean/median scaling in docs, i.e. "1.25*mean"
Versions
Darwin-16.6.0-x86_64-i386-64bit
Python 3.6.1 |Anaconda 4.4.0 (x86_64)| (default, May 11 2017, 13:04:09) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]
NumPy 1.12.1
SciPy 0.19.0
Scikit-Learn 0.18.1