-
-
Notifications
You must be signed in to change notification settings - Fork 26k
WIP Overhaul of feature_selection #1939
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
'stds': _scale_standard_deviations, | ||
'percentile': _scale_percentile, | ||
'incrank': _scale_rank, | ||
'decrank': lambda scores, n_samples: _scale_rank(-scores, n_samples), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this properly pickle?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I expect it should: it's not part of a __dict__
, and never intends to be. As it is, I'm not sure it's such a useful feature, so I don't mind removing it.
Now can be pickled Now can set est params
It seems to me there are commonalities between feature selection processes that are not being exploited, and places where ad-hoc code are used instead of a library.
Some of these things may be split off into a separate PR, especially on request, and all are up for discussion.
create a mixin to providesplit off into MRG Centralise feature selection transformations in a mixin #1962[inverse_]transform
given_get_support_mask()
in all feature selectors (but what to call it, given thatSelectorMixin
is taken by something else?)SelectByScore (score_func, minimum, maximum, scaling_func, limit)
for more generic score thresholding, e.g. supporting two-sided document frequency cutoffs. This implementation also reduces work attransform
time by pre-transforming/scaling scores to match the threshold.nan
nan
s?SelectByScore
forSelect{KBest,Percentile,Fdr,Fpr,Fwe,orMixin}
, closing Retrieve the mask in SelectFromModel #1459SelectByScore
to handle thresholding infeature_extraction.text.CountVectorizer
randomized_l1
chi2
,f_classify
, etc. tosklearn.metrics.feature_scores
or similar (and add document frequency ==count_nonzero
, what else?)