8000 WIP Overhaul of feature_selection by jnothman · Pull Request #1939 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

WIP Overhaul of feature_selection #1939

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 16 commits into from

Conversation

jnothman
Copy link
Member
@jnothman jnothman commented May 6, 2013

It seems to me there are commonalities between feature selection processes that are not being exploited, and places where ad-hoc code are used instead of a library.

Some of these things may be split off into a separate PR, especially on request, and all are up for discussion.

  • create a mixin to provide [inverse_]transform given _get_support_mask() in all feature selectors (but what to call it, given that SelectorMixin is taken by something else?) split off into MRG Centralise feature selection transformations in a mixin #1962
  • The following are included in [MRG] refactor feature selection by score #2093:
    • provide SelectByScore (score_func, minimum, maximum, scaling_func, limit) for more generic score thresholding, e.g. supporting two-sided document frequency cutoffs. This implementation also reduces work at transform time by pre-transforming/scaling scores to match the threshold.
    • work out how to deal with nan
      • do we want a way to force exclusion of nans?
    • documentation and testing of above
    • hopefully use SelectByScore for Select{KBest,Percentile,Fdr,Fpr,Fwe,orMixin}, closing Retrieve the mask in SelectFromModel  #1459
    • use SelectByScore to handle thresholding in feature_extraction.text.CountVectorizer
    • use same in randomized_l1
  • support dummy cases: selection by given indices; selection by feature name given a feature extractor (or union thereof)
  • documentation and testing of above
  • move chi2, f_classify, etc. to sklearn.metrics.feature_scores or similar (and add document frequency == count_nonzero, what else?)

'stds': _scale_standard_deviations,
'percentile': _scale_percentile,
'incrank': _scale_rank,
'decrank': lambda scores, n_samples: _scale_rank(-scores, n_samples),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this properly pickle?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I expect it should: it's not part of a __dict__, and never intends to be. As it is, I'm not sure it's such a useful feature, so I don't mind removing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0