8000 Feature selection for categorical variables · Issue #8480 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
Feature selection for categorical variables #8480
Open
@amueller

Description

@amueller

Currently feature selection on categorical variables is hard.
Using one-hot encoding we can select some of the categories, but we can't easily remove a whole variable.
I think this is something we should be able to do.

One way would be to have feature selection methods that are aware of categorical variables - I guess SelectFromModel(RandomForestClassifier()) would do that after we add categorical variable support.

We should have some more simple test-based methods, though.
Maybe f_regression and f_classif could be extended so that they can take categorical features into account?

Alternatively we could try to remove groups of features that correspond to the same original variable. That seems theoretically possible if we add feature name support, but using feature names for this would be putting semantics in the names which I don't entirely like. Doing it "properly" would require adding meta-data about the features, similarly to sample_props, only on the other axis. That seems pretty heavy-handed though.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0