Feature selection for categorical variables

Currently feature selection on categorical variables is hard.
Using one-hot encoding we can select some of the categories, but we can't easily remove a whole variable.
I think this is something we should be able to do.

One way would be to have feature selection methods that are aware of categorical variables - I guess SelectFromModel(RandomForestClassifier()) would do that after we add categorical variable support.

We should have some more simple test-based methods, though.
Maybe f_regression and f_classif could be extended so that they can take categorical features into account?

Alternatively we could try to remove groups of features that correspond to the same original variable. That seems theoretically possible if we add feature name support, but using feature names for this would be putting semantics in the names which I don't entirely like. Doing it "properly" would require adding meta-data about the features, similarly to sample_props, only on the other axis. That seems pretty heavy-handed though.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions