Description
Currently feature selection on categorical variables is hard.
Using one-hot encoding we can select some of the categories, but we can't easily remove a whole variable.
I think this is something we should be able to do.
One way would be to have feature selection methods that are aware of categorical variables - I guess SelectFromModel(RandomForestClassifier())
would do that after we add categorical variable support.
We should have some more simple test-based methods, though.
Maybe f_regression
and f_classif
could be extended so that they can take categorical features into account?
Alternatively we could try to remove groups of features that correspond to the same original variable. That seems theoretically possible if we add feature name support, but using feature names for this would be putting semantics in the names which I don't entirely like. Doing it "properly" would require adding meta-data about the features, similarly to sample_props
, only on the other axis. That seems pretty heavy-handed though.