8000 Feature selection for categorical variables · Issue #8480 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Feature selection for categorical variables #8480

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
amueller opened this issue Mar 1, 2017 · 16 comments
Open

Feature selection for categorical variables #8480

amueller opened this issue Mar 1, 2017 · 16 comments

Comments

@amueller
Copy link
Member
amueller commented Mar 1, 2017

Currently feature selection on categorical variables is hard.
Using one-hot encoding we can select some of the categories, but we can't easily remove a whole variable.
I think this is something we should be able to do.

One way would be to have feature selection methods that are aware of categorical variables - I guess SelectFromModel(RandomForestClassifier()) would do that after we add categorical variable support.

We should have some more simple test-based methods, though.
Maybe f_regression and f_classif could be extended so that they can take categorical features into account?

Alternatively we could try to remove groups of features that correspond to the same original variable. That seems theoretically possible if we add feature name support, but using feature names for this would be putting semantics in the names which I don't entirely like. Doing it "properly" would require adding meta-data about the features, similarly to sample_props, only on the other axis. That seems pretty heavy-handed though.

@jnothman
Copy link
Member
jnothman commented Mar 1, 2017 via email

@jnothman
Copy link
Member
jnothman commented Mar 1, 2017 via email

@amueller
Copy link
Member Author
amueller commented Mar 3, 2017

That's what I meant by the "heavy handed" approach. I'm not sure how we would pass the grouping information around.

@amueller
Copy link
Member Author

Hm I guess using hierarchical column-indices would solve the problem, if we could use data-frames....

@amueller
Copy link
Member Author
amueller commented Feb 7, 2019
8000

Thank you, past me, for opening this issue that I just wanted to open again.

I think with the column transformer we now have a way to do this using separate score functions. That should be pretty straight-forward, right?
We couldn't do model-based selection easily with that (I think?), but having f_regression_categorical or something like that for a feature union should be good.

Interesting question though: what should the input look like? (which is related to the meta-data question above).
If we assume everything is one-hot-encoded we need to have metadata about groups.
My "easy" solution would be to pass the data via column transformer before doing one-hot-encoding. But that means that either the scoring function needs to somehow call to OrdinalEncoder (or OneHotEncoder) internally, or we have to ask the user to call OrdinalEncoder first (which seems a bit weird but also not that weird).

@jnothman
Copy link
Member
jnothman commented Feb 7, 2019 via email

@caiohamamura
Copy link
caiohamamura commented Apr 8, 2019

Wouldn't f_classif expect categorical variables x continuous response variable?

When I convert the categorical variable manually as integers, the integers themselves are being used somehow to compute f-score, isn't it an undesirable behavior, shouldn't f-score always be the same regardless the label used to represent the categories?

This is the behavior when I use the same data against scipy.stats.f_oneway or statsmodels.formula.api.ols.

@jnothman
Copy link
Member
jnothman commented Apr 9, 2019 via email

@caiohamamura
Copy link
caiohamamura commented Apr 9, 2019 via email

@jnothman
Copy link
Member
jnothman commented Apr 9, 2019 via email

@thanasissdr
Copy link

Maybe there's a misunderstanding of mine, here, but from what I've read so far, I agree with @caiohamamura . Please correct me if I'm wrong. f_classif seems to apply the f_oneway ANOVA test and as far as I know the independent variable should be categorical and the response variable continuous.

@anhqngo
Copy link
anhqngo commented Jun 29, 2020

hi @amueller , if the input is nominal, can we just pass the input through OrdinalEncoder and use SelectKBest(score_func=chi2)?

Otherwise, for the OneHot case, can we add an attribute to the OneHotEncoder class, mapping from the original variables to the encoded variables? I'm thinking of something like

OneHotEncoder.enc_feature_map = {fruit:[is_apple, is_orange], city:[is_london, is_paris]}

where the keys are original features and the value lists are the encoded features. I feel like doing this would make it easier to compute a score for one feature group :/

@anhqngo
Copy link
anhqngo commented Jun 29, 2020

oh also, @yashika51 and I are interested in working on this issue.

@anhqngo
Copy link
anhqngo commented Jul 22, 2020

right now, f_regression calculates the Pearson correlation coefficient. If our input/output is ordinal/continuous, would it make sense to implement f_regression_ordinal that uses Spearman's rank correlation coefficient instead? If so, @yashika51 and I can implement it as the first step to tackling this issue. ping @jnothman @amueller and @thomasjpfan

@thomasjpfan
Copy link
Member

Consider a standard pipeline:

import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectKBest
from sklearn.model_selection import train_test_split

X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)

numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('ordinal', OrdinalEncoder())])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('select', SelectKBest(k=3)),
                      ('classifier', LogisticRegression())])

clf.fit(X, y)

This already works with SelectKBest + OrdinalEncoder and whole categorical features are removed. In this case, SelectKBest treats categorical data and continuous data equally.

Maybe f_regression and f_classif could be extended so that they can take categorical features into account?

If this is the issue here, then there isn't a clean way to do this in the Pipeline above.

@datascientist-nishant
Copy link

Hi @amueller I want to work on this feature request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
0