-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Feature selection for categorical variables #8480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
or we need a general way to score and select feature groups rather than
individual features
…On 2 Mar 2017 2:02 am, "Andreas Mueller" ***@***.***> wrote:
Currently feature selection on categorical variables is hard.
Using one-hot encoding we can select some of the categories, but we can't
easily remove a whole variable.
I think this is something we should be able to do.
One way would be to have feature selection methods that are aware of
categorical variables - I guess SelectFromModel(RandomForestClassifier())
would do that after we add categorical variable support.
We should have some more simple test-based methods, though.
Maybe f_regression and f_classif could be extended so that they can take
categorical features into account?
Alternatively we could try to remove groups of features that correspond to
the same original variable. That seems theoretically possible if we add
feature name support, but using feature names for this would be putting
semantics in the names which I don't entirely like. Doing it "properly"
would require adding meta-data about the features, similarly to
sample_props, only on the other axis. That seems pretty heavy-handed
though.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#8480>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/AAEz62CcdSj7o70LuKHmTAiq38oWY56dks5rhYh9gaJpZM4MPw8S>
.
|
if the partition of features into groups cm be stated statically, this can
be independent of feature_names
…On 2 Mar 2017 7:53 am, "Joel Nothman" ***@***.***> wrote:
or we need a general way to score and select feature groups rather than
individual features
On 2 Mar 2017 2:02 am, "Andreas Mueller" ***@***.***> wrote:
> Currently feature selection on categorical variables is hard.
> Using one-hot encoding we can select some of the categories, but we can't
> easily remove a whole variable.
> I think this is something we should be able to do.
>
> One way would be to have feature selection methods that are aware of
> categorical variables - I guess SelectFromModel(RandomForestClassifier())
> would do that after we add categorical variable support.
>
> We should have some more simple test-based methods, though.
> Maybe f_regression and f_classif could be extended so that they can take
> categorical features into account?
>
> Alternatively we could try to remove groups of features that correspond
> to the same original variable. That seems theoretically possible if we add
> feature name support, but using feature names for this would be putting
> semantics in the names which I don't entirely like. Doing it "properly"
> would require adding meta-data about the features, similarly to
> sample_props, only on the other axis. That seems pretty heavy-handed
> though.
>
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub
> <#8480>, or mute the
> thread
> <https://github.com/notifications/unsubscribe-auth/AAEz62CcdSj7o70LuKHmTAiq38oWY56dks5rhYh9gaJpZM4MPw8S>
> .
>
|
That's what I meant by the "heavy handed" approach. I'm not sure how we would pass the grouping information around. |
Hm I guess using hierarchical column-indices would solve the problem, if we could use data-frames.... |
Thank you, past me, for opening this issue that I just wanted to open again. I think with the column transformer we now have a way to do this using separate score functions. That should be pretty straight-forward, right? Interesting question though: what should the input look like? (which is related to the meta-data question above). |
I also think that using ColumnTransformer you cannot select k% of features
regardless of whether they are categorical or numeric. Really, you are
aiming to have a composite score function??
|
Wouldn't f_classif expect When I convert the categorical variable manually as integers, the integers themselves are being used somehow to compute f-score, isn't it an undesirable behavior, shouldn't f-score always be the same regardless the label used to represent the categories? This is the behavior when I use the same data against |
AFAIK: f_classif expects continuous X and categorical y. f_regression
assumes continuous X and continuous y
|
So there's no ANOVA categorical X and continuous Y?
Em seg, 8 de abr de 2019 22:24, Joel Nothman <notifications@github.com>
escreveu:
… AFAIK: f_classif expects continuous X and categorical y. f_regression
assumes continuous X and continuous y
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#8480 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADR-f1v1kPIaQmb0CCltsdKZAvgwg5SUks5ve-u5gaJpZM4MPw8S>
.
|
So there's no ANOVA categorical X and continuous Y?
Not in this library.
|
Maybe there's a misunderstanding of mine, here, but from what I've read so far, I agree with @caiohamamura . Please correct me if I'm wrong. |
hi @amueller , if the input is nominal, can we just pass the input through OrdinalEncoder and use Otherwise, for the OneHot case, can we add an attribute to the
where the keys are original features and the value lists are the encoded features. I feel like doing this would make it easier to compute a score for one feature group :/ |
oh also, @yashika51 and I are interested in working on this issue. |
right now, |
Consider a standard pipeline: import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectKBest
from sklearn.model_selection import train_test_split
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('ordinal', OrdinalEncoder())])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])
# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf = Pipeline(steps=[('preprocessor', preprocessor),
('select', SelectKBest(k=3)),
('classifier', LogisticRegression())])
clf.fit(X, y) This already works with
If this is the issue here, then there isn't a clean way to do this in the Pipeline above. |
Hi @amueller I want to work on this feature request. |
Currently feature selection on categorical variables is hard.
Using one-hot encoding we can select some of the categories, but we can't easily remove a whole variable.
I think this is something we should be able to do.
One way would be to have feature selection methods that are aware of categorical variables - I guess
SelectFromModel(RandomForestClassifier())
would do that after we add categorical variable support.We should have some more simple test-based methods, though.
Maybe
f_regression
andf_classif
could be extended so that they can take categorical features into account?Alternatively we could try to remove groups of features that correspond to the same original variable. That seems theoretically possible if we add feature name support, but using feature names for this would be putting semantics in the names which I don't entirely like. Doing it "properly" would require adding meta-data about the features, similarly to
sample_props
, only on the other axis. That seems pretty heavy-handed though.The text was updated successfully, but these errors were encountered: