-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[RFC] Should scalers or other estimators warn when fit on constant features? #19547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
My first reaction is to say "yes, it is a problem and the library should warn about this!" but thinking about it a bit more I am more inclined that it is the user's responsibility to know the data and remove these features beforehand. I think this would be very frequent in use and have a similar feeling to the chained assignment warning from pandas. However, it was to be added rather than adding a parameter to each estimator, adding a general option (again thinking about the chained assignment warning) about these might be better. There can be cases that the data you're training now on is constant but you know you'll get data later that will change that, and you might have ColumnTransformers that have different scalers on these columns that would proc the warning multiple times and would have to be turned off individually. I already see myself looking desperately for that one Scaler I forgot to turn the warning off and all the useful information is getting lost among the warnings in a cross-validation loop. |
II just discovered this while responding to #26357 and I think it would be a helpful addition to I think |
Sounds like a good proposal @Micky774. I think I prefer not adding it to too many estimators because it would lead to a flood of warnings, which leads to no one looking at them anymore. This means adding it sparingly to estimators where we see people step into this trap a few times is a good way of selecting where to add it. |
Another related data-point: This can be very verbose, in particular when feature selection is performed on a pipeline to model interactions between categorical variables: from sklearn.pipeline import make_pipeline
from sklearn.feature_selection import SelectKBest
from sklearn.preprocessing import PolynomialFeatures, OneHotEncoder
from sklearn.linear_model import RidgeCV
pipeline = make_pipeline(
OneHotEncoder(sparse_output=True),
PolynomialFeatures(interaction_only=True, include_bias=False),
SelectKBest(k=500),
RidgeCV(),
) The output of However, this is very verbose, especially if you cross-validate or hparam tune such a pipeline. EDIT: the current behavior of >>> import numpy as np
>>> from sklearn.feature_selection import SelectKBest
>>> X = np.ones(shape=(100, 10))
>>> y = np.random.choice(range(5), size=X.shape[0])
>>> feature_selector = SelectKBest(k=3).fit(X, y)
/Users/ogrisel/code/scikit-learn/sklearn/feature_selection/_univariate_selection.py:111: UserWarning: Features [0 1 2 3 4 5 6 7 8 9] are constant.
warnings.warn("Features %s are constant." % constant_features_idx, UserWarning)
/Users/ogrisel/code/scikit-learn/sklearn/feature_selection/_univariate_selection.py:112: RuntimeWarning: invalid value encountered in divide
f = msb / msw
>>> feature_selector.scores_
array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan])
>>> feature_selector.pvalues_
array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan])
>>> feature_selector.get_feature_names_out()
array(['x7', 'x8', 'x9'], dtype=object) The last Dropping would also be an option, but it could be dangerous in the sense that it could generate an empty output feature set in the case all input features are constant. So maybe the current behavior is a less surprising default. |
Uh oh!
There was an error while loading. Please reload this page.
As discussed in #19527, fitting models on data with constant feature can be surprising.
For instance a
StandardScaler(with_mean=False)
fit on a column with constant values set to1000.
will let those values passthough unchanged because the variance of the column is null. It can be surprising but is this a problem? Should we warn the user about the presence of such constant features which are typically not predictive for machine learning models?Which estimator should warn about such constant features? The scalers can naturally detect those because they can detect them when computing the
scale_
attribute. TheQuantileTransformer
could also probably warn about this degenerate case.HistGradientBoosting*
andKBinsDiscretizer
can also do it efficiently when binning the feature values.If we do so:
constant_feature={'warn', 'drop', 'passthrough', 'zero', 'one'}
with"warn"
as the default?Are there legitimate cases where such a warning would be frequent and annoying? For instance
StandardScaler(with_mean=False)
afterOneHotEncoding
with dense output with a categorical feature that has a category that is significantly more frequent than the others in cross-validation loop? A similar problem could happen with afterOrdinalEncoding
. But wouldStandardScaler(with_mean=False)
would actually make sense to use in those cases?List of estimators to consider:
StandardScaler
,RobustScaler
,MinMaxScaler
)...HistGradientBoosting*
andKBinsDiscretizer
,SelectKBest
.The text was updated successfully, but these errors were encountered: