8000 [RFC] Should scalers or other estimators warn when fit on constant features? · Issue #19547 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[RFC] Should scalers or other estimators warn when fit on constant features? #19547

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ogrisel opened this issue Feb 24, 2021 · 4 comments
Open

Comments

@ogrisel
Copy link
Member
ogrisel commented Feb 24, 2021

As discussed in #19527, fitting models on data with constant feature can be surprising.

For instance a StandardScaler(with_mean=False) fit on a column with constant values set to 1000. will let those values passthough unchanged because the variance of the column is null. It can be surprising but is this a problem? Should we warn the user about the presence of such constant features which are typically not predictive for machine learning models?

Which estimator should warn about such constant features? The scalers can naturally detect those because they can detect them when computing the scale_ attribute. The QuantileTransformer could also probably warn about this degenerate case.

HistGradientBoosting* and KBinsDiscretizer can also do it efficiently when binning the feature values.

If we do so:

  • what should be the warning message? Should it be the same for all the models?
  • shall we add a standard constructor param to this estimators constant_feature={'warn', 'drop', 'passthrough', 'zero', 'one'} with "warn" as the default?
  • should we generalize this to all estimators? (ogrisel: probably not because it could be expensive and redundant input validation check so we could restrict to the estimators above where it's cheap to check)

Are there legitimate cases where such a warning would be frequent and annoying? For instance StandardScaler(with_mean=False) after OneHotEncoding with dense output with a categorical feature that has a category that is significantly more frequent than the others in cross-validation loop? A similar problem could happen with after OrdinalEncoding. But would StandardScaler(with_mean=False) would actually make sense to use in those cases?

List of estimators to consider:

  • scalers (such as StandardScaler, RobustScaler, MinMaxScaler)...
  • estimators that do feature binning: HistGradientBoosting* and KBinsDiscretizer,
  • feature selectors such as SelectKBest.
@ogrisel ogrisel changed the title [RFC] Should scalers or other estimators warn when fit on constant features [RFC] Should scalers or other estimators warn when fit on constant features? Feb 24, 2021
@azihna
Copy link
Contributor
azihna commented Feb 26, 2021

My first reaction is to say "yes, it is a problem and the library should warn about this!" but thinking about it a bit more I am more inclined that it is the user's responsibility to know the data and remove these features beforehand. I think this would be very frequent in use and have a similar feeling to the chained assignment warning from pandas.

However, it was to be added rather than adding a parameter to each estimator, adding a general option (again thinking about the chained assignment warning) about these might be better. There can be cases that the data you're training now on is constant but you know you'll get data later that will change that, and you might have ColumnTransformers that have different scalers on these columns that would proc the warning multiple times and would have to be turned off individually. I already see myself looking desperately for that one Scaler I forgot to turn the warning off and all the useful information is getting lost among the warnings in a cross-validation loop.

@Micky774
Copy link
Contributor

II just discovered this while responding to #26357 and I think it would be a helpful addition to StandardScaler. It is a common enough pitfall that scikit-learn would benefit its users by including an easily-suppressed warning. I think the added keyword approach would be a good way to control the prevalence of the warning.

I think StandardScaler is most likely the least controversial estimator for something like this, and a good place to start.

@betatim
Copy link
Member
betatim commented May 16, 2023

Sounds like a good proposal @Micky774. I think I prefer not adding it to too many estimators because it would lead to a flood of warnings, which leads to no one looking at them anymore. This means adding it sparingly to estimators where we see people step into this trap a few times is a good way of selecting where to add it.

@ogrisel
Copy link
Member Author
ogrisel commented Jan 9, 2025

Another related data-point: SelectKBest currently warns about constant features, indirectly by calling f_oneway issues a UserWarning with the integer indices (but not feature names...) of all the constant features. Moreover, it issues a RuntimeWarning about "invalid value in divide" in the presence of constant features.

This can be very verbose, in particular when feature selection is performed on a pipeline to model interactions between categorical variables:

from sklearn.pipeline import make_pipeline
from sklearn.feature_selection import SelectKBest
from sklearn.preprocessing import PolynomialFeatures, OneHotEncoder
from sklearn.linear_model import RidgeCV

pipeline = make_pipeline(
    OneHotEncoder(sparse_output=True),
    PolynomialFeatures(interaction_only=True, include_bias=False),
    SelectKBest(k=500),
    RidgeCV(),
)

The output of PolynomialFeatures can be very high dimensional (in particular if there are two high cardinality categorical features in the input or more), but it should be cheap to filter the constant and non-informative generated features using SelectKBest.

However, this is very verbose, especially if you cross-validate or hparam tune such a pipeline.

EDIT: the current behavior of SelectKBest seems to be warn without dropping ahead of time:

>>> import numpy as np
>>> from sklearn.feature_selection import SelectKBest
>>> X = np.ones(shape=(100, 10))
>>> y = np.random.choice(range(5), size=X.shape[0])
>>> feature_selector = SelectKBest(k=3).fit(X, y)
/Users/ogrisel/code/scikit-learn/sklearn/feature_selection/_univariate_selection.py:111: UserWarning: Features [0 1 2 3 4 5 6 7 8 9] are constant.
  warnings.warn("Features %s are constant." % constant_features_idx, UserWarning)
/Users/ogrisel/code/scikit-learn/sklearn/feature_selection/_univariate_selection.py:112: RuntimeWarning: invalid value encountered in divide
  f = msb / msw
>>> feature_selector.scores_
array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan])
>>> feature_selector.pvalues_
array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan])
>>> feature_selector.get_feature_names_out()
array(['x7', 'x8', 'x9'], dtype=object)

The last k (constant) features are arbitrarily selected.

Dropping would also be an option, but it could be dangerous in the sense that it could generate an empty output feature set in the case all input features are constant. So maybe the current behavior is a less surprising default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants
0