8000 RFC / API add option to fit/predict without input validation · Issue #21804 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

RFC / API add option to fit/predict without input validation #21804

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
lorentzenchr opened this issue Nov 27, 2021 · 9 comments
Open

RFC / API add option to fit/predict without input validation #21804

lorentzenchr opened this issue Nov 27, 2021 · 9 comments

Comments

@lorentzenchr
Copy link
Member
lorentzenchr commented Nov 27, 2021

Goal

Ideally, this issue helps to sort out an API.

Describe the workflow you want to enable

I'd like to be able to switch on/off:

  1. Parameter validation in fit
  2. Input array validation in fit and predict
  3. All other validation steps in fit and predict (e.g. check_is_fitted)

Something like

model = RandomForestRegressor(validate_params=False)
model.fit(validate_input=False)
model.predict(validated_input=False)

Note that some estimators like Lasso already support check_input in fit.

The main reason to do so is improved performance.

Additional context

Related issues and PRs are #16653, #20657, #21578 (in particular #21578 (comment))
Related discussions: #21810

@glemaitre
Copy link
Member

At a user level, I would think that it could more friendly to use the set_config context manager to skip the validation altogether.

@lorentzenchr
Copy link
Member Author

I don't like global options, at least they should not be the only way to accomplish that.

@glemaitre
Copy link
Member

I don't like global options, at least they should not be the only way to accomplish that.

You might have more experience in putting models into production. I was under the impression that iterating on a subset of data locally with safeguard and then having a global option during deployment would be something super handy. Is it a misconception on my side?

On another hand, even if the config could be set globally, you can still use it as a context manager locally, for specific processing.

clf = RandomForestClassifier().fit(X, y)
with context_config(validation=False):
    clf.prefict(X)

I personally think that adding a new parameter will add some cognitive overhead to our user and this overhead should not be disregarded.

@lorentzenchr
Copy link
Member Author

Mmh:thinking: @glemaitre You have a good point, indeed. The context manager helps. My intuition still is to have it as configurable option in estimators.
I would be very interested in the opinion and experience of others.

@thomasjpfan
Copy link
Member

I think we can have a parameter and a global config option. By default, validate_input="use_global", which uses the global config option. This way a user can set the parameter during the method/function call.

@eangius
Copy link
eangius commented Nov 29, 2021

If we go with the fit/predict method parameter approach, for composite components (ie: Pipeline, FeatureUnion, VotingClassifier, OneVsRestClassifier, ... ) I would expect the validate_input parameter of the container to be passed down to its parts unless explicitly specified by the parts..

Perhaps someone can come up with a use-case for not wanting this behavior?

@ogrisel
Copy link
Member
ogrisel commented Nov 30, 2021

If we go with the fit/predict method parameter approach, for composite components (ie: Pipeline, FeatureUnion, VotingClassifier, OneVsRestClassifier, ... ) I would expect the validate_input parameter of the container to be passed down to its parts unless explicitly specified by the parts..
Perhaps someone can come up with a use-case for not wanting this behavior?

This is more complicated: sometimes to avoid expensive double validation meta-estimators can follow one of the following two strategies:

  • a) do not validate the data but delegate input validation to the underlying base estimator(s);
  • b) do the validation themselves and then pass the validated data to the underlying base estimator(s) and make them skip any additional validation.

a) is useful because sometimes the base estimators can do smart, estimator specific input validations (for instance imaging a pipeline delegating validation to of text inputs for text vectorizers, dataframe handling by column-transformer)

Other times b) would be more efficient because to do the input validation only once at the meta estimator level instead of each time a clone of the base estimator is called (e.g. at the forest level instead of an individual tree level).

Other times b) is necessary because the meta-estimator itself does some non-trivial data manipulation (e.g. fancy indexing by class labels for one-vs-rest multiclass handling).

@ogrisel
Copy link
Member
ogrisel commented Nov 30, 2021

Also there is an impact with feature names:

If b) is adopted and a meta-estimator is fit on a dataframe, then feature names are extracted by the meta-estimator and then a numpy array without feature names is passed to the base estimators. But sometimes the feature names could be useful for the base estimator (e.g. to specify features to be treated as categorical variables for a HistGBRT model for instance).

@AhmedThahir
Copy link

Any updates?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants
0