RFC / API add option to fit/predict without input validation #21804

lorentzenchr · 2021-11-27T11:48:24Z

Goal

Ideally, this issue helps to sort out an API.

Describe the workflow you want to enable

I'd like to be able to switch on/off:

Parameter validation in fit
Input array validation in fit and predict
All other validation steps in fit and predict (e.g. check_is_fitted)

Something like

model = RandomForestRegressor(validate_params=False)
model.fit(validate_input=False)
model.predict(validated_input=False)

Note that some estimators like Lasso already support check_input in fit.

The main reason to do so is improved performance.

Additional context

Related issues and PRs are #16653, #20657, #21578 (in particular #21578 (comment))
Related discussions: #21810

The text was updated successfully, but these errors were encountered:

glemaitre · 2021-11-29T13:26:11Z

At a user level, I would think that it could more friendly to use the set_config context manager to skip the validation altogether.

lorentzenchr · 2021-11-29T14:11:10Z

I don't like global options, at least they should not be the only way to accomplish that.

glemaitre · 2021-11-29T14:25:44Z

I don't like global options, at least they should not be the only way to accomplish that.

You might have more experience in putting models into production. I was under the impression that iterating on a subset of data locally with safeguard and then having a global option during deployment would be something super handy. Is it a misconception on my side?

On another hand, even if the config could be set globally, you can still use it as a context manager locally, for specific processing.

clf = RandomForestClassifier().fit(X, y)
with context_config(validation=False):
    clf.prefict(X)

I personally think that adding a new parameter will add some cognitive overhead to our user and this overhead should not be disregarded.

lorentzenchr · 2021-11-29T14:33:16Z

Mmh:thinking: @glemaitre You have a good point, indeed. The context manager helps. My intuition still is to have it as configurable option in estimators.
I would be very interested in the opinion and experience of others.

thomasjpfan · 2021-11-29T14:59:20Z

I think we can have a parameter and a global config option. By default, validate_input="use_global", which uses the global config option. This way a user can set the parameter during the method/function call.

eangius · 2021-11-29T15:04:26Z

If we go with the fit/predict method parameter approach, for composite components (ie: Pipeline, FeatureUnion, VotingClassifier, OneVsRestClassifier, ... ) I would expect the validate_input parameter of the container to be passed down to its parts unless explicitly specified by the parts..

Perhaps someone can come up with a use-case for not wanting this behavior?

ogrisel · 2021-11-30T08:41:54Z

If we go with the fit/predict method parameter approach, for composite components (ie: Pipeline, FeatureUnion, VotingClassifier, OneVsRestClassifier, ... ) I would expect the validate_input parameter of the container to be passed down to its parts unless explicitly specified by the parts..
Perhaps someone can come up with a use-case for not wanting this behavior?

This is more complicated: sometimes to avoid expensive double validation meta-estimators can follow one of the following two strategies:

a) do not validate the data but delegate input validation to the underlying base estimator(s);
b) do the validation themselves and then pass the validated data to the underlying base estimator(s) and make them skip any additional validation.

a) is useful because sometimes the base estimators can do smart, estimator specific input validations (for instance imaging a pipeline delegating validation to of text inputs for text vectorizers, dataframe handling by column-transformer)

Other times b) would be more efficient because to do the input validation only once at the meta estimator level instead of each time a clone of the base estimator is called (e.g. at the forest level instead of an individual tree level).

Other times b) is necessary because the meta-estimator itself does some non-trivial data manipulation (e.g. fancy indexing by class labels for one-vs-rest multiclass handling).

ogrisel · 2021-11-30T08:44:24Z

Also there is an impact with feature names:

If b) is adopted and a meta-estimator is fit on a dataframe, then feature names are extracted by the meta-estimator and then a numpy array without feature names is passed to the base estimators. But sometimes the feature names could be useful for the base estimator (e.g. to specify features to be treated as categorical variables for a HistGBRT model for instance).

AhmedThahir · 2024-11-03T10:08:26Z

Any updates?

lorentzenchr added New Feature API RFC labels Nov 27, 2021

lorentzenchr mentioned this issue Nov 27, 2021

FIX Remove warnings when fitting a dataframe #21578

Merged

This was referenced Dec 2, 2021

Added Fall out , miss rate , specificity #21854

Closed

Adding Fall out , miss rate , specificity #21856

Closed

Mod3 #21857

Closed

lorentzenchr mentioned this issue Apr 6, 2023

ENH Generally avoid nested param validation #25815

Merged

adrinjalali mentioned this issue Oct 18, 2024

check_is_fitted and validate_date are a performance bottlenck for ensembles prediction #16653

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC / API add option to fit/predict without input validation #21804

RFC / API add option to fit/predict without input validation #21804

RFC / API add option to fit/predict without input validation #21804

RFC / API add option to fit/predict without input validation #21804

Comments

Goal

Describe the workflow you want to enable

Additional context