-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
RFC / API add option to fit/predict without input validation #21804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
At a user level, I would think that it could more friendly to use the |
I don't like global options, at least they should not be the only way to accomplish that. |
You might have more experience in putting models into production. I was under the impression that iterating on a subset of data locally with safeguard and then having a global option during deployment would be something super handy. Is it a misconception on my side? On another hand, even if the config could be set globally, you can still use it as a context manager locally, for specific processing. clf = RandomForestClassifier().fit(X, y)
with context_config(validation=False):
clf.prefict(X) I personally think that adding a new parameter will add some cognitive overhead to our user and this overhead should not be disregarded. |
Mmh:thinking: @glemaitre You have a good point, indeed. The context manager helps. My intuition still is to have it as configurable option in estimators. |
I think we can have a parameter and a global config option. By default, |
If we go with the fit/predict method parameter approach, for composite components (ie: Pipeline, FeatureUnion, VotingClassifier, OneVsRestClassifier, ... ) I would expect the Perhaps someone can come up with a use-case for not wanting this behavior? |
This is more complicated: sometimes to avoid expensive double validation meta-estimators can follow one of the following two strategies:
a) is useful because sometimes the base estimators can do smart, estimator specific input validations (for instance imaging a pipeline delegating validation to of text inputs for text vectorizers, dataframe handling by column-transformer) Other times b) would be more efficient because to do the input validation only once at the meta estimator level instead of each time a clone of the base estimator is called (e.g. at the forest level instead of an individual tree level). Other times b) is necessary because the meta-estimator itself does some non-trivial data manipulation (e.g. fancy indexing by class labels for one-vs-rest multiclass handling). |
Also there is an impact with feature names: If b) is adopted and a meta-estimator is fit on a dataframe, then feature names are extracted by the meta-estimator and then a numpy array without feature names is passed to the base estimators. But sometimes the feature names could be useful for the base estimator (e.g. to specify features to be treated as categorical variables for a HistGBRT model for instance). |
Any updates? |
Goal
Ideally, this issue helps to sort out an API.
Describe the workflow you want to enable
I'd like to be able to switch on/off:
fit
fit
andpredict
fit
andpredict
(e.g.check_is_fitted
)Something like
Note that some estimators like
Lasso
already supportcheck_input
infit
.The main reason to do so is improved performance.
Additional context
Related issues and PRs are #16653, #20657, #21578 (in particular #21578 (comment))
Related discussions: #21810
The text was updated successfully, but these errors were encountered: