8000 `cross_val_score` crashes with `StackingRegressor` · Issue #24430 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

cross_val_score crashes with StackingRegressor #24430

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
GuillemGSubies opened this issue Sep 13, 2022 · 3 comments
Closed

cross_val_score crashes with StackingRegressor #24430

GuillemGSubies opened this issue Sep 13, 2022 · 3 comments

Comments

@GuillemGSubies
Copy link
Contributor

Describe the bug

I'm trying to make a simple stacking and getting the cross validation score but an error raises:

NotFittedError: This RandomForestRegressor instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

Steps/Code to Reproduce

from sklearn.ensemble import StackingRegressor, RandomForestRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import cross_val_score
X, y = make_regression()
rf = RandomForestRegressor(n_jobs=-1, random_state=42)
rf.fit(X, y)
stack = StackingRegressor([("rf", rf)], cv="prefit")
cross_val_score(estimator=stack, X=X, y=y, scoring="neg_mean_absolute_error", cv=5, n_jobs=-1, error_score="raise")

Expected Results

It should work fine if I am understanding everything right.

Actual Results

NotFittedError: This RandomForestRegressor instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

Full traceback
---------------------------------------------------------------------------
_RemoteTraceback                          Traceback (most recent call last)
_RemoteTraceback: 
"""
Traceback (most recent call last):
File "/home/guillem.garcia/miniconda3/envs/skbug/lib/python3.10/site-packages/joblib/externals/loky/process_executor.py", line 436, in _process_worker
  r = call_item()
File "/home/guillem.garcia/miniconda3/envs/skbug/lib/python3.10/site-packages/joblib/externals/loky/process_executor.py", line 288, in __call__
  return self.fn(*self.args, **self.kwargs)
File "/home/guillem.garcia/miniconda3/envs/skbug/lib/python3.10/site-packages/joblib/_parallel_backends.py", line 595, in __call__
  return self.func(*args, **kwargs)
File "/home/guillem.garcia/miniconda3/envs/skbug/lib/python3.10/site-packages/joblib/parallel.py", line 262, in __call__
  return [func(*args, **kwargs)
File "/home/guillem.garcia/miniconda3/envs/skbug/lib/python3.10/site-packages/joblib/parallel.py", line 262, in <listcomp>
  return [func(*args, **kwargs)
File "/home/guillem.garcia/miniconda3/envs/skbug/lib/python3.10/site-packages/sklearn/utils/fixes.py", line 117, in __call__
  return self.function(*args, **kwargs)
File "/home/guillem.garcia/miniconda3/envs/skbug/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
  estimator.fit(X_train, y_train, **fit_params)
File "/home/guillem.garcia/miniconda3/envs/skbug/lib/python3.10/site-packages/sklearn/ensemble/_stacking.py", line 879, in fit
  return super().fit(X, y, sample_weight)
File "/home/guillem.garcia/miniconda3/envs/skbug/lib/python3.10/site-packages/sklearn/ensemble/_stacking.py", line 189, in fit
  check_is_fitted(estimator)
File "/home/guillem.garcia/miniconda3/envs/skbug/lib/python3.10/site-packages/sklearn/utils/validation.py", line 1352, in check_is_fitted
  raise NotFittedError(msg % {"name": type(estimator).__name__})
sklearn.exceptions.NotFittedError: This RandomForestRegressor instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.
"""

The above exception was the direct cause of the following exception:

NotFittedError                            Traceback (most recent call last)
Cell In [3], line 8
    6 rf.fit(X, y)
    7 stack = StackingRegressor([("rf", rf)], cv="prefit")
----> 8 cross_val_score(estimator=stack, X=X, y=y, scoring="neg_mean_absolute_error", cv=5, n_jobs=-1, error_score="raise")

File ~/miniconda3/envs/skbug/lib/python3.10/site-packages/sklearn/model_selection/_validation.py:515, in cross_val_score(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, error_score)
  512 # To ensure multimetric format is not supported
  513 scorer = check_scoring(estimator, scoring=scoring)
--> 515 cv_results = cross_validate(
  516     estimator=estimator,
  517     X=X,
  518     y=y,
  519     groups=groups,
  520     scoring={"score": scorer},
  521     cv=cv,
  522     n_jobs=n_jobs,
  523     verbose=verbose,
  524     fit_params=fit_params,
  525     pre_dispatch=pre_dispatch,
  526     error_score=error_score,
  527 )
  528 return cv_results["test_score"]

File ~/miniconda3/envs/skbug/lib/python3.10/site-packages/sklearn/model_selection/_validation.py:266, in cross_validate(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, return_train_score, return_estimator, error_score)
  263 # We clone the estimator to make sure that all the folds are
  264 # independent, and that it is pickle-able.
  265 parallel = Parallel(n_jobs=n_jobs, verbose=verbose, pre_dispatch=pre_dispatch)
--> 266 results = parallel(
  267     delayed(_fit_and_score)(
  268         clone(estimator),
  269         X,
  270         y,
  271         scorers,
  272         train,
  273         test,
  274         verbose,
  275         None,
  276         fit_params,
  277         return_train_score=return_train_score,
  278         return_times=True,
  279         return_estimator=return_estimator,
  280         error_score=error_score,
  281     )
  282     for train, test in cv.split(X, y, groups)
  283 )
  285 _warn_or_raise_about_fit_failures(results, error_score)
  287 # For callabe scoring, the return type is only know after calling. If the
  288 # return type is a dictionary, the error scores can now be inserted with
  289 # the correct key.

File ~/miniconda3/envs/skbug/lib/python3.10/site-packages/joblib/parallel.py:1056, in Parallel.__call__(self, iterable)
 1053     self._iterating = False
 1055 with self._backend.retrieval_context():
-> 1056     self.retrieve()
 1057 # Make sure that we get a last message telling us we are done
 1058 elapsed_time = time.time() - self._start_time

File ~/miniconda3/envs/skbug/lib/python3.10/site-packages/joblib/parallel.py:935, in Parallel.retrieve(self)
  933 try:
  934     if getattr(self._backend, 'supports_timeout', False):
--> 935         self._output.extend(job.get(timeout=self.timeout))
  936     else:
  937         self._output.extend(job.get())

File ~/miniconda3/envs/skbug/lib/python3.10/site-packages/joblib/_parallel_backends.py:542, in LokyBackend.wrap_future_result(future, timeout)
  539 """Wrapper for Future.result to implement the same behaviour as
  540 AsyncResults.get from multiprocessing."""
  541 try:
--> 542     return future.result(timeout=timeout)
  543 except CfTimeoutError as e:
  544     raise TimeoutError from e

File ~/miniconda3/envs/skbug/lib/python3.10/concurrent/futures/_base.py:446, in Future.result(self, timeout)
  444     raise CancelledError()
  445 elif self._state == FINISHED:
--> 446     return self.__get_result()
  447 else:
  448     raise TimeoutError()

File ~/miniconda3/envs/skbug/lib/python3.10/concurrent/futures/_base.py:391, in Future.__get_result(self)
  389 if self._exception:
  390     try:
--> 391         raise self._exception
  392     finally:
  393         # Break a reference cycle with the exception in self._exception
  394         self = None

NotFittedError: This RandomForestRegressor instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

Versions

I installed the latest nightly


System:
    python: 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]
executable: /home/guillem.garcia/miniconda3/envs/skbug/bin/python
   machine: Linux-5.15.0-47-generic-x86_64-with-glibc2.35

Python dependencies:
      sklearn: 1.2.dev0
          pip: 22.1.2
   setuptools: 63.4.1
        numpy: 1.24.0.dev0+703.gb2fbc4349
        scipy: 1.10.0.dev0
       Cython: None
       pandas: None
   matplotlib: None
       joblib: 1.1.0
threadpoolctl: 3.1.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
         prefix: libgomp
       filepath: /home/guillem.garcia/m
8000
iniconda3/envs/skbug/lib/python3.10/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
        version: None
    num_threads: 8

       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /home/guillem.garcia/miniconda3/envs/skbug/lib/python3.10/site-packages/numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so
        version: 0.3.20
threading_layer: pthreads
   architecture: Haswell
    num_threads: 8

       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /home/guillem.garcia/miniconda3/envs/skbug/lib/python3.10/site-packages/scipy.libs/libopenblasp-r0-41284840.3.18.so
        version: 0.3.18
threading_layer: pthreads
   architecture: Haswell
    num_threads: 8
@GuillemGSubies GuillemGSubies added Bug Needs Triage Issue requires triage labels Sep 13, 2022
@thomasjpfan
Copy link
Member

Thank you for opening this issue! Overall, this is a limitation of scikit-learn's API. cross_val_score will clone the estimator when performing cross validation. This means the pre-fitted estimator in StackingRegressor will "reset" and lose it's fitted properties.

scikit-learn/enhancement_proposals#67 is the enhancement proposal that can fix this. There are two ways the enhancement proposal can help:

  1. In the short term, the proposal would allow estimators such as StackingRegressor to override what clone means and stop the pre-fitted estimator from being cloned when cv="prefit".
  2. In the long term, we can introduce a freezing API to "freeze" estimators and no longer need a prefit parameter.

@thomasjpfan thomasjpfan added Enhancement and removed Needs Triage Issue requires triage labels Sep 13, 2022
@GuillemGSubies
Copy link
Contributor Author

Thanks for the fast response! What you say makes sense, indeed.

Would changing that clone for a deepcopy work as a workaround?

@glemaitre
Copy link
Member

Closing since this a duplicate of #24409

Would changing that clone for a deepcopy work as a workaround?

It will be super messy but it might work. I don't know if you might potentially get into trouble since you are going to fit estimators with some unexpected internal states.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants
0