8000 CV integration for OOB-scoring · Issue #23382 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
CV integration for OOB-scoring #23382
Open
@multimeric

Description

@multimeric

Describe the workflow you want to enable

Out-of-Bag (OOB) scoring provides an estimate of the model generalizability for RandomForest without needing to refit the model several times as is demanded by k-fold cross validation (CV). Although sklearn provides a mechanism to obtain this estimate, it does not provide a mechanism to integrate it into the existing cross validation workflows. For example, we might have a GridSearchCV where we want to optimise hyperparameters for the forest, but only fitting once per parameter set. This in theory could be implemented using OOB error.

As far as I can, the two parameters of interest here are cv and scoring, both inputs to all the CV-related classes, and ultimately to cross_val_score(). scoring can be implemented easily enough using a custom scorer, since this has access to the final estimator and therefore the OOB error. What is problematic here is the cv argument, which requires that we split the dataset, and offers no alternative.

Describe your proposed solution

  • We add sklearn.metrics.oob, a scoring function that just returns the oob error on the trained classifier
  • We add sklearn.model_selection.IntegratedCV, which is a cross validator that does not split the data at all. ie IntegratedCV().split(X) will return X unchanged

With the combination of these two entities, users will be able to perform OOB-based cross-validation

Describe alternatives you've considered, if relevant

It is possible to apply general cross validation metrics to a RandomForest, such as k-folds. This is an alternative that already exists in sklearn today. However we are neglecting the significant (k times) speedup that could be obtained using OOB error.

Additional context

This question is notably discussed in these threads:

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0