8000 [API] A public API for creating and using multiple scorers in the sklearn-ecosystem · Issue #28299 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[API] A public API for creating and using multiple scorers in the sklearn-ecosystem #28299

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
eddiebergman opened this issue Jan 28, 2024 · 7 comments · Fixed by #28360
Closed

Comments

@eddiebergman
Copy link
Contributor
eddiebergman commented Jan 28, 2024

Describe the workflow you want to enable

I would like a public stable interface for multiple scorers that can be developed against for the sklearn eco-system.

Without this, it makes it difficult for libraries to provide any consistent API for dealing with evaluation with multiple scorers unless they:

  1. Rely exclusively on cross_validate for evaluation as its the only place user input from multiple metrics can be funneled directly through to sklearn for evaluation.
  2. Implement custom wrapper types.
  3. Refuse to support multiple metrics.

Why developers may prefer an externally sklearn supported multi-metric API:

  1. Custom evaluation protocols can be developed that evaluate multiple objectives and benefit from sklearn's correctness (i.e. caching, metadata and response values).
  2. Custom multi-scoring wrappers do not have to version against the verison of sklearn installed. (See alternatives considered)
  3. Users can rely more on the same interface in sklearn-ecosystem of compliant libraries.

Context for suggestion:

In re-developing Auto-Sklearn, we perform Hyperparameter Optimization, which can include evaluating many metrics. We require custom evaluation protocols not trivially satisfied by cross_validate or the related family of provided sklearn functions. Previously, AutoSklearn would implement it's own metrics, however we'd like to extend this to any sklearn compliant scorer. Using a _MultiMetricScorer is ideal for their caching and handling of model response values to fit the scorer. Ideally we could also access this cache but that is a secondary concern for now.

I had previous solutions which emulated _MultiMetricScorer but they broke with sklearn 1.3 and 1.4 due to changes in scorers. I'm unsure how to reliably build a stable API against sklearn for multiple metrics.

An example use case where a user may want to evaluate against

# Custom evaluation class the depends on sklearn API
# Does not need to know anything ahead of time about the scorers 
class CustomEvaluator:
    def __init__(..., scoring: dict[str, str | _Scorer]):
        self.scoring = {name: get_scorer(v) if isinstance(v, str) else v}
        
    def evaluate(self, pipeline_configuration):
        model = something(pipeline_configuration)

		# MAIN API REQUEST
        scorers = public_sklearn_api_to_make_multimetric_scorer(self.scoring)
        scores = scorers(model, X, Y)
        ...

# Custom evaluation metric
def user_custom_metric(y_pred, y_true) -> float:
    ....
    
# Userland, can rely on libraries to accept the following interface for providing multiple scorers
scorers = {
  	"acc": "accuracy",
  	"custom": make_scorer(
  	    user_custom_metric,
  	    response_method="predict",
  	    greater_is_better=False
  	)    
  }
custom_evaluator = CustomEvaluator(scorers)

)

Describe your proposed solution

My proposed solution would involve making some variant of _MultiMetricScorer public API. Perhaps this could be made accessible through a non-backwards breaking change to get_scorer

# Before
def get_scorer(scoring: str) -> _Scorer: ...

# After
@overload
def get_scorer(scoring: str) -> _Scorer: ...

@overload
def get_scorer(scoring: Iterable[str]) -> MultiMetricScorer: ...

def get_scorer(scoring: str | Iterable[str]) -> _Scorer | MultiMetricScorer: ...

This would allow a user to pass in a MultiMetricScorer which I can act upon, or at the very least, a list[str] I can reliably convert to one.

# Example
match scorer:
    case MultiMetricScorer():
        scores: dict[str, float] = scorer(estimator, X, y)
    case list():
    	scorers = get_scorer(scorer)
    	scores = scorer(estimator, X, y)
    case _:
    	score: float = scorer(estimator, X, y)

This might cause inconsistency issues internally with sklearn which could be problematic. One additional change that might be required would be to add a new non-backwards breaking default to check_scoring(..., *, allow_multi_scoring: bool = False).


** Issues with this proposal **

  • There is no public Scorer class API, perhaps this suggestion make no sense without a public Scorer API. However I think that even if the _MultiMetric class were to remain hidden but there is a publicly advertised method to construct one that had reliable usage semantics, then both classes can remain hidden.

Describe alternatives you've considered, if relevant

This easiest solution in most cases is rely on the private _check_multimetric_scoring and just instantiating a _MultiMetricScorer, relying on private functionality.

Previous solutions relied using the private _MultiMetricScorer and family of _BaseScorer and it's previous sub-families.

Understandably, these private classes are subject to change and broke with 1.3 changes to metadata routing and 1.4 with changes to the _Scorer hierarchy.

I will rely on private functionality if I have to but it makes developing a library against sklearn quite difficult due to versioning.

If this will not be supported, I will likely go with some wrapper class that is dependent upon the version of scikit-learn in use.

Additional context

Currently, the only way to use multiple scorers for a model is through the interface to cross_validate(scoring=["a", "b", "c"]) or to permutation_importance:

Further Comments

Having access to the transformed cached predictions post scoring would be useful as well but I think that lies outside the scope for now.

@eddiebergman eddiebergman added Needs Triage Issue requires triage New Feature labels Jan 28, 2024
@eddiebergman eddiebergman changed the title [API] A public interface for MutliMetricScorer [API] A public interface for using multiple metrics Jan 28, 2024
@eddiebergman eddiebergman changed the title [API] A public interface for using multiple metrics [API] A public interface for using multiple scorers Jan 28, 2024
@eddiebergman eddiebergman changed the title [API] A public interface for using multiple scorers [API] A public API for creating and using multiple scorers in the sklearn-ecosystem Jan 28, 2024
@adrinjalali adrinjalali added Needs Decision - API and removed Needs Triage Issue requires triage labels Jan 30, 2024
@adrinjalali
Copy link
Member

This is a good proposal. I'm not sure if we have bandwidth to prioritize this right now, but I'd be happy to see this happen.

WDYT @scikit-learn/core-devs

@eddiebergman
Copy link
Contributor Author
eddiebergman commented Jan 30, 2024

Let me know if you need help with the bandwidth or if there's any refinements that need to be considered into the API! I am admittedly not familiar with all the nuances that might arise but I am happy to spend the time to investigate further into such a proposal.

@thomasjpfan
Copy link
Member

From what I am reading, the quick solution is to have _check_multimetric_scoring be public and actually return a multimetric scorer. @eddiebergman Is this correct?

I recall wanting to make this change ~4 years ago, but ran into an API limitation when using it with GridSearch. Looking 8000 over the code now, I think we can make this happen. I'll be happy to look into it.

@jnothman
Copy link
Member

I've not read in great detail, but just want to note a few things that come to my mind quickly:

  • *SearchCV support multiple metrics provided as a list or dict to scorers.
  • I had once upon a time wanted it to be able to take a callable that returned a dict of {name:score} values, which is not unlike your CustomEvaluator. One of the main benefits of that is the opportunity to share sufficient statistics among several evaluation metrics.
  • I like the ability to build a custom evaluator not just from a list of scorers, but through logic that, given/inferring a task type (e.g. imbalanced binary classification), provides the user with a set of applicable scorers (or they may select among them).

@eddiebergman
Copy link
Contributor Author
eddiebergman commented Jan 31, 2024

From what I am reading, the quick solution is to have _check_multimetric_scoring be public and actually return a multimetric scorer. @eddiebergman Is this correct?

I recall wanting to make this change ~4 years ago, but ran into an API limitation when using it with GridSearch. Looking over the code now, I think we can make this happen. I'll be happy to look into it.

@thomasjpfan To be honest, I don't really mind how it's implemented, as long as its advertised as non-breaking. I'm sure breaking changes will need to happen at some point but at least it will then be considered something to be documented and advertised.

Some considerations:

  • As a library in the sklearn eco-system that consumes users "sklearn-stuff", check_multimetric_scoring should be enough and users provide a dict[str, _Scorer].
  • As a user who need to evaluate multiple metrics, they will likely not know to use this and by default get a less efficient implementation, doing something like:
from sklearn.metrics import get_scorer

# Strawman user: "I know about `get_scorer`, I will use that to get
# multiple scorers and evaluate them sequentially"
metrics = {"acc": get_scorer("accuracy"), "bal_acc": get_scorer("balanced_accuracy")}
scores = {metric_name: scorer(estimator, X, y) for metric_name, scorer in metrics.items()}

# Pro-user: "I know about this `check_multimetric_scoring` from <somewhere>, I can use that for efficient caching".
mm =  check_multimetric_scoring(
    {"acc": get_scorer("accuracy"), "bal_acc": get_scorer("balanced_accuracy")},  # Boiler plate
    backwards_compatable_default_set_to_default_False=True
)
scores = mm(estimator, X, y)

I think it could benefit users and maintainability burden to centralize getting scorers. The benefit is primarily no new methods nor concepts to be introduced for metrics, and everything is in one place for users but also for any maintenance considerations.

mm = get_scorer(["accuracy", "balanced_accuracy"])
scores = mm(estimator, X, Y)

@eddiebergman
Copy link
Contributor Author

I did a little more code search on the use of get_scorer to see the impact.

I also found this search rather interesting, I search for the usage of _MultimetricScorer( to see where on github this is called. It's a very re-occuring theme of checking input and then deciding which one to build.

https://github.com/search?q=_MultimetricScorer%28+path%3A*.py&type=code&ref=advsearch

@thomasjpfan
Copy link
Member
thomasjpfan commented Feb 4, 2024

@jnothman From my reading of this issue, this is more about making MultiMetricScorer usable outside of cross_validate and GridSearchCV. MultiMetricScorer is useful by itself becasue of the caching it does.

I had once upon a time wanted it to be able to take a callable that returned a dict of {name:score} values, which is not unlike your CustomEvaluator.

This is currently possible in cross_validate and GridSearchCV:

from sklearn import svm, datasets
from sklearn.model_selection import cross_validate
from sklearn.metrics import accuracy_score, f1_score
from pprint import pprint

iris = datasets.load_iris()

def my_scroer(est, X, y):
    y_pred = est.predict(X)
    return {"roc": accuracy_score(y, y_pred), "auc": f1_score(y, y_pred, average="micro")}

svc = svm.SVC()
results = cross_validate(svc, iris.data, iris.target, scoring=my_scroer)
pprint(results)

I like the ability to build a custom evaluator not just from a list of scorers, but through logic that, given/inferring a task type (e.g. imbalanced binary classification),

We have all the tools to build it now. REF: #12385

In any case, I think this is orthogonal to making multi-metric scoring public. I opened #28360 to add multi-metric support to check_scorer. In the context of this issue, enables the following to get a multi-metric scorer:

mutli_scorer = check_scoring(scoring=["r2", "roc_auc", "accuracy"])

# or
mutli_scorer = check_scoring(scorers={
    "acc": "accuracy",
    "custom": make_scorer(
        user_custom_metric,
        response_method="predict",
        greater_is_better=False
    )    
})

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants
0