10000 Automatically move `y_true` to the same device and namespace as `y_pred` for metrics · Issue #31274 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Automatically move y_true to the same device and namespace as y_pred for metrics #31274

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
lucyleeow opened this issue Apr 30, 2025 · 3 comments
Labels

Comments

@lucyleeow
Copy link
Member
lucyleeow commented Apr 30, 2025

This is closely linked to #28668 but separate enough to warrant it's own issue (#28668 (comment)). This is mostly a summary of discussions so far. If we are happy with a decision, we can move to updating the documentation.


For classification metrics to support array API, there is a problem in the case where y_pred is not in the same namespace/device as y_true.

y_pred is likely to be the output of predict_proba or decision_function and would be in the same namespace/device as X (if we decide in #28668 that "everything should follow X").
y_true could be an integer array or a numpy array or pandas series (this is pertinent as y_true may be string labels)

Motivating use case:

Using e.g., GridSearchCV or cross_validate with a pipeline that moves X to GPU.
Consider a pipeline like below (copied from #28668 (comment)):

pipeline = make_pipeline(
   SomeDataFrameAwareFeatureExtractor(),
   MoveFeaturesToPyTorch(device="cuda"),
   SomeArrayAPICapableClassifier(),
)

Pipelines do not ever touch y so we are not able to alter y within the pipeline.
We would need to pass a metric to GridSearchCV or cross_validate, which would be passed y_true and y_pred on different namespace / devices.

Thus the motivation to automatically move y_true to the same namespace / device as y_pred, in metrics functions.

(Note another example is discussed in #30439 (comment))

As it is more likely that y_pred is on GPU, y_true follow y_pred was slightly preferred over y_pred follows y_true. Computation wise, CPU vs GPU is probably similar for metrics like log-loss, but for metrics that require sorting (e.g., ROC AUC) GPU may be faster? (see #30439 (comment) for more discussion on this point)

Question for my own clarification, the main motivation is for usability, so the user does not have to manually convert y_true ? Would a helper function to help the user convert y_true to the correct namespace/device be interesting?

cc @ogrisel @betatim

@ogrisel
Copy link
Member
ogrisel commented May 15, 2025

+1 for this proposal and +1 for a PR to make that policy explicit in array_api.rst.

@ogrisel
Copy link
Member
ogrisel commented May 15, 2025

Question for my own clarification, the main motivation is for usability, so the user does not have to manually convert y_true ?
Would a helper function to help the user convert y_true to the correct namespace/device be interesting?

Not just this. Some pipelines would not be implementable otherwise. For instance, let's consider the following:

df = pd.read_parquet("data_file.parquet")
X = df[categorical_feature_names]  # categorical or str/object dtype
y = df[target_column_name]  # continuous value to regress.

pipeline = make_pipeline(
   TargetEncoder(),  # works on categorical X but also needs y, hence CPU for both
   FunctionTransformer(partial(torch.asarray, device="cuda:0")),
   Ridge(),  # only numerical inputs: faster to do it all on GPU, y follows X => y_pred on GPU.
)

cv_results = cross_validate(pipeline, X, y, scoring="r2")

# Or similarly to grid search hyper parameters of `TargetEncoder` and `Ridge` jointly.

Would a helper function to help the user convert y_true to the correct namespace/device be interesting?

I don't think it's possible to write a helper to deal with the above use case: y needs to be on CPU for TargetEncoder hence when passed to cross_validate. Note that TargetEncoder is a supervised transformer: it consumes y but does not transform it (our pipelines do not allow transforming y).

Ridge works much faster when fed numerical X on GPU and y follows X means that y_pred would naturally be on the GPU. Then the scoring function internally called by cross_validate or GridSearchCV needs to accept mixed inputs and implement y_true follows y_pred.

@lucyleeow
Copy link
Member Author

Thanks @ogrisel I had not considered that.

I didn't attend the meeting but are we mostly in consensus about this? Happy to move forward with doc + implementation if so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: No status
Development

No branches or pull requests

3 participants
0