Clarification of output array type when metrics accept multiclass/multioutput #31286

lucyleeow · 2025-05-01T05:28:13Z

Clarification of how we should handle array output type when a metric outputs several values (i.e. accepts multiclass or multioutput input).

The issue was summarised succinctly in #30439 (comment):

Not sure what should be the output namespace / device in case we output an array, e.g. roc_auc_score with average=None on multiclass problems...

Currently all regression/classification metrics that support array API and multiclass or multioutput, all output an array in the same namespace and device as the input (checked code and manually). Summary of these metrics :

Regression metrics

Returns array in same namespace/device:

Classification metrics

Returns array in same namespace/device:

precision_recall_fscore_support and family (f1_score, fbeta_score, precision_score, recall_score)

Looking at the metrics code, if we wanted to support a list of scalars, we'd generally have to do extra processing to convert an array (often output of an xp.** function) to a list of scalars.

Once we arrive at a consensus we should update the array API documentation and update the check_array_api_metric in tests such that when the output is array/list - we check that the output type etc is correct.

cc @ogrisel @betatim

The text was updated successfully, but these errors were encountered:

adrinjalali · 2025-05-05T08:28:03Z

@ogrisel once mentioned this was intentional since Array API is experimental and we didn't want to confuse people by complicated docstrings. But I think not having the information is very confusing. I'd be in favor of properly documenting them.

I think @StefanieSenger had the same view at some point.

lucyleeow · 2025-05-05T09:05:16Z

I wasn't suggesting we should include this in docstrings.

This issue is mainly because of the concern expressed in the comment quoted, deciding on what we want and adding a note in the array API doc page.

adrinjalali · 2025-05-05T09:09:23Z

Oh I see.

Since we decided for our metrics to return python scalars, shouldn't the arrays here also be very simple types?

lucyleeow · 2025-05-05T09:35:31Z

Sorry,the arrays should be simple array types? Like numpy array only?

Haven't looked into pros and cons either way, so not pushing either proposal.

StefanieSenger · 2025-05-05T11:45:37Z

That's an open question, that is currently blocking progress of implementing Array API. (I had tried to sum up the state of the discussion on one of my blocked PRs here.)

I believe it's less confusing for users if they get the same array type out of a function / pipeline that they had put into it and I also understand this service is part of the promise of Array API. It doesn't seem to be very costly to convert back to the input namespace even if numpy was used in part of the process, as @OmarManzoor had checked by benchmarking for a specific example.

(Scalar types from metrics could be an acceptable exception, since people are used getting scalars, but don't need to be.)

adrinjalali · 2025-05-05T12:44:01Z

When it comes to y, i.e. y_pred, I understand it makes sense for them to be in the same space, however, the dimensionality and nature of out of scores has nothing to do with their input really. So I'd be more in favor of a simple numpy array or a python list even.

lucyleeow · 2025-05-07T05:55:33Z

Thanks for the reference @StefanieSenger , that is interesting.

I think none of the metrics above required an explicit conversion into the same input namespace and device.

e.g., for mean_absolute_error (and I think the case is similar for the other mean* metrics)

scikit-learn/sklearn/metrics/_regression.py

Lines 287 to 292 in 7a88bf1

    
           output_errors = _average( 
        
               xp.abs(y_pred - y_true), weights=sample_weight, axis=0, xp=xp 
        
           ) 
        
           if isinstance(multioutput, str): 
        
               if multioutput == "raw_values": 
        
                   return output_errors

the output of _average keeps the same namespace, device (and uses _find_matching_floating_dtype to get the dtype) as the input value.

AFAICT the classification metrics, precision_recall_fscore_support and family, are all returned in the same namespace and device, as the final output is from _prf_divide

I think confusion matrix may be in the minority but what we may need is to have a look at the other metrics to be converted, and make an educated guess as to which will 'automatically' return in the same namespace/device and what will require explicit conversion like with confusion_matrix ?
Would this be helpful in our decision?

Or would we like a simple list/numpy array, even when it requires explicit conversion?

StefanieSenger · 2025-05-07T07:27:04Z

Could you explain what you mean with automatic vs. explicit conversion, @lucyleeow?

My understanding is that we would make the metrics return the type(s) we want. And then I would think, if it's not a float, but an array type, it makes sense to return the same type (and same device) the user has passed in as y_true (or ideally also y_pred), since they might want to keep working with these returned types on the device they are using. (Maybe even return the python floats as a 1-d array from the input namespace instead, so it can stay on gpu if needed.)

lucyleeow · 2025-05-07T11:49:12Z

automatic vs. explicit conversion

This is my understanding of how the code works. Looking at the 'easy' example of mean_absolute_error:

scikit-learn/sklearn/metrics/_regression.py

Lines 287 to 292 in 7a88bf1

    
           output_errors = _average( 
        
               xp.abs(y_pred - y_true), weights=sample_weight, axis=0, xp=xp 
        
           ) 
        
           if isinstance(multioutput, str): 
        
               if multioutput == "raw_values": 
        
                   return output_errors

xp.abs takes y_pred and y_true as inputs, and as array API aims for "array type in equals array type out", the result should be in the same namespace / device. AFAICT the _average function tries to do the same, I think these are the relevant lines:

scikit-learn/sklearn/utils/_array_api.py

Line 621 in bdef5aa

a = xp.asarray(a, device=device_)

and

scikit-learn/sklearn/utils/_array_api.py

Line 660 in bdef5aa

sum_ = xp.sum(xp.multiply(a, weights), axis=axis)

So the code automatically outputs an array in the same namespace / device as the input, simply by virtue of array API having the "array type in equals array type out" principle.

If you wanted something else, you would have to manually convert it at the end.

I guess the situation with confusion_matrix is that you convert inputs to a numpy array within the function, which means you have the decision as to whether you convert back to input type or keep as numpy at the end.

I think this is how it works, but I may have missed something.

ogrisel · 2025-05-15T13:16:26Z

it makes sense to return the same type (and same device) the user has passed in as y_true (or ideally also y_pred), since they might want to keep working with these returned types on the device they are using.

I would say: "it makes sense to return the same type (and same device) the user has passed in as y_pred" (y_pred instead of y_true) to be in line with #31274: for unthresholded classification metrics (log_loss, roc_auc, ...), y_true can be a numpy array/pandas series with string labels/ categorical values. y_pred on the other hand can be array API container with numeric values returned by predict_proba or decision_function.

betatim · 2025-05-15T15:08:59Z

We discussed this topic in the array API meeting today. My impression was that there were no strong opinions either way. Instead we discussed the question of "what are people going to do with the output of these functions?" - some things we could think of was: print values (maybe rounding to N digits), plot values with matplotlib, put into a pandas DF to then .describe() them. Something that is unlikely to happen is that they get fed back into the "machinery" that produced them, unlike say fitted attributes of estimators. This question/train of thought was going towards the direction of "numpy array or array-in-original-namespace, which one is going to be more convenient for the user?"

Another question raised was how much code we'd have to change/add to every metric function to, say, always output numpy arrays.

My personal thoughts: I don't know what the state of "array API support" in matplotlib is. I don't think they are working on supporting the array API, but there has been some thought and work around supporting things that aren't numpy arrays. I think, but this is just speculation, that most users will plot the results of the metrics or otherwise compute simple statistics on them. It seems silly to make every user add code that converts the result of a metric function to a numpy array (if that is what is needed for plotting). Especially because there is (currently) no generic way to take a arbitrary array API array and turn it into a numpy array. But if you can feed a GPU pytorch tensor to matplotlib just like a numpy array this problem goes 8000 away.

Keeping the namespace and device consistent with the input is less surprising and what other libraries do. Probably also what people expect to happen, given that metrics tend to be simple arithmetic functions of their inputs.

What do people think are typical "next steps" for the return values of metrics? Does anyone have experience with feeding "array API arrays" to matplotlib?

lucyleeow · 2025-05-16T02:15:35Z

I also have no strong views either way.

Especially because there is (currently) no generic way to take a arbitrary array API array and turn it into a numpy array

Spitballing - would you consider making _convert_to_numpy public?

betatim · 2025-05-16T09:03:23Z

Especially because there is (currently) no generic way to take a arbitrary array API array and turn it into a numpy array

Spitballing - would you consider making _convert_to_numpy public?

Not sure. Even if it was public it feels pretty clunky for a "generic average pydata user".

plt.plot(torch.asarray([1,2,3,4]), torch.asarray([4,3,2,1])) "just works", the same with cupy or a torch tensor with device="cuda" fails. Makes me wonder why matplotlib didn't implement something like our _convert_to_numpy to use internally.

StefanieSenger · 2025-05-16T11:23:52Z

I would expect something as _convert_to_numpy() in numpy.

OmarManzoor · 2025-05-16T11:32:18Z

I think @ogrisel pointed out that returning metrics in the same namespace and device would be useful if they were used within a complete pipeline.

Otherwise if it's just printing them, using them in plots or for further descriptive analysis then I think it doesn't really matter if we return it in numpy or the original namespace. I guess if we return it in the original namespace then the users would need an extra step to convert the return arrays into numpy and move them to the cpu when the libraries they use (matplotlib, pandas e.t.c.) don't really support the returned array or its device.

However won't returning in the same namespace and device be useful in the case where for example someone is training a neural network architecture maybe using HuggingFace and most of the computations are being done on the gpu but when calculating and displaying the metrics, the respective arrays first need to be moved to the cpu and possibly converted to numpy?

adrinjalali · 2025-05-16T11:56:17Z

Reading through this thread, here's what I see:

we don't have a real common usecase in mind where the output staying in the original space makes users' life easier
we have cases where returning numpy as we do now, makes users' lives easier in some cases in existing common usecases

Also:

we return numpy arrays now, so keeping the current behavior means users not having to change anything in their code
when scorers return single values, it's a python float, not even a numpy float, so our scorer outputs are not in the same "kind" as the input anyway

So in the end, it seams what returning the result in the same space as the input brings, is to make us happy that we're "consistent" in some way, but it doesn't seem to bring much of a value to the user. On the other hand, returning numpy arrays is a much "simpler" kind of output which to me seems closer to the idea of returning python floats on single values.

That to me indicates returning numpy arrays is winning the argument here.

betatim · 2025-05-16T12:27:10Z

However won't returning in the same namespace and device be useful in the case where for example someone is training a neural network architecture maybe using HuggingFace and most of the computations are being done on the gpu

could you explain this a bit more? I was thinking about things like RandomSearchCV as a way to train models and "using" the results of the metric. But I think for the random search use-case it doesn't matter what namespace the metric is in.

I'm also tending towards "return numpy" but need to ponder the idea a bit more. I'm also trying to work out where/how to reach out to the matplotlib team to find out if accepting (say) torch tensors on the GPU is something they want to have, or know that it is virtually impossible for that ever to land in matplotlib.

(I posted a message on the matplotlib channel on the scientific python discord to ask about this)

OmarManzoor · 2025-05-16T12:58:37Z

could you explain this a bit more? I was thinking about things like RandomSearchCV as a way to train models and "using" the results of the metric. But I think for the random search use-case it doesn't matter what namespace the metric is in.

So basically if you train models using HuggingFace and you want to compute metrics you need to move the arrays to cpu and convert to numpy before this can be done. Here is the code from the generic trainer class that does this /https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py#L4420-L4439

ogrisel · 2025-05-16T13:49:40Z

I think @ogrisel pointed out that returning metrics in the same namespace and device would be useful if they were used within a complete pipeline.

I wouldn't say useful, but less surprising for people who expect scikit-learn to behave more than other array API consuming libraries such as SciPy: in SciPy the general contract for array API capable functions is: namespace in, namespace out.

Our metric functions would not follow this contract. But as team said, I don't really see the case where it would be beneficial to stay in the original namespace. The metric values will not be used as inputs for any kind of complex computation but rather be displayed in jupyter cell outputs (after some simple post-processing such as rounding to remove non-significant digits or computing some simple summary/descriptive statistics such as mean, standard deviation, quantiles and co), or plotted with matplotlib and friends.

So basically if you train models using HuggingFace and you want to compute metrics you need to move the arrays to cpu and convert to numpy before this can be done.

So you mean that huggingface's trainer pytorch support automatically convert metric values to numpy on CPU so scikit-learn doing the same would not feel odd in that respect?

ogrisel · 2025-05-16T14:13:07Z

Spitballing - would you consider making _convert_to_numpy public?

_convert_to_numpy is a hack in the sense that it is not namespace agnostic: it only supports torch and cupy (and array_api_strict, mostly for array API compliance testing purposes) while the rest of our API code supposedly work for any namespace that follows the spec.

EDIT: for the record, here is how it is implemented at the moment:

scikit-learn/sklearn/utils/_array_api.py

Lines 780 to 789 in 675736a

    
           def _convert_to_numpy(array, xp): 
        
               """Convert X into a NumPy ndarray on the CPU.""" 
        
               if _is_xp_namespace(xp, "torch"): 
        
                   return array.cpu().numpy() 
        
               elif _is_xp_namespace(xp, "cupy"):  # pragma: nocover 
        
                   return array.get() 
        
               elif _is_xp_namespace(xp, "array_api_strict"): 
        
                   return numpy.asarray(xp.asarray(array, device=xp.Device("CPU_DEVICE"))) 
        
               return numpy.asarray(array)

The clean way to implement this would be to rely on from_dlpack which should allow for any namespace/device to any namespace/device transfers, possibly with memory copies. But this requires the array API compliant libraries to adopt the latest version of dlpack spec for their __dlpack__ method and the latest version of the array API spec for the from_dlpack method of the consuming library.

For instance, with pytorch 2.7.0 (and numpy 2.2.5), this does not work yet:

>>> import array_api_compat.torch as xp
>>> import numpy as np
>>> np.from_dlpack(xp.ones(3, device="mps"), device="cpu")
Traceback (most recent call last):
  Cell In[9], line 1
    np.from_dlpack(xp.ones(3, device="mps"), device="cpu")
TypeError: Tensor.__dlpack__() got an unexpected keyword argument 'dl_device'

This exception is raised by PyTorch.

EDIT: the PyTorch specific problem is being tackled in this set of PRs:

Upgrade to DLPack 1.0. pytorch/pytorch#145000

NumPy and array-api-strict are fine:

>>> import array_api_strict as xp
>>> np.from_dlpack(xp.ones(3, device=xp.Device("device1")), device="cpu")
array([1., 1., 1.])

but maybe it's just array-api-strict that is too lax when simulating non-CPU devices, because based on my understanding of the spec, I would have expected any of the following to raise an error:

>>> np.from_dlpack(xp.ones(3, device=xp.Device("device1")), device=None, copy=False)
array([1., 1., 1.])
>>>  np.from_dlpack(xp.ones(3, device=xp.Device("device1")), device="cpu", copy=False)
array([1., 1., 1.])

So to conclude, cross-device dlpack-based namespace conversion is too new to be relied upon hence we have to keep our _convert_to_numpy hack for the time being, as long as libraries do not implement the full dlpack/array API specs to get np.from_dlpack as expected.

ogrisel · 2025-05-16T14:18:25Z

Note, array-api-extra implements a more complete alternative to our _convert_to_numpy hack named: array_api_extra._lib._testing.as_numpy_array:

https://github.com/data-apis/array-api-extra/blob/e58c0cbc6d4793bab91f3efc266fad00e1d45f17/src/array_api_extra/_lib/_testing.py#L93-L114

However, at the time of writing it's not importable from a public module. This is not necessarily a problem for us since we vendor array-api-extra, but we shouldn't make it public either.

OmarManzoor · 2025-05-16T14:20:42Z

So you mean that huggingface's trainer pytorch support automatically convert metric values to numpy on CPU so scikit-learn doing the same would not feel odd in that respect?

I think with respect to the general standard it would be better to convert to numpy. What I meant was currently it seems like we have to perform the computations on the cpu using numpy. Supporting the original arrays might allow metrics to be computed on the gpu.

ogrisel · 2025-05-16T14:26:21Z

What I meant was currently it seems like we have to perform the computations on the cpu using numpy.

This is a different problem: we can compute as much as possible on the GPU (in the input namespace of the metric function, that is based on how y_pred is provided) whenever possible and only convert the result to numpy / CPU at the end, just before returning.

ogrisel · 2025-05-16T14:30:03Z

Given all that has been said above, I think I am ok with having all the metrics that return arrays to an internal conversion to numpy before returning the results. It feels more practical for 99% of the use cases of our users.

Once libraries such as matplotlib have a better support for dlpack 1.0 and/or array API spec, we can consider changing our mind (assuming we do so before array API support in scikit-learn is out of the experimental phase).

OmarManzoor · 2025-05-16T14:30:16Z

This is a different problem: we can compute as much as possible on the GPU (in the input namespace) whenever possible and only convert the result to numpy / CPU at the end, just before returning.

Right I think converting the final outcome to numpy seems to be fine then.

lucascolley · 2025-05-16T15:52:44Z

My intuition is telling me that 'array type in == array type out' is the better option here, but I am struggling to come up with a real-world use-case for performing further computation on the metric results. Here's the best I've come up with:

Suppose a user was implementing some novel variation on Forward Search, for some $\mathbf{y}$ with a large number of targets (outputs). They could call something from sklearn.metrics in step 5 to get errors per target for each set of features (an array with the same number of dims as the number of targets), then implement some (potentially complicated) custom function at step 7 which picks a feature set based on some sort of argmin of a weighted average over the error arrays. This custom function could use things from xp or scipy.

github-actions bot added the Needs Triage Issue requires triage label May 1, 2025

lucyleeow added the Array API label May 1, 2025

adrinjalali added Needs Decision Requires decision and removed Needs Triage Issue requires triage labels May 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification of output array type when metrics accept multiclass/multioutput #31286

Clarification of output array type when metrics accept multiclass/multioutput #31286

Clarification of output array type when metrics accept multiclass/multioutput #31286

Clarification of output array type when metrics accept multiclass/multioutput #31286

Comments

Regression metrics

Classification metrics