Add array API support to `median_absolute_error` #31406

lucyleeow · 2025-05-21T00:58:02Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Add array API support to median_absolute_error. (Currently the only change made was to add an array API supporting _median function, see below.)

Any other comments?

This is the only metric to use median, however median is used in a fair number of estimators. I think the first item to address is which median should we use.

Array API spec currently does not support median so these are our options:

Write our own median function (that uses np.median when namespace is numpy) - included in this PR, maintenance
Use our _weighted_percentile - slow
Push for median inclusion in array API. Admittedly, median is not used much outside of scikit-learn (RFC: array-agnostic quantile data-apis/array-api#795 (comment)), BUT it seems that most (all?) array libraries have an implementation. I would be in favour of pushing for inclusion, less so because of use, and more so because the implementation of median is well defined (vs e.g. quantile) and I think other array libraries do have an implementation, including dask. They may be open to this: RFC: array-agn 8000 ostic quantile data-apis/array-api#795 (comment)

Here are some benchmarking I did with numpy and cupy arrays. I wanted to increase the size of the arrays tested and include the new scipy quantile (which supports array API but not weights - as a reference, as I think we ultimately want to use this) but I ran out of GPU time in colab 🙃
Also maybe I should have also included torch CPU in the mix?

(Randomly generated 1D array)

	Numpy (1e7)	CuPy (1e7)
sklearn `_median`	0.182784s	0.017168s
sklearn `_weighted_percentile_`	2.369427s	0.088325s
Cupy `median`	n/a	0.015946s

github-actions · 2025-05-21T00:59:06Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 15b8d23. Link to the linter CI: here}

lucyleeow · 2025-05-21T01:01:03Z

sklearn/utils/_array_api.py

+
+    # Use mean in both odd and even case to coerce data type,
+    # using out array if needed.
+    rout = xp.mean(X_sorted[indexer], axis=axis)


Technically the spec states that NaNs are propagated (https://data-apis.org/array-api/latest/API_specification/generated/array_api.mean.html#mean) but there is also a note that says:

Array libraries, such as NumPy, PyTorch, and JAX, currently deviate from this specification in their handling of components which are NaN when computing the arithmetic mean.

lucyleeow · 2025-05-21T01:02:33Z

cc @ogrisel @betatim @lesteve @OmarManzoor

betatim · 2025-05-21T13:01:11Z

I think writing our own median is fine. We could also contribute it to array-api-extra?

Is the change to median_absolute_error still to come or is this really all we need to do?

lucyleeow · 2025-05-22T00:40:08Z

Thanks for responding @betatim !

More changes required, I just wanted to start the conversation about median and thought it would be nice to see what a re-implementation of it would look like and context, so opened this PR instead of an issue.

I like the array-api-extra idea, it opens a conversation re: median at least. I am of the opinion that it would be worthwhile asking for it to be added to the spec as it is implemented already in all the array libraries listed under 'actively considered' here, and e.g., dask's own implementation does stuff with chunking (ref):

This works by automatically chunking the reduced axes to a single chunk if necessary and then calling numpy.median function across the remaining dimensions

and I would rather let dask deal with that, then try to implement in array-api-extra. (though I guess the median function in array-api-extra could always just pass it to dask's own median function, but that almost seems to defeat the point)

The only concern I have about pushing for adding it to the spec is that it is not used much in other libraries outside of scikit-learn.

betatim reacted with thumbs up emoji

OmarManzoor

I think adding median in our own utils or contributing to array-api-extra are both fine.

betatim · 2025-05-23T11:44:24Z

Maybe we should do both: our own/array api extra implementation (while we wait for the standard) and ask for it to be added to the standard? I somehow dismissed the idea of adding it because it takes a long time :-/

The array-api-extra implementation could even just forward the call to the array library if it implements median

lucyleeow · 2025-05-26T01:26:33Z

I somehow dismissed the idea of adding it because it takes a long time :-/

I totally forgot about this part.

Okay let's move forward with our own implementation and also starting a discussion about adding median to the spec.

sklearn/utils/tests/test_array_api.py

ogrisel · 2025-05-28T10:03:36Z

BTW we need to use _averaged_weighted_percentile instead of _weighted_percentile when sample_weight is not None. This it to ensure "centered" results when calling it on data with even number of weighted predictions. This would fix the discrepancy with the unweighted in cases where they should return the same results:

>>> from sklearn.metrics import median_absolute_error
>>> import numpy as np
>>> median_absolute_error(np.zeros(4), np.arange(4))
1.5
>>> median_absolute_error(np.zeros(4), np.arange(4), sample_weight=np.ones(4))
1.0

But this can be addressed in a dedicated PR: it impacts both numpy and other array namespaces.

lucyleeow · 2025-05-28T12:12:04Z

sklearn/metrics/tests/test_common.py

+    if (
+        getattr(metric, "__name__", None) == "median_absolute_error"
+        and array_namespace == "array_api_strict"
+    ):
+        try:
+            import array_api_strict
+        except ImportError:
+            pass
+        else:
+            if device == array_api_strict.Device("device1"):
+                # See https://github.com/data-apis/array-api-strict/issues/134
+                pytest.xfail(
+                    "`_weighted_percentile` is affected by array_api_strict bug when "
+                    "indexing with tuple of arrays on non-'CPU_DEVICE' devices."
+                )


This is not ideal. We have a similar xfail in a _weighted_percentile test:

scikit-learn/sklearn/utils/tests/test_stats.py

Lines 186 to 196 in 398e8fe

if array_namespace == "array_api_strict":

try:

import array_api_strict

except ImportError:

pass

else:

if device == array_api_strict.Device("device1"):

# See https://github.com/data-apis/array-api-strict/issues/134

pytest.xfail(

"array_api_strict has bug when indexing with tuple of arrays "

"on non-'CPU_DEVICE' devices."

Note that as we add array API support for more regression metrics, we will need to add them to the xfail, as several others use _weighted_percentile.

(Note that this bug has been fixed but we'd need a new release of array-api-strict to see it)

@ev-br - you did mention there could be an array-api-strict release soon though?

sklearn/utils/_array_api.py

OmarManzoor

Thanks for the PR @lucyleeow
A few comments

sklearn/utils/_array_api.py

sklearn/utils/validation.py

sklearn/utils/_array_api.py

lucyleeow · 2025-05-30T02:03:17Z

BTW we need to use _averaged_weighted_percentile instead of _weighted_percentile when sample_weight is not None.

(ref: #31406 (comment))

FYI @ogrisel that will be fixed in #30787, as once median_absolute_error is tested correctly, the difference will show up. ~~Nevermind, test failure due to a different problem. I will amend this in a follow up PR 😬~~ Ignore this part, there were 2 problems, it will be fixed in #30787

sklearn/utils/_array_api.py

lucyleeow · 2025-05-30T06:20:37Z

Thank you @OmarManzoor ! CI is finally green 😅 !

OmarManzoor

Looks good now. Thank you for the PR @lucyleeow

ogrisel

LGTM with a few comments. Thanks @lucyleeow!

sklearn/utils/tests/test_array_api.py

ogrisel · 2025-06-02T09:33:13Z

sklearn/utils/_array_api.py

+    # `median` is not included in the Array API spec, but is implemented in most
+    # array libraries, and all that we support (as of May 2025).


Suggested change

# `median` is not included in the Array API spec, but is implemented in most

# array libraries, and all that we support (as of May 2025).

# XXX: `median` is not included in the array API spec, but is implemented

# in most array libraries, and all that we support (as of May 2025).

# TODO: consider simplifying this code to use scipy instead once the oldest

# supported SciPy version provides `scipy.stats.quantile` with native array API

# support (likely scipy 1.6 at the time of writing). Proper benchmarking of

# either option with popular array namespaces is required to evaluate the

# impact of this choice.

Interesting.
I did think about what we should do in the future. Ralf suggested maybe quantile should go into array api extra (I need to open a RFC issue about this in array api extra).

Despite most array libraries having a median (including dask), which may be somewhat faster than scipy.stats.quantile (benchmarking required), would you still be inclined to use scipy.stats.quantile over median of the native library?
(I don't think median will be in the spec, mostly because quantile is more versatile and torch's version of median implementation differs from everyone else)

I don't know for sure, we would need some proper evaluation of different options. Maybe I can rephrase the suggestion to be less assertive in the comment.

EDIT: done.

betatim · 2025-06-03T07:01:06Z

Looks like all the discussion topics were addressed/resolved and the robots are happy -> merging

edit: sorry, Olivier's suggestion hadn't actually been committed (I was tricked by the "outdated" marker). Should I open a new PR to add that?

edit edit: maybe it was actually done.

lucyleeow · 2025-06-04T00:21:12Z

Thanks @betatim !

lucyleeow added 2 commits May 19, 2025 15:36

add median

37e21b0

amend comment

f99397b

github-actions bot added the module:utils label May 21, 2025

lucyleeow commented May 21, 2025

View reviewed changes

lucyleeow added Array API Needs Decision Requires decision labels May 22, 2025

lucyleeow mentioned this pull request May 22, 2025

Make more of the "tools" of scikit-learn Array API compatible #26024

Open

OmarManzoor reviewed May 22, 2025

View reviewed changes

lucyleeow added 4 commits May 26, 2025 15:48

add support

3940610

add whats new

b277ac0

use quantile for torch

dba4b7f

add support for helpers

335a91d

betatim reviewed May 27, 2025

View reviewed changes

sklearn/utils/tests/test_array_api.py Outdated Show resolved Hide resolved

lucyleeow added 3 commits May 28, 2025 21:47

fix

e708595

uncomment

274f1c6

add xfail

4c38a53

lucyleeow commented May 28, 2025

View reviewed changes

lucyleeow added 4 commits May 29, 2025 11:13

add no implemented error

d125a25

fix xfail

3c4ba9c

add comment

27e049b

add back

bb81083

OmarManzoor reviewed May 29, 2025

View reviewed changes

sklearn/utils/_array_api.py Outdated Show resolved Hide resolved

sklearn/utils/_array_api.py Outdated Show resolved Hide resolved

sklearn/utils/validation.py Show resolved Hide resolved

use numpy for strict

d69e07a

merge main

fe339dd

OmarManzoor reviewed May 29, 2025

View reviewed changes

sklearn/utils/_array_api.py Outdated Show resolved Hide resolved

lucyleeow added 8 commits May 29, 2025 19:55

fix numpy as array

8c3b86f

use xpx is torch namespace

b71074a

fix

74b55ce

review

94472f6

fix test

a7fe344

fix

166bd8d

fix test

3e80603

namespace check

c55d44f

OmarManzoor reviewed May 29, 2025

View reviewed changes

sklearn/utils/_array_api.py Outdated Show resolved Hide resolved

lucyleeow added 2 commits May 30, 2025 11:09

fix median

232d1b0

fix med output

150c14c

OmarManzoor reviewed May 30, 2025

View reviewed changes

sklearn/utils/_array_api.py Outdated Show resolved Hide resolved

Update sklearn/utils/_array_api.py

b1dcc53

OmarManzoor added CUDA CI and removed Needs Decision Requires decision labels May 30, 2025

github-actions bot removed the CUDA CI label May 30, 2025

OmarManzoor approved these changes May 30, 2025

View reviewed changes

ogrisel approved these changes Jun 2, 2025

View reviewed changes

lucyleeow added 2 commits June 2, 2025 21:40

review

de7ccbb

review

15b8d23

betatim added the CUDA CI label Jun 3, 2025

github-actions bot removed the CUDA CI label Jun 3, 2025

betatim merged commit 5c21794 into scikit-learn:main Jun 3, 2025
40 checks passed

< 7802 /div>
lucyleeow deleted the aapi_med_abs_er branch June 4, 2025 00:20

	if array_namespace == "array_api_strict":
	try:
	import array_api_strict
	except ImportError:
	pass
	else:
	if device == array_api_strict.Device("device1"):
	# See https://github.com/data-apis/array-api-strict/issues/134
	pytest.xfail(
	"array_api_strict has bug when indexing with tuple of arrays "
	"on non-'CPU_DEVICE' devices."

		# `median` is not included in the Array API spec, but is implemented in most
		# array libraries, and all that we support (as of May 2025).

-    # `median` is not included in the Array API spec, but is implemented in most
-    # array libraries, and all that we support (as of May 2025).
+    # XXX: `median` is not included in the array API spec, but is implemented
+    # in most array libraries, and all that we support (as of May 2025).
+    # TODO: consider simplifying this code to use scipy instead once the oldest
+    # supported SciPy version provides `scipy.stats.quantile` with native array API
+    # support (likely scipy 1.6 at the time of writing). Proper benchmarking of
+    # either option with popular array namespaces is required to evaluate the
+    # impact of this choice.

Uh oh!

Add array API support to median_absolute_error #31406

Add array API support to median_absolute_error #31406

Uh oh!

Conversation

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Uh oh!

✔️ Linting Passed

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Add array API support to `median_absolute_error` #31406

Add array API support to `median_absolute_error` #31406