FEA Add array API support for GaussianMixture #30777

lesteve · 2025-02-06T14:25:53Z

Working on it with @StefanieSenger.

Link to TODO

github-actions · 2025-02-06T14:27:12Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 4fe3766. Link to the linter CI: here}

…pr/lesteve/30777

lesteve · 2025-05-09T15:29:09Z

I think the confusing error about torch not being defined is an array-api-compat bug with torch 2.7 I opened data-apis/array-api-compat#320.

From build log the error was:

x1 = tensor([[4.8761, 0.0000],
        [2.7456, 5.9371]], dtype=torch.float64)
x2 = tensor([[1., 0.],
        [0., 1.]], dtype=torch.float64), kwargs = {}
        x1, x2 = _fix_promotion(x1, x2, only_scalar=False)
        # Torch tries to emulate NumPy 1 solve behavior by using batched 1-D solve
        # whenever
        # 1. x1.ndim - 1 == x2.ndim
        # 2. x1.shape[:-1] == x2.shape
        #
        # See linalg_solve_is_vector_rhs in
        # aten/src/ATen/native/LinearAlgebraUtils.h and
        # TORCH_META_FUNC(_linalg_solve_ex) in
        # aten/src/ATen/native/BatchLinearAlgebra.cpp in the PyTorch source code.
        #
        # The easiest way to work around this is to prepend a size 1 dimension to
        # x2, since x2 is already one dimension less than x1.
        #
        # See https://github.com/pytorch/pytorch/issues/52915
        if x2.ndim != 1 and x1.ndim - 1 == x2.ndim and x1.shape[:-1] == x2.shape:
            x2 = x2[None]
>       return torch.linalg.solve(x1, x2, **kwargs)
E       NameError: name 'torch' is not defined

kwargs     = {}
x1         = tensor([[4.8761, 0.0000],
        [2.7456, 5.9371]], dtype=torch.float64)
x2         = tensor([[1., 0.],
        [0., 1.]], dtype=torch.float64)

…pr/lesteve/30777

lesteve · 2025-05-16T13:50:13Z

GaussianMixture is ready for a first round of review 🎉 !

…nto gmm-array-api

OmarManzoor

Thank you for the PR @lesteve and @StefanieSenger

OmarManzoor · 2025-05-22T06:56:08Z

sklearn/mixture/_base.py

        elif self.init_params == "random_from_data":
-            resp = np.zeros((n_samples, self.n_components), dtype=X.dtype)
+            resp = xp.zeros(


How about initializing this as a numpy array and only converting it to the xp array after the indexing part is done? This will allow us to remove the slow loop that might be performed on the gpu.

That's a great idea, I think.

I guess this is an option but it's not obvious to me which option is preferrable:

doing the indexing on the CPU + moving the array to the GPU

create the array on the GPU + for loop for indexing

It likely depends on the shape of the data as well.

Something I did not think of though is that in the numpy case, we are potentially making things slower. But maybe the initialization is rarely the bottleneck, this would need to be looked at in more details ...

I think we can leave the loop as it is.

I did some benchmarking on a kaggle kernel and these are a few results. These are not averages over various iterations and each time I ran one of these I restarted the kernel because on subsequent iterations the timings improve consistently.

With n_samples, n_components = 10000, 100:

Torch initialization took: 0.2210385799407959 sec, 0.25275325775146484 sec Conversion from numpy took: 0.18267560005187988 sec, 0.18964171409606934 sec

With n_samples, n_components = 100000, 1000

Torch initialization took: 0.25449371337890625 sec, 0.24976301193237305 sec Conversion from numpy took: 0.5824990272521973 sec, 0.5989353656768799 sec

I think the complete transfer of large arrays from the cpu to the gpu are costly and over here all we are doing inside the loop is a single assignment and no computation. So overall this seems to be okay.

What do you think @lesteve

lesteve added 4 commits February 5, 2025 11:17

wip

b04a9f7

wip

e6ba4e4

stuck on linalg.cholesky array API support

2226a55

a bit further with xp.cholesky but now linalg.solve_triangular

b1fdee7

github-actions bot added the module:mixture label Feb 6, 2025

lesteve marked this pull request as draft February 6, 2025 14:26

StefanieSenger self-requested a review February 14, 2025 09:28

StefanieSenger and others added 11 commits February 14, 2025 11:54

more array api

14fb0ba

wip (problem with weights as numpy arrays)

6010ff7

array api for covariance_type='diag' and init_params='random'

aa2a383

add simple test

de4f3a5

Add comments about tricky bits

7974931

lint

08e5f9b

one more comment

0f525ef

revert unwanted change

4801e2b

fix test_bayesian_mixture

de1343c

Compare to numpy result in test

b05eca0

Use global_random_seed

c35bdd6

lesteve added the CUDA CI label Mar 12, 2025

github-actions bot removed the CUDA CI label Mar 12, 2025

StefanieSenger and others added 5 commits March 12, 2025 14:30

retrigger CI

4516920

Merge branch 'gmm-array-api' of github.com:lesteve/scikit-learn into …

61c8b5d

…pr/lesteve/30777

retrigger CI

e974051

retrigger CI [azure parallel]

1a7f262

A bit further with setting the device more correctly

fb40870

lesteve mentioned this pull request Mar 13, 2025

BUG: error for arrays on non-default device scipy/scipy#22680

Open

lesteve added 3 commits March 14, 2025 16:52

Add our own implementation of logsumexp [azure parallel]

f2eba56

Fix implementation of logsumexp

a0f8d25

Fix for older numpy versions

53e9917

StefanieSenger and others added 13 commits May 14, 2025 11:40

different branch for numpy.linalg; only re-raise numpy error

f9b2946

Merge branch 'gmm-array-api' of github.com:lesteve/scikit-learn into …

7a38674

…pr/lesteve/30777

Remove comment

adc992e

Remove script

0bb750c

update TODOs

7874231

only use X array namespace at prediction time

96d8d8c

Fix predict

27a8cd2

remove TODO

4c62715

Fix

303f392

Better variable name

c232e39

Simplify with math.log

a43eeb2

Use math.pi

3a72ec9

Improve tests + make score return float

8f4079f

lesteve marked this pull request as ready for review May 16, 2025 13:44

lesteve changed the title ~~Investigate GaussianMixture array API support~~ ENH Add array API support for GaussianMixture May 16, 2025

lesteve changed the title ~~ENH Add array API support for GaussianMixture~~ FEA Add array API support for GaussianMixture May 16, 2025

List GaussianMixture in the estimators supporting array API

de1e575

lesteve added the CUDA CI label May 16, 2025

github-actions bot removed the CUDA CI label May 16, 2025

lesteve mentioned this pull request May 19, 2025

MNT Update array-api-compat to 1.12 #31388

Merged

lesteve added 2 commits May 20, 2025 14:10

Merge branch 'main' of https://github.com/scikit-learn/scikit-learn i…

3a7dfd1

…nto gmm-array-api

Remove temporary array-api-compat work-around

910aa1f

lesteve mentioned this pull request May 21, 2025

Array API support for pairwise kernels #29822

Open

4 tasks

OmarManzoor reviewed May 22, 2025

View reviewed changes

StefanieSenger mentioned this pull request May 29, 2025

DOC Clarify how mixed array input types handled in array api #31452

Open

lucyleeow mentioned this pull request Jun 2, 2025

Automatically move y (and sample_weight) to the same device and namespace as X #28668

Open

StefanieSenger and others added 2 commits June 4, 2025 13:54

Merge branch 'main' into gmm-array-api

23b543d

lint

4fe3766

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

FEA Add array API support for GaussianMixture #30777

FEA Add array API support for GaussianMixture #30777

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

FEA Add array API support for GaussianMixture #30777

Are you sure you want to change the base?

FEA Add array API support for GaussianMixture #30777

Conversation

Uh oh!

Uh oh!

Uh oh!

✔️ Linting Passed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!