FIX test_csr_polynomial_expansion_index_overflow on [scipy-dev] #30393

ogrisel · 2024-12-02T17:08:42Z

Fixes #30315, #30377 and partially #30378.

We probably want to backport this to 1.6.X.

github-actions · 2024-12-02T17:10:00Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: d949791. Link to the linter CI: here}

ogrisel · 2024-12-02T17:12:41Z

Note that I do not fully understand what's going on in ~~#21824~~ scipy/scipy#21824. I minimally changed our test so that it would pass both for the old and new scipy behaviors.

EDIT: fixed the PR number.

jeremiedbb · 2024-12-02T17:57:22Z

Note that I do not fully understand what's going on in #21824

The PR number seems wrong

jeremiedbb · 2024-12-02T18:02:30Z

I don't understand what's different from the previous version. Looking at the diff I feel like we're expecting the same thing as before. Can you expand a bit ?

ogrisel · 2024-12-03T08:48:07Z

I don't understand what's different from the previous version. Looking at the diff I feel like we're expecting the same thing as before. Can you expand a bit ?

It's checking for greater or equal rather than just equal. With scipy/scipy#21824, scipy.sparse.hstack can use int64 in cases where it used int32 previously. I am not sure if this is a regression in scipy or not.

lesteve · 2024-12-03T14:32:42Z

This seems like a reasonable change in scipy dev: if you hstack two scipy csr_array with int64 indices you should get an csr_array with int64 indices. Until a few days ago and scipy/scipy#21824 was merged, the hstacked csr_array indices would be int32 if possible.

My general understanding is that scipy.sparse was trying to use int32 indices as much as possible by downcasting but scipy 1.11 started to preserve int64 indices instead see scipy/scipy#18509 (comment) for example for more details.

With the following snippet:

import numpy as np

import scipy
print(scipy.__version__)
from scipy import sparse

X_ind_32bit = sparse.csr_array([[1, 0, 0], [0, 1, 0], [0, 1, 0]])
X_ind_64bit = X_ind_32bit.copy()
X_ind_64bit.indptr = X_ind_64bit.indptr.astype(np.int64)
X_ind_64bit.indices = X_ind_64bit.indices.astype(np.int64)
print('hstack int32 int64:', sparse.hstack([X_ind_32bit, X_ind_64bit]).indptr.dtype)
print('hstack int64 int64:', sparse.hstack([X_ind_64bit, X_ind_64bit]).indptr.dtype)

Recent scipy-dev version output:

1.15.0.dev0+git20241202.82ee025
hstack int32 int64: int64
hstack int64 int64: int64

scipy latest release 1.14.1 output:

1.
8000
14.1
hstack int32 int64: int32
hstack int64 int64: int32

I have to say I haven't managed to figure out how to adapt our test yet. The output indices of Polynomial.fit_transform depends on its parameters for example the test only fails (we get int64 indices instead of the expected 32bit indices) with:

include_bias=False, interaction_only=True, n_degree=2, n_features=65535
include_bias=False, interaction_only=True, n_degree=3, n_features=2344

We are kind of testing the fact that we don't really need int64 in this case since int32 indices would be enough, except that we end up hstack a csr_array with int64 indices (X itself) and a csr_matrix with int32 indices from _create_expansion (Cython code). hstack does not do any downcasting any more and we get int64 indices.

lesteve · 2024-12-03T15:41:20Z

On top of this (as if this wasn't complicated enough), the behaviour of hstack with mixed sparse array and matrices seems to be the first item in the list dictates behaviour:

if the first item is a sparse array => preserve 64bit indices
if the first item is a sparse matrix => downcast to 32bit indices

import numpy as np

import scipy
print(scipy.__version__)
from scipy import sparse

X_ind_32bit = sparse.csr_array([[1, 0, 0], [0, 1, 0], [0, 1, 0]])
X_matrix_ind_32bit = sparse.csr_matrix(X_ind_32bit.copy())
X_ind_64bit = X_ind_32bit.copy()
X_ind_64bit.indptr = X_ind_64bit.indptr.astype(np.int64)
X_ind_64bit.indices = X_ind_64bit.indices.astype(np.int64)
X_matrix_ind_64bit = sparse.csr_matrix(X_ind_64bit.copy())
print('array int32 + matrix int64:', sparse.hstack([X_ind_32bit, X_matrix_ind_64bit]).indptr.dtype)
print('array int64 + matrix int32:', sparse.hstack([X_ind_64bit, X_matrix_ind_32bit]).indptr.dtype)
print('matrix int64 + array int32:', sparse.hstack([X_matrix_ind_64bit, X_ind_32bit]).indptr.dtype)
print('matrix int32 + array int64:', sparse.hstack([X_matrix_ind_32bit, X_ind_64bit]).indptr.dtype)

Output on scipy-dev:

1.15.0.dev0+git20241202.82ee025
array int32 + matrix int64: int64
array int64 + matrix int32: int64
matrix int64 + array int32: int32
matrix int32 + array int64: int32

There is the additional complication (maybe slight variation of same behaviour when using Python integers as indices) sparse.csr_array(([1.], ([0], [0]))) use 64bit indices whereas sparse.csr_matrix(([1.], ([0], [0]))) use 32bit indices.

So then you look at the code and it seems like this is quite tricky to figure out what the output indices dtype will be:
The relevant code

scikit-learn/sklearn/preprocessing/_polynomial.py

Lines 448 to 466 in 8ded7f4

    
           if sparse.issparse(X) and X.format == "csr": 
        
               if self._max_degree > 3: 
        
                   return self.transform(X.tocsc()).tocsr() 
        
               to_stack = [] 
        
               if self.include_bias: 
        
                   to_stack.append( 
        
                       sparse.csr_matrix(np.ones(shape=(n_samples, 1), dtype=X.dtype)) 
        
                   ) 
        
               if self._min_degree <= 1 and self._max_degree > 0: 
        
                   to_stack.append(X) 
        
               cumulative_size = sum(mat.shape[1] for mat in to_stack) 
        
               for deg in range(max(2, self._min_degree), self._max_degree + 1): 
        
                   expanded = _create_expansion( 
        
                       X=X, 
        
                       interaction_only=self.interaction_only, 
        
                       deg=deg, 
        
                       n_features=n_features, 
        
                       cumulative_size=cumulative_size,

The test failure happen when X is a csr_array with 64bit indices and include_bias=False in this case X is the first in to_stacked so its 64bit indices will be preserved and the output will be 64bit where we expect 32bit because 32bit indices would be enough in principle.

ogrisel · 2024-12-03T18:32:57Z

Thanks for your investigation @lesteve. But in the end, do you think we can fix the test the way I did in this PR? Or shall we do something else? Or do you think the surprisingly asymmetric scipy behavior should be reported upstream?

lesteve · 2024-12-04T08:55:07Z

Looks like using int32 indices when we can when creating X in the test makes the test pass. Maybe this is the thing that respects the original test intent the most?

diff --git a/sklearn/preprocessing/tests/test_polynomial.py b/sklearn/preprocessing/tests/test_polynomial.py
index b97500d43e..29546dff30 100644
--- a/sklearn/preprocessing/tests/test_polynomial.py
+++ b/sklearn/preprocessing/tests/test_polynomial.py
@@ -1050,8 +1050,10 @@ def test_csr_polynomial_expansion_index_overflow(
     `scipy.sparse.hstack`.
     """
     data = [1.0]
-    row = [0]
-    col = [n_features - 1]
+    # Use int32 indices when we can
+    indices_dtype = np.int32 if n_features - 1 <= np.iinfo(np.int32).max else np.int64
+    row = np.array([0], dtype=indices_dtype)
+    col = np.array([n_features - 1], dtype=indices_dtype)
 
     # First degree index
     expected_indices = [

ogrisel · 2024-12-04T09:05:29Z

Let's do this, then. I will update the PR.

This reverts commit 7dfa9d0.

lesteve · 2024-12-04T10:42:28Z

I pushed some commits with the proposed change, let's see what the CI has to say about it.

lesteve

CI is happy

ogrisel

+1 as well.

…it-learn#30393) Co-authored-by: Loïc Estève <loic.esteve@ymail.com>

Co-authored-by: Loïc Estève <loic.esteve@ymail.com>

…it-learn#30393) Co-authored-by: Loïc Estève <loic.esteve@ymail.com>

Fix test_csr_polynomial_expansion_index_overflow on [scipy-dev]

7dfa9d0

ogrisel added Bug Build / CI module:test-suite everything related to our tests No Changelog Needed labels Dec 2, 2024

ogrisel added this to the 1.6 milestone Dec 2, 2024

github-actions bot added the module:preprocessing label Dec 2, 2024

jeremiedbb added the To backport PR merged in master that need a backport to a release branch defined based on the milestone. label Dec 2, 2024

lesteve added 3 commits December 4, 2024 11:37

Use int32 indices as much as we can

8a7e246

Revert "Fix test_csr_polynomial_expansion_index_overflow on [scipy-dev]"

3372250

This reverts commit 7dfa9d0.

[azure parallel] [scipy-dev]

d949791

lesteve approved these changes Dec 4, 2024

View reviewed changes

ogrisel commented Dec 4, 2024

View reviewed changes

ogrisel merged commit 0e0df36 into scikit-learn:main Dec 4, 2024
31 checks passed

ogrisel deleted the fix-test_csr_polynomial_expansion_index_overflow branch December 4, 2024 13:23

jeremiedbb pushed a commit to jeremiedbb/scikit-learn that referenced this pull request Dec 4, 2024

FIX test_csr_polynomial_expansion_index_overflow on [scipy-dev] (scik…

f4e10d4

…it-learn#30393) Co-authored-by: Loïc Estève <loic.esteve@ymail.com>

jeremiedbb pushed a commit that referenced this pull request Dec 6, 2024

FIX test_csr_polynomial_expansion_index_overflow on [scipy-dev] (#30393)

03d90f3

Co-authored-by: Loïc Estève <loic.esteve@ymail.com>

lesteve mentioned this pull request Dec 9, 2024

⚠️ CI failed on Linux_free_threaded.pylatest_free_threaded (last failure: Dec 04, 2024) ⚠️ #30377

Closed

virchan pushed a commit to virchan/scikit-learn that referenced this pull request Dec 9, 2024

FIX test_csr_polynomial_expansion_index_overflow on [scipy-dev] (scik…

0d36d6c

…it-learn#30393) Co-authored-by: Loïc Estève <loic.esteve@ymail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX test_csr_polynomial_expansion_index_overflow on [scipy-dev] #30393

FIX test_csr_polynomial_expansion_index_overflow on [scipy-dev] #30393

FIX test_csr_polynomial_expansion_index_overflow on [scipy-dev] #30393

FIX test_csr_polynomial_expansion_index_overflow on [scipy-dev] #30393

Conversation

✔️ Linting Passed

Choose a reason for hiding this comment

Choose a reason for hiding this comment