8000 FIX test_csr_polynomial_expansion_index_overflow on [scipy-dev] by ogrisel · Pull Request #30393 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

FIX test_csr_polynomial_expansion_index_overflow on [scipy-dev] #30393

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

ogrisel
Copy link
Member
@ogrisel ogrisel commented Dec 2, 2024

Fixes #30315, #30377 and partially #30378.

We probably want to backport this to 1.6.X.

@ogrisel ogrisel added this to the 1.6 milestone Dec 2, 2024
Copy link
github-actions bot commented Dec 2, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: d949791. Link to the linter CI: here

@ogrisel
Copy link
Member Author
ogrisel commented Dec 2, 2024

Note that I do not fully understand what's going on in #21824 scipy/scipy#21824. I minimally changed our test so that it would pass both for the old and new scipy behaviors.

EDIT: fixed the PR number.

@jeremiedbb jeremiedbb added the To backport PR merged in master that need a backport to a release branch defined based on the milestone. label Dec 2, 2024
@jeremiedbb
Copy link
Member

Note that I do not fully understand what's going on in #21824

The PR number seems wrong

@jeremiedbb
Copy link
Member

I don't understand what's different from the previous version. Looking at the diff I feel like we're expecting the same thing as before. Can you expand a bit ?

@ogrisel
Copy link
Member Author
ogrisel commented Dec 3, 2024

I don't understand what's different from the previous version. Looking at the diff I feel like we're expecting the same thing as before. Can you expand a bit ?

It's checking for greater or equal rather than just equal. With scipy/scipy#21824, scipy.sparse.hstack can use int64 in cases where it used int32 previously. I am not sure if this is a regression in scipy or not.

@lesteve
Copy link
Member
lesteve commented Dec 3, 2024

This seems like a reasonable change in scipy dev: if you hstack two scipy csr_array with int64 indices you should get an csr_array with int64 indices. Until a few days ago and scipy/scipy#21824 was merged, the hstacked csr_array indices would be int32 if possible.

My general understanding is that scipy.sparse was trying to use int32 indices as much as possible by downcasting but scipy 1.11 started to preserve int64 indices instead see scipy/scipy#18509 (comment) for example for more details.

With the following snippet:

import numpy as np

import scipy
print(scipy.__version__)
from scipy import sparse

X_ind_32bit = sparse.csr_array([[1, 0, 0], [0, 1, 0], [0, 1, 0]])
X_ind_64bit = X_ind_32bit.copy()
X_ind_64bit.indptr = X_ind_64bit.indptr.astype(np.int64)
X_ind_64bit.indices = X_ind_64bit.indices.astype(np.int64)
print('hstack int32 int64:', sparse.hstack([X_ind_32bit, X_ind_64bit]).indptr.dtype)
print('hstack int64 int64:', sparse.hstack([X_ind_64bit, X_ind_64bit]).indptr.dtype)

Recent scipy-dev version output:

1.15.0.dev0+git20241202.82ee025
hstack int32 int64: int64
hstack int64 int64: int64

scipy latest release 1.14.1 output:

1.
8000
14.1
hstack int32 int64: int32
hstack int64 int64: int32

I have to say I haven't managed to figure out how to adapt our test yet. The output indices of Polynomial.fit_transform depends on its parameters for example the test only fails (we get int64 indices instead of the expected 32bit indices) with:

  • include_bias=False, interaction_only=True, n_degree=2, n_features=65535
  • include_bias=False, interaction_only=True, n_degree=3, n_features=2344

We are kind of testing the fact that we don't really need int64 in this case since int32 indices would be enough, except that we end up hstack a csr_array with int64 indices (X itself) and a csr_matrix with int32 indices from _create_expansion (Cython code). hstack does not do any downcasting any more and we get int64 indices.

@lesteve
Copy link
Member
lesteve commented Dec 3, 2024

On top of this (as if this wasn't complicated enough), the behaviour of hstack with mixed sparse array and matrices seems to be the first item in the list dictates behaviour:

  • if the first item is a sparse array => preserve 64bit indices
  • if the first item is a sparse matrix => downcast to 32bit indices
import numpy as np

import scipy
print(scipy.__version__)
from scipy import sparse

X_ind_32bit = sparse.csr_array([[1, 0, 0], [0, 1, 0], [0, 1, 0]])
X_matrix_ind_32bit = sparse.csr_matrix(X_ind_32bit.copy())
X_ind_64bit = X_ind_32bit.copy()
X_ind_64bit.indptr = X_ind_64bit.indptr.astype(np.int64)
X_ind_64bit.indices = X_ind_64bit.indices.astype(np.int64)
X_matrix_ind_64bit = sparse.csr_matrix(X_ind_64bit.copy())
print('array int32 + matrix int64:', sparse.hstack([X_ind_32bit, X_matrix_ind_64bit]).indptr.dtype)
print('array int64 + matrix int32:', sparse.hstack([X_ind_64bit, X_matrix_ind_32bit]).indptr.dtype)
print('matrix int64 + array int32:', sparse.hstack([X_matrix_ind_64bit, X_ind_32bit]).indptr.dtype)
print('matrix int32 + array int64:', sparse.hstack([X_matrix_ind_32bit, X_ind_64bit]).indptr.dtype)

Output on scipy-dev:

1.15.0.dev0+git20241202.82ee025
array int32 + matrix int64: int64
array int64 + matrix int32: int64
matrix int64 + array int32: int32
matrix int32 + array int64: int32

There is the additional complication (maybe slight variation of same behaviour when using Python integers as indices) sparse.csr_array(([1.], ([0], [0]))) use 64bit indices whereas sparse.csr_matrix(([1.], ([0], [0]))) use 32bit indices.

So then you look at the code and it seems like this is quite tricky to figure out what the output indices dtype will be:
The relevant code

if sparse.issparse(X) and X.format == "csr":
if self._max_degree > 3:
return self.transform(X.tocsc()).tocsr()
to_stack = []
if self.include_bias:
to_stack.append(
sparse.csr_matrix(np.ones(shape=(n_samples, 1), dtype=X.dtype))
)
if self._min_degree <= 1 and self._max_degree > 0:
to_stack.append(X)
cumulative_size = sum(mat.shape[1] for mat in to_stack)
for deg in range(max(2, self._min_degree), self._max_degree + 1):
expanded = _create_expansion(
X=X,
interaction_only=self.interaction_only,
deg=deg,
n_features=n_features,
cumulative_size=cumulative_size,

The test failure happen when X is a csr_array with 64bit indices and include_bias=False in this case X is the first in to_stacked so its 64bit indices will be preserved and the output will be 64bit where we expect 32bit because 32bit indices would be enough in principle.

@ogrisel
Copy link
Member Author
ogrisel commented Dec 3, 2024

Thanks for your investigation @lesteve. But in the end, do you think we can fix the test the way I did in this PR? Or shall we do something else? Or do you think the surprisingly asymmetric scipy behavior should be reported upstream?

@lesteve
Copy link
Member
lesteve commented Dec 4, 2024

Looks like using int32 indices when we can when creating X in the test makes the test pass. Maybe this is the thing that respects the original test intent the most?

diff --git a/sklearn/preprocessing/tests/test_polynomial.py b/sklearn/preprocessing/tests/test_polynomial.py
index b97500d43e..29546dff30 100644
--- a/sklearn/preprocessing/tests/test_polynomial.py
+++ b/sklearn/preprocessing/tests/test_polynomial.py
@@ -1050,8 +1050,10 @@ def test_csr_polynomial_expansion_index_overflow(
     `scipy.sparse.hstack`.
     """
     data = [1.0]
-    row = [0]
-    col = [n_features - 1]
+    # Use int32 indices when we can
+    indices_dtype = np.int32 if n_features - 1 <= np.iinfo(np.int32).max else np.int64
+    row = np.array([0], dtype=indices_dtype)
+    col = np.array([n_features - 1], dtype=indices_dtype)
 
     # First degree index
     expected_indices = [

@ogrisel
Copy link
Member Author
ogrisel commented Dec 4, 2024

Let's do this, then. I will update the PR.

@lesteve
Copy link
Member
lesteve commented Dec 4, 2024

I pushed some commits with the proposed change, let's see what the CI has to say about it.

Copy link
Member
@lesteve lesteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI is happy

Copy link
Member Author
@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 as well.

@ogrisel ogrisel merged commit 0e0df36 into scikit-learn:main Dec 4, 2024
31 checks passed
@ogrisel ogrisel deleted the fix-test_csr_polynomial_expansion_index_overflow branch December 4, 2024 13:23
jeremiedbb pushed a commit to jeremiedbb/scikit-learn that referenced this pull request Dec 4, 2024
jeremiedbb pushed a commit that referenced this pull request Dec 6, 2024
Co-authored-by: Loïc Estève <loic.esteve@ymail.com>
virchan pushed a commit to virchan/scikit-learn that referenced this pull request Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Build / CI module:preprocessing module:test-suite everything related to our tests No Changelog Needed To backport PR merged in master that need a backport to a release branch defined based on the milestone.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

⚠️ CI failed on Linux_Nightly.pylatest_pip_scipy_dev (last failure: Dec 05, 2024) ⚠️
3 participants
0