8000 [RFC] Support for int64 indexed SciPy sparse matrices in Cython code · Issue #23653 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[RFC] Support for int64 indexed SciPy sparse matrices in Cython code #23653

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ogrisel opened this issue Jun 16, 2022 · 7 comments
Open

[RFC] Support for int64 indexed SciPy sparse matrices in Cython code #23653

ogrisel opened this issue Jun 16, 2022 · 7 comments
Labels

Comments

@ogrisel
Copy link
Member
ogrisel commented Jun 16, 2022

At the moment we do not have systematic support for very large sparse matrices in our Cython code. That would be useful when the data is passed as a sparse matrix with more than ~2e9 columns or non-zero values.

The purpose of this issue is to link:

  • reference all related issues in scikit-learn.
  • decide if we want to have some uniform support guarantees or not
  • decide if we need centralized Cython tooling (e.g. type declarations, tempita conventions) to add support for such matrices.

Related issues and PRs (feel free to update this list):

For polynomial feature expansion (quite popular request):

Other models with open issues:

Other Cython estimators that could also be updated:

  • neighbors models (k-NN and radius-based models)
  • k-means & variants
  • Feature Hasher / Hashing Vectorizer (sklearn/feature_extraction/_hashing_fast.pyx)

The following PR will introduce a scikit-learn transformer that can output int64 indexed sparse matrices (even if it's input is int32 indexed).

Helpful Python snippet

SciPy decides to use the int32 or int64 dtype depending on the dimensions of the matrix and on the number of stored non-zero elements. Here is a quick way to generate a CSR matrix that requires int64-typed .indices and .indptr attributes:

>>> from scipy.sparse import csr_matrix
>>> import numpy as np
>>>
>>> X = csr_matrix(([1.0], [np.iinfo(np.int32).max + 1], [0, 1]))
>>> X
<1x2147483649 sparse matrix of type '<class 'numpy.float64'>'
        with 1 stored elements in Compressed Sparse Row format>
>>> X.indices
array([2147483648])
>>> X.indices.dtype
dtype('int64')
>>> X.indptr.dtype
dtype('int64')
@ogrisel
Copy link
Member Author
ogrisel commented Jun 16, 2022

decide if we want to have some uniform support guarantees or not

I see no valid reason to reject large sparse matrix support for some estimators and not for others.

For instance, all estimators that support sparse high dim inputs with 10_000 dims could legitimately (attempt to) work with a large number of rows in X_train or X_test in which case they can face this problem quite quickly (e.g. with 200_000 data points).

So we could invest some effort to introduce a common test for int64 indexed sparse matrix support and use a meta-issue to track progress in supporting this consistently throughout the code base.

decide if we need centralized Cython tooling (e.g. type declarations, tempita conventions) to add support for such matrices.

We could at least define a convention for the name of the typedef to be used in Cython code that manipulates such index arrays. For instance:

ctypedef cnp.int32_t SPARSE_INDEX_t

in existing code that only support int32 indices and later upgrade this with tempita or fused types in dedicated PRs.

@Micky774
Copy link
Contributor
Micky774 commented Jun 16, 2022

I like the idea of adopting some kind of templating for this. Considering, e.g. #16803 which I think would be much simpler using templating than fused types imo.

Edit: Actually I think it's fine as-is, templating wasn't needed there.

@ogrisel
Copy link
Member Author
ogrisel commented Jun 16, 2022

#1680 seems unrelated.

@Micky774
Copy link
Contributor

#1680 seems unrelated.

Typo on my part, corrected now -- sorry 😅

@ogrisel
Copy link
Member Author
ogrisel commented Jun 17, 2022

Alright, #16803 is part of the list of linked issues.

@jjerphan
Copy link
Member
jjerphan commented Aug 4, 2022

@ogrisel
Copy link
Member Author
ogrisel commented Mar 29, 2023

Note that #25942 intentionally reverted the SPARSE_INDEX_t type alias in favor of type names with explicit precision levels such as int32_t instead.

For estimators that have Cython code that should support int32_t and int64_t indexed inputs, we can use ad-hoc fused types or use tempita templates if Cython class attributes are needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants
0