[RFC] Support for int64 indexed SciPy sparse matrices in Cython code

At the moment we do not have systematic support for very large sparse matrices in our Cython code. That would be useful when the data is passed as a sparse matrix with more than ~2e9 columns or non-zero values.

The purpose of this issue is to link:

reference all related issues in scikit-learn.
decide if we want to have some uniform support guarantees or not
decide if we need centralized Cython tooling (e.g. type declarations, tempita conventions) to add support for such matrices.

Related issues and PRs (feel free to update this list):

For polynomial feature expansion (quite popular request):

Other models with open issues:

Other Cython estimators that could also be updated:

neighbors models (k-NN and radius-based models)
- related issues not just about this problem: FEA CSR support for all DistanceMetric #23604
k-means & variants
Feature Hasher / Hashing Vectorizer (sklearn/feature_extraction/_hashing_fast.pyx)

The following PR will introduce a scikit-learn transformer that can output int64 indexed sparse matrices (even if it's input is int32 indexed).

ENH Allow for appropriate dtype us in preprocessing.PolynomialFeatures for sparse matrices #23731

Helpful Python snippet

SciPy decides to use the int32 or int64 dtype depending on the dimensions of the matrix and on the number of stored non-zero elements. Here is a quick way to generate a CSR matrix that requires int64-typed .indices and .indptr attributes:

>>> from scipy.sparse import csr_matrix
>>> import numpy as np
>>>
>>> X = csr_matrix(([1.0], [np.iinfo(np.int32).max + 1], [0, 1]))
>>> X
<1x2147483649 sparse matrix of type '<class 'numpy.float64'>'
        with 1 stored elements in Compressed Sparse Row format>
>>> X.indices
array([2147483648])
>>> X.indices.dtype
dtype('int64')
>>> X.indptr.dtype
dtype('int64')

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC] Support for int64 indexed SciPy sparse matrices in Cython code #23653

Related issues and PRs (feel free to update this list):

Helpful Python snippet

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC] Support for int64 indexed SciPy sparse matrices in Cython code #23653

Description

Related issues and PRs (feel free to update this list):

Helpful Python snippet

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions