8000 [RFC] Support for int64 indexed SciPy sparse matrices in Cython code · Issue #23653 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[RFC] Support for int64 indexed SciPy sparse matrices in Cython code #23653

@ogrisel

Description

@ogrisel

At the moment we do not have systematic support for very large sparse matrices in our Cython code. That would be useful when the data is passed as a sparse matrix with more than ~2e9 columns or non-zero values.

The purpose of this issue is to link:

  • reference all related issues in scikit-learn.
  • decide if we want to have some uniform support guarantees or not
  • decide if we need centralized Cython tooling (e.g. type declarations, tempita conventions) to add support for such matrices.

Related issues and PRs (feel free to update this list):

For polynomial feature expansion (quite popular request):

Other models with open issues:

Other Cython estimators that could also be updated:

  • neighbors models (k-NN and radius-based models)
  • k-means & variants
  • Feature Hasher / Hashing Vectorizer (sklearn/feature_extraction/_hashing_fast.pyx)

The following PR will introduce a scikit-learn transformer that can output int64 indexed sparse matrices (even if it's input is int32 indexed).

Helpful Python snippet

SciPy decides to use the int32 or int64 dtype depending on the dimensions of the matrix and on the number of stored non-zero elements. Here is a quick way to generate a CSR matrix that requires int64-typed .indices and .indptr attributes:

>>> from scipy.sparse import csr_matrix
>>> import numpy as np
>>>
>>> X = csr_matrix(([1.0], [np.iinfo(np.int32).max + 1], [0, 1]))
>>> X
<1x2147483649 sparse matrix of type '<class 'numpy.float64'>'
        with 1 stored elements in Compressed Sparse Row format>
>>> X.indices
array([2147483648])
>>> X.indices.dtype
dtype('int64')
>>> X.indptr.dtype
dtype('int64')

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0