-
-
Notifications
You must be signed in to change notification settings - Fork 26.4k
Description
At the moment we do not have systematic support for very large sparse matrices in our Cython code. That would be useful when the data is passed as a sparse matrix with more than ~2e9 columns or non-zero values.
The purpose of this issue is to link:
- reference all related issues in scikit-learn.
- decide if we want to have some uniform support guarantees or not
- decide if we need centralized Cython tooling (e.g. type declarations, tempita conventions) to add support for such matrices.
Related issues and PRs (feel free to update this list):
For polynomial feature expansion (quite popular request):
- index type
np.int32_tcauses issue in_csr_polynomial_expansion#16803 - Cython code for PolynomialFeatures should use int64s for indices. #17554
- [WIP] FIX index overflow error in sparse matrix polynomial expansion (bis) #19676
Other models with open issues:
- Support large sparse matrices in SGD* and SequentialDataset #11355
- Support large sparse matrices in SVC/SVR #11356
- Adding accept_large_sparse flag to SGDRegressor #18090
- Upgrade DictVectorizer to use 64 bit indices instead of 32 bit #18403
Other Cython estimators that could also be updated:
- neighbors models (k-NN and radius-based models)
- related issues not just about this problem: FEA CSR support for all
DistanceMetric#23604
- related issues not just about this problem: FEA CSR support for all
- k-means & variants
- Feature Hasher / Hashing Vectorizer (
sklearn/feature_extraction/_hashing_fast.pyx)
The following PR will introduce a scikit-learn transformer that can output int64 indexed sparse matrices (even if it's input is int32 indexed).
Helpful Python snippet
SciPy decides to use the int32 or int64 dtype depending on the dimensions of the matrix and on the number of stored non-zero elements. Here is a quick way to generate a CSR matrix that requires int64-typed .indices and .indptr attributes:
>>> from scipy.sparse import csr_matrix
>>> import numpy as np
>>>
>>> X = csr_matrix(([1.0], [np.iinfo(np.int32).max + 1], [0, 1]))
>>> X
<1x2147483649 sparse matrix of type '<class 'numpy.float64'>'
with 1 stored elements in Compressed Sparse Row format>
>>> X.indices
array([2147483648])
>>> X.indices.dtype
dtype('int64')
>>> X.indptr.dtype
dtype('int64')