-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[RFC] Support for int64 indexed SciPy sparse matrices in Cython code #23653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I see no valid reason to reject large sparse matrix support for some estimators and not for others. For instance, all estimators that support sparse high dim inputs with 10_000 dims could legitimately (attempt to) work with a large number of rows in So we could invest some effort to introduce a common test for int64 indexed sparse matrix support and use a meta-issue to track progress in supporting this consistently throughout the code base.
We could at least define a convention for the name of the typedef to be used in Cython code that manipulates such index arrays. For instance: ctypedef cnp.int32_t SPARSE_INDEX_t in existing code that only support int32 indices and later upgrade this with tempita or fused types in dedicated PRs. |
I like the idea of adopting some kind of templating for this. Considering, e.g. #16803 which I think would be much simpler using templating than fused types imo. Edit: Actually I think it's fine as-is, templating wasn't needed there. |
#1680 seems unrelated. |
Typo on my part, corrected now -- sorry 😅 |
Alright, #16803 is part of the list of linked issues. |
Note that #25942 intentionally reverted the For estimators that have Cython code that should support |
Uh oh!
There was an error while loading. Please reload this page.
At the moment we do not have systematic support for very large sparse matrices in our Cython code. That would be useful when the data is passed as a sparse matrix with more than ~2e9 columns or non-zero values.
The purpose of this issue is to link:
Related issues and PRs (feel free to update this list):
For polynomial feature expansion (quite popular request):
np.int32_t
causes issue in_csr_polynomial_expansion
#16803Other models with open issues:
Other Cython estimators that could also be updated:
DistanceMetric
#23604sklearn/feature_extraction/_hashing_fast.pyx
)The following PR will introduce a scikit-learn transformer that can output
int64
indexed sparse matrices (even if it's input isint32
indexed).preprocessing.PolynomialFeatures
for sparse matrices #23731Helpful Python snippet
SciPy decides to use the int32 or int64 dtype depending on the dimensions of the matrix and on the number of stored non-zero elements. Here is a quick way to generate a CSR matrix that requires int64-typed
.indices
and.indptr
attributes:The text was updated successfully, but these errors were encountered: