-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Update cython code to support 64 bit indexed sparse inputs #2969
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Except that when we construct sparse matrices, we should still use For this reason, I suggest we do this on an as-needed basis (handling more than 2bln points at once isn't a good idea for most algorithms anyway...) |
Indeed. However one should keep in mind that the 2bln points limit means 2bln non-zero values, not 2bln samples. It's still big but could happen for large text corpus vectorized with the hashing vectorizer for instance. Any both MinibatchKMeans and SGDClassifier (and friends) could deal with such a training set. |
Related SciPy bug: scipy/scipy#3465 |
To give an update on this issue, below is the test status of a few common estimators applied to a CSR array with 64 bit indices. These results were obtained by running A short summary is that,
|
Good to know. Is be happy to see a quick patch to liblinear raise an error.
perhaps adjust the sparse checks in estimator_checks to include a 64bit
index and we'd see this analysis for more than just the common estimators
|
It would be good to have this in the form of an estimator check. |
Actually, this is probably a non-issue, since SGD related estimators are likely to be trained in batches with X = scipy.sparse.rand(10, 1000, format='csr')
X.indices = X.indices.astype('int64')
X.indptr = X.indptr.astype('int64')
print(X.indices.dtype) # -> "int64"
print(X[:5].indices.dtype) # -> "int32" |
I'm sure people use SGD estimators for batch learning, so I don't think we
can get off that easy, unless I've misunderstood something.
…On 3 September 2017 at 20:21, Roman Yurchak ***@***.***> wrote:
1. To make the SGD related estimators work, the
sklearn.utils.seq_dataset.CSRDataset (at least) need to be added
support for 64 bit indices.
Actually, this is probably a non-issue, since SGD related estimators are
likely to be trained in batches with partial_fit anyway; in which case
scipy will downcast indices to 32 bit automatically...
X = scipy.sparse.rand(10, 1000, format='csr')
X.indices = X.indices.astype('int64')
X.indptr = X.indptr.astype('int64')print(X.indices.dtype) # -> "int64"print(X[:5].indices.dtype) # -> "int32"
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2969 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6yHc-k06EO0Bd
8000
vmqF3MGCsQK53wdks5sen2jgaJpZM4BprxF>
.
|
I vote we close this and open individual issues for SGD, Libsvm (maybe never fix?) and liblinear (probably never fix) |
In scipy master (to be released as 0.14), scipy sparse matrices can now be indexed with 64 bit integers.
This means that we will probably need to use fused types for
indptr
andindices
arrays whenever we deal with CSC or CSR datastructures in our Cython code base.Edit: by @rth on Nov 2017 to add the status of support for 64 bit CSR indices in different parts of the code, as discussed in #2969 (comment)
The text was updated successfully, but these errors were encountered: