-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Upgrade DictVectorizer to use 64 bit indices instead of 32 bit #18403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Worth noting that since SciPy 0.14.0, sparse has had the ability to work with |
Would this increase the memory used by |
Is this an issue that you run into? Could you please provide a traceback?
It would be 1e9, not 1e6. It's indeed possible to run into this limitations for the number of elements in the array, it should be less of a problem for the number of features/columns. As far as I can tell after a cursory glance the current limitation is only on the number of columns? Do you have data with more than 2e9 features then? |
You're right about it being 2e9, not 2e6, my apologies. Have edited the original to reflect this. It was an issue I had run into, yeah. Training a very large model, which I understand is not a typical use case, so I would also understand if this is not a high priority to fix. For me the issue was for the number of columns, and having looked at the code, I agree that it should only be an issue for a large number of features rather than rows, as the only cases of using 32 bit ints is in sorting the features and adding the feature_names to the vocab list. Using 64 bit indices would I believe only increase the memory usage in the fitting/sorting functions within The issue was solved for me by monkeypatching |
Uh oh!
There was an error while loading. Please reload this page.
Describe the workflow you want to enable
Currently, in DictVectorizer, the _transform function is limited to 32 bit indices, meaning that there is a limit of ~2B rows/cols in the resultant matrix.
This issue seeks to enable using DictVectorizer on larger datasets by increasing this to work with 64 bit values.
Describe your proposed solution
Update the dtypes for the indices in _dict_vectorizer.py as follows:
np.int32
tonp.int64
np.intc
tonp.int_
array("i")
toarray("l")
to get the 64 bit signed long instead of signed intThe text was updated successfully, but these errors were encountered: