8000 Upgrade DictVectorizer to use 64 bit indices instead of 32 bit · Issue #18403 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Upgrade DictVectorizer to use 64 bit indices instead of 32 bit #18403

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
E-Aho opened this issue Sep 15, 2020 · 4 comments
Open

Upgrade DictVectorizer to use 64 bit indices instead of 32 bit #18403

E-Aho opened this issue Sep 15, 2020 · 4 comments

Comments

@E-Aho
Copy link
E-Aho commented Sep 15, 2020

Describe the workflow you want to enable

Currently, in DictVectorizer, the _transform function is limited to 32 bit indices, meaning that there is a limit of ~2B rows/cols in the resultant matrix.

This issue seeks to enable using DictVectorizer on larger datasets by increasing this to work with 64 bit values.

Describe your proposed solution

Update the dtypes for the indices in _dict_vectorizer.py as follows:

  • from np.int32 to np.int64
  • from np.intc to np.int_
  • update from array("i") to array("l") to get the 64 bit signed long instead of signed int
@E-Aho
Copy link
Author
E-Aho commented Sep 15, 2020

Worth noting that since SciPy 0.14.0, sparse has had the ability to work with int64 indices.

@thomasjpfan
Copy link
Member

Would this increase the memory used by DictVectorizer?

@rth
Copy link
Member
rth commented Sep 21, 2020

Is this an issue that you run into? Could you please provide a traceback?

limited to 32 bit indices, meaning that there is a limit of ~2M rows/cols in the resultant matrix.

It would be 1e9, not 1e6. It's indeed possible to run into this limitations for the number of elements in the array, it should be less of a problem for the number of features/columns. As far as I can tell after a cursory glance the current limitation is only on the number of columns? Do you have data with more than 2e9 features then?

@E-Aho
Copy link
Author
E-Aho commented Sep 25, 2020

You're right about it being 2e9, not 2e6, my apologies. Have edited the original to reflect this.

It was an issue I had run into, yeah. Training a very large model, which I understand is not a typical use case, so I would also understand if this is not a high priority to fix.

For me the issue was for the number of columns, and having looked at the code, I agree that it should only be an issue for a large number of features rather than rows, as the only cases of using 32 bit ints is in sorting the features and adding the feature_names to the vocab list.

Using 64 bit indices would I believe only increase the memory usage in the fitting/sorting functions within DictVectorizer, and I don't think it would massively impact the overall memory usage, but I'm not incredibly familiar with the implementation of DictVectorizer in general.

The issue was solved for me by monkeypatching _dict_vectorizer.py as above, but I'm happy for the issue to be closed if having more than 2e9 features isn't a high priority to support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants
0