Upgrade DictVectorizer to use 64 bit indices instead of 32 bit #18403

E-Aho · 2020-09-15T15:20:58Z

Describe the workflow you want to enable

Currently, in DictVectorizer, the _transform function is limited to 32 bit indices, meaning that there is a limit of ~2B rows/cols in the resultant matrix.

This issue seeks to enable using DictVectorizer on larger datasets by increasing this to work with 64 bit values.

Describe your proposed solution

Update the dtypes for the indices in _dict_vectorizer.py as follows:

from np.int32 to np.int64
from np.intc to np.int_
update from array("i") to array("l") to get the 64 bit signed long instead of signed int

The text was updated successfully, but these errors were encountered:

E-Aho · 2020-09-15T15:58:43Z

Worth noting that since SciPy 0.14.0, sparse has had the ability to work with int64 indices.

thomasjpfan · 2020-09-17T17:46:23Z

Would this increase the memory used by DictVectorizer?

rth · 2020-09-21T13:42:14Z

Is this an issue that you run into? Could you please provide a traceback?

limited to 32 bit indices, meaning that there is a limit of ~2M rows/cols in the resultant matrix.

It would be 1e9, not 1e6. It's indeed possible to run into this limitations for the number of elements in the array, it should be less of a problem for the number of features/columns. As far as I can tell after a cursory glance the current limitation is only on the number of columns? Do you have data with more than 2e9 features then?

E-Aho · 2020-09-25T10:33:21Z

You're right about it being 2e9, not 2e6, my apologies. Have edited the original to reflect this.

It was an issue I had run into, yeah. Training a very large model, which I understand is not a typical use case, so I would also understand if this is not a high priority to fix.

For me the issue was for the number of columns, and having looked at the code, I agree that it should only be an issue for a large number of features rather than rows, as the only cases of using 32 bit ints is in sorting the features and adding the feature_names to the vocab list.

Using 64 bit indices would I believe only increase the memory usage in the fitting/sorting functions within DictVectorizer, and I don't think it would massively impact the overall memory usage, but I'm not incredibly familiar with the implementation of DictVectorizer in general.

The issue was solved for me by monkeypatching _dict_vectorizer.py as above, but I'm happy for the issue to be closed if having more than 2e9 features isn't a high priority to support.

E-Aho added the New Feature label Sep 15, 2020

cmarmo added Enhancement module:feature_extraction and removed New Feature labels Feb 4, 2021

ogrisel mentioned this issue Jun 16, 2022

[RFC] Support for int64 indexed SciPy sparse matrices in Cython code #23653

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Upgrade DictVectorizer to use 64 bit indices instead of 32 bit #18403

Upgrade DictVectorizer to use 64 bit indices instead of 32 bit #18403

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Upgrade DictVectorizer to use 64 bit indices instead of 32 bit #18403

Upgrade DictVectorizer to use 64 bit indices instead of 32 bit #18403

Comments

Uh oh!

Describe the workflow you want to enable

Describe your proposed solution

Uh oh!

Uh oh!

Uh oh!

Uh oh!