-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] fixes an issue w/ large sparse matrix indices in CountVectorizer #11295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
317a169
242e668
c397087
562b984
8f2db42
6fb6923
74b3736
1bafd00
0c5e34d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -31,6 +31,7 @@ | |||||||||||||||||||||||||
from .stop_words import ENGLISH_STOP_WORDS | ||||||||||||||||||||||||||
from ..utils.validation import check_is_fitted, check_array, FLOAT_DTYPES | ||||||||||||||||||||||||||
from ..utils.fixes import sp_version | ||||||||||||||||||||||||||
from ..utils import _IS_32BIT | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
__all__ = ['HashingVectorizer', | ||||||||||||||||||||||||||
|
@@ -871,7 +872,7 @@ def _sort_features(self, X, vocabulary): | |||||||||||||||||||||||||
Returns a reordered matrix and modifies the vocabulary in place | ||||||||||||||||||||||||||
""" | ||||||||||||||||||||||||||
sorted_features = sorted(vocabulary.items()) | ||||||||||||||||||||||||||
map_index = np.empty(len(sorted_features), dtype=np.int32) | ||||||||||||||||||||||||||
map_index = np.empty(len(sorted_features), dtype=X.indices.dtype) | ||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Indices should be of dtype There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we just add a check here that catches an improper conversion to int64's on 32bit arch? Or should we never be mucking with the index dtype at all and just throw when our scikit-learn/sklearn/feature_extraction/text.py Lines 944 to 955 in 317a169
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't have enough background here - could you link to the PR that added 64-bit indexing to scipy? At any rate, it seems to be that
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The indices can be either int32 or int64, scipy.sparse doesn't care which (intp size does not matter). Both indices and indptr however need to be of the same dtype to avoid casts on each operation. Of course, when you constructing stuff manually, you'll also need to choose the dtype so that it can hold the items you are going to insert. |
||||||||||||||||||||||||||
for new_val, (term, old_val) in enumerate(sorted_features): | ||||||||||||||||||||||||||
vocabulary[term] = new_val | ||||||||||||||||||||||||||
map_index[old_val] = new_val | ||||||||||||||||||||||||||
|
@@ -961,14 +962,12 @@ def _count_vocab(self, raw_documents, fixed_vocab): | |||||||||||||||||||||||||
" contain stop words") | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
if indptr[-1] > 2147483648: # = 2**31 - 1 | ||||||||||||||||||||||||||
if sp_version >= (0, 14): | ||||||||||||||||||||||||||
indices_dtype = np.int64 | ||||||||||||||||||||||||||
else: | ||||||||||||||||||||||||||
if _IS_32BIT: | ||||||||||||||||||||||||||
raise ValueError(('sparse CSR array has {} non-zero ' | ||||||||||||||||||||||||||
'elements and requires 64 bit indexing, ' | ||||||||||||||||||||||||||
' which is unsupported with scipy {}. ' | ||||||||||||||||||||||||||
'Please upgrade to scipy >=0.14') | ||||||||||||||||||||||||||
.format(indptr[-1], '.'.join(sp_version))) | ||||||||||||||||||||||||||
'which is unsupported with 32 bit Python.') | ||||||||||||||||||||||||||
.format(indptr[-1])) | ||||||||||||||||||||||||||
indices_dtype = np.int64 | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
else: | ||||||||||||||||||||||||||
indices_dtype = np.int32 | ||||||||||||||||||||||||||
|
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated, but: This seems to reorder
X
in place too.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed it does -- I can update that when we decide what the desired behavior is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's intentional, I think, to reduce memory usage, why would it be an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's an issue only because the documentation doesn't tell me that's going to happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, we should add a note about it but in a separate PR.