8000 [MRG] fixes an issue w/ large sparse matrix indices in CountVectorizer by gvacaliuc · Pull Request #11295 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] fixes an issue w/ large sparse matrix indices in CountVectorizer #11295

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jan 30, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions doc/whats_new/v0.20.rst
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,14 @@ Changelog
combination with ``handle_unknown='ignore'``.
:issue:`12881` by `Joris Van den Bossche`_.

:mod:`sklearn.feature_extraction.text`
......................................

- |Fix| Fixed a bug in :class:`feature_extraction.text.CountVectorizer` which
would result in the sparse feature matrix having conflicting `indptr` and
`indices` precisions under very large vocabularies. :issue:`11295` by
:user:`Gabriel Vacaliuc <gvacaliuc>`.

.. _changes_0_20_2:

Version 0.20.2
Expand Down
32 changes: 31 additions & 1 deletion sklearn/feature_extraction/tests/test_text.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,8 @@
assert_warns_message, assert_raise_message,
clean_warning_registry, ignore_warnings,
SkipTest, assert_raises, assert_no_warnings,
fails_if_pypy, assert_allclose_dense_sparse)
fails_if_pypy, assert_allclose_dense_sparse,
skip_if_32bit)
from collections import defaultdict
from functools import partial
import pickle
Expand Down Expand Up @@ -1144,6 +1145,35 @@ def test_vectorizer_stop_words_inconsistent():
['hello world'])


@skip_if_32bit
def test_countvectorizer_sort_features_64bit_sparse_indices():
"""
Check that CountVectorizer._sort_features preserves the dtype of its sparse
feature matrix.

This test is skipped on 32bit platforms, see:
https://github.com/scikit-learn/scikit-learn/pull/11295
for more details.
"""

X = sparse.csr_matrix((5, 5), dtype=np.int64)

# force indices and indptr to int64.
INDICES_DTYPE = np.int64
X.indices = X.indices.astype(INDICES_DTYPE)
X.indptr = X.indptr.astype(INDICES_DTYPE)

vocabulary = {
"scikit-learn": 0,
"is": 1,
"great!": 2
}

Xs = CountVectorizer()._sort_features(X, vocabulary)

assert INDICES_DTYPE == Xs.indices.dtype


@fails_if_pypy
@pytest.mark.parametrize('Estimator',
[CountVectorizer, TfidfVectorizer, HashingVectorizer])
Expand Down
13 changes: 6 additions & 7 deletions sklearn/feature_extraction/text.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@
from .stop_words import ENGLISH_STOP_WORDS
from ..utils.validation import check_is_fitted, check_array, FLOAT_DTYPES
from ..utils.fixes import sp_version
from ..utils import _IS_32BIT


__all__ = ['HashingVectorizer',
Expand Down Expand Up @@ -871,7 +872,7 @@ def _sort_features(self, X, vocabulary):
Returns a reordered matrix and modifies the vocabulary in place
Copy link
@eric-wieser eric-wieser Aug 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated, but: This seems to reorder X in place too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed it does -- I can update that when we decide what the desired behavior is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's intentional, I think, to reduce memory usage, why would it be an issue?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an issue only because the documentation doesn't tell me that's going to happen.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to reorder X in place too.

Agreed, we should add a note about it but in a separate PR.

"""
sorted_features = sorted(vocabulary.items())
map_index = np.empty(len(sorted_features), dtype=np.int32)
map_index = np.empty(len(sorted_features), dtype=X.indices.dtype)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indices should be of dtype np.intp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just add a check here that catches an improper conversion to int64's on 32bit arch? Or should we never be mucking with the index dtype at all and just throw when our np.intp would overflow?

if indptr[-1] > 2147483648: # = 2**31 - 1
if sp_version >= (0, 14):
indices_dtype = np.int64
else:
raise ValueError(('sparse CSR array has {} non-zero '
'elements and requires 64 bit indexing, '
' which is unsupported with scipy {}. '
'Please upgrade to scipy >=0.14')
.format(indptr[-1], '.'.join(sp_version)))
else:
indices_dtype = np.int32

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have enough background here - could you link to the PR that added 64-bit indexing to scipy?

At any rate, it seems to be that indices_dtype should always be np.intp - if that's not possible, then you should pick between np.int32 (on old scipy) and np.intp (on fixed scipy)

Copy link
Contributor
@pv pv Aug 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indices can be either int32 or int64, scipy.sparse doesn't care which (intp size does not matter). Both indices and indptr however need to be of the same dtype to avoid casts on each operation.

Of course, when you constructing stuff manually, you'll also need to choose the dtype so that it can hold the items you are going to insert.

for new_val, (term, old_val) in enumerate(sorted_features):
vocabulary[term] = new_val
map_index[old_val] = new_val
Expand Down Expand Up @@ -961,14 +962,12 @@ def _count_vocab(self, raw_documents, fixed_vocab):
" contain stop words")

if indptr[-1] > 2147483648: # = 2**31 - 1
if sp_version >= (0, 14):
indices_dtype = np.int64
else:
if _IS_32BIT:
raise ValueError(('sparse CSR array has {} non-zero '
'elements and requires 64 bit indexing, '
' which is unsupported with scipy {}. '
'Please upgrade to scipy >=0.14')
.format(indptr[-1], '.'.join(sp_version)))
'which is unsupported with 32 bit Python.')
.format(indptr[-1]))
indices_dtype = np.int64

else:
indices_dtype = np.int32
Expand Down
0