8000 Error with CountVectorizer OverflowError: signed integer is greater than maximum · Issue #12112 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Error with CountVectorizer OverflowError: signed integer is greater than maximum #12112

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
robguinness opened this issue Sep 19, 2018 · 5 comments

Comments

@robguinness
Copy link

I am getting an error, which is similar to the one discussed in #9147, but I am using the latest release of scikit-learn (0.19.2), so I'm not sure if this is another issue. I am working with a large corpus. Here is a partial stacktrace:

File "core.py", line 38, in <module>
    feature_vectorizer.ingest(data_dir='./data/docs/')
  File "/data/r_and_d/dochier/.venv/lib/python3.4/site-packages/freediscovery/engine/vectorizer.py", line 435, in ingest
    self.transform()
  File "/data/r_and_d/dochier/.venv/lib/python3.4/site-packages/freediscovery/engine/vectorizer.py", line 527, in transform
    res = vect.fit_transform(text_gen)
  File "/data/r_and_d/dochier/.venv/lib/python3.4/site-packages/sklearn/feature_extraction/text.py", line 869, in fit_transform
    self.fixed_vocabulary_)
  File "/data/r_and_d/dochier/.venv/lib/python3.4/site-packages/sklearn/feature_extraction/text.py", line 805, in _count_vocab
    indptr.append(len(j_indices))
OverflowError: signed integer is greater than maximum

Description

I am getting an error, which is similar to the one discussed in #9147, but I am using the latest release of scikit-learn (0.19.2), so I'm not sure if this is another issue. I am working with a large corpus. Here is a partial stacktrace:

File "core.py", line 38, in <module>
    feature_vectorizer.ingest(data_dir='./data/docs/')
  File "/data/r_and_d/dochier/.venv/lib/python3.4/site-packages/freediscovery/engine/vectorizer.py", line 435, in ingest
    self.transform()
  File "/data/r_and_d/dochier/.venv/lib/python3.4/site-packages/freediscovery/engine/vectorizer.py", line 527, in transform
    res = vect.fit_transform(text_gen)
  File "/data/r_and_d/dochier/.venv/lib/python3.4/site-packages/sklearn/feature_extraction/text.py", line 869, in fit_transform
    self.fixed_vocabulary_)
  File "/data/r_and_d/dochier/.venv/lib/python3.4/site-packages/sklearn/feature_extraction/text.py", line 805, in _count_vocab
    indptr.append(len(j_indices))
OverflowError: signed integer is greater than maximum

Steps/Code to Reproduce

This only happens when processing a large number of documents (>100k). I'm currently doing some more testing to figure out at what value len(j_indices) this error occurs.

Versions

Linux-4.4.0-111-generic-x86_64-with-Ubuntu-14.04-trusty
Python 3.4.3 (default, Nov 28 2017, 16:41:13)
[GCC 4.8.4]
NumPy 1.15.0
SciPy 1.1.0
Scikit-Learn 0.19.2

@qinhanmin2014
Copy link
Member

This is not included in 0.19.2. Try master/0.20rc1 or wait for the 0,20 release.

@robguinness
Copy link
Author

Ok, thanks for the clarification!

@robguinness
Copy link
Author

Just an FYI...I ran the same code using 0.20rc1, and the error went away. I'm running this code on ~2 million documents, so the update seems to help a lot with scaling CountVectorizer to large corpora.

@rth
Copy link
Member
rth commented Sep 20, 2018

Thanks for the confirmation @robguinness !

You may be also interested in the efforts to parallelize scikit-learn's text vectorizers in dask-ml (cf dask/dask-ml#5) . For now only HashingVectorizer is supported though. They would be interested in feedback on real-world usage on large text collections as well.

@robguinness
Copy link
Author

Thanks for the confirmation @robguinness !

You may be also interested in the efforts to parallelize scikit-learn's text vectorizers in dask-ml (cf dask/dask-ml#5) . For now only HashingVectorizer is supported though. They would be interested in feedback on real-world usage on large text collections as well.

Thanks, @trh. This is very interesting indeed! We regularly process corpera of ~10 million documents, and this part of the process is one of the most time consuming due to lack of parallelization. We have plans to work on this, too, but this might help to accelerate it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
0