-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Error with CountVectorizer OverflowError: signed integer is greater than maximum #12112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This is not included in 0.19.2. Try master/0.20rc1 or wait for the 0,20 release. |
Ok, thanks for the clarification! |
Just an FYI...I ran the same code using 0.20rc1, and the error went away. I'm running this code on ~2 million documents, so the update seems to help a lot with scaling CountVectorizer to large corpora. |
Thanks for the confirmation @robguinness ! You may be also interested in the efforts to parallelize scikit-learn's text vectorizers in dask-ml (cf dask/dask-ml#5) . For now only |
Thanks, @trh. This is very interesting indeed! We regularly process corpera of ~10 million documents, and this part of the process is one of the most time consuming due to lack of parallelization. We have plans to work on this, too, but this might help to accelerate it. |
I am getting an error, which is similar to the one discussed in #9147, but I am using the latest release of scikit-learn (0.19.2), so I'm not sure if this is another issue. I am working with a large corpus. Here is a partial stacktrace:
Description
I am getting an error, which is similar to the one discussed in #9147, but I am using the latest release of scikit-learn (0.19.2), so I'm not sure if this is another issue. I am working with a large corpus. Here is a partial stacktrace:
Steps/Code to Reproduce
This only happens when processing a large number of documents (>100k). I'm currently doing some more testing to figure out at what value
len(j_indices)
this error occurs.Versions
Linux-4.4.0-111-generic-x86_64-with-Ubuntu-14.04-trusty
Python 3.4.3 (default, Nov 28 2017, 16:41:13)
[GCC 4.8.4]
NumPy 1.15.0
SciPy 1.1.0
Scikit-Learn 0.19.2
The text was updated successfully, but these errors were encountered: