-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Errow thrown when running TF-IDF vectorizer on scikit-learn 0.20 #12256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the report. Could you also please post the full error traceback? |
@TheZeken, the code is also not reproducible, it's missing some imports, and the definition of It would be best if you could provide a minimal example including manually or randomly filling the a dataframe which gives the error you get. |
The issue seems to be in the stop word validation, would something like this, vect = CountVectorizer(token_pattern=r'\w{1,}',
stop_words=english_stop,
ngram_range=(1, 2))
vect.fit_transform(["some text"]) reproduce the error? If it is the case please either provide us with the contents of your In any case as @adrinjalali mentioned please try to remove all non relevant constructs (including |
I had not, I admit, anticipated this use of Currently we check each stop word with:
We could just disregard this if |
Right, missed that Here is a minimal example, from sklearn.feature_extraction.text import CountVectorizer
data = [{'text': 'some text'}]
vect = CountVectorizer(preprocessor=lambda x: x['text'],
stop_words=['and'])
vect.fit_transform(data) To get an error, it is necessary to both explicitly provide stop words and set a custom preprocessor.
That should work, but another thing to consider is that the vectororizer can be a class inherited from CountVectorizer et al with a custom build_tokenizer method: we actually suggest doing that in the user manual. In which case determining if it the original one or not might be a bit tricky. Maybe in tokenize(preprocess('and')) if that fails skip this check. That would allow to fix this while still detecting if the provided stop words have some bad elements that cannot be tokenized. |
but you're not assured that if
passes, all will. Say |
Hah, yes, but e.g. with |
what's the status on this? Anyone working on it? Do we need someone to pick it up for 0.20.1? |
I was considering the idea, somewhere between this week and the next. If someone can work on it earlier they should take it. |
Description
I got an error when I run the fit function from TFIDFVectorizer. I didn't get this issue before 0.20 and when I uninstall 0.20 and re-install 0.19.1 the bug disappear.
Steps/Code to Reproduce
Example:
Expected Results
No error is thrown and I got the following results.
Actual Results
The full traceback :
Versions
Windows-10-10.0.14393-SP0
Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)]
NumPy 1.15.2
SciPy 1.1.0
Scikit-Learn 0.19.1
Is there something that I'm missing ? Or could it come from scikit-learn update ?
Thank you ! :)
The text was updated successfully, but these errors were encountered: