You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The default value of for lowercase in CountVectorizer is True. This has the effect that all content of documents is lowercased by default. However, the entries in the vocabulary are not lowercased. So if the vocabulary contains uppercase characters it won't match against the content in the documents.
I think CountVectorizer should either
lowercase the vocabulary as well when lowercase is True or
not allow upper case characters in the vocabulary when lowercase is True
Steps/Code to Reproduce
Expected Results
Actual Results
importnumpyasnpfromsklearn.feature_extraction.textimportCountVectorizerdeftest_count_vectorizer():
voc= ["A", "B", "C"]
documents= ["A B C"]
count_model=CountVectorizer(
ngram_range=(1, 1),
vocabulary=voc,
)
x=count_model.fit_transform(documents).toarray()
assertnp.array_equal(x, [[1, 1, 1]]) # x is [[0, 0, 0]]; should be [[1, 1, 1]]
I think the principle when we receive a precomputed vocabulary is that we should avoid any processing of the vocabulary, which may be much larger than any particular document. So I'm not convinced that this is an issue for the library: the user needs to just take care.
Hm, I understand where you're coming from. I just feel like this is one of those things that creates a bug in a very silent way and it might be worth the extra computation cost calling .islower() on the entries in the vocabulary in_validate_vocabulary(self) when using lowercase=True.
Uh oh!
There was an error while loading. Please reload this page.
Describe the bug
The default value of for
lowercase
in CountVectorizer isTrue
. This has the effect that all content of documents is lowercased by default. However, the entries in the vocabulary are not lowercased. So if the vocabulary contains uppercase characters it won't match against the content in the documents.I think CountVectorizer should either
lowercase
isTrue
orlowercase
isTrue
Steps/Code to Reproduce
Expected Results
Actual Results
Versions
The text was updated successfully, but these errors were encountered: