Closed
Description
Describe the bug
The default value of for lowercase
in CountVectorizer is True
. This has the effect that all content of documents is lowercased by default. However, the entries in the vocabulary are not lowercased. So if the vocabulary contains uppercase characters it won't match against the content in the documents.
I think CountVectorizer should either
- lowercase the vocabulary as well when
lowercase
isTrue
or - not allow upper case characters in the vocabulary when
lowercase
isTrue
Steps/Code to Reproduce
Expected Results
Actual Results
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
def test_count_vectorizer():
voc = ["A", "B", "C"]
documents = ["A B C"]
count_model = CountVectorizer(
ngram_range=(1, 1),
vocabulary=voc,
)
x = count_model.fit_transform(documents).toarray()
assert np.array_equal(x, [[1, 1, 1]]) # x is [[0, 0, 0]]; should be [[1, 1, 1]]
Versions
setuptools: 51.0.0
sklearn: 0.23.2
numpy: 1.19.4
scipy: 1.5.4
Cython: 0.29.21
pandas: 1.1.5
matplotlib: 3.3.3
joblib: 1.0.0
threadpoolctl: 2.1.0
Built with OpenMP: True