8000 CountVectorizer does not lowercase() entries in vocabulary when `lowercase` is set to `True` · Issue #19311 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
CountVectorizer does not lowercase() entries in vocabulary when lowercase is set to True #19311
Closed
@philipp-eisen

Description

@philipp-eisen

Describe the bug

The default value of for lowercase in CountVectorizer is True. This has the effect that all content of documents is lowercased by default. However, the entries in the vocabulary are not lowercased. So if the vocabulary contains uppercase characters it won't match against the content in the documents.
I think CountVectorizer should either

  1. lowercase the vocabulary as well when lowercase is True or
  2. not allow upper case characters in the vocabulary when lowercase is True

Steps/Code to Reproduce

Expected Results

Actual Results

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

def test_count_vectorizer():
    voc = ["A", "B", "C"]
    documents = ["A B C"]

    count_model = CountVectorizer(
        ngram_range=(1, 1),
        vocabulary=voc,
    )
    x = count_model.fit_transform(documents).toarray()
    assert np.array_equal(x, [[1, 1, 1]])  # x is [[0, 0, 0]]; should be [[1, 1, 1]]

Versions

   setuptools: 51.0.0
      sklearn: 0.23.2
        numpy: 1.19.4
        scipy: 1.5.4
       Cython: 0.29.21
       pandas: 1.1.5
   matplotlib: 3.3.3
       joblib: 1.0.0
threadpoolctl: 2.1.0
Built with OpenMP: True

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    0