8000 CountVectorizer does not lowercase() entries in vocabulary when `lowercase` is set to `True` · Issue #19311 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

CountVectorizer does not lowercase() entries in vocabulary when lowercase is set to True #19311

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
philipp-eisen opened this issue Jan 31, 2021 · 5 comments · Fixed by #19401
Closed

Comments

@philipp-eisen
Copy link
philipp-eisen commented Jan 31, 2021

Describe the bug

The default value of for lowercase in CountVectorizer is True. This has the effect that all content of documents is lowercased by default. However, the entries in the vocabulary are not lowercased. So if the vocabulary contains uppercase characters it won't match against the content in the documents.
I think CountVectorizer should either

  1. lowercase the vocabulary as well when lowercase is True or
  2. not allow upper case characters in the vocabulary when lowercase is True

Steps/Code to Reproduce

Expected Results

Actual Results

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

def test_count_vectorizer():
    voc = ["A", "B", "C"]
    documents = ["A B C"]

    count_model = CountVectorizer(
        ngram_range=(1, 1),
        vocabulary=voc,
    )
    x = count_model.fit_transform(documents).toarray()
    assert np.array_equal(x, [[1, 1, 1]])  # x is [[0, 0, 0]]; should be [[1, 1, 1]]

Versions

   setuptools: 51.0.0
      sklearn: 0.23.2
        numpy: 1.19.4
        scipy: 1.5.4
       Cython: 0.29.21
       pandas: 1.1.5
   matplotlib: 3.3.3
       joblib: 1.0.0
threadpoolctl: 2.1.0
Built with OpenMP: True
@jnothman
Copy link
Member
jnothman commented Feb 2, 2021

I think the principle when we receive a precomputed vocabulary is that we should avoid any processing of the vocabulary, which may be much larger than any particular document. So I'm not convinced that this is an issue for the library: the user needs to just take care.

@philipp-eisen
Copy link
Author

Hm, I understand where you're coming from. I just feel like this is one of those things that creates a bug in a very silent way and it might be worth the extra computation cost calling .islower() on the entries in the vocabulary in_validate_vocabulary(self) when using lowercase=True.

@jnothman
Copy link
Member
jnothman commented Feb 2, 2021 via email

@philipp-eisen
Copy link
Author

Yeah that's what I meant with the message before. Sorry for not being very clear.

@zitorelova
Copy link
Contributor

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants
0