8000 ENH CountVectorizer does not check for lowercase in vocabulary by zitorelova · Pull Request #19401 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

ENH CountVectorizer does not check for lowercase in vocabulary #19401

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Feb 12, 2021

Conversation

zitorelova
Copy link
Contributor

Reference Issues/PRs

Fixes #19311

What does this implement/fix? Explain your changes.

When the CountVectorizer object's lowercase attribute is set to True, only given documents are converted to lowercase when fitting while vocabulary is left as is. This change checks the vocabulary for upper case characters and warns the user that entries with upper case characters will not match any documents.

Copy link
Member
@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you very much for the usability improvement.

Copy link
Member
@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@rth rth merged commit 769da3d into scikit-learn:main Feb 12, 2021
@zitorelova zitorelova changed the title Fix: CountVectorizer does not check for lowercase in vocabulary ENH CountVectorizer does not check for lowercase in vocabulary Feb 12, 2021
@glemaitre glemaitre mentioned this pull request Apr 22, 2021
12 tasks
@adrinjalali adrinjalali mentioned this pull request Sep 3, 2021
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CountVectorizer does not lowercase() entries in vocabulary when lowercase is set to True
3 participants
0