ENH CountVectorizer does not check for lowercase in vocabulary #19401

zitorelova · 2021-02-08T16:53:49Z

Reference Issues/PRs

Fixes #19311

What does this implement/fix? Explain your changes.

When the CountVectorizer object's lowercase attribute is set to True, only given documents are converted to lowercase when fitting while vocabulary is left as is. This change checks the vocabulary for upper case characters and warns the user that entries with upper case characters will not match any documents.

ogrisel

LGTM, thank you very much for the usability improvement.

rth

Thanks!

zitorelova added 2 commits February 7, 2021 19:35

Check vocab for uppercase chars when lowercase is True

61209a3

Add test to make sure proper warning is thrown

9367641

github-actions bot added the module:feature_extraction label Feb 8, 2021

ogrisel approved these changes Feb 8, 2021

View reviewed changes

rth approved these changes Feb 12, 2021

View reviewed changes

rth merged commit 769da3d into scikit-learn:main Feb 12, 2021

zitorelova changed the title ~~Fix: CountVectorizer does not check for lowercase in vocabulary~~ ENH CountVectorizer does not check for lowercase in vocabulary Feb 12, 2021

glemaitre mentioned this pull request Apr 22, 2021

Release 0.24.2 #19954

Merged

12 tasks

adrinjalali mentioned this pull request Sep 3, 2021

Missing what's new (v1.0) #20925

Closed

6 tasks

ogrisel mentioned this pull request Sep 6, 2021

MAINT missing what's new entry for PR-19401 #20955

Merged

sobayed mentioned this pull request Oct 5, 2021

CountVectorizer.transform() much slower since version 1.0 #21242

Closed

SekouDiaoNlp mentioned this pull request Oct 28, 2021

mlconjug3 v3.8.0 Ars-Linguistica/mlconjug3#299

Merged

eddiebergman mentioned this pull request Nov 15, 2022

Update scikit learn 1.2 automl/auto-sklearn#1611

Closed

54 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH CountVectorizer does not check for lowercase in vocabulary #19401

ENH CountVectorizer does not check for lowercase in vocabulary #19401

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

ENH CountVectorizer does not check for lowercase in vocabulary #19401

ENH CountVectorizer does not check for lowercase in vocabulary #19401

Uh oh!

Conversation

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants