Text vectorizer with unwanted float for max_idf or min_idf

When working using CountVectorizer and accidentaly set max_df with float greater than 1.0 and give me wrong result. According to documentation we can set max_idf or min_idf with float [0,1] or int when using CountVectorizer.

Steps/Code to Reproduce

Taken from example:

from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

vectorizer = CountVectorizer(analyzer='word', min_df=2, max_df=3)
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
# result ['document', 'first']

Suppose, I accidentaly set max_df with float greater than 1.0. This give me different result:

vectorizer = CountVectorizer(analyzer='word', min_df=2, max_df=3.0)
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
# result ['document', 'first', 'is', 'the', 'this']

Looking at the source,

   max_doc_count = (max_df
                             if isinstance(max_df, numbers.Integral)
                             else max_df * n_doc)

It seems setting max_df=3.0 will be multiplied by number of docs (4, in this case) and give same result as we directly set max_df=12

Expected Results

Warning or a prevention for user if they set max_df or min_df with value that is not in range [0,1]

Versions

Linux-5.11.0-25-generic-x86_64-with-glibc2.10
Python 3.8.8 (default, Apr 13 2021, 19:58:26)
[GCC 7.3.0]
NumPy 1.19.5
SciPy 1.6.2
Scikit-Learn 0.24.2
Imbalanced-Learn 0.7.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Steps/Code to Reproduce

Expected Results

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Description

Steps/Code to Reproduce

Expected Results

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions