Description
When working using CountVectorizer and accidentaly set max_df with float greater than 1.0 and give me wrong result. According to documentation we can set max_idf or min_idf with float [0,1] or int when using CountVectorizer.
Steps/Code to Reproduce
Taken from example:
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = CountVectorizer(analyzer='word', min_df=2, max_df=3)
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
# result ['document', 'first']
Suppose, I accidentaly set max_df
with float greater than 1.0. This give me different result:
vectorizer = CountVectorizer(analyzer='word', min_df=2, max_df=3.0)
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
# result ['document', 'first', 'is', 'the', 'this']
Looking at the source,
max_doc_count = (max_df
if isinstance(max_df, numbers.Integral)
else max_df * n_doc)
It seems setting max_df=3.0
will be multiplied by number of docs (4, in this case) and give same result as we directly set max_df=12
Expected Results
Warning or a prevention for user if they set max_df
or min_df
with value that is not in range [0,1]
Versions
Linux-5.11.0-25-generic-x86_64-with-glibc2.10
Python 3.8.8 (default, Apr 13 2021, 19:58:26)
[GCC 7.3.0]
NumPy 1.19.5
SciPy 1.6.2
Scikit-Learn 0.24.2
Imbalanced-Learn 0.7.0