8000 Text vectorizer with unwanted float for max_idf or min_idf · Issue #20746 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
Text vectorizer with unwanted float for max_idf or min_idf  #20746
Closed
@maziyank

Description

@maziyank

When working using CountVectorizer and accidentaly set max_df with float greater than 1.0 and give me wrong result. According to documentation we can set max_idf or min_idf with float [0,1] or int when using CountVectorizer.

Steps/Code to Reproduce

Taken from example:

from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

vectorizer = CountVectorizer(analyzer='word', min_df=2, max_df=3)
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
# result ['document', 'first']

Suppose, I accidentaly set max_df with float greater than 1.0. This give me different result:

vectorizer = CountVectorizer(analyzer='word', min_df=2, max_df=3.0)
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
# result ['document', 'first', 'is', 'the', 'this']

Looking at the source,

   max_doc_count = (max_df
                             if isinstance(max_df, numbers.Integral)
                             else max_df * n_doc)

It seems setting max_df=3.0 will be multiplied by number of docs (4, in this case) and give same result as we directly set max_df=12

Expected Results

Warning or a prevention for user if they set max_df or min_df with value that is not in range [0,1]

Versions

Linux-5.11.0-25-generic-x86_64-with-glibc2.10
Python 3.8.8 (default, Apr 13 2021, 19:58:26)
[GCC 7.3.0]
NumPy 1.19.5
SciPy 1.6.2
Scikit-Learn 0.24.2
Imbalanced-Learn 0.7.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0