CountVectorizer sets self.vocabulary_ in transform #14559

amueller · 2019-08-02T19:36:04Z

Right now CountVectorizer sometimes sets self.vocabulary_ outside of fit. We usually prohibit this, but the common tests haven't reached the vectorizers yet.

The text was updated successfully, but these errors were encountered:

jnothman · 2019-08-03T22:11:53Z

Hmmm... Not ideal. Should it just validate use `vocabulary` if fit was never called??? It should probably then have tag requires_fit=False if vocabulary is provided...

amueller · 2019-08-05T15:46:19Z

yes, both of these sound reasonable to me.

amueller · 2019-11-02T21:09:28Z

@thomasjpfan do you remember someone (you? sprint?) working on this? I have some vague memories.

thomasjpfan · 2019-11-06T22:51:25Z

We may have spoke about this IRL regarding SparseCoder and self.components_?

amueller · 2019-11-06T23:59:58Z

hm ok might have also been me. Either way, seems open?

SultanOrazbayev · 2021-10-23T11:06:37Z

This issue doesn't seem to be relevant anymore?

There are two lines where self.vocabulary_ is set, one is inside .fit_transform (seems OK):

scikit-learn/sklearn/feature_extraction/text.py

Lines 1286 to 1354 in ee5fddc

    
               def fit_transform(self, raw_documents, y=None): 
        
                   """Learn the vocabulary dictionary and return document-term matrix. 
        
                   This is equivalent to fit followed by transform, but more efficiently 
        
                   implemented. 
        
                   Parameters 
        
                   ---------- 
        
                   raw_documents : iterable 
        
                       An iterable which generates either str, unicode or file objects. 
        
                   y : None 
        
                       This parameter is ignored. 
        
                   Returns 
        
                   ------- 
        
                   X : array of shape (n_samples, n_features) 
        
                       Document-term matrix. 
        
                   """ 
        
                   # We intentionally don't call the transform method to make 
        
                   # fit_transform overridable without unwanted side effects in 
        
                   # TfidfVectorizer. 
        
                   if isinstance(raw_documents, str): 
        
                       raise ValueError( 
        
                           "Iterable over raw text documents expected, string object received." 
        
                       ) 
        
                   self._validate_params() 
        
                   self._validate_vocabulary() 
        
                   max_df = self.max_df 
        
                   min_df = self.min_df 
        
                   max_features = self.max_features 
        
                   if self.fixed_vocabulary_ and self.lowercase: 
        
                       for term in self.vocabulary: 
        
                           if any(map(str.isupper, term)): 
        
                               warnings.warn( 
        
                                   "Upper case characters found in" 
        
                                   " vocabulary while 'lowercase'" 
        
                                   " is True. These entries will not" 
        
                                   " be matched with any documents" 
        
                               ) 
        
                               break 
        
                   vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary_) 
        
                   if self.binary: 
        
                       X.data.fill(1) 
        
                   if not self.fixed_vocabulary_: 
        
                       n_doc = X.shape[0] 
        
                       max_doc_count = ( 
        
                           max_df if isinstance(max_df, numbers.Integral) else max_df * n_doc 
        
                       ) 
        
                       min_doc_count = ( 
        
                           min_df if isinstance(min_df, numbers.Integral) else min_df * n_doc 
        
                       ) 
        
                       if max_doc_count < min_doc_count: 
        
                           raise ValueError("max_df corresponds to < documents than min_df") 
        
                       if max_features is not None: 
        
                           X = self._sort_features(X, vocabulary) 
        
                       X, self.stop_words_ = self._limit_features( 
        
                           X, vocabulary, max_doc_count, min_doc_count, max_features 
        
                       ) 
        
                       if max_features is None: 
        
                           X = self._sort_features(X, vocabulary) 
        
                       self.vocabulary_ = vocabulary 
        
                   return X

and one is inside _validate_vocabulary where it's re-set (which also seems OK):

scikit-learn/sklearn/feature_extraction/text.py

Lines 463 to 491 in ee5fddc

    
           def _validate_vocabulary(self): 
        
               vocabulary = self.vocabulary 
        
               if vocabulary is not None: 
        
                   if isinstance(vocabulary, set): 
        
                       vocabulary = sorted(vocabulary) 
        
                   if not isinstance(vocabulary, Mapping): 
        
                       vocab = {} 
        
                       for i, t in enumerate(vocabulary): 
        
                           if vocab.setdefault(t, i) != i: 
        
                               msg = "Duplicate term in vocabulary: %r" % t 
        
                               raise ValueError(msg) 
        
                       vocabulary = vocab 
        
                   else: 
        
                       indices = set(vocabulary.values()) 
        
                       if len(indices) != len(vocabulary): 
        
                           raise ValueError("Vocabulary contains repeated indices.") 
        
                       for i in range(len(vocabulary)): 
        
                           if i not in indices: 
        
                               msg = "Vocabulary of size %d doesn't contain index %d." % ( 
        
                                   len(vocabulary), 
        
                                   i, 
        
                               ) 
        
                               raise ValueError(msg) 
        
                   if not vocabulary: 
        
                       raise ValueError("empty vocabulary passed to fit") 
        
                   self.fixed_vocabulary_ = True 
        
                   self.vocabulary_ = dict(vocabulary) 
        
               else: 
        
                   self.fixed_vocabulary_ = False

amueller · 2021-10-23T14:36:11Z

It's set in _count_vocab which is called in transform, right?

SultanOrazbayev · 2021-10-23T15:07:20Z

Hmm, I'm not sure, because it seems to create vocabulary variable but doesn't alter self.vocabulary_... but I'm not sure...

amueller mentioned this issue Aug 2, 2019

[MRG] simplify check_is_fitted to use any fitted attributes #14545

Merged

thomasjpfan added the API label Oct 26, 2019

amueller added the help wanted label Nov 7, 2019

cmarmo added the module:feature_extraction label Mar 23, 2022

thomasjpfan added this to Quansight's scikit-learn Project Board Apr 22, 2022

thomasjpfan moved this to Todo📬 in Quansight's scikit-learn Project Board Apr 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CountVectorizer sets self.vocabulary_ in transform #14559

CountVectorizer sets self.vocabulary_ in transform #14559

CountVectorizer sets self.vocabulary_ in transform #14559

CountVectorizer sets self.vocabulary_ in transform #14559

Comments