[MRG+2] Bugfix: Clone-safe vectorizers with custom vocabulary #3552

vene · 2014-08-12T12:49:36Z

I ran into the issue summarized by this gist: Basically vectorizers throw away the vocabulary parameter in grid search. This is because the __init__ function of vectorizers doesn't obey the rules required to make it clone-safe. I added a minimal test and fixed it, but it needed some more changes to the tests: many of the tests were relying to vect.vocabulary_ to exist straight after init.

It still seems like the __init__ does a bit more computation than it should with respect to other parameters though.

coveralls · 2014-08-12T13:01:43Z

Coverage decreased (-0.02%) when pulling be3ac79 on vene:vectclone into c604ac3 on scikit-learn:master.

arjoly · 2014-08-12T13:06:43Z

sklearn/feature_extraction/text.py

+            self.fixed_vocabulary = True
+            self.vocabulary_ = dict(vocabulary)
+        else:
+            self.fixed_vocabulary = False


Should it be self.fixed_vocabulary_?

I'd say yes. This is just the original code moved around, but this is a good opportunity to clean it up a bit.

arjoly · 2014-08-12T13:20:39Z

I would be for an overall clean up. However, I am not familiar with this part of the code.

larsmans · 2014-08-12T18:36:43Z

+1 for this, with or without the further cleanup. In fact, I'd like to merge this right now and leave the cleanup for a future PR.

vene · 2014-08-12T20:23:44Z

Amended to fix the whitespace.

coveralls · 2014-08-12T20:37:59Z

Coverage decreased (-0.02%) when pulling b176075 on vene:vectclone into f38c1ca on scikit-learn:master.

vene · 2014-08-12T21:02:59Z

@arjoly I added your suggested change.
@larsmans could you check whether your +1 still holds?

Aside:
I learned a cool thing doing this: hasattr works by calling getattr and catching an AttributeError. So you can easily wrap attributes that might not exist yet in properties, and the property will act like an undefined attribute as long as the underlying call raises an AttributeError. Pretty convenient, I was afraid I'd have to use descriptors to deprecate this cleanly.

Here's what I mean:


In [4]: vect = CountVectorizer()

In [5]: vect.fixed_vocabulary_
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-5-dcb6e6c6bc1d> in <module>()
----> 1 vect.fixed_vocabulary_

AttributeError: 'CountVectorizer' object has no attribute 'fixed_vocabulary_'

In [6]: hasattr(vect, 'fixed_vocabulary_')
Out[6]: False

In [7]: hasattr(vect, 'fixed_vocabulary')
/Users/vene/code/scikit-learn/sklearn/utils/__init__.py:93: DeprecationWarning: Function fixed_vocabulary is deprecated; The `fixed_vocabulary` attribute is deprecated and will be removed in 0.18.  Please use `fixed_vocabulary_` instead.
  warnings.warn(msg, category=DeprecationWarning)
Out[7]: False

In [8]: vect.fit(["abc"])
Out[8]:
CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [9]: hasattr(vect, 'fixed_vocabulary')
/Users/vene/code/scikit-learn/sklearn/utils/__init__.py:93: DeprecationWarning: Function fixed_vocabulary is deprecated; The `fixed_vocabulary` attribute is deprecated and will be removed in 0.18.  Please use `fixed_vocabulary_` instead.
  warnings.warn(msg, category=DeprecationWarning)
Out[9]: True

larsmans · 2014-08-12T21:10:13Z

I've used the same hack in the SVM code. I think I actually put up a message to the effect that we should never do this again because it's not obvious what's going on (I learned the same thing about hasattr only by doing)...

vene · 2014-08-12T21:14:05Z

It's not obvious, but I think in this case this is the most straightforward way to do the deprecation. Can you suggest any alternatives?

coveralls · 2014-08-12T21:17:26Z

Coverage decreased (-0.02%) when pulling 0aa278e on vene:vectclone into f38c1ca on scikit-learn:master.

larsmans · 2014-08-12T21:52:09Z

Nope. Let's do it, +1 still standing.

jnothman · 2014-08-12T22:51:30Z

I learned a cool thing doing this: hasattr works by calling getattr and catching an AttributeError. So you can easily wrap attributes that might not exist yet in properties, and the property will act like an undefined attribute as long as the underlying call raises an AttributeError. Pretty convenient, I was afraid I'd have to use descriptors to deprecate this cleanly.

Apart from the fact that you did use descriptors (in that property constructs one, and descriptors don't provide any facility beyond __get__ to resolve hasattr), I've similarly pointed out that we need this feature to ensure ducktyping works in metaestimators (bug report #1805; patch in #2854 has basically been waiting for review for 14 months).

jnothman · 2014-08-12T22:54:01Z

And +1 for the patch. (I don't know whether there are users who will have tried to catch these errors upon construction, and hence moving that validation to fit breaks backwards compatibility, but I think it highly unlikely given the type of validation.)

mblondel · 2014-08-13T04:14:55Z

I think I actually put up a message to the effect that we should never do this again because it's not obvious what's going on

I think we shouldn't be afraid of using this technique when it's useful. Duck typing is an essential part of our API.

mblondel · 2014-08-13T04:20:43Z

patch in #2854 has basically been waiting for review for 14 months

I think #2854 addresses an important issue but you need to find a consensual solution. @agramfort and @GaelVaroquaux don't like framework code :)

vene · 2014-08-13T05:46:28Z

Apart from the fact that you did use descriptors

Well, you know what I mean :) I mean it's 3 lines of code, I was afraid I'd have to make a "DeprecatedProperty" descriptor class and build an instance of it at the same time that fixed_vocabulary_ gets set.

Implicit things like this are scary. In this case it's good because it ends up doing something expected, it's just that the way in which it gets done is not clear.

jnothman · 2014-08-13T05:53:54Z

Yes, that's why I actually did make a descriptor in the end at #2854, to
make it more intrinsically documenting. And apologies, @mblondel, I'd
forgotten that I did get some comments on #2854, but have no sense of
consensus for a solution, where as far as I'm concerned merging something
and airbrushing it later is better than leaving ducktyping broken.

On 13 August 2014 15:46, Vlad Niculae notifications@github.com wrote:

Apart from the fact that you did use descriptors

Well, you know what I mean :) I mean it's 3 lines of code, I was afraid
I'd have to make a "DeprecatedProperty" descriptor class and build an
instance of it at the same time that fixed_vocabulary_ gets set.

Implicit things like this are scary. In this case it's good because it
ends up doing something expected, it's just that the way in which it gets
done is not clear.

—
Reply to this email directly or view it on GitHub
#3552 (comment)
.

vene · 2014-08-13T06:03:14Z

Also, the code we're talking about in this PR will be gone in 2 releases time.

@ogrisel, any comments on this PR?

jnothman · 2014-08-13T06:16:10Z

Unless you particularly want to know what @ogrisel thinks, this PR can be merged.

vene · 2014-08-13T08:57:24Z

Merged by rebase. Let's make a note of revisiting the rest of the vectorizer __init__ code.

arjoly · 2014-08-13T09:16:33Z

Merged by rebase. Let's make a note of revisiting the rest of the vectorizer init code.

It would be possible to get all these api issues by writing a common test that put random integer and strings to all parameters in the init (the parameter list could be obtain using get_params).

jnothman · 2014-08-13T10:06:12Z

Let's make a note of revisiting the rest of the vectorizer init code.

I think it's along the lines of what @arjoly is saying, but we should be
looking for bloated init not just in vectorizer.

On 13 August 2014 19:16, Arnaud Joly notifications@github.com wrote:

Merged by rebase. Let's make a note of revisiting the rest of the
vectorizer * init* code.

It would be possible to get all these api issues by writing a common test
that put random integer and strings to all parameters in the init (the
parameter list could be obtain using get_params).

—
Reply to this email directly or view it on GitHub
#3552 (comment)
.

arjoly reviewed Aug 12, 2014
View reviewed changes

larsmans changed the title ~~[MRG] Bugfix: Clone-safe vectorizers with custom vocabulary~~ [MRG+1] Bugfix: Clone-safe vectorizers with custom vocabulary Aug 12, 2014

FIX set vectorizer vocabulary outside of init

b176075

Deprecate vectorizer fixed_vocabulary attribute

0aa278e

jnothman changed the title ~~[MRG+1] Bugfix: Clone-safe vectorizers with custom vocabulary~~ [MRG+2] Bugfix: Clone-safe vectorizers with custom vocabulary Aug 13, 2014

vene closed this Aug 13, 2014

justhalf mentioned this pull request Nov 10, 2014

CountVectorizer's vocabulary gets copied by clone #3844

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG+2] Bugfix: Clone-safe vectorizers with custom vocabulary #3552

[MRG+2] Bugfix: Clone-safe vectorizers with custom vocabulary #3552

[MRG+2] Bugfix: Clone-safe vectorizers with custom vocabulary #3552

[MRG+2] Bugfix: Clone-safe vectorizers with custom vocabulary #3552

Conversation

Choose a reason for hiding this comment

Choose a reason for hiding this comment