10000 [MRG+2] Bugfix: Clone-safe vectorizers with custom vocabulary by vene · Pull Request #3552 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG+2] Bugfix: Clone-safe vectorizers with custom vocabulary #3552

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

vene
Copy link
Member
@vene vene commented Aug 12, 2014

I ran into the issue summarized by this gist: Basically vectorizers throw away the vocabulary parameter in grid search. This is because the __init__ function of vectorizers doesn't obey the rules required to make it clone-safe. I added a minimal test and fixed it, but it needed some more changes to the tests: many of the tests were relying to vect.vocabulary_ to exist straight after init.

It still seems like the __init__ does a bit more computation than it should with respect to other parameters though.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.02%) when pulling be3ac79 on vene:vectclone into c604ac3 on scikit-learn:master.

self.fixed_vocabulary = True
self.vocabulary_ = dict(vocabulary)
else:
self.fixed_vocabulary = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be self.fixed_vocabulary_?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say yes. This is just the original code moved around, but this is a good opportunity to clean it up a bit.

@arjoly
Copy link
Member
arjoly commented Aug 12, 2014

I would be for an overall clean up. However, I am not familiar with this part of the code.

@larsmans
Copy link
Member

+1 for this, with or without the further cleanup. In fact, I'd like to merge this right now and leave the cleanup for a future PR.

@larsmans larsmans changed the title [MRG] Bugfix: Clone-safe vectorizers with custom vocabulary [MRG+1] Bugfix: Clone-safe vectorizers with custom vocabulary Aug 12, 2014
@vene
Copy link
Member Author
vene commented Aug 12, 2014

Amended to fix the whitespace.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.02%) when pulling b176075 on vene:vectclone into f38c1ca on scikit-learn:master.

@vene
Copy link
Member Author
vene commented Aug 12, 2014

@arjoly I added your suggested change.
@larsmans could you check whether your +1 still holds?

Aside:
I learned a cool thing doing this: hasattr works by calling getattr and catching an AttributeError. So you can easily wrap attributes that might not exist yet in properties, and the property will act like an undefined attribute as long as the underlying call raises an AttributeError. Pretty convenient, I was afraid I'd have to use descriptors to deprecate this cleanly.

Here's what I mean:


In [4]: vect = CountVectorizer()

In [5]: vect.fixed_vocabulary_
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-5-dcb6e6c6bc1d> in <module>()
----> 1 vect.fixed_vocabulary_

AttributeError: 'CountVectorizer' object has no attribute 'fixed_vocabulary_'

In [6]: hasattr(vect, 'fixed_vocabulary_')
Out[6]: False

In [7]: hasattr(vect, 'fixed_vocabulary')
/Users/vene/code/scikit-learn/sklearn/utils/__init__.py:93: DeprecationWarning: Function fixed_vocabulary is deprecated; The `fixed_vocabulary` attribute is deprecated and will be removed in 0.18.  Please use `fixed_vocabulary_` instead.
  warnings.warn(msg, category=DeprecationWarning)
Out[7]: False

In [8]: vect.fit(["abc"])
Out[8]:
CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [9]: hasattr(vect, 'fixed_vocabulary')
/Users/vene/code/scikit-learn/sklearn/utils/__init__.py:93: DeprecationWarning: Function fixed_vocabulary is deprecated; The `fixed_vocabulary` attribute is deprecated and will be removed in 0.18.  Please use `fixed_vocabulary_` instead.
  warnings.warn(msg, category=DeprecationWarning)
Out[9]: True

@larsmans
Copy link
Member

I've used the same hack in the SVM code. I think I actually put up a message to the effect that we should never do this again because it's not obvious what's going on (I learned the same thing about hasattr only by doing)...

@vene
Copy link
Member Author
vene commented Aug 12, 2014

It's not obvious, but I think in this case this is the most straightforward way to do the deprecation. Can you suggest any alternatives?

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.02%) when pulling 0aa278e on vene:vectclone into f38c1ca on scikit-learn:master.

@larsmans
Copy link
Member

Nope. Let's do it, +1 still standing.

@jnothman
Copy link
Member

I learned a cool thing doing this: hasattr works by calling getattr and catching an AttributeError. So you can easily wrap attributes that might not exist yet in properties, and the property will act like an undefined attribute as long as the underlying call raises an AttributeError. Pretty convenient, I was afraid I'd have to use descriptors to deprecate this cleanly.

Apart from the fact that you did use descriptors (in that property constructs one, and descriptors don't provide any facility beyond __get__ to resolve hasattr), I've similarly pointed out that we need this feature to ensure ducktyping works in metaestimators (bug report #1805; patch in #2854 has basically been waiting for review for 14 months).

@jnothman
Copy link
Member

And +1 for the patch. (I don't know whether there are users who will have tried to catch these errors upon construction, and hence moving that validation to fit breaks backwards compatibility, but I think it highly unlikely given the type of validation.)

@mblondel
Copy link
Member

I think I actually put up a message to the effect that we should never do this again because it's not obvious what's going on

I think we shouldn't be afraid of using this technique when it's useful. Duck typing is an essential part of our API.

@mblondel
Copy link
Member

patch in #2854 has basically been waiting for review for 14 months

I think #2854 addresses an important issue but you need to find a consensual solution. @agramfort and @GaelVaroquaux don't like framework code :)

@vene
Copy link
Member Author
vene commented Aug 13, 2014

Apart from the fact that you did use descriptors

Well, you know what I mean :) I mean it's 3 lines of code, I was afraid I'd have to make a "DeprecatedProperty" descriptor class and build an instance of it at the same time that fixed_vocabulary_ gets set.

Implicit things like this are scary. In this case it's good because it ends up doing something expected, it's just that the way in which it gets done is not clear.

@jnothman
Copy link
Member

Yes, that's why I actually did make a descriptor in the end at #2854, to
make it more intrinsically documenting. And apologies, @mblondel, I'd
forgotten that I did get some comments on #2854, but have no sense of
consensus for a solution, where as far as I'm concerned merging something
and airbrushing it later is better than leaving ducktyping broken.

On 13 August 2014 15:46, Vlad Niculae notifications@github.com wrote:

Apart from the fact that you did use descriptors

Well, you know what I mean :) I mean it's 3 lines of code, I was afraid
I'd have to make a "DeprecatedProperty" descriptor class and build an
instance of it at the same time that fixed_vocabulary_ gets set.

Implicit things like this are scary. In this case it's good because it
ends up doing something expected, it's just that the way in which it gets
done is not clear.


Reply to this email directly or view it on GitHub
#3552 (comment)
.

@vene
Copy link
Member Author
vene commented Aug 13, 2014

Also, the code we're talking about in this PR will be gone in 2 releases time.

@ogrisel, any comments on this PR?

@jnothman jnothman changed the title [MRG+1] Bugfix: Clone-safe vectorizers with custom vocabulary [MRG+2] Bugfix: Clone-safe vectorizers with custom vocabulary Aug 13, 2014
@jnothman
Copy link
Member

Unless you particularly want to know what @ogrisel thinks, this PR can be merged.

@vene
Copy link
Member Author
vene commented Aug 13, 2014

Merged by rebase. Let's make a note of revisiting the rest of the vectorizer __init__ code.

@vene vene closed this Aug 13, 2014
@arjoly
Copy link A7FA
Member
arjoly commented Aug 13, 2014

Merged by rebase. Let's make a note of revisiting the rest of the vectorizer init code.

It would be possible to get all these api issues by writing a common test that put random integer and strings to all parameters in the init (the parameter list could be obtain using get_params).

@jnothman
Copy link
Member

Let's make a note of revisiting the rest of the vectorizer init code.

I think it's along the lines of what @arjoly is saying, but we should be
looking for bloated init not just in vectorizer.

On 13 August 2014 19:16, Arnaud Joly notifications@github.com wrote:

Merged by rebase. Let's make a note of revisiting the rest of the
vectorizer * init* code.

It would be possible to get all these api issues by writing a common test
that put random integer and strings to all parameters in the init (the
parameter list could be obtain using get_params).


Reply to this email directly or view it on GitHub
#3552 (comment)
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
0