8000 [MRG+2] Bugfix: Clone-safe vectorizers with custom vocabulary by vene · Pull Request #3552 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG+2] Bugfix: Clone-safe vectorizers with custom vocabulary #3552

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

vene
Copy link
Member
@vene vene commented Aug 12, 2014

I ran into the issue summarized by this gist: Basically vectorizers throw away the vocabulary parameter in grid search. This is because the __init__ function of vectorizers doesn't obey the rules required to make it clone-safe. I added a minimal test and fixed it, but it needed some more changes to the tests: many of the tests were relying to vect.vocabulary_ to exist straight after init.

It still seems like the __init__ does a bit more computation than it should with respect to other parameters though.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.02%) when pulling be3ac79 on vene:vectclone into c604ac3 on scikit-learn:master.

self.fixed_vocabulary = True
self.vocabulary_ = dict(vocabulary)
else:
self.fixed_vocabulary = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be self.fixed_vocabulary_?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say yes. This is just the original code moved around, but this is a good opportunity to clean it up a bit.

@arjoly
Copy link
Member
arjoly commented Aug 12, 2014

I would be for an overall clean up. However, I am not familiar with this part of the code.

@larsmans
Copy link
Member

+1 for this, with or without the further cleanup. In fact, I'd like to merge this right now and leave the cleanup for a future PR.

@larsmans larsmans changed the title [MRG] Bugfix: Clone-safe vectorizers with custom vocabulary [MRG+1] Bugfix: Clone-safe vectorizers with custom vocabulary Aug 12, 2014
@vene
Copy link
Member Author
vene commented Aug 12, 2014

Amended to fix the whitespace.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.02%) when pulling b176075 on vene:vectclone into f38c1ca on scikit-learn:master.

@vene
Copy link
Member Author
vene commented Aug 12, 2014

@arjoly I added your suggested change.
@larsmans could you check whether your +1 still holds?

Aside:
I learned a cool thing doing this: hasattr works by calling getattr and catching an AttributeError. So you can easily wrap attributes that might not exist yet in properties, and the property will act like an undefined attribute as long as the underlying call raises an AttributeError. Pretty convenient, I was afraid I'd have to use descriptors to deprecate this cleanly.

Here's what I mean:


In [4]: vect = CountVectorizer()

In [5]: vect.fixed_vocabulary_
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-5-dcb6e6c6bc1d> in <module>()
----> 1 vect.fixed_vocabulary_

AttributeError: 'CountVectorizer' object has no attribute 'fixed_vocabulary_'

In [6]: hasattr(vect, 'fixed_vocabulary_')
Out[6]: False

In [7]: hasattr(vect, 'fixed_vocabulary')
/Users/vene/code/scikit-learn/sklearn/utils/__init__.py:93: DeprecationWarning: Function fixed_vocabulary is deprecated; The `fixed_vocabulary` attribute is deprecated and will be removed in 0.18.  Please use `fixed_vocabulary_` instead.
  warnings.warn(msg, category=DeprecationWarning)
Out[7]: False

In [8]: vect.fit(["abc"])
Out[8]:
CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [9]: hasattr(vect, 'fixed_vocabulary')
/Users/vene/code/scikit-learn/sklearn/utils/__init__.py:93: DeprecationWarning: Function fixed_vocabulary is deprecated; The `fixed_vocabulary` attribute is deprecated and will be removed in 0.18.  Please use `fixed_vocabulary_` instead.
  warnings.warn(msg, category=DeprecationWarning)
Out[9]: True

@larsmans
Copy link
Member

I've used the same hack in the SVM code. I think I actually put up a message to the effect that we should never do this again because it's not obvious what's going on (I learned the same thing about hasattr only by doing)...

@vene
Copy link
Member Author
vene commented Aug 12, 2014

It's not obvious, but I think in this case this is the most straightforward way to do the deprecation. Can you suggest any alternatives?

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.02%) when pulling 0aa278e on vene:vectclone into f38c1ca on scikit-learn:master.

@larsmans
Copy link
Member

Nope. Let's do it, +1 still standing.

@jnothman
Copy link
Member

I learned a cool thing doing this: hasattr works by calling getattr and catching an AttributeError. So you can easily wrap attributes that might not exist yet in properties, and the property will act like an undefined attribute as long as the underlying call raises an AttributeError. Pretty convenient, I was afraid I'd have to use descriptors to deprecate this cleanly.

Apart from the fact that you did use descriptors (in that property constructs one, and descriptors don't provide any facility beyond __get__ to resolve hasattr), I've similarly pointed out that we need this feature to ensure ducktyping works in metaestimators (bug report #1805; patch in #2854 has basically been waiting for review for 14 months).

@jnothman
Copy link
Member

And +1 for the patch. (I don't know whether there are users who will have tried to catch these errors upon construction, and hence moving that validation to fit breaks backwards compatibility, but I think it highly unlikely given the type of validation.)