-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG+2] Bugfix: Clone-safe vectorizers with custom vocabulary #3552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
self.fixed_vocabulary = True | ||
self.vocabulary_ = dict(vocabulary) | ||
else: | ||
self.fixed_vocabulary = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be self.fixed_vocabulary_?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say yes. This is just the original code moved around, but this is a good opportunity to clean it up a bit.
I would be for an overall clean up. However, I am not familiar with this part of the code. |
+1 for this, with or without the further cleanup. In fact, I'd like to merge this right now and leave the cleanup for a future PR. |
Amended to fix the whitespace. |
@arjoly I added your suggested change. Aside: Here's what I mean:
|
I've used the same hack in the SVM code. I think I actually put up a message to the effect that we should never do this again because it's not obvious what's going on (I learned the same thing about |
It's not obvious, but I think in this case this is the most straightforward way to do the deprecation. Can you suggest any alternatives? |
Nope. Let's do it, +1 still standing. |
Apart from the fact that you did use descriptors (in that |
And +1 for the patch. (I don't know whether there are users who will have tried to catch these errors upon construction, and hence moving that validation to |
I think we shouldn't be afraid of using this technique when it's useful. Duck typing is an essential part of our API. |
I think #2854 addresses an important issue but you need to find a consensual solution. @agramfort and @GaelVaroquaux don't like framework code :) |
Well, you know what I mean :) I mean it's 3 lines of code, I was afraid I'd have to make a "DeprecatedProperty" descriptor class and build an instance of it at the same time that Implicit things like this are scary. In this case it's good because it ends up doing something expected, it's just that the way in which it gets done is not clear. |
Yes, that's why I actually did make a descriptor in the end at #2854, to On 13 August 2014 15:46, Vlad Niculae notifications@github.com wrote:
|
Also, the code we're talking about in this PR will be gone in 2 releases time. @ogrisel, any comments on this PR? |
Unless you particularly want to know what @ogrisel thinks, this PR can be merged. |
Merged by rebase. Let's make a note of revisiting the rest of the vectorizer |
It would be possible to get all these api issues by writing a common test that put random integer and strings to all parameters in the init (the parameter list could be obtain using get_params). |
I think it's along the lines of what @arjoly is saying, but we should be On 13 August 2014 19:16, Arnaud Joly notifications@github.com wrote:
|
I ran into the issue summarized by this gist: Basically vectorizers throw away the
vocabulary
parameter in grid search. This is because the__init__
function of vectorizers doesn't obey the rules required to make it clone-safe. I added a minimal test and fixed it, but it needed some more changes to the tests: many of the tests were relying tovect.vocabulary_
to exist straight after init.It still seems like the
__init__
does a bit more computation than it should with respect to other parameters though.