-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG + 1] Fix perplexity method by adding _unnormalized_transform method, Issue #7954 #7992
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thank you for the fix. Can you check on the failing tests? I think we don't want to add to the public api for this (imho), and I don't think we want to store the state in a private variable. I think my preferred solution would be to add a private method |
@amueller I really like your solution, which is not something I would have come up with on my own. I'll work on implementing your suggestion. Thanks for the advice! FYI, the failing tests are to do with the fact that I've changed the api, which, as you've suggested, is not ideal. Specifically, the test failed when the fit_transform method implemented in TransformMixin had no way to pass the normalize argument when it called transform. |
Ok lets see how it goes when you implement the suggestion :) |
I've implemented an |
That's a good question and I don't really have a straight-forward answer. We could deprecate passing the precomputed likelihood in the public interface. It might have been that it was only added there for the use in the training algorithm, but I'm not sure. Alternatively we could add a check that throws a warning if someone passes normalized probabilities to |
I'm inclined to agree with the deprecation of the optional |
Not sure what you mean. You can update the branch with the appropriate changes. |
…ent topic distribution
… use _unnormalized_transform to set distr
So that is ok as a fix, but now the API is a bit weird as you agreed, so I think you should go ahead and deprecate |
Sounds good, I'll get to it in the next day or so. |
Should this now be MRG? |
I wasn't planning to make any more updates unless you or @amueller had further suggestions. So in my biased opinion, it's ready to merge :) |
@@ -719,3 +739,21 @@ def perplexity(self, X, doc_topic_distr=None, sub_sampling=False): | |||
perword_bound = bound / word_cnt | |||
|
|||
return np.exp(-1.0 * perword_bound) | |||
|
|||
def perplexity(self, X, sub_sampling=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to deprecate the doc_topic_distr
parameter, you can't just remove it. It should raise a deprecation warning and be ignored, I think (because the results will likely be incorrect).
All public API must remain the same between releases unless there was a deprecation before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, makes sense. Sorry for the confusion.
""" | ||
|
||
if doc_topic_distr is not None: | ||
DeprecationWarning("Argument 'doc_topic_distr' is deprecated as " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe say explicitly that it is ignored.
Please add a regression test that checks that the perplexity is now computed correctly. I think the best way would be to check against the one computed during fit, but I'm not sure. Also, please add an entry to whatsnew, and add a |
doc_topic_distr : None or array, shape=(n_samples, n_topics) | ||
Document topic distribution. | ||
If it is None, it will be generated by applying transform on X. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can actually use a .. deprecated
note here
@amueller, we're not going to get someone more familiar with this code to comment, are we? |
The fix looks good to me. We need to pass unnormalized For |
Ideally, we'd also check that the warning is raised when |
I'll go ahead and write a regression test for the deprecated IMHO, deprecating the |
yeah. deprecate |
Thanks a lot @garyForeman, and @chyikwei, your continued advice is much appreciated. |
I was discussing with @jnothman a similar PR (#8137 (comment)) and it came up that this change doesn't really follow the standard deprecation procedure. I think I understand the justification that the parameter is pretty much useless if the user isn't able to obtain and supply a valid unnormalized distribution, but maybe the functionality should be restored, or at least a loud and clear notice in the docs and the Sorry to butt in on a topic I know little about and just wanted to relay that input. |
I'm not sure I would say that passing #8137 appears to be focused on tidying up the API, whereas this is a bug fix. It makes sense to maintain the As @amueller and I discussed above, the ideas we had for maintaining working functionality of the |
Sorry for not reviewing carefully. If doc_topic_distr is not being passed on to |
Ok, I can add that in. I'll open a new pull request shortly. |
""" | ||
if doc_topic_distr != 'deprecated': | ||
warnings.warn("Argument 'doc_topic_distr' is deprecated and will " | ||
"be ignored as of 0.19. Support for this argument " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I'll change this from future to present tense, i.e. "...is deprecated and is being ignored as of 0.19." Does that address the issue?
If there's no way to keep using doc_topic_distr then yes.
…On 3 January 2017 at 09:14, Gary Foreman ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In sklearn/decomposition/online_lda.py
<#7992 (review)>
:
> + Document word matrix.
+
+ doc_topic_distr : None or array, shape=(n_samples, n_topics)
+ Document topic distribution.
+ If it is None, it will be generated by applying transform on X.
+
+ .. deprecated:: 0.19
+
+ Returns
+ -------
+ score : float
+ Perplexity score.
+ """
+ if doc_topic_distr != 'deprecated':
+ warnings.warn("Argument 'doc_topic_distr' is deprecated and will "
+ "be ignored as of 0.19. Support for this argument "
So I'll change this from future to present tense, i.e. "...is deprecated
and is being ignored as of 0.19." Does that address the issue?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7992 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6yOw8k4qtQNKVxhz1kjrX0wtMkmZks5rOXa0gaJpZM4LFiE9>
.
|
X : array-like or sparse matrix, [n_samples, n_features] | ||
Document word matrix. | ||
|
||
doc_topic_distr : None or array, shape=(n_samples, n_topics) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring should probably note clearly that doc_topic_distr
is currently discarded as well, since that's where users look first.
well it doesn't do anything, the docstring should certainly not say that it
does
…On 3 Jan 2017 9:57 am, "Naoya Kanai" ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In sklearn/decomposition/online_lda.py
<#7992 (review)>
:
> +
+ def perplexity(self, X, doc_topic_distr='deprecated', sub_sampling=False):
+ """Calculate approximate perplexity for data X.
+
+ Perplexity is defined as exp(-1. * log-likelihood per word)
+
+ .. versionchanged:: 0.19
+ *doc_topic_distr* argument has been depricated because user no
+ longer has access to unnormalized distribution
+
+ Parameters
+ ----------
+ X : array-like or sparse matrix, [n_samples, n_features]
+ Document word matrix.
+
+ doc_topic_distr : None or array, shape=(n_samples, n_topics)
The docstring should probably note clearly that doc_topic_distr is
currently discarded as well, since that's where users look first.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7992 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6_dgrRQ2e47ghPPdgo1B8pWztjHyks5rOYDNgaJpZM4LFiE9>
.
|
…hod, Issue scikit-learn#7954 (scikit-learn#7992) Also deprecate doc_topic_distr argument in perplexity method
…hod, Issue scikit-learn#7954 (scikit-learn#7992) Also deprecate doc_topic_distr argument in perplexity method
…hod, Issue scikit-learn#7954 (scikit-learn#7992) Also deprecate doc_topic_distr argument in perplexity method
…hod, Issue scikit-learn#7954 (scikit-learn#7992) Also deprecate doc_topic_distr argument in perplexity method
…hod, Issue scikit-learn#7954 (scikit-learn#7992) Also deprecate doc_topic_distr argument in perplexity method
Reference Issue
Fixes #7954
What does this implement/fix? Explain your changes.
This fixes the broken perplexity method of the LatentDirichletAllocation class. This method was broken in the 0.18 release when the default behavior of the transform method switched to returning normalized document topic distributions. However, the perplexity calculation uses likelihoods rather than probabilities.
To fix the issue, I have added an optional argument, normalize, to the transform method. The default is set to True such that if the argument is not specified, the behavior of the transform method remains unchanged from the version 0.18 release. When the transform method is called from the perplexity method, normalize=False is passed so that the likelihoods rather than the probabilities are returned.
Any other comments?
While I believe the fix proposed here is the most elegant, I would understand if we didn't want to add an additional argument to the transform method. If this is the case, I am happy to pivot the implementation to use a "private" class attribute that stores the normalization information.