8000 PPCA algorithm returns inconsistent noise variance depending on truncated vs full solution · Issue #8541 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

PPCA algorithm returns inconsistent noise variance depending on truncated vs full solution #8541

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
polmorenoc opened this issue Mar 6, 2017 · 4 comments · Fixed by #9108
Labels

Comments

@polmorenoc
Copy link

In the full solution the returned value is (pca.fit_full):
self.noise_variance
= explained_variance_[n_components:].mean()

whereas in the truncated case (pca.fit_truncated) you do:
self.noise_variance
= (total_var.sum() - self.explained_variance_.sum())

So in the second case the total unexplained variance is returned, not the average. This is problematic not only because of the inconsistency, but because certain code in sklearn.decomposition.PCA, e.g. the computation of the precision matrix assumes that the self.noise_variance is the average noise variance for it to work as per Tipping and Bishop (http://www.miketipping.com/papers/met-mppca.pdf), which is not the case in the truncated PCA. Let me know if I'm missing something.

@amueller
Copy link
Member
amueller commented Mar 6, 2017

I might just have run into this in #8544. PR welcome.

@polmorenoc
Copy link
Author

Yeah there's a high chance this issue is causing the score method not to work, which also happened to me recently. I believe the fix is really easy though as all that needs to be done in the truncated case is:

self.noise_variance = (total_var.sum() - self.explained_variance_.sum())/(D - n_components)

where D is the full dimensionality and n_components the reduced dimensionality.

@lesteve
Copy link
Member
lesteve commented Mar 8, 2017

I believe the fix is really easy though

@polmorenoc you are more than welcome to open a PR with your proposed changes.

@IainStrachan
Copy link

If you're going to put in a change, I also ran into this problem and I noticed (not that it makes a huge difference) that the computation of the eigenvalues from the singular values is also incorrect (line 473 of pca.py) in that it divides the square of the singular values by n_samples, and it should be (n_samples-1).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants
0