-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] Incorrent implementation of noise_variance_ in PCA._fit_truncated #9108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it should be min(n_features, n_samples) instead of n_features (on line 503 as well)
@wallygauze Thanks. I agree with you and modify my code. For the current issue, it's enough, but scikit-learn seems to handle situations when |
Could you try to write tests to show that it fixes three issues? |
Sorry, I delete the branch without noticing it. Reopening. |
regression test: Regarding the comments, I doubt (I'm not sure) relying on giving the links to the issues in the file is convenient/best practice, especially since they are three of them and going through them could be long. A reliable summary should be included. Regarding the test design, the scores calculated by the different solvers being equal is not what we want - actually, if you test the same code with the iris dataset instead in the failing master, the score method is run without raising an error and yet the 3 scores are equal. The noise_variance formula for '_fit_truncated' being incorrect is not so directly connected with the score values. My understanding is that any lack of clarity could be dangerous in case someone considers changing the dataset in the future for test speed or something similar. I said so in #8544, but get_precision() raises no error in test_arpack_pca_solver and test_pca_randomized_solver, and that is actually not because the iris dataset is an unlikely exception. It suffices that the explained variance ratio of a dataset be imbalanced enough for it to happen. |
Thanks @wallygauze . If we only focus on the issue, we may simply remove the assertions and only calculate the score(the name of the function may be something like test_pca_score_no_error). Since it is a bug which is repeatedly reported, it may be better to mark the issue numbers or add it in what's new. Could you please give me some suggestions about the test? @jnothman |
(My bad, I commented again just before you answered.) |
And I do think calculating the score (or, again, calling get_precision) without any kind of comparison is enough as a test. |
@wallygauze Another reason for me to design this test is to ensure that in extreme situations, the scores are not only calculated but also correctly calculated using different svd_solver. Maybe you are right, so let's wait for the suggestions from the core developers. |
ping @jnothman Could you please have a look at it? Thanks. |
@@ -471,6 +471,26 @@ def test_infer_dim_by_explained_variance(): | |||
assert_equal(pca.n_components_, 2) | |||
|
|||
|
|||
def test_pca_score_equal(): | |||
# ensure that the scores calculated by different svd_solver are equal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this already tested by test_pca_score
? can we now test equality there at a higher number of decimal places?
digits = datasets.load_digits() | ||
X_digits = digits.data | ||
|
||
pca1 = PCA(n_components=30, svd_solver='full') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use a loop over solver_list as the other code around here does
@lesteve Thanks for your great update. |
digits = datasets.load_digits() | ||
X_digits = digits.data | ||
|
||
svd_solvers = ['full', 'arpack', 'randomized'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lesteve How about using the global variable solver_list here to make the test consistent with other tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, do that.
< 628C /div>There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, +1 for merge when green.
digits = datasets.load_digits() | ||
X_digits = digits.data | ||
|
||
svd_solvers = ['full', 'arpack', 'randomized'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, do that.
|
||
|
||
def test_pca_zero_noise_variance_edge_cases(): | ||
n, p = 100, 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd appreciate a comment explaining what the edge case is.
@jnothman Thanks. I update the comment. CIs are green. |
Thanks! |
@qinhanmin2014 thanks a lot for the fix! |
@lesteve Thanks a lot too for your patient instruction and great update. :) |
Release 0.19.0 * tag '0.19.0': (99 commits) DOC one more version issue in doc skip docstring tests because not useful to users and has some issues deprecation of n_components happened in 0.19 not 0.18 (scikit-learn#9527) sync whatsnew with master so I'm less confused DOC more navigation links DOC a note on data leakage and pipeline (scikit-learn#9510) DOC set release date to Friday DOC Update news and menu for 0.19 release DOC list of contributors to 0.19 DOC Change release date to Thursday DOC Remove some whitespace from what's new Update what's new for recent changes Use base.is_classifier instead instead of isinstance (scikit-learn#9482) Fix safe_indexing with read-only indices (scikit-learn#9507) [MRG+1] add scorer based on explained_variance_score (scikit-learn#9259) fix wrong assert in test_validation (scikit-learn#9480) [MRG+1] FIX Add missing mixins to ClassifierChain (scikit-learn#9473) Bring last code block in line with the image. (scikit-learn#9488) FIX Pass sample_weight as kwargs in VotingClassifier (scikit-learn#9493) FIX Incorrent implementation of noise_variance_ in PCA._fit_truncated (scikit-learn#9108) ...
* releases: (99 commits) DOC one more version issue in doc skip docstring tests because not useful to users and has some issues deprecation of n_components happened in 0.19 not 0.18 (scikit-learn#9527) sync whatsnew with master so I'm less confused DOC more navigation links DOC a note on data leakage and pipeline (scikit-learn#9510) DOC set release date to Friday DOC Update news and menu for 0.19 release DOC list of contributors to 0.19 DOC Change release date to Thursday DOC Remove some whitespace from what's new Update what's new for recent changes Use base.is_classifier instead instead of isinstance (scikit-learn#9482) Fix safe_indexing with read-only indices (scikit-learn#9507) [MRG+1] add scorer based on explained_variance_score (scikit-learn#9259) fix wrong assert in test_validation (scikit-learn#9480) [MRG+1] FIX Add missing mixins to ClassifierChain (scikit-learn#9473) Bring last code block in line with the image. (scikit-learn#9488) FIX Pass sample_weight as kwargs in VotingClassifier (scikit-learn#9493) FIX Incorrent implementation of noise_variance_ in PCA._fit_truncated (scikit-learn#9108) ...
* dfsg: (99 commits) DOC one more version issue in doc skip docstring tests because not useful to users and has some issues deprecation of n_components happened in 0.19 not 0.18 (scikit-learn#9527) sync whatsnew with master so I'm less confused DOC more navigation links DOC a note on data leakage and pipeline (scikit-learn#9510) DOC set release date to Friday DOC Update news and menu for 0.19 release DOC list of contributors to 0.19 DOC Change release date to Thursday DOC Remove some whitespace from what's new Update what's new for recent changes Use base.is_classifier instead instead of isinstance (scikit-learn#9482) Fix safe_indexing with read-only indices (scikit-learn#9507) [MRG+1] add scorer based on explained_variance_score (scikit-learn#9259) fix wrong assert in test_validation (scikit-learn#9480) [MRG+1] FIX Add missing mixins to ClassifierChain (scikit-learn#9473) Bring last code block in line with the image. (scikit-learn#9488) FIX Pass sample_weight as kwargs in VotingClassifier (scikit-learn#9493) FIX Incorrent implementation of noise_variance_ in PCA._fit_truncated (scikit-learn#9108) ...
Reference Issue
Fixes #7568


Fixes #8541
Fixes #8544
What does this implement/fix? Explain your changes.
noise_variance_
is correctly implemented inPCA._fit_full
andIncrementalPCA.partial_fit
, but incorrectly implemented inPCA._fit_truncated
.After the modification, in function
get_precision
(decomposition/base.py), we will not get any 0 inexp_var_diff
(line 72), thus will not get anyinf
inprecision
(line 74), which is the main reason of #8544 .Any other comments?
@amueller @lesteve Thanks for your precious time.