8000 [MRG+1] Incorrent implementation of noise_variance_ in PCA._fit_truncated by qinhanmin2014 · Pull Request #9108 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG+1] Incorrent implementation of noise_variance_ in PCA._fit_truncated #9108

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Aug 6, 2017
Merged

[MRG+1] Incorrent implementation of noise_variance_ in PCA._fit_truncated #9108

merged 17 commits into from
Aug 6, 2017

Conversation

qinhanmin2014
Copy link
Member
@qinhanmin2014 qinhanmin2014 commented Jun 11, 2017

Reference Issue

Fixes #7568
00
Fixes #8541
Fixes #8544
05

What does this implement/fix? Explain your changes.

noise_variance_ is correctly implemented in PCA._fit_full and IncrementalPCA.partial_fit, but incorrectly implemented in PCA._fit_truncated.
After the modification, in function get_precision(decomposition/base.py), we will not get any 0 in exp_var_diff(line 72), thus will not get any inf in precision(line 74), which is the main reason of #8544 .

Any other comments?

@amueller @lesteve Thanks for your precious time.

@qinhanmin2014 qinhanmin2014 changed the title [WIP] Incorrent implementation of noise_variance_ in PCA._fit_truncated [MRG] Incorrent implementation of noise_variance_ in PCA._fit_truncated Jun 11, 2017
Copy link
Contributor
@wallygauze wallygauze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it should be min(n_features, n_samples) instead of n_features (on line 503 as well)

@qinhanmin2014
Copy link
Member Author

@wallygauze Thanks. I agree with you and modify my code. For the current issue, it's enough, but scikit-learn seems to handle situations when n_samples < n_features strangely (See the example below). Maybe some other modifications is needed.
00
@amueller @lesteve

@qinhanmin2014
Copy link
Member Author

@jnothman @amueller Since the pull request fixes 3 issues, could you please take some time to consider it?

@jnothman
Copy link
Member

Could you try to write tests to show that it fixes three issues?

@qinhanmin2014
Copy link
Member Author

@jnothman Thanks. I construct a regression test based on #7568 and it takes 0.2s on my PC.

@qinhanmin2014 qinhanmin2014 deleted the my-feature-2 branch June 15, 2017 13:46
@qinhanmin2014 qinhanmin2014 restored the my-feature-2 branch June 15, 2017 13:47
@qinhanmin2014 qinhanmin2014 reopened this Jun 15, 2017
@qinhanmin2014
Copy link
Member Author

Sorry, I delete the branch without noticing it. Reopening.

@wallygauze
Copy link
Contributor
wallygauze commented Jun 16, 2017

regression test:
You are right that this code fails with master and not with the fix. However, the test design (and comment) does not reflect its purpose.

Regarding the comments, I doubt (I'm not sure) relying on giving the links to the issues in the file is convenient/best practice, especially since they are three of them and going through them could be long. A reliable summary should be included.

Regarding the test design, the scores calculated by the different solvers being equal is not what we want - actually, if you test the same code with the iris dataset instead in the failing master, the score method is run without raising an error and yet the 3 scores are equal. The noise_variance formula for '_fit_truncated' being incorrect is not so directly connected with the score values.

My understanding is that any lack of clarity could be dangerous in case someone considers changing the dataset in the future for test speed or something similar. I said so in #8544, but get_precision() raises no error in test_arpack_pca_solver and test_pca_randomized_solver, and that is actually not because the iris dataset is an unlikely exception. It suffices that the explained variance ratio of a dataset be imbalanced enough for it to happen.

@qinhanmin2014
Copy link
Member Author

Thanks @wallygauze . If we only focus on the issue, we may simply remove the assertions and only calculate the score(the name of the function may be something like test_pca_score_no_error). Since it is a bug which is repeatedly reported, it may be better to mark the issue numbers or add it in what's new. Could you please give me some suggestions about the test? @jnothman

@wallygauze
Copy link
Contributor

(My bad, I commented again just before you answered.)
Yes, I see why we'd want to include the issue reference (as long as we include a good summary).

@wallygauze
Copy link
Contributor
wallygauze commented Jun 16, 2017

And I do think calculating the score (or, again, calling get_precision) without any kind of comparison is enough as a test.
If it's too ad-hoc for a regression test , do consider modifying or extending test_arpack_pca_solver and test_pca_randomized_solver so that they do pick the bug with get_precision

@qinhanmin2014
Copy link
Member Author

@wallygauze Another reason for me to design this test is to ensure that in extreme situations, the scores are not only calculated but also correctly calculated using different svd_solver. Maybe you are right, so let's wait for the suggestions from the core developers.

@qinhanmin2014
Copy link
Member Author
qinhanmin2014 commented Jun 19, 2017

The previous test failure is because of #7893. I use an empty commit to make CI green. Could you please take some time to consider this pull request? Thanks. @jnothman

@qinhanmin2014
Copy link
Member Author

ping @jnothman Could you please have a look at it? Thanks.

Copy link
Member
@jnothman jnothman left a comment

Thanks for the pings.

This doesn't appear to test the min(n_features, n_samples) logic.

Are we able to describe the noise_variance_ (and explained_variance_) parameter and its scaling more clearly in the docstring?

@@ -471,6 +471,26 @@ def test_infer_dim_by_explained_variance():
assert_equal(pca.n_components_, 2)


def test_pca_score_equal():
# ensure that the scores calculated by different svd_solver are equal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this already tested by test_pca_score? can we now test equality there at a higher number of decimal places?

digits = datasets.load_digits()
X_digits = digits.data

pca1 = PCA(n_components=30, svd_solver='full')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use a loop over solver_list as the other code around here does

@qinhanmin2014
Copy link
Member Author

@lesteve Thanks for your great update.

digits = datasets.load_digits()
X_digits = digits.data

svd_solvers = ['full', 'arpack', 'randomized']
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lesteve How about using the global variable solver_list here to make the test consistent with other tests?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, do that.

< 628C /div>

Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, +1 for merge when green.

digits = datasets.load_digits()
X_digits = digits.data

svd_solvers = ['full', 'arpack', 'randomized']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, do that.



def test_pca_zero_noise_variance_edge_cases():
n, p = 100, 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd appreciate a comment explaining what the edge case is.

@qinhanmin2014
Copy link
Member Author

@jnothman Thanks. I update the comment. CIs are green.

@jnothman
Copy link
Member
jnothman commented Aug 6, 2017

Thanks!

@jnothman jnothman merged commit 3f53d78 into scikit-learn:master Aug 6, 2017
@qinhanmin2014 qinhanmin2014 deleted the my-feature-2 branch August 6, 2017 02:26
jnothman pushed a commit that referenced this pull request Aug 6, 2017
jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Aug 6, 2017
@lesteve
Copy link
Member
lesteve commented Aug 7, 2017

@qinhanmin2014 thanks a lot for the fix!

@qinhanmin2014
Copy link
Member Author

@lesteve Thanks a lot too for your patient instruction and great update. :)

dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017
dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017
yarikoptic added a commit to yarikoptic/scikit-learn that referenced this pull request Aug 12, 2017
Release 0.19.0

* tag '0.19.0': (99 commits)
  DOC one more version issue in doc
  skip docstring tests because not useful to users and has some issues
  deprecation of n_components happened in 0.19 not 0.18 (scikit-learn#9527)
  sync whatsnew with master so I'm less confused
  DOC more navigation links
  DOC a note on data leakage and pipeline (scikit-learn#9510)
  DOC set release date to Friday
  DOC Update news and menu for 0.19 release
  DOC list of contributors to 0.19
  DOC Change release date to Thursday
  DOC Remove some whitespace from what's new
  Update what's new for recent changes
  Use base.is_classifier instead instead of isinstance (scikit-learn#9482)
  Fix safe_indexing with read-only indices (scikit-learn#9507)
  [MRG+1] add scorer based on explained_variance_score (scikit-learn#9259)
  fix wrong assert in test_validation (scikit-learn#9480)
  [MRG+1] FIX Add missing mixins to ClassifierChain (scikit-learn#9473)
  Bring last code block in line with the image. (scikit-learn#9488)
  FIX Pass sample_weight as kwargs in VotingClassifier (scikit-learn#9493)
  FIX Incorrent implementation of noise_variance_ in PCA._fit_truncated (scikit-learn#9108)
  ...
yarikoptic added a commit to yarikoptic/scikit-learn that referenced this pull request Aug 12, 2017
* releases: (99 commits)
  DOC one more version issue in doc
  skip docstring tests because not useful to users and has some issues
  deprecation of n_components happened in 0.19 not 0.18 (scikit-learn#9527)
  sync whatsnew with master so I'm less confused
  DOC more navigation links
  DOC a note on data leakage and pipeline (scikit-learn#9510)
  DOC set release date to Friday
  DOC Update news and menu for 0.19 release
  DOC list of contributors to 0.19
  DOC Change release date to Thursday
  DOC Remove some whitespace from what's new
  Update what's new for recent changes
  Use base.is_classifier instead instead of isinstance (scikit-learn#9482)
  Fix safe_indexing with read-only indices (scikit-learn#9507)
  [MRG+1] add scorer based on explained_variance_score (scikit-learn#9259)
  fix wrong assert in test_validation (scikit-learn#9480)
  [MRG+1] FIX Add missing mixins to ClassifierChain (scikit-learn#9473)
  Bring last code block in line with the image. (scikit-learn#9488)
  FIX Pass sample_weight as kwargs in VotingClassifier (scikit-learn#9493)
  FIX Incorrent implementation of noise_variance_ in PCA._fit_truncated (scikit-learn#9108)
  ...
yarikoptic added a commit to yarikoptic/scikit-learn that referenced this pull request Aug 12, 2017
* dfsg: (99 commits)
  DOC one more version issue in doc
  skip docstring tests because not useful to users and has some issues
  deprecation of n_components happened in 0.19 not 0.18 (scikit-learn#9527)
  sync whatsnew with master so I'm less confused
  DOC more navigation links
  DOC a note on data leakage and pipeline (scikit-learn#9510)
  DOC set release date to Friday
  DOC Update news and menu for 0.19 release
  DOC list of contributors to 0.19
  DOC Change release date to Thursday
  DOC Remove some whitespace from what's new
  Update what's new for recent changes
  Use base.is_classifier instead instead of isinstance (scikit-learn#9482)
  Fix safe_indexing with read-only indices (scikit-learn#9507)
  [MRG+1] add scorer based on explained_variance_score (scikit-learn#9259)
  fix wrong assert in test_validation (scikit-learn#9480)
  [MRG+1] FIX Add missing mixins to ClassifierChain (scikit-learn#9473)
  Bring last code block in line with the image. (scikit-learn#9488)
  FIX Pass sample_weight as kwargs in VotingClassifier (scikit-learn#9493)
  FIX Incorrent implementation of noise_variance_ in PCA._fit_truncated (scikit-learn#9108)
  ...
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
AishwaryaRK pushed a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
0