-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[WIP] Implement PCA on sparse noncentered data #24415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[WIP] Implement PCA on sparse noncentered data #24415
Conversation
This test is expected to fail at the moment. I will expand test coverage in the future. |
Ran into an issue with the |
This is an intermediate commit with a lot of debug print code. All tests are passing though.
I dodged the issue by using the transpose identity That enables randomized SVD in addition to ARPACK. |
I'll need to squash these intermediate commits later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this PR still WIP? What remains to be done?
Here are some suggestions to move it forward:
- test with larger data than iris (e.g. a few hundred data points and features);
- use the
global_random_seed
fixture in the new test (see Improve tests by using global_random_seed fixture to make them less seed-sensitive #22827 for more details); - parametrize the new test to also check with
whiten
set toTrue
; - please also check that transforming a batch of random test data points (ideally not from the training set) yields the same result with
assert_allclose
; - check that it's possible to call transform on dense array of points on a model that was trained with sparse data and vice versa;
- document the change in the changelog for 1.2 (we will move it to 1.3 is the PR is not ready to merge by then).
@ogrisel Thank you so much for taking a look and for the suggestions, I will implement those. I was also planning to add support for LOBPCG and PROPACK as sparse SVD methods. That could go in via this PR or as a follow up. When is the merge window closing for 1.2? |
Soonish I think :) /cc @jeremiedbb |
Uh oh. A couple of days? |
@ogrisel Let me know if I interpreted the suggestions correctly, I put a TODO list at the top of the PR. |
76f7f32
to
1dff900
Compare
(re the force push) Had to kill some unwanted commits pulled from main directly as opposed to via a merge commit. |
@ogrisel Only 2080 out of 16000 tests are passing when testing on 400x300 random sparse matrices of varying densities across the 100 Command:
Test matrix:
Looking at some of the results manually, the errors are due to 1-2% elements mismatching, I'll try to gather better statistics on that in particular. Below is a high level breakdown of the pass rate by parameter. Plot reproSKLEARN_TESTS_GLOBAL_RANDOM_SEED="all" OMP_NUM_THREADS=1 pytest -v --tb=no -n `nproc --all` sklearn/decomposition/tests/test_pca.py::test_pca_sparse > test-pca-sparse-all-seeds.log
grep -P 'PASSED|FAILED' test-pca-sparse-all-seeds.log | sed -E -e 's/^.*(FAILED|PASSED).*\[(.*)\]/\2 \1/' -e 's/-/ /g' -e 's/ $//' -e 's/ /,/g' > test-sparse-pca-all-seeds.csv import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv(
'test-pca-sparse-all-seeds.csv',
header=None,
names=['seed', 'solver', 'layout', 'ncomp', 'density', 'outcome']
)
df['pass'] = df.outcome.apply(lambda x: True if x=='PASSED' else False)
def passrate_by(x):
passes = df.groupby(x)['pass']
counts = passes.count()
sums = passes.sum()
return sums / counts
fig, axes = plt.subplots(2, 2, figsize=(8, 8), dpi=200)
seed = passrate_by('seed').hist(ax=axes[0][0])
seed.set_title('pass rate by seed (histogram)')
seed.set_xlabel('pass rate')
seed.set_ylabel('seed count')
seed.set_ylim(top=100)
seed.set_xlim(right=1)
solver = passrate_by('solver').plot.bar(ax=axes[0][1])
solver.set_title('pass rate by solver')
solver.set_xlabel('solver')
solver.set_ylabel('pass rate')
density = passrate_by('density').plot.bar(ax=axes[1][0])
density.set_title('pass rate by density')
density.set_xlabel('density')
density.set_ylabel('pass rate')
ncomp = passrate_by('ncomp').plot.bar(ax=axes[1][1])
ncomp.set_title('pass rate by number of components')
ncomp.set_xlabel('# components')
ncomp.set_ylabel('pass rate')
for bp in (solver, density, ncomp):
bp.set_xticklabels(bp.get_xticklabels(), rotation=0)
bp.set_ylim(top=1)
fig.tight_layout()
fig.savefig('test-pca-sparse-pass-rate.png', facecolor='white', transparent=False) |
test_pca_sparse
test_pca_sparse [azure parallel]
test_pca_sparse
… seeds] test_pca_sparse
Updates are posted in the linked issue #12794. |
Previously it was completely ignored and as a result defaulted to 0.01.
Will fix #12794 when complete.
TODOs
whiten
global_random_seed
SKLEARN_TESTS_GLOBAL_RANDOM_SEED="all" pytest sklearn/decomposition/tests/test_pca.py::test_pca_sparse
.transform