8000 [WIP] Implement PCA on sparse noncentered data by andportnoy · Pull Request #24415 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[WIP] Implement PCA on sparse noncentered data #24415

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 37 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
225a848
Add small test of PCA on sparse data
andportnoy Sep 10, 2022
1520c5f
Merge branch 'main' into pca-on-sparse-noncentered-data
andportnoy Sep 24, 2022
d4e7daf
Merge branch 'main' into pca-on-sparse-noncentered-data
andportnoy Sep 25, 2022
bce3d62
Merge branch 'main' into pca-on-sparse-noncentered-data
andportnoy Oct 1, 2022
e4490f9
Merge branch 'main' into pca-on-sparse-noncentered-data
andportnoy Oct 4, 2022
3bf477e
Merge branch 'main' into pca-on-sparse-noncentered-data
andportnoy Oct 5, 2022
f5a30e4
Merge branch 'main' into pca-on-sparse-noncentered-data
andportnoy Oct 8, 2022
460c368
Add support for PCA on sparse matrices using ARPACK + randomized SVD
andportnoy Oct 8, 2022
ead8bf7
Blacken PCA on sparse data code
andportnoy Oct 8, 2022
ef071f3
Merge branch 'main' into pca-on-sparse-noncentered-data
andportnoy Oct 15, 2022
331ba6f
Merge branch 'main' into pca-on-sparse-noncentered-data
andportnoy Oct 19, 2022
d8b5283
Merge branch 'main' into pca-on-sparse-noncentered-data
andportnoy Oct 21, 2022
5fb7aa6
PCA/helpers: remove debug prints from _center_implicitly
andportnoy Oct 22, 2022
b1528cd
PCA/helpers: remove redundant variable from _center_implicitly
andportnoy Oct 22, 2022
9f9b8c8
Merge branch 'main' into pca-on-sparse-noncentered-data
andportnoy Oct 25, 2022
bfd7a0e
Merge branch 'main' into pca-on-sparse-noncentered-data
andportnoy Nov 5, 2022
1dff900
PCA/tests: test PCA on larger random sparse matrix
andportnoy Nov 5, 2022
fa670b4
Merge branch 'main' into pca-on-sparse-noncentered-data
andportnoy Nov 26, 2022
647b735
Merge branch 'main' into pca-on-sparse-noncentered-data
andportnoy Dec 4, 2022
1b9a851
PCA: add LOBPCG support for sparse data
andportnoy Dec 4, 2022
4786e88
CI [all random seeds]
andportnoy Dec 10, 2022
7cbf497
Merge branch 'main' into pca-on-sparse-noncentered-data
andportnoy Dec 10, 2022
480aa5b
Merge branch 'main' into pca-on-sparse-noncentered-data
andportnoy Dec 17, 2022
b2f8d64
PCA/tests: parametrize test_pca_sparse on rtol [all random seeds]
andportnoy Dec 17, 2022
ef18778
CI [azure parallel] [all random seeds]
andportnoy Dec 17, 2022
73dd609
PCA/tests: leave only default rtol value [azure parallel] [all random…
andportnoy Dec 17, 2022
8a66dfd
Merge branch 'main' into pca-on-sparse-noncentered-data
andportnoy Feb 23, 2023
58c7b90
Merge branch 'main' into pca-on-sparse-noncentered-data
andportnoy Feb 24, 2023
0884861
PCA/tests: use density parameter
andportnoy Feb 24, 2023
8569def
PCA/tests: check in directory with debug scripts
andportnoy Feb 24, 2023
b1edffb
PCA/debug: mkdir data and plot directories if necessary
andportnoy Feb 28, 2023
c80f883
PCA/tests: use 300 dpi in plots
andportnoy Feb 28, 2023
d9ca26d
Merge branch 'main' into pca-on-sparse-noncentered-data
andportnoy Feb 28, 2023
600f2f1
Merge branch 'main' into pca-on-sparse-noncentered-data
andportnoy Feb 28, 2023
2c01c70
Merge branch 'main' into pca-on-sparse-noncentered-data
andportnoy Apr 22, 2023
8d6c90e
Merge branch 'main' into pca-on-sparse-noncentered-d 8000 ata
andportnoy May 12, 2023
3f52d5c
Merge branch 'main' into pca-on-sparse-noncentered-data
andportnoy May 24, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions pca-sparse-debug/scripts/mismatch-csv.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
#!/bin/bash

LOGFILE=$1
paste -d ',' \
<(echo 'seed,rtol,solver,layout,k,density'; grep -Po '(?<=test_pca_sparse\[).+?(?=\])' $LOGFILE | sed -e 's/1e-/1e@/g' -e 's/-/,/g' -e 's/@/-/g') \
<(echo 'bad,total'; grep -Po '(?<=Mismatched elements: )\d+ / \d+' $LOGFILE | sed -e 's/ //g' -e 's/\//,/g')
3 changes: 3 additions & 0 deletions pca-sparse-debug/scripts/mismatch-log.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#!/bin/bash

SKLEARN_TESTS_GLOBAL_RANDOM_SEED="all" OMP_NUM_THREADS=1 pytest --color=no -n "$(nproc --all)" sklearn/decomposition/tests/test_pca.py::test_pca_sparse
19 changes: 19 additions & 0 deletions pca-sparse-debug/scripts/mismatch-main.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
set -euxo pipefail

SCRIPTDIR=$(dirname "$0")
DATADIR=$SCRIPTDIR/../data
PLOTDIR=$SCRIPTDIR/../plots
TIMESTAMP=$(date +"%Y%m%d-%H%M%S")
GITHASH=$(git rev-parse --short HEAD)

BASENAME=pca-sparse-mismatch-$GITHASH-$TIMESTAMP
LOGFILE=$DATADIR/$BASENAME.log
CSVFILE=$DATADIR/$BASENAME.csv
PLOTFILE=$PLOTDIR/$BASENAME.png

mkdir -p $DATADIR
mkdir -p $PLOTDIR

bash $SCRIPTDIR/mismatch-log.sh > $LOGFILE || true
bash $SCRIPTDIR/mismatch-csv.sh $LOGFILE > $CSVFILE
python $SCRIPTDIR/mismatch-plot.py $CSVFILE $PLOTFILE
52 changes: 52 additions & 0 deletions pca-sparse-debug/scripts/mismatch-plot.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
import sys
import pandas as pd
import matplotlib.pyplot as plt

csv = sys.argv[1]
plot = sys.argv[2]

df = pd.read_csv(csv)
df = df[df.solver != 'auto']

df['rate'] = df.bad/df.total

def mismatch_by(x):
gb = df.groupby(x)
return gb['rate'].mean()

fig, axes = plt.subplots(2, 2, figsize=(10, 10), dpi=300)

ax=axes[0][0]
seed = mismatch_by('seed').hist(ax=ax)
seed.set_title('mismatch rate by seed (histogram)')
seed.set_xlabel('mismatch rate')
seed.set_ylabel('seed count')
seed.set_ylim(top=100)
seed.set_xlim(right=1)

ax=axes[0][1]
solver = mismatch_by('solver').plot.bar(ax=ax)
solver.set_title('mismatch rate by solver')
solver.set_xlabel('solver')
solver.set_ylabel('mismatch rate')
ax.bar_label(ax.containers[0], fmt="%.3f")

ax=axes[1][0]
density = mismatch_by('density').plot.bar(ax=ax)
density.set_title('mismatch rate by density')
density.set_xlabel('density')
density.set_ylabel('mismatch rate')
ax.bar_label(ax.containers[0], fmt="%.3f")

ax=axes[1][1]
ncomp = mismatch_by('k').plot.bar(ax=ax)
ncomp.set_title('mismatch rate by number of components')
ncomp.set_xlabel('# components')
ncomp.set_ylabel('mismatch rate')
ax.bar_label(ax.containers[0], fmt="%.3f")

for bp in (solver, density, ncomp):
bp.set_xticklabels(bp.get_xticklabels(), rotation=0)
bp.set_ylim(top=1)
fig.tight_layout()
fig.savefig(plot, facecolor='white', transparent=False)
40 changes: 40 additions & 0 deletions pca-sparse-debug/scripts/mismatch-tolerance-plot.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
import sys
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt

csv = sys.argv[1]
plot = sys.argv[2]

df = pd.read_csv(csv)
df = df[df.solver != 'auto']

df['rate'] = df.bad/df.total

def mismatch_by(x):
gb = df.groupby(x)
return gb['rate'].mean()

gb = mismatch_by(['solver', 'rtol'])
gb = gb.reindex(pd.MultiIndex.from_product(gb.index.levels)).fillna(0)

def formatter(x):
if x == 1:
return '1e-00'
else:
return format(x, ".0e")
gb.index = gb.index.set_levels(map(formatter, gb.index.levels[1]), level=1)
solvers = gb.index.levels[0]
fig, axes = plt.subplots(1, len(solvers), figsize=(24, 4), dpi=300)
fig.suptitle('Elementwise mismatch rate by solver and relative tolerance', fontsize=20)
for solver, ax in zip(solvers, axes.flat):
bar = gb[solver].plot.bar(ax=ax)
bar.set_title(solver, pad=15, fontsize=20)
bar.set_xlabel('relative tolerance', fontsize=14)
bar.set_ylabel('mismatch rate', fontsize=14)
bar.set_ylim(bottom=0, top=1)
bar.tick_params(axis='both', which='major', labelsize=14)
ax.bar_label(ax.containers[0], fmt="%.2g")

fig.tight_layout()
fig.savefig(plot, facecolor='white', transparent=False)
5 changes: 5 additions & 0 deletions pca-sparse-debug/scripts/passrate-csv.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
#!/bin/bash

LOGFILE=$1
echo 'seed,rtol,solver,format,k,density,outcome'
grep -P 'PASSED|FAILED' $LOGFILE | sed -E -e 's/^.*(FAILED|PASSED).*\[(.*)\]/\2 \1/' -e 's/1e-/1e@/g' -e 's/-/ /g' -e 's/@/-/g' -e 's/ $//' -e 's/ /,/g'
5 changes: 5 additions & 0 deletions pca-sparse-debug/scripts/passrate-log.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
#!/bin/bash

# writes pca sparse pass/fail results to stdout

SKLEARN_TESTS_GLOBAL_RANDOM_SEED="all" OMP_NUM_THREADS=1 pytest --color=no -v --tb=no -n "$(nproc --all)" sklearn/decomposition/tests/test_pca.py::test_pca_sparse
19 changes: 19 additions & 0 deletions pca-sparse-debug/scripts/passrate-main.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
set -euxo pipefail

SCRIPTDIR=$(dirname "$0")
DATADIR=$SCRIPTDIR/../data
PLOTDIR=$SCRIPTDIR/../plots
TIMESTAMP=$(date +"%Y%m%d-%H%M%S")
GITHASH=$(git rev-parse --short HEAD)

BASENAME=pca-sparse-passrate-$GITHASH-$TIMESTAMP
LOGFILE=$DATADIR/$BASENAME.log
CSVFILE=$DATADIR/$BASENAME.csv
PLOTFILE=$PLOTDIR/$BASENAME.png

mkdir -p $DATADIR
mkdir -p $PLOTDIR

bash $SCRIPTDIR/passrate-log.sh > $LOGFILE || true
bash $SCRIPTDIR/passrate-csv.sh $LOGFILE > $CSVFILE
python $SCRIPTDIR/passrate-plot.py $CSVFILE $PLOTFILE
55 changes: 55 additions & 0 deletions pca-sparse-debug/scripts/passrate-plot.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
import sys
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt

csv = sys.argv[1]
plot = sys.argv[2]

df = pd.read_csv(csv)
df = df[df.solver != 'auto']

df['pass'] = df.outcome.apply(lambda x: True if x=='PASSED' else False)

def passrate_by(x):
passes = df.groupby(x)['pass']
counts = passes.count()
sums = passes.sum()
return sums / counts

fig, axes = plt.subplots(2, 2, figsize=(10, 10), dpi=300)

ax=axes[0][0]
seed = passrate_by('seed').hist(ax=ax)
seed.set_title('pass rate by seed (histogram)')
seed.set_xlabel('pass rate')
seed.set_ylabel('seed count')
seed.set_ylim(top=100)
seed.set_xlim(right=1)

ax=axes[0][1]
solver = passrate_by('solver').plot.bar(ax=ax)
solver.set_title('pass rate by solver')
solver.set_xlabel('solver')
solver.set_ylabel('pass rate')
ax.bar_label(ax.containers[0], fmt="%.3f")

ax=axes[1][0]
density = passrate_by('density').plot.bar(ax=ax)
density.set_title('pass rate by density')
density.set_xlabel('density')
density.set_ylabel('pass rate')
ax.bar_label(ax.containers[0], fmt="%.3f")

ax=axes[1][1]
ncomp = passrate_by('k').plot.bar(ax=ax)
ncomp.set_title('pass rate by number of components')
ncomp.set_xlabel('# components')
ncomp.set_ylabel('pass rate')
ax.bar_label(ax.containers[0], fmt="%.3f")

for bp in (solver, density, ncomp):
bp.set_xticklabels(bp.get_xticklabels(), rotation=0)
bp.set_ylim(top=1)
fig.tight_layout()
fig.savefig(plot, facecolor='white', transparent=False)
38 changes: 38 additions & 0 deletions pca-sparse-debug/scripts/passrate-tolerance-plot.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
import sys
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt

csv = sys.argv[1]
plot = sys.argv[2]

df = pd.read_csv(csv)
df['pass'] = df.outcome.apply(lambda x: True if x=='PASSED' else False)
df = df[df.solver != 'auto']

def passrate_by(x):
passes = df.groupby(x)['pass']
counts = passes.count()
sums = passes.sum()
return sums / counts

gb = passrate_by(['solver', 'rtol'])
def formatter(x):
if x == 1:
return '1e-00'
else:
return format(x, ".0e")
gb.index = gb.index.set_levels(map(formatter, gb.index.levels[1]), level=1)
fig, axes = plt.subplots(1, 4, figsize=(24, 4), dpi=300)
fig.suptitle('Test pass rate by solver and relative tolerance', fontsize=20)
for solver, ax in zip(gb.index.levels[0], axes.flat):
bar = gb[solver].plot.bar(ax=ax)
bar.set_title(solver, pad=15, fontsize=20)
bar.set_xlabel('relative tolerance', fontsize=14)
bar.set_ylabel('pass rate', fontsize=14)
bar.set_ylim(bottom=0, top=1)
bar.tick_params(axis='both', which='major', labelsize=14)
ax.bar_label(ax.containers[0], fmt="%.2g")

fig.tight_layout()
fig.savefig(plot, facecolor='white', transparent=False)
4 changes: 4 additions & 0 deletions pca-sparse-debug/scripts/search-mismatch-stats.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
LOGFILE=$1
paste -d ',' \
<(echo 'seed,solver,format,k,density'; grep -Po '(?<=_ test_pca_sparse\[).+?(?=\])' $LOGFILE | sed 's/-/,/g') \
<(echo 'bad,total'; grep -Po '(?<=Mismatched elements: )\d+ / \d+' $LOGFILE | sed -e 's/ //g' -e 's/\//,/g')
Loading
0