8000 ENH: improve FA vs PCA example · Felixhawk/scikit-learn@4f3722c · GitHub
[go: up one dir, main page]

Skip to content

Commit 4f3722c

Browse files
author
dengemann
committed
ENH: improve FA vs PCA example
1 parent 0e218c6 commit 4f3722c

File tree

1 file changed

+33
-6
lines changed

1 file changed

+33
-6
lines changed

examples/decomposition/plot_pca_vs_fa_model_selection.py

100644100755
Lines changed: 33 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -8,30 +8,38 @@
88
99
Probabilistic PCA and Factor Analysis are probabilistic models.
1010
The consequence is that the likelihood of new data can be used
11-
for model selection. Here we compare PCA and FA with cross-validation
12-
on low rank data corrupted with homoscedastic noise (noise variance
11+
for model selection and covariance estimation.
12+
Here we compare PCA and FA with cross-validation on low rank data corrupted
8000 13+
with homoscedastic noise (noise variance
1314
is the same for each feature) or heteroscedastic noise (noise variance
14-
is the different for each feature).
15+
is the different for each feature). In a second step we compare the model
16+
likelihood to the likelihoods obtained from shrinkage covariance estimators.
1517
1618
One can observe that with homoscedastic noise both FA and PCA succeed
1719
in recovering the size of the low rank subspace. The likelihood with PCA
1820
is higher than FA in this case. However PCA fails and overestimates
19-
the rank when heteroscedastic noise is present. The automatic estimation from
21+
the rank when heteroscedastic noise is present. Under appropriate
22+
circumstances the low rank models are more likily than shrinkage models.
23+
24+
The automatic estimation from
2025
Automatic Choice of Dimensionality for PCA. NIPS 2000: 598-604
2126
by Thomas P. Minka is also compared.
2227
2328
"""
2429
print(__doc__)
2530

2631
# Authors: Alexandre Gramfort
32+
# Denis A. Engemann
2733
# License: BSD 3 clause
2834

2935
import numpy as np
3036
import pylab as pl
3137
from scipy import linalg
3238

3339
from sklearn.decomposition import PCA, FactorAnalysis
40+
from sklearn.covariance import ShrunkCovariance, LedoitWolf
3441
from sklearn.cross_validation import cross_val_score
42+
from sklearn.grid_search import GridSearchCV
3543

3644
###############################################################################
3745
# Create the data
@@ -67,7 +75,19 @@ def compute_scores(X):
6775
fa_scores.append(np.mean(cross_val_score(fa, X)))
6876

6977
return pca_scores, fa_scores
70-
78+
79+
80+
def shrunk_cov_score(X):
81+
shrinkages = np.logspace(-100, 0, 30)
82+
tuned_parameters = [{'shrinkage': shrinkages}]
83+
cv = GridSearchCV(ShrunkCovariance(), tuned_parameters)
84+
return np.mean(cross_val_score(cv.fit(X).best_estimator_, X, cv=3))
85+
86+
87+
def lw_score(X):
88+
return np.mean(cross_val_score(LedoitWolf(), X, cv=3))
89+
90+
7191
for X, title in [(X_homo, 'Homoscedastic Noise'),
7292
(X_hetero, 'Heteroscedastic Noise')]:
7393
pca_scores, fa_scores = compute_scores(X)
@@ -77,7 +97,7 @@ def compute_scores(X):
7797
pca = PCA(n_components='mle')
7898
pca.fit(X)
7999
n_components_pca_mle = pca.n_components_
80-
100+
81101
print("best n_components by PCA CV = %d" % n_components_pca)
82102
print("best n_components by FactorAnalysis CV = %d" % n_components_fa)
83103
print("best n_components by PCA MLE = %d" % n_components_pca_mle)
@@ -92,6 +112,13 @@ def compute_scores(X):
92112
label='FactorAnalysis CV: %d' % n_components_fa, linestyle='--')
93113
pl.axvline(n_components_pca_mle, color='k',
94114
label='PCA MLE: %d' % n_components_pca_mle, linestyle='--')
115+
116+
# compare with other covariance estimators
117+
pl.axhline(shrunk_cov_score(X), color='violet',
118+
label='Shrunk Covariance MLE', linestyle='--')
119+
pl.axhline(lw_score(X), color='orange',
120+
label='LedoitWolf MLE' % n_components_pca_mle, linestyle='--')
121+
95122
pl.xlabel('nb of components')
96123
pl.ylabel('CV scores')
97124
pl.legend(loc='lower right')

0 commit comments

Comments
 (0)
0