8
8
9
9
Probabilistic PCA and Factor Analysis are probabilistic models.
10
10
The consequence is that the likelihood of new data can be used
11
- for model selection. Here we compare PCA and FA with cross-validation
12
- on low rank data corrupted with homoscedastic noise (noise variance
11
+ for model selection and covariance estimation.
12
+ Here we compare PCA and FA with cross-validation on low rank data corrupted
8000
13
+ with homoscedastic noise (noise variance
13
14
is the same for each feature) or heteroscedastic noise (noise variance
14
- is the different for each feature).
15
+ is the different for each feature). In a second step we compare the model
16
+ likelihood to the likelihoods obtained from shrinkage covariance estimators.
15
17
16
18
One can observe that with homoscedastic noise both FA and PCA succeed
17
19
in recovering the size of the low rank subspace. The likelihood with PCA
18
20
is higher than FA in this case. However PCA fails and overestimates
19
- the rank when heteroscedastic noise is present. The automatic estimation from
21
+ the rank when heteroscedastic noise is present. Under appropriate
22
+ circumstances the low rank models are more likily than shrinkage models.
23
+
24
+ The automatic estimation from
20
25
Automatic Choice of Dimensionality for PCA. NIPS 2000: 598-604
21
26
by Thomas P. Minka is also compared.
22
27
23
28
"""
24
29
print (__doc__ )
25
30
26
31
# Authors: Alexandre Gramfort
32
+ # Denis A. Engemann
27
33
# License: BSD 3 clause
28
34
29
35
import numpy as np
30
36
import pylab as pl
31
37
from scipy import linalg
32
38
33
39
from sklearn .decomposition import PCA , FactorAnalysis
40
+ from sklearn .covariance import ShrunkCovariance , LedoitWolf
34
41
from sklearn .cross_validation import cross_val_score
42
+ from sklearn .grid_search import GridSearchCV
35
43
36
44
###############################################################################
37
45
# Create the data
@@ -67,7 +75,19 @@ def compute_scores(X):
67
75
fa_scores .append (np .mean (cross_val_score (fa , X )))
68
76
69
77
return pca_scores , fa_scores
70
-
78
+
79
+
80
+ def shrunk_cov_score (X ):
81
+ shrinkages = np .logspace (- 100 , 0 , 30 )
82
+ tuned_parameters = [{'shrinkage' : shrinkages }]
83
+ cv = GridSearchCV (ShrunkCovariance (), tuned_parameters )
84
+ return np .mean (cross_val_score (cv .fit (X ).best_estimator_ , X, cv = 3 ))
85
+
86
+
87
+ def lw_score (X ):
88
+ return np .mean (cross_val_score (LedoitWolf (), X , cv = 3 ))
89
+
90
+
71
91
for X , title in [(X_homo , 'Homoscedastic Noise' ),
72
92
(X_hetero , 'Heteroscedastic Noise' )]:
73
93
pca_scores , fa_scores = compute_scores (X )
@@ -77,7 +97,7 @@ def compute_scores(X):
77
97
pca = PCA (n_components = 'mle' )
78
98
pca .fit (X )
79
99
n_components_pca_mle = pca .n_components_
80
-
100
+
81
101
print ("best n_components by PCA CV = %d" % n_components_pca )
82
102
print ("best n_components by FactorAnalysis CV = %d" % n_components_fa )
83
103
print ("best n_components by PCA MLE = %d" % n_components_pca_mle )
@@ -92,6 +112,13 @@ def compute_scores(X):
92
112
label = 'FactorAnalysis CV: %d' % n_components_fa , linestyle = '--' )
93
113
pl .axvline (n_components_pca_mle , color = 'k' ,
94
114
label = 'PCA MLE: %d' % n_components_pca_mle , linestyle = '--' )
115
+
116
+ # compare with other covariance estimators
117
+ pl .axhline (shrunk_cov_score (X ), color = 'violet' ,
118
+ label = 'Shrunk Covariance MLE' , linestyle = '--' )
119
+ pl .axhline (lw_score (X ), color = 'orange' ,
120
+ label = 'LedoitWolf MLE' % n_components_pca_mle , linestyle = '--' )
121
+
95
122
pl .xlabel ('nb of components' )
96
123
pl .ylabel ('CV scores' )
97
124
pl .legend (loc = 'lower right' )
0 commit comments