8000 FIX DOC MNT Big revamp of cross-decomposition module (#17095) · NicolasHug/scikit-learn@8061aac · GitHub
[go: up one dir, main page]

Skip to content

Commit 8061aac

Browse files
authored
FIX DOC MNT Big revamp of cross-decomposition module (scikit-learn#17095)
1 parent 9007dc2 commit 8061aac

File tree

7 files changed

+1097
-971
lines changed

7 files changed

+1097
-971
lines changed

doc/modules/cross_decomposition.rst

Lines changed: 164 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -6,12 +6,9 @@ Cross decomposition
66

77
.. currentmodule:: sklearn.cross_decomposition
88

9-
The cross decomposition module contains two main families of algorithms: the
10-
partial least squares (PLS) and the canonical correlation analysis (CCA).
11-
12-
These families of algorithms are useful to find linear relations between two
13-
multivariate datasets: the ``X`` and ``Y`` arguments of the ``fit`` method
14-
are 2D arrays.
9+
The cross decomposition module contains **supervised** estimators for
10+
dimensionality reduction and regression, belonging to the "Partial Least
11+
Squares" family.
1512

1613
.. figure:: ../auto_examples/cross_decomposition/images/sphx_glr_plot_compare_cross_decomposition_001.png
1714
:target: ../auto_examples/cross_decomposition/plot_compare_cross_decomposition.html
@@ -23,19 +20,173 @@ Cross decomposition algorithms find the fundamental relations between two
2320
matrices (X and Y). They are latent variable approaches to modeling the
2421
covariance structures in these two spaces. They will try to find the
2522
multidimensional direction in the X space that explains the maximum
26-
multidimensional variance direction in the Y space. PLS-regression is
27-
particularly suited when the matrix of predictors has more variables than
28-
observations, and when there is multicollinearity among X values. By contrast,
29-
standard regression will fail in these cases.
23+
multidimensional variance direction in the Y space. In other words, PLS
24+
projects both `X` and `Y` into a lower-dimensional subspace such that the
25+
covariance between `transformed(X)` and `transformed(Y)` is maximal.
26+
27+
PLS draws similarities with `Principal Component Regression
28+
<https://en.wikipedia.org/wiki/Principal_component_regression>`_ (PCR), where
29+
the samples are first projected into a lower-dimensional subspace, and the
30+
targets `y` are predicted using `transformed(X)`. One issue with PCR is that
31+
the dimensionality reduction is unsupervized, and may lose some important
32+
variables: PCR would keep the features with the most variance, but it's
33+
possible that features with a small variances are relevant from predicting
34+
the target. In a way, PLS allows for the same kind of dimensionality
35+
reduction, but by taking into account the targets `y`. An illustration of
36+
this fact is given in the following example:
37+
* :ref:`sphx_glr_auto_examples_cross_decomposition_plot_pcr_vs_pls.py`.
3038

31-
Classes included in this module are :class:`PLSRegression`
39+
Apart from CCA, the PLS estimators are particularly suited when the matrix of
40+
predictors has more variables than observations, and when there is
41+
multicollinearity among the features. By contrast, standard linear regression
42+
would fail in these cases unless it is regularized.
43+
44+
Classes included in this module are :class:`PLSRegression`,
3245
:class:`PLSCanonical`, :class:`CCA` and :class:`PLSSVD`
3346

47+
PLSCanonical
48+
------------
49+
50+
We here describe the algorithm used in :class:`PLSCanonical`. The other
51+
estimators use variants of this algorithm, and are detailed below.
52+
We recommend section [1]_ for more details and comparisons between these
53+
algorithms. In [1]_, :class:`PLSCanonical` corresponds to "PLSW2A".
54+
55+
Given two centered matrices :math:`X \in \mathbb{R}^{n \times d}` and
56+
:math:`Y \in \mathbb{R}^{n \times t}`, and a number of components :math:`K`,
57+
:class:`PLSCanonical` proceeds as follows:
58+
59+
Set :math:`X_1` to :math:`X` and :math:`Y_1` to :math:`Y`. Then, for each
60+
:math:`k \in [1, K]`:
61+
62+
- a) compute :math:`u_k \in \mathbb{R}^d` and :math:`v_k \in \mathbb{R}^t`,
63+
the first left and right singular vectors of the cross-covariance matrix
64+
:math:`C = X_k^T Y_k`.
65+
:math:`u_k` and :math:`v_k` are called the *weights*.
66+
By definition, :math:`u_k` and :math:`v_k` are
67+
choosen so that they maximize the covariance between the projected
68+
:math:`X_k` and the projected target, that is :math:`\text{Cov}(X_k u_k,
69+
Y_k v_k)`.
70+
- b) Project :math:`X_k` and :math:`Y_k` on the singular vectors to obtain
71+
*scores*: :math:`\xi_k = X_k u_k` and :math:`\omega_k = Y_k v_k`
72+
- c) Regress :math:`X_k` on :math:`\xi_k`, i.e. find a vector :math:`\gamma_k
73+
\in \mathbb{R}^d` such that the rank-1 matrix :math:`\xi_k \gamma_k^T`
74+
is as close as possible to :math:`X_k`. Do the same on :math:`Y_k` with
75+
:math:`\omega_k` to obtain :math:`\delta_k`. The vectors
76+
:math:`\gamma_k` and :math:`\delta_k` are called the *loadings*.
77+
- d) *deflate* :math:`X_k` and :math:`Y_k`, i.e. subtract the rank-1
78+
approximations: :math:`X_{k+1} = X_k - \xi_k \gamma_k^T`, and
79+
:math:`Y_{k + 1} = Y_k - \omega_k \delta_k^T`.
80+
81+
At the end, we have approximated :math:`X` as a sum of rank-1 matrices:
82+
:math:`X = \Xi \Gamma^T` where :math:`\Xi \in \mathbb{R}^{n \times K}`
83+
contains the scores in its columns, and :math:`\Gamma^T \in \mathbb{R}^{K
84+
\times d}` contains the loadings in its rows. Similarly for :math:`Y`, we
85+
have :math:`Y = \Omega \Delta^T`.
86+
87+
Note that the scores matrices :math:`\Xi` and :math:`\Omega` correspond to
88+
the projections of the training data :math:`X` and :math:`Y`, respectively.
89+
90+
Step *a)* may be performed in two ways: either by computing the whole SVD of
91+
:math:`C` and only retain the singular vectors with the biggest singular
92+
values, or by directly computing the singular vectors using the power method (cf section 11.3 in [1]_),
93+
which corresponds to the `'nipals'` option of the `algorithm` parameter.
94+
95+
96+
Transforming data
97+
^^^^^^^^^^^^^^^^^
98+
99+
To transform :math:`X` into :math:`\bar{X}`, we need to find a projection
100+
matrix :math:`P` such that :math:`\bar{X} = XP`. We know that for the
101+
training data, :math:`\Xi = XP`, and :math:`X = \Xi \Gamma^T`. Setting
102+
:math:`P = U(\Gamma^T U)^{-1}` where :math:`U` is the matrix with the
103+
:math:`u_k` in the columns, we have :math:`XP = X U(\Gamma^T U)^{-1} = \Xi
104+
(\Gamma^T U) (\Gamma^T U)^{-1} = \Xi` as desired. The rotation matrix
105+
:math:`P` can be accessed from the `x_rotations_` attribute.
106+
107+
Similarly, :math:`Y` can be transformed using the rotation matrix
108+
:math:`V(\Delta^T V)^{-1}`, accessed via the `y_rotations_` attribute.
109+
110+
Predicting the targets Y
111+
^^^^^^^^^^^^^^^^^^^^^^^^
112+
113+
To predict the targets of some data :math:`X`, we are looking for a
114+
coefficient matrix :math:`\beta \in R^{d \times t}` such that :math:`Y =
115+
X\beta`.
116+
117+
The idea is to try to predict the transformed targets :math:`\Omega` as a
118+
function of the transformed samples :math:`\Xi`, by computing :math:`\alpha
119+
\in \mathbb{R}` such that :math:`\Omega = \alpha \Xi`.
120+
121+
Then, we have :math:`Y = \Omega \Delta^T = \alpha \Xi \Delta^T`, and since
122+
:math:`\Xi` is the transformed training data we have that :math:`Y = X \alpha
123+
P \Delta^T`, and as a result the coefficient matrix :math:`\beta = \alpha P
124+
\Delta^T`.
125+
126+
:math:`\beta` can be accessed through the `coef_` attribute.
127+
128+
PLSSVD
129+
------
130+
131+
:class:`PLSSVD` is a simplified version of :class:`PLSCanonical`
132+
described earlier: instead of iteratively deflating the matrices :math:`X_k`
133+
and :math:`Y_k`, :class:`PLSSVD` computes the SVD of :math:`C = X^TY`
134+
only *once*, and stores the `n_components` singular vectors corresponding to
135+
the biggest singular values in the matrices `U` and `V`, corresponding to the
136+
`x_weights_` and `y_weights_` attributes. Here, the transformed data is
137+
simply `transformed(X) = XU` and `transformed(Y) = YV`.
138+
139+
If `n_components == 1`, :class:`PLSSVD` and :class:`PLSCanonical` are
140+
strictly equivalent.
141+
142+
PLSRegression
143+
-------------
144+
145+
The :class:`PLSRegression` estimator is similar to
146+
:class:`PLSCanonical` with `algorithm='nipals'`, with 2 significant
147+
differences:
148+
149+
- at step a) in the power method to compute :math:`u_k` and :math:`v_k`,
150+
:math:`v_k` is never normalized.
151+
- at step c), the targets :math:`Y_k` are approximated using the projection
152+
of :math:`X_k` (i.e. :math:`\xi_k`) instead of the projection of
153+
:math:`Y_k` (i.e. :math:`\omega_k`). In other words, the loadings
154+
computation is different. As a result, the deflation in step d) will also
155+
be affected.
156+
157+
These two modifications affect the output of `predict` and `transform`,
158+
which are not the same as for :class:`PLSCanonical`. Also, while the number
159+
of components is limited by `min(n_samples, n_features, n_targets)` in
160+
:class:`PLSCanonical`, here the limit is the rank of :math:`X^TX`, i.e.
161+
`min(n_samples, n_features)`.
162+
163+
:class:`PLSRegression` is also known as PLS1 (single targets) and PLS2
164+
(multiple targets). Much like :class:`~sklearn.linear_model.Lasso`,
165+
:class:`PLSRegression` is a form of regularized linear regression where the
166+
number of components controls the strength of the regularization.
167+
168+
Canonical Correlation Analysis
169+
------------------------------
170+
171+
Canonical Correlation Analysis was developed prior and independently to PLS.
172+
But it turns out that :class:`CCA` is a special case of PLS, and corresponds
173+
to PLS in "Mode B" in the literature.
174+
175+
:class:`CCA` differs from :class:`PLSCanonical` in the way the weights
176+
:math:`u_k` and :math:`v_k` are computed in the power method of step a).
177+
Details can be found in section 10 of [1]_.
178+
179+
Since :class:`CCA` involves the inversion of :math:`X_k^TX_k` and
180+
:math:`Y_k^TY_k`, this estimator can be unstable if the number of features or
181+
targets is greater than the number of samples.
182+
34183

35184
.. topic:: Reference:
36185

37-
* JA Wegelin
38-
`A survey of Partial Least Squares (PLS) methods, with emphasis on the two-block case <https://www.stat.washington.edu/research/reports/2000/tr371.pdf>`_
186+
.. [1] `A survey of Partial Least Squares (PLS) methods, with emphasis on
187+
the two-block case
188+
<https://www.stat.washington.edu/research/reports/2000/tr371.pdf>`_
189+
JA Wegelin
39190
40191
.. topic:: Examples:
41192

doc/whats_new/v0.24.rst

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,35 @@ Changelog
9797
:class:`covariance.GraphicalLassoCV`. `cv_alphas_` and `grid_scores_` will be
9898
removed in version 0.26. :pr:`16392` by `Thomas Fan`_.
9999

100+
:mod:`sklearn.cross_decomposition`
101+
..................................
102+
103+
- |API| The bounds of the `n_components` parameter is now restricted:
104+
105+
- into `[1, min(n_samples, n_features, n_targets)]`, for
106+
:class:`cross_decomposition.PLSSVD`, :class:`cross_decomposition.CCA`,
107+
and :class:`cross_decomposition.PLSCanonical`.
108+
- into `[1, n_features]` or :class:`cross_decomposition.PLSRegression`.
109+
110+
An error will be raised in 0.26. :pr:`17095` by `Nicolas Hug`_.
111+
112+
- |API| For :class:`cross_decomposition.PLSSVD`,
113+
:class:`cross_decomposition.CCA`, and
114+
:class:`cross_decomposition.PLSCanonical`, the `x_scores_` and `y_scores_`
115+
attributes were deprecated and will be removed in 0.26. They can be
116+
retrieved by calling `transform` on the training data. The `norm_y_weights`
117+
attribute will also be removed. :pr:`17095` by `Nicolas Hug`_.
118+
119+
- |Fix| Fixed a bug in :class:`cross_decomposition.PLSSVD` which would
120+
sometimes return components in the reversed order of importance.
121+
:pr:`17095` by `Nicolas Hug`_.
122+
123+
- |Fix| Fixed a bug in :class:`cross_decomposition.PLSSVD`,
124+
:class:`cross_decomposition.CCA`, and
125+
:class:`cross_decomposition.PLSCanonical`, which would lead to incorrect
126+
predictions for `est.transform(Y)` when the training data is single-target.
127+
:pr:`17095` by `Nicolas Hug`_.
128+
100129
:mod:`sklearn.datasets`
101130
.......................
102131

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
from ._pls import PLSCanonical, PLSRegression, PLSSVD
2-
from ._cca import CCA
1+
from ._pls import PLSCanonical, PLSRegression, PLSSVD, CCA
32

43
__all__ = ['PLSCanonical', 'PLSRegression', 'PLSSVD', 'CCA']

sklearn/cross_decomposition/_cca.py

Lines changed: 0 additions & 112 deletions
This file was deleted.

0 commit comments

Comments
 (0)
0