-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
A Bug at the inverse_transform
of the KernelPCA
#18902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I found this line is added in #16655 but I still do not see the point of the above line. Now I am concerned about the test code below as well. As the inversion using kernel ridge regression is only an approximation, reconstructed input is not necessarily equivalent to the original input. So I suggest undoing #16655.
|
For clarification, I will mention some theoretical points. Let X be the input matrix of shape [n_sample, n_feature] and the X_tr be the transformed input maxtix of shape [n_samples, n_components]. X_tr contains the coefficients of the principal components obtained by kernel PCA. Then, proper formula for the prediction of the input X from transformed input X_tr using kernel ridge regression becomes |
KernelPCA
correct?KernelPCA
KernelPCA
inverse_transform
of the KernelPCA
Hey @kstoneriv3, good spot! You are right - the kernel being rectangular is a giveaway that something is not right here. I do think the original bug does need a fix though, although the fix may be to add back on the mean somewhere (I seem to remember going down this route first), as previously the inverse transform always returned data with zero mean. For example, on scikit-learn==0.22: from sklearn.datasets import make_blobs
from sklearn.decomposition import KernelPCA
import matplotlib.pyplot as plt
X, _ = make_blobs(n_samples=100, centers=[[1, 1]], random_state=0)
X_train, X_test = X[:90], X[90:]
kp = KernelPCA(n_components=2, fit_inverse_transform=True)
kp.fit(X_train)
X_trans = kp.transform(X_test)
X_inv = kp.inverse_transform(X_trans)
plt.scatter(*zip(*X_test))
plt.scatter(*zip(*X_inv))
for x_inv, x_test in zip(X_inv, X_test):
plt.plot([x_inv[0], x_test[0]], [x_inv[1], x_test[1]], 'k--', alpha=0.5)
plt.show() I can take a look at this unless you want to have a go? |
@lrjball Thank you so much for taking the trouble to provide a clear example! I now see that 8000 your fix surely reduces the reconstruction error. However, at the moment, I do not see why your fix works so well. I will think for a while and probably get back to you soon. |
I haven't figured out the cause of the zero-mean issue, but #16655 is not a theoretically proper way to fix I guess. Theoretically, #16655 added the one-hot vector representation of the training sample (scaled by α) to the feature space. When we write the original kernel feature of data x_i as φ_i, then it is changed to φ^{new}_i = (φ_i, e_i / √α) so that Using this idea, you can also fix the issue of zero means by assigning randomly picked one-hot vector e_i to the transformed input X as following in def inverse_transform(self, X):
K = self._get_kernel(X, self.X_transformed_fit_)
n_X_fit = self.X_transformed_fit_.shape[0]
n_X = X.shape[0]
K[np.arange(n_X), np.random.randint(0, n_X_fit, n_X)] += self.alpha # randomly assign one hot vector
return np.dot(K, self.dual_coef_) Anyways, I will get back to you again when I see where the zero mean issue is caused. |
Finally, I figured out the cause of this issue. In fact, it is not a bug for This problem happens only when the linear kernel is used. Indeed, if you use When the linear kernel is used, the model is fitted to the shifted data whose mean is zero. (This shifting happens in the So, probably a good way to address this issue would be to add a mean shift when the linear kernel is used. def inverse_transform(self, X, mean_shift=True):
K = self._get_kernel(X, self.X_transformed_fit_)
X_reconstructed = np.dot(K, self.dual_coef_)
# Apply mean shift to keep the compatibility with `sklearn.decomposition.PCA`
if (self.kernel=="linear") and mean_shift:
X_reconstructed += self.X_fit_.mean(axis=0, keepdims=True)
return X_reconstructed @lrjball Do you agree with this explanation? If you do, I will make a PR |
Very interesting. So what is it about the linear kernel that is so special? Is it just that the method is inherently an approximation but can be made more precise in the special case? Because all of the kernels are zero meaned in the same way so without digging into it I am missing why the linear kernel is the only one that broke. Also, worth bearing in mind that |
If we used a linear kernel without centering, and if the The centering operation for linear kernel changes the original kernel matrix X^T X to (X-1μ)^T (X-1μ), where X is of shape [n_samples, n_features], μ is the mean row vector of X,1 is a column vector with all elements being one. Thus, the information of feature mean μ is lost by centering, because one cannot estimate μ only from the value of (X-1μ)^T (X-1μ). This is also the case for non-linear kernels, but in such cases, we lose the information of the mean feature 1/N Σ_n Φ(x_n). In special cases like the linear kernel, this mean of feature coincides with the sample mean of the X.
I would assume the method is always inexact in that it always loses the information of the feature mean. However, if we do away with centering, the method can be exact in the case
Good point! I totally agree. |
@kstoneriv3 makes sense. Well if you think you have identified the fix in the code then I think a PR would be a good idea, I'm sure others will be able to give you some good feedback on it as well after that |
Both in the examples above and in the _kernel_pca.py code, I see expressions of the form: where for some feature map I get it that _get_kernel might work even if we pass it vectors on the wrong space, particularly if it only depends trivially on the Euclidean norm of the vectors or their inner-product, but if we use a kernel that explicitly depends on the dimension of the input space (like the Mahalanobis distance), then it becomes clear this won't work. Could someone please clarify why it makes sense here to compute the kernel on tranformed/projected vectors in feature space? |
@Quarlos Hi, I did not have time to check the code and the references so I may be wrong, but it is just a rough approximation, as far as I remember. The kernel PCA does not have a natural extension of the projection in the linear PCA, and this reconstruction technique in the kernel PCA does not have an equally solid theoretical background as the projection of the linear PCA. Here, we are using kernel ridge regression with multiple dimensional outputs, but we could (in theory) use any supervised learning method to learn the mapping from the low-dimensional representation obtained by the kernel PCA to the original representation (i.e. pre-image). This idea was introduced in the following paper so it may contain more intuitions on how it works. Bakır, G. H., Weston, J., & Schölkopf, B. (2004). Learning to find pre-images. Advances in neural information processing systems, 16, 449-456. |
Thank you @kstoneriv3. Yes, the solution detailed in the paper — as referenced in the code as well — is indeed an approximation. It’s actually very neat, all that it uses is linear algebra, bypassing many problems if we do this by gradient descent. As far as I understand the paper, it seems to me that the sklearn implementation is doing something different.
where X_transformed is the projected point in the higher dimensional space (used for learning the map).
What is strange to me is how K is computed both in 1 and 2:
X_transformed is a point in the higher dimensional space (also for X in 2.). But the kernel K is not defined there in this way. Let me write an example, using the rbf kernel on a d-dimensional input space:
where ||x-y||^2 is the square of the Euclidean norm in R^d.
Of course, this still works because phi(x) also has an Euclidean norm, but I don't think it is correct. When the kernel is given in this way, it is only defined on input vectors. Yes, underneath the kernel we have:
but that inner-product is an inner-product on the RKHS H, and that's not trivial to compute. Based on the paper, I feel the right implementation is:
Let me know if it makes sense. |
I am not sure I fully understood your point but I think it's useful to point out the following points:
|
For the predefined kernels you will never encounter any issues with using the same kernel. But I was trying a callable kernel of the form:
where One way to make this work is to have KernelPCA accept two kernels for these two different purposes. |
Ah, yes. Use of a custom kernel with inverse_transform is indeed a corner case the current code fails to cover! |
I have a question about the following line of
KernelPCA
. This line appears in the reprojection of the sample points inKernelPCA
but for me, this line seems unnecessary. Here, kernel ridge regression is used to (approximately) reproject the "transformed samples"(, which are coefficients of principal components), using precomputed dual coefficients of the kernel ridge regression. Though we need to add alpha to the diagonal elements of the kernel matrix when computing the dual coefficients (as in_fit_inverse_transform
), we do not need to add alpha to the kernel in the prediction stage, as far as I understand.scikit-learn/sklearn/decomposition/_kernel_pca.py
Line 362 in 0fb307b
Is this line really necessary? If so, I would appreciate it if someone could explain why this line is needed.
The text was updated successfully, but these errors were encountered: