-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Implement SVD algorithm in MDS #16067
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Results:
Code:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @panpiort8
This also needs a whats_new entry in the right file. You can check the other entries and put one in the right place.
I have not checked the math.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @panpiort8 , made a quick pass.
A brief "Method" section in the user guide would be useful, to describe and compare SMACOF and SVD. Otherwise it isn't clear for users why they might prefer one over the other.
Thanks!
48cfc41
to
a081cf5
Compare
Thanks for reviews @NicolasHug and @adrinjalali. I hope it's done, but you'd better look at docs ;) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there advantages to using smacof if you want to assume Euclidean similarities? If not should we just make this work with MDS(metric='euclidean') running SVD? Or is that too opaque to the user?
I don't see any advantages of using smacof in that case. However if force MDS to use svd_scaler when metric='euclidean', there will be some problems:
To my mind the only thing we can do is to add third possible value of 'solver' parameter, namely 'optimal' and set solver='optimal' for default. But it's even more problematic. What do you think? |
My proposal there was to remove the solver parameter and replace it with
metric='euclidean', not to have both.
|
Ok, I misunderstood. I haven't suspected I'm allowed to change api in a way that is backward incompatible. If so, I think it's great idea and should be discussed. To start, here is how I see it:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another round, mostly nits, reviewed the math and the tests, this looks good!
sklearn/manifold/_mds.py
Outdated
"Multidimensional Scaling" Chapman Hall | ||
2nd edition, Boca Raton (2001) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm quite ambivalent about citing a $300+ springer book.
Can we find an alternative? else I guess the other ref would be enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found 'core' paper of that citation, but it's still paid, I've changed to it anyway. Is that ok?
|
||
# Signs of columns are dependent on signs of computed eigenvectors | ||
# which are arbitrary and meaningless | ||
assert (np.allclose(mds_clf.embedding_, X_true_1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just compare the absolute values
# ``dissimilarities`` is Euclidean iff ``K`` is positive semi-definite. | ||
# For detail see "Multidimensional Scaling" Chapman Hall p 397 | ||
try: | ||
w = _check_psd_eigenvalues(w) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sure @smarie will be happy to hear that _check_psd_eigenvalues
is used! ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great !! Thanks for the connection @NicolasHug :)
It would also be useful to have a functional test on various random (or not) data testing that the svd solver results in lower stress than SMACOF |
Do you mean test in test_mds.py? |
Gentle ping |
I don't really get what you mean (#16067 (comment)) about transforming the input matrix to a Euclidean one |
I mean these lines in MDS class (/sklear/manifold/_mds.py): if self.dissimilarity == "precomputed":
self.dissimilarity_matrix_ = X
elif self.dissimilarity == "euclidean":
self.dissimilarity_matrix_ = euclidean_distances(X) I agree that word "transform" is not the best one here. |
The "dissimilarity" matrix must already be symmetric and with all zeroes in the main diagonal in any case. What do you mean by "euclidean"? Take into account that if instead of "precomputed", "euclidean" is used as parameter for the dissimilarity option, it does not mean that all the off-diagonal elements will be zero. They can be zero if the same point is repeated in the dataset and the SMACOF works just fine with zeroes outside the diagonal (those points are fitted to be at "0" distance, zeroes are actually used for the fitting in the "metric" case and ignored in the "non-metric"). So using "euclidean" does not mean you get an Euclidean matrix in the classical sense (without zeroes off the diagonal), which might be confusing. Would SVD fail with zeroes outside the main diagonal? |
Nope, I mean Euclidean Distance Matrix and actually I cannot find any other definition of euclidean matrix (at least at Google).
It works: >>> points = np.array([[0, 93],
... [0, 93],
... [0, 93]])
>>> mds_clf = mds.MDS(metric=True, solver="svd", dissimilarity='euclidean')
>>> mds_clf.fit(points)
>>> print(mds_clf.embedding_)
[[0. 0.]
[0. 0.]
[0. 0.]] |
That's what I thought was your definition. But, by that Wikipedia definition, the Euclidean Distance Matrix has the distances squared. If we use dissimilarity='euclidean' in the MDS class we actually get a dissimilarity matrix with the distances not squared, so being strict with the Wikipedia definition it would not be the Euclidean Distance Matrix. Is it relevant for the SVD if they are squared or not? |
True, I'll make it more precise in my code.
My implementation of SVD expects input matrix to be squared Euclidean Distance Matrix. Maybe we should call it 'matrix of euclidean distances'. I think it's quite precise. |
Then, when using SVD and 'euclidean' instead of 'precomputed', it should do the following, right?: Then, we basically have two options for the dissimilarity matrix:
I agree ma F42D ybe it would be better to remove the dissimilarity parameter completely, since it is confusing having it with two different behaviors and a non-compliant definition of the Euclidean matrix. The dissimilarity matrix should be built externally before calling to MDS, but maybe it is not convenient for backwards compatibility. Then we would have 3 different solvers: Then, maybe instead of having 'solver' and 'metric' parameters, we could just have one with three options, but again adding 'solver' defaulting to SMACOF and keeping 'metric' is definitely better for compatibility... |
You're right, but I also would rather remove this parameter completely.
What about adding |
I like just adding the 'solver' parameter defaulting to 'SMACOF' and just add the 'SVD' solver option, checking internally that metric is then 'true' if 'solver' is 'SVD' and giving an error if it is 'false'. And also adding that if dissimilarity is 'euclidean' and 'solver' is SVD it does the squared version. This way it is completely backwards compatible and we do not break any existing code using the MDS library. Have you tested that the solution is the same-ish (apart from speed/accuracy differences) with 'solver' = SVD and solver='SMACOF' with metric='True'? Both with 'euclidean' dissimilarities. With this tweak where it is squared in one case and not-squared in the other they should converge to equivalent points (at least for a simple case in which SMACOF is not stuck in a local minima, and taking into consideration that there will be an arbitrary translation/rotation/mirroring of the points). Maybe this script can be useful to visually see if the points are correct: |
I believe leaving An if we want |
"Make sure to pass an euclidean matrix, or use " | ||
"dissimilarity='euclidean'.") | ||
|
||
# Get ``n_compontent`` greatest eigenvalues and corresponding eigenvectors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small typo here:
# Get ``n_compontent`` greatest eigenvalues and corresponding eigenvectors. | |
# Get ``n_components`` greatest eigenvalues and corresponding eigenvectors. |
Parameters | ||
---------- | ||
dissimilarities : ndarray, shape (n_samples, n_samples) | ||
Pairwise dissimilarities between the points. Must be euclidean. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and in several other parts
Must be euclidean
I may be mistaken but the term "Euclidean" when applied to metrics/distance does not denote a property but instead refers to the euclidean distance metric (the L2 norm of vector difference).
So what you probably mean hare and in other comments is that the dissimilarity must be a metric, that is, it should ensure the 4 conditions below:
- positivity
- symmetry
- distinguishability
- triangular inequality
Ref: M.M. Deza and E. Deza. Encyclopedia of Distances. Vol. 2006. 2009, p. 590. arXiv:0505065
Or simply: https://en.wikipedia.org/wiki/Metric_(mathematics)#Definition
|
||
w, V = linalg.eigh(B, check_finite=False) | ||
|
||
# ``dissimilarities`` is Euclidean iff ``B`` is positive semi-definite. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above: probably replace Euclidean by a distance metric.
try: | ||
w = _check_psd_eigenvalues(w) | ||
except ValueError: | ||
raise ValueError("Dissimilarity matrix must be euclidean. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above: I would directly raise the original ValueError
stating that the matrix is not PSD, or say that "dissimilarities must be computed using a true distance metric. Make sure to pass a Positive Semi-definite matrix, or use "dissimilarity='euclidean' to use the euclidean distance"
|
||
# Get ``n_compontent`` greatest eigenvalues and corresponding eigenvectors. | ||
# Eigenvalues should be in descending order by convention. | ||
w = w[::-1][:n_components] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NicolasHug remind me to propose a PR once this one is merged, to accelerate this with randomized methods as we did for KernelPCA.
# Double centered matrix | ||
B = -0.5 * np.dot(J, np.dot(dissimilarities ** 2, J)) | ||
|
||
w, V = linalg.eigh(B, check_finite=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not familiar with the reference you used, but this seems strange to me: why is the method called SVD if you use eigs to solve it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made some mess with this PR, but I won't be able to tidy it in the next two months (at least). If anyone is interested in continuing this PR - feel free to do it. I'd start with figuring out whether svd_solver
really requires dissimilarity
to be (squared) euclidean distance matrix. Wikipedia claims that "Classical MDS assumes Euclidean distances.". What does "assumes" mean? That algorithm won't work properly if distances comes from another metric?.
np.ones(dissimilarities.shape)) | ||
|
||
# Double centered matrix | ||
B = -0.5 * np.dot(J, np.dot(dissimilarities ** 2, J)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
B = -0.5 * np.dot(J, np.dot(dissimilarities ** 2, J)) | |
B = -0.5 * np.dot(J, np.dot(dissimilarities, J)) |
We then don't have to differentiate between SVD and SMACOF as suggested here
Continued in #22330. Would love for follow-up discussion there from anyone that's interested! |
Closing this one as superseded by #22330. Thanks @panpiort8 for your work! |
Reference Issues/PRs
Fixes #15272
What does this implement/fix? Explain your changes.
Adds implementation of Multidimensional scaling (MDS), which uses Singular Value Decompostion (SVD) method. SVD works much faster and far more accurate for Euclidean matrixes than SMACOF algorithm (current implementation of MDS in sklearn/manifold module).
Any other comments?
I'm putting simple benchmark script and it's results in comment (not sure if it's best practice).