8000 Implement SVD algorithm in MDS by panpiort8 · Pull Request #16067 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Implement SVD algorithm in MDS #16067

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed

Conversation

panpiort8
Copy link
Contributor

Reference Issues/PRs

Fixes #15272

What does this implement/fix? Explain your changes.

Adds implementation of Multidimensional scaling (MDS), which uses Singular Value Decompostion (SVD) method. SVD works much faster and far more accurate for Euclidean matrixes than SMACOF algorithm (current implementation of MDS in sklearn/manifold module).

Any other comments?

I'm putting simple benchmark script and it's results in comment (not sure if it's best practice).

@panpiort8
Copy link
Contributor Author

Results:

SVD:
╒═══════════════╤═══════════╤══════════════╤══════════════╤═══════════════════╕
│ NxN (reps)    │        dt │   avg stress │   max stress │   stress variance │
╞═══════════════╪═══════════╪══════════════╪══════════════╪═══════════════════╡
│ 5x5 (100)     │ 0.043299  │  8.14576e-26 │  8.72013e-25 │       1.619e-50   │
├───────────────┼───────────┼──────────────┼──────────────┼───────────────────┤
│ 50x50 (50)    │ 0.0955703 │  1.36424e-14 │  4.54747e-13 │       4.98376e-27 │
├───────────────┼───────────┼──────────────┼──────────────┼───────────────────┤
│ 250x250 (25)  │ 0.630252  │  2.00089e-13 │  1.93268e-12 │       2.77395e-25 │
├───────────────┼───────────┼──────────────┼──────────────┼───────────────────┤
│ 1000x1000 (5) │ 2.69359   │  6.6791e-12  │  7.7307e-12  │       8.87927e-25 │
╘═══════════════╧═══════════╧══════════════╧══════════════╧═══════════════════╛

SMACOF:
╒═══════════════╤═══════════╤══════════════╤══════════════╤═══════════════════╕
│ NxN (reps)    │        dt │   avg stress │   max stress │   stress variance │
╞═══════════════╪═══════════╪══════════════╪══════════════╪═══════════════════╡
│ 5x5 (100)     │   4.49364 │     0.633399 │      6.26971 │          0.692775 │
├───────────────┼───────────┼──────────────┼──────────────┼───────────────────┤
│ 50x50 (50)    │   5.04182 │     3.63208  │      5.42724 │          0.676349 │
├───────────────┼───────────┼──────────────┼──────────────┼───────────────────┤
│ 250x250 (25)  │  18.3058  │    22.3588   │     28.0936  │          7.6922   │
├───────────────┼───────────┼──────────────┼──────────────┼───────────────────┤
│ 1000x1000 (5) │ 107.26    │    94.5252   │    102.227   │         34.2675   │
╘═══════════════╧═══════════╧══════════════╧══════════════╧═══════════════════╛

SMACOF/SVD:
╒═══════════════╤══════════╤══════════════╤══════════════╤═══════════════════╕
│ NxN (reps)    │       dt │   avg stress │   max stress │   stress variance │
╞═══════════════╪══════════╪══════════════╪══════════════╪═══════════════════╡
│ 5x5 (100)     │ 103.782  │  7.77582e+24 │  7.18993e+24 │       4.27902e+49 │
├───────────────┼──────────┼──────────────┼──────────────┼───────────────────┤
│ 50x50 (50)    │  52.7551 │  2.66234e+14 │  1.19346e+13 │       1.3571e+26  │
├───────────────┼──────────┼──────────────┼──────────────┼───────────────────┤
│ 250x250 (25)  │  29.0452 │  1.11744e+14 │  1.45361e+13 │       2.77301e+25 │
├───────────────┼──────────┼──────────────┼──────────────┼───────────────────┤
│ 1000x1000 (5) │  39.8204 │  1.41524e+13 │  1.32236e+13 │       3.85927e+25 │
╘═══════════════╧══════════╧══════════════╧══════════════╧═══════════════════╛

Code:

import numpy as np
from tabulate import tabulate
from sklearn.manifold import _mds as mds
from time import time

def generate_euclidean(size):
    Y = np.random.randint(-100, 100, (size, 2))
    M = mds.euclidean_distances(Y)
    return M

def bench(func, args):
    res = []
    start = time()
    for arg in args:
        res.append(func(arg)[1])
    dt = time()-start
    return [dt, np.average(res), np.max(res), np.var(res)]

headers = ['NxN (reps)', 'dt', 'avg stress', 'max stress', 'stress variance']
params = [(5, 100), (50, 50), (250, 25), (1000, 5)]
params = [(4, 1)]
table_svd = []
table_smacof = []
table_diff = []

for i, (size, reps) in enumerate(params):
    head = '{}x{} ({})'.format(size, size, reps)
    args = [generate_euclidean(size) for j in range(reps)]
    print('benching tuple {} ({}/{})'.format((size, reps), i+1, len(params)))
    stats_svd = bench(mds.svd, args)
    stats_smacof = bench(mds.smacof, args)
    diff = [x/y for x, y in zip(stats_smacof, stats_svd)]

    table_svd.append([head, *stats_svd])
    table_smacof.append([head, *stats_smacof])
    table_diff.append([head, *diff])

print('SVD:')
print(tabulate(table_svd, headers=headers, tablefmt='fancy_grid'))
print('\nSMACOF:')
print(tabulate(table_smacof, headers=headers, tablefmt='fancy_grid'))
print('\nSMACOF/SVD:')
print(tabulate(table_diff, headers=headers, tablefmt='fancy_grid'))

@panpiort8 panpiort8 changed the title Svd in mds Implement SVD algorithm in MDS Jan 9, 2020
Copy link
Member
@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @panpiort8

This also needs a whats_new entry in the right file. You can check the other entries and put one in the right place.

I have not checked the math.

Copy link
Member
@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @panpiort8 , made a quick pass.

A brief "Method" section in the user guide would be useful, to describe and compare SMACOF and SVD. Otherwise it isn't clear for users why they might prefer one over the other.

Thanks!

@panpiort8 panpiort8 force-pushed the svd_in_mds branch 3 times, most recently from 48cfc41 to a081cf5 Compare March 3, 2020 10:10
@panpiort8
Copy link
Contributor Author
panpiort8 commented Mar 4, 2020

Thanks for reviews @NicolasHug and @adrinjalali. I hope it's done, but you'd better look at docs ;)

Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there advantages to using smacof if you want to assume Euclidean similarities? If not should we just make this work with MDS(metric='euclidean') running SVD? Or is that too opaque to the user?

@panpiort8
Copy link
Contributor Author
panpiort8 commented Mar 5, 2020

Are there advantages to using smacof if you want to assume Euclidean similarities? If not should we just make this work with MDS(metric='euclidean') running SVD? Or is that too opaque to the user?

I don't see any advantages of using smacof in that case. However if force MDS to use svd_scaler when metric='euclidean', there will be some problems:

  1. attribute n_iter_ won't be backward compatible as it'll be unexpectedly set to None. But this attribute had been actually undocumented before my previous commit. So maybe we can just remove it?
  2. what if user set solver='smacof'? Then we should use smacof of course. But user can also set solver='smacof' implicitly by not setting it at all, because solver='smacof' is default parameter. In my opinion this two cases should be treated the same, thus we should't replace anything when metric='euclidean'.

To my mind the only thing we can do is to add third possible value of 'solver' parameter, namely 'optimal' and set solver='optimal' for default. But it's even more problematic.

What do you think?

@jnothman
Copy link
Member
jnothman commented Mar 5, 2020 via email

@panpiort8
Copy link
Contributor Author

Ok, I misunderstood. I haven't suspected I'm allowed to change api in a way that is backward incompatible. If so, I think it's great idea and should be discussed. To start, here is how I see it:

  1. we allow metric to have three values: 'euclidean', 'non-euclidean', 'none' and aply solver adequately
  2. we remove dissimilarity parameter. Now it does odd thing - if set to 'euclidean', it transforms input matrix to euclidean one. I believe it to be confusing. Do we need such transformation inside MDS? Maybe user should do it outside? If answers are 'yes' and 'no', we can:
  3. additionaly add transform_to_euclidean parameter, which will tell whether to transform matrix or no

Copy link
Member
@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another round, mostly nits, reviewed the math and the tests, this looks good!

Comment on lines 302 to 304
"Multidimensional Scaling" Chapman Hall
2nd edition, Boca Raton (2001)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm quite ambivalent about citing a $300+ springer book.

Can we find an alternative? else I guess the other ref would be enough

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found 'core' paper of that citation, but it's still paid, I've changed to it anyway. Is that ok?


# Signs of columns are dependent on signs of computed eigenvectors
# which are arbitrary and meaningless
assert (np.allclose(mds_clf.embedding_, X_true_1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just compare the absolute values

# ``dissimilarities`` is Euclidean iff ``K`` is positive semi-definite.
# For detail see "Multidimensional Scaling" Chapman Hall p 397
try:
w = _check_psd_eigenvalues(w)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure @smarie will be happy to hear that _check_psd_eigenvalues is used! ;)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great !! Thanks for the connection @NicolasHug :)

@NicolasHug
Copy link
Member

It would also be useful to have a functional test on various random (or not) data testing that the svd solver results in lower stress than SMACOF

@panpiort8
Copy link
Contributor Author

It would also be useful to have a functional test on various random (or not) data testing that the svd solver results in lower stress than SMACOF

Do you mean test in test_mds.py?

@panpiort8
Copy link
Contributor Author

Gentle ping

@jnothman
Copy link
Member
jnothman commented Apr 2, 2020

I don't really get what you mean (#16067 (comment)) about transforming the input matrix to a Euclidean one

@panpiort8
Copy link
Contributor Author
panpiort8 commented Apr 6, 2020

I don't really get what you mean (#16067 (comment)) about transforming the input matrix to a Euclidean one

I mean these lines in MDS class (/sklear/manifold/_mds.py):

if self.dissimilarity == "precomputed":
    self.dissimilarity_matrix_ = X
elif self.dissimilarity == "euclidean":
    self.dissimilarity_matrix_ = euclidean_distances(X)

I agree that word "transform" is not the best one here.

@ADEscobar
Copy link
ADEscobar commented Apr 19, 2020

The "dissimilarity" matrix must already be symmetric and with all zeroes in the main diagonal in any case.

What do you mean by "euclidean"?
Does it mean that also all the off-diagonal elements must be positive (non-zero) and that it must satisfy the triangular inequality?

Take into account that if instead of "precomputed", "euclidean" is used as parameter for the dissimilarity option, it does not mean that all the off-diagonal elements will be zero. They can be zero if the same point is repeated in the dataset and the SMACOF works just fine with zeroes outside the diagonal (those points are fitted to be at "0" distance, zeroes are actually used for the fitting in the "metric" case and ignored in the "non-metric").

So using "euclidean" does not mean you get an Euclidean matrix in the classical sense (without zeroes off the diagonal), which might be confusing.

Would SVD fail with zeroes outside the main diagonal?

@panpiort8
Copy link
Contributor Author

What do you mean by "euclidean"?
Does it mean that also all the off-diagonal elements must be positive (non-zero) and that it must satisfy the triangular inequality?

Nope, I mean Euclidean Distance Matrix and actually I cannot find any other definition of euclidean matrix (at least at Google).

Would SVD fail with zeroes outside the main diagonal?

It works:

>>> points = np.array([[0, 93],
...                    [0, 93],
...                    [0, 93]])
>>> mds_clf = mds.MDS(metric=True, solver="svd", dissimilarity='euclidean')
>>> mds_clf.fit(points)
>>> print(mds_clf.embedding_)
 [[0. 0.]
 [0. 0.]
 [0. 0.]]

@ADEscobar
Copy link
ADEscobar commented Apr 20, 2020

What do you mean by "euclidean"?
Does it mean that also all the off-diagonal elements must be positive (non-zero) and that it must satisfy the triangular inequality?

Nope, I mean Euclidean Distance Matrix and actually I cannot find any other definition of euclidean matrix (at least at Google).

Would SVD fail with zeroes outside the main diagonal?

It works:

>>> points = np.array([[0, 93],
...                    [0, 93],
...                    [0, 93]])
>>> mds_clf = mds.MDS(metric=True, solver="svd", dissimilarity='euclidean')
>>> mds_clf.fit(points)
>>> print(mds_clf.embedding_)
 [[0. 0.]
 [0. 0.]
 [0. 0.]]

That's what I thought was your definition.

But, by that Wikipedia definition, the Euclidean Distance Matrix has the distances squared. If we use dissimilarity='euclidean' in the MDS class we actually get a dissimilarity matrix with the distances not squared, so being strict with the Wikipedia definition it would not be the Euclidean Distance Matrix.

Is it relevant for the SVD if they are squared or not?

@panpiort8
Copy link
Contributor Author
panpiort8 commented Apr 20, 2020

But, by that Wikipedia definition, the Euclidean Distance Matrix has the distances squared. If we use dissimilarity='euclidean' in the MDS class we actually get a dissimilarity matrix with the distances not squared, so being strict with the Wikipedia definition it would not be the Euclidean Distance Matrix.

True, I'll make it more precise in my code.

Is it relevant for the SVD if they are squared or not?

My implementation of SVD expects input matrix to be squared Euclidean Distance Matrix. Maybe we should call it 'matrix of euclidean distances'. I think it's quite precise.

@ADEscobar
Copy link
ADEscobar commented Apr 20, 2020

But, by that Wikipedia definition, the Euclidean Distance Matrix has the distances squared. If we use dissimilarity='euclidean' in the MDS class we actually get a dissimilarity matrix with the distances not squared, so being strict with the Wikipedia definition it would not be the Euclidean Distance Matrix.

True, I'll make it more precise in my code.

Is it relevant for the SVD if they are squared or not?

My implementation of SVD expects input matrix to be squared Euclidean Distance Matrix.

Then, when using SVD and 'euclidean' instead of 'precomputed', it should do the following, right?:
self.dissimilarity_matrix_ = euclidean_distances(X, squared = True)

Then, we basically have two options for the dissimilarity matrix:

  • 'precomputed': the user takes care of building it externally
  • 'euclidean': it will calculate internally the squared euclidean matrix if SVD is used as solver and the non-squared if SMACOF is used.

I agree ma F42D ybe it would be better to remove the dissimilarity parameter completely, since it is confusing having it with two different behaviors and a non-compliant definition of the Euclidean matrix. The dissimilarity matrix should be built externally before calling to MDS, but maybe it is not convenient for backwards compatibility.

Then we would have 3 different solvers:
SMACOF_metric
SMACOF_nonmetric
SVD

Then, maybe instead of having 'solver' and 'metric' parameters, we could just have one with three options, but again adding 'solver' defaulting to SMACOF and keeping 'metric' is definitely better for compatibility...

@panpiort8
Copy link
Contributor Author
panpiort8 commented Apr 20, 2020

Then, when using SVD and 'euclidean' instead of 'precomputed', it should do the following, right?:
self.dissimilarity_matrix_ = euclidean_distances(X, squared = True)

You're right, but I also would rather remove this parameter completely.

Then we would have 3 different solvers:
SMACOF_metric
SMACOF_nonmetric
SVD
Then, maybe instead of having 'solver' and 'metric' parameters, we could just have one with three options, but again adding 'solver' defaulting to SMACOF and keeping 'metric' is definitely better for compatibility...

What about adding euclidean_distances parameter and keeping metric? Then we can remove solver parameter and make choice of solver internally based on euclidean_distances and metric parameters.

@ADEscobar
Copy link
ADEscobar commented Apr 20, 2020

Then, when using SVD and 'euclidean' instead of 'precomputed', it should do the following, right?:
self.dissimilarity_matrix_ = euclidean_distances(X, squared = True)

You're right, but I also would rather remove this parameter completely.

Then we would have 3 different solvers:
SMACOF_metric
SMACOF_nonmetric
SVD
Then, maybe instead of having 'solver' and 'metric' parameters, we could just have one with three options, but again adding 'solver' defaulting to SMACOF and keeping 'metric' is definitely better for compatibility...

What about adding euclidean_distances parameter and keeping metric? Then we can remove solver parameter and make choice of solver internally based on euclidean_distances and metric parameters.

I like just adding the 'solver' parameter defaulting to 'SMACOF' and just add the 'SVD' solver option, checking internally that metric is then 'true' if 'solver' is 'SVD' and giving an error if it is 'false'.

And also adding that if dissimilarity is 'euclidean' and 'solver' is SVD it does the squared version.

This way it is completely backwards compatible and we do not break any existing code using the MDS library.

Have you tested that the solution is the same-ish (apart from speed/accuracy differences) with 'solver' = SVD and solver='SMACOF' with metric='True'? Both with 'euclidean' dissimilarities.

With this tweak where it is squared in one case and not-squared in the other they should converge to equivalent points (at least for a simple case in which SMACOF is not stuck in a local minima, and taking into consideration that there will be an arbitrary translation/rotation/mirroring of the points).

Maybe this script can be useful to visually see if the points are correct:
https://scikit-learn.org/stable/auto_examples/manifold/plot_mds.html#sphx-glr-auto-examples-manifold-plot-mds-py

@panpiort8
Copy link
Contributor Author
panpiort8 commented Apr 20, 2020

I like just adding the 'solver' parameter defaulting to 'SMACOF' and just add the 'SVD' solver option, checking internally that metric is then 'true' if 'solver' is 'SVD' and giving an error if it is 'false'.

And also adding that if dissimilarity is 'euclidean' and 'solver' is SVD it does the squared version.

This way it is completely backwards compatible and we do not break any existing code using the MDS library.

I believe leaving solver isn't best idea, bacause both solvers requires different matrixes (squared and non-squared) and user expects that he'll get proper results just by replacing solver without any manipulation with other parameters. What if we just replace solver with flag euclidean_distances and set default to False? It'll indicate that input must be of particular form and will provide user best available method without making him to read about SVD and SMACOF. It'll also be backward compatible.

An if we want dissimilarity parameter to stay, I'll suggest adding third option: 'euclidean_squared' and throw error when 'euclidean' and SVD are both chosen (SVD is chosen implictly by setting metric=True, euclidean_distances=True)

"Make sure to pass an euclidean matrix, or use "
"dissimilarity='euclidean'.")

# Get ``n_compontent`` greatest eigenvalues and corresponding eigenvectors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small typo here:

Suggested change
# Get ``n_compontent`` greatest eigenvalues and corresponding eigenvectors.
# Get ``n_components`` greatest eigenvalues and corresponding eigenvectors.

Parameters
----------
dissimilarities : ndarray, shape (n_samples, n_samples)
Pairwise dissimilarities between the points. Must be euclidean.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and in several other parts

Must be euclidean

I may be mistaken but the term "Euclidean" when applied to metrics/distance does not denote a property but instead refers to the euclidean distance metric (the L2 norm of vector difference).

So what you probably mean hare and in other comments is that the dissimilarity must be a metric, that is, it should ensure the 4 conditions below:

  • positivity
  • symmetry
  • distinguishability
  • triangular inequality

Ref: M.M. Deza and E. Deza. Encyclopedia of Distances. Vol. 2006. 2009, p. 590. arXiv:0505065

Or simply: https://en.wikipedia.org/wiki/Metric_(mathematics)#Definition


w, V = linalg.eigh(B, check_finite=False)

# ``dissimilarities`` is Euclidean iff ``B`` is positive semi-definite.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above: probably replace Euclidean by a distance metric.

try:
w = _check_psd_eigenvalues(w)
except ValueError:
raise ValueError("Dissimilarity matrix must be euclidean. "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above: I would directly raise the original ValueError stating that the matrix is not PSD, or say that "dissimilarities must be computed using a true distance metric. Make sure to pass a Positive Semi-definite matrix, or use "dissimilarity='euclidean' to use the euclidean distance"


# Get ``n_compontent`` greatest eigenvalues and corresponding eigenvectors.
# Eigenvalues should be in descending order by convention.
w = w[::-1][:n_components]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NicolasHug remind me to propose a PR once this one is merged, to accelerate this with randomized methods as we did for KernelPCA.

# Double centered matrix
B = -0.5 * np.dot(J, np.dot(dissimilarities ** 2, J))

w, V = linalg.eigh(B, check_finite=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar with the reference you used, but this seems strange to me: why is the method called SVD if you use eigs to solve it ?

Copy link
Contributor Author
@panpiort8 panpiort8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made some mess with this PR, but I won't be able to tidy it in the next two months (at least). If anyone is interested in continuing this PR - feel free to do it. I'd start with figuring out whether svd_solver really requires dissimilarity to be (squared) euclidean distance matrix. Wikipedia claims that "Classical MDS assumes Euclidean distances.". What does "assumes" mean? That algorithm won't work properly if distances comes from another metric?.

np.ones(dissimilarities.shape))

# Double centered matrix
B = -0.5 * np.dot(J, np.dot(dissimilarities ** 2, J))
Copy link
Contributor Author
@panpiort8 panpiort8 May 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
B = -0.5 * np.dot(J, np.dot(dissimilarities ** 2, J))
B = -0.5 * np.dot(J, np.dot(dissimilarities, J))

We then don't have to differentiate between SVD and SMACOF as suggested here

Base automatically changed from master to main January 22, 2021 10:51
@Micky774
Copy link
Contributor

Continued in #22330. Would love for follow-up discussion there from anyone that's interested!

@cmarmo cmarmo added the Superseded PR has been replace by a newer PR label Feb 3, 2022
@cmarmo
Copy link
Contributor
cmarmo commented May 10, 2022

Closing this one as superseded by #22330. Thanks @panpiort8 for your work!

@cmarmo cmarmo closed this May 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:manifold Superseded PR has been replace by a newer PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sklearn MDS vs skbio PCoA
9 participants
0