-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Implement SVD algorithm in MDS #16067
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -6,6 +6,7 @@ | |||||
# License: BSD | ||||||
|
||||||
import numpy as np | ||||||
from scipy import linalg | ||||||
from joblib import Parallel, delayed, effective_n_jobs | ||||||
|
||||||
import warnings | ||||||
|
@@ -14,6 +15,7 @@ | |||||
from ..metrics import euclidean_distances | ||||||
from ..utils import check_random_state, check_array, check_symmetric | ||||||
from ..isotonic import IsotonicRegression | ||||||
from sklearn.utils.validation import _check_psd_eigenvalues | ||||||
|
||||||
|
||||||
def _smacof_single(dissimilarities, metric=True, n_components=2, init=None, | ||||||
|
@@ -119,7 +121,7 @@ def _smacof_single(dissimilarities, metric=True, n_components=2, init=None, | |||||
if verbose >= 2: | ||||||
print('it: %d, stress %s' % (it, stress)) | ||||||
if old_stress is not None: | ||||||
if(old_stress - stress / dis) < eps: | ||||||
if (old_stress - stress / dis) < eps: | ||||||
if verbose: | ||||||
print('breaking at iteration %d with stress %s' % (it, | ||||||
stress)) | ||||||
|
@@ -272,6 +274,72 @@ def smacof(dissimilarities, metric=True, n_components=2, init=None, n_init=8, | |||||
return best_pos, best_stress | ||||||
|
||||||
|
||||||
def svd_scaler(dissimilarities, n_components=2): | ||||||
""" | ||||||
Computes multidimensional scaling using SVD algorithm | ||||||
|
||||||
Parameters | ||||||
---------- | ||||||
dissimilarities : ndarray, shape (n_samples, n_samples) | ||||||
Pairwise dissimilarities between the points. Must be euclidean. | ||||||
n_components : int, optional, default: 2 | ||||||
Number of dimension in which to immerse the dissimilarities. | ||||||
|
||||||
Returns | ||||||
---------- | ||||||
embedding : ndarray, shape (n_samples, n_components) | ||||||
Coordinates of the points in a ``n_components``-space. | ||||||
|
||||||
stress : float | ||||||
The final value of the stress (sum of squared distance of the | ||||||
disparities and the distances for all constrained points). | ||||||
|
||||||
References | ||||||
---------- | ||||||
"An Introduction to MDS" Florian Wickelmaier | ||||||
Sound Quality Research Unit, Aalborg University, Denmark (2003) | ||||||
|
||||||
"Metric and Euclidean properties of dissimilarity coefficients" | ||||||
J. C. Gower and P. Legendre, Journal of Classification 3, 5–48 (1986) | ||||||
|
||||||
""" | ||||||
|
||||||
dissimilarities = check_symmetric(dissimilarities, raise_exception=True) | ||||||
|
||||||
n_samples = dissimilarities.shape[0] | ||||||
|
||||||
# Centering matrix | ||||||
NicolasHug marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
J = np.eye(*dissimilarities.shape) - (1. / n_samples) * ( | ||||||
np.ones(dissimilarities.shape)) | ||||||
|
||||||
# Double centered matrix | ||||||
B = -0.5 * np.dot(J, np.dot(dissimilarities ** 2, J)) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
We then don't have to differentiate between SVD and SMACOF as suggested here |
||||||
|
||||||
w, V = linalg.eigh(B, check_finite=False) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not familiar with the reference you used, but this seems strange to me: why is the method called SVD if you use eigs to solve it ? |
||||||
|
||||||
# ``dissimilarities`` is Euclidean iff ``B`` is positive semi-definite. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See above: probably replace Euclidean by a distance metric. |
||||||
# See "Metric and Euclidean properties of dissimilarity coefficients" | ||||||
# for details | ||||||
try: | ||||||
w = _check_psd_eigenvalues(w) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm sure @smarie will be happy to hear that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Great !! Thanks for the connection @NicolasHug :) |
||||||
except ValueError: | ||||||
raise ValueError("Dissimilarity matrix must be euclidean. " | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See above: I would directly raise the original |
||||||
"Make sure to pass an euclidean matrix, or use " | ||||||
"dissimilarity='euclidean'.") | ||||||
|
||||||
# Get ``n_compontent`` greatest eigenvalues and corresponding eigenvectors. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. small typo here:
Suggested change
|
||||||
# Eigenvalues should be in descending order by convention. | ||||||
w = w[::-1][:n_components] | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @NicolasHug remind me to propose a PR once this one is merged, to accelerate this with randomized methods as we did for KernelPCA. |
||||||
V = V[:, ::-1][:, :n_components] | ||||||
|
||||||
embedding = np.sqrt(w) * V | ||||||
|
||||||
dist = euclidean_distances(embedding) | ||||||
stress = ((dissimilarities.ravel() - dist.ravel()) ** 2).sum() * 0.5 | ||||||
|
||||||
return embedding, stress | ||||||
|
||||||
|
||||||
class MDS(BaseEstimator): | ||||||
"""Multidimensional scaling | ||||||
|
||||||
|
@@ -285,21 +353,29 @@ class MDS(BaseEstimator): | |||||
metric : boolean, optional, default: True | ||||||
If ``True``, perform metric MDS; otherwise, perform nonmetric MDS. | ||||||
|
||||||
If ``solver=='svd'``, metric must be set to True. | ||||||
|
||||||
n_init : int, optional, default: 4 | ||||||
Number of times the SMACOF algorithm will be run with different | ||||||
initializations. The final results will be the best output of the runs, | ||||||
determined by the run with the smallest final stress. | ||||||
|
||||||
Ignored if ``solver=='svd'``. | ||||||
|
||||||
max_iter : int, optional, default: 300 | ||||||
Maximum number of iterations of the SMACOF algorithm for a single run. | ||||||
|
||||||
Ignored if ``solver=='svd'``. | ||||||
|
||||||
verbose : int, optional, default: 0 | ||||||
Level of verbosity. | ||||||
|
||||||
eps : float, optional, default: 1e-3 | ||||||
Relative tolerance with respect to stress at which to declare | ||||||
convergence. | ||||||
|
||||||
Ignored if ``solver=='svd'``. | ||||||
|
||||||
n_jobs : int or None, optional (default=None) | ||||||
The number of jobs to use for the computation. If multiple | ||||||
initializations are used (``n_init``), each run of the algorithm is | ||||||
|
@@ -309,6 +385,8 @@ class MDS(BaseEstimator): | |||||
``-1`` means using all processors. See :term:`Glossary <n_jobs>` | ||||||
for more details. | ||||||
|
||||||
Ignored if ``solver=='svd'``. | ||||||
|
||||||
random_state : int, RandomState instance, default=None | ||||||
Determines the random number generator used to initialize the centers. | ||||||
Pass an int for reproducible results across multiple function calls. | ||||||
|
@@ -324,6 +402,11 @@ class MDS(BaseEstimator): | |||||
Pre-computed dissimilarities are passed directly to ``fit`` and | ||||||
``fit_transform``. | ||||||
|
||||||
solver : {'smacof', 'svd'}, default ='smacof' | ||||||
The solver used for solving the MDS problem. | ||||||
|
||||||
.. versionadded:: 0.23 | ||||||
|
||||||
Attributes | ||||||
---------- | ||||||
embedding_ : array-like, shape (n_samples, n_components) | ||||||
|
@@ -333,6 +416,12 @@ class MDS(BaseEstimator): | |||||
The final value of the stress (sum of squared distance of the | ||||||
disparities and the distances for all constrained points). | ||||||
|
||||||
n_iter_ : int | ||||||
The number of iterations of SMACOF algorithm corresponding | ||||||
to the best stress. | ||||||
|
||||||
It is set to ``None`` if ``solver=='svd'``. | ||||||
|
||||||
Examples | ||||||
-------- | ||||||
>>> from sklearn.datasets import load_digits | ||||||
|
@@ -357,12 +446,15 @@ class MDS(BaseEstimator): | |||||
hypothesis" Kruskal, J. Psychometrika, 29, (1964) | ||||||
|
||||||
""" | ||||||
|
||||||
def __init__(self, n_components=2, metric=True, n_init=4, | ||||||
max_iter=300, verbose=0, eps=1e-3, n_jobs=None, | ||||||
random_state=None, dissimilarity="euclidean"): | ||||||
random_state=None, dissimilarity="euclidean", | ||||||
solver="smacof"): | ||||||
self.n_components = n_components | ||||||
self.dissimilarity = dissimilarity | ||||||
self.metric = metric | ||||||
self.solver = solver | ||||||
self.n_init = n_init | ||||||
self.max_iter = max_iter | ||||||
self.eps = eps | ||||||
|
@@ -390,6 +482,8 @@ def fit(self, X, y=None, init=None): | |||||
Starting configuration of the embedding to initialize the SMACOF | ||||||
algorithm. By default, the algorithm is initialized with a randomly | ||||||
chosen array. | ||||||
|
||||||
Ignored if ``solver=='svd'``. | ||||||
""" | ||||||
self.fit_transform(X, init=init) | ||||||
return self | ||||||
|
@@ -410,6 +504,8 @@ def fit_transform(self, X, y=None, init=None): | |||||
Starting configuration of the embedding to initialize the SMACOF | ||||||
algorithm. By default, the algorithm is initialized with a randomly | ||||||
chosen array. | ||||||
|
||||||
Ignored if ``solver=='svd'``. | ||||||
""" | ||||||
X = self._validate_data(X) | ||||||
if X.shape[0] == X.shape[1] and self.dissimilarity != "precomputed": | ||||||
|
@@ -423,14 +519,26 @@ def fit_transform(self, X, y=None, init=None): | |||||
elif self.dissimilarity == "euclidean": | ||||||
self.dissimilarity_matrix_ = euclidean_distances(X) | ||||||
else: | ||||||
raise ValueError("Proximity must be 'precomputed' or 'euclidean'." | ||||||
" Got %s instead" % str(self.dissimilarity)) | ||||||
|
||||||
self.embedding_, self.stress_, self.n_iter_ = smacof( | ||||||
self.dissimilarity_matrix_, metric=self.metric, | ||||||
n_components=self.n_components, init=init, n_init=self.n_init, | ||||||
n_jobs=self.n_jobs, max_iter=self.max_iter, verbose=self.verbose, | ||||||
eps=self.eps, random_state=self.random_state, | ||||||
return_n_iter=True) | ||||||
raise ValueError( | ||||||
"Dissimilarity matrix must be 'precomputed' or 'euclidean'." | ||||||
" Got %s instead" % str(self.dissimilarity)) | ||||||
|
||||||
if self.solver == "smacof": | ||||||
self.embedding_, self.stress_, self.n_iter_ = smacof( | ||||||
panpiort8 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
self.dissimilarity_matrix_, metric=self.metric, | ||||||
n_components=self.n_components, init=init, | ||||||
n_init=self.n_init, n_jobs=self.n_jobs, | ||||||
max_iter=self.max_iter, verbose=self.verbose, | ||||||
eps=self.eps, random_state=self.random_state, | ||||||
return_n_iter=True) | ||||||
elif self.solver == "svd": | ||||||
if not self.metric: | ||||||
raise ValueError("Using SVD requires metric=True") | ||||||
self.embedding_, self.stress_ = svd_scaler( | ||||||
self.dissimilarity_matrix_, n_components=self.n_components) | ||||||
self.n_iter_ = None | ||||||
else: | ||||||
raise ValueError("Solver must be 'smacof' or 'svd'." | ||||||
" Got %s instead" % str(self.solver)) | ||||||
|
||||||
return self.embedding_ |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -55,10 +55,80 @@ def test_smacof_error(): | |
mds.smacof(sim, init=Z, n_init=1) | ||
|
||
|
||
def test_svd_error(): | ||
# Non symmetric (dis)similarity matrix: | ||
sim = np.array([[0, 5, 9, 4], | ||
[5, 0, 2, 2], | ||
[3, 2, 0, 1], | ||
[4, 2, 1, 0]]) | ||
|
||
with pytest.raises(ValueError, match="Array must be symmetric"): | ||
mds.svd_scaler(sim) | ||
|
||
# Non squared (dis)similarity matrix: | ||
sim = np.array([[0, 5, 9, 4], | ||
[5, 0, 2, 2], | ||
[4, 2, 1, 0]]) | ||
|
||
with pytest.raises(ValueError, | ||
match="array must be 2-dimensional and square"): | ||
mds.svd_scaler(sim) | ||
|
||
# Non Euclidean (dis)similarity matrix: | ||
sim = np.array([[0, 12, 3, 4], | ||
[12, 0, 2, 2], | ||
[3, 2, 0, 1], | ||
[4, 2, 1, 0]]) | ||
|
||
with pytest.raises(ValueError, | ||
match="Dissimilarity matrix must be euclidean"): | ||
mds.svd_scaler(sim) | ||
|
||
|
||
def test_MDS_error(): | ||
panpiort8 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
# Bad solver name | ||
sim = np.ones((2, 2)) | ||
mdc_clf = mds.MDS(solver='bad name') | ||
with pytest.raises(ValueError, match="Solver must be 'smacof' or 'svd'"): | ||
mdc_clf.fit(sim) | ||
|
||
# SVD with metric=False | ||
sim = np.ones((2, 2)) | ||
mdc_clf = mds.MDS(metric=False, solver='svd') | ||
with pytest.raises(ValueError, match="Using SVD requires metric=True"): | ||
mdc_clf.fit(sim) | ||
|
||
|
||
def test_MDS(): | ||
sim = np.array([[0, 5, 3, 4], | ||
[5, 0, 2, 2], | ||
[3, 2, 0, 1], | ||
[4, 2, 1, 0]]) | ||
mds_clf = mds.MDS(metric=False, n_jobs=3, dissimilarity="precomputed") | ||
mds_clf.fit(sim) | ||
|
||
|
||
def test_MDS_svd(): | ||
# Test svd using example data from "An Introduction to MDS" | ||
# Florian Wickelmaier, p 11 | ||
sim = np.array([[0, 93, 82, 133], | ||
[93, 0, 52, 60], | ||
[82, 52, 0, 111], | ||
[133, 60, 111, 0]]) | ||
|
||
mds_clf = mds.MDS(metric=True, solver="svd", dissimilarity='precomputed') | ||
mds_clf.fit(sim) | ||
panpiort8 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
X_true_1 = np.array([[-62.831, -32.97448], | ||
[18.403, 12.02697], | ||
[-24.960, 39.71091], | ||
[69.388, -18.76340]]) | ||
X_true_2 = np.copy(X_true_1) | ||
X_true_2[:, 0] *= -1 | ||
|
||
# Signs of columns are dependent on signs of computed eigenvectors | ||
# which are arbitrary and meaningless | ||
assert (np.allclose(mds_clf.embedding_, X_true_1) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's just compare the absolute values |
||
or np.allclose(mds_clf.embedding_, -X_true_1) | ||
or np.allclose(mds_clf.embedding_, X_true_2) | ||
or np.allclose(mds_clf.embedding_, -X_true_2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and in several other parts
I may be mistaken but the term "Euclidean" when applied to metrics/distance does not denote a property but instead refers to the euclidean distance metric (the L2 norm of vector difference).
So what you probably mean hare and in other comments is that the dissimilarity must be a metric, that is, it should ensure the 4 conditions below:
Ref: M.M. Deza and E. Deza. Encyclopedia of Distances. Vol. 2006. 2009, p. 590. arXiv:0505065
Or simply: https://en.wikipedia.org/wiki/Metric_(mathematics)#Definition