8000 Implement SVD-based method in MDS class · scikit-learn/scikit-learn@a081cf5 · GitHub
[go: up one dir, main page]

Skip to content

Commit a081cf5

Browse files
author
Pan Jan
committed
Implement SVD-based method in MDS class
1 parent 72b3041 commit a081cf5

File tree

4 files changed

+240
-11
lines changed

4 files changed

+240
-11
lines changed

doc/modules/manifold.rst

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -439,6 +439,9 @@ should then correspond exactly to the distance between point :math:`i` and
439439

440440
Most commonly, disparities are set to :math:`\hat{d}_{ij} = b S_{ij}`.
441441

442+
If the metric of :math:`S` is Euclidean, user can choose to use faster and more accurate
443+
method of calculating results. See :ref:`_multidimensional_scaling_method` for details.
444+
442445
Nonmetric MDS
443446
-------------
444447

@@ -457,6 +460,19 @@ order to avoid that, the disparities :math:`\hat{d}_{ij}` are normalized.
457460
:align: center
458461
:scale: 60
459462

463+
.. _multidimensional_scaling_method:
464+
465+
Method
466+
-------------
467+
468+
Metric :class:`MDS` offers two different algorithms (methods) to calculate
469+
results: SMACOF and SVD-based. The SMACOF method (Scaling by MAjorizing a
470+
COmplicated Function) minimizes objective function (stress) in iterative
471+
manner. The SVD-based method performs series of transformations (including
472+
Singular Value Decomposition) to give exact result. The SVD-based method is
473+
thus much faster and more accurate, but also less general - it requires metric
474+
of :math:`S` to be Euclidean.
475+
460476

461477
.. topic:: References:
462478

@@ -472,6 +488,11 @@ order to avoid that, the disparities :math:`\hat{d}_{ij}` are normalized.
472488
<https://link.springer.com/article/10.1007%2FBF02289565>`_
473489
Kruskal, J. Psychometrika, 29, (1964)
474490

491+
* `"An Introduction to MDS"
492+
<https://www.researchgate.net/publication/228775338_An_introduction_to_MDS>`_
493+
Florian Wickelmaier, Sound Quality Research Unit, Aalborg University, Denmark (2003)
494+
495+
475496
.. _t_sne:
476497

477498
t-distributed Stochastic Neighbor Embedding (t-SNE)

doc/whats_new/v0.23.rst

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -259,6 +259,16 @@ Changelog
259259
of strictly inferior for maximum of `absgrad` and `tol` in `utils.optimize._newton_cg`.
260260
:pr:`16266` by :user:`Rushabh Vasani <rushabh-v>`.
261261

262+
:mod:`sklearn.manifold`
263+
...........................
264+
265+
- |Feature| Support of multidimensional scaling method,
266+
which uses Singular Value Decomposition (SVD) in :class:`manifold.MDS`.
267+
User can choose whether to use SVD- or SMACOF-based method via parameter
268+
`method`. The SVD-based method is faster and more accurate, but works
269+
only for euclidean dissimilarity matrices.
270+
:pr:`16067` by :user:`Piotr Gaiński <panpiort8>`.
271+
262272
:mod:`sklearn.metrics`
263273
......................
264274

sklearn/manifold/_mds.py

Lines changed: 116 additions & 11 deletions
< 10000 td data-grid-cell-id="diff-a2d46f243976aa28653d89f341032c83e311199cc659e84d7731ec626a4bb2ef-434-536-0" data-selected="false" role="gridcell" style="background-color:var(--diffBlob-additionNum-bgColor, var(--diffBlob-addition-bgColor-num));text-align:center" tabindex="-1" valign="top" class="focusable-grid-cell diff-line-number position-relative left-side">
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
# License: BSD
77

88
import numpy as np
9+
from scipy import linalg
910
from joblib import Parallel, delayed, effective_n_jobs
1011

1112
import warnings
@@ -14,6 +15,7 @@
1415
from ..metrics import euclidean_distances
1516
from ..utils import check_random_state, check_array, check_symmetric
1617
from ..isotonic import IsotonicRegression
18+
from sklearn.utils.validation import _check_psd_eigenvalues
1719

1820

1921
def _smacof_single(dissimilarities, metric=True, n_components=2, init=None,
@@ -119,7 +121,7 @@ def _smacof_single(dissimilarities, metric=True, n_components=2, init=None,
119121
if verbose >= 2:
120122
print('it: %d, stress %s' % (it, stress))
121123
if old_stress is not None:
122-
if(old_stress - stress / dis) < eps:
124+
if (old_stress - stress / dis) < eps:
123125
if verbose:
124126
print('breaking at iteration %d with stress %s' % (it,
125127
stress))
@@ -272,6 +274,71 @@ def smacof(dissimilarities, metric=True, n_components=2, init=None, n_init=8,
272274
return best_pos, best_stress
273275

274276

277+
def svd_scaler(dissimilarities, n_components=2):
278+
"""
279+
Computes multidimensional scaling using SVD algorithm
280+
281+
Parameters
282+
----------
283+
dissimilarities : ndarray, shape (n_samples, n_samples)
284+
Pairwise dissimilarities between the points. Must be euclidean.
285+
n_components : int, optional, default: 2
286+
Number of dimension in which to immerse the dissimilarities.
287+
288+
Returns
289+
----------
290+
X : ndarray, shape (n_samples, n_components)
291+
Coordinates of the points in a ``n_components``-space.
292+
293+
stress : float
294+
The final value of the stress (sum of squared distance of the
295+
disparities and the distances for all constrained points).
296+
297+
References
298+
----------
299+
"An Introduction to MDS" Florian Wickelmaier
300+
Sound Quality Research Unit, Aalborg University, Denmark (2003)
301+
302+
"Multidimensional Scaling" Chapman Hall
303+
2nd edition, Boca Raton (2001)
304+
305+
"""
306+
307+
dissimilarities = check_symmetric(dissimilarities, raise_exception=True)
308+
309+
n_samples = dissimilarities.shape[0]
310+
311+
# Centering matrix
312+
H = np.eye(*dissimilarities.shape) - (1. / n_samples) * \
313+
np.ones(dissimilarities.shape)
314+
315+
# Double centered matrix
316+
K = -0.5 * np.dot(H, np.dot(dissimilarities ** 2, H))
317+
318+
w, V = linalg.eigh(K, check_finite=False)
319+
320+
# ``dissimilarities`` is Euclidean iff ``K`` is positive semi-definite.
321+
# For detail see "Multidimensional Scaling" Chapman Hall p 397
322+
try:
323+
w = _check_psd_eigenvalues(w)
324+
except ValueError:
325+
raise ValueError("Dissimilarity matrix must be euclidean. "
326+
"Make sure to pass an euclidean matrix, or use "
327+
"dissimilarity='euclidean'.")
328+
329+
# Get ``n_compontent`` greatest eigenvalues and corresponding eigenvectors.
330+
# Eigenvalues should be in descending order by convention.
331+
w = w[:-n_components-1:-1]
332+
V = V[:, :-n_components-1:-1]
333+
334+
X = np.sqrt(w) * V
335+
336+
dist = euclidean_distances(X)
337+
stress = ((dissimilarities.ravel() - dist.ravel()) ** 2).sum() * 0.5
338+
339+
return X, stress
340+
341+
275342
class MDS(BaseEstimator):
276343
"""Multidimensional scaling
277344
@@ -285,21 +352,29 @@ class MDS(BaseEstimator):
285352
metric : boolean, optional, default: True
286353
If ``True``, perform metric MDS; otherwise, perform nonmetric MDS.
287354
355+
If ``method=='svd'``, metric must be set to True.
356+
288357
n_init : int, optional, default: 4
289358
Number of times the SMACOF algorithm will be run with different
290359
initializations. The final results will be the best output of the runs,
291360
determined by the run with the smallest final stress.
292361
362+
Ignored if ``method=='svd'``.
363+
293364
max_iter : int, optional, default: 300
294365
Maximum number of iterations of the SMACOF algorithm for a single run.
295366
367+
Ignored if ``method=='svd'``.
368+
296369
verbose : int, optional, default: 0
297370
Level of verbosity.
298371
299372
eps : float, optional, default: 1e-3
300373
Relative tolerance with respect to stress at which to declare
301374
convergence.
302375
376+
Ignored if ``method=='svd'``.
377+
303378
n_jobs : int or None, optional (default=None)
304379
The number of jobs to use for the computation. If multiple
305380
initializations are used (``n_init``), each run of the algorithm is
@@ -309,6 +384,8 @@ class MDS(BaseEstimator):
309384
``-1`` means using all processors. See :term:`Glossary <n_jobs>`
310385
for more details.
311386
387+
Ignored if ``method=='svd'``.
388+
312389
random_state : int, RandomState instance, default=None
313390
Determines the random number generator used to initialize the centers.
314391
Pass an int for reproducible results across multiple function calls.
@@ -324,6 +401,11 @@ class MDS(BaseEstimator):
324401
Pre-computed dissimilarities are passed directly to ``fit`` and
325402
``fit_transform``.
326403
404+
method: {'smacof', 'svd'}, default ='smacof'
405+
The method used for solving the MDS problem.
406+
407+
.. versionadded:: 0.23
408+
327409
Attributes
328410
----------
329411
embedding_ : array-like, shape (n_samples, n_components)
@@ -333,6 +415,12 @@ class MDS(BaseEstimator):
333415
The final value of the stress (sum of squared distance of the
334416
disparities and the distances for all constrained points).
335417
418+
n_iter_ : int
419+
The number of iterations of SMACOF algorithm corresponding
420+
to the best stress.
421+
422+
It is set to ``None`` if ``method=='svd'``.
423+
336424
Examples
337425
--------
338426
>>> from sklearn.datasets import load_digits
@@ -357,12 +445,15 @@ class MDS(BaseEstimator):
357445
hypothesis" Kruskal, J. Psychometrika, 29, (1964)
358446
359447
"""
448+
360449
def __init__(self, n_components=2, metric=True, n_init=4,
361450
max_iter=300, verbose=0, eps=1e-3, n_jobs=None,
362-
random_state=None, dissimilarity="euclidean"):
451+
random_state=None, dissimilarity="euclidean",
452+
method="smacof"):
363453
self.n_components = n_components
364454
self.dissimilarity = dissimilarity
365455
self.metric = metric
456+
self.method = method
366457
self.n_init = n_init
367458
self.max_iter = max_iter
368459
self.eps = eps
@@ -387,6 +478,7 @@ def fit(self, X, y=None, init=None):
387478
y : Ignored
388479
389480
init : ndarray, shape (n_samples,), optional, default: None
481+
Ignored if ``method=='svd'``.
390482
Starting configuration of the embedding to initialize the SMACOF
391483
algorithm. By default, the algorithm is initialized with a randomly
392484
chosen array.
@@ -407,6 +499,7 @@ def fit_transform(self, X, y=None, init=None):
407499
y : Ignored
408500
409501
init : ndarray, shape (n_samples,), optional, default: None
502+
Ignored if ``method=='svd'``.
410503
Starting configuration of the embedding to initialize the SMACOF
411504
algorithm. By default, the algorithm is initialized with a randomly
412505
chosen array.
@@ -423,14 +516,26 @@ def fit_transform(self, X, y=None, init=None):
423516
elif self.dissimilarity == "euclidean":
424517
self.dissimilarity_matrix_ = euclidean_distances(X)
425518
else:
426-
raise ValueError("Proximity must be 'precomputed' or 'euclidean'."
427-
" Got %s instead" % str(self.dissimilarity))
428-
429-
self.embedding_, self.stress_, self.n_iter_ = smacof(
430-
self.dissimilarity_matrix_, metric=self.metric,
431-
n_components=self.n_components, init=init, n_init=self.n_init,
432-
n_jobs=self.n_jobs, max_iter=self.max_iter, verbose=self.verbose,
433-
eps=self.eps, random_state=self.random_state,
434-
return_n_iter=True)
519+
raise ValueError(
520+
"Dissimilarity matrix must be 'precomputed' or 'euclidean'."
521+
" Got %s instead" % str(self.dissimilarity))
522+
523+
if self.method == "smacof":
524+
self.embedding_, self.stress_, self.n_iter_ = smacof(
525+
self.dissimilarity_matrix_, metric=self.metric,
526+
n_components=self.n_components, init=init,
527+
n_init=self.n_init, n_jobs=self.n_jobs,
528+
max_iter=self.max_iter, verbose=self.verbose,
529+
eps=self.eps, random_state=self.random_state,
530+
return_n_iter=True)
531+
elif self.method == "svd":
532+
if not self.metric:
533+
raise ValueError("Using SVD requires metric=True")
534+
self.embedding_, self.stress_ = svd_scaler(
535+
self.dissimilarity_matrix_, n_components=self.n_components)
536+
self.n_iter_ = None
537+
else:
538+
raise ValueError("Method must be 'smacof' or 'svd'."
539+
" Got %s instead" % str(self.method))
435540

436541
return self.embedding_

sklearn/manifold/tests/test_mds.py

Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,10 +55,103 @@ def test_smacof_error():
5555
mds.smacof(sim, init=Z, n_init=1)
5656

5757

58+
def test_svd():
59+
# Test svd using example data from "An Introduction to MDS"
60+
# Florian Wickelmaier, p 11
61+
sim = np.array([[0, 93, 82, 133],
62+
[93, 0, 52, 60],
63+
[82, 52, 0, 111],
64+
[133, 60, 111, 0]])
65+
66+
X, stress = mds.svd_scaler(sim, n_components=2)
67+
X_true_1 = np.array([[-62.831, -32.97448],
68+
[18.403, 12.02697],
69+
[-24.960, 39.71091],
70+
[69.388, -18.76340]])
71+
X_true_2 = np.copy(X_true_1)
72+
X_true_2[:, 0] *= -1
73+
74+
# Signs of columns are dependent on signs of computed eigenvectors
75+
# which are arbitrary and meaningless
76+
assert(np.allclose(X, X_true_1)
77+
or np.allclose(X, -X_true_1)
78+
or np.allclose(X, X_true_2)
79+
or np.allclose(X, -X_true_2))
80+
81+
82+
def test_svd_error():
83+
# Non symmetric (dis)similarity matrix:
84+
sim = np.array([[0, 5, 9, 4],
85+
[5, 0, 2, 2],
86+
[3, 2, 0, 1],
87+
[4, 2, 1, 0]])
88+
89+
with pytest.raises(ValueError):
90+
mds.svd_scaler(sim)
91+
92+
# Non squared (dis)similarity matrix:
93+
sim = np.array([[0, 5, 9, 4],
94+
[5, 0, 2, 2],
95+
[4, 2, 1, 0]])
96+
97+
with pytest.raises(ValueError):
98+
mds.svd_scaler(sim)
99+
100+
# Non Euclidean (dis)similarity matrix:
101+
sim = np.array([[0, 12, 3, 4],
102+
[12, 0, 2, 2],
103+
[3, 2, 0, 1],
104+
[4, 2, 1, 0]])
105+
106+
with pytest.raises(ValueError,
107+
match="Dissimilarity matrix must be euclidean"):
108+
mds.svd_scaler(sim)
109+
110+
111+
def test_MDS_error():
112+
# Bad method name
113+
sim = np.ones((2, 2))
114+
mdc_clf = mds.MDS(method='bad name')
115+
with pytest.raises(ValueError):
116+
mdc_clf.fit(sim)
117+
118+
# SVD with metric=False
119+
sim = np.ones((2, 2))
120+
mdc_clf = mds.MDS(metric=False, method='svd')
121+
with pytest.raises(ValueError):
122+
mdc_clf.fit(sim)
123+
124+
58125
def test_MDS():
59126
sim = np.array([[0, 5, 3, 4],
60127
[5, 0, 2, 2],
61128
[3, 2, 0, 1],
62129
[4, 2, 1, 0]])
63130
mds_clf = mds.MDS(metric=False, n_jobs=3, dissimilarity="precomputed")
64131
mds_clf.fit(sim)
132+
133+
134+
def test_MDS_svd():
135+
# Test svd using example data from "An Introduction to MDS"
136+
# Florian Wickelmaier, p 11
137+
sim = np.array([[0, 93, 82, 133],
138+
[93, 0, 52, 60],
139+
[82, 52, 0, 111],
< 741A div aria-hidden="true" style="left:-2px" class="position-absolute top-0 d-flex user-select-none DiffLineTableCellParts-module__in-progress-comment-indicator--hx3m3">
140+
[133, 60, 111, 0]])
141+
142+
mds_clf = mds.MDS(metric=True, method="svd", dissimilarity='precomputed')
143+
mds_clf.fit(sim)
144+
145+
X_true_1 = np.array([[-62.831, -32.97448],
146+
[18.403, 12.02697],
147+
[-24.960, 39.71091],
148+
[69.388, -18.76340]])
149+
X_true_2 = np.copy(X_true_1)
150+
X_true_2[:, 0] *= -1
151+
152+
# Signs of columns are dependent on signs of computed eigenvectors
153+
# which are arbitrary and meaningless
154+
assert (np.allclose(mds_clf.embedding_, X_true_1)
155+
or np.allclose(mds_clf.embedding_, -X_true_1)
156+
or np.allclose(mds_clf.embedding_, X_true_2)
157+
or np.allclose(mds_clf.embedding_, -X_true_2))

0 commit comments

Comments
 (0)
0