8000 FIX Fix multiple severe bugs in non-metric MDS (#30514) · scikit-learn/scikit-learn@5f91dca · GitHub
[go: up one dir, main page]

Skip to content

Commit 5f91dca

Browse files
dkobakantoinebaker
andauthored
FIX Fix multiple severe bugs in non-metric MDS (#30514)
Co-authored-by: antoinebaker <antoinebaker@users.noreply.github.com>
1 parent 692289e commit 5f91dca

File tree

5 files changed

+249
-122
lines changed

5 files changed

+249
-122
lines changed

doc/modules/manifold.rst

Lines changed: 39 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -418,67 +418,65 @@ Multi-dimensional Scaling (MDS)
418418
representation of the data in which the distances respect well the
419419
distances in the original high-dimensional space.
420420

421-
In general, :class:`MDS` is a technique used for analyzing similarity or
422-
dissimilarity data. It attempts to model similarity or dissimilarity data as
423-
distances in a geometric space. The data can be ratings of similarity between
421+
In general, :class:`MDS` is a technique used for analyzing
422+
dissimilarity data. It attempts to model dissimilarities as
423+
distances in a Euclidean space. The data can be ratings of dissimilarity between
424424
objects, interaction frequencies of molecules, or trade indices between
425425
countries.
426426

427427
There exist two types of MDS algorithm: metric and non-metric. In
428-
scikit-learn, the class :class:`MDS` implements both. In Metric MDS, the input
429-
similarity matrix arises from a metric (and thus respects the triangular
430-
inequality), the distances between output two points are then set to be as
431-
close as possible to the similarity or dissimilarity data. In the non-metric
432-
version, the algorithms will try to preserve the order of the distances, and
428+
scikit-learn, the class :class:`MDS` implements both. In metric MDS,
429+
the distances in the embedding space are set as
430+
close as possible to the dissimilarity data. In the non-metric
431+
version, the algorithm will try to preserve the order of the distances, and
433432
hence seek for a monotonic relationship between the distances in the embedded
434-
space and the similarities/dissimilarities.
433+
space and the input dissimilarities.
435434

436435
.. figure:: ../auto_examples/manifold/images/sphx_glr_plot_lle_digits_010.png
437436
:target: ../auto_examples/manifold/plot_lle_digits.html
438437
:align: center
439438
:scale: 50
440439

441440

442-
Let :math:`S` be the similarity matrix, and :math:`X` the coordinates of the
443-
:math:`n` input points. Disparities :math:`\hat{d}_{ij}` are transformation of
444-
the similarities chosen in some optimal ways. The objective, called the
445-
stress, is then defined by :math:`\sum_{i < j} d_{ij}(X) - \hat{d}_{ij}(X)`
441+
Let :math:`\delta_{ij}` be the dissimilarity matrix between the
442+
:math:`n` input points (possibly arising as some pairwise distances
443+
:math:`d_{ij}(X)` between the coordinates :math:`X` of the input points).
444+
Disparities :math:`\hat{d}_{ij} = f(\delta_{ij})` are some transformation of
445+
the dissimilarities. The MDS objective, called the raw stress, is then
446+
defined by :math:`\sum_{i < j} (\hat{d}_{ij} - d_{ij}(Z))^2`,
447+
where :math:`d_{ij}(Z)` are the pairwise distances between the
448+
coordinates :math:`Z` of the embedded points.
446449

447450

448451
.. dropdown:: Metric MDS
449452

450-
The simplest metric :class:`MDS` model, called *absolute MDS*, disparities are defined by
451-
:math:`\hat{d}_{ij} = S_{ij}`. With absolute MDS, the value :math:`S_{ij}`
452-
should then correspond exactly to the distance between point :math:`i` and
453-
:math:`j` in the embedding point.
454-
455-
Most commonly, disparities are set to :math:`\hat{d}_{ij} = b S_{ij}`.
453+
In the metric :class:`MDS` model (sometimes also called *absolute MDS*),
454+
disparities are simply equal to the input dissimilarities
455+
:math:`\hat{d}_{ij} = \delta_{ij}`.
456456

457457
.. dropdown:: Nonmetric MDS
458458

459459
Non metric :class:`MDS` focuses on the ordination of the data. If
460-
:math:`S_{ij} > S_{jk}`, then the embedding should enforce :math:`d_{ij} <
461-
d_{jk}`. For this reason, we discuss it in terms of dissimilarities
462-
(:math:`\delta_{ij}`) instead of similarities (:math:`S_{ij}`). Note that
463-
dissimilarities can easily be obtained from similarities through a simple
464-
transform, e.g. :math:`\delta_{ij}=c_1-c_2 S_{ij}` for some real constants
465-
:math:`c_1, c_2`. A simple algorithm to enforce proper ordination is to use a
466-
monotonic regression of :math:`d_{ij}` on :math:`\delta_{ij}`, yielding
467-
disparities :math:`\hat{d}_{ij}` in the same order as :math:`\delta_{ij}`.
468-
469-
A trivial solution to this problem is to set all the points on the origin. In
470-
order to avoid that, the disparities :math:`\hat{d}_{ij}` are normalized. Note
471-
that since we only care about relative ordering, our objective should be
460+
:math:`\delta_{ij} > \delta_{kl}`, then the embedding
461+
seeks to enforce :math:`d_{ij}(Z) > d_{kl}(Z)`. A simple algorithm
462+
to enforce proper ordination is to use an
463+
isotonic regression of :math:`d_{ij}(Z)` on :math:`\delta_{ij}`, yielding
464+
disparities :math:`\hat{d}_{ij}` that are a monotonic transformation
465+
of dissimilarities :math:`\delta_{ij}` and hence having the same ordering.
466+
This is done repeatedly after every step of the optimization algorithm.
467+
In order to avoid the trivial solution where all embedding points are
468+
overlapping, the disparities :math:`\hat{d}_{ij}` are normalized.
469+
470+
Note that since we only care about relative ordering, our objective should be
472471
invariant to simple translation and scaling, however the stress used in metric
473-
MDS is sensitive to scaling. To address this, non-metric MDS may use a
474-
normalized stress, known as Stress-1 defined as
472+
MDS is sensitive to scaling. To address this, non-metric MDS returns
473+
normalized stress, also known as Stress-1, defined as
475474

476475
.. math::
477-
\sqrt{\frac{\sum_{i < j} (d_{ij} - \hat{d}_{ij})^2}{\sum_{i < j} d_{ij}^2}}.
476+
\sqrt{\frac{\sum_{i < j} (\hat{d}_{ij} - d_{ij}(Z))^2}{\sum_{i < j}
477+
d_{ij}(Z)^2}}.
478478
479-
The use of normalized Stress-1 can be enabled by setting `normalized_stress=True`,
480-
however it is only compatible with the non-metric MDS problem and will be ignored
481-
in the metric case.
479+
Normalized Stress-1 is returned if `normalized_stress=True`.
482480

483481
.. figure:: ../auto_examples/manifold/images/sphx_glr_plot_mds_001.png
484482
:target: ../auto_examples/manifold/plot_mds.html
@@ -487,6 +485,10 @@ stress, is then defined by :math:`\sum_{ F438 i < j} d_{ij}(X) - \hat{d}_{ij}(X)`
487485

488486
.. rubric:: References
489487

488+
* `"More on Multidimensional Scaling and Unfolding in R: smacof Version 2"
489+
<https://www.jstatsoft.org/article/view/v102i10>`_
490+
Mair P, Groenen P., de Leeuw J. Journal of Statistical Software (2022)
491+
490492
* `"Modern Multidimensional Scaling - Theory and Applications"
491493
<https://www.springer.com/fr/book/9780387251509>`_
492494
Borg, I.; Groenen P. Springer Series in Statistics (1997)
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
- :class:`manifold.MDS` now correctly handles non-metric MDS. Furthermore,
2+
the returned stress value now corresponds to the returned embedding and
3+
normalized stress is now allowed for metric MDS.
4+
By :user:`Dmitry Kobak <dkobak>`

examples/manifold/plot_mds.py

Lines changed: 36 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -21,79 +21,87 @@
2121
from sklearn.decomposition import PCA
2222
from sklearn.metrics import euclidean_distances
2323

24+
# Generate the data
2425
EPSILON = np.finfo(np.float32).eps
2526
n_samples = 20
26-
seed = np.random.RandomState(seed=3)
27-
X_true = seed.randint(0, 20, 2 * n_samples).astype(float)
27+
rng = np.random.RandomState(seed=3)
28+
X_true = rng.randint(0, 20, 2 * n_samples).astype(float)
2829
X_true = X_true.reshape((n_samples, 2))
30+
2931
# Center the data
3032
X_true -= X_true.mean()
3133

32-
similarities = euclidean_distances(X_true)
34+
# Compute pairwise Euclidean distances
35+
distances = euclidean_distances(X_true)
3336

34-
# Add noise to the similarities
35-
noise = np.random.rand(n_samples, n_samples)
37+
# Add noise to the distances
38+
noise = rng.rand(n_samples, n_samples)
3639
noise = noise + noise.T
37-
noise[np.arange(noise.shape[0]), np.arange(noise.shape[0])] = 0
38-
similarities += noise
40+
np.fill_diagonal(noise, 0)
41+
distances += noise
3942

4043
mds = manifold.MDS(
4144
n_components=2,
4245
max_iter=3000,
4346
eps=1e-9,
44-
random_state=seed,
47+
random_state=42,
4548
dissimilarity="precomputed",
4649
n_jobs=1,
4750
)
48-
pos = mds.fit(similarities).embedding_
51+
X_mds = mds.fit(distances).embedding_
4952

5053
nmds = manifold.MDS(
5154
n_components=2,
5255
metric=False,
5356
max_iter=3000,
5457
eps=1e-12,
5558
dissimilarity="precomputed",
56-
random_state=seed,
59+
random_state=42,
5760
n_jobs=1,
5861
n_init=1,
5962
)
60-
npos = nmds.fit_transform(similarities, init=pos)
63+
X_nmds = nmds.fit_transform(distances)
6164

6265
# Rescale the data
63-
pos *= np.sqrt((X_true**2).sum()) / np.sqrt((pos**2).sum())
64-
npos *= np.sqrt((X_true**2).sum()) / np.sqrt((npos**2).sum())
66+
X_mds *= np.sqrt((X_true**2).sum()) / np.sqrt((X_mds**2).sum())
67+
X_nmds *= np.sqrt((X_true**2).sum()) / np.sqrt((X_nmds**2).sum())
6568

6669
# Rotate the data
67-
clf = PCA(n_components=< F42D /span>2)
68-
X_true = clf.fit_transform(X_true)
69-
70-
pos = clf.fit_transform(pos)
71-
72-
npos = clf.fit_transform(npos)
70+
pca = PCA(n_components=2)
71+
X_true = pca.fit_transform(X_true)
72+
X_mds = pca.fit_transform(X_mds)
73+
X_nmds = pca.fit_transform(X_nmds)
74+
75+
# Align the sign of PCs
76+
for i in [0, 1]:
77+
if np.corrcoef(X_mds[:, i], X_true[:, i])[0, 1] < 0:
78+
X_mds[:, i] *= -1
79+
if np.corrcoef(X_nmds[:, i], X_true[:, i])[0, 1] < 0:
80+
X_nmds[:, i] *= -1
7381

7482
fig = plt.figure(1)
7583
ax = plt.axes([0.0, 0.0, 1.0, 1.0])
7684

7785
s = 100
7886
plt.scatter(X_true[:, 0], X_true[:, 1], color="navy", s=s, lw=0, label="True Position")
79-
plt.scatter(pos[:, 0], pos[:, 1], color="turquoise", s=s, lw=0, label="MDS")
80-
plt.scatter(npos[:, 0], npos[:, 1], color="darkorange", s=s, lw=0, label="NMDS")
87+
plt.scatter(X_mds[:, 0], X_mds[:, 1], color="turquoise", s=s, lw=0, label="MDS")
88+
plt.scatter(X_nmds[:, 0], X_nmds[:, 1], color="darkorange", s=s, lw=0, label="NMDS")
8189
plt.legend(scatterpoints=1, loc="best", shadow=False)
8290

83-
similarities = similarities.max() / (similarities + EPSILON) * 100
84-
np.fill_diagonal(similarities, 0)
8591
# Plot the edges
86-
start_idx, end_idx = np.where(pos)
92+
start_idx, end_idx = np.where(X_mds)
8793
# a sequence of (*line0*, *line1*, *line2*), where::
8894
# linen = (x0, y0), (x1, y1), ... (xm, ym)
8995
segments = [
90-
[X_true[i, :], X_true[j, :]] for i in range(len(pos)) for j in range(len(pos))
96+
[X_true[i, :], X_true[j, :]] for i in range(len(X_true)) for j in range(len(X_true))
9197
]
92-
values = np.abs(similarities)
98+
edges = distances.max() / (distances + EPSILON) * 100
99+
np.fill_diagonal(edges, 0)
100+
edges = np.abs(edges)
93101
lc = LineCollection(
94-
segments, zorder=0, cmap=plt.cm.Blues, norm=plt.Normalize(0, values.max())
102+
segments, zorder=0, cmap=plt.cm.Blues, norm=plt.Normalize(0, edges.max())
95103
)
96-
lc.set_array(similarities.flatten())
104+
lc.set_array(edges.flatten())
97105
lc.set_linewidths(np.full(len(segments), 0.5))
98106
ax.add_collection(lc)
99107

0 commit comments

Comments
 (0)
0