8000 ENH Calculate normed stress (Stress-1) in `manifold.MDS` by Micky774 · Pull Request #22562 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
b34bf99
Update _mds.py
rotheconrad Aug 4, 2020
f58b51e
Merge branch 'main' into mds_normalized
Micky774 Feb 20, 2022
6d43af3
Move stress normalization to end of loop
Micky774 Feb 20, 2022
3879fbe
Updated tests, moved normalization to inner loop, added versionadd
Micky774 Feb 21, 2022
4df87b9
Added changelong entry
Micky774 Feb 21, 2022
cc311ea
Added warning when using normalized stress for metric MDS
Micky774 Feb 25, 2022
a0bb1d2
Merge branch 'main' into mds_normalized
Micky774 Feb 25, 2022
cc782e1
Merge branch 'main' into mds_normalized
Micky774 Feb 26, 2022
47e8b02
Fixed docstring typo
Micky774 Feb 26, 2022
ce527e0
Merge branch 'main' into mds_normalized
Micky774 Mar 25, 2022
ea82cfe
Changed variable name from `normalized`-->`normalized_stress`
Micky774 Mar 25, 2022
3dbd449
Updated `normalized_stress` name in tests
Micky774 Mar 25, 2022
d9eb5de
Merge branch 'main' into mds_normalized
Micky774 Mar 25, 2022
5de18a4
Merge branch 'main' into mds_normalized
Micky774 Apr 22, 2022
e44a789
Update doc/whats_new/v1.1.rst
Micky774 Apr 22, 2022
3211a0d
Merge branch 'main' into mds_normalized
Micky774 May 12, 2022
13d74a1
Merge branch 'main' into mds_normalized
Micky774 May 19, 2022
97115dd
Merge branch 'main' into mds_normalized
Micky774 May 23, 2022
8f71fdd
Merge branch 'main' into mds_normalized
Micky774 May 31, 2022
5a9057c
Implemented reviewer feedback
Micky774 May 31, 2022
066771a
Improved MDS User Guide
Micky774 May 31, 2022
c8dd9c9
Merge branch 'main' into mds_normalized
Micky774 May 31, 2022
535b930
Merge branch 'main' into mds_normalized
glemaitre Jun 1, 2022
8b213eb
Merge branch 'main' into mds_normalized
Micky774 Jun 1, 2022
145ec54
Merge branch 'main' into mds_normalized
glemaitre Jun 2, 2022
ce3893b
Merge branch 'main' into mds_normalized
thomasjpfan Jun 27, 2022
a3a6eb4
Merge branch 'main' into mds_normalized
Micky774 Jun 28, 2022
773bafe
Addressed review comments
Micky774 Jun 28, 2022
4019406
Merge branch 'main' into mds_normalized
Micky774 Jun 29, 2022
577e122
Added parameter constraint
Micky774 Jun 29, 2022
45d59a1
Merge branch 'main' into mds_normalized
Micky774 Jun 30, 2022
ef47a9d
Raised error and modified test
Micky774 Jun 30, 2022
de69da9
Improved docstring
Micky774 Jun 30, 2022
66b531e
Merge branch 'main' into mds_normalized
Micky774 Jun 30, 2022
2262049
Merge branch 'main' into mds_normalized
Micky774 Jul 4, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 21 additions & 7 deletions doc/modules/manifold.rst
Original file line number Diff line number Diff line change
Expand Up @@ -462,14 +462,28 @@ Nonmetric MDS
-------------

Non metric :class:`MDS` focuses on the ordination of the data. If
:math:`S_{ij} < S_{jk}`, then the embedding should enforce :math:`d_{ij} <
d_{jk}`. A simple algorithm to enforce that is to use a monotonic regression
of :math:`d_{ij}` on :math:`S_{ij}`, yielding disparities :math:`\hat{d}_{ij}`
in the same order as :math:`S_{ij}`.
:math:`S_{ij} > S_{jk}`, then the embedding should enforce :math:`d_{ij} <
d_{jk}`. For this reason, we discuss it in terms of dissimilarities
(:math:`\delta_{ij}`) instead of similarities (:math:`S_{ij}`). Note that
dissimilarities can easily be obtained from similarities through a simple
transform, e.g. :math:`\delta_{ij}=c_1-c_2 S_{ij}` for some real constants
:math:`c_1, c_2`. A simple algorithm to enforce proper ordination is to use a
monotonic regression of :math:`d_{ij}` on :math:`\delta_{ij}`, yielding
disparities :math:`\hat{d}_{ij}` in the same order as :math:`\delta_{ij}`.

A trivial solution to this problem is to set all the points on the origin. In
order to avoid that, the disparities :math:`\hat{d}_{ij}` are normalized.
order to avoid that, the disparities :math:`\hat{d}_{ij}` are normalized. Note
that since we only care about relative ordering, our objective should be
invariant to simple translation and scaling, however the stress used in metric
MDS is sensitive to scaling. To address this, non-metric MDS may use a
normalized stress, known as Stress-1 defined as

.. math::
\sqrt{\frac{\sum_{i < j} (d_{ij} - \hat{d}_{ij})^2}{\sum_{i < j} d_{ij}^2}}.

The use of normalized Stress-1 can be enabled by setting `normalized_stress=True`,
however it is only compatible with the non-metric MDS problem and will be ignored
in the metric case.

.. figure:: ../auto_examples/manifold/images/sphx_glr_plot_mds_001.png
:target: ../auto_examples/manifold/plot_mds.html
Expand All @@ -484,11 +498,11 @@ order to avoid that, the disparities :math:`\hat{d}_{ij}` are normalized.
Borg, I.; Groenen P. Springer Series in Statistics (1997)

* `"Nonmetric multidimensional scaling: a numerical method"
<https://link.springer.com/article/10.1007%2FBF02289694>`_
<http://cda.psych.uiuc.edu/psychometrika_highly_cited_articles/kruskal_1964b.pdf>`_
Kruskal, J. Psychometrika, 29 (1964)

* `"Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis"
<https://link.springer.com/article/10.1007%2FBF02289565>`_
<http://cda.psych.uiuc.edu/psychometrika_highly_cited_articles/kruskal_1964a.pdf>`_
Kruskal, J. Psychometrika, 29, (1964)

.. _t_sne:
Expand Down
8 changes: 8 additions & 0 deletions doc/whats_new/v1.2.rst
Original file line number Diff line number Diff line change
Expand Up @@ -233,6 +233,14 @@ Changelog
:mod:`sklearn.manifold`
.......................

- |Feature| Adds option to use the normalized stress in `manifold.MDS`. This is
enabled by setting the new `normalize` parameter to `True`.
:pr:`10168` by :user:`Łukasz Borchmann <Borchmann>`,
:pr:`12285` by :user:`Matthias Miltenberger <mattmilten>`,
:pr:`13042` by :user:`Matthieu Parizy <matthieu-pa>`,
:pr:`18094` by :user:`Roth E Conrad <rotheconrad>` and
:pr:`22562` by :user:`Meekail Zain <micky774>`.

- |Enhancement| Adds `eigen_tol` parameter to
:class:`manifold.SpectralEmbedding`. Both :func:`manifold.spectral_embedding`
and :class:`manifold.SpectralEmbedding` now propogate `eigen_tol` to all
Expand Down
93 changes: 74 additions & 19 deletions sklearn/manifold/_mds.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ def _smacof_single(
verbose=0,
eps=1e-3,
random_state=None,
normalized_stress=False,
):
"""Computes multidimensional scaling using SMACOF algorithm.

Expand Down Expand Up @@ -58,13 +59,21 @@ def _smacof_single(

eps : float, default=1e-3
Relative tolerance with respect to stress at which to declare
convergence.
convergence. The value of `eps` should be tuned separately depending
on whether or not `normalized_stress` is being used.

random_state : int, RandomState instance or None, default=None
Determines the random number generator used to initialize the centers.
Pass an int for reproducible results across multiple function calls.
See :term:`Glossary <random_state>`.

normalized_stress : bool, default=False
Whether use and return normed stress value (Stress-1) instead of raw
stress calculated by default. Only supported in non-metric MDS. The
caller must ensure that if `normalized_stress=True` then `metric=False`

.. versionadded:: 1.2

Returns
-------
X : ndarray of shape (n_samples, n_components)
Expand All @@ -73,9 +82,23 @@ def _smacof_single(
stress : float
The final value of the stress (sum of squared distance of the
disparities and the distances for all constrained points).
If `normalized_stress=True`, and `metric=False` returns Stress-1.
A value of 0 indicates "perfect" fit, 0.025 excellent, 0.05 good,
0.1 fair, and 0.2 poor [1]_.

n_iter : int
The number of iterations corresponding to the best stress.

References
----------
.. [1] "Nonmetric multidimensional scaling: a numerical method" Kruskal, J.
Psychometrika, 29 (1964)

.. [2] "Multidimensional scaling by optimizing goodness of fit to a nonmetric
hypothesis" Kruskal, J. Psychometrika, 29, (1964)

.. [3] "Modern Multidimensional Scaling - Theory and Applications" Borg, I.;
Groenen P. Springer Series in Statistics (1997)
"""
dissimilarities = check_symmetric(dissimilarities, raise_exception=True)

Expand Down Expand Up @@ -121,7 +144,8 @@ def _smacof_single(

# Compute stress
stress = ((dis.ravel() - disparities.ravel()) ** 2).sum() / 2

if normalized_stress:
stress = np.sqrt(stress / ((disparities.ravel() ** 2).sum() / 2))
# Update X using the Guttman transform
dis[dis == 0] = 1e-5
ratio = disparities / dis
Expand Down Expand Up @@ -155,6 +179,7 @@ def smacof(
eps=1e-3,
random_state=None,
return_n_iter=False,
normalized_stress=False,
):
"""Compute multidimensional scaling using the SMACOF algorithm.

Expand Down Expand Up @@ -217,7 +242,8 @@ def smacof(

eps : float, default=1e-3
Relative tolerance with respect to stress at which to declare
convergence.
convergence. The value of `eps` should be tuned separately depending
on whether or not `normalized_stress` is being used.

random_state : int, RandomState instance or None, default=None
Determines the random number generator used to initialize the centers.
Expand All @@ -227,6 +253,12 @@ def smacof(
return_n_iter : bool, default=False
Whether or not to return the number of iterations.

normalized_stress : bool, default=False
Whether use and return normed stress value (Stress-1) instead of raw
stress calculated by default. Only supported in non-metric MDS.

.. versionadded:: 1.2

Returns
-------
X : ndarray of shape (n_samples, n_components)
Expand All @@ -235,26 +267,33 @@ def smacof(
stress : float
The final value of the stress (sum of squared distance of the
disparities and the distances for all constrained points).
If `normalized_stress=True`, and `metric=False` returns Stress-1.
A value of 0 indicates "perfect" fit, 0.025 excellent, 0.05 good,
0.1 fair, and 0.2 poor [1]_.

n_iter : int
The number of iterations corresponding to the best stress. Returned
only if ``return_n_iter`` is set to ``True``.

Notes
-----
"Modern Multidimensional Scaling - Theory and Applications" Borg, I.;
Groenen P. Springer Series in Statistics (1997)
References
----------
.. [1] "Nonmetric multidimensional scaling: a numerical method" Kruskal, J.
Psychometrika, 29 (1964)

"Nonmetric multidimensional scaling: a numerical method" Kruskal, J.
Psychometrika, 29 (1964)
.. [2] "Multidimensional scaling by optimizing goodness of fit to a nonmetric
hypothesis" Kruskal, J. Psychometrika, 29, (1964)

"Multidimensional scaling by optimizing goodness of fit to a nonmetric
hypothesis" Kruskal, J. Psychometrika, 29, (1964)
.. [3] "Modern Multidimensional Scaling - Theory and Applications" Borg, I.;
Groenen P. Springer Series in Statistics (1997)
"""

dissimilarities = check_array(dissimilarities)
random_state = check_random_state(random_state)

if normalized_stress and metric:
raise ValueError(
"Normalized stress is not supported for metric MDS. Either set"
" `normalized_stress=False` or use `metric=False`."
)
if hasattr(init, "__array__"):
init = np.asarray(init).copy()
if not n_init == 1:
Expand All @@ -277,6 +316,7 @@ def smacof(
verbose=verbose,
eps=eps,
random_state=random_state,
normalized_stress=normalized_stress,
)
if best_stress is None or stress < best_stress:
best_stress = stress
Expand All @@ -294,6 +334,7 @@ def smacof(
verbose=verbose,
eps=eps,
random_state=seed,
normalized_stress=normalized_stress,
)
for seed in seeds
)
Expand Down Expand Up @@ -335,7 +376,8 @@ class MDS(BaseEstimator):

eps : float, default=1e-3
Relative tolerance with respect to stress at which to declare
convergence.
convergence. The value of `eps` should be tuned separately depending
on whether or not `normalized_stress` is being used.

n_jobs : int, default=None
The number of jobs to use for the computation. If multiple
Expand All @@ -361,6 +403,12 @@ class MDS(BaseEstimator):
Pre-computed dissimilarities are passed directly to ``fit`` and
``fit_transform``.

normalized_stress : bool, default=False
Whether use and return normed stress value (Stress-1) instead of raw
stress calculated by default. Only supported in non-metric MDS.

.. versionadded:: 1.2

Attributes
----------
embedding_ : ndarray of shape (n_samples, n_components)
Expand All @@ -369,6 +417,9 @@ class MDS(BaseEstimator):
stress_ : float
The final value of the stress (sum of squared distance of the
disparities and the distances for all constrained points).
If `normalized_stress=True`, and `metric=False` returns Stress-1.
A value of 0 indicates "perfect" fit, 0.025 excellent, 0.05 good,
0.1 fair, and 0.2 poor [1]_.

dissimilarity_matrix_ : ndarray of shape (n_samples, n_samples)
Pairwise dissimilarities between the points. Symmetric matrix that:
Expand Down Expand Up @@ -405,14 +456,14 @@ class MDS(BaseEstimator):

References
----------
"Modern Multidimensional Scaling - Theory and Applications" Borg, I.;
Groenen P. Springer Series in Statistics (1997)
.. [1] "Nonmetric multidimensional scaling: a numerical method" Kruskal, J.
Psychometrika, 29 (1964)

"Nonmetric multidimensional scaling: a numerical method" Kruskal, J.
Psychometrika, 29 (1964)
.. [2] "Multidimensional scaling by optimizing goodness of fit to a nonmetric
hypothesis" Kruskal, J. Psychometrika, 29, (1964)

"Multidimensional scaling by optimizing goodness of fit to a nonmetric
hypothesis" Kruskal, J. Psychometrika, 29, (1964)
.. [3] "Modern Multidimensional Scaling - Theory and Applications" Borg, I.;
Groenen P. Springer Series in Statistics (1997)

Examples
--------
Expand All @@ -437,6 +488,7 @@ class MDS(BaseEstimator):
"n_jobs": [None, Integral],
"random_state": ["random_state"],
"dissimilarity": [StrOptions({"euclidean", "precomputed"})],
"normalized_stress": ["boolean"],
}

def __init__(
Expand All @@ -451,6 +503,7 @@ def __init__(
n_jobs=None,
random_state=None,
dissimilarity="euclidean",
normalized_stress=False,
):
self.n_components = n_components
self.dissimilarity = dissimilarity
Expand All @@ -461,6 +514,7 @@ def __init__(
self.verbose = verbose
self.n_jobs = n_jobs
self.random_state = random_state
self.normalized_stress = normalized_stress

def _more_tags(self):
return {"pairwise": self.dissimilarity == "precomputed"}
Expand Down Expand Up @@ -544,6 +598,7 @@ def fit_transform(self, X, y=None, init=None):
eps=self.eps,
random_state=self.random_state,
return_n_iter=True,
normalized_stress=self.normalized_stress,
)

return self.embedding_
29 changes: 28 additions & 1 deletion sklearn/manifold/tests/test_mds.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
import numpy as np
from numpy.testing import assert_array_almost_equal
from numpy.testing import assert_array_almost_equal, assert_allclose
import pytest

from sklearn.manifold import _mds as mds
Expand Down Expand Up @@ -42,3 +42,30 @@ def test_MDS():
sim = np.array([[0, 5, 3, 4], [5, 0, 2, 2], [3, 2, 0, 1], [4, 2, 1, 0]])
mds_clf = mds.MDS(metric=False, n_jobs=3, dissimilarity="precomputed")
mds_clf.fit(sim)
97A8

@pytest.mark.parametrize("k", [0.5, 1.5, 2])
def test_normed_stress(k):
"""Test that non-metric MDS normalized stress is scale-invariant."""
sim = np.array([[0, 5, 3, 4], [5, 0, 2, 2], [3, 2, 0, 1], [4, 2, 1, 0]])

X1, stress1 = mds.smacof(
sim, metric=False, normalized_stress=True, max_iter=5, random_state=0
)
X2, stress2 = mds.smacof(
k * sim, metric=False, normalized_stress=True, max_iter=5, random_state=0
)

assert_allclose(stress1, stress2, rtol=1e-5)
assert_allclose(X1, X2, rtol=1e-5)


def test_normalize_metric_warning():
"""
Test that a UserWarning is emitted when using normalized stress with
metric-MDS.
"""
msg = "Normalized stress is not supported"
sim = np.array([[0, 5, 3, 4], [5, 0, 2, 2], [3, 2, 0, 1], [4, 2, 1, 0]])
with pytest.raises(ValueError, match=msg):
mds.smacof(sim, metric=True, normalized_stress=True)
0