8000 ENH Add eigh as a solver in MDS by Micky774 · Pull Request #22330 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

ENH Add eigh as a solver in MDS #22330

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 47 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
27868c6
Implement SVD-based method in MDS class
Mar 3, 2020
cba2700
Make requested changes
Mar 5, 2020
03954cf
Change name 'method' to 'solver'
Mar 5, 2020
3047960
Make another set of requested changes
Mar 7, 2020
ec66a71
Merge branch 'main' into svd_in_mds
Micky774 Jan 29, 2022
c8ec454
Added temporary benchmark notebook
Micky774 Jan 29, 2022
bdc70da
Updated solver selection logic
Micky774 Jan 29, 2022
8573ed3
Updated documentation and improved solver selection
Micky774 Jan 30, 2022
534bd23
Merge branch 'main' into svd_in_mds
Micky774 Jan 30, 2022
4c1b098
Updated docs that included old `default: ...` format
Micky774 Feb 2, 2022
94a7bfa
Updated whats_new
Micky774 Feb 2, 2022
48a003b
Undo `solver='auto'` change
Micky774 Feb 2, 2022
ca60660
Updated solver error message
Micky774 Feb 2, 2022
48b9d21
Corrected solver parameter documentation
Micky774 Feb 2, 2022
f4fa1fe
Merge branch 'main' into svd_in_mds
Micky774 Feb 4, 2022
0b63d80
Merge branch 'main' into svd_in_mds
Micky774 Feb 12, 2022
c8f0eca
Merge branch 'main' into svd_in_mds
Micky774 Feb 14, 2022
6042fa2
Removed depreciated tests
Micky774 Feb 15, 2022
8f3b8d4
Merge branch 'main' into svd_in_mds
Micky774 Feb 16, 2022
874739a
Linting w/ new version of Black
Micky774 Feb 16, 2022
450bcb4
Merge branch 'main' into svd_in_mds
Micky774 Mar 27, 2022
2b6e84c
Apply suggestions from code review
Micky774 Mar 27, 2022
e347dc9
Renamed solver from `svd` to `eigh` and added public reference
Micky774 Mar 27, 2022
092819a
Merge branch 'main' into svd_in_mds
Micky774 Apr 22, 2022
33e7690
Merge branch 'main' into svd_in_mds
Micky774 May 12, 2022
5e33d52
Reconciled changelogs
Micky774 May 30, 2022
e11126e
Merge branch 'main' into svd_in_mds
Micky774 May 30, 2022
d825245
Merge branch 'main' into svd_in_mds
Micky774 May 31, 2022
0e03893
Improved test
Micky774 May 31, 2022
fe23df2
Merge branch 'main' into svd_in_mds
Micky774 Jun 14, 2022
2dd78fa
Improved documentation and testing
Micky774 Jun 14, 2022
0f49a13
Significant improvements to user guide entry
Micky774 Jun 14, 2022
3570a59
Minor text correction
Micky774 Jun 14, 2022
51a684f
Extended documentation
Micky774 Jun 14, 2022
63a9f1c
Linting
Micky774 Jun 14, 2022
316a7e1
Merge branch 'main' into svd_in_mds
Micky774 Jun 20, 2022
ccec4e8
Merge branch 'main' into svd_in_mds
Micky774 Jun 30, 2022
e8eef36
Reconciled param validation changes and streamlined tests
Micky774 Jun 30, 2022
aa074a4
Reverting unnecessary diff
Micky774 Jun 30, 2022
595ca9d
Added clarifying comment to test
Micky774 Jun 30, 2022
12fbbe1
Adopted suggestion from @thisirs
Micky774 Jul 6, 2022
19cac39
Merge branch 'main' into svd_in_mds
Micky774 Jul 6, 2022
696448a
Merge branch 'main' into svd_in_mds
Micky774 Jul 25, 2022
88c0f1f
Merge branch 'main' into svd_in_mds
Micky774 Jul 26, 2022
9ae844f
Merge branch 'main' into svd_in_mds
Micky774 Jul 29, 2022
1b2ae3c
Merge branch 'main' into svd_in_mds
Micky774 Jul 27, 2023
6cd2ad8
Merge branch 'main' into svd_in_mds
Micky774 Jul 27, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 59 additions & 9 deletions doc/modules/manifold.rst
8000
Original file line number Diff line number Diff line change
Expand Up @@ -429,34 +429,75 @@ countries.

There exists two types of MDS algorithm: metric and non metric. In
scikit-learn, the class :class:`MDS` implements both. In Metric MDS, the input
similarity matrix arises from a metric (and thus respects the triangular
dissimilarity matrix arises from a metric (and thus respects the triangular
inequality), the distances between output two points are then set to be as
close as possible to the similarity or dissimilarity data. In the non-metric
version, the algorithms will try to preserve the order of the distances, and
hence seek for a monotonic relationship between the distances in the embedded
space and the similarities/dissimilarities.
close as possible to the dissimilarity data. In the non-metric version, the
algorithms will try to preserve the order of the distances, and hence seek for
a monotonic relationship between the distances in the embedded space and the
dissimilarities.

.. figure:: ../auto_examples/manifold/images/sphx_glr_plot_lle_digits_010.png
:target: ../auto_examples/manifold/plot_lle_digits.html
:align: center
:scale: 50


Let :math:`S` be the similarity matrix, and :math:`X` the coordinates of the
Let :math:`D` be the dissimilarity matrix, and :math:`X` the coordinates of the
:math:`n` input points. Disparities :math:`\hat{d}_{ij}` are transformation of
the similarities chosen in some optimal ways. The objective, called the
the dissimilarities chosen in some optimal ways. The objective, called the
stress, is then defined by :math:`\sum_{i < j} d_{ij}(X) - \hat{d}_{ij}(X)`


Metric MDS
----------

The simplest metric :class:`MDS` model, called *absolute MDS*, disparities are defined by
:math:`\hat{d}_{ij} = S_{ij}`. With absolute MDS, the value :math:`S_{ij}`
:math:`\hat{d}_{ij} = D_{ij}`. With absolute MDS, the value :math:`D_{ij}`
should then correspond exactly to the distance between point :math:`i` and
:math:`j` in the embedding point.

Most commonly, disparities are set to :math:`\hat{d}_{ij} = b S_{ij}`.
Most commonly, disparities are set to :math:`\hat{d}_{ij} = b D_{ij}`.

If the dissimilarity matrix :math:`D` is "Euclidean", user can choose to use
faster and more accurate solver of calculating results. See
:ref:`multidimensional_scaling_solver` for details.

.. _multidimensional_scaling_solver:

Solver
-------------

Metric :class:`MDS` offers two different algorithms (solvers) to calculate
results: SMACOF and SVD-based. The SMACOF solver (Scaling by MAjorizing a
COmplicated Function) minimizes its objective function (stress) in an iterative
manner. The SVD-based solver performs series of transformations (including
Singular Value Decomposition) to give exact result. The SVD-based solver is
thus much faster and more accurate, but also less general - it requires :math:`D`
to be "Euclidean". Specifically, we refer to :math:`D` as "Euclidean" if there
exists a set of points :math:`p_1,\dots,p_n` such that the dissimilarities can
be generated by taking the euclidean distance between such points, i.e.

.. math::
D_{i,j}=\|p_i-p_j\|_2^2

Let the matrix :math:`\Delta` be the matrix of transformed dissimilarities, such
that :math:`\Delta_{i,j}=\frac{-1}{d_{i,j}^2}`. Then define the vector of ones as
:math:`J=[1,\dots,1]^\intercal`. We then say that the matrix :math:`D` is "Euclidean"
iff the matrix

.. math::
L=(I-Js^\intercal)\Delta (I-sJ^\intercal)

is positive semi-definite for some vector :math:`s` such that :math:`s^\intercal J=1` (e.g.
:math:`s=e_i` or :math:`s=\frac{1}{n}J`). Saying that :math:`L` is positive
semi-definite is equivalent to saying that all eigenvalues of :math:`L` are non-negative.

.. note::
While technically :math:`D` is "Euclidean" if there exists *some* vector :math:`s`
satisfying the above equation, it is shown that it is sufficient to test an
arbitrary vector :math:`s` satisfying the above conditions; if there exists
a single valid choice of :math:`s` solving the above, then *every* valid choice
of :math:`s` must solve the above.

Nonmetric MDS
-------------
Expand Down Expand Up @@ -493,6 +534,10 @@ in the metric case.

.. topic:: References:

* `"Metric and Euclidean properties of dissimilarity coefficients"
<https://sci-hub.se/10.1007/bf01896809>`_
Gower, J.C., Legendre, P. Journal of Classification 3, 5-48 (1986)

* `"Modern Multidimensional Scaling - Theory and Applications"
<https://www.springer.com/fr/book/9780387251509>`_
Borg, I.; Groenen P. Springer Series in Statistics (1997)
Expand All @@ -505,6 +550,11 @@ in the metric case.
<http://cda.psych.uiuc.edu/psychometrika_highly_cited_articles/kruskal_1964a.pdf>`_
Kruskal, J. Psychometrika, 29, (1964)

* `"An Introduction to MDS"
<https://www.hongfeili.com/files/paper100/paper4.pdf>`_
Florian Wickelmaier, Sound Quality Research Unit, Aalborg University, Denmark (2003)


.. _t_sne:

t-distributed Stochastic Neighbor Embedding (t-SNE)
Expand Down
10 changes: 10 additions & 0 deletions doc/whats_new/v1.4.rst
Original file line number Diff line number Diff line change
Expand Up @@ -165,6 +165,16 @@ Changelog
which can be used to check whether a given set of parameters would be consumed.
:pr:`26831` by `Adrin Jalali`_.

:mod:`sklearn.manifold`
.......................

- |Feature| :class:`manifold.MDS` now supports a Singular Value Decomposition
(SVD). Users can choose whether to use SVD or SMACOF-based solvers via the
`solver` parameter. The SVD-based solver is faster and more accurate, but only
works for euclidean dissimilarity matrices in the metric-MDS context.
:pr:`16067` by :user:`Piotr Gaiński <panpiort8>` and
:pr:`22330` by :user:`Meekail Zain <micky774>`.

Code and Documentation Contributors
-----------------------------------

Expand Down
137 changes: 112 additions & 25 deletions sklearn/manifold/_mds.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,13 +10,15 @@

import numpy as np
from joblib import effective_n_jobs
from scipy import linalg

from ..base import BaseEstimator, _fit_context
from ..isotonic import IsotonicRegression
from ..metrics import euclidean_distances
from ..utils import check_array, check_random_state, check_symmetric
from ..utils._param_validation import Hidden, Interval, StrOptions, validate_params
from ..utils.parallel import Parallel, delayed
from ..utils.validation import _check_psd_eigenvalues


def _smacof_single(
Expand Down Expand Up @@ -390,6 +392,72 @@ def smacof(
return best_pos, best_stress


def eigh_scaler(dissimilarities, n_components=2):
"""
Computes multidimensional scaling using eigh solver.

Parameters
----------
dissimilarities : ndarray, shape (n_samples, n_samples)
Pairwise dissimilarities between the points. Must be euclidean.
n_components : int, optional, default=2
Number of dimension in which to immerse the dissimilarities.

Returns
----------
embedding : ndarray, shape (n_samples, n_components)
Coordinates of the points in a ``n_components``-space.

stress : float
The final value of the stress (sum of squared distance of the
disparities and the distances for all constrained points).

References
----------
"An Introduction to MDS" Florian Wickelmaier
Sound Quality Research Unit, Aalborg University, Denmark (2003)

"Metric and Euclidean properties of dissimilarity coefficients"
J. C. Gower and P. Legendre, Journal of Classification 3, 5–48 (1986)

"""

dissimilarities = check_symmetric(dissimilarities, raise_exception=True)

# Centering
B = dissimilarities**2
B = B.astype(np.float64)
B -= np.mean(B, axis=0)
B -= np.mean(B, axis=1, keepdims=True)
B *= -0.5

w, V = linalg.eigh(B, check_finite=False)

# `dissimilarities` is Euclidean iff `B` is positive semi-definite.
# See "Metric and Euclidean properties of dissimilarity coefficients"
# for details
try:
w = _check_psd_eigenvalues(w)
except ValueError:
raise ValueError(
"Dissimilarity matrix must be euclidean. "
"Make sure to pass an euclidean matrix, or use "
"dissimilarity='euclidean'."
)

# Get ``n_components`` greatest eigenvalues and corresponding eigenvectors.
# Eigenvalues should be in descending order by convention.
w = w[::-1][:n_components]
V = V[:, ::-1][:, :n_components]

embedding = np.sqrt(w) * V

dist = euclidean_distances(embedding)
stress = ((dissimilarities.ravel() - dist.ravel()) ** 2).sum() * 0.5

return embedding, stress


class MDS(BaseEstimator):
"""Multidimensional scaling.

Expand All @@ -402,33 +470,37 @@ class MDS(BaseEstimator):

metric : bool, default=True
If ``True``, perform metric MDS; otherwise, perform nonmetric MDS.
When ``False`` (i.e. non-metric MDS), dissimilarities with 0 are considered as
missing values.
When ``False`` (i.e. non-metric MDS), dissimilarities with 0 are
considered as missing values. If `solver=='eigh'`, metric must be set
to ``True``.

n_init : int, default=4
Number of times the SMACOF algorithm will be run with different
initializations. The final results will be the best output of the runs,
determined by the run with the smallest final stress.
Ignored if `solver=='eigh'`.

max_iter : int, default=300
max_iter : int, optional, default=300
Maximum number of iterations of the SMACOF algorithm for a single run.
Ignored if `solver=='eigh'`.

verbose : int, default=0
verbose : int, optional, default=0
Level of verbosity.

eps : float, default=1e-3
Relative tolerance with respect to stress at which to declare
convergence. The value of `eps` should be tuned separately depending
on whether or not `normalized_stress` is being used.
on whether or not `normalized_stress` is being used. Ignored if
`solver=='eigh'`.

n_jobs : int, default=None
n_jobs : int or None, optional (default=None)
The number of jobs to use for the computation. If multiple
initializations are used (``n_init``), each run of the algorithm is
computed in parallel.

``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
``-1`` means using all processors. See :term:`Glossary <n_jobs>`
for more details.
for more details. Ignored if `solver=='eigh'`.

random_state : int, RandomState instance or None, default=None
Determines the random number generator used to initialize the centers.
Expand All @@ -445,6 +517,10 @@ class MDS(BaseEstimator):
Pre-computed dissimilarities are passed directly to ``fit`` and
``fit_transform``.

solver : {'smacof', 'eigh'}, default = 'smacof'
The solver used for solving the MDS problem. The `eigh` solver is only
usable when `metric=False` but is often significantly faster.

normalized_stress : bool or "auto" default=False
Whether use and return normed stress value (Stress-1) instead of raw
stress calculated by default. Only supported in non-metric MDS.
Expand Down Expand Up @@ -484,6 +560,7 @@ class MDS(BaseEstimator):

n_iter_ : int
The number of iterations corresponding to the best stress.
It is set to ``None`` if ``solver=='eigh'``.

See Also
--------
Expand Down Expand Up @@ -530,6 +607,7 @@ class MDS(BaseEstimator):
"n_jobs": [None, Integral],
"random_state": ["random_state"],
"dissimilarity": [StrOptions({"euclidean", "precomputed"})],
"solver": [StrOptions({"smacof", "eigh"})],
"normalized_stress": [
"boolean",
StrOptions({"auto"}),
Expand All @@ -549,11 +627,13 @@ def __init__(
n_jobs=None,
random_state=None,
dissimilarity="euclidean",
solver="smacof",
normalized_stress="warn",
):
self.n_components = n_components
self.dissimilarity = dissimilarity
self.metric = metric
self.solver = solver
self.n_init = n_init
self.max_iter = max_iter
self.eps = eps
Expand Down Expand Up @@ -582,7 +662,7 @@ def fit(self, X, y=None, init=None):
init : ndarray of shape (n_samples, n_components), default=None
Starting configuration of the embedding to initialize the SMACOF
algorithm. By default, the algorithm is initialized with a randomly
chosen array.
chosen array. Ignored if ``solver=='eigh'``.

Returns
- E85D ------
Expand Down Expand Up @@ -610,7 +690,7 @@ def fit_transform(self, X, y=None, init=None):
init : ndarray of shape (n_samples, n_components), default=None
Starting configuration of the embedding to initialize the SMACOF
algorithm. By default, the algorithm is initialized with a randomly
chosen array.
chosen array. Ignored if ``solver=='eigh'``.

Returns
-------
Expand All @@ -620,7 +700,7 @@ def fit_transform(self, X, y=None, init=None):
X = self._validate_data(X)
if X.shape[0] == X.shape[1] and self.dissimilarity != "precomputed":
warnings.warn(
"The MDS API has changed. ``fit`` now constructs an"
"The MDS API has changed. ``fit`` now constructs a"
" dissimilarity matrix from data. To use a custom "
"dissimilarity matrix, set "
"``dissimilarity='precomputed'``."
Expand All @@ -631,19 +711,26 @@ def fit_transform(self, X, y=None, init=None):
elif self.dissimilarity == "euclidean":
self.dissimilarity_matrix_ = euclidean_distances(X)

self.embedding_, self.stress_, self.n_iter_ = smacof(
self.dissimilarity_matrix_,
metric=self.metric,
n_components=self.n_components,
init=init,
n_init=self.n_init,
n_jobs=self.n_jobs,
max_iter=self.max_iter,
verbose=self.verbose,
eps=self.eps,
random_state=self.random_state,
return_n_iter=True,
normalized_stress=self.normalized_stress,
)

if self.solver == "smacof":
self.embedding_, self.stress_, self.n_iter_ = smacof(
self.dissimilarity_matrix_,
metric=self.metric,
n_components=self.n_components,
init=init,
n_init=self.n_init,
n_jobs=self.n_jobs,
max_iter=self.max_iter,
verbose=self.verbose,
eps=self.eps,
random_state=self.random_state,
return_n_iter=True,
normalized_stress=self.normalized_stress,
)
elif self.solver == "eigh":
if not self.metric:
raise ValueError("Using the eigh solver requires metric=True")
self.embedding_, self.stress_ = eigh_scaler(
self.dissimilarity_matrix_, n_components=self.n_components
)
self.n_iter_ = None
return self.embedding_
Loading
0