8000 ENH Add eigh as a solver in MDS by Micky774 · Pull Request #22330 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

ENH Add eigh as a solver in MDS #22330

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 47 commits into
base: main
Choose a base branch
from

Conversation

Micky774
Copy link
Contributor
@Micky774 Micky774 commented Jan 29, 2022

Reference Issues/PRs

Fixes #15272
Resolves #16067 (Stalled)

What does this implement/fix? Explain your changes.

PR #16067:
Adds implementation of Multidimensional scaling (MDS), which uses Singular Value Decompostion (SVD) method. SVD works much faster and far more accurate for Euclidean matrixes than SMACOF algorithm (current implementation of MDS in sklearn/manifold module).

This PR:
Addresses review comments and provides further forum for discussion regarding API.

Any other comments?

@Micky774
Copy link
Contributor Author
Micky774 commented Jan 29, 2022

One minor note is that since we default metric=True, I went ahead and changed the default solver to svd since I wager that in most cases where a user is operating MDS in a metric context, the dissimilarity matrix will be Euclidean for the solver.

On that note, re: @smarie 's comment

The term Euclidean, in this context, refers to a property that dissimilarity matrices may have. Specifically:
image
which is from

"Metric and Euclidean properties of dissimilarity coefficients"
J. C. Gower and P. Legendre, Journal of Classification 3, 5–48 (1986)

I do, however, think that the term is confusing at best since it's an overloaded name. Perhaps we ought to consider rephrasing the condition, or clarifying what Euclidean means in this specific context.

Also, regarding why this method uses the eigen-solver, the classical metric MDS strategy directly uses the eigenvalue decomposition.

@Micky774
Copy link
Contributor Author
Micky774 commented Jan 29, 2022

One minor note is that since we default metric=True, I went ahead and changed the default solver to svd since I wager that in most cases where a user is operating MDS in a metric context, the dissimilarity matrix will be Euclidean for the solver.

Actually I changed it to begin with solver='auto' and resolve it to svd or smacof based on whether metric=True or not, resp.

Ignore all of that. I have removed auto and gone back to defaulting to smacof to maintain generality of a default MDS estimator.

@cmarmo
Copy link
Contributor
cmarmo commented Feb 3, 2022

Hi @Micky774, thanks for your follow-up! For reviewers, please notice that the failures are related to #22061 : this PR is ready for review. Thanks!

@Micky774
Copy link
Contributor Author

@NicolasHug @adrinjalali I wonder if either of you would be interested in reviewing this

@Micky774
Copy link
Contributor Author

Okay I'm not sure why it's failing the CI linting test:
image

When I run black locally, I get no issue (indeed it didn't stop me during pre-commit either). When I make the change that the CI black is suggesting, then re-run it, I get this result.
image

Is the CI using a different version of black? Or is there something else I'm missing?

@thomasjpfan
Copy link
Member

We recently update our version of black to the stable versionersino. Installing black==22.1.0 should resolve the issue.

For merge with main and pre-commit should detect the new config and install the new version of black. Be sure the config looks like this:

- repo: https://github.com/psf/black
rev: 22.1.0

Copy link
Member
@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this called the "Classical MDS algorithm"?

Can we run a benchmark on a decently sized dataset to compare the new solver and SMACOF?

@dgolteanu
Copy link

Sorry for an ignorant question, is this PR missing something or any chance it will make it in 1.2? Excitedly following from the sidelines :)

@Micky774
Copy link
Contributor Author

Sorry for an ignorant question, is this PR missing something or any chance it will make it in 1.2? Excitedly following from the sidelines :)

Sorry for the late reply -- it is currently waiting on reviewers, so it is not likely to make it into 1.2 unfortunately.

@github-actions
Copy link
github-actions bot commented Jul 27, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 6cd2ad8. Link to the linter CI: here

@jolespin
Copy link

Has this been integrated into any beta versions to use?

@dkobak
Copy link
Contributor
dkobak commented Oct 23, 2024

While I support this PR in principle, I don't like the suggested API and some of the suggested wording in the docs. This PR implements "classical MDS" (aka "principal coordinates analysis", PCoA). Classical MDS optimizes a different loss function compared to metric MDS. The former optimizes what is called "strain" while the latter optimizes what is called "stress". For the "strain" there is an exact solution in terms of eigendecomposition. For the stress there is not, which is why SMACOF algorithm is used instead. These two things are not approximations of each other, and are not supposed to give the same result.

For that reason, I don't like solver="smacof" and solver="eigh", because these are not two different solvers optimizing the same loss, but two different optimization problems. I think it should be called algorithm or maybe flavor. And the two options I would call algorithm="classical" vs. algorithm="metric". As there is also non-metric MDS, currently available via metric=False, it could be refactored as algorithm="non-metric". Unfortunately this API would need to go through a deprecation cycle...

Alternatively, we could make classical MDS available via classical=True, by default set to classical=False.

@adrinjalali
Copy link
Member

@antoinebaker would you be able to have a look here maybe?

@Micky774
Copy link
Contributor Author
Micky774 commented Nov 4, 2024

I favor the algorithm : {'classical', 'metric', 'non-metric'}, default = 'classical' approach w/ deprecation of metric. I'll begin updating this PR, if folks are okay with this API moving forwards.

@dkobak
Copy link
Contributor
dkobak commented Nov 4, 2024

@Micky774 IMHO the default should be metric, for backwards compatibility (and also because I do first think of non-classical metric MDS when somebody says "MDS").

@dkobak
Copy link
Contributor
dkobak commented Nov 22, 2024

@Micky774 Are you still planning on updating this PR according to what was discussed above, or should somebody take over?

@antoinebaker
Copy link
Contributor
antoinebaker commented Nov 29, 2024

Thanks @Micky774 for the PR!

For the API, I think we should consider implementing a new estimator ClassicalMDS or PCoA.

As correctly pointed out by @dkobak

While I support this PR in principle, I don't like the suggested API and some of the suggested wording in the docs. This PR implements "classical MDS" (aka "principal coordinates analysis", PCoA). Classical MDS optimizes a different loss function compared to metric MDS. The former optimizes what is called "strain" while the latter optimizes what is called "stress". For the "strain" there is an exact solution in terms of eigendecomposition. For the stress there is not, which is why SMACOF algorithm is used instead. These two things are not approximations of each other, and are not supposed to give the same result.

I feel that adding a solver or algorithm parameter has several drawbacks, from the developer and user perspective:

  • the SMACOF and PCoA do not share any logic in the code
  • it will add confusion to the already confusing User Guide section, with too many nested sections and mathematical details (eg the different optimization problems, the different losses)
  • it will add confusion to the MDS parameters and attributes
    • many parameters are SMACOF specific n_init, max_iter, verbose, eps, n_jobs, random_state, normalized_stress and irrelevant for PCoA
    • for PCoA we would like to have a strain_ attribute (that's what is optimized) instead of the (normalized) stress_ attribute for SMACOF
    • the attribute n_iter_ is SMACOF specific
    • the resulting docstring will be convoluted

Keeping MDS and adding PCoA instead will make it clear that we have two different algorithms, tackling different optimization problems, with different behaviors (SMACOF is stochastic and iterative while PCoA is deterministic). In the User Guide we should of course cross reference PCoA and MDS, for instance advice PCoA as a (faster) alternative for metric MDS on dissimilarities computed from euclidian distances. Also it will be easier to implement future developments for the SMACOF/PCoA separately, eg Array API support or sample weight.

What do you think @Micky774, @dkobak and @adrinjalali ?

@dkobak
Copy link
Contributor
dkobak commented Nov 29, 2024

Good points @antoinebaker. I don't have a strong opinion here. I find that many people first think of classical MDS when I say "MDS", so that would be an argument in favor of having classical MDS inside the MDS class. But I do see your point.

Regarding the name of a new class, I think in the ML community more people are familiar with the term "MDS" than with the term "PCoA", but on the other hand PCoA looks better as a class name than ClassicalMDS. On the yet other hand, there may be unnecessary confusion with PCA. So perhaps ClasicalMDS?

@antoinebaker
Copy link
Contributor

ClassicalMDS sounds good, that way users searching for "MDS" (unspecified) will find both classes.

@Micky774
Copy link
Contributor Author

cc @scikit-learn/core-devs I think this is a nice opportunity to clean up our MDS support a bit so I want to get some opinions on the way forwards here

@adrinjalali
Copy link
Member

I'm happy with the suggestions here.

@dkobak
Copy link
Contributor
dkobak commented Apr 2, 2025

Hi @Micky774. Are you still planning to get back to this PR and rework it to implement a new class ClassicalMDS as we discussed? If not, I would maybe go ahead and create a separate PR for that. But of course it would be great if you would adjust this PR.

@adrinjalali
Copy link
Member

@dkobak I think you can go ahead and have a PR superseding this one.

@dkobak
Copy link
Contributor
dkobak commented Apr 23, 2025

@adrinjalali Thanks, I will try to find time for this after my pending PR #31117 gets merged. If you could take a look there, it'd be much appreciated by the way!

@dkobak
Copy link
Contributor
dkobak commented May 6, 2025

@adrinjalali @antoinebaker I took a stab at it at #31322.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sklearn MDS vs skbio PCoA
9 participants
0