-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
MDS implementation has error in smacof algorithm for non-metric case #18933
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm also experiencing unexpected behaviour with the |
I am happy to hear that I am not the only one that encountered this problem. @mpdprot Is there a way that you could provide an additional example with the actual result and the expected result? I was struggling a lot with that... |
To be honest I'm not sure whether the non-metric algorithm does what I thought it does, I was trying to recover 3D coordinates in the absence of some information. Using the non-metric MDS SKLearn implementation allows us to use zero values as substitutes for missing information.
These are the recovered coordinates in non-metric mode: https://i.stack.imgur.com/yqadb.png |
Tagging like a bug. I would be happy to have a fix with a non-regression test. |
Thanks for submitting this bug. I agree that the current implementation does not make sense, and does not work either. I just made a PR that fixes it (together with some other things): #30514. |
Awesome, thank you @dkobak ! I am sorry that I could never submit a PR for that but very happy that you did. |
@henrymartin1 If you have a moment to take a look at my PR and comment / express support over there, it could be helpful to make it go forward. Cheers! |
Dear Scikit team,
I really enjoy using the scikit learn library and I am very greatful for your very impactful work. I hope I can contribute with this bug report (if it is true) to this library.
Keep on your good work!
I checked existing issues that concern MDS it is not the same as #16846 but to solve it I would also address #11381
Describe the bug
Everything concerns the MDS implementation in
scikit-learn/sklearn/manifold/_mds.py
. I have copied the relevant snipped of code below and added the line numbers I am referring to. All equations I am referring to are from [1]Short version:
Long version:
I found the following error in the implementation:
In
scikit-learn.sklearn.manifold._mds._smacof_single
in line 103 the disparities get initialized with the distances. In line 104 only the values of the upper triangle matrix get overwritten by the calculated disparities (disparities_flat
). This results in a matrix that has disparities (similar to rank distances) in the upper triangle and euclidean distances in the lower triangle. Therefore the normalization factor in line 106 is calculated based on a matrix that is half distances and half disparitiesLine 106 standardized the disparities to

which corresponds to step 3 and 9 of the
Smacof algorithm with Admissibly Transformed Proximities
[2]. However, from equation 9.1 it seems like this corresponds only to the upper triangle matrix (which explains then(n-1)/2
) in the implementation of line 106, all elements from the matrix are used which results in a wrong standardization factor.Also line 106 the term


(disparities ** 2).sum()
should not be in the square-root. This results from a transformation when we are looking for a normalization factora
that is applied to the disparities such that:We then should calculate
a
as(equations 9.1 + 9.2) from [1]
When the stress is computed in line 110, a difference between the dissimilarities and the disparities is computed. As stated above, the disparities had the dissimilarities in the lower triangle matrix (which I think is confusing) which would normally mean that the all differences in the lower triangle matrix are zero. However, as the disparities got scaled in line 106, the differences in the lower triangle matrix are no longer zero but some (more or less) random values.
There are also some smaller issues that I noticed:
References
[1]: “Modern Multidimensional Scaling - Theory and Applications” Borg, I.; Groenen P. Springer Series in Statistics (1997), Version from 2005
[2]: “Modern Multidimensional Scaling - Theory and Applications” Borg, I.; Groenen P. Springer Series in Statistics (1997), Version from 2005, Page 204 in Chapter 9. Metric and non-metric MDS
Code from scikit MDS implementation (for reference)
Steps/Code to Reproduce
The code and data to calculate the results is in this gist: (https://gist.github.com/henrymartin1/05a2aa5210e52216abd7ce61e6918e3a). The relevant file is
scikit_mds_bugreport.py
I honestly had a hard time to come up with code that proofs that there is a problem. This is what I did:
Expected Results
Actual Results
Result mds scikit
Result mds r
Versions
System:
python: 3.8.6 | packaged by conda-forge | (default, Oct 7 2020, 18:22:52) [MSC v.1916 64 bit (AMD64)]
executable: C:\Users\henry.conda\envs\evhomepv\python.exe
machine: Windows-10-10.0.19041-SP0
Python dependencies:
pip: 20.2.4
setuptools: 49.6.0.post20201009
sklearn: 0.23.2
numpy: 1.19.4
scipy: 1.5.3
Cython: None
pandas: 1.1.4
matplotlib: 3.3.2
joblib: 0.17.0
threadpoolctl: 2.1.0
Built with OpenMP: True
The text was updated successfully, but these errors were encountered: