-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] Extending MDS to new data #9834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…valent to PCA transfor m. The method introduces errors as new points are projected, compared to a new projection o f all points. See Bengio, Yoshua, et al. "Out-of-sample extensions for lle, isomap, mds, eigenmaps, and spectral clustering." Advances in neural information processing systems 16 (2004): 177-184.
I posted this previously regarding the failing tests: There are some tests failing and I could use guidance on the best way to resolve them. Now that MDS is capable of acting as a transformer, a lot of general purpose tests are applied. MDS is by default not capable of using
There are many possible ways to fix this:
Are there any suggestions on how to best fix these errors? Thanks |
I think that you could use the same approach as in scikit-learn/sklearn/pipeline.py Lines 399 to 420 in 8de18e6
If that works, that means that MDS would have transform only when A random question while I am at it, any particular reason why |
@lesteve Thank you for the pointer - I will give that a try. I'm not sure if the results are better. For me I like that it is much faster and can be extended. Those properties combined allow for visualizing many more points. We are probably using more memory... but the old version was limited to a few thousand points unless you were willing to wait hours/days. I think the two reasons for not switching the default are just changing the behavior and not having a great understanding of the quantitative differences. |
@lesteve I tried the All tests are passing and I have improved code coverage for the MDS class by quite a bit. This is ready for review. |
Sorry, there's not a lot of core dev availability lately. |
@jnothman no worries, I just wanted to make sure that I wasn't missing a step. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursory initial review.
sklearn/manifold/mds.py
Outdated
""" | ||
def __init__(self, n_components=2, metric=True, n_init=4, | ||
max_iter=300, verbose=0, eps=1e-3, n_jobs=1, | ||
random_state=None, dissimilarity="euclidean"): | ||
random_state=None, dissimilarity="euclidean", | ||
extendible=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*"extensible" or "extendable"
Perhaps the correct term is "inductive"
Please document the parameter in the class docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will switch over to using "inductive".
sklearn/manifold/mds.py
Outdated
""" | ||
|
||
if not self.extendible: | ||
raise ValueError("Method only available if extendible is True.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please test this
sklearn/manifold/tests/test_mds.py
Outdated
# Test non-parallel case | ||
mds_clf = mds.MDS(metric=False, n_jobs=1, dissimilarity="precomputed") | ||
mds_clf.fit(sim) | ||
mds_clf.fit_transform(sim) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should you be testing that the model is identical?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be a good improvement. I will add that in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found there is randomness that cannot be controlled by setting MDS.random_state. This was true before this PR... I suggest opening an issue and solving that in a follow-up PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you accepting that we hold off on fixing this issue? I looked into it more and the fixes required are out of the scope of this PR. We can either keep these new tests (checking execution but not result) of different paths using the old/default method of smacof, or I can remove them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please open an issue as I do not feel expert enough in this to understand where the randomness is coming from myself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, no problem.
sklearn/manifold/tests/test_mds.py
Outdated
|
||
# test fit_transform under the extendible case | ||
mds_clf = mds.MDS(dissimilarity="euclidean", extendible=True) | ||
mds_clf.fit_transform(sim) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the output of this be identical to the extendible=False case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe it should be the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would not, the inductive=False case uses the SMACOF algorithm to find projection coordinates for the points, which is an iterative minimisation procedure on a loss function ("stress"). The inductive=True case produces the closed form solution of this minimisation, taking the loss function to be the "metric stress" (special case of the general "stress"). The closed form solution involves performing eigendecomposition on the centred dissimilarity matrix.
@jnothman I have made a full pass on the initial review |
sklearn/manifold/tests/test_mds.py
Outdated
# Test non-parallel case | ||
mds_clf = mds.MDS(metric=False, n_jobs=1, dissimilarity="precomputed") | ||
mds_clf.fit(sim) | ||
mds_clf.fit_transform(sim) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet fixed.
sklearn/manifold/tests/test_mds.py
Outdated
# test fit and transform for precomputed case | ||
mds_clf = mds.MDS(dissimilarity="precomputed", inductive=True) | ||
mds_clf.fit(sim, init=Z) | ||
mds_clf.transform(sim) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be identical to the result of fit_transform above? assert, please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it should be... will fix.
sklearn/manifold/tests/test_mds.py
Outdated
# Testing for extending MDS to new points | ||
sim2 = np.array([[3, 1, 1, 2], | ||
[4, 1, 2, 2]]) | ||
mds_clf.transform(sim2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the expected output?
sklearn/manifold/mds.py
Outdated
@@ -330,6 +331,12 @@ class MDS(BaseEstimator): | |||
Pre-computed dissimilarities are passed directly to ``fit`` and | |||
``fit_transform``. | |||
|
|||
inductive : boolean, optional, default: False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should have a method
or algorithm
or solver
parameter rather than inductive. Remind me: do we get the same embedding of the training data, or an embedding with similar properties, either way? Does one or the other have higher computational costs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a different embedding. The new algorithm (which can be applied to out-of-sample points) is faster, but there is an increased memory requirement as we store a matrix and I believe the stress value may not be quite as good.
Then I think a "method"-style parameter would be better than "inductive".
It makes clear that you can get different results, and has different
computational properties, etc.
…On 7 November 2017 at 00:44, Joshua Southerland ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In sklearn/manifold/mds.py
<#9834 (comment)>
:
> @@ -330,6 +331,12 @@ class MDS(BaseEstimator):
Pre-computed dissimilarities are passed directly to ``fit`` and
``fit_transform``.
+ inductive : boolean, optional, default: False
It's a different embedding. The new algorithm (which can be applied to
out-of-sample points) is faster, but there is an increased memory
requirement as we store a matrix and I believe the stress value may not be
quite as good.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9834 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz654uWYA5QQeiQHaT8OqjDcuwfAmqks5szw1IgaJpZM4Pk6tN>
.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't seem to test the correctness of the algorithm in fit_transform, i.e. that it derives an appropriate embedding and model...
if self.method == 'inductive': | ||
self.X_train_ = X | ||
if self.dissimilarity == 'precomputed': | ||
D = X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is never tested
D = X | ||
elif self.dissimilarity == 'euclidean': | ||
D = euclidean_distances(X) | ||
self.D_XX_ = euclidean_distances(self.X_train_, self.X_train_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a public attribute, by virtue of its name. It is not documented.
y: Ignored | ||
NB: similarity matrix has to be centered, use the | ||
make_euclidean_similarities function to create it. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove extra blank line
if self.dissimilarity == 'precomputed': | ||
D_new = X | ||
elif self.dissimilarity == 'euclidean': | ||
if not hasattr(self, 'X_train_') \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use check_is_fitted
if not hasattr(self, 'X_train_') \ | ||
or not hasattr(self, 'D_XX_') \ | ||
or self.X_train_ is None \ | ||
or self.D_XX_ is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why would this be the case??
self.fit(X) | ||
return self.transform(X) | ||
|
||
def center_similarities(self, D_aX, D_XX): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be a private method, unless you have a very good reason otherwise.
|
||
# calling transform with inductive=False causes an error | ||
mds_clf = mds.MDS(dissimilarity="euclidean") | ||
assert_raises(ValueError, mds_clf.transform, sim) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be better to use assert_raises_regexp
so we're sure we're getting caught in the right ValueError...
sim2 = np.array([[3, 1, 1, 2], | ||
[4, 1, 2, 2]]) | ||
result = mds_clf.transform(sim2) | ||
expected = np.array([[-.705, -.452], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these values something you've derived by hand? Or just what the model happens to output?
result1 = mds_clf.fit_transform(sim) | ||
|
||
# test fit and transform for precomputed case | ||
mds_clf = mds.MDS(dissimilarity="euclidean", method="inductive") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
precomputed case?
@jnothman @jsoutherland Really looking forward to this feature for a current research project. Is it likely to be added in the next couple months? |
@JHibbard I don't think so. The tests have been too difficult to implement to the higher standards that are in place now for MDS. The algorithm is too stochastic to allow for straight forward testing. Also, previous tests largely tested that MDS does not encounter errors, rather than correctness. |
fwiw, we recently improved correctness testing in tSNE substantially. might
be worth a look
|
@jsoutherland @jnothman i'd be interested in investing some time in this PR, but I'm not exactly sure where to start. Is it just the tests failing? Or mainly the open review? -> is there anything I can do to help with this PR? |
It could do with having stronger tests, but also needs the comments above to be resolved.. thanks @shukon |
@shukon help would be appreciated. I'd still like to see this get in. It was passing tests at a92d36a . After improving test coverage (mostly based on reproducibility) it is failing due to randomness in a dependency here: Also failing the new parallel-computation tests due to #10119 |
Reference Issue
Continuation of #6222.
What does this implement/fix? Explain your changes.
Extended MDS object to include a transform method for out-of-sample points as described here:
http://papers.nips.cc/paper/2461-out-of-sample-extensions-for-lle-isomap-mds-eigenmaps-and-spectral-clustering.pdf
SMACOF algorithm is used in the non-extendible (default) case but not for the extendible case, which requires eigen-decomposition of (dis)similarity matrix.
Originally implemented by @webdrone in #6222. It has since been modified to alter the
fit
/transform
API and speed up subsequentMDS.transform()
callsAny other comments?