[MRG] Extending MDS to new data#9834
[MRG] Extending MDS to new data#9834jsoutherland wants to merge 19 commits intoscikit-learn:mainfrom
Conversation
…valent to PCA transfor m. The method introduces errors as new points are projected, compared to a new projection o f all points. See Bengio, Yoshua, et al. "Out-of-sample extensions for lle, isomap, mds, eigenmaps, and spectral clustering." Advances in neural information processing systems 16 (2004): 177-184.
|
I posted this previously regarding the failing tests: There are some tests failing and I could use guidance on the best way to resolve them. Now that MDS is capable of acting as a transformer, a lot of general purpose tests are applied. MDS is by default not capable of using There are many possible ways to fix this:
Are there any suggestions on how to best fix these errors? Thanks |
|
I think that you could use the same approach as in scikit-learn/sklearn/pipeline.py Lines 399 to 420 in 8de18e6 If that works, that means that MDS would have transform only when A random question while I am at it, any particular reason why |
|
@lesteve Thank you for the pointer - I will give that a try. I'm not sure if the results are better. For me I like that it is much faster and can be extended. Those properties combined allow for visualizing many more points. We are probably using more memory... but the old version was limited to a few thousand points unless you were willing to wait hours/days. I think the two reasons for not switching the default are just changing the behavior and not having a great understanding of the quantitative differences. |
|
@lesteve I tried the All tests are passing and I have improved code coverage for the MDS class by quite a bit. This is ready for review. |
|
Sorry, there's not a lot of core dev availability lately. |
|
@jnothman no worries, I just wanted to make sure that I wasn't missing a step. |
sklearn/manifold/mds.py
Outdated
| max_iter=300, verbose=0, eps=1e-3, n_jobs=1, | ||
| random_state=None, dissimilarity="euclidean"): | ||
| random_state=None, dissimilarity="euclidean", | ||
| extendible=False): |
There was a problem hiding this comment.
*"extensible" or "extendable"
Perhaps the correct term is "inductive"
Please document the parameter in the class docstring.
There was a problem hiding this comment.
I will switch over to using "inductive".
sklearn/manifold/mds.py
Outdated
| """ | ||
|
|
||
| if not self.extendible: | ||
| raise ValueError("Method only available if extendible is True.") |
sklearn/manifold/tests/test_mds.py
Outdated
| # Test non-parallel case | ||
| mds_clf = mds.MDS(metric=False, n_jobs=1, dissimilarity="precomputed") | ||
| mds_clf.fit(sim) | ||
| mds_clf.fit_transform(sim) |
There was a problem hiding this comment.
Should you be testing that the model is identical?
There was a problem hiding this comment.
That would be a good improvement. I will add that in.
There was a problem hiding this comment.
I found there is randomness that cannot be controlled by setting MDS.random_state. This was true before this PR... I suggest opening an issue and solving that in a 43F6 follow-up PR?
There was a problem hiding this comment.
Are you accepting that we hold off on fixing this issue? I looked into it more and the fixes required are out of the scope of this PR. We can either keep these new tests (checking execution but not result) of different paths using the old/default method of smacof, or I can remove them.
There was a problem hiding this comment.
Could you please open an issue as I do not feel expert enough in this to understand where the randomness is coming from myself.
sklearn/manifold/tests/test_mds.py
Outdated
|
|
||
| # test fit_transform under the extendible case | ||
| mds_clf = mds.MDS(dissimilarity="euclidean", extendible=True) | ||
| mds_clf.fit_transform(sim) |
There was a problem hiding this comment.
Should the output of this be identical to the extendible=False case?
There was a problem hiding this comment.
I don't believe it should be the same.
There was a problem hiding this comment.
It would not, the inductive=False case uses the SMACOF algorithm to find projection coordinates for the points, which is an iterative minimisation procedure on a loss function ("stress"). The inductive=True case produces the closed form solution of this minimisation, taking the loss function to be the "metric stress" (special case of the general "stress"). The closed form solution involves performing eigendecomposition on the centred dissimilarity matrix.
sklearn/manifold/tests/test_mds.py
Outdated
| # Test non-parallel case | ||
| mds_clf = mds.MDS(metric=False, n_jobs=1, dissimilarity="precomputed") | ||
| mds_clf.fit(sim) | ||
| mds_clf.fit_transform(sim) |
There was a problem hiding this comment.
Not yet fixed.
Sorry, something went wrong.
sklearn/manifold/tests/test_mds.py
Outdated
| # test fit and transform for precomputed case | ||
| mds_clf = mds.MDS(dissimilarity="precomputed", inductive=True) | ||
| mds_clf.fit(sim, init=Z) | ||
| mds_clf.transform(sim) |
There was a problem hiding this comment.
Should this be identical to the result of fit_transform above? assert, please.
Sorry, something went wrong.
There was a problem hiding this comment.
I believe it should be... will fix.
Sorry, something went wrong.
sklearn/manifold/tests/test_mds.py
Outdated
| # Testing for extending MDS to new points | ||
| sim2 = np.array([[3, 1, 1, 2], | ||
| [4, 1, 2, 2]]) | ||
| mds_clf.transform(sim2) |
There was a problem hiding this comment.
what is the expected output?
Sorry, something went wrong.
sklearn/manifold/mds.py
Outdated
| Pre-computed dissimilarities are passed directly to ``fit`` and | ||
| ``fit_transform``. | ||
|
|
||
| inductive : boolean, optional, default: False |
There was a problem hiding this comment.
Perhaps we should have a method or algorithm or solver parameter rather than inductive. Remind me: do we get the same embedding of the training data, or an embedding with similar properties, either way? Does one or the other have higher computational costs?
Sorry, something went wrong.
There was a problem hiding this comment.
It's a different embedding. The new algorithm (which can be applied to out-of-sample points) is faster, but there is an increased memory requirement as we store a matrix and I believe the stress value may not be quite as good.
Sorry, something went wrong.
|
Then I think a "method"-style parameter would be better than "inductive".
It makes clear that you can get different results, and has different
computational properties, etc.
…On 7 November 2017 at 00:44, Joshua Southerland ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In sklearn/manifold/mds.py
<#9834 (comment)>
:
> @@ -330,6 +331,12 @@ class MDS(BaseEstimator):
Pre-computed dissimilarities are passed directly to ``fit`` and
``fit_transform``.
+ inductive : boolean, optional, default: False
It's a different embedding. The new algorithm (which can be applied to
out-of-sample points) is faster, but there is an increased memory
requirement as we store a matrix and I believe the stress value may not be
quite as good.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9834 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz654uWYA5QQeiQHaT8OqjDcuwfAmqks5szw1IgaJpZM4Pk6tN>
.
|
Sorry, something went wrong.
There was a problem hiding this comment.
You don't seem to test the correctness of the algorithm in fit_transform, i.e. that it derives an appropriate embedding and model...
Sorry, something went wrong.
| if self.method == 'inductive': | ||
| self.X_train_ = X | ||
| if self.dissimilarity == 'precomputed': | ||
| D = X |
There was a problem hiding this comment.
this is never tested
Sorry, something went wrong.
| D = X | ||
| elif self.dissimilarity == 'euclidean': | ||
| D = euclidean_distances(X) | ||
| self.D_XX_ = euclidean_distances(self.X_train_, self.X_train_) |
There was a problem hiding this comment.
this is a public attribute, by virtue of its name. It is not documented.
Sorry, something went wrong.
| y: Ignored | ||
| NB: similarity matrix has to be centered, use the | ||
| make_euclidean_similarities function to create it. | ||
|
|
There was a problem hiding this comment.
please remove extra blank line
Sorry, something went wrong.
| if self.dissimilarity == 'precomputed': | ||
| D_new = X | ||
| elif self.dissimilarity == 'euclidean': | ||
| if not hasattr(self, 'X_train_') \ |
There was a problem hiding this comment.
Please use check_is_fitted
Sorry, something went wrong.
| if not hasattr(self, 'X_train_') \ | ||
| or not hasattr(self, 'D_XX_') \ | ||
| or self.X_train_ is None \ | ||
| or self.D_XX_ is None: |
There was a problem hiding this comment.
why would this be the case??
Sorry, something went wrong.
| self.fit(X) | ||
| return self.transform(X) | ||
|
|
||
| def center_similarities(self, D_aX, D_XX): |
There was a problem hiding this comment.
I think this should be a private method, unless you have a very good reason otherwise.
Sorry, something went wrong.
|
|
||
| # calling transform with inductive=False causes an error | ||
| mds_clf = mds.MDS(dissimilarity="euclidean") | ||
| assert_raises(ValueError, mds_clf.transform, sim) |
There was a problem hiding this comment.
It might be better to use assert_raises_regexp so we're sure we're getting caught in the right ValueError...
Sorry, something went wrong.
| sim2 = np.array([[3, 1, 1, 2], | ||
| [4, 1, 2, 2]]) | ||
| result = mds_clf.transform(sim2) | ||
| expected = np.array([[-.705, -.452], |
There was a problem hiding this comment.
Are these values something you've derived by hand? Or just what the model happens to output?
Sorry, something went wrong.
| result1 = mds_clf.fit_transform(sim) | ||
|
|
||
| # test fit and transform for precomputed case | ||
| mds_clf = mds.MDS(dissimilarity="euclidean", method="inductive") |
There was a problem hiding this comment.
precomputed case?
Sorry, something went wrong.
|
@jnothman @jsoutherland Really looking forward to this feature for a current research project. Is it likely to be added in the next couple months? |
Sorry, something went wrong.
|
@JHibbard I don't think so. The tests have been too difficult to implement to the higher standards that are in place now for MDS. The algorithm is too stochastic to allow for straight forward testing. Also, previous tests largely tested that MDS does not encounter errors, rather than correctness. |
Sorry, something went wrong.
|
fwiw, we recently improved correctness testing in tSNE substantially. might
be worth a look
|
Sorry, something went wrong.
|
@jsoutherland @jnothman i'd be interested in investing some time in this PR, but I'm not exactly sure where to start. Is it just the tests failing? Or mainly the open review? -> is there anything I can do to help with this PR? |
Sorry, something went wrong.
|
It could do with having stronger tests, but also needs the comments above to be resolved.. thanks @shukon |
Sorry, something went wrong.
|
@shukon help would be appreciated. I'd still like to see this get in. It was passing tests at a92d36a . After improving test coverage (mostly based on reproducibility) it is failing due to randomness in a dependency here: Also failing the new parallel-computation tests due to #10119 |
Sorry, something went wrong.
Reference Issue
Continuation of #6222.
What does this implement/fix? Explain your changes.
Extended MDS object to include a transform method for out-of-sample points as described here:
http://papers.nips.cc/paper/2461-out-of-sample-extensions-for-lle-isomap-mds-eigenmaps-and-spectral-clustering.pdf
SMACOF algorithm is used in the non-extendible (default) case but not for the extendible case, which requires eigen-decomposition of (dis)similarity matrix.
Originally implemented by @webdrone in #6222. It has since been modified to alter the
fit/transformAPI and speed up subsequentMDS.transform()callsAny other comments?