[MRG] Extending MDS to new data #9834

jsoutherland · 2017-09-26T21:35:48Z

Reference Issue

Continuation of #6222.

What does this implement/fix? Explain your changes.

Extended MDS object to include a transform method for out-of-sample points as described here:
http://papers.nips.cc/paper/2461-out-of-sample-extensions-for-lle-isomap-mds-eigenmaps-and-spectral-clustering.pdf

SMACOF algorithm is used in the non-extendible (default) case but not for the extendible case, which requires eigen-decomposition of (dis)similarity matrix.

Originally implemented by @webdrone in #6222. It has since been modified to alter the fit/transform API and speed up subsequent MDS.transform() calls

Any other comments?

…valent to PCA transfor m. The method introduces errors as new points are projected, compared to a new projection o f all points. See Bengio, Yoshua, et al. "Out-of-sample extensions for lle, isomap, mds, eigenmaps, and spectral clustering." Advances in neural information processing systems 16 (2004): 177-184.

…nsforms

jsoutherland · 2017-09-26T21:44:52Z

I posted this previously regarding the failing tests:

There are some tests failing and I could use guidance on the best way to resolve them. Now that MDS is capable of acting as a transformer, a lot of general purpose tests are applied. MDS is by default not capable of using transform() ... it must first be configured with extendible=True. Because of this, 5 tests fail with similar results:

======================================================================
ERROR: /home/josh/got/jsoutherland/scikit-learn/sklearn/tests/test_common.py.test_non_meta_estimators:check_estimators_dtypes(MDS)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/josh/anaconda2/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/josh/got/jsoutherland/scikit-learn/sklearn/utils/testing.py", line 789, in __call__
    return self.check(*args, **kwargs)
  File "/home/josh/got/jsoutherland/scikit-learn/sklearn/utils/testing.py", line 309, in wrapper
    return fn(*args, **kwargs)
  File "/home/josh/got/jsoutherland/scikit-learn/sklearn/utils/estimator_checks.py", line 865, in check_estimators_dtypes
    getattr(estimator, method)(X_train)
  File "/home/josh/got/jsoutherland/scikit-learn/sklearn/manifold/mds.py", line 475, in transform
    raise ValueError("Method only available if extendible is True.")
ValueError: Method only available if extendible is True.

There are many possible ways to fix this:

Making extendible=True the default is probably not what we want.
We could skip the test results.
We could stop returning a ValueError...
There may be a way to configure MDS to be extendible during the test

Are there any suggestions on how to best fix these errors?

Thanks

lesteve · 2017-09-28T05:06:33Z

I think that you could use the same approach as in sklearn/pipeline.py with a property:

scikit-learn/sklearn/pipeline.py

Lines 399 to 420 in 8de18e6

    
               @property 
        
               def transform(self): 
        
                   """Apply transforms, and transform with the final estimator 
        
                   This also works where final estimator is ``None``: all prior 
        
                   transformations are applied. 
        
                   Parameters 
        
                   ---------- 
        
                   X : iterable 
        
                       Data to transform. Must fulfill input requirements of first step 
        
                       of the pipeline. 
        
                   Returns 
        
                   ------- 
        
                   Xt : array-like, shape = [n_samples, n_transformed_features] 
        
                   """ 
        
                   # _final_estimator is None or has transform, otherwise attribute error 
        
                   # XXX: Handling the None case means we can't use if_delegate_has_method 
        
                   if self._final_estimator is not None: 
        
                       self._final_estimator.transform 
        
                   return self._transform

If that works, that means that MDS would have transform only when extendible=True.

A random question while I am at it, any particular reason why extendible=True should not be the default? Maybe computation costs? I guess changing previous behaviour is debatable but it the result is a lot better, then why not ...

jsoutherland · 2017-09-28T13:07:33Z

@lesteve Thank you for the pointer - I will give that a try.

I'm not sure if the results are better. For me I like that it is much faster and can be extended. Those properties combined allow for visualizing many more points. We are probably using more memory... but the old version was limited to a few thousand points unless you were willing to wait hours/days.

I think the two reasons for not switching the default are just changing the behavior and not having a great understanding of the quantitative differences.

jsoutherland · 2017-10-05T16:17:07Z

@lesteve I tried the @property approach but ran into other errors. I was able to get everything passing by using the extendible case when necessary.

All tests are passing and I have improved code coverage for the MDS class by quite a bit.

This is ready for review.

jsoutherland · 2017-10-13T13:57:18Z

@lesteve @NelleV I haven't contributed to scikit-learn before, do you have any tips on how to move this forward? Is there someone we need to ping? Thanks

jnothman · 2017-10-17T04:23:44Z

Sorry, there's not a lot of core dev availability lately.

jsoutherland · 2017-10-17T20:48:06Z

@jnothman no worries, I just wanted to make sure that I wasn't missing a step.

jnothman

Cursory initial review.

jnothman · 2017-10-18T00:22:18Z

sklearn/manifold/mds.py

    """
    def __init__(self, n_components=2, metric=True, n_init=4,
                 max_iter=300, verbose=0, eps=1e-3, n_jobs=1,
-                 random_state=None, dissimilarity="euclidean"):
+                 random_state=None, dissimilarity="euclidean",
+                 extendible=False):


*"extensible" or "extendable"
Perhaps the correct term is "inductive"

Please document the parameter in the class docstring.

I will switch over to using "inductive".

jnothman · 2017-10-18T00:22:51Z

sklearn/manifold/mds.py

        """
+
+        if not self.extendible:
+            raise ValueError("Method only available if extendible is True.")


Please test this

jnothman · 2017-10-18T00:24:39Z

sklearn/manifold/tests/test_mds.py

+    # Test non-parallel case
+    mds_clf = mds.MDS(metric=False, n_jobs=1, dissimilarity="precomputed")
+    mds_clf.fit(sim)
+    mds_clf.fit_transform(sim)


Should you be testing that the model is identical?

That would be a good improvement. I will add that in.

I found there is randomness that cannot be controlled by setting MDS.random_state. This was true before this PR... I suggest opening an issue and solving that in a follow-up PR?

Not yet fixed.

Are you accepting that we hold off on fixing this issue? I looked into it more and the fixes required are out of the scope of this PR. We can either keep these new tests (checking execution but not result) of different paths using the old/default method of smacof, or I can remove them.

Could you please open an issue as I do not feel expert enough in this to understand where the randomness is coming from myself.

Sure, no problem.

jnothman · 2017-10-18T00:24:55Z

sklearn/manifold/tests/test_mds.py

+
+    # test fit_transform under the extendible case
+    mds_clf = mds.MDS(dissimilarity="euclidean", extendible=True)
+    mds_clf.fit_transform(sim)


Should the output of this be identical to the extendible=False case?

I don't believe it should be the same.

It would not, the inductive=False case uses the SMACOF algorithm to find projection coordinates for the points, which is an iterative minimisation procedure on a loss function ("stress"). The inductive=True case produces the closed form solution of this minimisation, taking the loss function to be the "metric stress" (special case of the general "stress"). The closed form solution involves performing eigendecomposition on the centred dissimilarity matrix.

…ive case

jsoutherland · 2017-10-19T15:55:52Z

@jnothman I have made a full pass on the initial review

jnothman · 2017-10-31T22:47:59Z

sklearn/manifold/tests/test_mds.py

+    # Test non-parallel case
+    mds_clf = mds.MDS(metric=False, n_jobs=1, dissimilarity="precomputed")
+    mds_clf.fit(sim)
+    mds_clf.fit_transform(sim)


Not yet fixed.

jnothman · 2017-10-31T22:48:37Z

sklearn/manifold/tests/test_mds.py

+    # test fit and transform for precomputed case
+    mds_clf = mds.MDS(dissimilarity="precomputed", inductive=True)
+    mds_clf.fit(sim, init=Z)
+    mds_clf.transform(sim)


Should this be identical to the result of fit_transform above? assert, please.

I believe it should be... will fix.

jnothman · 2017-10-31T22:48:47Z

sklearn/manifold/tests/test_mds.py

+    # Testing for extending MDS to new points
+    sim2 = np.array([[3, 1, 1, 2],
+                     [4, 1, 2, 2]])
+    mds_clf.transform(sim2)


what is the expected output?

jnothman · 2017-10-31T22:50:54Z

sklearn/manifold/mds.py

@@ -330,6 +331,12 @@ class MDS(BaseEstimator):
            Pre-computed dissimilarities are passed directly to ``fit`` and
            ``fit_transform``.

+    inductive : boolean, optional, default: False


Perhaps we should have a method or algorithm or solver parameter rather than inductive. Remind me: do we get the same embedding of the training data, or an embedding with similar properties, either way? Does one or the other have higher computational costs?

It's a different embedding. The new algorithm (which can be applied to out-of-sample points) is faster, but there is an increased memory requirement as we store a matrix and I believe the stress value may not be quite as good.

jnothman · 2017-11-06T22:49:35Z

Then I think a "method"-style parameter would be better than "inductive". It makes clear that you can get different results, and has different computational properties, etc.

…

On 7 November 2017 at 00:44, Joshua Southerland ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In sklearn/manifold/mds.py <#9834 (comment)> : > @@ -330,6 +331,12 @@ class MDS(BaseEstimator): Pre-computed dissimilarities are passed directly to ``fit`` and ``fit_transform``. + inductive : boolean, optional, default: False It's a different embedding. The new algorithm (which can be applied to out-of-sample points) is faster, but there is an increased memory requirement as we store a matrix and I believe the stress value may not be quite as good. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9834 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz654uWYA5QQeiQHaT8OqjDcuwfAmqks5szw1IgaJpZM4Pk6tN> .

jnothman

You don't seem to test the correctness of the algorithm in fit_transform, i.e. that it derives an appropriate embedding and model...

jnothman · 2017-11-13T07:00:07Z

sklearn/manifold/mds.py

+        if self.method == 'inductive':
+            self.X_train_ = X
+            if self.dissimilarity == 'precomputed':
+                D = X


this is never tested

jnothman · 2017-11-13T07:00:43Z

sklearn/manifold/mds.py

+                D = X
+            elif self.dissimilarity == 'euclidean':
+                D = euclidean_distances(X)
+                self.D_XX_ = euclidean_distances(self.X_train_, self.X_train_)


this is a public attribute, by virtue of its name. It is not documented.

jnothman · 2017-11-13T07:01:06Z

sklearn/manifold/mds.py

-        y: Ignored
+            NB: similarity matrix has to be centered, use the
+            make_euclidean_similarities function to create it.
+


please remove extra blank line

jnothman · 2017-11-13T07:01:53Z

sklearn/manifold/mds.py

+        if self.dissimilarity == 'precomputed':
+            D_new = X
+        elif self.dissimilarity == 'euclidean':
+            if not hasattr(self, 'X_train_') \


Please use check_is_fitted

jnothman · 2017-11-13T07:02:07Z

sklearn/manifold/mds.py

+            if not hasattr(self, 'X_train_') \
+               or not hasattr(self, 'D_XX_') \
+               or self.X_train_ is None \
+               or self.D_XX_ is None:


why would this be the case??

jnothman · 2017-11-13T07:02:50Z

sklearn/manifold/mds.py

+        self.fit(X)
+        return self.transform(X)
+
+    def center_similarities(self, D_aX, D_XX):


I think this should be a private method, unless you have a very good reason otherwise.

jnothman · 2017-11-13T07:04:10Z

sklearn/manifold/tests/test_mds.py

+
+    # calling transform with inductive=False causes an error
+    mds_clf = mds.MDS(dissimilarity="euclidean")
+    assert_raises(ValueError, mds_clf.transform, sim)


It might be better to use assert_raises_regexp so we're sure we're getting caught in the right ValueError...

jnothman · 2017-11-13T07:04:42Z

sklearn/manifold/tests/test_mds.py

+    sim2 = np.array([[3, 1, 1, 2],
+                     [4, 1, 2, 2]])
+    result = mds_clf.transform(sim2)
+    expected = np.array([[-.705, -.452],


Are these values something you've derived by hand? Or just what the model happens to output?

jnothman · 2017-11-13T07:08:31Z

sklearn/manifold/tests/test_mds.py

+    result1 = mds_clf.fit_transform(sim)
+
+    # test fit and transform for precomputed case
+    mds_clf = mds.MDS(dissimilarity="euclidean", method="inductive")


precomputed case?

JHibbard · 2017-12-28T21:39:53Z

@jnothman @jsoutherland Really looking forward to this feature for a current research project. Is it likely to be added in the next couple months?

jsoutherland · 2018-01-03T22:32:27Z

@JHibbard I don't think so. The tests have been too difficult to implement to the higher standards that are in place now for MDS. The algorithm is too stochastic to allow for straight forward testing. Also, previous tests largely tested that MDS does not encounter errors, rather than correctness.

jnothman · 2018-01-03T22:48:09Z

fwiw, we recently improved correctness testing in tSNE substantially. might be worth a look

shukon · 2018-05-23T16:07:51Z

@jsoutherland @jnothman i'd be interested in investing some time in this PR, but I'm not exactly sure where to start. Is it just the tests failing? Or mainly the open review? -> is there anything I can do to help with this PR?

jnothman · 2018-05-23T22:34:57Z

It could do with having stronger tests, but also needs the comments above to be resolved.. thanks @shukon

jsoutherland · 2018-05-24T02:55:44Z

@shukon help would be appreciated. I'd still like to see this get in. It was passing tests at a92d36a . After improving test coverage (mostly based on reproducibility) it is failing due to randomness in a dependency here:
https://github.com/jsoutherland/scikit-learn/blob/92ce919e8a55a8c7a4b192fa3d0d3018529f7c52/sklearn/manifold/mds.py#L421

Also failing the new parallel-computation tests due to #10119

webdrone and others added 6 commits January 24, 2016 21:23

Added test for extendible MDS.

aab1d93

Fixing merge conflicts for 0.19.0+

0ffa972

Fixing missed mds merge conflict

46b73de

Extendable MDS: Storing training data and optimizing for repeated tra…

7655a64

…nsforms

Separating test for MDS transform

4924a16

jsoutherland changed the title ~~Extending MDS to new data~~ [MRG] Extending MDS to new data Sep 26, 2017

Fixing unreachable/unnecessary code, improving test coverage

521c55b

jsoutherland mentioned this pull request Sep 27, 2017

Mds extend #6222

Closed

Joshua Southerland added 2 commits September 26, 2017 23:15

Increasing test coverage

56bada9

8000

Fixing public attributes added by MDS.fit

bec17d8

Joshua Southerland added 3 commits October 5, 2017 09:56

MDS: Being consistent on private variables and where they are modified

8000 f2b3951

MDS: setting extendible to True when testing transformer functionality

4592a34

Fixing flake8

90e3de6

jnothman reviewed Oct 18, 2017

View reviewed changes

Joshua Southerland added 3 commits October 19, 2017 08:26

Changing new MDS param from extendible to inductive and documenting it

aca1295

MDS: adding test expecting error when transform is used on non-induct…

2695f73

…ive case

Fixing flake8

a92d36a

jnothman reviewed Oct 31, 2017

View reviewed changes

Joshua Southerland added 2 commits November 12, 2017 18:12

MDS: using a method param to support new inductive algorithm

f78c4dc

MDS: making fit_transform = fit+transform, testing this is the case

92ce919

Joshua Southerland added 2 commits November 12, 2017 19:09

Restoring original test_MDS

9afa1ef

MDS: check result values for out-of-sample transformation

db9e8f9

jnothman reviewed Nov 13, 2017

View reviewed changes

shukon mentioned this pull request May 2, 2018

MDS for Configurator footprint only with randomly sampled configurations automl/CAVE#148

Open

amueller added the Stalled label Aug 5, 2019

github-actions bot added module:manifold module:utils labels Mar 2, 2020

Base automatically changed from master to main January 22, 2021 10:49

cmarmo added the help wanted label Dec 23, 2021

Uh oh!

[MRG] Extending MDS to new data #9834

Are you sure you want to change the base?

[MRG] Extending MDS to new data #9834

Uh oh!

Conversation

Uh oh!

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment