8000 [MRG] Extending MDS to new data by jsoutherland · Pull Request #9834 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] Extending MDS to new data #9834

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

jsoutherland
Copy link
@jsoutherland jsoutherland commented Sep 26, 2017

Reference Issue

Continuation of #6222.

What does this implement/fix? Explain your changes.

Extended MDS object to include a transform method for out-of-sample points as described here:
http://papers.nips.cc/paper/2461-out-of-sample-extensions-for-lle-isomap-mds-eigenmaps-and-spectral-clustering.pdf

SMACOF algorithm is used in the non-extendible (default) case but not for the extendible case, which requires eigen-decomposition of (dis)similarity matrix.

Originally implemented by @webdrone in #6222. It has since been modified to alter the fit/transform API and speed up subsequent MDS.transform() calls

Any other comments?

webdrone and others added 6 commits January 24, 2016 21:23
…valent to PCA transfor

m. The method introduces errors as new points are projected, compared to a new projection o
f all points. See Bengio, Yoshua, et al. "Out-of-sample extensions for lle, isomap, mds, eigenmaps, and spectral clustering." Advances in neural information processing systems 16 (2004): 177-184.
@jsoutherland jsoutherland changed the title Extending MDS to new data [MRG] Extending MDS to new data Sep 26, 2017
@jsoutherland
Copy link
Author

I posted this previously regarding the failing tests:

There are some tests failing and I could use guidance on the best way to resolve them. Now that MDS is capable of acting as a transformer, a lot of general purpose tests are applied. MDS is by default not capable of using transform() ... it must first be configured with extendible=True. Because of this, 5 tests fail with similar results:

======================================================================
ERROR: /home/josh/got/jsoutherland/scikit-learn/sklearn/tests/test_common.py.test_non_meta_estimators:check_estimators_dtypes(MDS)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/josh/anaconda2/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/josh/got/jsoutherland/scikit-learn/sklearn/utils/testing.py", line 789, in __call__
    return self.check(*args, **kwargs)
  File "/home/josh/got/jsoutherland/scikit-learn/sklearn/utils/testing.py", line 309, in wrapper
    return fn(*args, **kwargs)
  File "/home/josh/got/jsoutherland/scikit-learn/sklearn/utils/estimator_checks.py", line 865, in check_estimators_dtypes
    getattr(estimator, method)(X_train)
  File "/home/josh/got/jsoutherland/scikit-learn/sklearn/manifold/mds.py", line 475, in transform
    raise ValueError("Method only available if extendible is True.")
ValueError: Method only available if extendible is True.

There are many possible ways to fix this:

  1. Making extendible=True the default is probably not what we want.
  2. We could skip the test results.
  3. We could stop returning a ValueError...
  4. There may be a way to configure MDS to be extendible during the test

Are there any suggestions on how to best fix these errors?

Thanks

@jsoutherland jsoutherland mentioned this pull request Sep 27, 2017
@lesteve
Copy link
Member
lesteve commented Sep 28, 2017

I think that you could use the same approach as in sklearn/pipeline.py with a property:

@property
def transform(self):
"""Apply transforms, and transform with the final estimator
This also works where final estimator is ``None``: all prior
transformations are applied.
Parameters
----------
X : iterable
Data to transform. Must fulfill input requirements of first step
of the pipeline.
Returns
-------
Xt : array-like, shape = [n_samples, n_transformed_features]
"""
# _final_estimator is None or has transform, otherwise attribute error
# XXX: Handling the None case means we can't use if_delegate_has_method
if self._final_estimator is not None:
self._final_estimator.transform
return self._transform

If that works, that means that MDS would have transform only when extendible=True.

A random question while I am at it, any particular reason why extendible=True should not be the default? Maybe computation costs? I guess changing previous behaviour is debatable but it the result is a lot better, then why not ...

@jsoutherland
Copy link
Author

@lesteve Thank you for the pointer - I will give that a try.

I'm not sure if the results are better. For me I like that it is much faster and can be extended. Those properties combined allow for visualizing many more points. We are probably using more memory... but the old version was limited to a few thousand points unless you were willing to wait hours/days.

I think the two reasons for not switching the default are just changing the behavior and not having a great understanding of the quantitative differences.

@jsoutherland
Copy link
Author

@lesteve I tried the @property approach but ran into other errors. I was able to get everything passing by using the extendible case when necessary.

All tests are passing and I have improved code coverage for the MDS class by quite a bit.

This is ready for review.

@jsoutherland
Copy link
Author

@lesteve @NelleV I haven't contributed to scikit-learn before, do you have any tips on how to move this forward? Is there someone we need to ping? Thanks

@jnothman
Copy link
Member

Sorry, there's not a lot of core dev availability lately.

@jsoutherland
Copy link
Author

@jnothman no worries, I just wanted to make sure that I wasn't missing a step.

Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursory initial review.

"""
def __init__(self, n_components=2, metric=True, n_init=4,
max_iter=300, verbose=0, eps=1e-3, n_jobs=1,
random_state=None, dissimilarity="euclidean"):
random_state=None, dissimilarity="euclidean",
extendible=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*"extensible" or "extendable"
Perhaps the correct term is "inductive"

Please document the parameter in the class docstring.

Uh oh!

There was an error while loading. Please reload this page.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will switch over to using "inductive".

"""

if not self.extendible:
raise ValueError("Method only available if extendible is True.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please test this

# Test non-parallel case
mds_clf = mds.MDS(metric=False, n_jobs=1, dissimilarity="precomputed")
mds_clf.fit(sim)
mds_clf.fit_transform(sim)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you be testing that the model is identical?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be a good improvement. I will add that in.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found there is randomness that cannot be controlled by setting MDS.random_state. This was true before this PR... I suggest opening an issue and solving that in a follow-up PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet fixed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you accepting that we hold off on fixing this issue? I looked into it more and the fixes required are out of the scope of this PR. We can either keep these new tests (checking execution but not result) of different paths using the old/default method of smacof, or I can remove them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please open an issue as I do not feel expert enough in this to understand where the randomness is coming from myself.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, no problem.


# test fit_transform under the extendible case
mds_clf = mds.MDS(dissimilarity="euclidean", extendible=True)
mds_clf.fit_transform(sim)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the output of this be identical to the extendible=False case?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe it should be the same.

Copy link
@webdrone webdrone Oct 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would not, the inductive=False case uses the SMACOF algorithm to find projection coordinates for the points, which is an iterative minimisation procedure on a loss function ("stress"). The inductive=True case produces the closed form solution of this minimisation, taking the loss function to be the "metric stress" (special case of the general "stress"). The closed form solution involves performing eigendecomposition on the centred dissimilarity matrix.

@jsoutherland
Copy link
Author

@jnothman I have made a full pass on the initial review

# Test non-parallel case
mds_clf = mds.MDS(metric=False, n_jobs=1, dissimilarity="precomputed")
mds_clf.fit(sim)
mds_clf.fit_transform(sim)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet fixed.

# test fit and transform for precomputed case
mds_clf = mds.MDS(dissimilarity="precomputed", inductive=True)
mds_clf.fit(sim, init=Z)
mds_clf.transform(sim)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be identical to the result of fit_transform above? assert, please.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it should be... will fix.

# Testing for extending MDS to new points
sim2 = np.array([[3, 1, 1, 2],
[4, 1, 2, 2]])
mds_clf.transform(sim2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the expected output?

@@ -330,6 +331,12 @@ class MDS(BaseEstimator):
Pre-computed dissimilarities are passed directly to ``fit`` and
``fit_transform``.

inductive : boolean, optional, default: False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should have a method or algorithm or solver parameter rather than inductive. Remind me: do we get the same embedding of the training data, or an embedding with similar properties, either way? Does one or the other have higher computational costs?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a different embedding. The new algorithm (which can be applied to out-of-sample points) is faster, but there is an increased memory requirement as we store a matrix and I believe the stress value may not be quite as good.

@jnothman
Copy link
Member
jnothman commented Nov 6, 2017 via email

Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't seem to test the correctness of the algorithm in fit_transform, i.e. that it derives an appropriate embedding and model...

if self.method == 'inductive':
self.X_train_ = X
if self.dissimilarity == 'precomputed':
D = X
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is never tested

D = X
elif self.dissimilarity == 'euclidean':
D = euclidean_distances(X)
self.D_XX_ = euclidean_distances(self.X_train_, self.X_train_)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a public attribute, by virtue of its name. It is not documented.

y: Ignored
NB: similarity matrix has to be centered, use the
make_euclidean_similarities function to create it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove extra blank line

if self.dissimilarity == 'precomputed':
D_new = X
elif self.dissimilarity == 'euclidean':
if not hasattr(self, 'X_train_') \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use check_is_fitted

if not hasattr(self, 'X_train_') \
or not hasattr(self, 'D_XX_') \
or self.X_train_ is None \
or self.D_XX_ is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would this be the case??

self.fit(X)
return self.transform(X)

def center_similarities(self, D_aX, D_XX):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be a private method, unless you have a very good reason otherwise.


# calling transform with inductive=False causes an error
mds_clf = mds.MDS(dissimilarity="euclidean")
assert_raises(ValueError, mds_clf.transform, sim)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to use assert_raises_regexp so we're sure we're getting caught in the right ValueError...

sim2 = np.array([[3, 1, 1, 2],
[4, 1, 2, 2]])
result = mds_clf.transform(sim2)
expected = np.array([[-.705, -.452],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these values something you've derived by hand? Or just what the model happens to output?

result1 = mds_clf.fit_transform(sim)

# test fit and transform for precomputed case
mds_clf = mds.MDS(dissimilarity="euclidean", method="inductive")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

precomputed case?

@JHibbard
Copy link

@jnothman @jsoutherland Really looking forward to this feature for a current research project. Is it likely to be added in the next couple months?

@jsoutherland
Copy link
Author

@JHibbard I don't think so. The tests have been too difficult to implement to the higher standards that are in place now for MDS. The algorithm is too stochastic to allow for straight forward testing. Also, previous tests largely tested that MDS does not encounter errors, rather than correctness.

@jnothman
Copy link
Member
jnothman commented Jan 3, 2018 via email

@shukon
Copy link
shukon commented May 23, 2018

@jsoutherland @jnothman i'd be interested in investing some time in this PR, but I'm not exactly sure where to start. Is it just the tests failing? Or mainly the open review? -> is there anything I can do to help with this PR?

@jnothman
Copy link
Member

It could do with having stronger tests, but also needs the comments above to be resolved.. thanks @shukon

@jsoutherland
Copy link
Author
jsoutherland commented May 24, 2018

@shukon help would be appreciated. I'd still like to see this get in. It was passing tests at a92d36a . After improving test coverage (mostly based on reproducibility) it is failing due to randomness in a dependency here:
https://github.com/jsoutherland/scikit-learn/blob/92ce919e8a55a8c7a4b192fa3d0d3018529f7c52/sklearn/manifold/mds.py#L421

Also failing the new parallel-computation tests due to #10119

Base automatically changed from master to main January 22, 2021 10:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants
0