8000 [MRG] Fix TSNE init as NumPy array by b-carter · Pull Request #7208 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] Fix TSNE init as NumPy array #7208

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Aug 24, 2016

Conversation

b-carter
Copy link
Contributor
@b-carter b-carter commented Aug 18, 2016

Reference Issue

Fixes #7204

What does this implement/fix? Explain your changes.

Fixes discrepancy between doc and type checking for the 'init' parameter in TSNE. NumPy array is now a possible value for 'init'. Type check and docstring are updated to reflect this change.

Updates existing test to reflect change in error msg.

Adds new test to validate NumPy array as a possible 'init' value.

Any other comments?

Made one other small readability change by reordering elif options to show string options before NumPy array check.

@b-carter b-carter changed the title [MRG] Fix tsne numpy array init [MRG] Fix TSNE init as NumPy array Aug 18, 2016
@b-carter
Copy link
Contributor Author

I didn't realize the incompatibility of 'basestring' with Python 3. I'll fix it so it's compatible.

@@ -305,11 +305,14 @@ def test_non_square_precomputed_distances():


def test_init_not_available():
# 'init' must be 'pca' or 'random'.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just edit the comment instead of removing it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Initialization of embedding. Possible options are 'random' and 'pca'.
PCA initialization cannot be used with precomputed distances and is
usually more globally stable than random initialization.
init : string or NumPy array, optional (default: "random")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you use the more consistent numpy array instead of NumPy array? Also it would be nice to specify as numpy array of shape [n_samples, n_components].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, fixed for consistency/clarification

@@ -707,7 +710,7 @@ def _fit(self, X, skip_num_points=0):
raise ValueError("n_iter should be at least 200")

if self.metric == "precomputed":
if self.init == 'pca':
if not isinstance(self.init, np.ndarray) and self.init == 'pca':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer:

if isinstance(self.init, string_types) and self.init == 'pca':

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I had this in a previous commit, however one of the builds failed because it couldn't import six.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer:

if isinstance(self.init, string_types) and self.init == 'pca':

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't, that's why! I'll go back to using 'string_types' and get it from there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pushed change

assert_raises_regexp(ValueError, m, TSNE, init="not available")


def test_init_ndarray():
tsne = TSNE(init=np.zeros((100, 2)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also do a fit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if the test I put in the comment below looks good. If so, I'll push it in so you can merge.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please just add a tsne.fit here and assess at least the size of the fitted attributes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, do you think it's also worthwhile to check the case of init with ndarray and metric='precomputed'? (one change in this PR involves fixing a FutureWarning in this case)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added another test, should be good now.

@agramfort
Copy link
Member

besides +1 for merge

@b-carter
Copy link
Contributor Author
b-carter commented Aug 20, 2016

Per my question a few comments up, I had written a test that reads:

def test_fit_init_ndarray():
    # Make sure no warnings are thrown from isinstance checks on numpy array
    tsne = TSNE(init=np.zeros((2, 2)), metric="precomputed")
    tsne.fit(np.array([[0, 0], [0, 0]]))

I'll just push this one. Is this okay?

@agramfort agramfort merged commit 040a766 into scikit-learn:master Aug 24, 2016
@agramfort
Copy link
Member

thanks @b-carter

TomDLT pushed a commit to TomDLT/scikit-learn that referenced this pull request Oct 3, 2016
* fix tsne init numpy array type checking and existing test

* adds test for tsne init numpy array type checking

* reorder TSNE elif statements for string options first for increased readability

* string type checking with python 3 compatibility

* restore removed comment

* avoid use of six module and check string and unicode manually

* use lowercase spelling of numpy for consistency

* parens for better consistency

* fix init as numpy array throwing FutureWarning

* simplify logic for init validity checking

* use string_types for string checking

* improve init TSNE with ndarray test to check fit

* add test for init TSNE with ndarray and metric precomputed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TSNE does not accept init = np.ndarray despite saying it is able too
4 participants
0