10000 FIX Set TSNE's internal PCA to always use numpy as output by betatim · Pull Request #25370 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

FIX Set TSNE's internal PCA to always use numpy as output #25370

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jan 13, 2023

Conversation

betatim
Copy link
Member
@betatim betatim commented Jan 12, 2023

The internal PCA should always output a numpy array as it is results are only used internally.

Reference Issues/PRs

Fixes #25365

This fix should be backported to the v1.2.x branch.

The internal PCA should always output a numpy array as it is results
are only used internally.
@betatim betatim force-pushed the tsne-pandas-output branch from 311f08d to 5f2cb81 Compare January 12, 2023 14:14
@betatim betatim added the To backport PR merged in master that need a backport to a release branch defined based on the milestone. label Jan 12, 2023
@betatim betatim added this to the 1.2.1 milestone Jan 12, 2023
@glemaitre
Copy link
Member

Also, I would have expected #24932 to have introduced the common test if we are using t-SNE with the default parameter since we don't change the init parameter in _set_checking_parameters.

@glemaitre glemaitre self-requested a review January 12, 2023 14:29
@glemaitre glemaitre changed the title Set TSNE's internal PCA to always use numpy as output FIX Set TSNE's internal PCA to always use numpy as output Jan 12, 2023
@glemaitre
Copy link
Member

I took a deeper look at the code and forcing the output of PCA seems to be wise here:

  • all numerical subsequent computations are based on linear algebra
  • pca is not exposed publicly. Therefore we don't expose an instance that is not affected by the config so we are lucky.

betatim and others added 2 commits January 12, 2023 16:22
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Tom Dupré la Tour <tom.duprelatour.10@gmail.com>
@betatim betatim added the Waiting for Second Reviewer First reviewer is done, need a second one! label Jan 13, 2023
Copy link
Member
@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this particular case, this solution makes sense, but we have many places where we create a clone of a given sub-estimator and expose the fitted version as a part of the public API.

What should be the behavior of those estimators?

cc @scikit-learn/core-devs

@glemaitre
Copy link
Member

In this particular case, this solution makes sense, but we have many places where we create a clone of a given sub-estimator and expose the fitted version as a part of the public API.

In this case, I would expect to have the estimator follow the global config and therefore we should make sure that the inner code works as much as possible whatever the data container provided.

@ogrisel
Copy link
Member
ogrisel commented Jan 13, 2023

we have many places where we create a clone of a given sub-estimator and expose the fitted version as a part of the public API. What should be the behavior of those estimators?

Do you have specific transformer in mind where this is the case? Is see the Pipeline and the ColumnTransformer but I think the current behavior is fine.

Copy link
Member
@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ogrisel ogrisel enabled auto-merge (squash) January 13, 2023 15:16
@adrinjalali
Copy link
Member

I guess we'll see. I'm thinking third party estimators which fit estimators internally. Let's see.

@ogrisel ogrisel merged commit fbd95f1 into scikit-learn:main Jan 13, 2023
@betatim betatim deleted the tsne-pandas-output branch January 13, 2023 15:46
jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Jan 20, 2023
…rn#25370)



Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Tom Dupré la Tour <tom.duprelatour.10@gmail.com>
jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Jan 20, 2023
…rn#25370)



Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Tom Dupré la Tour <tom.duprelatour.10@gmail.com>
jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Jan 23, 2023
…rn#25370)



Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Tom Dupré la Tour <tom.duprelatour.10@gmail.com>
adrinjalali pushed a commit that referenced this pull request Jan 24, 2023

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Tom Dupré la Tour <tom.duprelatour.10@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:manifold To backport PR merged in master that need a backport to a release branch defined based on the milestone. Waiting for Second Reviewer First reviewer is done, need a second one!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sklearn.set_config(transform_output="pandas") breaks TSNE embeddings
5 participants
0