FIX Set TSNE's internal PCA to always use numpy as output #25370

betatim · 2023-01-12T14:13:46Z

The internal PCA should always output a numpy array as it is results are only used internally.

Reference Issues/PRs

This fix should be backported to the v1.2.x branch.

The internal PCA should always output a numpy array as it is results are only used internally.

sklearn/manifold/_t_sne.py

glemaitre · 2023-01-12T14:29:17Z

Also, I would have expected #24932 to have introduced the common test if we are using t-SNE with the default parameter since we don't change the init parameter in _set_checking_parameters.

sklearn/manifold/_t_sne.py

sklearn/manifold/tests/test_t_sne.py

sklearn/manifold/_t_sne.py

glemaitre · 2023-01-12T14:47:08Z

I took a deeper look at the code and forcing the output of PCA seems to be wise here:

all numerical subsequent computations are based on linear algebra
pca is not exposed publicly. Therefore we don't expose an instance that is not affected by the config so we are lucky.

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

doc/whats_new/v1.2.rst

Co-authored-by: Tom Dupré la Tour <tom.duprelatour.10@gmail.com>

adrinjalali

In this particular case, this solution makes sense, but we have many places where we create a clone of a given sub-estimator and expose the fitted version as a part of the public API.

What should be the behavior of those estimators?

cc @scikit-learn/core-devs

glemaitre · 2023-01-13T13:59:45Z

In this particular case, this solution makes sense, but we have many places where we create a clone of a given sub-estimator and expose the fitted version as a part of the public API.

In this case, I would expect to have the estimator follow the global config and therefore we should make sure that the inner code works as much as possible whatever the data container provided.

ogrisel · 2023-01-13T15:15:31Z

we have many places where we create a clone of a given sub-estimator and expose the fitted version as a part of the public API. What should be the behavior of those estimators?

Do you have specific transformer in mind where this is the case? Is see the Pipeline and the ColumnTransformer but I think the current behavior is fine.

ogrisel

LGTM!

adrinjalali · 2023-01-13T15:25:43Z

I guess we'll see. I'm thinking third party estimators which fit estimators internally. Let's see.

…rn#25370) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> Co-authored-by: Tom Dupré la Tour <tom.duprelatour.10@gmail.com>

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> Co-authored-by: Tom Dupré la Tour <tom.duprelatour.10@gmail.com>

github-actions bot added the module:manifold label Jan 12, 2023

Set TSNE's internal PCA to always use numpy as output

5f2cb81

The internal PCA should always output a numpy array as it is results are only used internally.

betatim force-pushed the tsne-pandas-output branch from 311f08d to 5f2cb81 Compare January 12, 2023 14:14

betatim added the To backport PR merged in master that need a backport to a release branch defined based on the milestone. label Jan 12, 2023

betatim added this to the 1.2.1 milestone Jan 12, 2023