8000 sklearn.set_config(transform_output="pandas") breaks TSNE embeddings · Issue #25365 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

sklearn.set_config(transform_output="pandas") breaks TSNE embeddings #25365

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
loftusa opened this issue Jan 11, 2023 · 8 comments · Fixed by #25370
Closed

sklearn.set_config(transform_output="pandas") breaks TSNE embeddings #25365

loftusa opened this issue Jan 11, 2023 · 8 comments · Fixed by #25370

Comments

@loftusa
Copy link
loftusa commented Jan 11, 2023

Describe the bug

TSNE doesn't work when the global config is changed to pandas.

I tracked down this bug in the sklearn codebase. The issue is here: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/manifold/_t_sne.py#L996

What's happening is that X_embedded returns a Pandas array under set_output API, with the columns being named "pca0" and "pca1". So when X_embedded[:, 0] is called, we get an IndexError, because you'd have to index with X_embedded.iloc[:, 0] in this situation.

Possible fix could be changing line 996 to this:
X_embedded = X_embedded / np.std(np.array(X_embedded)[:, 0]) * 1e-4

which I am happy to make a PR to do unless somebody has a cleaner way.

Cheers!

Steps/Code to Reproduce

import sklearn
import numpy as np
from sklearn.manifold import TSNE

sklearn.set_config(transform_output="pandas")
arr = np.arange(35*4).reshape(35, 4)
TSNE(n_components=2).fit_transform(arr)

Expected Results

No error is thrown, a 2-dimensional pandas array is returned

Actual Results

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/pandas/core/indexes/base.py:3803, in Index.get_loc(self, key, method, tolerance)
   3802 try:
-> 3803     return self._engine.get_loc(casted_key)
   3804 except KeyError as err:

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/pandas/_libs/index.pyx:138, in pandas._libs.index.IndexEngine.get_loc()

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/pandas/_libs/index.pyx:144, in pandas._libs.index.IndexEngine.get_loc()

TypeError: '(slice(None, None, None), 0)' is an invalid key

During handling of the above exception, another exception occurred:

InvalidIndexError                         Traceback (most recent call last)
Cell In[14], line 7
      5 sklearn.set_config(transform_output="pandas")
      6 arr = np.arange(35*4).reshape(35, 4)
----> 7 TSNE(n_components=2).fit_transform(arr)

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/sklearn/manifold/_t_sne.py:1117, in TSNE.fit_transform(self, X, y)
   1115 self._validate_params()
   1116 self._check_params_vs_input(X)
-> 1117 embedding = self._fit(X)
   1118 self.embedding_ = embedding
   1119 return self.embedding_

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/sklearn/manifold/_t_sne.py:996, in TSNE._fit(self, X, skip_num_points)
    993     X_embedded = pca.fit_transform(X).astype(np.float32, copy=False)
    994     # PCA is rescaled so that PC1 has standard deviation 1e-4 which is
    995     # the default value for random initialization. See issue #18018.
--> 996     X_embedded = X_embedded / np.std(X_embedded[:, 0]) * 1e-4
    997 elif self.init == "random":
    998     # The embedding is initialized with iid samples from Gaussians with
    999     # standard deviation 1e-4.
   1000     X_embedded = 1e-4 * random_state.standard_normal(
   1001         size=(n_samples, self.n_components)
   1002     ).astype(np.float32)

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/pandas/core/frame.py:3805, in DataFrame.__getitem__(self, key)
   3803 if self.columns.nlevels > 1:
   3804     return self._getitem_multilevel(key)
-> 3805 indexer = self.columns.get_loc(key)
   3806 if is_integer(indexer):
   3807     indexer = [indexer]

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/pandas/core/indexes/base.py:3810, in Index.get_loc(self, key, method, tolerance)
   3805         raise KeyError(key) from err
   3806     except TypeError:
   3807         # If we have a listlike key, _check_indexing_error will raise
   3808         #  InvalidIndexError. Otherwise we fall through and re-raise
   3809         #  the TypeError.
-> 3810         self._check_indexing_error(key)
   3811         raise
   3813 # GH#42269

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/pandas/core/indexes/base.py:5968, in Index._check_indexing_error(self, key)
   5964 def _check_indexing_error(self, key):
   5965     if not is_scalar(key):
   5966         # if key is not a scalar, directly raise an error (the code below
   5967         # would convert to numpy arrays and raise later any way) - GH29926
-> 5968         raise InvalidIndexError(key)

InvalidIndexError: (slice(None, None, None), 0)

Versions

System:
    python: 3.10.9 (main, Dec 12 2022, 21:10:20) [GCC 9.4.0]
executable: /home/aloftus/.pyenv/versions/3.10.9/bin/python3.10
   machine: Linux-5.4.0-128-generic-x86_64-with-glibc2.31

Python dependencies:
      sklearn: 1.2.0
          pip: 22.3.1
   setuptools: 65.6.3
        numpy: 1.23.5
        scipy: 1.9.3
       Cython: None
       pandas: 1.5.2
   matplotlib: 3.6.2
       joblib: 1.2.0
threadpoolctl: 3.1.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /home/aloftus/.pyenv/versions/3.10.9/lib/python3.10/site-packages/numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so
        version: 0.3.20
threading_layer: pthreads
   architecture: SkylakeX
    num_threads: 32

       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /home/aloftus/.pyenv/versions/3.10.9/lib/python3.10/site-packages/scipy.libs/libopenblasp-r0-41284840.3.18.so
        version: 0.3.18
threading_layer: pthreads
   architecture: SkylakeX
    num_threads: 32

       user_api: openmp
   internal_api: openmp
         prefix: libgomp
       filepath: /home/aloftus/.pyenv/versions/3.10.9/lib/python3.10/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
        version: None
    num_threads: 32

       user_api: openmp
   internal_api: openmp
         prefix: libgomp
       filepath: /home/aloftus/.pyenv/versions/3.10.9/lib/python3.10/site-packages/torch/lib/libgomp-a34b3233.so.1
        version: None
    num_threads: 16
@loftusa loftusa added Bug Needs Triage Issue requires triage labels Jan 11, 2023
@TomDLT
Copy link
Member
TomDLT commented Jan 12, 2023

Thanks for the detailed bug report, it makes the bug easy to reproduce.

Best fix might be to use .set_output(transform="default") on the PCA estimator, to directly output a numpy array. PR welcome, bonus if you find other instances of this bug!

@betatim
Copy link
Member
betatim commented Jan 12, 2023

bonus if you find other instances of this bug!

I wonder what a good way would be to find estimators that are used internally in another estimator. Those would be prime candidates for suffering from the same bug as this.

@betatim
Copy link
Member
betatim commented Jan 12, 2023

The below estimators raise some kind of exception when used like this:

X, y = load_iris(as_frame=True, return_X_y=True)

with config_context(transform_output="pandas"):
  est = Estimator()
  est.fit_transform(X, y)

List of failures:

  • CCA
  • DictVectorizer
  • FeatureHasher
  • GaussianRandomProjection
  • IsotonicRegression
  • KBinsDiscretizer
  • KNeighborsTransformer
  • KernelCenterer
  • LabelBinarizer
  • LabelEncoder
  • MultiLabelBinarizer
  • OneHotEncoder
  • PLSCanonical
  • PLSSVD
  • RadiusNeighborsTransformer
  • RandomTreesEmbedding
  • SelectKBest
  • SparseRandomProjection
  • TSNE

Might be worth looking through them to see what the problem is for each one. Most of them are probably spurious (at least I didn't spot anything on a quick look through the exceptions that were raised).

@glemaitre
Copy link
Member

So there is something wrong with the test there:

@pytest.mark.parametrize(
"estimator", SET_OUTPUT_ESTIMATORS, ids=_get_check_estimator_ids
)
def test_global_output_transform_pandas(estimator):
name = estimator.__class__.__name__
if not hasattr(estimator, "set_output"):
pytest.skip(
f"Skipping check_global_ouptut_transform_pandas for {name}: Does not"
" support set_output API yet"
)
_set_checking_parameters(estimator)
with ignore_warnings(category=(FutureWarning)):
check_global_ouptut_transform_pandas(estimator.__class__.__name__, estimator)

@glemaitre
Copy link
Member

In the list above, iris would not work for cross-decomposition and would probably fail for the vecorizers as well.

@betatim
Copy link
Member
betatim commented Jan 12, 2023

I think the list of SET_OUTPUT_ESTIMATORS does not contain TSNE because it uses _tested_estimators("transformer") to populate the list. TSNE only inherits from BaseEstimator. The filtering is based on the right kind of mixin being used. So I think what happens is that TSNE never makes it into these tests.

@glemaitre
Copy link
Member

I will give try to make TSNE inherit from TransformerMixin to check where it fails. I don't see why it should not inherit. I will do it in a separate PR that you open. Depending on how it goes, we can remove the non-regression test and use the common test.

@betatim
Copy link
Member
betatim commented Jan 13, 2023

I stochastically sampled a few more of the estimators I listed above. None of them looked like they had the same problem as TSNE. So I think we can keep this issue closed and see what comes out of #25374.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants
0