8000 sklearn.set_config(transform_output="pandas") breaks TSNE embeddings · Issue #25365 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

sklearn.set_config(transform_output="pandas") breaks TSNE embeddings #25365

@loftusa

Description

@loftusa

Describe the bug

TSNE doesn't work when the global config is changed to pandas.

I tracked down this bug in the sklearn codebase. The issue is here: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/manifold/_t_sne.py#L996

What's happening is that X_embedded returns a Pandas array under set_output API, with the columns being named "pca0" and "pca1". So when X_embedded[:, 0] is called, we get an IndexError, because you'd have to index with X_embedded.iloc[:, 0] in this situation.

Possible fix could be changing line 996 to this:
X_embedded = X_embedded / np.std(np.array(X_embedded)[:, 0]) * 1e-4

which I am happy to make a PR to do unless somebody has a cleaner way.

Cheers!

Steps/Code to Reproduce

import sklearn
import numpy as np
from sklearn.manifold import TSNE

sklearn.set_config(transform_output="pandas")
arr = np.arange(35*4).reshape(35, 4)
TSNE(n_components=2).fit_transform(arr)

Expected Results

No error is thrown, a 2-dimensional pandas array is returned

Actual Results

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/pandas/core/indexes/base.py:3803, in Index.get_loc(self, key, method, tolerance)
   3802 try:
-> 3803     return self._engine.get_loc(casted_key)
   3804 except KeyError as err:

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/pandas/_libs/index.pyx:138, in pandas._libs.index.IndexEngine.get_loc()

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/pandas/_libs/index.pyx:144, in pandas._libs.index.IndexEngine.get_loc()

TypeError: '(slice(None, None, None), 0)' is an invalid key

During handling of the above exception, another exception occurred:

InvalidIndexError                         Traceback (most recent call last)
Cell In[14], line 7
      5 sklearn.set_config(transform_output="pandas")
      6 arr = np.arange(35*4).reshape(35, 4)
----> 7 TSNE(n_components=2).fit_transform(arr)

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/sklearn/manifold/_t_sne.py:1117, in TSNE.fit_transform(self, X, y)
   1115 self._validate_params()
   1116 self._check_params_vs_input(X)
-> 1117 embedding = self._fit(X)
   1118 self.embedding_ = embedding
   1119 return self.embedding_

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/sklearn/manifold/_t_sne.py:996, in TSNE._fit(self, X, skip_num_points)
    993     X_embedded = pca.fit_transform(X).astype(np.float32, copy=False)
    994     # PCA is rescaled so that PC1 has standard deviation 1e-4 which is
    995     # the default value for random initialization. See issue #18018.
--> 996     X_embedded = X_embedded / np.std(X_embedded[:, 0]) * 1e-4
    997 elif self.init == "random":
    998     # The embedding is initialized with iid samples from Gaussians with
    999     # standard deviation 1e-4.
   1000     X_embedded = 1e-4 * random_state.standard_normal(
   1001         size=(n_samples, self.n_components)
   1002     ).astype(np.float32)

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/pandas/core/frame.py:3805, in DataFrame.__getitem__(self, key)
   3803 if self.columns.nlevels > 1:
   3804     return self._getitem_multilevel(key)
-> 3805 indexer = self.columns.get_loc(key)
   3806 if is_integer(indexer):
   3807     indexer = [indexer]

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/pandas/core/indexes/base.py:3810, in Index.get_loc(self, key, method, tolerance)
   3805         raise KeyError(key) from err
   3806     except TypeError:
   3807         # If we have a listlike key, _check_indexing_error will raise
   3808         #  InvalidIndexError. Otherwise we fall through and re-raise
   3809         #  the TypeError.
-> 3810         self._check_indexing_error(key)
   3811         raise
   3813 # GH#42269

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/pandas/core/indexes/base.py:5968, in Index._check_indexing_error(self, key)
   5964 def _check_indexing_error(self, key):
   5965     if not is_scalar(key):
   5966         # if key is not a scalar, directly raise an error (the code below
   5967         # would convert to numpy arrays and raise later any way) - GH29926
-> 5968         raise InvalidIndexError(key)

InvalidIndexError: (slice(None, None, None), 0)

Versions

System:
    python: 3.10.9 (main, Dec 12 2022, 21:10:20) [GCC 9.4.0]
executable: /home/aloftus/.pyenv/versions/3.10.9/bin/python3.10
   machine: Linux-5.4.0-128-generic-x86_64-with-glibc2.31

Python dependencies:
      sklearn: 1.2.0
          pip: 22.3.1
   setuptools: 65.6.3
        numpy: 1.23.5
        scipy: 1.9.3
       Cython: None
       pandas: 1.5.2
   matplotlib: 3.6.2
       joblib: 1.2.0
threadpoolctl: 3.1.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /home/aloftus/.pyenv/versions/3.10.9/lib/python3.10/site-packages/numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so
        version: 0.3.20
threading_layer: pthreads
   architecture: SkylakeX
    num_threads: 32

       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /home/aloftus/.pyenv/versions/3.10.9/lib/python3.10/site-packages/scipy.libs/libopenblasp-r0-41284840.3.18.so
        version: 0.3.18
threading_layer: pthreads
   architecture: SkylakeX
    num_threads: 32

       user_api: openmp
   internal_api: openmp
         prefix: libgomp
       filepath: /home/aloftus/.pyenv/versions/3.10.9/lib/python3.10/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
        version: None
    num_threads: 32

       user_api: openmp
   internal_api: openmp
         prefix: libgomp
       filepath: /home/aloftus/.pyenv/versions/3.10.9/lib/python3.10/site-packages/torch/lib/libgomp-a34b3233.so.1
        version: None
    num_threads: 16

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0