-
-
Notifications
You must be signed in to change notification settings - Fork 26.4k
Closed
Labels
Milestone
Description
Describe the bug
TSNE doesn't work when the global config is changed to pandas.
I tracked down this bug in the sklearn codebase. The issue is here: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/manifold/_t_sne.py#L996
What's happening is that X_embedded returns a Pandas array under set_output API, with the columns being named "pca0" and "pca1". So when X_embedded[:, 0] is called, we get an IndexError, because you'd have to index with X_embedded.iloc[:, 0] in this situation.
Possible fix could be changing line 996 to this:
X_embedded = X_embedded / np.std(np.array(X_embedded)[:, 0]) * 1e-4
which I am happy to make a PR to do unless somebody has a cleaner way.
Cheers!
Steps/Code to Reproduce
import sklearn
import numpy as np
from sklearn.manifold import TSNE
sklearn.set_config(transform_output="pandas")
arr = np.arange(35*4).reshape(35, 4)
TSNE(n_components=2).fit_transform(arr)Expected Results
No error is thrown, a 2-dimensional pandas array is returned
Actual Results
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/pandas/core/indexes/base.py:3803, in Index.get_loc(self, key, method, tolerance)
3802 try:
-> 3803 return self._engine.get_loc(casted_key)
3804 except KeyError as err:
File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/pandas/_libs/index.pyx:138, in pandas._libs.index.IndexEngine.get_loc()
File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/pandas/_libs/index.pyx:144, in pandas._libs.index.IndexEngine.get_loc()
TypeError: '(slice(None, None, None), 0)' is an invalid key
During handling of the above exception, another exception occurred:
InvalidIndexError Traceback (most recent call last)
Cell In[14], line 7
5 sklearn.set_config(transform_output="pandas")
6 arr = np.arange(35*4).reshape(35, 4)
----> 7 TSNE(n_components=2).fit_transform(arr)
File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/sklearn/manifold/_t_sne.py:1117, in TSNE.fit_transform(self, X, y)
1115 self._validate_params()
1116 self._check_params_vs_input(X)
-> 1117 embedding = self._fit(X)
1118 self.embedding_ = embedding
1119 return self.embedding_
File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/sklearn/manifold/_t_sne.py:996, in TSNE._fit(self, X, skip_num_points)
993 X_embedded = pca.fit_transform(X).astype(np.float32, copy=False)
994 # PCA is rescaled so that PC1 has standard deviation 1e-4 which is
995 # the default value for random initialization. See issue #18018.
--> 996 X_embedded = X_embedded / np.std(X_embedded[:, 0]) * 1e-4
997 elif self.init == "random":
998 # The embedding is initialized with iid samples from Gaussians with
999 # standard deviation 1e-4.
1000 X_embedded = 1e-4 * random_state.standard_normal(
1001 size=(n_samples, self.n_components)
1002 ).astype(np.float32)
File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/pandas/core/frame.py:3805, in DataFrame.__getitem__(self, key)
3803 if self.columns.nlevels > 1:
3804 return self._getitem_multilevel(key)
-> 3805 indexer = self.columns.get_loc(key)
3806 if is_integer(indexer):
3807 indexer = [indexer]
File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/pandas/core/indexes/base.py:3810, in Index.get_loc(self, key, method, tolerance)
3805 raise KeyError(key) from err
3806 except TypeError:
3807 # If we have a listlike key, _check_indexing_error will raise
3808 # InvalidIndexError. Otherwise we fall through and re-raise
3809 # the TypeError.
-> 3810 self._check_indexing_error(key)
3811 raise
3813 # GH#42269
File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/pandas/core/indexes/base.py:5968, in Index._check_indexing_error(self, key)
5964 def _check_indexing_error(self, key):
5965 if not is_scalar(key):
5966 # if key is not a scalar, directly raise an error (the code below
5967 # would convert to numpy arrays and raise later any way) - GH29926
-> 5968 raise InvalidIndexError(key)
InvalidIndexError: (slice(None, None, None), 0)
Versions
System:
python: 3.10.9 (main, Dec 12 2022, 21:10:20) [GCC 9.4.0]
executable: /home/aloftus/.pyenv/versions/3.10.9/bin/python3.10
machine: Linux-5.4.0-128-generic-x86_64-with-glibc2.31
Python dependencies:
sklearn: 1.2.0
pip: 22.3.1
setuptools: 65.6.3
numpy: 1.23.5
scipy: 1.9.3
Cython: None
pandas: 1.5.2
matplotlib: 3.6.2
joblib: 1.2.0
threadpoolctl: 3.1.0
Built with OpenMP: True
threadpoolctl info:
user_api: blas
internal_api: openblas
prefix: libopenblas
filepath: /home/aloftus/.pyenv/versions/3.10.9/lib/python3.10/site-packages/numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so
version: 0.3.20
threading_layer: pthreads
architecture: SkylakeX
num_threads: 32
user_api: blas
internal_api: openblas
prefix: libopenblas
filepath: /home/aloftus/.pyenv/versions/3.10.9/lib/python3.10/site-packages/scipy.libs/libopenblasp-r0-41284840.3.18.so
version: 0.3.18
threading_layer: pthreads
architecture: SkylakeX
num_threads: 32
user_api: openmp
internal_api: openmp
prefix: libgomp
filepath: /home/aloftus/.pyenv/versions/3.10.9/lib/python3.10/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
version: None
num_threads: 32
user_api: openmp
internal_api: openmp
prefix: libgomp
filepath: /home/aloftus/.pyenv/versions/3.10.9/lib/python3.10/site-packages/torch/lib/libgomp-a34b3233.so.1
version: None
num_threads: 16