-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
sklearn.set_config(transform_output="pandas") breaks TSNE embeddings #25365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the detailed bug report, it makes the bug easy to reproduce. Best fix might be to use |
I wonder what a good way would be to find estimators that are used internally in another estimator. Those would be prime candidates for suffering from the same bug as this. |
The below estimators raise some kind of exception when used like this: X, y = load_iris(as_frame=True, return_X_y=True)
with config_context(transform_output="pandas"):
est = Estimator()
est.fit_transform(X, y) List of failures:
Might be worth looking through them to see what the problem is for each one. Most of them are probably spurious (at least I didn't spot anything on a quick look through the exceptions that were raised). |
So there is something wrong with the test there: scikit-learn/sklearn/tests/test_common.py Lines 623 to 635 in 8b06f6a
|
In the list above, |
I think the list of |
I will give try to make |
I stochastically sampled a few more of the estimators I listed above. None of them looked like they had the same problem as TSNE. So I think we can keep this issue closed and see what comes out of #25374. |
Describe the bug
TSNE doesn't work when the global config is changed to pandas.
I tracked down this bug in the sklearn codebase. The issue is here: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/manifold/_t_sne.py#L996
What's happening is that
X_embedded
returns a Pandas array underset_output
API, with the columns being named "pca0" and "pca1". So whenX_embedded[:, 0]
is called, we get an IndexError, because you'd have to index withX_embedded.iloc[:, 0]
in this situation.Possible fix could be changing line 996 to this:
X_embedded = X_embedded / np.std(np.array(X_embedded)[:, 0]) * 1e-4
which I am happy to make a PR to do unless somebody has a cleaner way.
Cheers!
Steps/Code to Reproduce
Expected Results
No error is thrown, a 2-dimensional pandas array is returned
Actual Results
Versions
The text was updated successfully, but these errors were encountered: