8000 NearestCentroid not handling properly distance metrics other than Manhattan or Euclidean. · Issue #23890 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

NearestCentroid not handling properly distance metrics other than Manhattan or Euclidean. #23890

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Valentin-Laurent opened this issue Jul 12, 2022 · 4 comments

Comments

@Valentin-Laurent
Copy link
Contributor

Describe the bug

Three issues, quite related.

  • Using a metric that is not Manhattan or Euclidean triggers the following warning: UserWarning: Averaging for metrics other than euclidean and manhattan not supported. The average is set to be the mean., but I believe an error message should be thrown instead. Indeed, computing the centroids as the average of the class samples is wrong for (I think) all distances except Euclidean, so it is misleading to actually do it even with a warning. Throwing an error here would fix the 2 following issues.

  • Calling .predict on a NearestCentroid with metric wminkowski, seuclidean or mahalanobis throws an error, because those metrics need params that we can't pass to NearestCentroid. DOC Improve doc of Nearest Neighbors metric param #23806 would fix the documentation, but maybe it should be nice to handle this properly in the code.

  • Using any string as a metric (even a random string like foobar) does not throws an error when fitting, but does when calling .predict. Doing this with another classifier like KNeighborsClassifier throws an error at the fitting step.

If relevant I'd like to work on fixing those issues (new contributor here :)) I noticed a stalled PR (#19262) related to implementing partial_fit to NearestCentroid. Maybe I should start working on it first, as it seems to improve the code quality of _nearest_centroid.py, so it would be an interesting base to start with.

Steps/Code to Reproduce

#Reproduces the second issue mentionned

from sklearn.neighbors import NearestCentroid
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
nearest_centroid = NearestCentroid(metric="seuclidean")
nearest_centroid.fit(X,y)
nearest_centroid.predict(X)

Expected Results

Fitting a NearestCentroid with a metric that is not Manhattan or Euclidean should trigger a proper error message.

Actual Results

Error related to the second issue:
ValueError: The 'V' parameter is required for the seuclidean metric when Y is passed.

Versions

System:
    python: 3.8.6 (default, Mar 29 2021, 18:42:37)  [Clang 12.0.0 (clang-1200.0.32.29)]
executable: /Users/valentinlaurent/.pyenv/versions/3.8.6/envs/lewagon/bin/python3.8
   machine: macOS-10.15.7-x86_64-i386-64bit

Python dependencies:
          pip: 22.1.1
   setuptools: 49.2.1
      sklearn: 1.0.2
        numpy: 1.19.5
        scipy: 1.6.2
       Cython: 0.29.26
       pandas: 1.1.3
   matplotlib: 3.4.1
       joblib: 1.1.0
threadpoolctl: 2.1.0

Built with OpenMP: True
@ogrisel
Copy link
Member
ogrisel commented Jul 13, 2022

I agree with your analysis. Feel free to open a PR that throws a ValueError asap in fit when metric not in ["manhattan", "euclidean"].

This could be achieved cleanly with the help of the new parameter validation/constraints API: #23462.

@ogrisel
Copy link
Member
ogrisel commented Jul 13, 2022

Actually there is already an open PR here: #23874. It would need to be updated accordingly and the change document in the changelog (as a bugfix).

@fkdosilovic
Copy link
Contributor
fkdosilovic commented Dec 29, 2023

It seems that this issue can be closed? @ogrisel

@jeremiedbb
Copy link
Member

Indeed, fixed by #24083. Let's close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants
0