10000 Specifying 'cosine' as metric in KDTree throws error · Issue #25364 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Specifying 'cosine' as metric in KDTree throws error #25364

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
01-vyom opened this issue Jan 11, 2023 · 6 comments · Fixed by #25482
Closed

Specifying 'cosine' as metric in KDTree throws error #25364

01-vyom opened this issue Jan 11, 2023 · 6 comments · Fixed by #25482

Comments

@01-vyom
Copy link
Contributor
01-vyom commented Jan 11, 2023

Describe the bug

I am trying to implement the KDTree Algorithm with cosine as a distance metric. I first started with scipy's implementation, but it didn't support cosine as a metric. Then, I came across sklearn implementation. The documentation indicates that the supported metric can be found (scipy.spatial.distance) and (distance_metrics). Both of these places show that the cosine metric is supported. Also, the user guide also suggests that cosine is supported as it tries to point out cosine as cosine_distance. However, it still throws an error for me when I try to use that metric.

Steps/Code to Reproduce

import numpy as np
from sklearn.neighbors import KDTree
rng = np.random.RandomState(0)
X = rng.random_sample((10, 3))  # 10 points in 3 dimensions
tree = KDTree(X, leaf_size=2, metric= "cosine")              

Expected Results

No error is thrown

Actual Results

Traceback (most recent call last):
  File "sklearn/metrics/_dist_metrics.pyx", line 270, in sklearn.metrics._dist_metrics.DistanceMetric.get_metric
KeyError: 'cosine'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "sklearn/neighbors/_binary_tree.pxi", line 844, in sklearn.neighbors._kd_tree.BinaryTree.__init__
  File "sklearn/metrics/_dist_metrics.pyx", line 272, in sklearn.metrics._dist_metrics.DistanceMetric.get_metric
ValueError: Unrecognized metric 'cosine'

Versions

System:
    python: 3.8.10 (default, Mar 15 2022, 12:22:08)  [GCC 9.4.0]
executable: /bin/python3
   machine: Linux-5.4.0-132-generic-x86_64-with-glibc2.29

Python dependencies:
      sklearn: 1.2.0
          pip: 22.3.1
   setuptools: 65.6.3
        numpy: 1.23.3
        scipy: 1.9.3
       Cython: 0.29.30
       pandas: 1.4.2
   matplotlib: 3.6.2
       joblib: 1.2.0
threadpoolctl: 3.1.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /usr/local/lib/python3.8/dist-packages/numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so
        version: 0.3.20
threading_layer: pthreads
   architecture: SkylakeX
    num_threads: 28

       user_api: openmp
   internal_api: openmp
         prefix: libgomp
       filepath: /usr/local/lib/python3.8/dist-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
        version: None
    num_threads: 28

       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /usr/local/lib/python3.8/dist-packages/scipy.libs/libopenblasp-r0-41284840.3.18.so
        version: 0.3.18
threading_layer: pthreads
   architecture: SkylakeX
    num_threads: 28

       user_api: openmp
   internal_api: openmp
         prefix: libgomp
       filepath: /usr/local/lib/python3.8/dist-packages/torch/lib/libgomp-a34b3233.so.1
        version: None
    num_threads: 14
@01-vyom 01-vyom added Bug Needs Triage Issue requires triage labels Jan 11, 2023
@vitaliset
Copy link
Contributor

Hi @01-vyom! Thanks for the issue.

As far as I can see, sklearn.neighbors.KDTree can only be used with Minkowski distances (with $p \in [1, \infty )$):

from sklearn.neighbors import KDTree
KDTree.valid_metrics
>>> ['euclidean', 'l2', 'minkowski', 'p', 'manhattan', 'cityblock', 'l1', 'chebyshev', 'infinity']

For some reason, the documentation here:

image

is rendered in a way that makes it not obvious to run the code above (because of the lowercase and the underline).

When KDTree (and BallTree) inherit the docs from BinaryTree, the class name is given to BinaryTree, not binary_tree.

DOC_DICT = {'BinaryTree': 'KDTree', 'binary_tree': 'kd_tree'}

metric : string or DistanceMetric object
the distance metric to use for the tree. Default='minkowski'
with p=2 (that is, a euclidean metric). See the documentation
of the DistanceMetric class for a list of available metrics.
{binary_tree}.valid_metrics gives a list of the metrics which
are valid for {BinaryTree}.

So maybe changing here to BinaryTree can help:

{binary_tree}.valid_metrics gives a list of the metrics which

As for the user guide, I agree that, in 1.6.1.2, the text

For a list of available metrics, see the documentation of the DistanceMetric class and the metrics listed in sklearn.metrics.pairwise.PAIRWISE_DISTANCE_FUNCTIONS. Note that the “cosine” metric uses cosine_distances.

makes it confusing to understand that you should not use "cosine" for KDTree and probably should be erased (or moved to another part of the text, such as in 1.6.1.1).

@01-vyom
Copy link
Contributor Author
01-vyom commented Jan 12, 2023

Hi @vitaliset, Thanks for clarifying!
Ok, this makes a lot of sense. Seems like an easy fix for the first one. I could make a PR for the same.

Also, for the second one, I would say that we could move it to 1.6.1.1, but we might still need to say that it is not supported by algorithms the BallTree or the KDTree algorithm when used by the NearestNeighbor algorithm in 1.6.1.1. Or we could erase it completely (this seems like a better option as cosine is too specific for such a general example in section 1.6.1.1. What would you suggest (other suggestions welcome!)?

@vitaliset
Copy link
Contributor

I agree that we could erase the part about the cosine metric, but I like the first phrase, and I think it is important to be somewhere:

For a list of available metrics, see the documentation of the DistanceMetric class and the metrics listed in sklearn.metrics.pairwise.PAIRWISE_DISTANCE_FUNCTIONS.

Let's see what others have to say!

@jjerphan jjerphan removed the Needs Triage Issue requires triage label Jan 12, 2023
@jjerphan
Copy link
Member
jjerphan commented Jan 12, 2023

Thanks for reporting this problem!

The line you quoted is misleading and the last part (i.e. "and the metrics listed in sklearn.metrics.pairwise.PAIRWISE_DISTANCE_FUNCTIONS.") must be removed, yes. Moreover, I think we can add indications in this section of the user guide to mention using {BallTree,KDTree}.valid_metrics and to disambiguate distance metrics and distances.

Is one of you interesting in authoring such improvement to scikit-learn's documentation? 🙂

@01-vyom
Copy link
Contributor Author
01-vyom commented Jan 12, 2023

I am interested in authoring improvements. Should I create 2 separate PRs for the first (KDTree Docs.) as well as the second problem (user guide section)? Or Should I create one single PR?

@jjerphan
Copy link
Member

Both are performable as part of an unique PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants
0