8000 `pairwise_distances` is inconsistent with `scipy.spatial.distance` when using `metric="matching"` · Issue #25532 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

pairwise_distances is inconsistent with scipy.spatial.distance when using metric="matching" #25532

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
magnusbarata opened this issue Feb 2, 2023 · 13 comments · Fixed by #26264
Labels

Comments

@magnusbarata
Copy link
Contributor
magnusbarata commented Feb 2, 2023

Describe the bug

Although the metric matching is already removed from the documentation, pairwise_distances function still allows its usage. When used, the input arrays are converted into boolean. This brings inconsistency with the counterpart function cdist and pdist from scipy.spatial.distance (note that scipy.spatial.distance.matching has been completely removed since v1.10.0). In scipy's cdist and pdist, the metric matching is considered a synonym for hamming, which allows non-boolean inputs.

To address this issue, I can propose 2 solutions:

  1. Disallow matching usage as a metric. This fix will remove matching from metrics allowed on pairwise.py and sklearn.neighbors._base.py.
  2. Allow non-boolean inputs when using matching as a metric. This fix will keep it consistent to scipy's implementation.

Once the solution is decided, I can make a PR for it.

Steps/Code to Reproduce

import numpy as np
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cdist

x = np.array([[1, 0, -1, 1, 0, -1]])
y = np.array([[0, -1, 1, 1, 0, -1]])
print('pairwise_distances: ', pairwise_distances(x, y, metric='matching'))
print('scipy cdist: ', cdist(x, y, metric='matching'))

Expected Results

pairwise_distances:  [[0.5]]
scipy cdist:  [[0.5]]

Actual Results

/usr/local/lib/python3.11/site-packages/sklearn/metrics/pairwise.py:2025: DataConversionWarning: Data was converted to boolean for metric matching
  warnings.warn(msg, DataConversionWarning)
pairwise_distances:  [[0.33333333]]
scipy cdist:  [[0.5]]

Versions

System:
    python: 3.11.1 (main, Jan 23 2023, 21:39:49) [GCC 10.2.1 20210110]
executable: /usr/local/bin/python3
   machine: Linux-5.15.49-linuxkit-aarch64-with-glibc2.31

Python dependencies:
      sklearn: 1.2.1
          pip: 22.3.1
   setuptools: 65.5.1
        numpy: 1.24.1
        scipy: 1.10.0
       Cython: None
       pandas: None
   matplotlib: None
       joblib: 1.2.0
threadpoolctl: 3.1.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
         prefix: libgomp
       filepath: /usr/local/lib/python3.11/site-packages/scikit_learn.libs/libgomp-d22c30c5.so.1.0.0
        version: None
    num_threads: 5

       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /usr/local/lib/python3.11/site-packages/numpy.libs/libopenblas64_p-r0-cecebdce.3.21.so
        version: 0.3.21
threading_layer: pthreads
   architecture: armv8
    num_threads: 5

       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /usr/local/lib/python3.11/site-packages/scipy.libs/libopenblasp-r0-dff490c2.3.18.so
        version: 0.3.18
threading_layer: pthreads
   architecture: armv8
    num_threads: 5
@magnusbarata magnusbarata added Bug Needs Triage Issue requires triage labels Feb 2, 2023
@vitaliset
Copy link
Contributor
vitaliset commented Feb 6, 2023

Hello @magnusbarata!

When I Ctrl+F "matching" on the parwise.py file I found out that it was inside the PAIRWISE_BOOLEAN_FUNCTIONS list and decided to run your code for these "boolean" metrics as well (also that is related to the warning you showed so it made sense to take a look).

from sklearn import __version__
print(__version__)
>>> 1.2.0

import numpy as np
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cdist
from sklearn.metrics.pairwise import PAIRWISE_BOOLEAN_FUNCTIONS

X = np.array([[1, 0, -1, 1, 0, -1]])
Y = np.array([[0, -1, 1, 1, 0, -1]])
for metric in PAIRWISE_BOOLEAN_FUNCTIONS:
    if not (pairwise_distances(x, y, metric=metric) == cdist(x, y, metric=metric)):
        print(metric, end=" ")
>>> ...\pairwise.py:2025: DataConversionWarning: Data was converted to boolean for metric ...
>>> warnings.warn(msg, DataConversionWarning)
>>> jaccard matching 

I got the same bug. This time for "jaccard" also! So, maybe, it was something related to this conversion. Indeed it is.

I tried to follow the logic of the pairwise_distances function and ended up this minimal code:

metric = "matching"
kwds = {}

from functools import partial
from sklearn.metrics.pairwise import check_pairwise_arrays, _precompute_metric_params, _parallel_pairwise

dtype = bool if metric in PAIRWISE_BOOLEAN_FUNCTIONS else None

if dtype == bool and (X.dtype != bool or (Y is not None and Y.dtype != bool)):
    msg = "Data was converted to boolean for metric %s" % metric
    print(msg)

X, Y = check_pairwise_arrays(
    X, Y, dtype=dtype
)

# precompute data-derived metric params
params = _precompute_metric_params(X, Y, metric=metric, **kwds)
kwds.update(**params)

func = partial(cdist, metric=metric, **kwds)

_parallel_pairwise(X, Y, func, 1, **kwds)
>>> Data was converted to boolean for metric matching
>>> array([[0.33333333]])

We are getting the same result from the pairwise_distances function, but when I look at the transformed X and Y, we now have this:

print(X, Y, sep="\n")
>>> [[ True False  True  True False  True]]
>>> [[False  True  True  True False  True]]

Where we once had an -1, 1 pair (third component), now we have True, True (they are both non-zero) and they match. metric="jaccard" seems to be affected by this as well, but others PAIRWISE_BOOLEAN_FUNCTIONS' metrics don't. I don't know why without further looking into their logic.

Looking at the git blame, apparently, the transformation to boolean was introduced by #6932 to fix #4523. It made sense in that context because it looked like a binary scenario.

In my opinion, we should not treat this as a bug, mainly because of the warning we already have. If you want to use the "matching" distance for non-boolean arrays, you can try metric="hamming", and it will do what you want:

x = np.array([[1, 0, -1, 1, 0, -1]])
y = np.array([[0, -1, 1, 1, 0, -1]])

print(pairwise_distances(x, y, metric="hamming"))
>>> [[0.5]]

Edit: I wrote my answer without having read your description, just your code. I saw now that you already knew that the problem was in the transformation to bool and that you know about hamming. My bad (but I'll leave it here because of some extra information I found).
My concern with your first solution is to generate regressions such as messing up again the issue that introduced this transformation or removing this metric from NearestNeighbors. The second one looks better, but we already have hamming, maybe we want matching to be the boolean version of it... I don't know heheh The core contributors will be able to evaluate it better! :D

@magnusbarata
Copy link
Contributor Author

Hi @vitaliset! Thank you for reporting your problem.

Where we once had an -1, 1 pair (third component), now we have True, True (they are both non-zero) and they match. metric="jaccard" seems to be affected by this as well, but others PAIRWISE_BOOLEAN_FUNCTIONS' metrics don't. I don't know why without further looking into their logic.

This is because scikit-learn wraps scipy cdist or pdist. For the jaccard metric, your result was expected as the input should be boolean, but this isn't the case with matching. Also, as you had already pointed, in python, values other than 0 is considered trut 8000 hy, so -1 is converted to True.


In my opinion, we should not treat this as a bug, mainly because of the warning we already have...

For the bug that I reported, as the title suggests, I want to emphasize more on the inconsistency with scipy when matching is used as metric. If scikit-learn allows the usage of matching (it wasn't listed as an available metric in the documentation btw), I think it should be consistent with scipy, or at least document this behavior.


My concern with your first solution is to generate regressions...

I already understand the reason why the boolean transformation was introduced, but thank you for making it clear by adding the git blame. This is exactly the reason why I didn't create a PR, but ask for suggestions on this issue first. The second proposed solution is simply to remove matching from PAIRWISE_BOOLEAN_FUNCTIONS.

@ogrisel
Copy link
Member
ogrisel commented Feb 9, 2023

+1 for removing matching to be consistent with scipy.

Shall we do a deprecation cycle for this?

@ogrisel ogrisel removed the Needs Triage Issue requires triage label Feb 9, 2023
@jjerphan
Copy link
Member
jjerphan commented Feb 9, 2023

I think we must stay in sync with SciPy and deprecate metric="matching". In this case we might want to proceed similarly to Kulsinski's deprecation: #25417.

@jjerphan
Copy link
Member
jjerphan commented Feb 13, 2023

@magnusbarata: would you be interested in authoring a PR to fix this issue? 🙂

@magnusbarata
Copy link
Contributor Author

@ogrisel @jjerphan
Hi! Thank you for your comments!

Of course, I will create a PR for this issue! There are several things that I would like to clarify first:

  1. The PR will introduce a deprecation for the metric matching, instead of directly removing it.
  2. Why we should deprecate manhattan? Do you mean matching?
  3. If we should deprecate manhattan, should I do it in the same or separate PR?

@jjerphan
Copy link
Member

Yes, I meant matching (I have edited my comment).

@ramvikrams
Copy link
Contributor

I think we must stay in sync with SciPy and deprecate metric="matching". In this case we might want to proceed similarly to Kulsinski's deprecation: #25417.

means replacing it with cdist ?

@jjerphan
Copy link
Member

I meant deprecating metric="matching" for scikit-learn.

@ramvikrams
Copy link
Contributor

@jjerphan is there any similar PR for Deprecation other than the one mentioned so I can get a idea of how to deprecate it.

@magnusbarata
Copy link
Contributor Author

@jjerphan Sorry, I forgot to send the PR. I will send it by this Sunday.

@jjerphan
Copy link
Member
jjerphan commented Apr 6, 2023

It's fine. You are not on a SLA. 🙂

@magnusbarata
Copy link
Contributor Author

@jjerphan Hello, sorry for taking so long, but I just made the PR. Please kindly review it. 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants
0