-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
pairwise_distances
is inconsistent with scipy.spatial.distance
when using metric="matching"
#25532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hello @magnusbarata! When I Ctrl+F "matching" on the parwise.py file I found out that it was inside the from sklearn import __version__
print(__version__)
>>> 1.2.0
import numpy as np
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cdist
from sklearn.metrics.pairwise import PAIRWISE_BOOLEAN_FUNCTIONS
X = np.array([[1, 0, -1, 1, 0, -1]])
Y = np.array([[0, -1, 1, 1, 0, -1]]) for metric in PAIRWISE_BOOLEAN_FUNCTIONS:
if not (pairwise_distances(x, y, metric=metric) == cdist(x, y, metric=metric)):
print(metric, end=" ")
>>> ...\pairwise.py:2025: DataConversionWarning: Data was converted to boolean for metric ...
>>> warnings.warn(msg, DataConversionWarning)
>>> jaccard matching I got the same bug. This time for "jaccard" also! So, maybe, it was something related to this conversion. Indeed it is. I tried to follow the logic of the metric = "matching"
kwds = {}
from functools import partial
from sklearn.metrics.pairwise import check_pairwise_arrays, _precompute_metric_params, _parallel_pairwise
dtype = bool if metric in PAIRWISE_BOOLEAN_FUNCTIONS else None
if dtype == bool and (X.dtype != bool or (Y is not None and Y.dtype != bool)):
msg = "Data was converted to boolean for metric %s" % metric
print(msg)
X, Y = check_pairwise_arrays(
X, Y, dtype=dtype
)
# precompute data-derived metric params
params = _precompute_metric_params(X, Y, metric=metric, **kwds)
kwds.update(**params)
func = partial(cdist, metric=metric, **kwds)
_parallel_pairwise(X, Y, func, 1, **kwds)
>>> Data was converted to boolean for metric matching
>>> array([[0.33333333]]) We are getting the same result from the print(X, Y, sep="\n")
>>> [[ True False True True False True]]
>>> [[False True True True False True]] Where we once had an Looking at the git blame, apparently, the transformation to boolean was introduced by #6932 to fix #4523. It made sense in that context because it looked like a binary scenario. In my opinion, we should not treat this as a bug, mainly because of the warning we already have. If you want to use the "matching" distance for non-boolean arrays, you can try x = np.array([[1, 0, -1, 1, 0, -1]])
y = np.array([[0, -1, 1, 1, 0, -1]])
print(pairwise_distances(x, y, metric="hamming"))
>>> [[0.5]] Edit: I wrote my answer without having read your description, just your code. I saw now that you already knew that the problem was in the transformation to bool and that you know about hamming. My bad (but I'll leave it here because of some extra information I found). |
Hi @vitaliset! Thank you for reporting your problem.
This is because scikit-learn wraps scipy
For the bug that I reported, as the title suggests, I want to emphasize more on the inconsistency with scipy when
I already understand the reason why the boolean transformation was introduced, but thank you for making it clear by adding the git blame. This is exactly the reason why I didn't create a PR, but ask for suggestions on this issue first. The second proposed solution is simply to remove |
+1 for removing matching to be consistent with scipy. Shall we do a deprecation cycle for this? |
I think we must stay in sync with SciPy and deprecate |
@magnusbarata: would you be interested in authoring a PR to fix this issue? 🙂 |
@ogrisel @jjerphan Of course, I will create a PR for this issue! There are several things that I would like to clarify first:
|
Yes, I meant |
means replacing it with |
I meant deprecating |
@jjerphan is there any similar PR for Deprecation other than the one mentioned so I can get a idea of how to deprecate it. |
@jjerphan Sorry, I forgot to send the PR. I will send it by this Sunday. |
It's fine. You are not on a SLA. 🙂 |
@jjerphan Hello, sorry for taking so long, but I just made the PR. Please kindly review it. 🙏 |
Uh oh!
There was an error while loading. Please reload this page.
Describe the bug
Although the metric
matching
is already removed from the documentation,pairwise_distances
function still allows its usage. When used, the input arrays are converted into boolean. This brings inconsistency with the counterpart functioncdist
andpdist
fromscipy.spatial.distance
(note thatscipy.spatial.distance.matching
has been completely removed since v1.10.0). In scipy'scdist
andpdist
, the metricmatching
is considered a synonym forhamming
, which allows non-boolean inputs.To address this issue, I can propose 2 solutions:
matching
usage as a metric. This fix will removematching
from metrics allowed onpairwise.py
andsklearn.neighbors._base.py
.matching
as a metric. This fix will keep it consistent to scipy's implementation.Once the solution is decided, I can make a PR for it.
Steps/Code to Reproduce
Expected Results
Actual Results
Versions
The text was updated successfully, but these errors were encountered: