Closed
Description
Describe the bug
Although the metric matching
is already removed from the documentation, pairwise_distances
function still allows its usage. When used, the input arrays are converted into boolean. This brings inconsistency with the counterpart function cdist
and pdist
from scipy.spatial.distance
(note that scipy.spatial.distance.matching
has been completely removed since v1.10.0). In scipy's cdist
and pdist
, the metric matching
is considered a synonym for hamming
, which allows non-boolean inputs.
To address this issue, I can propose 2 solutions:
- Disallow
matching
usage as a metric. This fix will removematching
from metrics allowed onpairwise.py
andsklearn.neighbors._base.py
. - Allow non-boolean inputs when using
matching
as a metric. This fix will keep it consistent to scipy's implementation.
Once the solution is decided, I can make a PR for it.
Steps/Code to Reproduce
import numpy as np
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cdist
x = np.array([[1, 0, -1, 1, 0, -1]])
y = np.array([[0, -1, 1, 1, 0, -1]])
print('pairwise_distances: ', pairwise_distances(x, y, metric='matching'))
print('scipy cdist: ', cdist(x, y, metric='matching'))
Expected Results
pairwise_distances: [[0.5]]
scipy cdist: [[0.5]]
Actual Results
/usr/local/lib/python3.11/site-packages/sklearn/metrics/pairwise.py:2025: DataConversionWarning: Data was converted to boolean for metric matching
warnings.warn(msg, DataConversionWarning)
pairwise_distances: [[0.33333333]]
scipy cdist: [[0.5]]
Versions
System:
python: 3.11.1 (main, Jan 23 2023, 21:39:49) [GCC 10.2.1 20210110]
executable: /usr/local/bin/python3
machine: Linux-5.15.49-linuxkit-aarch64-with-glibc2.31
Python dependencies:
sklearn: 1.2.1
pip: 22.3.1
setuptools: 65.5.1
numpy: 1.24.1
scipy: 1.10.0
Cython: None
pandas: None
matplotlib: None
joblib: 1.2.0
threadpoolctl: 3.1.0
Built with OpenMP: True
threadpoolctl info:
user_api: openmp
internal_api: openmp
prefix: libgomp
filepath: /usr/local/lib/python3.11/site-packages/scikit_learn.libs/libgomp-d22c30c5.so.1.0.0
version: None
num_threads: 5
user_api: blas
internal_api: openblas
prefix: libopenblas
filepath: /usr/local/lib/python3.11/site-packages/numpy.libs/libopenblas64_p-r0-cecebdce.3.21.so
version: 0.3.21
threading_layer: pthreads
architecture: armv8
num_threads: 5
user_api: blas
internal_api: openblas
prefix: libopenblas
filepath: /usr/local/lib/python3.11/site-packages/scipy.libs/libopenblasp-r0-dff490c2.3.
4F31
18.so
version: 0.3.18
threading_layer: pthreads
architecture: armv8
num_threads: 5