Jaccard distance in trees very different from pairwise_distances jaccard distance. #4523

amueller · 2015-04-05T20:28:02Z

As observed in #4522, BallTree and pairwise_distances have very different results for metric="jaccard":

from sklearn.neighbors import NearestNeighbors
X = np.random.uniform(size=(6, 5))

nn = NearestNeighbors(metric="jaccard", algorithm='brute').fit(X)
print(nn.kneighbors(X)[0])
nn = NearestNeighbors(metric="jaccard", algorithm='ball_tree').fit(X)
print(nn.kneighbors(X)[0])

[[ 0. 1. 1. 1. 1.]
[ 0. 1. 1. 1. 1.]
[ 0. 1. 1. 1. 1.]
[ 0. 1. 1. 1. 1.]
[ 0. 1. 1. 1. 1.]
[ 0. 1. 1. 1. 1.]]
[[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]]

The text was updated successfully, but these errors were encountered:

jnothman · 2015-04-06T00:05:13Z

Oh, that doesn't look good. Both pairwise_distances and
sklearn.neighbors.dist_metrics should be tested against
scipy.spatial.distance where possible.

On 6 April 2015 at 06:28, Andreas Mueller notifications@github.com wrote:

As observed in #4522
#4522, BallTree and
pairwise_distances have very different results for metric="jaccard":

from sklearn.neighbors import NearestNeighbors
X = np.random.uniform(size=(6, 5))

nn = NearestNeighbors(metric="jaccard", algorithm='brute').fit(X)print(nn.kneighbors(X)[0])
nn = NearestNeighbors(metric="jaccard", algorithm='ball_tree').fit(X)print(nn.kneighbors(X)[0])

[[ 0. 1. 1. 1. 1.]
[ 0. 1. 1. 1. 1.]
[ 0. 1. 1. 1. 1.]
[ 0. 1. 1. 1. 1.]
[ 0. 1. 1. 1. 1.]
[ 0. 1. 1. 1. 1.]]
[[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]]

—
Reply to this email directly or view it on GitHub
#4523.

amueller · 2015-04-06T13:47:26Z

yeah. Well pairwise_distances is calling scipy.spatial.distance here, but dist_metrics does something else.

amueller · 2015-04-06T13:47:47Z

See the tests in #4522

jakevdp · 2015-05-05T20:55:06Z

Relevant discussion on the SciPy-dev mailing list: http://mail.scipy.org/pipermail/scipy-dev/2012-December/018129.html

I think the "extended" Jaccard distance used by scipy is not actually a true metric, which is why BallTree casts to bools and uses the true (binary) Jaccard metric. Under the scipy definition, the BallTree search would fail.

amueller · 2015-05-05T21:22:58Z

@jakevdp Do you have a good idea to resolve this? I find the current state pretty catastrophic (different metrics are used depending on the size of the data).
Maybe we should just implement or own boolean brute metric?

jakevdp · 2015-05-05T21:27:26Z

Maybe we should just implement or own boolean brute metric?

I think that's probably best. I do believe that the scipy implementation is technically not correct.

jakevdp · 2015-05-05T21:29:30Z

Note that the unit tests in the sklearn.neighbors module explicitly only test boolean metrics with boolean values, which is why this wasn't caught by the tests. It's a simple case of GIGO, so perhaps we should be more explicit in the documentation, and/or raise a warning if the user passes non-boolean data to a boolean metrics.

ogrisel · 2015-05-07T16:42:11Z

+1 for at least raising a warning, maybe even an exception.

amueller · 2015-05-07T17:14:42Z

If we raise an exception, we will not have to reimplement. But it will break peoples code. Reimplement +warn seems like a good idea, maybe?

jakevdp · 2015-05-07T18:18:04Z

Rather than re-implement, we could add a validation check for all the boolean metrics. Basically something like

if np.any((X != 1) | (X != 0)):
    warnings.warn("casting data to boolean for {0} metric".format(metric))
    X = (X != 0).astype(float)

Then the current code should be consistent between scipy and scikit-learn (and is tested as such with our current unit tests).

amueller · 2015-05-07T18:19:43Z

Why would you convert back to float in the end? Because the trees need that?

jakevdp · 2015-05-07T18:20:27Z

Hmm, I can't remember what input type they assume. Let me check

amueller · 2015-05-07T18:20:30Z

Ok, lets raise a DataConversionWarning. As the previous behavior was inconsistent, I think the behavior change is "ok".

jakevdp · 2015-05-07T18:24:10Z

So in the BallTree code, all data is converted to float eventually anyway, because of the type consistency required in the tree.

All the unit tests compare the results of float ones and zeros passed to the distance metrics and the corresponding scipy metrics.

amueller · 2015-05-07T18:33:00Z

ok +1 on basically your snipplet then. Do you want to do the PR or should I?

jakevdp · 2015-05-07T18:43:27Z

I can tackle it. All I'm doing today is github stuff anyway 😄

jakevdp · 2015-05-07T18:47:41Z

OK, all of a sudden I'm confused about where this should go. Do we want all pairwise_distance calculations to do this for boolean metrics, or are we just worried about routines with e.g. algorithm='brute' and algorithm='ball_tree'?

I'm worried that there may be use cases for the non-boolean appliactions of the boolean metrics that I don't know about.

amueller · 2015-05-07T19:14:03Z

I thought you claimed that was a bug ;)
I think we should have a consistent behavior everywhere, so it should probably go in pairwise_distance.

jakevdp · 2015-05-07T19:18:22Z

It's a bug if you expect the results to conform to the definition of a metric. But there may be cases where the extended non-metric definition is useful; I just don't know of any.

amueller · 2015-05-19T18:56:14Z

@TomDLT if you like you can have a look at this one. The consensus seems to be to cast float data to boolean data when a boolean metric is requested and raise a warning.

tomMoral · 2015-10-19T15:21:54Z

We looked into it today. Our understanding of the concensus is:

We should add a list PAIRWISE_BOOLEAN_METRIC in pairwise.py and check for those metrics that the input array is boolean in pairwise_distances before pdist.
Else we raise a warning and cast the array to boolean.

Is that correct? Shouldn't we switch to using the dist_metrics implementation of the metric?

amueller · 2015-10-21T16:32:02Z

@tomMoral yeah I think that is right (your summary of the consensus)

amueller added the Bug label Apr 5, 2015

amueller mentioned this issue Apr 5, 2015

[WIP] Metrics testing #4522

Closed

amueller added this to the 0.16.2 milestone May 5, 2015

amueller modified the milestones: 0.16.2, 0.17 Sep 8, 2015

TomDLT mentioned this issue Oct 19, 2015

[MRG+2] convert to boolean arrays for boolean distances (ex: jaccard) #5460

Closed

amueller modified the milestones: 0.17, 0.18 Nov 2, 2015

TomDLT mentioned this issue Jun 29, 2016

[MRG+2] convert to boolean arrays for boolean distances (ex: jaccard) #6932

Merged

lesteve closed this as completed in #6932 Jun 29, 2016

NicolasHug mentioned this issue Mar 23, 2019

Remove "warn_on_dtype" from check_array #13324

Closed

vitaliset mentioned this issue Feb 6, 2023

pairwise_distances is inconsistent with scipy.spatial.distance when using metric="matching" #25532

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Jaccard distance in trees very different from pairwise_distances jaccard distance. #4523

Jaccard distance in trees very different from pairwise_distances jaccard distance. #4523

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Jaccard distance in trees very different from pairwise_distances jaccard distance. #4523

Jaccard distance in trees very different from pairwise_distances jaccard distance. #4523

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!