10000 Jaccard distance in trees very different from pairwise_distances jaccard distance. · Issue #4523 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Jaccard distance in trees very different from pairwise_distances jaccard distance. #4523

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
amueller opened this issue Apr 5, 2015 · 22 comments
Labels
Milestone

Comments

@amueller
Copy link
Member
amueller commented Apr 5, 2015

As observed in #4522, BallTree and pairwise_distances have very different results for metric="jaccard":

from sklearn.neighbors import NearestNeighbors
X = np.random.uniform(size=(6, 5))

nn = NearestNeighbors(metric="jaccard", algorithm='brute').fit(X)
print(nn.kneighbors(X)[0])
nn = NearestNeighbors(metric="jaccard", algorithm='ball_tree').fit(X)
print(nn.kneighbors(X)[0])

[[ 0. 1. 1. 1. 1.]
[ 0. 1. 1. 1. 1.]
[ 0. 1. 1. 1. 1.]
[ 0. 1. 1. 1. 1.]
[ 0. 1. 1. 1. 1.]
[ 0. 1. 1. 1. 1.]]
[[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]]

@jnothman
Copy link
Member
jnothman commented Apr 6, 2015

Oh, that doesn't look good. Both pairwise_distances and
sklearn.neighbors.dist_metrics should be tested against
scipy.spatial.distance where possible.

On 6 April 2015 at 06:28, Andreas Mueller notifications@github.com wrote:

As observed in #4522
#4522, BallTree and
pairwise_distances have very different results for metric="jaccard":

from sklearn.neighbors import NearestNeighbors
X = np.random.uniform(size=(6, 5))

nn = NearestNeighbors(metric="jaccard", algorithm='brute').fit(X)print(nn.kneighbors(X)[0])
nn = NearestNeighbors(metric="jaccard", algorithm='ball_tree').fit(X)print(nn.kneighbors(X)[0])

[[ 0. 1. 1. 1. 1.]
[ 0. 1. 1. 1. 1.]
[ 0. 1. 1. 1. 1.]
[ 0. 1. 1. 1. 1.]
[ 0. 1. 1. 1. 1.]
[ 0. 1. 1. 1. 1.]]
[[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]]


Reply to this email directly or view it on GitHub
#4523.

@amueller
Copy link
Member Author
amueller commented Apr 6, 2015

yeah. Well pairwise_distances is calling scipy.spatial.distance here, but dist_metrics does something else.

@amueller
Copy link
Member Author
amueller commented Apr 6, 2015

See the tests in #4522

@jakevdp
Copy link
Member
jakevdp commented May 5, 2015

Relevant discussion on the SciPy-dev mailing list: http://mail.scipy.org/pipermail/scipy-dev/2012-December/018129.html

I think the "extended" Jaccard distance used by scipy is not actually a true metric, which is why BallTree casts to bools and uses the true (binary) Jaccard metric. Under the scipy definition, the BallTree search would fail.

@amueller
Copy link
Member Author
amueller commented May 5, 2015

@jakevdp Do you have a good idea to resolve this? I find the current state pretty catastrophic (different metrics are used depending on the size of the data).
Maybe we should just implement or own boolean brute metric?

@amueller amueller added this to the 0.16.2 milestone May 5, 2015
@jakevdp
Copy link
Member
jakevdp commented May 5, 2015

Maybe we should just implement or own boolean brute metric?

I think that's probably best. I do believe that the scipy implementation is technically not correct.

@jakevdp
Copy link
Member
jakevdp commented May 5, 2015

Note that the unit tests in the sklearn.neighbors module explicitly only test boolean metrics with boolean values, which is why this wasn't caught by the tests. It's a simple case of GIGO, so perhaps we should be more explicit in the documentation, and/or raise a warning if the user passes non-boolean data to a boolean metrics.

@ogrisel
Copy link
Member
ogrisel commented May 7, 2015

+1 for at least raising a warning, maybe even an exception.

@amueller
Copy link
Member Author
amueller commented May 7, 2015

If we raise an exception, we will not have to reimplement. But it will break peoples code. Reimplement +warn seems like a good idea, maybe?

@jakevdp
Copy link
Member
jakevdp commented May 7, 2015

Rather than re-implement, we could add a validation check for all the boolean metrics. Basically something like

if np.any((X != 1) | (X != 0)):
    warnings.warn("casting data to boolean for {0} metric".format(metric))
    X = (X != 0).astype(float)

Then the current code should be consistent between scipy and scikit-learn (and is tested as such with our current unit tests).

@amueller
Copy link
Member Author
amueller commented May 7, 2015

Why would you convert back to float in the end? Because the trees need that?

@jakevdp
Copy link
Member
jakevdp commented May 7, 2015

Hmm, I can't remember what input type they assume. Let me check

@amueller
Copy link
Member Author
amueller commented May 7, 2015

Ok, lets raise a DataConversionWarning. As the previous behavior was inconsistent, I think the behavior change is "ok".

@jakevdp
Copy link
Member
jakevdp commented May 7, 2015

So in the BallTree code, all data is converted to float eventually anyway, because of the type consistency required in the tree.

All the unit tests compare the results of float ones and zeros passed to the distance metrics and the corresponding scipy metrics.

@amueller
Copy link
Member Author
amueller commented May 7, 2015

ok +1 on basically your snipplet then. Do you want to do the PR or should I?

@jakevdp
Copy link
Member
jakevdp commented May 7, 2015

I can tackle it. All I'm doing today is github stuff anyway 😄

@jakevdp
Copy link
Member
jakevdp commented May 7, 2015

OK, all of a sudden I'm confused about where this should go. Do we want all pairwise_distance calculations to do this for boolean metrics, or are we just worried about routines with e.g. algorithm='brute' and algorithm='ball_tree'?

I'm worried that there may be use cases for the non-boolean appliactions of the boolean metrics that I don't know about.

@amueller
Copy link
Member Author
amueller commented May 7, 2015

I thought you claimed that was a bug ;)
I think we should have a consistent behavior everywhere, so it should probably go in pairwise_distance.

@jakevdp
Copy link
Member
jakevdp commented May 7, 2015

It's a bug if you expect the results to conform to the definition of a metric. But there may be cases where the extended non-metric definition is useful; I just don't know of any.

@amueller
Copy link
Member Author

@TomDLT if you like you can have a look at this one. The consensus seems to be to cast float data to boolean data when a boolean metric is requested and raise a warning.

@amueller amueller modified the milestones: 0.16.2, 0.17 Sep 8, 2015
@tomMoral
Copy link
Contributor

We looked into it today. Our understanding of the concensus is:

  • We should add a list PAIRWISE_BOOLEAN_METRIC in pairwise.py and check for those metrics that the input array is boolean in pairwise_distances before pdist.
  • Else we raise a warning and cast the array to boolean.

Is that correct? Shouldn't we switch to using the dist_metrics implementation of the metric?

@amueller
Copy link
Member Author

@tomMoral yeah I think that is right (your summary of the consensus)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants
0