-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Jaccard distance in trees very different from pairwise_distances jaccard distance. #4523
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Oh, that doesn't look good. Both pairwise_distances and On 6 April 2015 at 06:28, Andreas Mueller notifications@github.com wrote:
|
yeah. Well |
See the tests in #4522 |
Relevant discussion on the SciPy-dev mailing list: http://mail.scipy.org/pipermail/scipy-dev/2012-December/018129.html I think the "extended" Jaccard distance used by scipy is not actually a true metric, which is why BallTree casts to bools and uses the true (binary) Jaccard metric. Under the scipy definition, the BallTree search would fail. |
@jakevdp Do you have a good idea to resolve this? I find the current state pretty catastrophic (different metrics are used depending on the size of the data). |
I think that's probably best. I do believe that the scipy implementation is technically not correct. |
Note that the unit tests in the |
+1 for at least raising a warning, maybe even an exception. |
If we raise an exception, we will not have to reimplement. But it will break peoples code. Reimplement +warn seems like a good idea, maybe? |
Rather than re-implement, we could add a validation check for all the boolean metrics. Basically something like if np.any((X != 1) | (X != 0)):
warnings.warn("casting data to boolean for {0} metric".format(metric))
X = (X != 0).astype(float) Then the current code should be consistent between scipy and scikit-learn (and is tested as such with our current unit tests). |
Why would you convert back to float in the end? Because the trees need that? |
Hmm, I can't remember what input type they assume. Let me check |
Ok, lets raise a |
So in the BallTree code, all data is converted to float eventually anyway, because of the type consistency required in the tree. All the unit tests compare the results of float ones and zeros passed to the distance metrics and the corresponding scipy metrics. |
ok +1 on basically your snipplet then. Do you want to do the PR or should I? |
I can tackle it. All I'm doing today is github stuff anyway 😄 |
OK, all of a sudden I'm confused about where this should go. Do we want all I'm worried that there may be use cases for the non-boolean appliactions of the boolean metrics that I don't know about. |
I thought you claimed that was a bug ;) |
It's a bug if you expect the results to conform to the definition of a metric. But there may be cases where the extended non-metric definition is useful; I just don't know of any. |
@TomDLT if you like you can have a look at this one. The consensus seems to be to cast float data to boolean data when a boolean metric is requested and raise a warning. |
We looked into it today. Our understanding of the concensus is:
Is that correct? Shouldn't we switch to using the dist_metrics implementation of the metric? |
@tomMoral yeah I think that is right (your summary of the consensus) |
As observed in #4522, BallTree and
pairwise_distances
have very different results formetric="jaccard"
:The text was updated successfully, but these errors were encountered: