8000 [MRG + 1] support X_norm_squared in euclidean_distances by djsutherland · Pull Request #2459 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG + 1] support X_norm_squared in euclidean_distances #2459

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

djsutherland
Copy link
Contributor

There's a comment in euclidean_distances saying

# should not need X_norm_squared because if you could precompute that as
# well as Y, then you should just pre-compute the output and not even
# call this function.

That's not necessarily true, though. I've run into a situation today where I have a whole bunch of sets, and need to do something based on the distances between each pair of sets. It's helpful to cache the squared norms for each of the sets; if I did that and called it with just Y_norm_squared for each pair, that'd still be recomputing the norms for X all the time. (Of course, I can just do it without the helper function, which is what I'm doing now, but it's nicer to use helpers....)

Another situation is when you happen to already have the squared norms for a set X and then you want euclidean_distances(X). I guessed that maybe euclidean_distances(X, Y_norm_squared=X_norm_sq) would work, but looking at the code that doesn't actually use X_norm_sq. Now euclidean_distances can handle that case too.

This also adds an extremely simple test that passing X_norm_squared and/or Y_norm_squared gives the same result; previously there was no test that used Y_norm_squared.

As an aside: I have no idea why XX is being computed with X * X and YY with Y ** 2 (which necessitates the annoying copy code when it's sparse); it seems like it should be exactly the same situation, except for the very minor difference of the shape. I left it as-is, though.

@djsutherland
Copy link
Contributor Author

I don't know why the tests passed locally and failed on the server. Will look into that.

@larsmans
Copy link
Member
larsmans commented Oct 2, 2013

This is not mergeable anymore after my optimizations in 81950ba. The X*X and Y**2 difference is gone, as the squaring is now offloaded to a helper that calls einsum when possible.

@djsutherland
Copy link
Contributor Author

Okay, looks good. I still think passing in the square norms of X makes sense; I'll update to take your optimizations into account and track down the test problem sometime soon.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 8d2978c on dougalsutherland:euclidean_distances into e54c54a on scikit-learn:master.

@djsutherland
Copy link
Contributor Author

Updated this to use the new version, with basically the same changeset as before. I put the X_norm_squared argument last in the spec, though it should probably go earlier, to keep backwards compatibility with positional arguments. Is that something that I should do?

@amueller
Copy link
Member

Not sure why this got stale, looks like a good contrib. Can you rebase?

@djsutherland
Copy link
Contributor Author

@amueller Okay, rebased and switched to use check_array.

I still kept X_norm_squared as the last argument in case people are calling it with positional arguments, despite the order looking ugly.

@amueller
Copy link
Member

I'm ok with that ordering. LGTM.

@amueller amueller changed the title support X_norm_squared in euclidean_distances [MRG + 1] support X_norm_squared in euclidean_distances Aug 27, 2015
Y_norm_sq = (Y ** 2).sum(axis=1)

D1 = euclidean_distances(X, Y)
D2 = euclidean_distances(X, Y, X_norm_squared=X_norm_sq)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests don't actually check that the args are being utilised. Do we care? Should we add tests with incorrect norms in order to ensure that there aren't bugs in the use of the norms? Or is code coverage sufficient?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. It's too late in NYC. What we want to test is that it gets faster, right? But that is not nice to test.
Maybe adding a test that the squares are used would be ok. We didn't have that for Y_norm_squared, though, right?
Maybe just add a test that passes both as zero and see if it just computes -2 times the dot product?

@jnothman
Copy link
Member

Apart from that nitpick, this LGTM.

@djsutherland
Copy link
Contributor Author

Added a test that the answer is wrong with the wrong squared norms.

I tried @amueller's suggestion of passing zeros, but the function does maximum(distances, 0), so instead of -2 * X.dot(Y.T) it returned just all zeros. So I added a stupid test where it halves the squared norms and makes sure that at least one distance was changed by more than .01, because whatever.

@amueller
Copy link
Member

@jnothman merge?

@GaelVaroquaux
Copy link
Member

LGTM. Merging. Thanks!

GaelVaroquaux added a commit that referenced this pull request Sep 1, 2015
[MRG + 1] support X_norm_squared in euclidean_distances
@GaelVaroquaux GaelVaroquaux merged commit 4d39cf8 into scikit-learn:master Sep 1, 2015
< 63CB input type="hidden" data-csrf="true" name="authenticity_token" value="ERYV9LVmlQihRdMmKgC57MLXvl5nw9Zby0F+K+8KASnYDtE9wolgOhoBK+paqBhApYrRNPZJ0oir2gbcmMWHKA==" />
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
0