-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG + 1] support X_norm_squared in euclidean_distances #2459
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG + 1] support X_norm_squared in euclidean_distances #2459
Conversation
I don't know why the tests passed locally and failed on the server. Will look into that. |
This is not mergeable anymore after my optimizations in 81950ba. The |
Okay, looks good. I still think passing in the square norms of X makes sense; I'll update to take your optimizations into account and track down the test problem sometime soon. |
Updated this to use the new version, with basically the same changeset as before. I put the X_norm_squared argument last in the spec, though it should probably go earlier, to keep backwards compatibility with positional arguments. Is that something that I should do? |
Not sure why this got stale, looks like a good contrib. Can you rebase? |
8d2978c
to
4a1b0cc
Compare
@amueller Okay, rebased and switched to use I still kept |
I'm ok with that ordering. LGTM. |
Y_norm_sq = (Y ** 2).sum(axis=1) | ||
|
||
D1 = euclidean_distances(X, Y) | ||
D2 = euclidean_distances(X, Y, X_norm_squared=X_norm_sq) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests don't actually check that the args are being utilised. Do we care? Should we add tests with incorrect norms in order to ensure that there aren't bugs in the use of the norms? Or is code coverage sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. It's too late in NYC. What we want to test is that it gets faster, right? But that is not nice to test.
Maybe adding a test that the squares are used would be ok. We didn't have that for Y_norm_squared, though, right?
Maybe just add a test that passes both as zero and see if it just computes -2 times the dot product?
Apart from that nitpick, this LGTM. |
4a1b0cc
to
8d8b434
Compare
Added a test that the answer is wrong with the wrong squared norms. I tried @amueller's suggestion of passing zeros, but the function does |
@jnothman merge? |
LGTM. Merging. Thanks! |
[MRG + 1] support X_norm_squared in euclidean_distances
There's a comment in
euclidean_distances
sayingThat's not necessarily true, though. I've run into a situation today where I have a whole bunch of sets, and need to do something based on the distances between each pair of sets. It's helpful to cache the squared norms for each of the sets; if I did that and called it with just
Y_norm_squared
for each pair, that'd still be recomputing the norms forX
all the time. (Of course, I can just do it without the helper function, which is what I'm doing now, but it's nicer to use helpers....)Another situation is when you happen to already have the squared norms for a set
X
and then you wanteuclidean_distances(X)
. I guessed that maybeeuclidean_distances(X, Y_norm_squared=X_norm_sq)
would work, but looking at the code that doesn't actually useX_norm_sq
. Noweuclidean_distances
can handle that case too.This also adds an extremely simple test that passing
X_norm_squared
and/orY_norm_squared
gives the same result; previously there was no test that usedY_norm_squared.
As an aside: I have no idea why
XX
is being computed withX * X
andYY
withY ** 2
(which necessitates the annoying copy code when it's sparse); it seems like it should be exactly the same situation, except for the very minor difference of the shape. I left it as-is, though.