-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[WIP] Providing stable implementation for euclidean_distances #10069
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
"Incompatible dimensions for Y and Y_norm_squared") | ||
else: | ||
YY = row_norms(Y, squared=True)[np.newaxis, :] | ||
diff = X.reshape(-1, 1, X.shape[1]) - Y.reshape(1, -1, X.shape[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to support sparse matrices, which this does not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh ok I'll fix this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be better if you can try to make CIs green
(1)There are various PEP8 errors, see the bottom of https://travis-ci.org/scikit-learn/scikit-learn/jobs/297482321
(2)Please figure out a way to support the sparse case (Seems that scipy don't support multi-dimension sparse array, so need another way)
(3)There are some relevant test which needs to be removed
Do we really going to deprecate X_norm_squared and Y_norm_squared? They are introduced in #42 for k_means and are still used there. E.g. scikit-learn/sklearn/cluster/k_means_.py Lines 98 to 100 in abb43c1
|
A fix for the sparse matrix problem could be to check if it is sparse and then convert it back to numpy array, although it doesn't seem like an elegant way to fix it, right ? |
I think that Y_norm_squared isn't necessarily required even in KMeans but I could be wrong |
The questions here are going to be about implementation and thorough benchmarking. API is only a minor issue. Don't worry about it for now. It might even be a good idea for now to keep both approaches in there triggered by a switch. That would make it easy to run benchmarks and identify where one implementation substantially outperforms the other. I suspect with the new implementation we will need to use chunking to avoid high memory usage. For sparse, yes, we cannot use the same broadcasting technique to avoid memory issues and copying in subtraction. But the asymptotic performance will be the same Are you sure you're up to all this @ragnerok? Let us know if you need help. |
Btw, passing in norms is simply a way to amortize some of the calculation costs. If we can show that this implementation gives similar KMeans performance, that amortization becomes irrelevant. |
It seems an unacceptable way? Although you can indeed make CIs green by doing so (I have tried previously when digging into the issue).
Indeed, what I mean above is deprecating Y_norm_squared might influence the speed of k-means. (You can still get the right result). There seems already some issues indicating that our k-means implementation is slow. (See #9430 also for benchmark reference) I have used grep to check the repo, Y_norm_squared is used in KMeans(in _k_init) and Birch(_split_node). So we might need to benchmark both of them along with euclidean_distance itself to see how much the speed decreases. Apart from #42 where we introduce Y_norm_squared, also see #2459 where we introduce X_norm_squared. The user proposes some usages of the two parameters there. Still doubt if we have a better way to solve the problem and keep the two parameters. |
@@ -384,6 +384,15 @@ def test_euclidean_distances(): | |||
assert_array_almost_equal(D, [[1., 2.]]) | |||
|
|||
rng = np.random.RandomState(0) | |||
#check if it works for float32 | |||
X = rng.rand(1,3000000).astype(np.float32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a specific reason for such a complex test? It seems that the current problem can be reproduced with very simple array, e,g
arr1_32 = np.array([555.5, 666.6, 777.7], dtype=np.float32)
arr2_32 = np.array([555.6, 666.7, 777.8], dtype=np.float32)
arr1_64 = np.array([555.5, 666.6, 777.7], dtype=np.float64)
arr2_64 = np.array([555.6, 666.7, 777.8], dtype=np.float64)
euclidean_distances(arr1_32.reshape(1, -1), arr2_32.reshape(1, -1))
# array([[ 0.17319804]], dtype=float32)
euclidean_distances(arr1_64.reshape(1, -1), arr2_64.reshape(1, -1))
# array([[ 0.17320508]])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well I thought it would be better to show that it fails with a random array. No particular reason though. I'll change it to a simpler test if it's a problem.
@jnothman yeah I could use some help dealing with chunking and sparse matrix |
Are you capable of plotting benchmarks? Could I suggest you start by setting up some benchmarks, showing what they look like, then we can work on improving the implementation. Thanks. |
Yeah I think I can do it I'll make a commit. |
benchmarks as a gist rather than within the PR is fine
|
Thanks. It's helpful if you run the benchmark and share the output too so that we're all on the same page |
Also, you don't need the number of samples and features to be randomized. Rather you want to test the response of runtime and memory to changes in number of features and samples. Ideally plotting curves for each of: the new implementation, the old implementation, the old implementation with norms precomputed. |
@ragnerok It will be better if you can also benchmark KMeans and Birch :) remember to remove the code calculating Y_norm_squared (KMeans in _k_init and Birch in _split_node). |
I don't think X_norm_squared and Y_norm_squared should be deprecated. I have a WIP branch to speed-up nearest neighbors that makes use of at least one. Look at this for more details: master...lesteve:improve-kneighbors-brute When you do neighbors queries it is faster to pre-compute ||X||^2 and then do:
|
well at least for float32 that may be faster but very inaccurate. perhaps
we should only change the implementation for float32. would that fix most
of it?
|
Hmmm I guess it would be very nice to have proper benchmarks for computing time and accuracy then. Something I forgot to mention: my WIP branch is in the context of https://github.com/erikbern/ann-benchmarks, which only measure predict time and not fit time, in which case precomputing |
Ok so you want the benchmarks for computing time, memory and accuracy with number of features compared between new, old and old with precomputed norms right ? Also could anyone help me with finding response of memory ? Should I use virtual_memory of psutil to find used memory ? |
Here's the results I get when using the 2nd gist |
I'd like to see more data points in that plot. also please mark number of
samples with your plot (eg in its title). Without knowing number of
samples, I don't know if the slowdown is due to number of features might be
due to large memory allocations, and hence solved by chunking.
you can do memory profiling using memory_profile, I think, to find the
maximum memory usage over a function's execution.
|
By chunking I mean like #7177 which sets a limit on working memory and
performs the computations on successive slices of the array to achieve that
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should also be testing with float32, I suppose.
This is going to be tricky to get right. We might need to think a little more.
else: | ||
YY = row_norms(Y, squared=True)[np.newaxis, :] | ||
diff = X.reshape(-1, 1, X.shape[1]) - Y.reshape(1, -1, X.shape[1]) | ||
distances = np.sum(np.power(diff, 2), axis=2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please try using np.dot to calculate the sum of squares?
If I may suggest an alternate method -
This method is pretty fast in comparison. |
Yes, einsum can be faster than dot for this. Good. Let's work with that.
…On 12 November 2017 at 04:26, Osaid Rehman Nasir ***@***.***> wrote:
If I may suggest an alternate method -
def dist(X, Y, squared=False):
X, Y = check_pairwise_arrays(X, Y)
diff = X.reshape(-1, 1, X.shape[1]) - Y.reshape(1, -1, X.shape[1])
diff = diff.reshape(-1, X.shape[1])
distances = np.einsum('ij,ij->i', diff, diff)
distances = distances.reshape(-1, Y.shape[0])
return distances if squared else np.sqrt(distances)
This method is pretty fast in comparison.
Obviously I can use sklearn.utils.extmath.row_norms rather than einsum
but it gives a less precise result
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#10069 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz69sNFC8prxVo_1RCBv51ETrWAxjYks5s1di1gaJpZM4QSTuU>
.
|
However, we may find that we want to cythonize the whole
sum-of-squared-distances to avoid any memory overhead...
… |
The only problem with row_norms/einsum is that it's not very accurate for large number of features (eg. the test that I've added).
Output |
I think inaccuracy at the 7th significant figure is unsurprising in float32. I don't consider this a real problem. |
(Note that np.log2(10 ** 7) == 23.25...) |
benchmark of einsum? |
So should I change the current implementation and also change the test and commit ? |
@jnothman so what do you want me to do now ? |
well I think it's clear that we can only apply this to the float32 case.
it's tempting even to think about ways we can calculate in float 64 and
return float 32s without ever using excessive memory. Or we can just go
back to the situation in v0.18 or so when we just returned float 64. I
don't especially like the idea of a solution where float 32 calculation
takes much longer... Perhaps @ogrisel will give an opinion here
…On 1 Dec 2017 2:26 am, "Osaid Rehman Nasir" ***@***.***> wrote:
@jnothman <https://github.com/jnothman> so what do you want me to do now ?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#10069 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz66dsC4Dp2c6Qy18TPnowYvErFYdGks5s7skigaJpZM4QSTuU>
.
|
I think a temporary fix in which we perform calculation in float64 (whether we then cast back to float32 for return) might be a better idea than maintaining two parallel implementations, one with a poor memory profile. We could later potentially use chunked calculation as in #10280 to convert one chunk at a time to float32, so that we maintain the overall memory benefits of computing in float32... But don't worry about that for now. |
Just to make sure I got it right, all I have to do is convert float32 inputs to float64 and return the distance as float64 for now. |
Return as float32, perhaps. Yes, I think so. We need to focus on correctness brie efficiency, but need to leave the 64 case as efficient as it was |
Ok I'll make a commit then |
Hmm... I'd like to see this finished in the next month or so. @ragnerok are you around? |
Yeah I'll work on it now apologies for the delay |
No worries, and thanks for your work :)
|
In the current implementation, the performance bottleneck is most likely in the matrix-matrix dot product. Which numpy delegate to the underlying BLAS. BTW, I think the current algorithm looses approximately 3 decimal digits in accuracy (rough experimental estimation) for both float32 and float64 data types. I genuinely wonder how one decide what is or isn't an acceptable loss. |
For what it's worth, I've tried several different algorithms to compute the euclidean distance matrix, including:
Turns out, 1. was a bad idea. The result is symmetric, but further from the true value. |
@Celelibi basically said that the best solution is to do the computation in
64bit even when the input is 32bit, but to do so on a chunk at a time to
avoid excessive memory usage due to copying X and Y (which may have a lot
of features, and hence be as costly in storage as the pairwise distances).
@Celebili, would you like to offer an implementation?
I'm also wondering whether the scipy.spatial implementation is more stable
and can be exploited.
|
Sure. As long as you don't misspell my nickname. ^^
I also looked into it. I looked too quickly at the code of BTW, I have some (crowded) plots to support all these claims. |
The code performing the block computation should probably look something like this. bs = 1024
dist = np.zeros((X.shape[0], Y.shape[0]), dtype=(X[0, 0]-Y[0, 0]).dtype)
for i in range(0, X.shape[0], bs):
for j in range(i, Y.shape[0], bs):
for k in range(0, X.shape[1], bs):
Xc = X[i:i+bs, k:k+bs].astype(np.float64)
Yc = Y[j:j+bs, k:k+bs].astype(np.float64)
dist[i:i+bs, j:j+bs] += euclidean_distances(Xc, Yc, squared=True)
if i != j:
dist[j:j+bs, i:i+bs] = dist[i:i+bs, j:j+bs].T But the current code would probably need some refactoring to avoid unclear recursive calls. And it turns out reusing |
@Celelibi I think you should make a new PR and I should probably close this one, I hadn't done too much work anyways |
thanks again for your work, @ragnerok. this is an important but difficult
issue to solve
|
Yeah, no problem and thanks for the help |
Reference Issues/PRs
Fixes #9354
What does this implement/fix? Explain your changes.
Provides a more precise result.
Any other comments?
I also tried using row_norms but for some reason it was less precise than this implementation.
Here's what I tried -