[WIP] Providing stable implementation for euclidean_distances #10069

osaid-r · 2017-11-05T09:57:32Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Provides a more precise result.

Any other comments?

I also tried using row_norms but for some reason it was less precise than this implementation.
Here's what I tried -

X, Y = check_pairwise_arrays(X, Y)
diff = X.reshape(-1, 1, X.shape[1]) - Y.reshape(1, -1, X.shape[1])
diff = diff.reshape(-1, diff.shape[2])
distances = row_norms(diff, squared=squared)
distances = distances.reshape(-1, Y.shape[0])

jnothman · 2017-11-05T09:50:16Z

sklearn/metrics/pairwise.py

-                "Incompatible dimensions for Y and Y_norm_squared")
-    else:
-        YY = row_norms(Y, squared=True)[np.newaxis, :]
+    diff = X.reshape(-1, 1, X.shape[1]) - Y.reshape(1, -1, X.shape[1])


We need to support sparse matrices, which this does not

Oh ok I'll fix this

qinhanmin2014

It will be better if you can try to make CIs green
(1)There are various PEP8 errors, see the bottom of https://travis-ci.org/scikit-learn/scikit-learn/jobs/297482321
(2)Please figure out a way to support the sparse case (Seems that scipy don't support multi-dimension sparse array, so need another way)
(3)There are some relevant test which needs to be removed

qinhanmin2014 · 2017-11-06T15:33:11Z

@jnothman

I think they will be deprecated, but we might need to first be assured that there's no case where we should keep that version.

Do we really going to deprecate X_norm_squared and Y_norm_squared? They are introduced in #42 for k_means and are still used there. E.g.

scikit-learn/sklearn/cluster/k_means_.py

Lines 98 to 100 in abb43c1

    
           closest_dist_sq = euclidean_distances( 
        
               centers[0, np.newaxis], X, Y_norm_squared=x_squared_norms, 
        
               squared=True)

osaid-r · 2017-11-06T16:06:06Z

A fix for the sparse matrix problem could be to check if it is sparse and then convert it back to numpy array, although it doesn't seem like an elegant way to fix it, right ?

osaid-r · 2017-11-06T16:34:47Z

I think that Y_norm_squared isn't necessarily required even in KMeans but I could be wrong

jnothman · 2017-11-06T23:13:35Z

The questions here are going to be about implementation and thorough benchmarking. API is only a minor issue. Don't worry about it for now.

It might even be a good idea for now to keep both approaches in there triggered by a switch. That would make it easy to run benchmarks and identify where one implementation substantially outperforms the other. I suspect with the new implementation we will need to use chunking to avoid high memory usage. For sparse, yes, we cannot use the same broadcasting technique to avoid memory issues and copying in subtraction. But the asymptotic performance will be the same O(n_samples_X * n_samples_Y * n_features) if we just duplicate rows before subtraction (X[[0, 0, 0, 1, 1, 1, 2, 2, 2]] - Y[[0, 1, 2, 0, 1, 2, 0, 1, 2]]).

Are you sure you're up to all this @ragnerok? Let us know if you need help.

jnothman · 2017-11-07T00:58:15Z

Btw, passing in norms is simply a way to amortize some of the calculation costs. If we can show that this implementation gives similar KMeans performance, that amortization becomes irrelevant.

qinhanmin2014 · 2017-11-07T01:39:46Z

A fix for the sparse matrix problem could be to check if it is sparse and then convert it back to numpy array, although it doesn't seem like an elegant way to fix it, right?

It seems an unacceptable way? Although you can indeed make CIs green by doing so (I have tried previously when digging into the issue).

Btw, passing in norms is simply a way to amortize some of the calculation costs. If we can show that this implementation gives similar KMeans performance, that amortization becomes irrelevant.

Indeed, what I mean above is deprecating Y_norm_squared might influence the speed of k-means. (You can still get the right result). There seems already some issues indicating that our k-means implementation is slow. (See #9430 also for benchmark reference)

I have used grep to check the repo, Y_norm_squared is used in KMeans(in _k_init) and Birch(_split_node). So we might need to benchmark both of them along with euclidean_distance itself to see how much the speed decreases.

Apart from #42 where we introduce Y_norm_squared, also see #2459 where we introduce X_norm_squared. The user proposes some usages of the two parameters there.

Still doubt if we have a better way to solve the problem and keep the two parameters.

qinhanmin2014 · 2017-11-07T01:47:58Z

sklearn/metrics/tests/test_pairwise.py

@@ -384,6 +384,15 @@ def test_euclidean_distances():
    assert_array_almost_equal(D, [[1., 2.]])

    rng = np.random.RandomState(0)
+    #check if it works for float32
+    X = rng.rand(1,3000000).astype(np.float32)


Is there a specific reason for such a complex test? It seems that the current problem can be reproduced with very simple array, e,g

arr1_32 = np.array([555.5, 666.6, 777.7], dtype=np.float32) arr2_32 = np.array([555.6, 666.7, 777.8], dtype=np.float32) arr1_64 = np.array([555.5, 666.6, 777.7], dtype=np.float64) arr2_64 = np.array([555.6, 666.7, 777.8], dtype=np.float64) euclidean_distances(arr1_32.reshape(1, -1), arr2_32.reshape(1, -1)) # array([[ 0.17319804]], dtype=float32) euclidean_distances(arr1_64.reshape(1, -1), arr2_64.reshape(1, -1)) # array([[ 0.17320508]])

Well I thought it would be better to show that it fails with a random array. No particular reason though. I'll change it to a simpler test if it's a problem.

osaid-r · 2017-11-07T06:17:00Z

@jnothman yeah I could use some help dealing with chunking and sparse matrix

jnothman · 2017-11-07T06:22:59Z

Are you capable of plotting benchmarks? Could I suggest you start by setting up some benchmarks, showing what they look like, then we can work on improving the implementation. Thanks.

osaid-r · 2017-11-07T06:55:43Z

Yeah I think I can do it I'll make a commit.

jnothman · 2017-11-07T07:15:12Z

benchmarks as a gist rather than within the PR is fine

osaid-r · 2017-11-07T15:26:31Z

Here's the gist that I created - gist
It's a very basic code that creates time vs features graph for euclidean distances function.

I also created another gist which compare's scikit-learn's euclidean_distances function with the new euclidean_distances function - gist

jnothman · 2017-11-07T21:11:44Z

Thanks. It's helpful if you run the benchmark and share the output too so that we're all on the same page

jnothman · 2017-11-07T21:19:23Z

Also, you don't need the number of samples and features to be randomized. Rather you want to test the response of runtime and memory to changes in number of features and samples. Ideally plotting curves for each of: the new implementation, the old implementation, the old implementation with norms precomputed.

qinhanmin2014 · 2017-11-08T01:05:01Z

@ragnerok It will be better if you can also benchmark KMeans and Birch :) remember to remove the code calculating Y_norm_squared (KMeans in _k_init and Birch in _split_node).

lesteve · 2017-11-08T08:39:31Z

I don't think X_norm_squared and Y_norm_squared should be deprecated. I have a WIP branch to speed-up nearest neighbors that makes use of at least one. Look at this for more details: master...lesteve:improve-kneighbors-brute

When you do neighbors queries it is faster to pre-compute ||X||^2 and then do:

||X - y||^2 = ||X||^2 - 2 (X|y) + ||y||^2

jnothman · 2017-11-08T08:46:04Z

well at least for float32 that may be faster but very inaccurate. perhaps we should only change the implementation for float32. would that fix most of it?

lesteve · 2017-11-08T08:55:26Z

Hmmm I guess it would be very nice to have proper benchmarks for computing time and accuracy then. Something I forgot to mention: my WIP branch is in the context of https://github.com/erikbern/ann-benchmarks, which only measure predict time and not fit time, in which case precomputing ||X||^2 makes even more sense.

osaid-r · 2017-11-08T13:06:46Z

Ok so you want the benchmarks for computing time, memory and accuracy with number of features compared between new, old and old with precomputed norms right ?

Also could anyone help me with finding response of memory ? Should I use virtual_memory of psutil to find used memory ?

osaid-r · 2017-11-08T13:20:42Z

Here's the results I get when using the 2nd gist

jnothman · 2017-11-08T21:01:39Z

I'd like to see more data points in that plot. also please mark number of samples with your plot (eg in its title). Without knowing number of samples, I don't know if the slowdown is due to number of features might be due to large memory allocations, and hence solved by chunking. you can do memory profiling using 9E88 memory_profile, I think, to find the maximum memory usage over a function's execution.

jnothman · 2017-11-08T21:10:30Z

By chunking I mean like #7177 which sets a limit on working memory and performs the computations on successive slices of the array to achieve that

osaid-r · 2017-11-10T17:57:19Z

jnothman

You should also be testing with float32, I suppose.

This is going to be tricky to get right. We might need to think a little more.

jnothman · 2017-11-11T11:27:27Z

sklearn/metrics/pairwise.py

-    else:
-        YY = row_norms(Y, squared=True)[np.newaxis, :]
+    diff = X.reshape(-1, 1, X.shape[1]) - Y.reshape(1, -1, X.shape[1])
+    distances = np.sum(np.power(diff, 2), axis=2)


Can you please try using np.dot to calculate the sum of squares?

osaid-r · 2017-11-11T17:26:10Z

If I may suggest an alternate method -

def dist(X, Y, squared=False):
	X, Y = check_pairwise_arrays(X, Y)
	diff = X.reshape(-1, 1, X.shape[1]) - Y.reshape(1, -1, X.shape[1])
	diff = diff.reshape(-1, X.shape[1])
	distances = np.einsum('ij,ij->i', diff, diff)
	distances = distances.reshape(-1, Y.shape[0])
	return distances if squared else np.sqrt(distances)

This method is pretty fast in comparison.
Obviously I can use sklearn.utils.extmath.row_norms rather than einsum but it gives a less precise result

jnothman · 2017-11-11T22:37:48Z

Yes, einsum can be faster than dot for this. Good. Let's work with that.

…

On 12 November 2017 at 04:26, Osaid Rehman Nasir ***@***.***> wrote: If I may suggest an alternate method - def dist(X, Y, squared=False): X, Y = check_pairwise_arrays(X, Y) diff = X.reshape(-1, 1, X.shape[1]) - Y.reshape(1, -1, X.shape[1]) diff = diff.reshape(-1, X.shape[1]) distances = np.einsum('ij,ij->i', diff, diff) distances = distances.reshape(-1, Y.shape[0]) return distances if squared else np.sqrt(distances) This method is pretty fast in comparison. Obviously I can use sklearn.utils.extmath.row_norms rather than einsum but it gives a less precise result — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#10069 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz69sNFC8prxVo_1RCBv51ETrWAxjYks5s1di1gaJpZM4QSTuU> .

jnothman · 2017-11-11T22:38:28Z

However, we may find that we want to cythonize the whole sum-of-squared-distances to avoid any memory overhead...

…

osaid-r · 2017-11-12T08:55:36Z

The only problem with row_norms/einsum is that it's not very accurate for large number of features (eg. the test that I've added).
Also I think I should use row_norms since it uses einsum as well.

X = rng.randn(1, 3000000).astype(np.float32)
Y = rng.randn(1, 3000000).astype(np.float32)
ans1 = np.linalg.norm(X-Y)[np.newaxis, np.newaxis]
ans2 = dist(X, Y)         #dist is defined in the previous comment
print(ans1, ans2)

Output
[[ 2449.83007812]] [[ 2449.83691406]]

jnothman · 2017-11-13T07:12:14Z

I think inaccuracy at the 7th significant figure is unsurprising in float32. I don't consider this a real problem.

jnothman · 2017-11-13T07:14:26Z

(Note that np.log2(10 ** 7) == 23.25...)

jnothman · 2017-11-13T07:14:44Z

benchmark of einsum?

osaid-r · 2017-11-13T13:11:00Z

yeah sorry about that

osaid-r · 2017-11-13T13:14:20Z

So should I change the current implementation and also change the test and commit ?

osaid-r · 2017-11-30T15:26:24Z

@jnothman so what do you want me to do now ?

jnothman · 2017-11-30T20:42:51Z

well I think it's clear that we can only apply this to the float32 case. it's tempting even to think about ways we can calculate in float 64 and return float 32s without ever using excessive memory. Or we can just go back to the situation in v0.18 or so when we just returned float 64. I don't especially like the idea of a solution where float 32 calculation takes much longer... Perhaps @ogrisel will give an opinion here

…

On 1 Dec 2017 2:26 am, "Osaid Rehman Nasir" ***@***.***> wrote: @jnothman <https://github.com/jnothman> so what do you want me to do now ? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#10069 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz66dsC4Dp2c6Qy18TPnowYvErFYdGks5s7skigaJpZM4QSTuU> .

jnothman · 2017-12-11T23:15:59Z

I think a temporary fix in which we perform calculation in float64 (whether we then cast back to float32 for return) might be a better idea than maintaining two parallel implementations, one with a poor memory profile. We could later potentially use chunked calculation as in #10280 to convert one chunk at a time to float32, so that we maintain the overall memory benefits of computing in float32... But don't worry about that for now.

osaid-r · 2017-12-16T08:59:47Z

Just to make sure I got it right, all I have to do is convert float32 inputs to float64 and return the distance as float64 for now.
I also have to correct the test that was added !

jnothman · 2017-12-17T02:38:19Z

Return as float32, perhaps. Yes, I think so. We need to focus on correctness brie efficiency, but need to leave the 64 case as efficient as it was

osaid-r · 2017-12-18T13:52:45Z

Ok I'll make a commit then

jnothman · 2018-06-04T14:19:09Z

Hmm... I'd like to see this finished in the next month or so. @ragnerok are you around?

osaid-r · 2018-06-04T19:30:02Z

Yeah I'll work on it now apologies for the delay

jnothman · 2018-06-05T08:03:21Z

No worries, and thanks for your work :)

Celelibi · 2018-06-08T09:52:44Z

In the current implementation, the performance bottleneck is most likely in the matrix-matrix dot product. Which numpy delegate to the underlying BLAS.
In most implementations, this operation is heavily optimized. It most likely make a good use of SIMD instructions, of the multiple cores available and optimize the cache usage. It might even use a sub-cubic algorithm for large matrices. I highly doubt we could reach a similar performance if the new solution doesn't let the underlying BLAS do the bulk of the computations.

BTW, I think the current algorithm looses approximately 3 decimal digits in accuracy (rough experimental estimation) for both float32 and float64 data types. I genuinely wonder how one decide what is or isn't an acceptable loss.

Celelibi · 2018-06-08T17:39:47Z

For what it's worth, I've tried several different algorithms to compute the euclidean distance matrix, including:

forcing the symmetry of the distance matrix by averaging the upper and lower triangles;
broadcasting the subtraction into a 3D array and compute all the squared sum (as initially proposed here);
same as 2. but block-wise to avoid a huge memory consumption;
iterate over each row of X and broadcast the subtraction to Y, thus computing the distance matrix row by row;
block-wise cast X and Y to float64 and copy the blocks of the upper triangle to the lower triangle.

Turns out, 1. was a bad idea. The result is symmetric, but further from the true value.
Algorithm 2 is often on par with 3 and they show the worst performance.
Algorithm 4 perform almost twice faster in some cases, but is still no match for algorithm 5.
Unsatisfyingly, Algorithm 5 is the best of all. My tests would suggest using a block size of at least 1024 to get the most efficiency. Which means up to 16MB of additional memory.

osaid-r · 2018-06-11T11:04:03Z

@jnothman Can you please review @Celelibi's work. I do not have much idea of what is being said but if I am not wrong what we had decided earlier is the best approach.

jnothman · 2018-06-11T11:21:31Z

@Celelibi basically said that the best solution is to do the computation in 64bit even when the input is 32bit, but to do so on a chunk at a time to avoid excessive memory usage due to copying X and Y (which may have a lot of features, and hence be as costly in storage as the pairwise distances). @Celebili, would you like to offer an implementation? I'm also wondering whether the scipy.spatial implementation is more stable and can be exploited.

Celelibi · 2018-06-11T16:35:57Z

@Celebili, would you like to offer an implementation?

Sure. As long as you don't misspell my nickname. ^^
Shall I make another pull request? I mean, I wouldn't want to be rude to @ragnerok who already invested some time on it.

I'm also wondering whether the scipy.spatial implementation is more stable and can be exploited.

I also looked into it. scipy.spartial.distance.euclidean perform the subtraction first and uses nrm2 from the available BLAS. Unfortunately, only works on a single vector.

I looked too quickly at the code of scipy.spartial.distance.cdist (and pdist) the first time and didn't include it in my benchmark. It actually uses its own straightforward C implementation. So even if the performance is better than the aformentioned Algorithms 2, 3 and 4, the current implementation is still faster by far. (At least on my system using OpenBLAS.) Moreover, it is only implemented for the double C type and the input is cast the float32 to float64 first if needed. So there would be no benefits to using it in either case.

BTW, I have some (crowded) plots to support all these claims.

jnothman · 2018-06-11T23:58:12Z

Oh, sorry for the name fail, @Celelibi (okay?). You can offer a pull request to @ragnerok's branch, or in a comment here...

Celelibi · 2018-06-12T13:52:01Z

The code performing the block computation should probably look something like this.

        bs = 1024
        dist = np.zeros((X.shape[0], Y.shape[0]), dtype=(X[0, 0]-Y[0, 0]).dtype)
        for i in range(0, X.shape[0], bs):
            for j in range(i, Y.shape[0], bs):
                for k in range(0, X.shape[1], bs):
                    Xc = X[i:i+bs, k:k+bs].astype(np.float64)
                    Yc = Y[j:j+bs, k:k+bs].astype(np.float64)
                    dist[i:i+bs, j:j+bs] += euclidean_distances(Xc, Yc, squared=True)

                if i != j:
                    dist[j:j+bs, i:i+bs] = dist[i:i+bs, j:j+bs].T

But the current code would probably need some refactoring to avoid unclear recursive calls. And it turns out reusing X_norm_squared and / or Y_norm_squared (if given) isn't straightforward if we don't want to use an unbounded amount of additional memory. It's even worse if we want to consider the case where the given squared norm vectors have a different dtype from X and Y.
What should we do if X contains float64s but X_norm_squared contains float32s?

osaid-r · 2018-06-13T17:06:09Z

@Celelibi I think you should make a new PR and I should probably close this one, I hadn't done too much work anyways

jnothman · 2018-06-29T00:14:02Z

thanks again for your work, @ragnerok. this is an important but difficult issue to solve

osaid-r · 2018-06-29T17:05:05Z

Yeah, no problem and thanks for the help

added stable implementation of euclidean_distances and added test for it

95e156d

jnothman reviewed Nov 5, 2017

View reviewed changes

qinhanmin2014 reviewed Nov 6, 2017

View reviewed changes

qinhanmin2014 reviewed Nov 7, 2017

View reviewed changes

jnothman reviewed Nov 11, 2017

View reviewed changes

jnothman added this to the 0.20 milestone Jun 4, 2018

Celelibi mentioned this pull request Jun 15, 2018

[WIP] Stable and fast float32 implementation of euclidean_distances #11271

Closed

osaid-r closed this Jun 28, 2018

Uh oh!

[WIP] Providing stable implementation for euclidean_distances #10069

[WIP] Providing stable implementation for euclidean_distances #10069

Uh oh!

Conversation

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!