8000 [MRG] Fix euclidean_distances numerical instabilities by jeremiedbb · Pull Request #13410 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] Fix euclidean_distances numerical instabilities #13410

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 35 commits into from

Conversation

jeremiedbb
Copy link
Member
@jeremiedbb jeremiedbb commented Mar 7, 2019

Fixes #9354
Should close #11271 eventually
should close #12128

This PR follows the discussion in #9354, i.e.

  • for n_features <= 16, use the usual exact method to compute distances d(x,y)² = ||x-y||²

  • for n_features > 16, keep the fast method d(x,y)² = ||x||² -2x.y + ||y||² but up-casting to float64 if necessary.

A couple of remarks:

  • When n_features <= 32, when one of the matrices is sparse and the other is not, the performances are pretty good, but I couldn't get good perfs when both are sparse. In that case I densified the smaller array. I think this is fine and won't cause a big increase of memory usage. The reason is as follows:
    Worst cast, both arrays have same n_sample N. Then the ration between memory usage of the densified array and the distance matrix is N * n_features / N*N = n_features / N < 32 / N. So for reasonable N >~ 1000, the increase is at most ~3%.

  • To achieve good performances on the exact dense dense case I had to add the -Ofast compiler flag to enable better compiler optimizations (almost x2 speedup).

Benchmarks:

  • low dimension
    n_samples_X = 100000, n_samples_Y = 100, n_features = 10
    cdist master PR
dense/dense 102 ms ± 184 µs 74.4 ms ± 303 µs 99.5 ms ± 173 µs
float32 sparse/dense / 96.3 ms ± 809 µs 101 ms ± 217 µs
sparse/sparse / 110 ms ± 5.44 ms 102 ms ± 804 µs
dense/dense 98.4 ms ± 518 µs 122 ms ± 256 µs 134 ms ± 1.2 ms
float64 sparse/dense / 165 ms ± 673 µs 103 ms ± 204 µs
sparse/sparse / 170 ms ± 11.6 ms 103 ms ± 92.4 µs
  • high dimension
    n_samples_X = 100000, n_samples_Y = 100, n_features = 1000
    cdist master PR
dense/dense 8.22 s ± 19.4 ms 406 ms ± 6.48 ms 1.11 s ± 2.86 ms
float32 sparse/dense / 433 ms ± 902 µs 860 ms ± 26.8 ms
sparse/sparse / 879 ms ± 2.78 ms 1.02 s ± 970 µs
dense/dense 8.08 s ± 230 ms 631 ms ± 11.3 ms 1.01 s ± 12.3 ms
float64 sparse/dense / 701 ms ± 2.3 ms 758 ms ± 12.4 ms
sparse/sparse / 932 ms ± 14.5 ms 995 ms ± 4.59 ms

@jnothman
Copy link
Member

Thanks for this!! I think our main priority here is numerical stability, while ideally maintaining good performance where we can. Let us know when you want review.

distances = euclidean_distances(X, Y)

assert_allclose(distances, expected, rtol=1e-6)
assert distances.dtype == dtype
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test don't pass with rtol=1e-7 (default) in float32, probably because scipy upcast to float64 to compute distances which can cause differences in the last significant digit.

I think a rtol of 1e-6 is enough for float32. What do you think ?

@jeremiedbb jeremiedbb changed the title [WIP] Fix euclidean_distances numerical unstabilities [MRG] Fix euclidean_distances numerical unstabilities Mar 11, 2019
@jeremiedbb
Copy link
Member Author

@jnothman @amueller @rth I think it's ready for review.

@jnothman
Copy link
Member

The ci/circleci: doc-min-dependencies may be real? Could you have generated infinite distances, for instance?

@jnothman jnothman changed the title [MRG] Fix euclidean_distances numerical unstabilities [MRG] Fix euclidean_distances numerical instabilities Mar 12, 2019
Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Please add a what's new entry.

return np.asarray(D)


cdef floating _euclidean_dense_dense_exact_1d(floating *x,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this much better than using vectorized implementation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pairwise distances can't be simply vectorized.
The numpy way would be to do:

D = ((X[:,:,np.newaxis] - Y.T[np.newaxis,:,:])**2).sum(axis=1)

but it takes a lot more memory and is 10x slower.

Another possibility is to do:

for i in range(n_samples_X):
   for j in range(n_samples_Y):
        D[i, j] = ((X[i] - Y[j])**2).sum()

But calling the numpy api in a tight C loop is catastrophic. It's 1000x slower.

So I far as I know the implementation I did is the only way to keep close to cdist performance. If you know a way to use vectorized implem I'll take it

xe = xs + (chunk_size if i < n_chunks_X - 1 else n_samples_X_rem)

X_chunk = X[xs:xe].astype(np.float64)
XX_chunk = row_norms(X_chunk, squared=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you won't use the XX and YY that were passed in?