torch.cdist returns inconsistent result #37734

vadimkantorov · 2020-05-03T15:37:48Z

import torch

A, B = map(torch.load('bug.pt').get, ['A', 'B'])
print(A.shape, B.shape) # torch.Size([95, 39]) torch.Size([295, 39])

D = torch.cdist(A, B)
print(D[75, 233]) # tensor(0., grad_fn=<SelectBackward>)

D_ = torch.cdist(A[75].unsqueeze(0), B[233].unsqueeze(0)).squeeze()
print(D_) # tensor(0.0004, grad_fn=<SqueezeBackward0>)

Both ways should have same precision properties. It is very strange that they return different results: 0.0004 is quite a large difference

bug.pt.gz

cc @jlin27 @mruberry

vadimkantorov · 2020-05-04T08:28:53Z

The first way also produces nan gradient, since cdist happens to produce norm=0, whereas it's not even exactly zero (as seen in the second way)

vadimkantorov · 2020-05-04T15:31:32Z

I wonder if the cause is different block reduction configuration used, and then the problem is really nasty: the precision is quite high and only 39 elements are being reduced

ngimel · 2020-05-05T03:51:40Z

The issue of nan gradient if cdist is 0 is being addressed in #37337. As for the numeric differences, for the full computation a matrix-multiply-based approach is used, which is much faster but also more susceptible to numerical issues. For the second case, a custom kernel is used. So the difference in results is expected.

pytorch/aten/src/ATen/native/Distance.cpp

Lines 89 to 96 in 090ea77

    
           } else if (p == 2 && (mode == 1 || (mode == 0 && (r1 > 25 || r2 > 25)))) { 
        
             Tensor dist = (expand_batch_product == 1) ? euclidean_dist_out(x1, x2) : 
        
                           euclidean_dist_out(tensor1_expanded, tensor2_expanded); 
        
             result = dist.view(output_shape); 
        
           } else { 
        
             result = at::empty(output_shape, x1.options()); 
        
             cdist_stub(device1, result, tensor1_expanded, tensor2_expanded, p); 
        
           }

vadimkantorov · 2020-05-05T08:17:39Z

Oh, I see. It makes sense there's a difference, but isn’t it too much difference? all of these inputs are in 0-1 range, quite dense region.

vadimkantorov · 2020-05-05T13:08:20Z

using double bring difference to 1e-12 range. Maybe worth noting the precision properties somehow in the docs? I wonder if someone has calculated precision properties of both methods in some textbooks

albanD · 2020-05-05T20:24:21Z

Note that running this same code on my mac gives:

albandes-mbp:test albandes$ py foo.py
torch.Size([95, 39]) torch.Size([295, 39])
tensor(0.0003, grad_fn=<SelectBackward>)
tensor(0.0004, grad_fn=<SqueezeBackward0>)

Which is not great but not as bad :D

But I think it is still worth adding a note in the doc stating that the mm version might be less precise.

ngimel · 2020-05-05T20:36:28Z

It's also 0.0003 on the GPU, so mkl must be doing something funny.

albanD · 2020-05-05T21:07:11Z

We have seen as well that the number of used threads for CPU can have a large impact on the precision of these functions in the worst case scenarios.

vadimkantorov · 2020-05-05T21:08:31Z

In reality, underlying precision can be even worse. Probably it's exactly zero because it's negative and then clamped.

vadimkantorov · 2020-05-05T21:10:57Z

We have seen as well that the number of used threads for CPU can have a large impact on the precision of these functions in the worst case scenarios.

It's only 39 reduced values here per mm cell. I hope only one thread/cell is used :)

It can also be that different width of the accumulator is used?

vadimkantorov · 2020-05-06T04:27:30Z

About error analysis: I guess it'd be sth similar to https://www.johndcook.com/blog/2008/09/28/theoretical-explanation-for-numerical-results/. In this post it's shown that matrix multiply method is much worse than the difference method (for standard deviation computation)

albanD · 2020-05-06T15:48:57Z

It can also be that different width of the accumulator is used?

About this, @ngimel do we have documented anywhere the single vs double precision accumulators that are used for CPU/GPU operators?

ngimel · 2020-05-06T16:24:21Z

No (in part because we are not too consistent about it, in part because for calls offloaded to libraries, such as matrix multiplication, it would depend on the underlying library)

vadimkantorov · 2020-05-06T19:03:42Z

A similar issue in sklearn also with fingers pointed at MKL: scikit-learn/scikit-learn#11711

vadimkantorov · 2020-05-06T19:16:20Z

I guess actionable "feature requests" are:

Determine if MKL has worse precision than CUDA matrix multiply
Document these quirks (these problems would often appear when computing distances between close points from simplex/unit-norm like in my case)
Does keops have better precision?

vadimkantorov · 2020-08-03T19:53:02Z

Oh wow. I didn't know cdist supported computation algorithm choosing: #42479

ngimel added module: numerical-stability Problems related to numerical stability of operations triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels May 5, 2020

albanD added module: docs Related to our documentation, both in docs/ and docblocks module: operators labels May 5, 2020

agolynski mentioned this issue Aug 4, 2020

cdist sometimes returns wrong results when using matrix multiplication #42479

Closed

mruberry removed the module: operators (deprecated) label Oct 10, 2020

heitorschueroff added the module: distance functions label Oct 15, 2020

yamsgithub mentioned this issue May 15, 2022

torch.cdist returns high diagonal values with CUDA #57690

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

torch.cdist returns inconsistent result #37734

torch.cdist returns inconsistent result #37734

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

torch.cdist returns inconsistent result #37734

torch.cdist returns inconsistent result #37734

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!