8000 torch.cdist returns inconsistent result · Issue #37734 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

torch.cdist returns inconsistent result #37734

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
vadimkantorov opened this issue May 3, 2020 · 16 comments
Open

torch.cdist returns inconsistent result #37734

vadimkantorov opened this issue May 3, 2020 · 16 comments
Labels
module: distance functions module: docs Related to our documentation, both in docs/ and docblocks module: numerical-stability Problems related to numerical stability of operations triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@vadimkantorov
Copy link
Contributor
vadimkantorov commented May 3, 2020
import torch

A, B = map(torch.load('bug.pt').get, ['A', 'B'])
print(A.shape, B.shape) # torch.Size([95, 39]) torch.Size([295, 39])

D = torch.cdist(A, B)
print(D[75, 233]) # tensor(0., grad_fn=<SelectBackward>)

D_ = torch.cdist(A[75].unsqueeze(0), B[233].unsqueeze(0)).squeeze()
print(D_) # tensor(0.0004, grad_fn=<SqueezeBackward0>)

Both ways should have same precision properties. It is very strange that they return different results: 0.0004 is quite a large difference

bug.pt.gz

cc @jlin27 @mruberry

@vadimkantorov
Copy link
Contributor Author

The first way also produces nan gradient, since cdist happens to produce norm=0, whereas it's not even exactly zero (as seen in the second way)

@vadimkantorov
Copy link
Contributor Author
vadimkantorov commented May 4, 2020

I wonder if the cause is different block reduction configuration used, and then the problem is really nasty: the precision is quite high and only 39 elements are being reduced

@ngimel
Copy link
Collaborator
ngimel commented May 5, 2020

The issue of nan gradient if cdist is 0 is being addressed in #37337. As for the numeric differences, for the full computation a matrix-multiply-based approach is used, which is much faster but also more susceptible to numerical issues. For the second case, a custom kernel is used. So the difference in results is expected.

} else if (p == 2 && (mode == 1 || (mode == 0 && (r1 > 25 || r2 > 25)))) {
Tensor dist = (expand_batch_product == 1) ? euclidean_dist_out(x1, x2) :
euclidean_dist_out(tensor1_expanded, tensor2_expanded);
result = dist.view(output_shape);
} else {
result = at::empty(output_shape, x1.options());
cdist_stub(device1, result, tensor1_expanded, tensor2_expanded, p);
}

@ngimel ngimel added module: numerical-stability Problems related to numerical stability of operations triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels May 5, 2020
@vadimkantorov
Copy link
Contributor Author
vadimkantorov commented May 5, 2020

Oh, I see. It makes sense there's a difference, but isn’t it too much difference? all of these inputs are in 0-1 range, quite dense region.

@vadimkantorov
Copy link
Contributor Author
vadimkantorov commented May 5, 2020

using double bring difference to 1e-12 range. Maybe worth noting the precision properties somehow in the docs? I wonder if someone has calculated precision properties of both methods in some textbooks

@albanD albanD added module: docs Related to our documentation, both in docs/ and docblocks module: operators labels May 5, 2020
@albanD
Copy link
Collaborator
albanD commented May 5, 2020

Note that running this same code on my mac gives:

albandes-mbp:test albandes$ py foo.py
torch.Size([95, 39]) torch.Size([295, 39])
tensor(0.0003, grad_fn=<SelectBackward>)
tensor(0.0004, grad_fn=<SqueezeBackward0>)

Which is not great but not as bad :D

But I think it is still worth adding a note in the doc stating that the mm version might be less precise.

@ngimel
Copy link
Collaborator
ngimel commented May 5, 2020

It's also 0.0003 on the GPU, so mkl must be doing something funny.

@albanD
Copy link
Collaborator
albanD commented May 5, 2020

We have seen as well that the number of used threads for CPU can have a large impact on the precision of these functions in the worst case scenarios.

@vadimkantorov
Copy link
Contributor Author
vadimkantorov commented May 5, 2020

In reality, underlying precision can be even worse. Probably it's exactly zero because it's negative and then clamped.

@vadimkantorov
Copy link
Contributor Author

We have seen as well that the number of used threads for CPU can have a large impact on the precision of these functions in the worst case scenarios.

It's only 39 reduced values here per mm cell. I hope only one thread/cell is used :)

It can also be that different width of the accumulator is used?

@vadimkantorov
Copy link
Contributor Author
vadimkantorov commented May 6, 2020

About error analysis: I guess it'd be sth similar to https://www.johndcook.com/blog/2008/09/28/theoretical-explanation-for-numerical-results/. In this post it's shown that matrix multiply method is much worse than the difference method (for standard deviation computation)

@albanD
Copy link
Collaborator
albanD commented May 6, 2020

It can also be that different width of the accumulator is used?

About this, @ngimel do we have documented anywhere the single vs double precision accumulators that are used for CPU/GPU operators?

@ngimel
Copy link
Collaborator
ngimel commented May 6, 2020

No (in part because we are not too consistent about it, in part because for calls offloaded to libraries, such as matrix multiplication, it would depend on the underlying library)

@vadimkantorov
Copy link
Contributor Author

A similar issue in sklearn also with fingers pointed at MKL: scikit-learn/scikit-learn#11711

@vadimkantorov
Copy link
Contributor Author
vadimkantorov commented May 6, 2020

I guess actionable "feature requests" are:

  1. Determine if MKL has worse precision than CUDA matrix multiply
  2. Document these quirks (these problems would often appear when computing distances between close points from simplex/unit-norm like in my case)
  3. Does keops have better precision?

@vadimkantorov
Copy link
Contributor Author

Oh wow. I didn't know cdist supported computation algorithm choosing: #42479

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: distance functions module: docs Related to our documentation, both in docs/ and docblocks module: numerical-stability Problems related to numerical stability of operations triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

5 participants
0