-
Notifications
You must be signed in to change notification settings - Fork 24.2k
Ensure conj/neg flags are set in destination for CUDA->CPU copies #147231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: gh/amjames/20/base
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/147231
Note: Links to docs will display an error until the docs builds have been completed. ❌ 7 New FailuresAs of commit 5610b04 with merge base f2221b2 ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@pytorchbot label "topic: not user facing" |
test/test_complex.py
Outdated
@@ -44,6 +45,20 @@ def test_conj_copy(self, device, dtype): | |||
x1.copy_(xc1) | |||
self.assertEqual(x1, torch.tensor([5 - 1j, 2 - 2j], device=device, dtype=dtype)) | |||
|
|||
@onlyCUDA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this test onlyCUDA? Couldn't we parameterize it all device backends?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need a cuda tensor to be the source of the copy and a CPU tensor to be the destination. The issue is only present on cross device copies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amjames For this specific issue sure, but I am just pointing out we could expand test coverage to copy an XPU, TPU, etc. Also a CPU to CPU copy should also be valid, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or is copy_
not implemented on all devices? Or is that this behavior is buggy on other devices? Or worse, throws an error depending on the support for non-blocking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also one minor nit: non_blocking could be parameterized. Or at the very least executed as a subtest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wait, looks like the current test suite only runs on CPU/CUDA anyway so this is fine.
Actually, this is a pretty PERF sense part of the codebase, so I'll defer
aten/src/ATen/native/cuda/Copy.cu
Outdated
@@ -449,10 +449,10 @@ static void copy_kernel_cuda(TensorIterator& iter, bool non_blocking) { | |||
} | |||
|
|||
if (iter.tensor(0).is_conj() != iter.tensor(1).is_conj()) { | |||
iter.tensor(0).conj_physical_(); | |||
iter.tensor(0)._set_conj(iter.tensor(1).is_conj()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you cannot change conj
flag on the destination tensor, because the copy is inplace and you are not allowed to change the attributes that the tensor has. THis is the reason conj_physical
has been run here in the first place - why do you think we would do an operation on tensor data if we didn't need it and could just flip a flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alight, misunderstood what conj_physical
does. Now with the correct understanding of that I don't see anyway for us to ensure that the conjugation happens on the destination after the copy without blocking. It will have to be resolved beforehand.
We could push this into the branch for copies requiring a temporary, where the conjugate is resolved into a temporary which is then the source for the non blocking copy.
why do you think we would do an operation on tensor data if we didn't need it and could just flip a flag
FWIW I thought this might be something we don't allow, but there were no test failures, so thought it could be an oversight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought there were test cases to check this, but apparently not. We should resolve conj/neg similar to how we resolve dtype via intermediates - the cases with mismatching conj/neg should go through requires_temporaries
branch.
aten/src/ATen/native/cuda/Copy.cu
Outdated
if (same_dtype && iter.is_contiguous()) { | ||
// Contiguous same-dtype copies can always use cudaMemcpyAsync | ||
bool same_conj_neg = iter.tensor(0).is_conj() == iter.tensor(1).is_conj() && iter.tensor(0).is_neg() == iter.tensor(0).is_neg(); | ||
if (same_dtype && iter.is_contiguous() && same_conj_neg) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (same_dtype && iter.is_contiguous() && same_conj_neg) { | |
if (same_dtype && iter.is_contiguous() && (same_conj_neg || !non_blocking) { |
aten/src/ATen/native/cuda/Copy.cu
Outdated
@@ -340,8 +344,8 @@ static void copy_kernel_cuda(TensorIterator& iter, bool non_blocking) { | |||
|
|||
// Enable p2p access between devices. (No-op if it involves the CPU) | |||
bool p2p_enabled = maybe_enable_p2p_access(dst_device, src_device); | |||
|
|||
if (copy_requires_temporaries(iter, p2p_enabled)) { | |||
bool temp_needed = copy_requires_temporaries(iter, p2p_enabled, non_blocking); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dead code
aten/src/ATen/native/cuda/Copy.cu
Outdated
@@ -355,19 +359,17 @@ static void copy_kernel_cuda(TensorIterator& iter, bool non_blocking) { | |||
auto conversion_device = non_blocking ? kCUDA : kCPU; | |||
if (iter.device_type(1) == conversion_device) { | |||
dst_contig = dst.is_contiguous() ? dst : at::empty_like(dst, LEGACY_CONTIGUOUS_MEMORY_FORMAT); | |||
src_contig = iter.tensor(1).to(iter.dtype(0)).expand_as(dst).contiguous(); | |||
src_contig = iter.tensor(1).to(iter.dtype(0)).expand_as(dst).contiguous().resolve_conj(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if src is_conj
is false and dst
is_conj is true? resolve_conj
will do nothing, dst_contig.copy_(src_contig)
will still require temporaries, you are in an infinite loop.
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
Stack from ghstack (oldest at bottom):
Fixes #146286