Ensure conj/neg flags are set in destination for CUDA->CPU copies #147231

amjames · 2025-02-14T22:18:29Z

Stack from ghstack (oldest at bottom):

Fixes #146286

[ghstack-poisoned]

pytorch-bot · 2025-02-14T22:18:34Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/147231

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 7 New Failures

As of commit 5610b04 with merge base f2221b2 ():

NEW FAILURES - The following jobs have failed:

pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 1, 3, lf.linux.g4dn.12xlarge.nvidia.gpu) (gh)
distributed/tensor/test_tensor_ops.py::DistTensorOpsTest::test_slice
pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 2, 3, lf.linux.g4dn.12xlarge.nvidia.gpu) (gh)
distributed/checkpoint/fsdp/test_fsdp_dsd.py::TestFullyShardWithDistributedStateDict::test_save_with_fsdp2_tp_and_load_with_tp
pull / linux-focal-cuda12.6-py3.10-gcc11 / test (default, 1, 5, lf.linux.4xlarge.nvidia.gpu) (gh)
test_reductions.py::TestReductionsCUDA::test_all_any_vs_numpy_cuda_bool
pull / linux-focal-cuda12.6-py3.10-gcc11 / test (default, 2, 5, lf.linux.4xlarge.nvidia.gpu) (gh)
test_linalg.py::TestLinalgCUDA::test_addmv_rowmajor_colmajor_incx_incy_lda_cuda_float32
pull / linux-focal-cuda12.6-py3.10-gcc11 / test (default, 3, 5, lf.linux.4xlarge.nvidia.gpu) (gh)
test_binary_ufuncs.py::TestBinaryUfuncsCUDA::test_broadcasting__refs_add_cuda_float32
pull / linux-focal-cuda12.6-py3.10-gcc11-sm89 / test (default, 1, 5, lf.linux.g6.4xlarge.experimental.nvidia.gpu) (gh)
test_reductions.py::TestReductionsCUDA::test_all_any_vs_numpy_cuda_bool
pull / linux-focal-cuda12.6-py3.10-gcc11-sm89 / test (default, 3, 5, lf.linux.g6.4xlarge.experimental.nvidia.gpu) (gh)
test_linalg.py::TestLinalgCUDA::test_addmv_rowmajor_colmajor_incx_incy_lda_cuda_float32

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Fixes #146286 ghstack-source-id: e4fd7f1 Pull Request resolved: #147231

amjames · 2025-02-14T22:20:22Z

@pytorchbot label "topic: not user facing"

Skylion007 · 2025-02-16T15:44:00Z

test/test_complex.py

@@ -44,6 +45,20 @@ def test_conj_copy(self, device, dtype):
        x1.copy_(xc1)
        self.assertEqual(x1, torch.tensor([5 - 1j, 2 - 2j], device=device, dtype=dtype))

+    @onlyCUDA


Why is this test onlyCUDA? Couldn't we parameterize it all device backends?

You need a cuda tensor to be the source of the copy and a CPU tensor to be the destination. The issue is only present on cross device copies

@amjames For this specific issue sure, but I am just pointing out we could expand test coverage to copy an XPU, TPU, etc. Also a CPU to CPU copy should also be valid, right?

Or is copy_ not implemented on all devices? Or is that this behavior is buggy on other devices? Or worse, throws an error depending on the support for non-blocking.

Also one minor nit: non_blocking could be parameterized. Or at the very least executed as a subtest.

Oh wait, looks like the current test suite only runs on CPU/CUDA anyway so this is fine.

Actually, this is a pretty PERF sense part of the codebase, so I'll defer

ngimel · 2025-02-17T19:05:41Z

aten/src/ATen/native/cuda/Copy.cu

@@ -449,10 +449,10 @@ static void copy_kernel_cuda(TensorIterator& iter, bool non_blocking) {
  }

  if (iter.tensor(0).is_conj() != iter.tensor(1).is_conj()) {
-     iter.tensor(0).conj_physical_();
+     iter.tensor(0)._set_conj(iter.tensor(1).is_conj());


you cannot change conj flag on the destination tensor, because the copy is inplace and you are not allowed to change the attributes that the tensor has. THis is the reason conj_physical has been run here in the first place - why do you think we would do an operation on tensor data if we didn't need it and could just flip a flag?

Alight, misunderstood what conj_physical does. Now with the correct understanding of that I don't see anyway for us to ensure that the conjugation happens on the destination after the copy without blocking. It will have to be resolved beforehand.

We could push this into the branch for copies requiring a temporary, where the conjugate is resolved into a temporary which is then the source for the non blocking copy.

why do you think we would do an operation on tensor data if we didn't need it and could just flip a flag

FWIW I thought this might be something we don't allow, but there were no test failures, so thought it could be an oversight.

I thought there were test cases to check this, but apparently not. We should resolve conj/neg similar to how we resolve dtype via intermediates - the cases with mismatching conj/neg should go through requires_temporaries branch.

[ghstack-poisoned]

Fixes #146286 ghstack-source-id: 0c44ccf Pull Request resolved: #147231

ngimel · 2025-02-18T21:35:24Z

aten/src/ATen/native/cuda/Copy.cu

-  if (same_dtype && iter.is_contiguous()) {
-    // Contiguous same-dtype copies can always use cudaMemcpyAsync
+  bool same_conj_neg = iter.tensor(0).is_conj() == iter.tensor(1).is_conj() && iter.tensor(0).is_neg() == iter.tensor(0).is_neg();
+  if (same_dtype && iter.is_contiguous() && same_conj_neg) {


Suggested change

if (same_dtype && iter.is_contiguous() && same_conj_neg) {

if (same_dtype && iter.is_contiguous() && (same_conj_neg || !non_blocking) {

ngimel · 2025-02-18T21:36:08Z

aten/src/ATen/native/cuda/Copy.cu

@@ -340,8 +344,8 @@ static void copy_kernel_cuda(TensorIterator& iter, bool non_blocking) {

  // Enable p2p access between devices. (No-op if it involves the CPU)
  bool p2p_enabled = maybe_enable_p2p_access(dst_device, src_device);
-
-  if (copy_requires_temporaries(iter, p2p_enabled)) {
+  bool temp_needed = copy_requires_temporaries(iter, p2p_enabled, non_blocking);


ngimel · 2025-02-18T21:41:07Z

aten/src/ATen/native/cuda/Copy.cu

@@ -355,19 +359,17 @@ static void copy_kernel_cuda(TensorIterator& iter, bool non_blocking) {
    auto conversion_device = non_blocking ? kCUDA : kCPU;
    if (iter.device_type(1) == conversion_device) {
      dst_contig = dst.is_contiguous() ? dst : at::empty_like(dst, LEGACY_CONTIGUOUS_MEMORY_FORMAT);
-      src_contig = iter.tensor(1).to(iter.dtype(0)).expand_as(dst).contiguous();
+      src_contig = iter.tensor(1).to(iter.dtype(0)).expand_as(dst).contiguous().resolve_conj();


what if src is_conj is false and dst is_conj is true? resolve_conj will do nothing, dst_contig.copy_(src_contig) will still require temporaries, you are in an infinite loop.

[ghstack-poisoned]

Fixes #146286 ghstack-source-id: e990d36 Pull Request resolved: #147231

[ghstack-poisoned]

Fixes #146286 ghstack-source-id: a23e358 Pull Request resolved: #147231

[ghstack-poisoned]

Fixes #146286 ghstack-source-id: c042321 Pull Request resolved: #147231

[ghstack-poisoned]

Fixes #146286 ghstack-source-id: feb4d9b Pull Request resolved: #147231

github-actions · 2025-05-14T02:12:37Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

Update

9754f7f

[ghstack-poisoned]

amjames requested review from eqy and syed-ahmed as code owners February 14, 2025 22:18

amjames added a commit that referenced this pull request Feb 14, 2025

Ensure conj/neg flags are set in destination for CUDA->CPU copies

7155712

Fixes #146286 ghstack-source-id: e4fd7f1 Pull Request resolved: #147231

pytorch-bot bot added the topic: not user facing topic category label Feb 14, 2025

pytorchbot added the open source label Feb 14, 2025

Skylion007 reviewed Feb 16, 2025

View reviewed changes

amjames requested a review from janeyx99 February 17, 2025 14:41

Skylion007 previously approved these changes Feb 17, 2025

View reviewed changes

Skylion007 requested a review from ngimel February 17, 2025 15:30

ngimel requested changes Feb 17, 2025

View reviewed changes

Update

5b28e99

[ghstack-poisoned]

amjames added a commit that referenced this pull request Feb 18, 2025

Ensure conj is resolved prior to async host->device copy

f7663c5

Fixes #146286 ghstack-source-id: 0c44ccf Pull Request resolved: #147231

ngimel reviewed Feb 18, 2025

View reviewed changes

Update

c257b1c

[ghstack-poisoned]

amjames mentioned this pull request Mar 14, 2025

Ensure conj_physical always does a physical conjugation #149226

Open

amjames added a commit that referenced this pull request Mar 14, 2025

Ensure conj is resolved prior to async host->device copy

751d1b3

Fixes #146286 ghstack-source-id: e990d36 Pull Request resolved: #147231

Update

8f80883

[ghstack-poisoned]

amjames added a commit that referenced this pull request Mar 14, 2025

Ensure conj is resolved prior to async host->device copy

19e99db

Fixes #146286 ghstack-source-id: a23e358 Pull Request resolved: #147231

Update

9dd69eb

[ghstack-poisoned]

amjames added a commit that referenced this pull request Mar 14, 2025

Ensure conj is resolved prior to async host->device copy

7e5c754 A935

Fixes #146286 ghstack-source-id: c042321 Pull Request resolved: #147231

Update

5610b04

[ghstack-poisoned]

amjames added a commit that referenced this pull request Mar 14, 2025

Ensure conj is resolved prior to async host->device copy

ce451dc

Fixes #146286 ghstack-source-id: feb4d9b Pull Request resolved: #147231

github-actions bot added the Stale label May 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure conj/neg flags are set in destination for CUDA->CPU copies #147231

Ensure conj/neg flags are set in destination for CUDA->CPU copies #147231

	if (same_dtype && iter.is_contiguous() && same_conj_neg) {
	if (same_dtype && iter.is_contiguous() && (same_conj_neg \|\| !non_blocking) {

Ensure conj/neg flags are set in destination for CUDA->CPU copies #147231

Are you sure you want to change the base?

Ensure conj/neg flags are set in destination for CUDA->CPU copies #147231

Conversation

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/147231

❌ 7 New Failures

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment