Work around MPSGraph issue in backward pass of nn.ReplicationPad1d/2d #152094

xwu-498 · 2025-04-24T08:58:02Z

When the 3rd from last dimension is 2^16 or greater, MPSGraph returns 0 for padgradient.
To work around this, we break the problematic dimension into chunks with chunk size being
no greater than 2^16 - 1.

Test case for nn.ReplicationPad1d:

    shape = [65739, 2, 4]
    x_cpu = torch.randn(shape, device='cpu', requires_grad=True)
    x_mps = x_cpu.clone().detach().to('mps').requires_grad_(True)
    model = torch.nn.ReplicationPad1d((1, 1))

    out_cpu = model(x_cpu)
    out_mp
8000
s = model(x_mps)

    # backward
    g_cpu = torch.randn_like(out_cpu)
    g_mps = g_cpu.clone().detach().to('mps').requires_grad_(False)
    out_cpu.backward(g_cpu)
    out_mps.backward(g_mps)

    print(f"{((x_cpu.grad - x_mps.grad.cpu()).abs() > 1e-5).sum() = }")

    # Expected Output:
    # ((x_cpu.grad - x_mps.grad.cpu()).abs() > 1e-5).sum() = tensor(0)

Test case for nn.ReplicationPad2d,

    shape = [2, 65739, 2, 4]
    x_cpu = torch.randn(shape, device='cpu', requires_grad=True)
    x_mps = x_cpu.clone().detach().to('mps').requires_grad_(True)
    model = torch.nn.ReplicationPad2d((1, 1, 1, 1))

    out_cpu = model(x_cpu)
    out_mps = model(x_mps)

    # backward
    g_cpu = torch.randn_like(out_cpu)
    g_mps = g_cpu.clone().detach().to('mps').requires_grad_(False)
    out_cpu.backward(g_cpu)
    out_mps.backward(g_mps)

    print(f"{((x_cpu.grad - x_mps.grad.cpu()).abs() > 1e-5).sum() = }")

    # Expected Output:
    # ((x_cpu.grad - x_mps.grad.cpu()).abs() > 1e-5).sum() = tensor(0)

These tests produce expected output with this workaround.

pytorch-bot · 2025-04-24T08:58:06Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/152094

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2025-04-24T08:58:07Z

The committers listed above are authorized under a signed CLA.

✅ login: xwu-498 (786fd71)

Fixes pytorch#135447. When the 3rd from last dimension is 2^16 or greater, MPSGraph returns 0 for padgradient. To work around this, we break the problematic dimension into chunks with chunk size being no greater than 2^16 - 1. Test case for nn.ReplicationPad1d: ``` shape = [65739, 2, 4] x_cpu = torch.randn(shape, device='cpu', requires_grad=True) x_mps = x_cpu.clone().detach().to('mps').requires_grad_(True) model = torch.nn.ReplicationPad1d((1, 1)) out_cpu = model(x_cpu) out_mps = model(x_mps) # backward g_cpu = torch.randn_like(out_cpu) g_mps = g_cpu.clone().detach().to('mps').requires_grad_(False) out_cpu.backward(g_cpu) out_mps.backward(g_mps) print(f"{((x_cpu.grad - x_mps.grad.cpu()).abs() > 1e-5).sum() = }") # Expected Output: # ((x_cpu.grad - x_mps.grad.cpu()).abs() > 1e-5).sum() = tensor(0) ``` Test case for nn.ReplicationPad2d, ``` shape = [2, 65739, 2, 4] x_cpu = torch.randn(shape, device='cpu', requires_grad=True) x_mps = x_cpu.clone().detach().to('mps').requires_grad_(True) model = torch.nn.ReplicationPad2d((1, 1, 1, 1)) out_cpu = model(x_cpu) out_mps = model(x_mps) # backward g_cpu = torch.randn_like(out_cpu) g_mps = g_cpu.clone().detach().to('mps').requires_grad_(False) out_cpu.backward(g_cpu) out_mps.backward(g_mps) print(f"{((x_cpu.grad - x_mps.grad.cpu()).abs() > 1e-5).sum() = }") # Expected Output: # ((x_cpu.grad - x_mps.grad.cpu()).abs() > 1e-5).sum() = tensor(0) ``` These tests produce expected output with this workaround.

malfet

Please fix lint and also looks like it fails on exactly the test you are trying to add

malfet · 2025-05-16T14:16:57Z

aten/src/ATen/native/mps/operations/Pad.mm

+  // we break the tensor into chuncks where the problematic dimention is no greater than 2**16-1.
+  // This is reported in https://github.com/pytorch/pytorch/issues/135447.
+  // Internal radar for MPSGraph: rdar://149853787.
+  const int64_t max_sub_batch_size = 65535;


Suggested change

const int64_t max_sub_batch_size = 65535;

constexpr auto max_sub_batch_size = 65535;

Thanks @malfet for the comments. I will follow up on these issues.

xwu-498 requested review from kulinseth and malfet as code owners April 24, 2025 08:58

pytorch-bot bot added the release notes: mps Release notes category label Apr 24, 2025

pytorchbot added the open source label Apr 24, 2025

xwu-498 force-pushed the fix-pad-grad branch 2 times, most recently from 97ba670 to 9256e4e Compare April 24, 2025 23:34

xwu-498 changed the title ~~Work around MPSGraph issue in handling backward pass of nn.Replicatio…~~ Work around MPSGraph issue in backward pass of nn.ReplicationPad1d/2d. Apr 24, 2025

xwu-498 changed the title ~~Work around MPSGraph issue in backward pass of nn.ReplicationPad1d/2d.~~ Work around MPSGraph issue in backward pass of nn.ReplicationPad1d/2d Apr 24, 2025

xwu-498 force-pushed the fix-pad-grad branch from 9256e4e to 77bb5c7 Compare April 25, 2025 03:09

soulitzer added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Apr 28, 2025

xwu-498 force-pushed the fix-pad-grad branch from 77bb5c7 to ebd8f3a Compare May 8, 2025 19:03

xwu-498 force-pushed the fix-pad-grad branch from ebd8f3a to 786fd71 Compare May 12, 2025 19:06

skotapati added the ciflow/mps Run MPS tests (subset of trunk) label May 14, 2025

malfet requested changes May 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Work around MPSGraph issue in backward pass of nn.ReplicationPad1d/2d #152094

Work around MPSGraph issue in backward pass of nn.ReplicationPad1d/2d #152094

	const int64_t max_sub_batch_size = 65535;
	constexpr auto max_sub_batch_size = 65535;

Work around MPSGraph issue in backward pass of nn.ReplicationPad1d/2d #152094

Are you sure you want to change the base?

Work around MPSGraph issue in backward pass of nn.ReplicationPad1d/2d #152094

Conversation

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/152094

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment