Fix opcheck to detect stride mismatches on CPU tensors by aorenste · Pull Request #177115 · pytorch/pytorch

aorenste · 2026-03-11T04:51:46Z

Human Note

torch.library.opcheck silently skipped stride validation for CPU tensors because compare_tensor_meta called check_significant_strides with only_cuda=True (the default), causing the stride check to be bypassed when both tensors are on CPU. This change passes only_cuda=False so that stride mismatches between real and fake tensor implementations are caught on all devices. A new test confirms that opcheck now correctly raises on CPU stride mismatches, and all existing tests continue to pass.

Agent Report

Fix: torch.library.opcheck doesn't check strides for CPU Tensors

Issue: pytorch/pytorch#149468

Summary

torch.library.opcheck fails to detect stride mismatches between real and fake tensor implementations when tensors are on CPU. Custom ops with incorrect register_fake implementations silently pass validation on CPU.

Root Cause

In torch/_prims_common/__init__.py, the function _check_strides_helper has a guard controlled by only_cuda=True (the default):

if (not only_cuda or a.device.type == "cuda" or b.device.type == "cuda") and a.numel() > 0:

When only_cuda=True and both tensors are on CPU, this evaluates to False, skipping stride comparison entirely. This was a workaround for #77553 (CPU elementwise strides being incorrect).

The call chain through opcheck is:

opcheck → CrossRefFakeMode.__torch_dispatch__
  → _check_fake_real_tensors → compare_tensor_meta
    → check_significant_strides(a, b)  # only_cuda defaults to True
      → _check_strides_helper(only_cuda=True) → skips CPU tensors

Fix

One-line change in compare_tensor_meta (line 188): pass only_cuda=False to check_significant_strides.

This affects compare_tensor_meta callers (opcheck via CrossRefFakeMode, and the dynamo fake_tensor_crossref debug mode). It does not affect direct callers of check_significant_strides (e.g., test_meta.py, test_torchinductor.py).

Test Results

New test test_opcheck_detects_cpu_stride_mismatch added to test/test_custom_ops.py
Before fix: test FAILs (Exception not raised — opcheck silently passes)
After fix: test PASSes (opcheck correctly raises Stride mismatch!)
All 31 existing TestCustomOpTestingCPU tests pass — no regressions

Repro Script

import torch

@torch.library.custom_op("mylib2::mismatched_strides", mutates_args=())
def mismatched_strides(x: torch.Tensor) -> torch.Tensor:
    return x.clone().permute(2, 0, 1)

@mismatched_strides.register_fake
def _(x):
    c, h, w = x.shape[2], x.shape[0], x.shape[1]
    return x.new_empty(c, h, w)

x = torch.randn(4, 4, 3)  # CPU tensor
result = torch.library.opcheck(mismatched_strides, (x,))
print(result)

Before fix output:

{'test_schema': 'SUCCESS', 'test_autograd_registration': 'SUCCESS', 'test_faketensor': 'SUCCESS', 'test_aot_dispatch_dynamic': 'SUCCESS'}

After fix output:

OpCheckError: opcheck(op, ...): test_faketensor failed with ...
Stride mismatch! Strides are (1, 12, 3) and (16, 4, 1) (mismatched at 0)!

Issue #77553 Impact Analysis

Issue #77553 reported that CPU elementwise ops can produce tensors where "meaningless" strides (for dimensions with size == 1) differ between real and meta/fake implementations. The only_cuda=True guard in _check_strides_helper was the workaround.

This fix does NOT cause regressions for #77553. The protection is provided by a different mechanism: check_significant_strides uses significant_only=True, which skips stride comparison for any dimension where shape[dim] == 1. This is the exact scenario #77553 was about. The only_cuda guard was an overly broad workaround that is no longer needed for compare_tensor_meta.

Callers that go through check_significant_strides directly (e.g., test_meta.py, test_torchinductor.py) are completely unaffected — they still use the default only_cuda=True.

Verification:

6 targeted unit tests confirming dim-1 stride diffs are correctly ignored with only_cuda=False
7 exhaustive test groups running CrossRefFakeMode on CPU with elementwise, reduction, shape, matmul, indexing, NN module, and dim-1-heavy ops — all pass
85 FakeTensorTest tests, 16 FakeTensorOperatorInvariants tests, 31 TestCustomOpTestingCPU tests — all pass

Fix for `onnx.RotaryEmbedding.opset23` fake impl

The stride-checking fix correctly exposed a pre-existing bug in _rotary_embedding_23_fake_impl (torch/onnx/ops/_impl.py). The fake impl returned x.clone() (contiguous), but the real impl permutes 4D input to (B, S, NH, HS), operates (producing contiguous output in that layout), then permutes back to (B, NH, S, HS) — resulting in non-contiguous strides (96, 8, 24, 1) instead of the contiguous (96, 32, 8, 1).

Fix: for 4D inputs, the fake impl now creates a contiguous tensor in (B, S, NH, HS) layout then permutes to (B, NH, S, HS), matching the real impl's stride pattern. For 3D inputs, x.clone() remains correct since the real impl ends with reshape (contiguous).

Remaining Risks

BC-breaking: Downstream custom ops with incorrect fake implementations that previously passed opcheck on CPU will now fail. This is the intended behavior per the issue discussion.
CrossRefFakeMode with crossref == "all": If used on all aten ops (not just custom ops), the dynamo CrossRefFakeMode could theoretically see new failures if any aten ops produce stride mismatches on significant (non-dim-1) dimensions for CPU tensors. This is extremely unlikely and would indicate a genuine bug in the aten op's meta implementation. The default for opcheck is crossref == "custom_ops", which skips aten builtins.

Fixes #149468

Repro Script

import torch

@torch.library.custom_op("mylib2::mismatched_strides", mutates_args=())
def mismatched_strides(x: torch.Tensor) -> torch.Tensor:
    return x.clone().permute(2, 0, 1)

@mismatched_strides.register_fake
def _(x):
    c, h, w = x.shape[2], x.shape[0], x.shape[1]
    return x.new_empty(c, h, w)

x = torch.randn(4, 4, 3)  # CPU tensor
result = torch.library.opcheck(mismatched_strides, (x,))
print("CPU opcheck result:", result)
# BUG: all tests pass on CPU despite stride mismatch.
# The fake impl returns contiguous strides (3, 12, 1) for shape (3, 4, 4)
# but the real impl returns permuted strides (16, 4, 1) -> permute(2,0,1) gives (1, 4, 16) or similar

Agent Worklog

Run 2

Step 1: Reproduced the bug

Ran the minimized repro script on PyTorch 2.12.0a0+git4bc9d7f. Confirmed that opcheck reports SUCCESS for test_faketensor even though the real op returns strides (1, 12, 3) and the fake impl returns contiguous strides (16, 4, 1). This is a clear stride mismatch that goes undetected on CPU.

Step 2: Traced the call chain

opcheck → fake_check → CrossRefFakeMode.__torch_dispatch__
  → _check_fake_real_tensors → compare_tensor_meta
    → check_significant_strides(a, b, only_cuda=True)  ← DEFAULT
      → _check_strides_helper(only_cuda=True)
        → guard: (not True or "cpu"=="cuda" or "cpu"=="cuda") = False
        → stride check SKIPPED for CPU tensors

Step 3: Identified the root cause

The only_cuda=True default in _check_strides_helper (torch/_prims_common/init.py:214) causes stride comparison to be completely skipped for CPU tensors. This was introduced as a workaround for issue #77553 (CPU elementwise strides being incorrect). The workaround is too broad — it disables stride checking for ALL uses, including opcheck where custom ops should be validated.

Step 4: Assessed fix options

Option B (recommended): Pass only_cuda=False in compare_tensor_meta line 187. One-line change. Affects opcheck and CrossRefFakeMode. Does NOT affect test_meta.py or test_torchinductor.py which call check_significant_strides directly.
Other options involve more plumbing or broader defaults changes.

Step 5: Verified logic with standalone test

Created a standalone test that simulates _check_strides_helper logic, confirming:

With only_cuda=True (current): CPU stride check returns True (match) regardless of actual strides
With only_cuda=False (proposed fix): CPU stride check correctly detects mismatch

Run 3

User: I agree with the reproduction and the proposed solution. Implement and verify it.

Your job:

write the smallest fix that explains the reproduced failure
add or update a focused regression test
demonstrate that the test would fail before the fix
demonstrate that the same test passes with the fix
keep scope tight to the approved diagnosis and plan

Rules:

no speculative cleanup or unrelated refactors
if the test does not fail before the fix, stop and report that the evidence is insufficient
if the fix requires a broader change than planned, stop and ask before expanding scope
prefer the narrowest test that captures the reported failure mode

Output:

patch summary
regression test added or updated
evidence that it fails before the fix
evidence that it passes after the fix
any remaining risks or gaps

Step 1: Wrote regression test

Added test_opcheck_detects_cpu_stride_mismatch to TestCustomOpTesting in test/test_custom_ops.py. The test creates a custom op where the real impl returns a permuted tensor (strides (1, 12, 3)) but the fake impl returns a contiguous tensor (strides (16, 4, 1)), then asserts opcheck raises "Stride mismatch".

Step 2: Verified test fails before fix

Without the fix, the test fails with AssertionError: Exception not raised — confirming the bug.

Step 3: Applied the one-line fix

Changed torch/_prims_common/__init__.py line 188: passed only_cuda=False to check_significant_strides in compare_tensor_meta.

Step 4: Verified test passes after fix

The test passes: opcheck now correctly raises OpCheckError with "Stride mismatch! Strides are (1, 12, 3) and (16, 4, 1) (mismatched at 0)!"

Step 5: Ran regression tests

All 31 TestCustomOpTestingCPU tests pass with no regressions.

Step 6: Generated artifacts

fix.diff — two-file patch (1 line in __init__.py, 18 lines test in test_custom_ops.py)
report.md — full report with repro, root cause, fix, and test results

Run 4

User: Please verify that the issue described in #77553 isn't negatively affected by this fix.

Step 1: Analyzed impact surface

Mapped all callers of the affected function compare_tensor_meta:

Affected (now uses only_cuda=False): _check_fake_real_tensors in fake_utils.py (used by CrossRefFakeMode, which powers opcheck), and test_ops.py/test_fake_tensor.py callers.
Not affected (use check_significant_strides directly with default only_cuda=True): test_meta.py, test_torchinductor.py.

Key insight: test_ops.py callers do NOT pass check_strides=True (it defaults to False), so they are unaffected. The test_fake_tensor.py callers that DO pass check_strides=True compare real tensors with their fake counterparts via from_tensor, not via op execution, so stride differences from op execution are not a concern.

Step 2: Verified #77553 protection is still active

Issue #77553 was about CPU ops producing different "meaningless" strides on dimensions of size 1 between real and fake execution. The fix uses check_significant_strides, which internally calls _check_strides_helper(significant_only=True). This skips stride comparison for any dimension where shape[dim] == 1, which is exactly the #77553 scenario. Verified with unit test:

torch.empty(1, 4) with strides (4, 1) vs (1, 1) → check_significant_strides(only_cuda=False) returns True (match), because the mismatched dim has size 1.

Step 3: Ran targeted unit tests

Created test_77553_impact.py with 6 focused tests:

test_significant_strides_ignores_dim1 — dim-1 stride diffs correctly ignored ✅
test_significant_strides_catches_real_mismatch — real stride diffs correctly caught ✅
test_compare_tensor_meta_dim1_strides — compare_tensor_meta OK with dim-1 diffs ✅
test_compare_tensor_meta_real_mismatch — compare_tensor_meta catches real mismatches ✅
test_77553_style_elementwise_ops — unsqueeze, view, reshape via FakeTensorMode ✅
test_cross_ref_fake_mode_77553 — basic ops via CrossRefFakeMode ✅

Step 4: Ran exhaustive CrossRefFakeMode tests

Created test_77553_exhaustive.py with 7 test groups exercising CrossRefFakeMode with stride checking on CPU:

Elementwise ops (add, mul, relu, sigmoid, tanh, etc.) ✅
Reduction ops (sum, mean, max, min, with keepdim=True) ✅
Shape ops (unsqueeze, view, reshape, permute, t, expand) ✅
Matmul ops (mm, bmm, @) ✅
Indexing ops (slice, cat, stack) ✅
NN modules (Linear, Conv2d, BatchNorm2d, LayerNorm, Dropout) ✅
Dim-1 heavy ops (all shapes with size-1 dimensions) ✅

Step 5: Ran existing PyTorch test suites

FakeTensorTest (85 tests): all pass ✅
FakeTensorOperatorInvariants (16 tests): all pass ✅
TestCustomOpTestingCPU (31 tests): all pass ✅
Specific stride-checking tests (test_batch_tensor, test_deepcopy, test_fake_grad_copy, test_randperm, test_full, test_fake_tensor_batch_norm_cpu): all pass ✅

Conclusion

The fix does NOT negatively affect #77553. The significant_only=True parameter in check_significant_strides already provides the correct protection by skipping stride comparison on dimensions with size 1, which is exactly what #77553 was about. The only_cuda guard was an overly broad workaround that is no longer needed for the compare_tensor_meta path.

Run 5

User: test/onnx/ops/test_ops.py::NativeOnnxOpsTest::test_rotary_embedding_opcheck is failing

Step 1: Reproduced the failure

Ran the opcheck on rotary_embedding_23 and confirmed: OpCheckError: Stride mismatch! Strides are (96, 8, 24, 1) and (96, 32, 8, 1) (mismatched at 1)!

Step 2: Root cause

The fake impl _rotary_embedding_23_fake_impl in torch/onnx/ops/_impl.py just does x.clone() (contiguous). The real impl permutes 4D input to (B, S, NH, HS), operates (producing contiguous output in that layout), then permutes back to (B, NH, S, HS) — resulting in non-contiguous strides. Our stride-checking fix correctly exposed this pre-existing bug in the fake impl.

Step 3: Fixed the fake impl

For 4D inputs, the fake impl now creates a contiguous tensor in (B, S, NH, HS) layout then permutes to (B, NH, S, HS), matching the real impl's stride pattern. For 3D inputs, x.clone() remains correct since the real impl ends with reshape (contiguous).

Step 4: Verified

4D opcheck passes with correct strides ✅
3D opcheck passes ✅
Regenerated fix.diff

This PR was generated by ptq with human review.

pytorch-bot · 2026-03-11T04:51:51Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/177115

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Pending, 1 Unrelated Failure

As of commit c4b2b83 with merge base 4bc9d7f ():

NEW FAILURE - The following job has failed:

pull / linux-docs / build-docs-python-false (gh)
'Build python docs'

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / linux-jammy-py3.10-gcc11 / test (distributed, 1, 2, lf.linux.2xlarge) (gh) (similar failure)
test/distributed/tensor/test_dtensor_ops.py::TestLocalDTensorOpsCPU::test_dtensor_op_db_nanmean_cpu_float32

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot · 2026-03-11T04:51:54Z

This PR needs a `release notes:` label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Fix opcheck to detect stride mismatches on CPU tensors

5afb1e6

Fix #149468

c4b2b83

aorenste changed the title ~~Fix opcheck to detect stride mismatches on CPU tensors~~ Fix #149468 Mar 11, 2026

pytorch-bot bot added the release notes: onnx torch.onnx related changes that should show up in the release notes label Mar 11, 2026

aorenste changed the title ~~Fix #149468~~ Fix opcheck to detect stride mismatches on CPU tensors Mar 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix opcheck to detect stride mismatches on CPU tensors#177115

Fix opcheck to detect stride mismatches on CPU tensors#177115
aorenste wants to merge 2 commits intomainfrom
ptq/149468

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Uh oh!

Human Note

Agent Report

Fix: torch.library.opcheck doesn't check strides for CPU Tensors

Summary

Root Cause

Fix

Test Results

Issue #77553 Impact Analysis

Fix for onnx.RotaryEmbedding.opset23 fake impl

Remaining Risks

Run 2

Step 1: Reproduced the bug

Step 2: Traced the call chain

Step 3: Identified the root cause

Step 4: Assessed fix options

Step 5: Verified logic with standalone test

Run 3

Step 1: Wrote regression test

Step 2: Verified test fails before fix

Step 3: Applied the one-line fix

Step 4: Verified test passes after fix

Step 5: Ran regression tests

Step 6: Generated artifacts

Run 4

Step 1: Analyzed impact surface

Step 2: Verified #77553 protection is still active

Step 3: Ran targeted unit tests

Step 4: Ran exhaustive CrossRefFakeMode tests

Step 5: Ran existing PyTorch test suites

Conclusion

Run 5

Step 1: Reproduced the failure

Step 2: Root cause

Step 3: Fixed the fake impl

Step 4: Verified

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/177115

❌ 1 New Failure, 1 Pending, 1 Unrelated Failure

Uh oh!

This PR needs a release notes: label

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fix for `onnx.RotaryEmbedding.opset23` fake impl

This PR needs a `release notes:` label