Fix opcheck to detect stride mismatches on CPU tensors#177115
Draft
Fix opcheck to detect stride mismatches on CPU tensors#177115
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/177115
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 1 Pending, 1 Unrelated FailureAs of commit c4b2b83 with merge base 4bc9d7f ( NEW FAILURE - The following job has failed:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Human Note
torch.library.opcheck silently skipped stride validation for CPU tensors because compare_tensor_meta called check_significant_strides with only_cuda=True (the default), causing the stride check to be bypassed when both tensors are on CPU. This change passes only_cuda=False so that stride mismatches between real and fake tensor implementations are caught on all devices. A new test confirms that opcheck now correctly raises on CPU stride mismatches, and all existing tests continue to pass.
Agent Report
Fix: torch.library.opcheck doesn't check strides for CPU Tensors
Issue: pytorch/pytorch#149468
Summary
torch.library.opcheckfails to detect stride mismatches between real and fake tensor implementations when tensors are on CPU. Custom ops with incorrectregister_fakeimplementations silently pass validation on CPU.Root Cause
In
torch/_prims_common/__init__.py, the function_check_strides_helperhas a guard controlled byonly_cuda=True(the default):When
only_cuda=Trueand both tensors are on CPU, this evaluates toFalse, skipping stride comparison entirely. This was a workaround for #77553 (CPU elementwise strides being incorrect).The call chain through
opcheckis:Fix
One-line change in
compare_tensor_meta(line 188): passonly_cuda=Falsetocheck_significant_strides.This affects
compare_tensor_metacallers (opcheck viaCrossRefFakeMode, and the dynamofake_tensor_crossrefdebug mode). It does not affect direct callers ofcheck_significant_strides(e.g.,test_meta.py,test_torchinductor.py).Test Results
test_opcheck_detects_cpu_stride_mismatchadded totest/test_custom_ops.pyException not raised— opcheck silently passes)Stride mismatch!)TestCustomOpTestingCPUtests pass — no regressionsRepro Script
Before fix output:
After fix output:
Issue #77553 Impact Analysis
Issue #77553 reported that CPU elementwise ops can produce tensors where "meaningless" strides (for dimensions with size == 1) differ between real and meta/fake implementations. The
only_cuda=Trueguard in_check_strides_helperwas the workaround.This fix does NOT cause regressions for #77553. The protection is provided by a different mechanism:
check_significant_stridesusessignificant_only=True, which skips stride comparison for any dimension whereshape[dim] == 1. This is the exact scenario #77553 was about. Theonly_cudaguard was an overly broad workaround that is no longer needed forcompare_tensor_meta.Callers that go through
check_significant_stridesdirectly (e.g.,test_meta.py,test_torchinductor.py) are completely unaffected — they still use the defaultonly_cuda=True.Verification:
only_cuda=FalseFakeTensorTesttests, 16FakeTensorOperatorInvariantstests, 31TestCustomOpTestingCPUtests — all passFix for
onnx.RotaryEmbedding.opset23fake implThe stride-checking fix correctly exposed a pre-existing bug in
_rotary_embedding_23_fake_impl(torch/onnx/ops/_impl.py). The fake impl returnedx.clone()(contiguous), but the real impl permutes 4D input to(B, S, NH, HS), operates (producing contiguous output in that layout), then permutes back to(B, NH, S, HS)— resulting in non-contiguous strides(96, 8, 24, 1)instead of the contiguous(96, 32, 8, 1).Fix: for 4D inputs, the fake impl now creates a contiguous tensor in
(B, S, NH, HS)layout then permutes to(B, NH, S, HS), matching the real impl's stride pattern. For 3D inputs,x.clone()remains correct since the real impl ends withreshape(contiguous).Remaining Risks
crossref == "all": If used on all aten ops (not just custom ops), the dynamoCrossRefFakeModecould theoretically see new failures if any aten ops produce stride mismatches on significant (non-dim-1) dimensions for CPU tensors. This is extremely unlikely and would indicate a genuine bug in the aten op's meta implementation. The default for opcheck iscrossref == "custom_ops", which skips aten builtins.Fixes #149468
Repro Script
Agent Worklog
Run 2
Step 1: Reproduced the bug
Ran the minimized repro script on PyTorch 2.12.0a0+git4bc9d7f. Confirmed that
opcheckreports SUCCESS fortest_faketensoreven though the real op returns strides(1, 12, 3)and the fake impl returns contiguous strides(16, 4, 1). This is a clear stride mismatch that goes undetected on CPU.Step 2: Traced the call chain
Step 3: Identified the root cause
The
only_cuda=Truedefault in_check_strides_helper(torch/_prims_common/init.py:214) causes stride comparison to be completely skipped for CPU tensors. This was introduced as a workaround for issue #77553 (CPU elementwise strides being incorrect). The workaround is too broad — it disables stride checking for ALL uses, including opcheck where custom ops should be validated.Step 4: Assessed fix options
only_cuda=Falseincompare_tensor_metaline 187. One-line change. Affects opcheck and CrossRefFakeMode. Does NOT affecttest_meta.pyortest_torchinductor.pywhich callcheck_significant_stridesdirectly.Step 5: Verified logic with standalone test
Created a standalone test that simulates
_check_strides_helperlogic, confirming:only_cuda=True(current): CPU stride check returnsTrue(match) regardless of actual stridesonly_cuda=False(proposed fix): CPU stride check correctly detects mismatchRun 3
Your job:
Rules:
Output:
Step 1: Wrote regression test
Added
test_opcheck_detects_cpu_stride_mismatchtoTestCustomOpTestingintest/test_custom_ops.py. The test creates a custom op where the real impl returns a permuted tensor (strides(1, 12, 3)) but the fake impl returns a contiguous tensor (strides(16, 4, 1)), then asserts opcheck raises "Stride mismatch".Step 2: Verified test fails before fix
Without the fix, the test fails with
AssertionError: Exception not raised— confirming the bug.Step 3: Applied the one-line fix
Changed
torch/_prims_common/__init__.pyline 188: passedonly_cuda=Falsetocheck_significant_stridesincompare_tensor_meta.Step 4: Verified test passes after fix
The test passes: opcheck now correctly raises
OpCheckErrorwith "Stride mismatch! Strides are (1, 12, 3) and (16, 4, 1) (mismatched at 0)!"Step 5: Ran regression tests
All 31
TestCustomOpTestingCPUtests pass with no regressions.Step 6: Generated artifacts
fix.diff— two-file patch (1 line in__init__.py, 18 lines test intest_custom_ops.py)report.md— full report with repro, root cause, fix, and test resultsRun 4
Step 1: Analyzed impact surface
Mapped all callers of the affected function
compare_tensor_meta:only_cuda=False):_check_fake_real_tensorsinfake_utils.py(used byCrossRefFakeMode, which powers opcheck), andtest_ops.py/test_fake_tensor.pycallers.check_significant_stridesdirectly with defaultonly_cuda=True):test_meta.py,test_torchinductor.py.Key insight:
test_ops.pycallers do NOT passcheck_strides=True(it defaults toFalse), so they are unaffected. Thetest_fake_tensor.pycallers that DO passcheck_strides=Truecompare real tensors with their fake counterparts viafrom_tensor, not via op execution, so stride differences from op execution are not a concern.Step 2: Verified #77553 protection is still active
Issue #77553 was about CPU ops producing different "meaningless" strides on dimensions of size 1 between real and fake execution. The fix uses
check_significant_strides, which internally calls_check_strides_helper(significant_only=True). This skips stride comparison for any dimension whereshape[dim] == 1, which is exactly the #77553 scenario. Verified with unit test:torch.empty(1, 4)with strides(4, 1)vs(1, 1)→check_significant_strides(only_cuda=False)returnsTrue(match), because the mismatched dim has size 1.Step 3: Ran targeted unit tests
Created
test_77553_impact.pywith 6 focused tests:test_significant_strides_ignores_dim1— dim-1 stride diffs correctly ignored ✅test_significant_strides_catches_real_mismatch— real stride diffs correctly caught ✅test_compare_tensor_meta_dim1_strides—compare_tensor_metaOK with dim-1 diffs ✅test_compare_tensor_meta_real_mismatch—compare_tensor_metacatches real mismatches ✅test_77553_style_elementwise_ops—unsqueeze,view,reshapevia FakeTensorMode ✅test_cross_ref_fake_mode_77553— basic ops viaCrossRefFakeMode✅Step 4: Ran exhaustive CrossRefFakeMode tests
Created
test_77553_exhaustive.pywith 7 test groups exercising CrossRefFakeMode with stride checking on CPU:Step 5: Ran existing PyTorch test suites
FakeTensorTest(85 tests): all pass ✅FakeTensorOperatorInvariants(16 tests): all pass ✅TestCustomOpTestingCPU(31 tests): all pass ✅test_batch_tensor,test_deepcopy,test_fake_grad_copy,test_randperm,test_full,test_fake_tensor_batch_norm_cpu): all pass ✅Conclusion
The fix does NOT negatively affect #77553. The
significant_only=Trueparameter incheck_significant_stridesalready provides the correct protection by skipping stride comparison on dimensions with size 1, which is exactly what #77553 was about. Theonly_cudaguard was an overly broad workaround that is no longer needed for thecompare_tensor_metapath.Run 5
Step 1: Reproduced the failure
Ran the opcheck on
rotary_embedding_23and confirmed:OpCheckError: Stride mismatch! Strides are (96, 8, 24, 1) and (96, 32, 8, 1) (mismatched at 1)!Step 2: Root cause
The fake impl
_rotary_embedding_23_fake_implintorch/onnx/ops/_impl.pyjust doesx.clone()(contiguous). The real impl permutes 4D input to (B, S, NH, HS), operates (producing contiguous output in that layout), then permutes back to (B, NH, S, HS) — resulting in non-contiguous strides. Our stride-checking fix correctly exposed this pre-existing bug in the fake impl.Step 3: Fixed the fake impl
For 4D inputs, the fake impl now creates a contiguous tensor in (B, S, NH, HS) layout then permutes to (B, NH, S, HS), matching the real impl's stride pattern. For 3D inputs,
x.clone()remains correct since the real impl ends withreshape(contiguous).Step 4: Verified
This PR was generated by ptq with human review.