[symbolic shapes] Log SymNode id for provenance #146532

angelayi · 2025-02-05T22:47:15Z

Stack from ghstack (oldest at bottom):

We can use the SymNode id to point us back to how previous expressions were created, and construct this nice tree in tlparse:

cc @ezyang @SherlockNoMad @EikanWang @jgong5 @wenzhe-nrv

[ghstack-poisoned]

pytorch-bot · 2025-02-05T22:47:18Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146532

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 5 Unrelated Failures

As of commit 92c0f4a with merge base 6818945 ():

NEW FAILURES - The following jobs have failed:

inductor / unit-test / cuda12.4-py3.10-gcc9-sm86 / test (inductor, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
'Test'
inductor / unit-test / cuda12.4-py3.12-gcc9-sm86 / test (inductor, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
'Test'
inductor / unit-test / cuda12.4-py3.13-gcc9-sm86 / test (inductor, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
'Test'

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

inductor / unit-test / linux-jammy-cpu-py3.9-gcc11-inductor / test (inductor_amx, 1, 2, linux.8xlarge.amx) (gh) (similar failure)
'Test'
inductor / unit-test / linux-jammy-cpu-py3.9-gcc11-inductor / test (inductor_avx2, 1, 2, linux.10xlarge.avx2) (gh) (similar failure)
'Test'
pull / linux-focal-cuda12.4-py3.10-gcc9 / test (default, 3, 5, lf.linux.4xlarge.nvidia.gpu) (gh) (detected as infra flaky with no log or failing log classifier)
pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, lf.linux.2xlarge) (gh) (similar failure)
[ FAILED ] TensorPtrMakerTest.CreateRandTensorWithBFloatType

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

inductor-rocm / rocm6.3-py3.10-inductor / test (inductor, 1, 2, linux.rocm.gpu.2) (gh) (#146433)
'Test'

This comment was automatically generated by Dr. CI and updates every 15 minutes.

cc ezyang SherlockNoMad EikanWang jgong5 wenzhe-nrv [ghstack-poisoned]

We can use the SymNode id to point us back to how previous expressions were created, and construct this nice tree in tlparse: <img width="761" alt="image" src="https://github.com/user-attachments/assets/531b03e8-4398-4d0a-bd11-16078256041c" /> cc ezyang SherlockNoMad EikanWang jgong5 wenzhe-nrv [ghstack-poisoned]

angelayi · 2025-02-11T17:24:50Z

@pytorchbot merge

pytorchmergebot · 2025-02-11T17:26:53Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-02-11T17:27:24Z

Merge failed

Reason: 2 jobs have failed, first few of them are: inductor / unit-test / cuda12.4-py3.10-gcc9-sm86 / test (inductor, 1, 2, linux.g5.4xlarge.nvidia.gpu), inductor / unit-test / cuda12.4-py3.13-gcc9-sm86 / test (inductor, 1, 2, linux.g5.4xlarge.nvidia.gpu)

Details for Dev Infra team

Raised by workflow job

We can use the SymNode id to point us back to how previous expressions were created, and construct this nice tree in tlparse: <img width="761" alt="image" src="https://github.com/user-attachments/assets/531b03e8-4398-4d0a-bd11-16078256041c" /> cc ezyang SherlockNoMad EikanWang jgong5 wenzhe-nrv [ghstack-poisoned]

Using a custom logger so that we can store our own buffer to dedup logs that look the same. The schema for deduping is as follows: ```python if key == "missing_fake_kernel": return hash((key, data["op"])) # Same ops get deduped elif key == "mismatched_fake_kernel": return hash((key, data["op"], data["reason"])) # Same op and reason for errors get deduped elif key == "propagate_real_tensors": return hash((key, json.dumps(data["stack"]))) # Guards appearing on the same stacktrace get deduped elif key == "create_unbacked_symbol": return hash((key, json.dumps(data["stack"]))) # Unbacked symbols appearing on the same stacktrace get deduped ``` Notably, guards appearing on the same stacktrace get deduped. This is because there are some cases in PT2I models where a piece of code which creates a new unbacked symint + runs into a DDE gets called 800 times, causing 800 new symints to be created, and 800 propagate_real_tensor errors that are all the same expression. This is hard to look at, so we should just deduplicate this. The con of this is that if there exists multiple DDE on the same stacktrace, we will only show the first issue. Pull Request resolved: #146533 Approved by: https://github.com/avikchaudhuri ghstack dependencies: #146532

Added some additional logging so we can also run tlparse on generic export errors Pull Request resolved: #146534 Approved by: https://github.com/pianpwk ghstack dependencies: #146532, #146533

Added a utility function for capturing the user stack and framework stacktrace. Pull Request resolved: #146858 Approved by: https://github.com/bobrenjc93 ghstack dependencies: #146532, #146533, #146534

Pull Request resolved: #146859 Approved by: https://github.com/pianpwk ghstack dependencies: #146532, #146533, #146534, #146858

We can use the SymNode id to point us back to how previous expressions were created, and construct this nice tree in tlparse: <img width="761" alt="image" src="https://github.com/user-attachments/assets/531b03e8-4398-4d0a-bd11-16078256041c" /> Pull Request resolved: #146532 Approved by: https://github.com/bobrenjc93

Using a custom logger so that we can store our own buffer to dedup logs that look the same. The schema for deduping is as follows: ```python if key == "missing_fake_kernel": return hash((key, data["op"])) # Same ops get deduped elif key == "mismatched_fake_kernel": return hash((key, data["op"], data["reason"])) # Same op and reason for errors get deduped elif key == "propagate_real_tensors": return hash((key, json.dumps(data["stack"]))) # Guards appearing on the same stacktrace get deduped elif key == "create_unbacked_symbol": return hash((key, json.dumps(data["stack"]))) # Unbacked symbols appearing on the same stacktrace get deduped ``` Notably, guards appearing on the same stacktrace get deduped. This is because there are some cases in PT2I models where a piece of code which creates a new unbacked symint + runs into a DDE gets called 800 times, causing 800 new symints to be created, and 800 propagate_real_tensor errors that are all the same expression. This is hard to look at, so we should just deduplicate this. The con of this is that if there exists multiple DDE on the same stacktrace, we will only show the first issue. Pull Request resolved: #146533 Approved by: https://github.com/avikchaudhuri ghstack dependencies: #146532

Added some additional logging so we can also run tlparse on generic export errors Pull Request resolved: #146534 Approved by: https://github.com/pianpwk ghstack dependencies: #146532, #146533

Added a utility function for capturing the user stack and framework stacktrace. Pull Request resolved: #146858 Approved by: https://github.com/bobrenjc93 ghstack dependencies: #146532, #146533, #146534

Pull Request resolved: #146859 Approved by: https://github.com/pianpwk ghstack dependencies: #146532, #146533, #146534, #146858

We can use the SymNode id to point us back to how previous expressions were created, and construct this nice tree in tlparse: <img width="761" alt="image" src="https://github.com/user-attachments/assets/531b03e8-4398-4d0a-bd11-16078256041c" /> Pull Request resolved: pytorch#146532 Approved by: https://github.com/bobrenjc93

Using a custom logger so that we can store our own buffer to dedup logs that look the same. The schema for deduping is as follows: ```python if key == "missing_fake_kernel": return hash((key, data["op"])) # Same ops get deduped elif key == "mismatched_fake_kernel": return hash((key, data["op"], data["reason"])) # Same op and reason for errors get deduped elif key == "propagate_real_tensors": return hash((key, json.dumps(data["stack"]))) # Guards appearing on the same stacktrace get deduped elif key == "create_unbacked_symbol": return hash((key, json.dumps(data["stack"]))) # Unbacked symbols appearing on the same stacktrace get deduped ``` Notably, guards appearing on the same stacktrace get deduped. This is because there are some cases in PT2I models where a piece of code which creates a new unbacked symint + runs into a DDE gets called 800 times, causing 800 new symints to be created, and 800 propagate_real_tensor errors that are all the same expression. This is hard to look at, so we should just deduplicate this. The con of this is that if there exists multiple DDE on the same stacktrace, we will only show the first issue. Pull Request resolved: pytorch#146533 Approved by: https://github.com/avikchaudhuri ghstack dependencies: pytorch#146532

Added some additional logging so we can also run tlparse on generic export errors Pull Request resolved: pytorch#146534 Approved by: https://github.com/pianpwk ghstack dependencies: pytorch#146532, pytorch#146533

Added a utility function for capturing the user stack and framework stacktrace. Pull Request resolved: pytorch#146858 Approved by: https://github.com/bobrenjc93 ghstack dependencies: pytorch#146532, pytorch#146533, pytorch#146534

Pull Request resolved: pytorch#146859 Approved by: https://github.com/pianpwk ghstack dependencies: pytorch#146532, pytorch#146533, pytorch#146534, pytorch#146858

We can use the SymNode id to point us back to how previous expressions were created, and construct this nice tree in tlparse: <img width="761" alt="image" src="https://github.com/user-attachments/assets/531b03e8-4398-4d0a-bd11-16078256041c" /> Pull Request resolved: pytorch#146532 Approved by: https://github.com/bobrenjc93

Using a custom logger so that we can store our own buffer to dedup logs that look the same. The schema for deduping is as follows: ```python if key == "missing_fake_kernel": return hash((key, data["op"])) # Same ops get deduped elif key == "mismatched_fake_kernel": return hash((key, data["op"], data["reason"])) # Same op and reason for errors get deduped elif key == "propagate_real_tensors": return hash((key, json.dumps(data["stack"]))) # Guards appearing on the same stacktrace get deduped elif key == "create_unbacked_symbol": return hash((key, json.dumps(data["stack"]))) # Unbacked symbols appearing on the same stacktrace get deduped ``` Notably, guards appearing on the same stacktrace get deduped. This is because there are some cases in PT2I models where a piece of code which creates a new unbacked symint + runs into a DDE gets called 800 times, causing 800 new symints to be created, and 800 propagate_real_tensor errors that are all the same expression. This is hard to look at, so we should just deduplicate this. The con of this is that if there exists multiple DDE on the same stacktrace, we will only show the first issue. Pull Request resolved: pytorch#146533 Approved by: https://github.com/avikchaudhuri ghstack dependencies: pytorch#146532

Added some additional logging so we can also run tlparse on generic export errors Pull Request resolved: pytorch#146534 Approved by: https://github.com/pianpwk ghstack dependencies: pytorch#146532, pytorch#146533

Added a utility function for capturing the user stack and framework stacktrace. Pull Request resolved: pytorch#146858 Approved by: https://github.com/bobrenjc93 ghstack dependencies: pytorch#146532, pytorch#146533, pytorch#146534

Pull Request resolved: pytorch#146859 Approved by: https://github.com/pianpwk ghstack dependencies: pytorch#146532, pytorch#146533, pytorch#146534, pytorch#146858

[symbolic shapes] Log SymNode id for provenance

e32d85e

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor release notes: fx release notes category labels Feb 5, 2025

facebook-github-bot added the fx label Feb 5, 2025

This was referenced Feb 5, 2025

[export] Use custom stream logger in draft-export #146533

Closed

[export] Add additional tlparse logging #146534

Closed

Update on "[symbolic shapes] Log SymNode id for provenance"

f03e985

cc ezyang SherlockNoMad EikanWang jgong5 wenzhe-nrv [ghstack-poisoned]

This was referenced Feb 10, 2025

[tlparse] Add stacktrace filter utility #146858

Closed

[export] Dedup expression_created logs #146859

Closed

Update on "[symbolic shapes] Log SymNode id for provenance"

a1430e5

cc ezyang SherlockNoMad EikanWang jgong5 wenzhe-nrv [ghstack-poisoned]

angelayi requested a review from bobrenjc93 February 10, 2025 22:28

bobrenjc93 approved these changes Feb 10, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 11, 2025

pytorchmergebot added the merging label Feb 11, 2025

pytorchmergebot removed the merging label Feb 11, 2025

This was referenced Feb 11, 2025

[export] Log evaluate_expr #146939

Closed

[export] Minor fix to locals #146955

Closed

pytorchmergebot closed this in be387f5 Feb 13, 2025

pytorchmergebot added the Merged label Feb 13, 2025

pytorchmergebot pushed a commit that referenced this pull request Feb 13, 2025

[export] Dedup expression_created logs (#146859)

67cbbb2

Pull Request resolved: #146859 Approved by: https://github.com/pianpwk ghstack dependencies: #146532, #146533, #146534, #146858

Raymo111 pushed a commit that referenced this pull request Feb 20, 2025

[export] Dedup expression_created logs (#146859)

6f42603

Pull Request resolved: #146859 Approved by: https://github.com/pianpwk ghstack dependencies: #146532, #146533, #146534, #146858

github-actions bot deleted the gh/angelayi/66/head branch March 23, 2025 02:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[symbolic shapes] Log SymNode id for provenance #146532

[symbolic shapes] Log SymNode id for provenance #146532

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[symbolic shapes] Log SymNode id for provenance #146532

[symbolic shapes] Log SymNode id for provenance #146532

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146532

❌ 3 New Failures, 5 Unrelated Failures

Uh oh!

Uh oh!

Merge started

Uh oh!

Merge failed

Uh oh!

Uh oh!