Incorporate coalesce analysis in codegen #153751

eellison · 2025-05-16T19:13:27Z

Stack from ghstack (oldest at bottom):

This pr uses the coalescing information in generating a tiling. The previous tiling heuristic would have each dependency generate a tiling. Then, we sum up the score for each generated tiling, preferring any 2d tiling over the default. The new tiling heuristics scores each tiling by its global coalesced memory. This gives both a potentially better tiling (especially for more complicated, 3d patterns) as well as information we can use in generating block sizes.

In triton heuristics, for generating 3d tiled reductions, we take the same total block size that the 2d reduction would use, then distribute the block according to whichever block coalesces the most memory.

The motivating kernel is in #149982 which is a 32 element reduction. A smaller version of it is here. We need to run this kernel once in the forward per linear layer on a contiguous tensor, and once in the backward on a transposed tensor.

While the contiguous kernel has coalesced accesses, and is performant on master, the transposed version accesses uncoalesced memory on main and is ~2.8x slower. See, this full log from the above repro. Now, with this PR, it is only ~1.15x slower. See the updated log.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

[ghstack-poisoned]

pytorch-bot · 2025-05-16T19:13:31Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/153751

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Experiencing "429: Too Many Requests" on downloading actions

❌ 1 New Failure, 2 Unrelated Failures

As of commit bc4cb28 with merge base d91c85b ():

NEW FAILURE - The following job has failed:

inductor / cuda12.8-py3.10-gcc9-sm86 / test (inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
doctr_det_predictor

FLAKY - The following job failed but was likely due to flakiness present on trunk:

inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (dynamic_cpu_inductor_timm, 2, 2, linux.8xlarge.amx) (gh) (detected as infra flaky with no log or failing log classifier)

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / linux-jammy-py3-clang12-executorch / build (gh) (#150261)
Final attempt failed. Child_process exited with error code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 56a5c66 Pull Request resolved: #153751

[ghstack-poisoned]

ghstack-source-id: a0985f1 Pull Request resolved: #153751

[ghstack-poisoned]

ghstack-source-id: b6a6026 Pull Request resolved: #153751

[ghstack-poisoned]

ghstack-source-id: d83dcf1 Pull Request resolved: #153751

This pr uses the coalescing information in generating a tiling. The previous tiling heuristic would have each dependency generate a tiling. Then, we sum up the score for each generated tiling, preferring any 2d tiling over the default. The new tiling heuristics scores each tiling by its global coalesced memory. This gives both a potentially better tiling (especially for more complicated, 3d patterns) as well as information we can use in generating block sizes. In triton heuristics, for generating 3d tiled reductions, we take the same total block size that the 2d reduction would use, then distribute the block according to whichever block coalesces the most memory. The motivating kernel is in #149982 which is a 32 element reduction. A smaller version of it is [here](https://gist.github.com/eellison/0fa9396f5479eb4dba09756e3bf6ff2a). We need to run this kernel once in the forward per linear layer on a contiguous tensor, and once in the backward on a transposed tensor. While the contiguous kernel has coalesced accesses, and is performant on master, the transposed version accesses uncoalesced memory on main and is ~2.8x slower. See, this [full log](https://gist.github.com/eellison/fa644bfd9d0ae11dadb62e17a5d48a83) from the above repro. Now, with this PR, it is only ~1.15x slower. See the [updated log](https://gist.github.com/eellison/0b2b653309494d28cf7b48929a022075). cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]

ghstack-source-id: 0beae46 Pull Request resolved: #153751

This pr uses the coalescing information in generating a tiling. The previous tiling heuristic would have each dependency generate a tiling. Then, we sum up the score for each generated tiling, preferring any 2d tiling over the default. The new tiling heuristics scores each tiling by its global coalesced memory. This gives both a potentially better tiling (especially for more complicated, 3d patterns) as well as information we can use in generating block sizes. In triton heuristics, for generating 3d tiled reductions, we take the same total block size that the 2d reduction would use, then distribute the block according to whichever block coalesces the most memory. The motivating kernel is in #149982 which is a 32 element reduction. A smaller version of it is [here](https://gist.github.com/eellison/0fa9396f5479eb4dba09756e3bf6ff2a). We need to run this kernel once in the forward per linear layer on a contiguous tensor, and once in the backward on a transposed tensor. While the contiguous kernel has coalesced accesses, and is performant on master, the transposed version accesses uncoalesced memory on main and is ~2.8x slower. See, this [full log](https://gist.github.com/eellison/fa644bfd9d0ae11dadb62e17a5d48a83) from the above repro. Now, with this PR, it is only ~1.15x slower. See the [updated log](https://gist.github.com/eellison/0b2b653309494d28cf7b48929a022075). cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]

ghstack-source-id: 3f1f852 Pull Request resolved: #153751

This pr uses the coalescing information in generating a tiling. The previous tiling heuristic would have each dependency generate a tiling. Then, we sum up the score for each generated tiling, preferring any 2d tiling over the default. The new tiling heuristics scores each tiling by its global coalesced memory. This gives both a potentially better tiling (especially for more complicated, 3d patterns) as well as information we can use in generating block sizes. In triton heuristics, for generating 3d tiled reductions, we take the same total block size that the 2d reduction would use, then distribute the block according to whichever block coalesces the most memory. The motivating kernel is in #149982 which is a 32 element reduction. A smaller version of it is [here](https://gist.github.com/eellison/0fa9396f5479eb4dba09756e3bf6ff2a). We need to run this kernel once in the forward per linear layer on a contiguous tensor, and once in the backward on a transposed tensor. While the contiguous kernel has coalesced accesses, and is performant on master, the transposed version accesses uncoalesced memory on main and is ~2.8x slower. See, this [full log](https://gist.github.com/eellison/fa644bfd9d0ae11dadb62e17a5d48a83) from the above repro. Now, with this PR, it is only ~1.15x slower. See the [updated log](https://gist.github.com/eellison/0b2b653309494d28cf7b48929a022075). cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]

ghstack-source-id: 726c377 Pull Request resolved: #153751

This pr uses the coalescing information in generating a tiling. The previous tiling heuristic would have each dependency generate a tiling. Then, we sum up the score for each generated tiling, preferring any 2d tiling over the default. The new tiling heuristics scores each tiling by its global coalesced memory. This gives both a potentially better tiling (especially for more complicated, 3d patterns) as well as information we can use in generating block sizes. In triton heuristics, for generating 3d tiled reductions, we take the same total block size that the 2d reduction would use, then distribute the block according to whichever block coalesces the most memory. The motivating kernel is in #149982 which is a 32 element reduction. A smaller version of it is [here](https://gist.github.com/eellison/0fa9396f5479eb4dba09756e3bf6ff2a). We need to run this kernel once in the forward per linear layer on a contiguous tensor, and once in the backward on a transposed tensor. While the contiguous kernel has coalesced accesses, and is performant on master, the transposed version accesses uncoalesced memory on main and is ~2.8x slower. See, this [full log](https://gist.github.com/eellison/fa644bfd9d0ae11dadb62e17a5d48a83) from the above repro. Now, with this PR, it is only ~1.15x slower. See the [updated log](https://gist.github.com/eellison/0b2b653309494d28cf7b48929a022075). cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]

ghstack-source-id: a208d7c Pull Request resolved: #153751

[ghstack-poisoned]

ghstack-source-id: 67dac38 Pull Request resolved: #153751

etaf · 2025-05-31T01:18:56Z

test/inductor/test_loop_ordering.py

+        self.assertEqual(out, f(*inps))
+
+    def test_penalized_small_dim(self):
+        x = torch.rand([2000, 1], device="cuda")


Hi， may I suggest to replace the hard code "cuda" in this case so that it won't fail on XPU, thanks.

[ghstack-poisoned]

eellison · 2025-06-03T15:47:26Z

@pytorchbot merge

pytorchmergebot · 2025-06-03T15:49:43Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-06-03T19:54:37Z

Merge failed

Reason: 1 jobs have failed, first few of them are: inductor / cuda12.8-py3.10-gcc9-sm86 / test (inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu)

Details for Dev Infra team

Raised by workflow job

eellison · 2025-06-04T00:14:55Z

@pytorchbot merge -i

pytorchmergebot · 2025-06-04T00:16:56Z

Merge started

Your change will be merged while ignoring the following 3 checks: pull / linux-jammy-py3-clang12-executorch / build, inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (dynamic_cpu_inductor_timm, 2, 2, linux.8xlarge.amx), inductor / cuda12.8-py3.10-gcc9-sm86 / test (inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Summary: This pr uses the coalescing information in generating a tiling. The previous tiling heuristic would have each dependency generate a tiling. Then, we sum up the score for each generated tiling, preferring any 2d tiling over the default. The new tiling heuristics scores each tiling by its global coalesced memory. This gives both a potentially better tiling (especially for more complicated, 3d patterns) as well as information we can use in generating block sizes. In triton heuristics, for generating 3d tiled reductions, we take the same total block size that the 2d reduction would use, then distribute the block according to whichever block coalesces the most memory. The motivating kernel is in pytorch/pytorch#149982 which is a 32 element reduction. A smaller version of it is [here](https://gist.github.com/eellison/0fa9396f5479eb4dba09756e3bf6ff2a). We need to run this kernel once in the forward per linear layer on a contiguous tensor, and once in the backward on a transposed tensor. While the contiguous kernel has coalesced accesses, and is performant on master, the transposed version accesses uncoalesced memory on main and is ~2.8x slower. See, this [full log](https://gist.github.com/eellison/fa644bfd9d0ae11dadb62e17a5d48a83) from the above repro. Now, with this PR, it is only ~1.15x slower. See the [updated log](https://gist.github.com/eellison/0b2b653309494d28cf7b48929a022075). X-link: pytorch/pytorch#153751 Approved by: https://github.com/jansel ghstack dependencies: #153723, #153730, #153748 Reviewed By: seemethere Differential Revision: D75919085 fbshipit-source-id: b2f9cea33b18cc27baf0f4c2d18fc7c3c6bcd492

This pr uses the coalescing information in generating a tiling. The previous tiling heuristic would have each dependency generate a tiling. Then, we sum up the score for each generated tiling, preferring any 2d tiling over the default. The new tiling heuristics scores each tiling by its global coalesced memory. This gives both a potentially better tiling (especially for more complicated, 3d patterns) as well as information we can use in generating block sizes. In triton heuristics, for generating 3d tiled reductions, we take the same total block size that the 2d reduction would use, then distribute the block according to whichever block coalesces the most memory. The motivating kernel is in pytorch#149982 which is a 32 element reduction. A smaller version of it is [here](https://gist.github.com/eellison/0fa9396f5479eb4dba09756e3bf6ff2a). We need to run this kernel once in the forward per linear layer on a contiguous tensor, and once in the backward on a transposed tensor. While the contiguous kernel has coalesced accesses, and is performant on master, the transposed version accesses uncoalesced memory on main and is ~2.8x slower. See, this [full log](https://gist.github.com/eellison/fa644bfd9d0ae11dadb62e17a5d48a83) from the above repro. Now, with this PR, it is only ~1.15x slower. See the [updated log](https://gist.github.com/eellison/0b2b653309494d28cf7b48929a022075). Pull Request resolved: pytorch#153751 Approved by: https://github.com/jansel ghstack dependencies: pytorch#153723, pytorch#153730, pytorch#153748

ghstack-source-id: 86e5094 Pull Request resolved: pytorch/pytorch#153751

This pr uses the coalescing information in generating a tiling. The previous tiling heuristic would have each dependency generate a tiling. Then, we sum up the score for each generated tiling, preferring any 2d tiling over the default. The new tiling heuristics scores each tiling by its global coalesced memory. This gives both a potentially better tiling (especially for more complicated, 3d patterns) as well as information we can use in generating block sizes. In triton heuristics, for generating 3d tiled reductions, we take the same total block size that the 2d reduction would use, then distribute the block according to whichever block coalesces the most memory. The motivating kernel is in pytorch#149982 which is a 32 element reduction. A smaller version of it is [here](https://gist.github.com/eellison/0fa9396f5479eb4dba09756e3bf6ff2a). We need to run this kernel once in the forward per linear layer on a contiguous tensor, and once in the backward on a transposed tensor. While the contiguous kernel has coalesced accesses, and is performant on master, the transposed version accesses uncoalesced memory on main and is ~2.8x slower. See, this [full log](https://gist.github.com/eellison/fa644bfd9d0ae11dadb62e17a5d48a83) from the above repro. Now, with this PR, it is only ~1.15x slower. See the [updated log](https://gist.github.com/eellison/0b2b653309494d28cf7b48929a022075). Pull Request resolved: pytorch#153751 Approved by: https://github.com/jansel ghstack dependencies: pytorch#153723, pytorch#153730, pytorch#153748

Update

b36bb8f

[ghstack-poisoned]

This was referenced May 16, 2025

[Tiling rewrite pt1] Normalize reads and writes to common iter space #153723

Closed

Analyze coalesced mem #153730

Closed

Solve for tilings #153748

Closed

pytorch-bot bot added ciflow/inductor module: inductor labels May 16, 2025

eellison added a commit that referenced this pull request May 16, 2025

Incorporate coalesce analysis in codegen

69541db

ghstack-source-id: 56a5c66 Pull Request resolved: #153751

Update

3fbbec9

[ghstack-poisoned]

eellison added a commit that referenced this pull request May 16, 2025

Incorporate coalesce analysis in codegen

564a187

ghstack-source-id: a0985f1 Pull Request resolved: #153751

Update

948321e

[ghstack-poisoned]

eellison added a commit that referenced this pull request May 16, 2025

Incorporate coalesce analysis in codegen

07eced7

ghstack-source-id: b6a6026 Pull Request resolved: #153751

eellison requested review from jansel and blaine-rister and removed request for jansel and blaine-rister May 16, 2025 19:25

Update

4b46873

[ghstack-poisoned]

eellison added a commit that referenced this pull request May 20, 2025

Incorporate coalesce analysis in codegen

a6ad9e5

ghstack-source-id: d83dcf1 Pull Request resolved: #153751

eellison added a commit that referenced this pull request May 20, 2025

Incorporate coalesce analysis in codegen

5df2556

ghstack-source-id: 0beae46 Pull Request resolved: #153751

eellison added a commit that referenced this pull request May 20, 2025

Incorporate coalesce analysis in codegen

a897353

ghstack-source-id: 3f1f852 Pull Request resolved: #153751

eellison mentioned this pull request May 21, 2025

test #154005

Closed

eellison added 2 commits May 20, 2025 19:26

eellison added a commit that referenced this pull request May 21, 2025

Incorporate coalesce analysis in codegen

a0cdb85

ghstack-source-id: 726c377 Pull Request resolved: #153751

eellison added the topic: not user facing topic category label May 21, 2025

eellison added 2 commits May 21, 2025 07:38

eellison added a commit that referenced this pull request May 22, 2025

Incorporate coalesce analysis in codegen

258deef

ghstack-source-id: a208d7c Pull Request resolved: #153751

Update

60178c2

[ghstack-poisoned]

eellison added a commit that referenced this pull request May 30, 2025

Incorporate coalesce analysis in codegen

2232104

ghstack-source-id: 67dac38 Pull Request resolved: #153751

eellison mentioned this pull request May 30, 2025

Turn on new tiling by default #154768

Closed

etaf reviewed May 31, 2025

View reviewed changes

eellison added 6 commits June 1, 2025 17:35

Update

26172e4

[ghstack-poisoned]

Update

397f2b4

[ghstack-poisoned]

Update

378c8b5

[ghstack-poisoned]

Update

15fcf78

[ghstack-poisoned]

Update

ccefab6

[ghstack-poisoned]

Update

bc4cb28

[ghstack-poisoned]

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 3, 2025

pytorchmergebot added the merging label Jun 3, 2025

pytorchmergebot removed the merging label Jun 3, 2025

pytorchmergebot added the merging label Jun 4, 2025

pytorchmergebot added the Merged label Jun 4, 2025

pytorchmergebot closed this in 40a8770 Jun 4, 2025

pytorchmergebot removed the merging label Jun 4, 2025

github-actions bot deleted the gh/eellison/793/head branch July 4, 2025 02:22

superiwan pushed a commit to superiwan/pytorch that referenced this pull request Jul 14, 2025

Incorporate coalesce analysis in codegen

0453262

ghstack-source-id: 86e5094 Pull Request resolved: pytorch/pytorch#153751

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Incorporate coalesce analysis in codegen #153751

Incorporate coalesce analysis in codegen #153751

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Incorporate coalesce analysis in codegen #153751

Incorporate coalesce analysis in codegen #153751

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/153751

❗ 1 Active SEVs

❌ 1 New Failure, 2 Unrelated Failures

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Merge started

Uh oh!

Merge failed

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!