8000 Inductor Tiling Rewrite by eellison · Pull Request #151958 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

Inductor Tiling Rewrite #151958

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: gh/eellison/785/base
Choose a base branch
from
Open

Conversation

eellison
Copy link
Contributor
@eellison eellison commented Apr 22, 2025

Stack from ghstack (oldest at bottom):

Fix for #149982.

Summary:

This PR does two main thing 8000 s:

  1. Rewrites the tiling heuristics. The previous tiling heuristic would have each dependency generate a tiling. Then, we sum up the score for each generated tiling, preferring any 2d tiling over the default. The new tiling heuristics scores each tiling by its global coalesced memory. This gives both a potentially better tiling (especially for more complicated, 3d patterns) as well as information we can use in generating block sizes.

  2. Analyses memory dependencies for accesses that would be coalesced with additional tiling. The motivating kernel is in request for faster inductor kernels for blockwise reduction across dim1 -> write #149982 which is a 32 element reduction. A smaller version of it is here. We need to run this kernel once in the forward per linear layer on a contiguous tensor, and once in the backward on a transposed tensor.

While the contiguous kernel has coalesced accesses, and is performant on master, the transposed version accesses uncoalesced memory on main and is ~2.8x slower. See, this full log from the above repro. Now, with this PR, it is only ~1.15x slower. See the updated log.

We analyse memory addresses that are not coalesced by any iteration variable. For this following dependency:

(((32*n0 + n1)//2048)) + 4096*(ModularIndexing(32*n0 + n1, 1, 2048)) we infer that tiling n0 by 64 makes the first term coalesced.

I'm sure there are still some CI failures to debug..

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @vkuzo

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

[ghstack-poisoned]
Copy link
pytorch-bot bot commented Apr 22, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/151958

Note: Links to docs will display an error until the docs builds have been completed.

❌ 14 New Failures, 3 Cancelled Jobs, 3 Unrelated Failures

As of commit 2747b69 with merge base 5d316ce (image):

NEW FAILURES - The following jobs have failed:

CANCELLED JOBS - The following jobs were cancelled. Please retry:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

UNSTABLE - The following jobs are marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

eellison added a commit that referenced this pull request Apr 22, 2025
ghstack-source-id: 40976f5
Pull Request resolved: #151958
[ghstack-poisoned]
eellison added a commit that referenced this pull request Apr 25, 2025
ghstack-source-id: c85aaec
Pull Request resolved: #151958
[ghstack-poisoned]
eellison added a commit that referenced this pull request Apr 25, 2025
ghstack-source-id: b6ce905
Pull Request resolved: #151958
@eellison eellison added the topic: not user facing topic category label Apr 25, 2025
[ghstack-poisoned]
eellison added a commit that referenced this pull request Apr 28, 2025
ghstack-source-id: 3a843de
Pull Request resolved: #151958
@eellison eellison added the keep-going Don't stop on first failure, keep running tests until the end label Apr 28, 2025
[ghstack-poisoned]
eellison added a commit that referenced this pull request Apr 28, 2025
ghstack-source-id: f9b502f
Pull Request resolved: #151958
[ghstack-poisoned]
eellison added a commit that referenced this pull request Apr 28, 2025
ghstack-source-id: 23bb863
Pull Request resolved: #151958
Fix for #149982. 

Summary:

This PR does two main things:
1. Rewrites the tiling heuristics. The previous tiling heuristic would have each dependency generate a tiling. Then, we sum up the score for each generated tiling, preferring any 2d tiling over the default. The new tiling heuristics scores each tiling by its global coalesced memory. This gives both a potentially better tiling (especially for more complicated, 3d patterns) as well as information we can use in generating block sizes.  

2. Analyses memory dependencies for accesses that would be coalesced with additional tiling. The motivating kernel is in #149982 which is a 32 element reduction. A smaller version of it is [here](https://gist.github.com/eellison/0fa9396f5479eb4dba09756e3bf6ff2a). We need to run this kernel once in the forward per linear layer on a contiguous tensor, and once in the backward on a transposed tensor. 

While the contiguous kernel has coalesced accesses, and is performant on master, the transposed version accesses uncoalesced memory on main and is ~2.8x slower. See, this [full log](https://gist.github.com/eellison/fa644bfd9d0ae11dadb62e17a5d48a83) from the above repro. Now, with this PR, it is only ~1.15x slower. See the [updated log](https://gist.github.com/eellison/0b2b653309494d28cf7b48929a022075). 

We analyse memory addresses that are not coalesced by any iteration variable. For this following dependency:

`(((32*n0 + n1)//2048)) + 4096*(ModularIndexing(32*n0 + n1, 1, 2048))` we infer that tiling `n0` by 64 makes the first term coalesced. 

I'm sure there are still some CI failures to debug..

cc vkuzo 


cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov 

[ghstack-poisoned]
eellison added a commit that referenced this pull request Apr 28, 2025
ghstack-source-id: 4a61b07
Pull Request resolved: #151958
Fix for #149982. 

Summary:

This PR does two main things:
1. Rewrites the tiling heuristics. The previous tiling heuristic would have each dependency generate a tiling. Then, we sum up the score for each generated tiling, preferring any 2d tiling over the default. The new tiling heuristics scores each tiling by its global coalesced memory. This gives both a potentially better tiling (especially for more complicated, 3d patterns) as well as information we can use in generating block sizes.  

2. Analyses memory dependencies for accesses that would be coalesced with additional tiling. The motivating kernel is in #149982 which is a 32 element reduction. A smaller version of it is [here](https://gist.github.com/eellison/0fa9396f5479eb4dba09756e3bf6ff2a). We need to run this kernel once in the forward per linear layer on a contiguous tensor, and once in the backward on a transposed tensor. 

While the contiguous kernel has coalesced accesses, and is performant on master, the transposed version accesses uncoalesced memory on main and is ~2.8x slower. See, this [full log](https://gist.github.com/eellison/fa644bfd9d0ae11dadb62e17a5d48a83) from the above repro. Now, with this PR, it is only ~1.15x slower. See the [updated log](https://gist.github.com/eellison/0b2b653309494d28cf7b48929a022075). 

We analyse memory addresses that are not coalesced by any iteration variable. For this following dependency:

`(((32*n0 + n1)//2048)) + 4096*(ModularIndexing(32*n0 + n1, 1, 2048))` we infer that tiling `n0` by 64 makes the first term coalesced. 

I'm sure there are still some CI failures to debug..

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov vkuzo 


cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov 

[ghstack-poisoned]
eellison added a commit that referenced this pull request Apr 28, 2025
ghstack-source-id: 9cc1907
Pull Request resolved: #151958
[ghstack-poisoned]
eellison added a commit that referenced this pull request May 15, 2025
ghstack-source-id: 2c5598f
Pull Request resolved: #151958
[ghstack-poisoned]
eellison added a commit that referenced this pull request May 15, 2025
ghstack-source-id: dd1cedd
Pull Request resolved: #151958
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/inductor keep-going Don't stop on first failure, keep running tests until the end module: dynamo module: inductor topic: not user facing topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant
0