[pytorch][triton] Enabling TMA for flex-attention for supported device types #153662

mandroid6 · 2025-05-15T21:42:52Z

Summary:
Currently flex-attention defaults to USE_TMA=False.

We can enable TMA on devices which support it based on has_triton_tma_device.

Test Plan:

Tritonbench results

buck2 run mode/opt //pytorch/tritonbench:run -- --op flex_attention --use-tma --mod-type all
.
.
.
(B, Hq, M, Hkv, N, D)     |       Mask Type    compiled-latency    compiled-tflops
- USE_TMA =  False
(8, 16, 128, 16, 128, 128) |            noop   0.027936 (±5.61%)            38.5859
(8, 16, 128, 16, 128, 128) |          causal   0.017760 (±3.42%)            60.6946
(8, 16, 128, 16, 128, 128) |             rel   0.028384 (±5.07%)            37.9769
(8, 16, 128, 16, 128, 128) |       head_bias   0.027712 (±4.27%)            38.8978
(8, 16, 128, 16, 128, 128) |           alibi   0.017920 (±3.21%)            60.1527
- USE_TMA = True
(8, 16, 128, 16, 128, 128) |            noop   0.025632 (±5.74%)            42.0543
(8, 16, 128, 16, 128, 128) |          causal   0.015328 (±3.97%)            70.3246
(8, 16, 128, 16, 128, 128) |             rel   0.025824 (±4.96%)            41.7416
(8, 16, 128, 16, 128, 128) |       head_bias   0.025472 (±4.90%)            42.3185
(8, 16, 128, 16, 128, 128) |           alibi   0.015392 (±3.74%)            70.0322

Differential Revision: D74841543

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

pytorch-bot · 2025-05-15T21:42:56Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/153662

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

VolumeLimitExceeded Issue for linux.2xlarge and linux.4xlarge

✅ You can merge normally! (1 Unrelated Failure)

As of commit 3317a9e with merge base 9fe2d15 ():

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / cuda12.8-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu, unstable) (gh) (#153987)
MISSING REGRESSION TEST

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2025-05-15T21:43:01Z

This pull request was exported from Phabricator. Differential Revision: D74841543

drisspg · 2025-05-16T01:27:13Z

Can you also do a perf bench w/ larger sequen lengths, I am curious

Copying over comments:
Looks good, the only thing is that we dont have CI testing for this, I am going to run the tests on the devvm

drisspg · 2025-05-16T16:43:46Z

https://www.internalfb.com/intern/paste/P1813676625/

Some failing tests

…e types (#153662) Summary: Currently flex-attention defaults to `USE_TMA=False`. We can enable TMA on devices which support it based on `has_triton_tma_device`. Test Plan: ## Unit tests on H100 ``` buck test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:flex_attention ``` https://www.internalfb.com/intern/testinfra/testrun/3096224974340688 # Tritonbench results ``` buck2 run mode/opt //pytorch/tritonbench:run -- --op flex_attention --use-tma --mod-type all . . . (B, Hq, M, Hkv, N, D) | Mask Type compiled-latency compiled-tflops - USE_TMA = False (8, 16, 128, 16, 128, 128) | noop 0.027936 (±5.61%) 38.5859 (8, 16, 128, 16, 128, 128) | causal 0.017760 (±3.42%) 60.6946 (8, 16, 128, 16, 128, 128) | rel 0.028384 (±5.07%) 37.9769 (8, 16, 128, 16, 128, 128) | head_bias 0.027712 (±4.27%) 38.8978 (8, 16, 128, 16, 128, 128) | alibi 0.017920 (±3.21%) 60.1527 - USE_TMA = True (8, 16, 128, 16, 128, 128) | noop 0.025632 (±5.74%) 42.0543 (8, 16, 128, 16, 128, 128) | causal 0.015328 (±3.97%) 70.3246 (8, 16, 128, 16, 128, 128) | rel 0.025824 (±4.96%) 41.7416 (8, 16, 128, 16, 128, 128) | head_bias 0.025472 (±4.90%) 42.3185 (8, 16, 128, 16, 128, 128) | alibi 0.015392 (±3.74%) 70.0322 ``` Differential Revision: D74841543

facebook-github-bot · 2025-05-27T17:55:00Z

This pull request was exported from Phabricator. Differential Revision: D74841543

…e types (pytorch#153662) Summary: Currently flex-attention defaults to `USE_TMA=False`. We can enable TMA on devices which support it based on `has_triton_tma_device`. Test Plan: ## Unit tests on H100 ``` buck test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:flex_attention ``` https://www.internalfb.com/intern/testinfra/testrun/3096224974340688 # Tritonbench results ``` buck2 run mode/opt //pytorch/tritonbench:run -- --op flex_attention --use-tma --mod-type all . . . (B, Hq, M, Hkv, N, D) | Mask Type compiled-latency compiled-tflops - USE_TMA = False (8, 16, 128, 16, 128, 128) | noop 0.027936 (±5.61%) 38.5859 (8, 16, 128, 16, 128, 128) | causal 0.017760 (±3.42%) 60.6946 (8, 16, 128, 16, 128, 128) | rel 0.028384 (±5.07%) 37.9769 (8, 16, 128, 16, 128, 128) | head_bias 0.027712 (±4.27%) 38.8978 (8, 16, 128, 16, 128, 128) | alibi 0.017920 (±3.21%) 60.1527 - USE_TMA = True (8, 16, 128, 16, 128, 128) | noop 0.025632 (±5.74%) 42.0543 (8, 16, 128, 16, 128, 128) | causal 0.015328 (±3.97%) 70.3246 (8, 16, 128, 16, 128, 128) | rel 0.025824 (±4.96%) 41.7416 (8, 16, 128, 16, 128, 128) | head_bias 0.025472 (±4.90%) 42.3185 (8, 16, 128, 16, 128, 128) | alibi 0.015392 (±3.74%) 70.0322 ``` Differential Revision: D74841543

facebook-github-bot · 2025-05-28T02:00:29Z

This pull request was exported from Phabricator. Differential Revision: D74841543

…e types (pytorch#153662) Summary: Currently flex-attention defaults to `USE_TMA=False`. We can enable TMA on devices which support it based on `has_triton_tma_device`. Test Plan: ## Unit tests on H100 ``` buck test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:flex_attention ``` https://www.internalfb.com/intern/testinfra/testrun/3096224974340688 # Tritonbench results ``` buck2 run mode/opt //pytorch/tritonbench:run -- --op flex_attention --use-tma --mod-type all . . . (B, Hq, M, Hkv, N, D) | Mask Type compiled-latency compiled-tflops - USE_TMA = False (8, 16, 128, 16, 128, 128) | noop 0.027936 (±5.61%) 38.5859 (8, 16, 128, 16, 128, 128) | causal 0.017760 (±3.42%) 60.6946 (8, 16, 128, 16, 128, 128) | rel 0.028384 (±5.07%) 37.9769 (8, 16, 128, 16, 128, 128) | head_bias 0.027712 (±4.27%) 38.8978 (8, 16, 128, 16, 128, 128) | alibi 0.017920 (±3.21%) 60.1527 - USE_TMA = True (8, 16, 128, 16, 128, 128) | noop 0.025632 (±5.74%) 42.0543 (8, 16, 128, 16, 128, 128) | causal 0.015328 (±3.97%) 70.3246 (8, 16, 128, 16, 128, 128) | rel 0.025824 (±4.96%) 41.7416 (8, 16, 128, 16, 128, 128) | head_bias 0.025472 (±4.90%) 42.3185 (8, 16, 128, 16, 128, 128) | alibi 0.015392 (±3.74%) 70.0322 ``` Differential Revision: D74841543

facebook-github-bot · 2025-05-28T03:07:31Z

This pull request was exported from Phabricator. Differential Revision: D74841543

facebook-github-bot · 2025-05-30T17:32:29Z

This pull request was exported from Phabricator. Differential Revision: D74841543

…e types (#153662) Summary: Currently flex-attention defaults to `USE_TMA=False`. We can enable TMA on devices which support it based on `has_triton_tma_device`. Test Plan: ## Unit tests on H100 ``` buck test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:flex_attention ``` https://www.internalfb.com/intern/testinfra/testrun/7318349664976675 # Tritonbench results ``` buck2 run mode/opt //pytorch/tritonbench:run -- --op flex_attention --use-tma --mod-type all . . . (B, Hq, M, Hkv, N, D) | Mask Type compiled-latency compiled-tflops - USE_TMA = False (8, 16, 128, 16, 128, 128) | noop 0.027936 (±5.61%) 38.5859 (8, 16, 128, 16, 128, 128) | causal 0.017760 (±3.42%) 60.6946 (8, 16, 128, 16, 128, 128) | rel 0.028384 (±5.07%) 37.9769 (8, 16, 128, 16, 128, 128) | head_bias 0.027712 (±4.27%) 38.8978 (8, 16, 128, 16, 128, 128) | alibi 0.017920 (±3.21%) 60.1527 - USE_TMA = True (8, 16, 128, 16, 128, 128) | noop 0.025632 (±5.74%) 42.0543 (8, 16, 128, 16, 128, 128) | causal 0.015328 (±3.97%) 70.3246 (8, 16, 128, 16, 128, 128) | rel 0.025824 (±4.96%) 41.7416 (8, 16, 128, 16, 128, 128) | head_bias 0.025472 (±4.90%) 42.3185 (8, 16, 128, 16, 128, 128) | alibi 0.015392 (±3.74%) 70.0322 ``` Differential Revision: D74841543

…e types (#153662) Summary: Currently flex-attention defaults to `USE_TMA=False`. We can enable TMA on devices which support it based on `has_triton_tma_device`. Test Plan: ## Unit tests on H100 ``` buck test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:flex_attention ``` # Tritonbench results ``` buck2 run mode/opt //pytorch/tritonbench:run -- --op flex_attention --use-tma --mod-type all . . . (B, Hq, M, Hkv, N, D) | Mask Type compiled-latency compiled-tflops - USE_TMA = False (8, 16, 128, 16, 128, 128) | noop 0.027936 (±5.61%) 38.5859 (8, 16, 128, 16, 128, 128) | causal 0.017760 (±3.42%) 60.6946 (8, 16, 128, 16, 128, 128) | rel 0.028384 (±5.07%) 37.9769 (8, 16, 128, 16, 128, 128) | head_bias 0.027712 (±4.27%) 38.8978 (8, 16, 128, 16, 128, 128) | alibi 0.017920 (±3.21%) 60.1527 - USE_TMA = True (8, 16, 128, 16, 128, 128) | noop 0.025632 (±5.74%) 42.0543 (8, 16, 128, 16, 128, 128) | causal 0.015328 (±3.97%) 70.3246 (8, 16, 128, 16, 128, 128) | rel 0.025824 (±4.96%) 41.7416 (8, 16, 128, 16, 128, 128) | head_bias 0.025472 (±4.90%) 42.3185 (8, 16, 128, 16, 128, 128) | alibi 0.015392 (±3.74%) 70.0322 ``` Differential Revision: D74841543

facebook-github-bot · 2025-06-10T19:49:52Z

This pull request was exported from Phabricator. Differential Revision: D74841543

…e types (pytorch#153662) Summary: Pull Request resolved: pytorch#153662 Currently flex-attention defaults to `USE_TMA=False`. We can enable TMA on devices which support it based on `has_triton_tma_device`. Test Plan: ## Unit tests on H100 ``` buck test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:flex_attention ``` # Tritonbench results ``` buck2 run mode/opt //pytorch/tritonbench:run -- --op flex_attention --use-tma --mod-type all . . . (B, Hq, M, Hkv, N, D) | Mask Type compiled-latency compiled-tflops - USE_TMA = False (8, 16, 128, 16, 128, 128) | noop 0.027936 (±5.61%) 38.5859 (8, 16, 128, 16, 128, 128) | causal 0.017760 (±3.42%) 60.6946 (8, 16, 128, 16, 128, 128) | rel 0.028384 (±5.07%) 37.9769 (8, 16, 128, 16, 128, 128) | head_bias 0.027712 (±4.27%) 38.8978 (8, 16, 128, 16, 128, 128) | alibi 0.017920 (±3.21%) 60.1527 - USE_TMA = True (8, 16, 128, 16, 128, 128) | noop 0.025632 (±5.74%) 42.0543 (8, 16, 128, 16, 128, 128) | causal 0.015328 (±3.97%) 70.3246 (8, 16, 128, 16, 128, 128) | rel 0.025824 (±4.96%) 41.7416 (8, 16, 128, 16, 128, 128) | head_bias 0.025472 (±4.90%) 42.3185 (8, 16, 128, 16, 128, 128) | alibi 0.015392 (±3.74%) 70.0322 ``` Differential Revision: D74841543

davidberard98 · 2025-06-12T04:14:11Z

torch/_inductor/kernel/flex_attention.py

@@ -1613,7 +1614,7 @@ def flex_attention(
    original_kernel_options = kernel_options.copy()
    # Default config for warp specialization
    num_consumer_groups, num_buffers_warp_spec = 0, 0
-
+    USE_TMA = has_triton_tma_device()


@mandroid6 once #155771 lands, can you change this to has_triton_stable_tma_api() ? This is because #155771 switches to using the stable TMA API (whereas has_triton_tma_device() just checks if there's support for any TMA API)

Triton 3.4 will remove the experimental TMA APIs: triton-lang/triton#6488. Ahead of this, we are **replacing the experimental TMA API usage with the stable TMA API** in flex attention. This means that **flex attention TMA will stop working with Triton 3.2 or Triton 3.3/3.3.1** for now (but it should work for Triton 3.4 in the PyTorch 2.8 release, and Meta-internal triton 3.3.1fb, which have the new TMA API). This PR does the following: * replace the experimental TMA APIs with the stable TMA APIs * remove the workspace args. Testing: I ran test/inductor/test_flex_attention.py on a H100, [TODO confirm results] TODO: When #153662 lands, turning on TMA support by default, it should be modified slightly to check specifically for stable TMA API support. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]

Triton 3.4 will remove the experimental TMA APIs: triton-lang/triton#6488. Ahead of this, we are **replacing the experimental TMA API usage with the stable TMA API** in flex attention. This means that **flex attention TMA will stop working with Triton 3.2 or Triton 3.3/3.3.1** for now (but it should work for Triton 3.4 in the PyTorch 2.8 release, and Meta-internal triton 3.3.1fb, which have the new TMA API). This PR does the following: * replace the experimental TMA APIs with the stable TMA APIs * remove the workspace args. Testing: I ran test/inductor/test_flex_attention.py on a H100, [TODO confirm results] Note: When #153662 lands, turning on TMA support by default, it should be checking specifically for stable TMA API support (commented on PR) cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]

@mandroid6

Triton 3.4 will remove the experimental TMA APIs: triton-lang/triton#6488. Ahead of this, we are **replacing the experimental TMA API usage with the stable TMA API** in flex attention. This means that **flex attention TMA will stop working with Triton 3.2 or Triton 3.3/3.3.1** for now (but it should work for Triton 3.4 in the PyTorch 2.8 release, and Meta-internal triton 3.3.1fb, which have the new TMA API). This PR does the following: * replace the experimental TMA APIs with the stable TMA APIs * remove the workspace args. Testing: I ran test/inductor/test_flex_attention.py on a H100 with @mandroid6's PR #153662 patched in to turn on TMA [TODO: confirm results once all the local tests pass, but from the first 100 tests I ran locally, all the failing tests were also failing on #153662 alone] Note: When #153662 lands, turning on TMA support by default, it should be checking specifically for stable TMA API support (commented on PR) Pull Request resolved: #155771 Approved by: https://github.com/mandroid6, https://github.com/nmacchioni

@mandroid6

Triton 3.4 will remove the experimental TMA APIs: triton-lang/triton#6488. Ahead of this, we are **replacing the experimental TMA API usage with the stable TMA API** in flex attention. This means that **flex attention TMA will stop working with Triton 3.2 or Triton 3.3/3.3.1** for now (but it should work for Triton 3.4 in the PyTorch 2.8 release, and Meta-internal triton 3.3.1fb, which have the new TMA API). This PR does the following: * replace the experimental TMA APIs with the stable TMA APIs * remove the workspace args. Testing: I ran test/inductor/test_flex_attention.py on a H100 with @mandroid6's PR pytorch#153662 patched in to turn on TMA [TODO: confirm results once all the local tests pass, but from the first 100 tests I ran locally, all the failing tests were also failing on pytorch#153662 alone] Note: When pytorch#153662 lands, turning on TMA support by default, it should be checking specifically for stable TMA API support (commented on PR) Pull Request resolved: pytorch#155771 Approved by: https://github.com/mandroid6, https://github.com/nmacchioni

…e types (pytorch#153662) Summary: Currently flex-attention defaults to `USE_TMA=False`. We can enable TMA on devices which support it based on `has_triton_tma_device`. Test Plan: ## Unit tests on H100 ``` buck test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:flex_attention ``` # Tritonbench results ``` buck2 run mode/opt //pytorch/tritonbench:run -- --op flex_attention --use-tma --mod-type all . . . (B, Hq, M, Hkv, N, D) | Mask Type compiled-latency compiled-tflops - USE_TMA = False (8, 16, 128, 16, 128, 128) | noop 0.027936 (±5.61%) 38.5859 (8, 16, 128, 16, 128, 128) | causal 0.017760 (±3.42%) 60.6946 (8, 16, 128, 16, 128, 128) | rel 0.028384 (±5.07%) 37.9769 (8, 16, 128, 16, 128, 128) | head_bias 0.027712 (±4.27%) 38.8978 (8, 16, 128, 16, 128, 128) | alibi 0.017920 (±3.21%) 60.1527 - USE_TMA = True (8, 16, 128, 16, 128, 128) | noop 0.025632 (±5.74%) 42.0543 (8, 16, 128, 16, 128, 128) | causal 0.015328 (±3.97%) 70.3246 (8, 16, 128, 16, 128, 128) | rel 0.025824 (±4.96%) 41.7416 (8, 16, 128, 16, 128, 128) | head_bias 0.025472 (±4.90%) 42.3185 (8, 16, 128, 16, 128, 128) | alibi 0.015392 (±3.74%) 70.0322 ``` Rollback Plan: Differential Revision: D74841543

facebook-github-bot · 2025-06-26T18:43:54Z

This pull request was exported from Phabricator. Differential Revision: D74841543

drisspg · 2025-07-03T23:12:30Z

@mandroid6 ping me when this is re-ready for review

pytorch-bot bot added ciflow/inductor module: inductor labels May 15, 2025

facebook-github-bot added the fb-exported label May 15, 2025

mandroid6 requested a review from drisspg May 15, 2025 21:43

mandroid6 added the topic: not user facing topic category label May 15, 2025

Skylion007 approved these changes May 16, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 16, 2025

Skylion007 approved these changes May 19, 2025

View reviewed changes

mandroid6 force-pushed the export-D74841543 branch from 83d37fd to 9972d12 Compare May 27, 2025 17:54

mandroid6 force-pushed the export-D74841543 branch from 9972d12 to dfd5ae0 Compare May 28, 2025 02:00

mandroid6 force-pushed the export-D74841543 branch from dfd5ae0 to 2e14cda Compare May 28, 2025 03:07

mandroid6 force-pushed the export-D74841543 branch from 2e14cda to db2c9b6 Compare May 30, 2025 17:32

mandroid6 force-pushed the export-D74841543 branch from db2c9b6 to fec2acc Compare June 10, 2025 19:46

mandroid6 force-pushed the export-D74841543 branch from fec2acc to f45bd01 Compare June 10, 2025 19:49

davidberard98 mentioned this pull request Jun 12, 2025

[flex attention][triton pin] use new TMA API #155771

Closed

davidberard98 reviewed Jun 12, 2025

View reviewed changes

mandroid6 force-pushed the export-D74841543 branch from f45bd01 to 3317a9e Compare June 26, 2025 18:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[pytorch][triton] Enabling TMA for flex-attention for supported device types #153662

[pytorch][triton] Enabling TMA for flex-attention for supported device types #153662

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[pytorch][triton] Enabling TMA for flex-attention for supported device types #153662

Are you sure you want to change the base?

[pytorch][triton] Enabling TMA for flex-attention for supported device types #153662

Conversation

Uh oh!

Tritonbench results

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/153662

❗ 1 Active SEVs

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!