[Inductor] Broadcast to range tree shape before block pointer store #151399

blaine-rister · 2025-04-16T01:44:20Z

Feature

This fixes a bug related to block pointer stores. Since Triton's block pointer stores don't support implicit broadcasting, in certain cases we need to generate a reshape->broadcast->reshape pattern to ensure that the tensor being stored has the same shape as the block pointer. This happens when the block indexing expression involves strides of 0 or dimensions of 1, both of which we eliminate from the block pointer.

The existing logic missed an important edge case. We may need a broadcast prior to the first reshape of this pattern, in case the tensor comes from a load with implicit broadcasting. For example, if the range trees have shape [YBLOCK, XBLOCK], but the load has a shape [1, XBLOCK], we need to broadcast this to [YBLOCK, XBLOCK] prior to storing. See the example kernel below, which comes from expand -> clone with 3D tiling. The load has an implicit broadcast, and the store has a reshape. Thus, we need to insert an explicit broadcast between them.

@triton.jit
def triton_poi_fused_clone_0(in_ptr0, out_ptr0, znumel, ynumel, xnumel, ZBLOCK : tl.constexpr, YBLOCK : tl.constexpr, XBLOCK : tl.constexpr):
    znumel = 32
    ynumel = 1
    xnumel = 32
    zoffset = tl.program_id(2) * ZBLOCK
    zindex = zoffset + tl.arange(0, ZBLOCK)[:, None, None]
    zmask = zindex < znumel
    yoffset = tl.program_id(1) * YBLOCK
    yindex = yoffset + tl.arange(0, YBLOCK)[None, :, None]
    ymask = tl.full([ZBLOCK, YBLOCK, XBLOCK], True, tl.int1)
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[None, None, :]
    xmask = xindex < xnumel
    x1 = xindex
    z0 = zindex
    tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[32], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), boundary_check=[0], eviction_policy='evict_last')[None, None, :]
    tl.store(tl.make_block_ptr(out_ptr0, shape=[32, 32], strides=[32, 1], block_shape=[ZBLOCK, XBLOCK], order=[1, 0], offsets=[zoffset, xoffset]), tl.reshape(tl.broadcast_to(tmp0, [ZBLOCK, YBLOCK, XBLOCK]), [ZBLOCK, XBLOCK]).to(tl.float32), boundary_check=[0, 1])
''', device_str='cuda')

The tricky part is that we don't want to emit redundant broadcasts in the store. This PR reworks the logic a bit to make sure we don't emit a second broadcast unless it actually changes the shape.

Test plan

Added a CI test for this case, which would fail on trunk. Checked that only one broadcast was emitted.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

pytorch-bot · 2025-04-16T01:44:24Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/151399

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c9881bb with merge base b0e28f6 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

eellison

Cc @isuruf , this is another case that will be simplified by #149905

blaine-rister · 2025-04-16T16:20:28Z

Cc @isuruf , this is another case that will be simplified by #149905

@eellison @isuruf I'm glad you're working on this feature! We currently have to add broadcasts defensively since it's hard to know what the actual shape is. The triton compiler eliminates these no-op broadcasts anyways, but they make the code harder to read. It would be great if the IR tracked the shape more directly.

facebook-github-bot · 2025-04-16T16:21:43Z

@blaine-rister has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

blaine-rister · 2025-04-16T18:29:52Z

@pytorchbot merge

pytorchmergebot · 2025-04-16T18:31:40Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…ytorch#151399) # Feature This fixes a bug related to block pointer stores. Since Triton's block pointer stores don't support implicit broadcasting, in certain cases we need to generate a `reshape->broadcast->reshape` pattern to ensure that the tensor being stored has the same shape as the block pointer. This happens when the block indexing expression involves strides of 0 or dimensions of 1, both of which we eliminate from the block pointer. The existing logic missed an important edge case. We may need a broadcast prior to the first `reshape` of this pattern, in case the tensor comes from a load with implicit broadcasting. For example, if the range trees have shape `[YBLOCK, XBLOCK]`, but the load has a shape `[1, XBLOCK]`, we need to broadcast this to `[YBLOCK, XBLOCK]` prior to storing. See the example kernel below, which comes from `expand` -> `clone` with 3D tiling. The load has an implicit broadcast, and the store has a reshape. Thus, we need to insert an explicit broadcast between them. ``` @triton.jit def triton_poi_fused_clone_0(in_ptr0, out_ptr0, znumel, ynumel, xnumel, ZBLOCK : tl.constexpr, YBLOCK : tl.constexpr, XBLOCK : tl.constexpr): znumel = 32 ynumel = 1 xnumel = 32 zoffset = tl.program_id(2) * ZBLOCK zindex = zoffset + tl.arange(0, ZBLOCK)[:, None, None] zmask = zindex < znumel yoffset = tl.program_id(1) * YBLOCK yindex = yoffset + tl.arange(0, YBLOCK)[None, :, None] ymask = tl.full([ZBLOCK, YBLOCK, XBLOCK], True, tl.int1) xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[None, None, :] xmask = xindex < xnumel x1 = xindex z0 = zindex tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[32], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), boundary_check=[0], eviction_policy='evict_last')[None, None, :] tl.store(tl.make_block_ptr(out_ptr0, shape=[32, 32], strides=[32, 1], block_shape=[ZBLOCK, XBLOCK], order=[1, 0], offsets=[zoffset, xoffset]), tl.reshape(tl.broadcast_to(tmp0, [ZBLOCK, YBLOCK, XBLOCK]), [ZBLOCK, XBLOCK]).to(tl.float32), boundary_check=[0, 1]) ''', device_str='cuda') ``` The tricky part is that we don't want to emit redundant broadcasts in the store. This PR reworks the logic a bit to make sure we don't emit a second broadcast unless it actually changes the shape. # Test plan Added a CI test for this case, which would fail on trunk. Checked that only one broadcast was emitted. Pull Request resolved: pytorch#151399 Approved by: https://github.com/jansel, https://github.com/eellison

blaine-rister added 2 commits April 15, 2025 16:41

fix broadcast bug and add test case

6b9f577

avoid broadcasting to the same shape twice

a92de29

pytorch-bot bot added ciflow/inductor module: inductor labels Apr 16, 2025

blaine-rister added the topic: not user facing topic category label Apr 16, 2025

blaine-rister added 2 commits April 15, 2025 18:50

don't broadcast twice

259de03

comment

99ac826

blaine-rister requested review from nandesuka, jansel, eellison and shunting314 April 16, 2025 02:00

comment again

c9881bb

jansel approved these changes Apr 16, 2025

View reviewed changes

eellison approved these changes Apr 16, 2025

View reviewed changes

blaine-rister marked this pull request as ready for review April 16, 2025 16:17

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 16, 2025

pytorchmergebot added the merging label Apr 16, 2025

pytorchmergebot closed this in 9400f53 Apr 16, 2025

pytorchmergebot added Merged and removed merging labels Apr 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Inductor] Broadcast to range tree shape before block pointer store #151399

[Inductor] Broadcast to range tree shape before block pointer store #151399

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[Inductor] Broadcast to range tree shape before block pointer store #151399

[Inductor] Broadcast to range tree shape before block pointer store #151399

Uh oh!

Conversation

Uh oh!

Feature

Test plan

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/151399

✅ No Failures

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!