8000 [Inductor] Adjust boundary checking of dimensions using YBLOCK by kundaMwiza · Pull Request #149504 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[Inductor] Adjust boundary checking of dimensions using YBLOCK #149504

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

kundaMwiza
Copy link
Collaborator
@kundaMwiza kundaMwiza commented Mar 19, 2025

Apply the same logic introduced in #139751 to triton kernels using block ptrs. Here, if ynumel / YBLOCK > max_y_grids, dimensions dependent on YBLOCK need to be boundary checked, even if the block shape in such dimensions is a multiple of an expression in YBLOCK. This is because ynumel / YBLOCK % get_max_y_grids() may not be zero, so redundant programs will be launched that will attempt to read / write OOB.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

Copy link
pytorch-bot bot commented Mar 19, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149504

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 85c0178 with merge base 6c2c527 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@kundaMwiza
Copy link
Collaborator Author

@pytorchbot label "topic: not user facing"

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Mar 19, 2025
@kundaMwiza kundaMwiza force-pushed the mwizak/handle-ygrid-overflow-if-block-ptr branch 3 times, most recently from 5553e41 to f65cb36 Compare March 20, 2025 10:59
@bdhirsh bdhirsh requested review from eellison and shunting314 March 24, 2025 14:30
@colesbury colesbury added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 24, 2025
@eellison
Copy link
Contributor

Sorry for delay. cc @blaine-rister who has added a lot of the block ptrs logic, mind taking this one ?

@eellison eellison requested a review from blaine-rister April 15, 2025 19:19
@eellison eellison removed their request for review April 22, 2025 19:25
@blaine-rister
Copy link
Contributor
blaine-rister commented Apr 22, 2025

Nit: Did you mean to leave "Fixes #ISSUE_NUMBER" in the PR description? You can delete that line if there's no issue to reference.

# if numel not a block multiple, must boundary check
# skip triton_cpu very slow test > 1000s
subtest([False, False], decorators=[test_torchinductor.skip_if_triton_cpu])
# TODO: test zdim too
Copy link
Contributor
@blaine-rister blaine-rister Apr 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you meant to add another test here.

# See Note: Constant mask optimisation
# if ynumel / YBLOCK > max_ygrid, then the z dimension is used to handle
# the remaining programs that cannot fit into the y dimension. This means
# its possible that some redundant programs are launched, so even if
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is redundancy the issue here, or is it that we could end up with out-of-bounds accesses?

Copy link
Collaborator Author
@kundaMwiza kundaMwiza Apr 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its the latter, OOB accesses due to additional programs being launched. I've edited the comment.

Comment on lines 450 to 476
boundary_check = []
overflow_grid_check = None
for idx in range(len(self.shape)):
# See Note: Constant mask optimisation
# if ynumel / YBLOCK > max_ygrid, then the z dimension is used to handle
# the remaining programs that cannot fit into the y dimension. This means
# its possible that some redundant programs are launched, so even if
# ynumel divides YBLOCK, boundary checking is required in the relevant dimensions
if (
TritonSymbols.block_sizes[SymT.YBLOCK]
in self.block_shape[idx].free_symbols
):
if overflow_grid_check is None:
y_tree: IterationRangesRoot = next(
t for t in range_trees if t.prefix == "y"
)
overflow_grid_check = (
not y_tree.has_zdim
and not V.graph.sizevars.statically_known_leq(
y_tree.numel, get_max_y_grid()
)
)

if (
not sizevars.statically_known_equals(self.strides[idx], sympy.S.Zero)
and not sizevars.statically_known_multiple_of(
self.shape[idx], self.block_shape[idx]
)
and not sizevars.statically_known_multiple_of(
self.shape[idx], sympy_subs(self.block_shape[idx], block_to_max)
and (
overflow_grid_check
or (
not sizevars.statically_known_multiple_of(
self.shape[idx], self.block_shape[idx]
)
and not sizevars.statically_known_multiple_of(
self.shape[idx],
sympy_subs(self.block_shape[idx], block_to_max),
)
)
Copy link
Contributor
@blaine-rister blaine-rister Apr 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there's some existing logic to compute this. Since this is fairly complex, could we call the same helper here? Something like this should work.

needs_overflow_grid = any(map(self.needs_yz_grid_overflow, self.range_trees))

self._boundary_check = [
            idx
            for idx in range(len(self.shape))
            if (
                not sizevars.statically_known_equals(self.strides[idx], sympy.S.Zero)
                and not sizevars.statically_known_multiple_of(
                    self.shape[idx], self.block_shape[idx]
                )
                and not sizevars.statically_known_multiple_of(
                    self.shape[idx], sympy_subs(self.block_shape[idx], block_to_max)
                )
                and not (
                    V.kernel.no_x_dim
                    and self.block_shape[idx] == TritonSymbols.block_sizes[SymT.XBLOCK]
                )
                and not (
                    needs_overflow_grid
                    and self.block_shape[idx] != TritonSymbols.block_sizes[SymT.XBLOCK]
                )
            )
        ]

in self.block_shape[idx].free_symbols
):
if overflow_grid_check is None:
y_tree: IterationRangesRoot = next(
Copy link
Contributor
@blaine-rister blaine-rister Apr 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this is initialized to None, but then inside the loop, it becomes either True or False. Does the value depend on idx? Or would it be equivalent to compute this outside the loop? See the suggestion below.

Copy link
Contributor
@blaine-rister blaine-rister left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good fix and I think it's pretty close to being approved. A couple things I'd like to see before approving:

  1. Resolving the TODO by adding a test for 3D tiling.
  2. I left a few comments about possibly cleaning up the boundary check logic.

@kundaMwiza kundaMwiza force-pushed the mwizak/handle-ygrid-overflow-if-block-ptr branch from 8b01e19 to 565b66a Compare April 28, 2025 17:17
and not sizevars.statically_known_multiple_of(
self.shape[idx], sympy_subs(self.block_shape[idx], block_to_max)
and (
(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@blaine-rister I've updated the code here. If needs_overflow_grid and the current dimension is based on YBLOCK, there is no need to check divisibility of the input shape by the block shape, since even if it is divisible, if excess programs are launched then the OOB case is reached.

@kundaMwiza
Copy link
Collaborator Author

@blaine-rister Thank you for the review. I've updated the PR, can you please have a look?

Copy link
Contributor
@blaine-rister blaine-rister left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the fix and very thorough testing.

@kundaMwiza
Copy link
Collaborator Author

@pytorchbot merge

Copy link
pytorch-bot bot commented May 15, 2025

Pull workflow has not been scheduled for the PR yet. It could be because author doesn't have permissions to run those or skip-checks keywords were added to PR/commits, aborting merge. Please get/give approval for the workflows and/or remove skip ci decorators before next merge attempt. If you think this is a mistake, please contact PyTorch Dev Infra.

@kundaMwiza kundaMwiza force-pushed the mwizak/handle-ygrid-overflow-if-block-ptr branch from c639a7a to 85c0178 Compare May 21, 2025 20:56
@kundaMwiza
Copy link
Collaborator Author

@blaine-rister I think you need to approve the workflows / merge

@kundaMwiza
Copy link
Collaborator Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 20, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged module: inductor open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
0