-
Notifications
You must be signed in to change notification settings - Fork 25.1k
[Inductor] Adjust boundary checking of dimensions using YBLOCK #149504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Inductor] Adjust boundary checking of dimensions using YBLOCK #149504
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149504
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 85c0178 with merge base 6c2c527 ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@pytorchbot label "topic: not user facing" |
5553e41
to
f65cb36
Compare
Sorry for delay. cc @blaine-rister who has added a lot of the block ptrs logic, mind taking this one ? |
Nit: Did you mean to leave "Fixes #ISSUE_NUMBER" in the PR description? You can delete that line if there's no issue to reference. |
# if numel not a block multiple, must boundary check | ||
# skip triton_cpu very slow test > 1000s | ||
subtest([False, False], decorators=[test_torchinductor.skip_if_triton_cpu]) | ||
# TODO: test zdim too |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you meant to add another test here.
torch/_inductor/codegen/triton.py
Outdated
# See Note: Constant mask optimisation | ||
# if ynumel / YBLOCK > max_ygrid, then the z dimension is used to handle | ||
# the remaining programs that cannot fit into the y dimension. This means | ||
# its possible that some redundant programs are launched, so even if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: is redundancy the issue here, or is it that we could end up with out-of-bounds accesses?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its the latter, OOB accesses due to additional programs being launched. I've edited the comment.
torch/_inductor/codegen/triton.py
Outdated
boundary_check = [] | ||
overflow_grid_check = None | ||
for idx in range(len(self.shape)): | ||
# See Note: Constant mask optimisation | ||
# if ynumel / YBLOCK > max_ygrid, then the z dimension is used to handle | ||
# the remaining programs that cannot fit into the y dimension. This means | ||
# its possible that some redundant programs are launched, so even if | ||
# ynumel divides YBLOCK, boundary checking is required in the relevant dimensions | ||
if ( | ||
TritonSymbols.block_sizes[SymT.YBLOCK] | ||
in self.block_shape[idx].free_symbols | ||
): | ||
if overflow_grid_check is None: | ||
y_tree: IterationRangesRoot = next( | ||
t for t in range_trees if t.prefix == "y" | ||
) | ||
overflow_grid_check = ( | ||
not y_tree.has_zdim | ||
and not V.graph.sizevars.statically_known_leq( | ||
y_tree.numel, get_max_y_grid() | ||
) | ||
) | ||
|
||
if ( | ||
not sizevars.statically_known_equals(self.strides[idx], sympy.S.Zero) | ||
and not sizevars.statically_known_multiple_of( | ||
self.shape[idx], self.block_shape[idx] | ||
) | ||
and not sizevars.statically_known_multiple_of( | ||
self.shape[idx], sympy_subs(self.block_shape[idx], block_to_max) | ||
and ( | ||
overflow_grid_check | ||
or ( | ||
not sizevars.statically_known_multiple_of( | ||
self.shape[idx], self.block_shape[idx] | ||
) | ||
and not sizevars.statically_known_multiple_of( | ||
self.shape[idx], | ||
sympy_subs(self.block_shape[idx], block_to_max), | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like there's some existing logic to compute this. Since this is fairly complex, could we call the same helper here? Something like this should work.
needs_overflow_grid = any(map(self.needs_yz_grid_overflow, self.range_trees))
self._boundary_check = [
idx
for idx in range(len(self.shape))
if (
not sizevars.statically_known_equals(self.strides[idx], sympy.S.Zero)
and not sizevars.statically_known_multiple_of(
self.shape[idx], self.block_shape[idx]
)
and not sizevars.statically_known_multiple_of(
self.shape[idx], sympy_subs(self.block_shape[idx], block_to_max)
)
and not (
V.kernel.no_x_dim
and self.block_shape[idx] == TritonSymbols.block_sizes[SymT.XBLOCK]
)
and not (
needs_overflow_grid
and self.block_shape[idx] != TritonSymbols.block_sizes[SymT.XBLOCK]
)
)
]
torch/_inductor/codegen/triton.py
Outdated
in self.block_shape[idx].free_symbols | ||
): | ||
if overflow_grid_check is None: | ||
y_tree: IterationRangesRoot = next( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this is initialized to None
, but then inside the loop, it becomes either True
or False
. Does the value depend on idx
? Or would it be equivalent to compute this outside the loop? See the suggestion below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good fix and I think it's pretty close to being approved. A couple things I'd like to see before approving:
- Resolving the TODO by adding a test for 3D tiling.
- I left a few comments about possibly cleaning up the boundary check logic.
8b01e19
to
565b66a
Compare
and not sizevars.statically_known_multiple_of( | ||
self.shape[idx], sympy_subs(self.block_shape[idx], block_to_max) | ||
and ( | ||
( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@blaine-rister I've updated the code here. If needs_overflow_grid
and the current dimension is based on YBLOCK
, there is no need to check divisibility of the input shape by the block shape, since even if it is divisible, if excess programs are launched then the OOB case is reached.
@blaine-rister Thank you for the review. I've updated the PR, can you please have a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the fix and very thorough testing.
@pytorchbot merge |
Pull workflow has not been scheduled for the PR yet. It could be because author doesn't have permissions to run those or skip-checks keywords were added to PR/commits, aborting merge. Please get/give approval for the workflows and/or remove skip ci decorators before next merge attempt. If you think this is a mistake, please contact PyTorch Dev Infra. |
Co-authored-by: blaine-rister <145300525+blaine-rister@users.noreply.github.com>
c639a7a
to
85c0178
Compare
@blaine-rister I think you need to approve the workflows / merge |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Apply the same logic introduced in #139751 to triton kernels using block ptrs. Here, if ynumel / YBLOCK > max_y_grids, dimensions dependent on YBLOCK need to be boundary checked, even if the block shape in such dimensions is a multiple of an expression in YBLOCK. This is because ynumel / YBLOCK % get_max_y_grids() may not be zero, so redundant programs will be launched that will attempt to read / write OOB.
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov