Implement the new tuning API for DeviceScan#7565
Implement the new tuning API for DeviceScan#7565griwes wants to merge 28 commits intoNVIDIA:mainfrom
DeviceScan#7565Conversation
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
This looks really good already! Great work!
…feature/new-tuning-api/scan
|
Note, the warpspeed integration is still largely untested; I've added an rtxpro6000 test job to c.parallel and that will be the primary test right now. I'll lease a machine with a relevant GPU if that fails, or if there's anything that's clearly wrong to someone's eyes in review. Edit: also seems I messed up some constexprness 😅 |
This comment has been minimized.
This comment has been minimized.
| res.smemInOut, | ||
| res.smemNextBlockIdx, | ||
| res.smemSumExclusiveCta, | ||
| res.smemSumThreadAndWarp); |
There was a problem hiding this comment.
Remark: This seems like a massive duplication of the logic allocResources does. I am extremely worried this will render the codebase brittle and unmaintainable. We should really try to come up with a way to not duplicate so much logic.
There was a problem hiding this comment.
This is, in fact, a reduction of the duplication. The only way to not have the parts that are duplicated duplicated is to entirely drop the use of typed resources and use the raw resources (so the path you're highlighting here) as the only code path. I... can do that, but that appears to me to be less desirable than what I have in the PR right now.
There was a problem hiding this comment.
The reason it is like this is that the current code is all written in terms of types and their statically known sizes. We do not have that in c.parallel code paths, we need to use runtime values. So it's either this or all code paths use raw resources exclusively.
There was a problem hiding this comment.
I see a lot of changes to the setup of the shared memory resources, which worry me. I am almost certain those will introduce changes to the SASS of warspeed kernels.
There was a problem hiding this comment.
@griwes please try to refactor out anything that is not related to the new tuning API and ship it as another PR, so we can reduce the scope of this PR.
|
The setup of the resources is the same from the perspective of the kernel. The only thing that changes there is the ability to use runtime-sized types with the same logic. If we drop that from this PR, I will need to duplicate a whole bunch of code between the paths for the host and for the device w.r.t. how the resources are set up for the purpose of computing the max number of stages and the dynamic shared memory necessary. Splitting the core logic of allocating the phases is risky, because now updating them would require correctly updating them both in the exact same way, and that's harder to maintain. The changes you see there are crucial to actually ensure the logic matches. |
|
The bottom line is that everything here is related to the new tuning API. Cutting out the changes around the resource setup and calculations means cutting out c.parallel, and actually getting c.parallel to be able to use the new toys is a good part of the reasoning for these changes in the first place. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Last remaining real failure is SASS checks in non-scan c.parallel tests on sm120; I'll pull that out of this PR, together with the enablement of the config in CI, and post it separately. |
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
I still have to re-review the dispatch logic and the changes around the kernel, especially the refactoring to compute whether we can fit a single stage into 48KiB SMEM. Otherwise this looks pretty good already!
Ideally, we should not see any SASS changes for SM 75;80;86;90;100 for one of the benchmarks, like cub.bench.scan.sum.base. Can you please diff a SASS dump before and after the PR and confirm this? Thx!
| // bottleneck. As soon as it produces a new value, it will be consumed by the | ||
| // scanStore squad, releasing the stage. | ||
| int numSumExclusiveCtaStages = 2; | ||
| const auto counts = make_scan_stage_counts(numStages); |
There was a problem hiding this comment.
Suggestion: could use a structured binding
| { | ||
| static constexpr int num_squads = 5; | ||
|
|
||
| bool valid = false; |
There was a problem hiding this comment.
Remark: we should probably introduce an algorithm enum like in DeviceTransform before all the policies go public. No changes need for now.
| #if __cccl_ptx_isa >= 860 | ||
| template <typename ActivePolicyT> | ||
| CUB_RUNTIME_FUNCTION _CCCL_HOST _CCCL_FORCEINLINE cudaError_t __invoke_lookahead_algorithm(ActivePolicyT) | ||
| #if _CCCL_CUDACC_AT_LEAST(12, 8) |
There was a problem hiding this comment.
Question: why is this change needed? We should check for the PTX ISA we require IMO.
| int smem_size = detail::scan::smem_for_stages( | ||
| warpspeed_policy, | ||
| num_stages, | ||
| policy_selector.input_value_size, | ||
| policy_selector.input_value_alignment, | ||
| policy_selector.output_value_size, | ||
| policy_selector.output_value_alignment, | ||
| policy_selector.accum_size, | ||
| policy_selector.accum_alignment); |
There was a problem hiding this comment.
Important: Please retain the compile-time check when possible. It helps a lot with development if we can turn on warpspeed unconditionally and just compile to see if we find any test failures etc.
| int smem_size = detail::scan::smem_for_stages( | |
| warpspeed_policy, | |
| num_stages, | |
| policy_selector.input_value_size, | |
| policy_selector.input_value_alignment, | |
| policy_selector.output_value_size, | |
| policy_selector.output_value_alignment, | |
| policy_selector.accum_size, | |
| policy_selector.accum_alignment); | |
| CUB_DETAIL_CONSTEXPR_ISH int smem_size = detail::scan::smem_for_stages( | |
| warpspeed_policy, | |
| num_stages, | |
| policy_selector.input_value_size, | |
| policy_selector.input_value_alignment, | |
| policy_selector.output_value_size, | |
| policy_selector.output_value_alignment, | |
| policy_selector.accum_size, | |
| policy_selector.accum_alignment); | |
| CUB_DETAIL_STATIC_ISH_ASSERT(smem_size <= detail::max_smem_per_block); // this is ensured by scan_use_warpspeed |
smem_for_stages is constexpr, so I think we just need to pull the policy_getter from __invoke into __invoke_lookahead_algorithm to get a CUB_DETAIL_CONSTEXPR_ISH auto active_policy = policy_getter();
| return dispatch_arch(policy_selector, arch_id, [&](auto policy_getter) { | ||
| return DispatchScan<InputIteratorT, | ||
| OutputIteratorT, | ||
| ScanOpT, | ||
| InitValueT, | ||
| OffsetT, | ||
| AccumT, | ||
| EnforceInclusive, | ||
| fake_policy, | ||
| KernelSource, | ||
| KernelLauncherFactory>{ | ||
| d_temp_storage, | ||
| temp_storage_bytes, | ||
| d_in, | ||
| d_out, | ||
| num_items, | ||
| scan_op, | ||
| init_value, | ||
| stream, | ||
| -1 /* ptx_version, not used actually */, | ||
| kernel_source, | ||
| launcher_factory} | ||
| .__invoke(policy_getter, policy_selector); | ||
| }); |
There was a problem hiding this comment.
Remark: I wonder if it would have been easier to duplicate the logic from DispatchScan into the dispatch function and strip all warpspeed logic from DispatchScan. The warpspeed scan is not on a release branch yet, so it's fine if it's not reachable through DispatchScan.
| @@ -141,6 +129,13 @@ struct scan_tuning : cub::detail::scan::tuning<scan_tuning<BlockThreads>> | |||
|
|
|||
| using MaxPolicy = Policy500; | |||
| }; | |||
|
|
|||
| template <class InputValueT, class OutputValueT, class AccumT, class OffsetT, class ScanOpT> | |||
| using selector = | |||
| cub::detail::scan::policy_selector_from_hub<policy_hub<InputValueT, OutputValueT, AccumT, OffsetT, ScanOpT>, | |||
| InputValueT, | |||
| OutputValueT, | |||
| AccumT>; | |||
| }; | |||
There was a problem hiding this comment.
Suggestion: we can just rewrite scan_tuning to be only a policy selector (only have operator()).
| struct default_tuning : tuning<default_tuning> | ||
| { | ||
| template <typename InputValueT, typename OutputValueT, typename AccumT, typename OffsetT, typename ScanOpT> | ||
| using fn = policy_hub<InputValueT, OutputValueT, AccumT, OffsetT, ScanOpT>; | ||
| using selector = policy_selector_from_types<InputValueT, OutputValueT, AccumT, OffsetT, ScanOpT>; | ||
| }; |
There was a problem hiding this comment.
and drop this entirely
| using scan_tuning_t = ::cuda::std::execution:: | ||
| __query_result_or_t<TuningEnvT, detail::scan::get_tuning_query_t, detail::scan::default_tuning>; | ||
|
|
||
| // Unsigned integer type for global offsets | ||
| using offset_t = detail::choose_offset_t<NumItemsT>; | ||
|
|
||
| using accum_t = | ||
| ::cuda::std::__accumulator_t<ScanOpT, | ||
| cub::detail::it_value_t<InputIteratorT>, | ||
| ::cuda::std::_If<::cuda::std::is_same_v<InitValueT, NullType>, | ||
| cub::detail::it_value_t<InputIteratorT>, | ||
| typename InitValueT::value_type>>; | ||
|
|
||
| using policy_t = typename scan_tuning_t:: | ||
| template fn<detail::it_value_t<InputIteratorT>, detail::it_value_t<OutputIteratorT>, accum_t, offset_t, ScanOpT>; | ||
| using policy_selector_t = typename scan_tuning_t::template selector< | ||
| detail::it_value_t<InputIteratorT>, | ||
| detail::it_value_t<OutputIteratorT>, | ||
| accum_t, | ||
| offset_t, | ||
| ScanOpT>; |
There was a problem hiding this comment.
and:
| ScanOpT>; | |
| using default_policy_selector = policy_selector_from_types<InputValueT, OutputValueT, AccumT, OffsetT, ScanOpT>; | |
| using policy_selector_t = ::cuda::std::execution:: | |
| __query_result_or_t<TuningEnvT, detail::scan::get_tuning_query_t, default_policy_selector>; | |
| // Unsigned integer type for global offsets | |
| using offset_t = detail::choose_offset_t<NumItemsT>; |
|
I finished another review and I only have minor comments except for the wish to retain the static assert that one stage fits into SMEM. I am now waiting for confirmation that we don't see SASS changes. |
This comment has been minimized.
This comment has been minimized.
😬 CI Workflow Results🟥 Finished in 4h 07m: Pass: 99%/255 | Total: 9d 11h | Max: 3h 48m | Hits: 66%/156146See results here. |
Description
Resolves #7521
Resolves #7476
Resolves #6821
Ready for review, still planning to do SASS inspection in some crucial places.
Sidenote: this exact type of task seems to fit Codex really, really well.
Checklist