8000 Implement the new tuning API for `DeviceScan` by griwes · Pull Request #7565 · NVIDIA/cccl · GitHub
[go: up one dir, main page]

Skip to content

Implement the new tuning API for DeviceScan#7565

Open
griwes wants to merge 28 commits intoNVIDIA:mainfrom
griwes:feature/new-tuning-api/scan
Open

Implement the new tuning API for DeviceScan#7565
griwes wants to merge 28 commits intoNVIDIA:mainfrom
griwes:feature/new-tuning-api/scan

Conversation

@griwes
Copy link
Contributor
@griwes griwes commented Feb 8, 2026

Description

Resolves #7521
Resolves #7476
Resolves #6821

Ready for review, still planning to do SASS inspection in some crucial places.

Sidenote: this exact type of task seems to fit Codex really, really well.

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@griwes griwes requested review from a team as code owners February 8, 2026 05:44
@griwes griwes requested a review from shwina February 8, 2026 05:44
@griwes griwes requested a review from elstehle February 8, 2026 05:44
@github-project-automation github-project-automation bot moved this to Todo in CCCL Feb 8, 2026
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Feb 8, 2026
@github-actions

This comment has been minimized.

Copy link
Contributor
@bernhardmgruber bernhardmgruber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good already! Great work!

@bernhardmgruber
Copy link
Contributor

@griwes we just merged #6811, which also touches the scan tunings. This will probably create some more work for this PR. Issue #6821 also tracks making the new scan implementation available to CCCL.C. Do you think you can handle this as well?

@bernhardmgruber
Copy link
Contributor

@griwes I pulled out the delay constructor refactoring in #7668 so I can better stack my refactorings on top, in case this PR takes a bit longer (sorry again for the extra work with warpspeed!)

@griwes
Copy link
Contributor Author
griwes commented Feb 18, 2026

Note, the warpspeed integration is still largely untested; I've added an rtxpro6000 test job to c.parallel and that will be the primary test right now. I'll lease a machine with a relevant GPU if that fails, or if there's anything that's clearly wrong to someone's eyes in review.

Edit: also seems I messed up some constexprness 😅

@github-actions

This comment has been minimized.

res.smemInOut,
res.smemNextBlockIdx,
res.smemSumExclusiveCta,
res.smemSumThreadAndWarp);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remark: This seems like a massive duplication of the logic allocResources does. I am extremely worried this will render the codebase brittle and unmaintainable. We should really try to come up with a way to not duplicate so much logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is, in fact, a reduction of the duplication. The only way to not have the parts that are duplicated duplicated is to entirely drop the use of typed resources and use the raw resources (so the path you're highlighting here) as the only code path. I... can do that, but that appears to me to be less desirable than what I have in the PR right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason it is like this is that the current code is all written in terms of types and their statically known sizes. We do not have that in c.parallel code paths, we need to use runtime values. So it's either this or all code paths use raw resources exclusively.

Copy link
Contributor
@bernhardmgruber bernhardmgruber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a lot of changes to the setup of the shared memory resources, which worry me. I am almost certain those will introduce changes to the SASS of warspeed kernels.

Copy link
Contributor
@bernhardmgruber bernhardmgruber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@griwes please try to refactor out anything that is not related to the new tuning API and ship it as another PR, so we can reduce the scope of this PR.

@griwes
Copy link
Contributor Author
griwes commented Feb 25, 2026

The setup of the resources is the same from the perspective of the kernel. The only thing that changes there is the ability to use runtime-sized types with the same logic.

If we drop that from this PR, I will need to duplicate a whole bunch of code between the paths for the host and for the device w.r.t. how the resources are set up for the purpose of computing the max number of stages and the dynamic shared memory necessary. Splitting the core logic of allocating the phases is risky, because now updating them would require correctly updating them both in the exact same way, and that's harder to maintain.

The changes you see there are crucial to actually ensure the logic matches.

@griwes
Copy link
Contributor Author
griwes commented Feb 25, 2026

The bottom line is that everything here is related to the new tuning API. Cutting out the changes around the resource setup and calculations means cutting out c.parallel, and actually getting c.parallel to be able to use the new toys is a good part of the reasoning for these changes in the first place.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@griwes
Copy link
Contributor Author
griwes commented Feb 27, 2026

Last remaining real failure is SASS checks in non-scan c.parallel tests on sm120; I'll pull that out of this PR, together with the enablement of the config in CI, and post it separately.

@github-actions

This comment has been minimized.

4268
Copy link
Contributor
@bernhardmgruber bernhardmgruber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have to re-review the dispatch logic and the changes around the kernel, especially the refactoring to compute whether we can fit a single stage into 48KiB SMEM. Otherwise this looks pretty good already!

Ideally, we should not see any SASS changes for SM 75;80;86;90;100 for one of the benchmarks, like cub.bench.scan.sum.base. Can you please diff a SASS dump before and after the PR and confirm this? Thx!

// bottleneck. As soon as it produces a new value, it will be consumed by the
// scanStore squad, releasing the stage.
int numSumExclusiveCtaStages = 2;
const auto counts = make_scan_stage_counts(numStages);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: could use a structured binding

{
static constexpr int num_squads = 5;

bool valid = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remark: we should probably introduce an algorithm enum like in DeviceTransform before all the policies go public. No changes need for now.

#if __cccl_ptx_isa >= 860
template <typename ActivePolicyT>
CUB_RUNTIME_FUNCTION _CCCL_HOST _CCCL_FORCEINLINE cudaError_t __invoke_lookahead_algorithm(ActivePolicyT)
#if _CCCL_CUDACC_AT_LEAST(12, 8)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: why is this change needed? We should check for the PTX ISA we require IMO.

Comment on lines +503 to +511
int smem_size = detail::scan::smem_for_stages(
warpspeed_policy,
num_stages,
policy_selector.input_value_size,
policy_selector.input_value_alignment,
policy_selector.output_value_size,
policy_selector.output_value_alignment,
policy_selector.accum_size,
policy_selector.accum_alignment);
7082 Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: Please retain the compile-time check when possible. It helps a lot with development if we can turn on warpspeed unconditionally and just compile to see if we find any test failures etc.

Suggested change
int smem_size = detail::scan::smem_for_stages(
warpspeed_policy,
num_stages,
policy_selector.input_value_size,
policy_selector.input_value_alignment,
policy_selector.output_value_size,
policy_selector.output_value_alignment,
policy_selector.accum_size,
policy_selector.accum_alignment);
CUB_DETAIL_CONSTEXPR_ISH int smem_size = detail::scan::smem_for_stages(
warpspeed_policy,
num_stages,
policy_selector.input_value_size,
policy_selector.input_value_alignment,
policy_selector.output_value_size,
policy_selector.output_value_alignment,
policy_selector.accum_size,
policy_selector.accum_alignment);
CUB_DETAIL_STATIC_ISH_ASSERT(smem_size <= detail::max_smem_per_block); // this is ensured by scan_use_warpspeed

smem_for_stages is constexpr, so I think we just need to pull the policy_getter from __invoke into __invoke_lookahead_algorithm to get a CUB_DETAIL_CONSTEXPR_ISH auto active_policy = policy_getter();

Comment on lines +933 to +956
return dispatch_arch(policy_selector, arch_id, [&](auto policy_getter) {
return DispatchScan<InputIteratorT,
OutputIteratorT,
ScanOpT,
InitValueT,
OffsetT,
AccumT,
EnforceInclusive,
fake_policy,
KernelSource,
KernelLauncherFactory>{
d_temp_storage,
temp_storage_bytes,
d_in,
d_out,
num_items,
scan_op,
init_value,
stream,
-1 /* ptx_version, not used actually */,
kernel_source,
launcher_factory}
.__invoke(policy_getter, policy_selector);
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remark: I wonder if it would have been easier to duplicate the logic from DispatchScan into the dispatch function and strip all warpspeed logic from DispatchScan. The warpspeed scan is not on a release branch yet, so it's fine if it's not reachable through DispatchScan.

Comment on lines 105 to 139
@@ -141,6 +129,13 @@ struct scan_tuning : cub::detail::scan::tuning<scan_tuning<BlockThreads>>

using MaxPolicy = Policy500;
};

template <class InputValueT, class OutputValueT, class AccumT, class OffsetT, class ScanOpT>
using selector =
cub::detail::scan::policy_selector_from_hub<policy_hub<InputValueT, OutputValueT, AccumT, OffsetT, ScanOpT>,
InputValueT,
OutputValueT,
AccumT>;
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-->

Suggestion: we can just rewrite scan_tuning to be only a policy selector (only have operator()).

Comment on lines 53 to 57
struct default_tuning : tuning<default_tuning>
{
template <typename InputValueT, typename OutputValueT, typename AccumT, typename OffsetT, typename ScanOpT>
using fn = policy_hub<InputValueT, OutputValueT, AccumT, OffsetT, ScanOpT>;
using selector = policy_selector_from_types<InputValueT, OutputValueT, AccumT, OffsetT, ScanOpT>;
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and drop this entirely

Comment on lines 127 to +145
using scan_tuning_t = ::cuda::std::execution::
__query_result_or_t<TuningEnvT, detail::scan::get_tuning_query_t, detail::scan::default_tuning>;

// Unsigned integer type for global offsets
using offset_t = detail::choose_offset_t<NumItemsT>;

using accum_t =
::cuda::std::__accumulator_t<ScanOpT,
cub::detail::it_value_t<InputIteratorT>,
::cuda::std::_If<::cuda::std::is_same_v<InitValueT, NullType>,
cub::detail::it_value_t<InputIteratorT>,
typename InitValueT::value_type>>;

using policy_t = typename scan_tuning_t::
template fn<detail::it_value_t<InputIteratorT>, detail::it_value_t<OutputIteratorT>, accum_t, offset_t, ScanOpT>;
using policy_selector_t = typename scan_tuning_t::template selector<
detail::it_value_t<InputIteratorT>,
detail::it_value_t<OutputIteratorT>,
accum_t,
offset_t,
ScanOpT>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and:

Suggested change
ScanOpT>;
using default_policy_selector = policy_selector_from_types<InputValueT, OutputValueT, AccumT, OffsetT, ScanOpT>;
using policy_selector_t = ::cuda::std::execution::
__query_result_or_t<TuningEnvT, detail::scan::get_tuning_query_t, default_policy_selector>;
// Unsigned integer type for global offsets
using offset_t = detail::choose_offset_t<NumItemsT>;

@bernhardmgruber
Copy link
Contributor

I finished another review and I only have minor comments except for the wish to retain the static assert that one stage fits into SMEM. I am now waiting for confirmation that we don't see SASS changes.

@github-actions

This comment has been minimized.

@github-actions
Copy link
Contributor
github-actions bot commented Mar 5, 2026

😬 CI Workflow Results

🟥 Finished in 4h 07m: Pass: 99%/255 | Total: 9d 11h | Max: 3h 48m | Hits: 66%/156146

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

Implement the new tuning API for DeviceScan Refactor cccl.c scan to use tuning API Make warpspeed scan work in CCCL.C

3 participants

0