Implement the new tuning API for `DeviceScan` by griwes · Pull Request #7565 · NVIDIA/cccl

griwes · 2026-02-08T05:44:48Z

Description

Resolves #7521
Resolves #7476
Resolves #6821

Ready for review, still planning to do SASS inspection in some crucial places.

Sidenote: this exact type of task seems to fit Codex really, really well.

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

…i/scan

bernhardmgruber

This looks really good already! Great work!

c/parallel/src/scan.cu

cub/benchmarks/bench/scan/policy_selector.h

cub/cub/device/dispatch/tuning/tuning_radix_sort.cuh

cub/cub/device/dispatch/dispatch_scan.cuh

cub/cub/device/dispatch/tuning/tuning_scan.cuh

bernhardmgruber · 2026-02-10T22:03:07Z

@griwes we just merged #6811, which also touches the scan tunings. This will probably create some more work for this PR. Issue #6821 also tracks making the new scan implementation available to CCCL.C. Do you think you can handle this as well?

bernhardmgruber · 2026-02-13T13:03:55Z

@griwes I pulled out the delay constructor refactoring in #7668 so I can better stack my refactorings on top, in case this PR takes a bit longer (sorry again for the extra work with warpspeed!)

…feature/new-tuning-api/scan

…i/scan

griwes · 2026-02-18T23:52:04Z

Note, the warpspeed integration is still largely untested; I've added an rtxpro6000 test job to c.parallel and that will be the primary test right now. I'll lease a machine with a relevant GPU if that fails, or if there's anything that's clearly wrong to someone's eyes in review.

Edit: also seems I messed up some constexprness 😅

cub/benchmarks/bench/scan/policy_selector.h

cub/cub/device/dispatch/kernels/kernel_scan.cuh

cub/cub/device/dispatch/kernels/kernel_scan_warpspeed.cuh

cub/cub/device/dispatch/kernels/scan_warpspeed_policy.cuh

bernhardmgruber · 2026-02-25T15:52:00Z

cub/cub/device/dispatch/kernels/kernel_scan_warpspeed.cuh

+    res.smemInOut,
+    res.smemNextBlockIdx,
+    res.smemSumExclusiveCta,
+    res.smemSumThreadAndWarp);


Remark: This seems like a massive duplication of the logic allocResources does. I am extremely worried this will render the codebase brittle and unmaintainable. We should really try to come up with a way to not duplicate so much logic.

This is, in fact, a reduction of the duplication. The only way to not have the parts that are duplicated duplicated is to entirely drop the use of typed resources and use the raw resources (so the path you're highlighting here) as the only code path. I... can do that, but that appears to me to be less desirable than what I have in the PR right now.

The reason it is like this is that the current code is all written in terms of types and their statically known sizes. We do not have that in c.parallel code paths, we need to use runtime values. So it's either this or all code paths use raw resources exclusively.

bernhardmgruber

I see a lot of changes to the setup of the shared memory resources, which worry me. I am almost certain those will introduce changes to the SASS of warspeed kernels.

bernhardmgruber

@griwes please try to refactor out anything that is not related to the new tuning API and ship it as another PR, so we can reduce the scope of this PR.

griwes · 2026-02-25T16:28:09Z

The setup of the resources is the same from the perspective of the kernel. The only thing that changes there is the ability to use runtime-sized types with the same logic.

If we drop that from this PR, I will need to duplicate a whole bunch of code between the paths for the host and for the device w.r.t. how the resources are set up for the purpose of computing the max number of stages and the dynamic shared memory necessary. Splitting the core logic of allocating the phases is risky, because now updating them would require correctly updating them both in the exact same way, and that's harder to maintain.

The changes you see there are crucial to actually ensure the logic matches.

griwes · 2026-02-25T16:38:49Z

The bottom line is that everything here is related to the new tuning API. Cutting out the changes around the resource setup and calculations means cutting out c.parallel, and actually getting c.parallel to be able to use the new toys is a good part of the reasoning for these changes in the first place.

… comments.

…i/scan

griwes · 2026-02-27T18:21:25Z

Last remaining real failure is SASS checks in non-scan c.parallel tests on sm120; I'll pull that out of this PR, together with the enablement of the config in CI, and post it separately.

cub/cub/device/dispatch/tuning/tuning_scan.cuh

bernhardmgruber

I still have to re-review the dispatch logic and the changes around the kernel, especially the refactoring to compute whether we can fit a single stage into 48KiB SMEM. Otherwise this looks pretty good already!

Ideally, we should not see any SASS changes for SM 75;80;86;90;100 for one of the benchmarks, like cub.bench.scan.sum.base. Can you please diff a SASS dump before and after the PR and confirm this? Thx!

bernhardmgruber · 2026-03-04T14:34:04Z

cub/cub/device/dispatch/kernels/kernel_scan_warpspeed.cuh

-  // bottleneck. As soon as it produces a new value, it will be consumed by the
-  // scanStore squad, releasing the stage.
-  int numSumExclusiveCtaStages = 2;
+  const auto counts = make_scan_stage_counts(numStages);


Suggestion: could use a structured binding

bernhardmgruber · 2026-03-04T14:40:11Z

cub/cub/device/dispatch/kernels/scan_warpspeed_policy.cuh

+{
+  static constexpr int num_squads = 5;
+
+  bool valid = false;


Remark: we should probably introduce an algorithm enum like in DeviceTransform before all the policies go public. No changes need for now.

bernhardmgruber · 2026-03-04T14:52:42Z

cub/cub/device/dispatch/dispatch_scan.cuh

-#if __cccl_ptx_isa >= 860
-  template <typename ActivePolicyT>
-  CUB_RUNTIME_FUNCTION _CCCL_HOST _CCCL_FORCEINLINE cudaError_t __invoke_lookahead_algorithm(ActivePolicyT)
+#if _CCCL_CUDACC_AT_LEAST(12, 8)


Question: why is this change needed? We should check for the PTX ISA we require IMO.

bernhardmgruber · 2026-03-04T22:44:55Z

cub/cub/device/dispatch/dispatch_scan.cuh

+    int smem_size    = detail::scan::smem_for_stages(
+      warpspeed_policy,
+      num_stages,
+      policy_selector.input_value_size,
+      policy_selector.input_value_alignment,
+      policy_selector.output_value_size,
+      policy_selector.output_value_alignment,
+      policy_selector.accum_size,
+      policy_selector.accum_alignment);


Important: Please retain the compile-time check when possible. It helps a lot with development if we can turn on warpspeed unconditionally and just compile to see if we find any test failures etc.

Suggested change

int smem_size = detail::scan::smem_for_stages(

warpspeed_policy,

num_stages,

policy_selector.input_value_size,

policy_selector.input_value_alignment,

policy_selector.output_value_size,

policy_selector.output_value_alignment,

policy_selector.accum_size,

policy_selector.accum_alignment);

CUB_DETAIL_CONSTEXPR_ISH int smem_size = detail::scan::smem_for_stages(

warpspeed_policy,

num_stages,

policy_selector.input_value_size,

policy_selector.input_value_alignment,

policy_selector.output_value_size,

policy_selector.output_value_alignment,

policy_selector.accum_size,

policy_selector.accum_alignment);

CUB_DETAIL_STATIC_ISH_ASSERT(smem_size <= detail::max_smem_per_block); // this is ensured by scan_use_warpspeed

smem_for_stages is constexpr, so I think we just need to pull the policy_getter from __invoke into __invoke_lookahead_algorithm to get a CUB_DETAIL_CONSTEXPR_ISH auto active_policy = policy_getter();

bernhardmgruber · 2026-03-04T22:48:09Z

cub/cub/device/dispatch/dispatch_scan.cuh

+  return dispatch_arch(policy_selector, arch_id, [&](auto policy_getter) {
+    return DispatchScan<InputIteratorT,
+                        OutputIteratorT,
+                        ScanOpT,
+                        InitValueT,
+                        OffsetT,
+                        AccumT,
+                        EnforceInclusive,
+                        fake_policy,
+                        KernelSource,
+                        KernelLauncherFactory>{
+      d_temp_storage,
+      temp_storage_bytes,
+      d_in,
+      d_out,
+      num_items,
+      scan_op,
+      init_value,
+      stream,
+      -1 /* ptx_version, not used actually */,
+      kernel_source,
+      launcher_factory}
+      .__invoke(policy_getter, policy_selector);
+  });


Remark: I wonder if it would have been easier to duplicate the logic from DispatchScan into the dispatch function and strip all warpspeed logic from DispatchScan. The warpspeed scan is not on a release branch yet, so it's fine if it's not reachable through DispatchScan.

bernhardmgruber · 2026-03-04T22:51:35Z

cub/test/catch2_test_device_scan_env.cu

@@ -141,6 +129,13 @@ struct scan_tuning : cub::detail::scan::tuning<scan_tuning<BlockThreads>>

    using MaxPolicy = Policy500;
  };
+
+  template <class InputValueT, class OutputValueT, class AccumT, class OffsetT, class ScanOpT>
+  using selector =
+    cub::detail::scan::policy_selector_from_hub<policy_hub<InputValueT, OutputValueT, AccumT, OffsetT, ScanOpT>,
+                                                InputValueT,
+                                                OutputValueT,
+                                                AccumT>;
 };


Suggestion: we can just rewrite scan_tuning to be only a policy selector (only have operator()).

bernhardmgruber · 2026-03-04T22:51:59Z

cub/cub/device/device_scan.cuh

 struct default_tuning : tuning<default_tuning>
 {
  template <typename InputValueT, typename OutputValueT, typename AccumT, typename OffsetT, typename ScanOpT>
-  using fn = policy_hub<InputValueT, OutputValueT, AccumT, OffsetT, ScanOpT>;
+  using selector = policy_selector_from_types<InputValueT, OutputValueT, AccumT, OffsetT, ScanOpT>;
 };


and drop this entirely

bernhardmgruber · 2026-03-04T22:53:33Z

cub/cub/device/device_scan.cuh

    using scan_tuning_t = ::cuda::std::execution::
      __query_result_or_t<TuningEnvT, detail::scan::get_tuning_query_t, detail::scan::default_tuning>;

    // Unsigned integer type for global offsets
    using offset_t = detail::choose_offset_t<NumItemsT>;

    using accum_t =
      ::cuda::std::__accumulator_t<ScanOpT,
                                   cub::detail::it_value_t<InputIteratorT>,
                                   ::cuda::std::_If<::cuda::std::is_same_v<InitValueT, NullType>,
                                                    cub::detail::it_value_t<InputIteratorT>,
                                                    typename InitValueT::value_type>>;

-    using policy_t = typename scan_tuning_t::
-      template fn<detail::it_value_t<InputIteratorT>, detail::it_value_t<OutputIteratorT>, accum_t, offset_t, ScanOpT>;
+    using policy_selector_t = typename scan_tuning_t::template selector<
+      detail::it_value_t<InputIteratorT>,
+      detail::it_value_t<OutputIteratorT>,
+      accum_t,
+      offset_t,
+      ScanOpT>;


and:

Suggested change

ScanOpT>;

using default_policy_selector = policy_selector_from_types<InputValueT, OutputValueT, AccumT, OffsetT, ScanOpT>;

using policy_selector_t = ::cuda::std::execution::

__query_result_or_t<TuningEnvT, detail::scan::get_tuning_query_t, default_policy_selector>;

// Unsigned integer type for global offsets

using offset_t = detail::choose_offset_t<NumItemsT>;

bernhardmgruber · 2026-03-04T22:55:28Z

I finished another review and I only have minor comments except for the wish to retain the static assert that one stage fits into SMEM. I am now waiting for confirmation that we don't see SASS changes.

…i/scan

github-actions · 2026-03-05T21:59:21Z

😬 CI Workflow Results

🟥 Finished in 4h 07m: Pass: 99%/255 | Total: 9d 11h | Max: 3h 48m | Hits: 66%/156146

See results here.

griwes added 5 commits February 7, 2026 20:58

Base changes in scan and tests.

ad1c1df

Update benchmarks.

6371339

Update copyright years.

9e346b5

Merge remote-tracking branch 'origin/main' into feature/new-tuning-ap…

910d511

…i/scan

c.parallel: centralize the handling of common cub types.

e9467af

griwes requested review from a team as code owners February 8, 2026 05:44

griwes requested a review from shwina February 8, 2026 05:44

github-project-automation bot added this to CCCL Feb 8, 2026

griwes requested a review from elstehle February 8, 2026 05:44

github-project-automation bot moved this to Todo in CCCL Feb 8, 2026

cccl-authenticator-app bot moved this from Todo to In Review in CCCL Feb 8, 2026

This comment has been minimized.

Sign in to view

bernhardmgruber reviewed Feb 9, 2026

View reviewed changes

bernhardmgruber mentioned this pull request Feb 9, 2026

Move delay constructor policies somewhere central #7530

Closed

griwes added 2 commits February 12, 2026 18:09

Resolve review comments.

c4c0c09

Fix c.parallel radix_sort breakage.

2c2db7c

This was referenced Feb 13, 2026

Implement the new tuning API for Dispatch[Streaming]ReduceByKey #7667

Merged

Centralize delay_constructor policy helpers #7668

Merged

griwes added 2 commits February 19, 2026 00:44

integrate warpspeed: Merge remote-tracking branch 'origin/main' into …

a288da0

…feature/new-tuning-api/scan

Merge remote-tracking branch 'origin/main' into feature/new-tuning-ap…

9eec3e2

…i/scan

griwes added 4 commits February 19, 2026 01:15

Compilation fixes.

2a0ddf4

Go through dispatch_arch, unify dispatch paths for scan.

22ece56

Remove cuda::std::optional from policies.

d7f5333

Pull scan_warpspeed_policy out into its own file.

497638c

bernhardmgruber reviewed Feb 25, 2026

View reviewed changes

cub/benchmarks/bench/scan/policy_selector.h Outdated Show resolved Hide resolved

This comment has been minimized.

Sign in to view

bernhardmgruber reviewed Feb 25, 2026

View reviewed changes

4270 bernhardmgruber reviewed Feb 25, 2026

View reviewed changes

bernhardmgruber reviewed Feb 25, 2026

View reviewed changes

griwes added 8 commits February 26, 2026 15:54

CI fixes.

2a7e044

Refactor DeviceScan, remove warpspeed from policy_hub, address review…

1353f06

… comments.

Operators for scan_warpspeed_policy.

863c874

Merge remote-tracking branch 'origin/main' into feature/new-tuning-ap…

6867500

…i/scan

Fix clang build.

699d4bf

Fix a missed test.

ab85bfe

Fix clang-cuda concept checks.

372bcb9

Fix classifications of bool and min/max.

27ab1c4

This comment has been minimized.

Sign in to view

Add missing includes.

16bc452

This comment has been minimized.

Sign in to view

This comment has been minimized.

Sign in to view

4268

bernhardmgruber reviewed Mar 3, 2026

View reviewed changes

cub/cub/device/dispatch/tuning/tuning_scan.cuh Outdated Show resolved Hide resolved

bernhardmgruber reviewed Mar 3, 2026

View reviewed changes

bernhardmgruber mentioned this pull request Mar 4, 2026

Use tuning policy as tag for tuning environment #7835

Open

2 tasks

griwes mentioned this pull request Mar 4, 2026

Enable SM120 CI for c.parallel. #7885

Merged

2 tasks

bernhardmgruber reviewed Mar 4, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into feature/new-tuning-ap…

aec8147

…i/scan

This comment has been minimized.

Sign in to view

Conversation

Uh oh!

Description

Checklist

Uh oh!

This comment has been minimized.

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

This comment has been minimized.

Uh oh!

This comment has been minimized.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

This comment has been minimized.

😬 CI Workflow Results

🟥 Finished in 4h 07m: Pass: 99%/255 | Total: 9d 11h | Max: 3h 48m | Hits: 66%/156146

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants