8000 [CPU] OpGroupNonUniformBallot returns wrong result when kernel uses BuiltIn LocalInvocationId · Issue #901 · intel/compute-runtime · GitHub
[go: up one dir, main page]

Skip to content

[CPU] OpGroupNonUniformBallot returns wrong result when kernel uses BuiltIn LocalInvocationId #901

@pvelesko

Description

@pvelesko

Summary

OpGroupNonUniformBallot returns 0x1 instead of 0xFFFFFFFF on the Intel CPU OpenCL runtime when the same SPIR-V kernel also loads BuiltIn LocalInvocationId (v3ulong). Replacing LocalInvocationId with SubgroupLocalInvocationId (scalar uint) — with no other changes — produces the correct result.

The ballot does not use the builtin value. The mere presence of OpLoad from a LocalInvocationId variable in the same kernel corrupts the ballot result.

Works correctly on Intel GPU runtimes (Arc A770 dGPU, UHD 770 iGPU). CPU only.

Environment

  • CPU: 13th Gen Intel Core i9-13900K
  • Runtime: Intel OpenCL CPU runtime, OpenCL 3.0 (Build 0)
  • OS: Ubuntu (Linux 6.11.0-29-generic)

Reproducer

See attached intel-cpu-ballot-bug.zip. Extract and run:

make clean && make run

Requires: OpenCL headers/library, spirv-as (from SPIRV-Tools).

Root cause

Two SPIR-V kernels are identical except for which builtin they use for the thread ID check. Both call OpGroupNonUniformBallot with a true predicate and OpExecutionMode SubgroupSize 32.

CORRECT — uses SubgroupLocalInvocationId (scalar uint):

; ... (same preamble) ...
OpEntryPoint Kernel %main "test_ballot" %__spirv_BuiltInSubgroupLocalInvocationId
OpExecutionMode %main SubgroupSize 32
OpDecorate %__spirv_BuiltInSubgroupLocalInvocationId BuiltIn SubgroupLocalInvocationId

%main = OpFunction %void None %kernel_ty
 %out = OpFunctionParameter %ptr_cw_ulong
%entry = OpLabel
%ballot = OpGroupNonUniformBallot %v4uint %uint_3 %true    ; <-- ballot(true)
%ball_x = OpCompositeExtract %uint %ballot 0
%ballot_r = OpUConvert %ulong %ball_x
   %lid = OpLoad %uint %__spirv_BuiltInSubgroupLocalInvocationId  ; <-- scalar uint
%is_lid0 = OpIEqual %bool %lid %uint_0
           OpSelectionMerge %merge None
           OpBranchConditional %is_lid0 %then %merge
 %then  = OpLabel
           OpStore %out %ballot_r
           OpBranch %merge
%merge  = OpLabel
           OpReturn
           OpFunctionEnd

Result: ballot = 0xFFFFFFFF

WRONG — uses LocalInvocationId (v3ulong):

OpEntryPoint Kernel %main "test_ballot" %__spirv_BuiltInLocalInvocationId
OpExecutionMode %main SubgroupSize 32
OpDecorate %__spirv_BuiltInLocalInvocationId BuiltIn LocalInvocationId

%main = OpFunction %void None %kernel_ty
 %out = OpFunctionParameter %ptr_cw_ulong
%entry = OpLabel
%ballot = OpGroupNonUniformBallot %v4uint %uint_3 %true    ; <-- ballot(true)
%ball_x = OpCompositeExtract %uint %ballot 0
%ballot_r = OpUConvert %ulong %ball_x
%lid_v = OpLoad %v3ulong %__spirv_BuiltInLocalInvocationId  ; <-- v3ulong
%lid_x = OpCompositeExtract %ulong %lid_v 0
  %lid = OpUConvert %uint %lid_x
%is_lid0 = OpIEqual %bool %lid %uint_0
           OpSelectionMerge %merge None
           OpBranchConditional %is_lid0 %then %merge
 %then  = OpLabel
           OpStore %out %ballot_r
           OpBranch %merge
%merge  = OpLabel
           OpReturn
           OpFunctionEnd

Result: ballot = 0x00000001 (only lane 0's bit set)

Expected output

CPU: 13th Gen Intel(R) Core(TM) i9-13900K (Intel(R) OpenCL)
  SubgroupLocalInvocationId (scalar uint)       ballot=0xffffffff  expected=0xffffffff  CORRECT
  LocalInvocationId (v3ulong)                   ballot=0xffffffff  expected=0xffffffff  CORRECT

Actual output

CPU: 13th Gen Intel(R) Core(TM) i9-13900K (Intel(R) OpenCL)
  SubgroupLocalInvocationId (scalar uint)       ballot=0xffffffff  expected=0xffffffff  CORRECT
  LocalInvocationId (v3ulong)                   ballot=0x00000001  expected=0xffffffff  WRONG

Both tests pass on GPU:

GPU: Intel(R) Arc(TM) A770 Graphics (Intel(R) OpenCL Graphics)
  SubgroupLocalInvocationId (scalar uint)       ballot=0xffffffff  expected=0xffffffff  CORRECT
  LocalInvocationId (v3ulong)                   ballot=0xffffffff  expected=0xffffffff  CORRECT

GPU: Intel(R) UHD Graphics 770 (Intel(R) OpenCL Graphics)
  SubgroupLocalInvocationId (scalar uint)       ballot=0xffffffff  expected=0xffffffff  CORRECT
  LocalInvocationId (v3ulong)                   ballot=0xffffffff  expected=0xffffffff  CORRECT

Additional observations from bisection

  • Declaring LocalInvocationId without loading it does NOT trigger the bug
  • Loading LocalInvocationId before or after the ballot call — both trigger the bug
  • Cross-module linking is NOT required (single self-contained module reproduces it)
  • No extra capabilities, entry points, or complex structure needed — just OpGroupNonUniformBallot + OpLoad from LocalInvocationId in the same kernel

intel-cpu-ballot-bug.zip

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0