-
Notifications
You must be signed in to change notification settings - Fork 268
Description
Summary
OpGroupNonUniformBallot returns 0x1 instead of 0xFFFFFFFF on the Intel CPU OpenCL runtime when the same SPIR-V kernel also loads BuiltIn LocalInvocationId (v3ulong). Replacing LocalInvocationId with SubgroupLocalInvocationId (scalar uint) — with no other changes — produces the correct result.
The ballot does not use the builtin value. The mere presence of OpLoad from a LocalInvocationId variable in the same kernel corrupts the ballot result.
Works correctly on Intel GPU runtimes (Arc A770 dGPU, UHD 770 iGPU). CPU only.
Environment
- CPU: 13th Gen Intel Core i9-13900K
- Runtime: Intel OpenCL CPU runtime,
OpenCL 3.0 (Build 0) - OS: Ubuntu (Linux 6.11.0-29-generic)
Reproducer
See attached intel-cpu-ballot-bug.zip. Extract and run:
make clean && make run
Requires: OpenCL headers/library, spirv-as (from SPIRV-Tools).
Root cause
Two SPIR-V kernels are identical except for which builtin they use for the thread ID check. Both call OpGroupNonUniformBallot with a true predicate and OpExecutionMode SubgroupSize 32.
CORRECT — uses SubgroupLocalInvocationId (scalar uint):
; ... (same preamble) ...
OpEntryPoint Kernel %main "test_ballot" %__spirv_BuiltInSubgroupLocalInvocationId
OpExecutionMode %main SubgroupSize 32
OpDecorate %__spirv_BuiltInSubgroupLocalInvocationId BuiltIn SubgroupLocalInvocationId
%main = OpFunction %void None %kernel_ty
%out = OpFunctionParameter %ptr_cw_ulong
%entry = OpLabel
%ballot = OpGroupNonUniformBallot %v4uint %uint_3 %true ; <-- ballot(true)
%ball_x = OpCompositeExtract %uint %ballot 0
%ballot_r = OpUConvert %ulong %ball_x
%lid = OpLoad %uint %__spirv_BuiltInSubgroupLocalInvocationId ; <-- scalar uint
%is_lid0 = OpIEqual %bool %lid %uint_0
OpSelectionMerge %merge None
OpBranchConditional %is_lid0 %then %merge
%then = OpLabel
OpStore %out %ballot_r
OpBranch %merge
%merge = OpLabel
OpReturn
OpFunctionEnd
Result: ballot = 0xFFFFFFFF
WRONG — uses LocalInvocationId (v3ulong):
OpEntryPoint Kernel %main "test_ballot" %__spirv_BuiltInLocalInvocationId
OpExecutionMode %main SubgroupSize 32
OpDecorate %__spirv_BuiltInLocalInvocationId BuiltIn LocalInvocationId
%main = OpFunction %void None %kernel_ty
%out = OpFunctionParameter %ptr_cw_ulong
%entry = OpLabel
%ballot = OpGroupNonUniformBallot %v4uint %uint_3 %true ; <-- ballot(true)
%ball_x = OpCompositeExtract %uint %ballot 0
%ballot_r = OpUConvert %ulong %ball_x
%lid_v = OpLoad %v3ulong %__spirv_BuiltInLocalInvocationId ; <-- v3ulong
%lid_x = OpCompositeExtract %ulong %lid_v 0
%lid = OpUConvert %uint %lid_x
%is_lid0 = OpIEqual %bool %lid %uint_0
OpSelectionMerge %merge None
OpBranchConditional %is_lid0 %then %merge
%then = OpLabel
OpStore %out %ballot_r
OpBranch %merge
%merge = OpLabel
OpReturn
OpFunctionEnd
Result: ballot = 0x00000001 (only lane 0's bit set)
Expected output
CPU: 13th Gen Intel(R) Core(TM) i9-13900K (Intel(R) OpenCL)
SubgroupLocalInvocationId (scalar uint) ballot=0xffffffff expected=0xffffffff CORRECT
LocalInvocationId (v3ulong) ballot=0xffffffff expected=0xffffffff CORRECT
Actual output
CPU: 13th Gen Intel(R) Core(TM) i9-13900K (Intel(R) OpenCL)
SubgroupLocalInvocationId (scalar uint) ballot=0xffffffff expected=0xffffffff CORRECT
LocalInvocationId (v3ulong) ballot=0x00000001 expected=0xffffffff WRONG
Both tests pass on GPU:
GPU: Intel(R) Arc(TM) A770 Graphics (Intel(R) OpenCL Graphics)
SubgroupLocalInvocationId (scalar uint) ballot=0xffffffff expected=0xffffffff CORRECT
LocalInvocationId (v3ulong) ballot=0xffffffff expected=0xffffffff CORRECT
GPU: Intel(R) UHD Graphics 770 (Intel(R) OpenCL Graphics)
SubgroupLocalInvocationId (scalar uint) ballot=0xffffffff expected=0xffffffff CORRECT
LocalInvocationId (v3ulong) ballot=0xffffffff expected=0xffffffff CORRECT
Additional observations from bisection
- Declaring
LocalInvocationIdwithout loading it does NOT trigger the bug - Loading
LocalInvocationIdbefore or after the ballot call — both trigger the bug - Cross-module linking is NOT required (single self-contained module reproduces it)
- No extra capabilities, entry points, or complex structure needed — just
OpGroupNonUniformBallot+OpLoadfromLocalInvocationIdin the same kernel