-
Notifications
You must be signed in to change notification settings - Fork 113
Description
I'm working on oneAPI.jl, which provides Julia support for Intel GPUs through Level Zero. Occasionally, we run into users reporting that they run into an opaque ZE_RESULT_ERROR_UNINITIALIZED
when we call zeInit
during loading of oneAPI.jl. This is an unhelpful error, and it makes it impossible to use the Level Zero APIs to figure out what's actually happening. For example, I've run into:
- users not having a (supported) GPU
- restrictive permissions on
/dev/dri
- conflicting library versions picked up (e.g. redistributed
libze_loader
vs systemlibze_tracing_layer
)
Apart from the last one, I wouldn't expect the loader to fail to initialize, but still allow iterating drivers (why else this abstraction?) and ideally being able to determine why there's no devices. Currently, we typically find this out after a painstaking remote debugging session using strace
or LD_DEBUG
.
Am I missing something in the API here? CUDA for example has error codes that indicate at least a little better what may be happening happening (CUDA_ERROR_NO_DEVICE
, CUDA_ERROR_DEVICE_UNAVAILABLE
, CUDA_ERROR_DEVICE_NOT_LICENSED
, etc).
Apart from the above API issue, I also have a concrete case where a user's system keeps on throwing ZE_RESULT_ERROR_UNINITIALIZED
: JuliaGPU/oneAPI.jl#399. LD_DEBUG
reveals that the correct libraries are found, and strace
shows that /dev/dri
nodes are successfully discovered and opened.
I've found out about some environment variables to increase logging, but the output isn't very helpful:
❯ ZE_ENABLE_LOADER_DEBUG_TRACE=1 julia ...
ZE_LOADER_DEBUG_TRACE:Loading Driver libze_intel_gpu.so.1
ZE_LOADER_DEBUG_TRACE:Loading Driver libze_intel_vpu.so.1
ZE_LOADER_DEBUG_TRACE:Load Library of libze_intel_vpu.so.1 failed with libze_intel_vpu.so.1: cannot open shared object file: No such file or directory
ZE_LOADER_DEBUG_TRACE:Load Library of libze_tracing_layer.so.1 failed with libze_tracing_layer.so.1: cannot open shared object file: No such file or directory
ZE_LOADER_DEBUG_TRACE:check_drivers(flags=0(ZE_INIT_ALL_DRIVER_TYPES_ENABLED))
ZE_LOADER_DEBUG_TRACE:init driver libze_intel_gpu.so.1 zeInit(0(ZE_INIT_ALL_DRIVER_TYPES_ENABLED)) returning ZE_RESULT_ERROR_UNINITIALIZED
ZE_LOADER_DEBUG_TRACE:Check Drivers Failed on libze_intel_gpu.so.1 , driver will be removed. zeInit failed with ZE_RESULT_ERROR_UNINITIALIZED
❯ NEOReadDebugKeys=1 PrintDebugMessages=1 PrintXeLogs=1 julia ...
...
INFO: System Info query failed!
WARNING: Failed to request OCL Turbo Boost
ZE_LOADER_DEBUG_TRACE:init driver libze_intel_gpu.so.1 zeInit(0(ZE_INIT_ALL_DRIVER_TYPES_ENABLED)) returning ZE_RESULT_ERROR_UNINITIALIZED
ZE_LOADER_DEBUG_TRACE:Check Drivers Failed on libze_intel_gpu.so.1 , driver will be removed. zeInit failed with ZE_RESULT_ERROR_UNINITIALIZED
Any other suggestions on how to debug this would be much appreciated.