Use Highway with Dynamic Dispatch for some types in Absolute#24384
Use Highway with Dynamic Dispatch for some types in Absolute#24384Mousius wants to merge 2 commits intonumpy:mainfrom
Conversation
Demonstrates how this fits together with the user toggles within the existing system.
| NPY_NO_EXPORT void | ||
| FLOAT_absolute(char **args, npy_intp const *dimensions, npy_intp const *steps, void *NPY_UNUSED(func)) | ||
| { | ||
| static auto dispatcher = HWY_DYNAMIC_DISPATCH(FLOAT_SuperAbsolute); |
There was a problem hiding this comment.
| static auto dispatcher = HWY_DYNAMIC_DISPATCH(FLOAT_SuperAbsolute); | |
| auto dispatcher = HWY_DYNAMIC_DISPATCH(FLOAT_SuperAbsolute); |
Looks simple but I'm afraid of C++ evilness may involved since static initialization requires thread safety so this call usually will be warped in between a thread guard which make it actually slower than uses local variable if Highway cached CPUID calls similar to what NumPy does.
Here a pesudo code:
static the_dudced_type dispatcher;
static bool dispatcher_once = false
call cx_gaurd_lock
if not dispatcher_once
dispatcher = HWY_DYNAMIC_DISPATCH(FLOAT_SuperAbsolute)
dispatcher_once = true
endif
call cx_gaurd_unlockThere was a problem hiding this comment.
I'm guessing the even more ideal way to do this would be to generate the HWY_DYNAMIC_DISPATCH calls in the generated ufunc call.
// Generated code
abs_functions[4] = HWY_DYNAMIC_DISPATCH(FLOAT_SuperAbsolute);
Which could also generate the wrapper FLOAT_SuperAbsolute to avoid having to write it twice.
There was a problem hiding this comment.
Good point about the static. In JPEG XL we avoided this with initializing to a constant, and then detecting CPU on the first call.
DYNAMIC_DISPATCH() boils down to table[clz(globalMask())]. This globalMask is initialized to the constant 1, so that the first clz returns 0 which is the special function pointer that first calls CPU detection. Then globalMask is the actual CPU target bitfield.
One consequence is that it would actually be better to call DYNAMIC_DISPATCH again, rather than a static initializer that would be set to the special/wrapper function (if we haven't called ChosenTarget::Update anywhere else to pre-initialize). So a normal local variable as Sayed proposes would be fine.
This demonstrates how Highway can fit into the project, using the Highway dynamic dispatcher, including querying for detected features via the Highway API to mirror the existing NumPy APIs. I've only done a few data types to demonstrate the functionality.
Loads and stores can only be contiguous rather than supporting all potential layouts (waiting for the Scatter equivalent for google/highway@c6c09c4, then it should be trivial).