8000 Use Highway with Dynamic Dispatch for some types in Absolute by Mousius · Pull Request #24384 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

Use Highway with Dynamic Dispatch for some types in Absolute#24384

Closed
Mousius wants to merge 2 commits intonumpy:mainfrom
Mousius:hwy-spike
Closed

Use Highway with Dynamic Dispatch for some types in Absolute#24384
Mousius wants to merge 2 commits intonumpy:mainfrom
Mousius:hwy-spike

Conversation

@Mousius
Copy link
Member
@Mousius Mousius commented Aug 10, 2023

This demonstrates how Highway can fit into the project, using the Highway dynamic dispatcher, including querying for detected features via the Highway API to mirror the existing NumPy APIs. I've only done a few data types to demonstrate the functionality.

Loads and stores can only be contiguous rather than supporting all potential layouts (waiting for the Scatter equivalent for google/highway@c6c09c4, then it should be trivial).

Mousius added 2 commits July 7, 2023 15:07
Demonstrates how this fits together with the user toggles within the
existing system.
@Mousius
Copy link
Member Author
Mousius commented Aug 10, 2023

NPY_NO_EXPORT void
FLOAT_absolute(char **args, npy_intp const *dimensions, npy_intp const *steps, void *NPY_UNUSED(func))
{
static auto dispatcher = HWY_DYNAMIC_DISPATCH(FLOAT_SuperAbsolute);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static auto dispatcher = HWY_DYNAMIC_DISPATCH(FLOAT_SuperAbsolute);
auto dispatcher = HWY_DYNAMIC_DISPATCH(FLOAT_SuperAbsolute);

Looks simple but I'm afraid of C++ evilness may involved since static initialization requires thread safety so this call usually will be warped in between a thread guard which make it actually slower than uses local variable if Highway cached CPUID calls similar to what NumPy does.
Here a pesudo code:

static the_dudced_type dispatcher;
static bool dispatcher_once = false
call cx_gaurd_lock
if not dispatcher_once
   dispatcher = HWY_DYNAMIC_DISPATCH(FLOAT_SuperAbsolute)
   dispatcher_once = true
endif
call cx_gaurd_unlock

Copy link
Member Author
@Mousius Mousius Aug 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing the even more ideal way to do this would be to generate the HWY_DYNAMIC_DISPATCH calls in the generated ufunc call.

// Generated code
abs_functions[4] = HWY_DYNAMIC_DISPATCH(FLOAT_SuperAbsolute);

Which could also generate the wrapper FLOAT_SuperAbsolute to avoid having to write it twice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point about the static. In JPEG XL we avoided this with initializing to a constant, and then detecting CPU on the first call.
DYNAMIC_DISPATCH() boils down to table[clz(globalMask())]. This globalMask is initialized to the constant 1, so that the first clz returns 0 which is the special function pointer that first calls CPU detection. Then globalMask is the actual CPU target bitfield.
One consequence is that it would actually be better to call DYNAMIC_DISPATCH again, rather than a static initializer that would be set to the special/wrapper function (if we haven't called ChosenTarget::Update anywhere else to pre-initialize). So a normal local variable as Sayed proposes would be fine.

@rgommers rgommers added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Aug 28, 2023
@Mousius Mousius closed this Nov 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component: SIMD Issues in SIMD (fast instruction sets) code or machinery

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

0