8000 more small-array performance improvements by juliantaylor · Pull Request #4904 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

more small-array performance improvements #4904

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

juliantaylor
Copy link
Contributor

some more small array performance improvements, please see the commits for details.
The changes are somewhat hackish and could probably also be resolved by a more extensive overhaul of the ufunc object, though these changes are also very simple and should we decide to do an overhaul its good to have a better performance in our benchmarks for comparison.

small reductions (O(100) elements) improve on my machine by ~20% and binary ufuncs by about ~10%.

  • add.reduce(ones(100)) from 1.80us to 1.5us
  • add(ones(100), ones(100)) from 0.77us to 0.7us

@@ -4355,7 +4368,8 @@ PyUFunc_FromFuncAndDataAndSignature(PyUFuncGenericFunction *func, void **data,
}
ufunc->doc = doc;

ufunc->op_flags = PyArray_malloc(sizeof(npy_uint32)*ufunc->nargs);
ufunc->op_flags = PyArray_malloc(sizeof(npy_uint32)*(ufunc->nargs +
NPY_NTYPES));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do this in a cleaner way? E.g.

typedef struct {
    PyUFuncObject public;
    /* extra fields here */
} PyUFuncObjectPrivate;

and then they can be cast back and forth to each other so long as we're careful to always allocate the full thing? (This does assume that subclassing ufunc is illegal though.)

Or if that's not doable for some reason, at least a few helpers to get/set this table?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm thinking about it this approach does not really help as we would also need some marker to indicate that field exists.
actually that object has been extended a few times (e.g. in 1.7) so I guess nobody actually relies on its size not growing. So we could just stuff a pointer to a private object into the end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm that also won't work as we might get passed in some user copy of the old structure...
also it would break forward compatibilty with cython, we don't provide that anyway but it would be nice to avoid it for something that might turn out as temporary.

I guess a second independant lookup table based on the code generator would work

@charris
Copy link
Member
charris commented Aug 23, 2014

@juliantaylor Are you OK with this in its current state?

@juliantaylor
Copy link
Contributor Author

not in the current state, it breaks ABI anyway so we might as well increase the structure size (causing more problems for cython forward compat)
we need to do something about the constant ABI breaks, e.g. deprecating usage of the structure.

@charris
Copy link
Member
charris commented Aug 23, 2014

We could do more to hide the structure, but such things take time as we need to coordinate with cython. Which is to say we should try to plan and get started early ;)

I'm thinking of separating out the numpy ABI into something more like a library. What are your thoughts on that?

@juliantaylor
Copy link
Contributor Author

could we use the check_return which 8000 seems to be unused to flag extension? possibly a special iter flag would work too? or should we just not care about breaking the abi of that structure? we seem to have done so lots of times in the last couple releases.

Check first character of ufunc name before attempting a full string
compare. This improves scalar operations performance slightly.
The creation and parsing of the type tuple is slower than the array
result path used by binary ufuncs. For the most common reductions add a
fast path skipping the type tuple creation and sending it through the
array result path.
Improves small reduction performance of these types by 5%-10%.
updating unconditionally caused a ~5% performance regression for scalar
operations.
Add a jump table to the first entry of each of the basic type to the
ufunc object and use it to skip over uninteresting inner loops.
For the add_signatures this skips 13 loop iterations and improves scalar
performance by 10%.

The jump table is placed behind an dynamically allocated object already
present in the ufunc object so the ABI is preserved.
@juliantaylor
Copy link
Contributor Author

as we are going to be less strict on preserving the ufunc structure this can be probably be considered for merging

@njsmith
Copy link
Member
njsmith commented Oct 9, 2015

Maybe it would be best to finish off that NEP first and then figure out what to do here? Is this urgent?

@charris
Copy link
Member
charris commented Dec 10, 2015

Ping folks who are in the conversation. We should either close this or take it forward.

@charris
Copy link
Member
charris commented Mar 7, 2016

Ping.

@mattip
Copy link
Member
mattip commented Feb 14, 2019

Can we close this?

@mattip
Copy link
Member
mattip commented Apr 26, 2019

Closing. Please reopen if you wish to move it forward.

@mattip mattip closed this Apr 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0