-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
MAINT,ENH: Reorganize buffered iteration setup #27883
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
A, maybe not too impressive, example for reduce per-iteration overhead is:
In this case, we are only benefitting from figuring out sizes up-front rather than every time. It should be possible to reduce the overhead here a lot more: The old code could not optimize the buffered FWIW. some of the strided benchmarks seem to profit, but I am not exactly sure why... I'll run the full benchmark suite, although I am not too optimistic there will good examples. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments and some applies because looking at in a different layout is helpful.
There are some too short comments, but this is ready for review, I think.
There is little point in reviewing in-line for the nditer_api.c
. It shows that I tried to retain some things, but of course the whole point of this exercise is to effectively rewrite the copy to/from buffers.
@@ -828,6 +823,9 @@ NpyIter_IsFirstVisit(NpyIter *iter, int iop) | |||
|
|||
/*NUMPY_API | |||
* Whether the iteration could be done with no buffering. | |||
* | |||
* Note that the iterator may use buffering to increase the inner loop size | |||
* even when buffering is not required. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was always true of course. Previously, it was even always the case unless a reduction prevented it.
npy_intp fact = NIT_ITERINDEX(iter) / NIT_BUFFERDATA(iter)->coresize; | ||
npy_intp offset = NIT_ITERINDEX(iter) - fact * NIT_BUFFERDATA(iter)->coresize; | ||
NIT_BUFFERDATA(iter)->coreoffset = offset; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be integrated into the loop. The bigger improvement here would be to avoid this function entirely in iternext()
. Which the code does for unbuffered iterators, but not for buffered ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a next iteration, I think!
NPY_IT_DBG_PRINT1( | ||
"Iterator: clearing refs of operand %d\n", (int)iop); | ||
npy_intp buf_stride = dtypes[iop]->elsize; | ||
// TODO: transfersize is too large for reductions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a change, not a catastrophe. But this always clears the buffer as if it was completely filled (even if there may only be a single item).
I guess, calculating it shouldn't be too hard, we have NBF_STRIDES()
and NBF_OUTERSTRIDES()
to work with. (If either stride is 0, the core-size or the outer-size do not matter).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave that for a future enhancement
* NOTE: This is technically not true since we may still use | ||
* an outer reduce at this point. | ||
* So it prefers a non-reduce setup, which seems not | ||
* ideal, but OK. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmmm, maybe the if
can actually just use op_single_stride_dims[iop] == idim - 1
?
That would require the assumption of using reduce logic below (and potentially disabling when it is not necessary). Probably that just works, maybe even be easier to grok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still trying to grok this bit, but could one just swap the two pieces?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might work, but not sure if it should matter right now (should make an issue probably that this all should be tweaked)...
The point is, if we use a "reduce" setup (i.e. two strides), then we can do one more dimension without buffering.
And maybe/probably it works to change this if and move it to the beginning (before the check for identical strides).
(I have to think about it myself.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talking with Matti today so hopefully we can push it through (with some notes in an issue). So a bit of a note to self: Changing it here is simple, I think. The difference will be that using_reduce
needs to be discovered below by re-checking operands.
Reduce-loops are unfavorable since you need to call the inner-loop/iternext
more often (but, you save on buffer filling). That trade-off may already be covered by preferring larger cores, but not sure.
This whole "finding the best" is very crude and I am sure it can be improved. Just also don't want to overcomplicate it and make it slow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave that for a further investigation/PR. Maybe we can open a follow up issue to track more cleanup and optimizations.
* If this operand is a reduction operand and this is not the | ||
* outer_reduce_dim, then we must stop. | ||
*/ | ||
if (op_itflags[iop] & NPY_OP_ITFLAG_REDUCE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also use ITFLAG_WRITE
, which may be a bit clearer, considering that we may disable the flag later.
(We don't need the flag to be set, but right now it gets set because the earlier code does input validation anyway.)
numpy/_core/tests/test_nditer.py
Outdated
| REDUCE OuterSize: 10 | ||
| REDUCE OuterDim: 1 | ||
| BUFFER Reduce outersize: 10 | ||
| BUFFER OuterDim: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose the reduce outersize is actually also always valid now. Which would be useful for advancing the iterator.
Should rename/reorganize, but would add churn elsewhere too probably.
This reorganizes the loop to do a larger initial setup rather than figuring things out on every iteration. That shrinks the buffersize often to fit better with the actual size. We try to do that smartly, although, the logic is different. It also means that "reduce" is not necessarily used (although it is likely that before it was also effectively unused often). This is not generally faster, but in some cases it will not use buffering unnecessarily and it should be much better at buffer re-use (at least with smaller fixes). In those cases things will be much faster. (As of writing, the setup time seems slightly slower, that is probably not unexpected, since finding the best size is extra work and it would only amortize if there are many iterations.)
The old code just did bad things. But a too large ndim would not have mattered, since the copy would have finished early anyway.
6cc45b7
to
ac282fc
Compare
Also clarifies the comment, since it may be a bit confusing: The outer stride being zero means that there must be nonzero inner strides (otherwise they are all zero and we don't use reduce logic). But the inner stride being zero is not meaningful.
I think this is ready for review.
There is one remaining question: This currently breaks |
@mattip this code removes that weird thing that |
FWIW, I have reinstante and added tests for the
|
1352e7a
to
92ecbaf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just comments on the long comment for now. Will try to get going on the rest later...
* The functino here finds an outer dimension and it's "core" to use that | ||
* works with reductions. | ||
* While iterating, we will fill the buffers making sure that we: | ||
* - Never buffer beyond the outer dimension (optimize chance of re-use). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would it mean to "buffer beyond the outer dimension"? For some length N, I can imagine buffering some smaller M, but why ever M>N? I'm probably misreading this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added two examples, hopefully they help (if not for this, then hopefully to show how to think about these iterations).
The outer dimension may be iterated in chunks (not fully), so if it is say 100, and we can fit 30 of them in at a time, you reach 90. Then we just buffer 10, not 30 again, because that would increment the dimension that comes next (with a different stride).
* - Never buffer beyond the outer dimension (optimize chance of re-use). | ||
* - Never buffer beyond the end of the core (all but outer dimension) | ||
* if the user manually moved the iterator to the middle of the core. | ||
* (Not typical and if, should usually happen once.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Garbled sentence! In the larger context, doesn't it just mean that if the iterator is manually set, things have to be recalculated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Restructured the order and wording to make it clearer hopefully.
You always have to (re-)fill buffers when manually moved (or ranged).
The point is that I decided to just finish a single core (which could be very small). For reductions, you have to do that anyway. Otherwise, it just seems a bit simpler to not worry about what happens if you iterator 1.5 core-sizes (rather than N core-sizes).
* We don't want to figure out the complicated copies every time we fill buffers | ||
* so here we want to find a the outer iteration dimension so that: | ||
* - Its core size is <= the maximum buffersize if buffering is needed. | ||
* - Allows for reductions to work (with or without reduce-mode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
follow grammatical structure of first item:
- Reductions are possible (with or without reduce mode);
- Iteration overhead is minimized. We ...
* - Allows for reductions to work (with or without reduce-mode) | ||
* - Optimizes for iteration overhead. We estimate the total overhead with: | ||
* `(1 + n_buffers) / size`. | ||
* The `size` is the `min(core_size * outer_dim_size, buffersize)` and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be:
`(1 + n_buffers) / size`, a guess of how often we fill buffers, with
`size=min(core_size * outer_dim_size, buffersize)`.
But I think this is not very clear, this does not look like a number (it has 1/bytes as dimension). I think what one really is trying to minimize is,
`(1+n_buffers) * core_size * outer_dim_size / min(buffer_size, core_size * outer_dim_size)`.
In practice, since the total size is fixed, what you write does the same thing, but maybe it is clearer to write like I have it? I.e., how about the following for the whole entry:
- Iteration overhead is minimized. For this, we use the number of times buffers
are filled, `(1 + n_buffers) * total_size / min(total_size, buffer_size)`
(where `total_size = core_size * outer_dim_size`).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, total_size
is not actually core_size * outer_dim_size
. Sin
F987
ce there may be more dimensions (outer here is the first "outer" dimension in a sense).
(I added the "first" in a few places, maybe it helps... It would be nice to find a different name because "outer" is also a nice name for everything that isn't buffered, but for the "outer" dimension here it is just the outermost that is still iterated as a part of a buffer.
But, I can't think of one right now.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(completely changed/expanded the formulas, though...)
f37e35a
to
426934d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, a few more code comments, though none remotely serious. I cannot say that I really understand the code, so at some level am relying on tests passing. I do think your refactoring makes the code easier to follow (but I haven't really tried to understand the whole...).
NpyIter_BufferData *bufferdata = NIT_BUFFERDATA(iter); | ||
|
||
/* | ||
* We check two things here, first whether the core is single strided |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The single strided
is a bit confusing; from the code, below, I'd call it contiguous
. Maybe change that? Otherwise, just add "(C contiguous)" here?
EDIT: or rather, I think this does not have to be contiguous at all, just consistent with the first stride. So, maybe add "(note that the stride of the first axis does not have to equal the element size; if it does, this is equivalent to the operand being C contiguous)".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rephrased to say "first how many operand dimensions can be iterated using a single stride (all dimensions are consistent),"
Contiguouity of course doesn't matter (it only matters for that slightly awkward "contiguous" flag.)
/* For clarity: op_reduce_outer_dim[iop] if set always matches. */ | ||
assert(!op_reduce_outer_dim[iop] || op_reduce_outer_dim[iop] == outer_reduce_dim); | F438||
} | ||
if (iop != nop) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can this be hit? I don't see a break
in the for
loop above...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just forgot to delete this. The first version I had broke immediately on finding a reduce op. But I wanted the info for all operands, so now it's breaking later.
* NOTE: This is technically not true since we may still use | ||
* an outer reduce at this point. | ||
* So it prefers a non-reduce setup, which seems not | ||
* ideal, but OK. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still trying to grok this bit, but could one just swap the two pieces?
npy_intp fact = NIT_ITERINDEX(iter) / NIT_BUFFERDATA(iter)->coresize; | ||
npy_intp offset = NIT_ITERINDEX(iter) - fact * NIT_BUFFERDATA(iter)->coresize; | ||
NIT_BUFFERDATA(iter)->coreoffset = offset; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a next iteration, I think!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more small things, and one possible little buglet. I still do not feel I really understand everything. Best would be if someone else could have a look, but, if not, my sense would be to put this in, and continue in smaller pieces...
reduce_outeraxisdata = NIT_INDEX_AXISDATA(axisdata, reduce_outerdim); | ||
if (itflags & NPY_ITFLAG_REDUCE) { | ||
npy_intp sizeof_axisdata = NIT_AXISDATA_SIZEOF(itflags, ndim, nop); | ||
outer_axisdata = NIT_INDEX_AXISDATA(axisdata, bufferdata->outerdim); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not your problem, but, really, having macros require other variables to be defined is pretty bad form! Especially since here axisdata_incr
could have made use of sizeof_axisdata
.
/* | ||
* Note that this code requires/uses the fact that there is always one | ||
* axisdata even for ndim == 0 (i.e. no need to worry about it). | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, only if you are making changes, maybe add:
* Also, it uses that the operands have already been broadcast against
* each other (and thus may have zero strides).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dont' feel it fits here (also moved this comment down to best_ndim = 0
.
} | ||
} | ||
NIT_BUFFERDATA(iter)->buffersize = best_size; | ||
/* Core size is 0 (unless the user applies a range explicitly). */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type? "Core starts at 0 (unless..."
} | ||
else { | ||
/* If there's no buffering, the strides are always fixed */ | ||
/* If there's no buffering, the strides come from the operands. */ | ||
memcpy(out_strides, NAD_STRIDES(axisdata0), nop*NPY_SIZEOF_INTP); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A future cleanup might eliminate any internal calls to this API function and use the iter data directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understood, it looks good. Might be nice to check with valgrind at some point before the next release.
There are some follow ups, but nothing really jumps out as a possible continuation. Thanks @seberg |
Somehow we are seeing some failures in scikit-learn against numpy-dev and a quick bisect seems to indicate this PR is the reason. Suggestions more than welcome on this! See scikit-learn/scikit-learn#30509 for more details. |
It is possible, the buffering changed, which may change things mainly for:
(I was planning on adding a release note.) |
The issue does happen with float32. We are comparing our own optimized version of pairwise_distances to I do realize that it's quite far away from a minimal reproducer ... I could try to spend time on creating one/debugging further, if you think this is useful. scikit-learn/scikit-learn#30509 (comment) has more details just in case. |
I am taking a look. Somewhere in the code, there must be a summation along the dimension, do you know where that happens? Of course both directions of change are possible, in this instance, my current suspicion is that you are using a NumPy sum, which is then precise for:
Where But parts of the code also looks like it goes to Or in other words/example:
And to complete the reasoning above on 2.1.3:
On main:
|
Thanks for having a look!
Not very familiar with this code unfortunately, maybe @jjerphan or @jeremiedbb remember more details. See scikit-learn/scikit-learn#30509 (comment) for more context and the original failure on scikit-learn when running against numpy-dev. |
Actually, complete guess, but maybe in |
This changes the buffered iteration setup to do most of the work initially rather than when actually copying buffers.
It has a few changes:
(To be fair, this is most important with reductions where the buffersize always had to be shrunk down.)
(A possible new optimization: Use "reduce" loop even if not a reduce.)
It still needs:
NPY_ITER_CONTIG
.