8000 MAINT,ENH: Reorganize buffered iteration setup by seberg · Pull Request #27883 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

MAINT,ENH: Reorganize buffered iteration setup #27883

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Dec 17, 2024

Conversation

seberg
Copy link
Member
@seberg seberg commented Dec 1, 2024

This changes the buffered iteration setup to do most of the work initially rather than when actually copying buffers.

It has a few changes:

  • As of now, a very slightly higher overhead initially.
  • Shrinks the buffersize to fit the dimensions. This means that buffers can be re-used far more often for broadcast operations.
    (To be fair, this is most important with reductions where the buffersize always had to be shrunk down.)
  • The buffer re-use logic is now passably understandable, I hope. Which means we may also be able to optimize it further (e.g. allow re-use for dtypes with references, which currently fails...).
  • Tries to guess the whether using buffers is actually worthwhile. And example is:
    a = np.ones((4, 500))[::2]
    %timeit a + a  # 840ns vs. 1.15
    
    so not sure it is important, could maybe safe on it and just grow as much as possible.
    (A possible new optimization: Use "reduce" loop even if not a reduce.)

It still needs:

  • Figure out why CI breaks down when local run did not.
  • (largely ok) A bit of cleanup (e.g. unnecessary moving around). The main function is long and a bit messy, I admit. But a second pair of eyes is probably better at suggesting improvements if they are worthwhile.
  • Someone will probably want some good speedup examples. They exist, but I need to nail down the nicest ones.
  • Decide what to do about NPY_ITER_CONTIG.

@seberg seberg marked this pull request as draft December 1, 2024 22:17
@seberg
Copy link
Member Author
seberg commented Dec 2, 2024

A, maybe not too impressive, example for reduce per-iteration overhead is:

a = np.ones((100000, 2, 2))

%timeit a.sum(1)  # 3.55ms vs. 3.9ms

In this case, we are only benefitting from figuring out sizes up-front rather than every time. It should be possible to reduce the overhead here a lot more: The old code could not optimize the buffered iternext() for advancing the index (it could have done so for reduce ones, though).
The new code, now always knows exactly which dimension to advance by how much (unless a coreoffset is set, i.e. effectively always). So it can just advance one dimension (plus the next dimensions each by one if there was an overflow).
Currently, it needs to figure out the full thing with divisions for each dimension.

FWIW. some of the strided benchmarks seem to profit, but I am not exactly sure why... I'll run the full benchmark suite, although I am not too optimistic there will good examples.

Copy link
Member Author
@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments and some applies because looking at in a different layout is helpful.

There are some too short comments, but this is ready for review, I think.

There is little point in reviewing in-line for the nditer_api.c. It shows that I tried to retain some things, but of course the whole point of this exercise is to effectively rewrite the copy to/from buffers.

@@ -828,6 +823,9 @@ NpyIter_IsFirstVisit(NpyIter *iter, int iop)

/*NUMPY_API
* Whether the iteration could be done with no buffering.
*
* Note that the iterator may use buffering to increase the inner loop size
* even when buffering is not required.
8000 Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was always true of course. Previously, it was even always the case unless a reduction prevented it.

npy_intp fact = NIT_ITERINDEX(iter) / NIT_BUFFERDATA(iter)->coresize;
npy_intp offset = NIT_ITERINDEX(iter) - fact * NIT_BUFFERDATA(iter)->coresize;
NIT_BUFFERDATA(iter)->coreoffset = offset;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be integrated into the loop. The bigger improvement here would be to avoid this function entirely in iternext(). Which the code does for unbuffered iterators, but not for buffered ones.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a next iteration, I think!

NPY_IT_DBG_PRINT1(
"Iterator: clearing refs of operand %d\n", (int)iop);
npy_intp buf_stride = dtypes[iop]->elsize;
// TODO: transfersize is too large for reductions
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a change, not a catastrophe. But this always clears the buffer as if it was completely filled (even if there may only be a single item).
I guess, calculating it shouldn't be too hard, we have NBF_STRIDES() and NBF_OUTERSTRIDES() to work with. (If either stride is 0, the core-size or the outer-size do not matter).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave that for a future enhancement

* NOTE: This is technically not true since we may still use
* an outer reduce at this point.
* So it prefers a non-reduce setup, which seems not
* ideal, but OK.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm, maybe the if can actually just use op_single_stride_dims[iop] == idim - 1?

That would require the assumption of using reduce logic below (and potentially disabling when it is not necessary). Probably that just works, maybe even be easier to grok.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still trying to grok this bit, but could one just swap the two pieces?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might work, but not sure if it should matter right now (should make an issue probably that this all should be tweaked)...

The point is, if we use a "reduce" setup (i.e. two strides), then we can do one more dimension without buffering.

And maybe/probably it works to change this if and move it to the beginning (before the check for identical strides).
(I have to think about it myself.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talking with Matti today so hopefully we can push it through (with some notes in an issue). So a bit of a note to self: Changing it here is simple, I think. The difference will be that using_reduce needs to be discovered below by re-checking operands.
Reduce-loops are unfavorable since you need to call the inner-loop/iternext more often (but, you save on buffer filling). That trade-off may already be covered by preferring larger cores, but not sure.

This whole "finding the best" is very crude and I am sure it can be improved. Just also don't want to overcomplicate it and make it slow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave that for a further investigation/PR. Maybe we can open a follow up issue to track more cleanup and optimizations.

* If this operand is a reduction operand and this is not the
* outer_reduce_dim, then we must stop.
*/
if (op_itflags[iop] & NPY_OP_ITFLAG_REDUCE) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also use ITFLAG_WRITE, which may be a bit clearer, considering that we may disable the flag later.

(We don't need the flag to be set, but right now it gets set because the earlier code does input validation anyway.)

| REDUCE OuterSize: 10
| REDUCE OuterDim: 1
| BUFFER Reduce outersize: 10
| BUFFER OuterDim: 1
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose the reduce outersize is actually also always valid now. Which would be useful for advancing the iterator.
Should rename/reorganize, but would add churn elsewhere too probably.

seberg and others added 10 commits December 2, 2024 18:10
This reorganizes the loop to do a larger initial setup rather than
figuring things out on every iteration.
That shrinks the buffersize often to fit better with the actual size.
We try to do that smartly, although, the logic is different.

It also means that "reduce" is not necessarily used (although it
is likely that before it was also effectively unused often).

This is not generally faster, but in some cases it will not use
buffering unnecessarily and it should be much better at buffer
re-use (at least with smaller fixes).
In those cases things will be much faster.

(As of writing, the setup time seems slightly slower, that is probably
not unexpected, since finding the best size is extra work and it would
only amortize if there are many iterations.)
The old code just did bad things.  But a too large ndim would not
have mattered, since the copy would have finished early anyway.
@seberg seberg force-pushed the reduce-buf-refactor branch from 6cc45b7 to ac282fc Compare December 2, 2024 17:15
Also clarifies the comment, since it may be a bit confusing:
The outer stride being zero means that there must be nonzero
inner strides (otherwise they are all zero and we don't use reduce
logic).
But the inner stride being zero is not meaningful.
@seberg seberg marked this pull request as ready for review December 3, 2024 10:49
@seberg
Copy link
Member Author
seberg commented Dec 3, 2024

I think this is ready for review.

  • The tests are actually surprisingly not so bad, because I wrote this test many years ago when there was a buffer-reuse bug (and einsum covers a lot too).
  • Can probably do some improvements, but it is at the point where suggestions would be the best way to drive those.

There is one remaining question: This currently breaks NPY_ITER_CONTIG which requests an operand to be contiguous. Re-instating it shouldn't be hard, but considering that a github code search finds exactly 0 occurances outside of NumPy, and NumPy doesn't actually use it...
I do wonder if we should just raise an error if it is set saying that we could re-instate it if needed?

@seberg
Copy link
Member Author
seberg commented Dec 3, 2024

@mattip this code removes that weird thing that arr.sum() has does multiple (unnecessary) calls to the inner-loop. I think that is fine because the pairwise summation is light-weight enough that we don't need a limit on the recursion depth?

@seberg
Copy link
Member Author
seberg commented Dec 5, 2024

FWIW, I have reinstante and added tests for the CONTIG flag. But, it is still exactly as buggy as it was before for now...

There was a use-after free error, not sure yet why (or if even related) so saving for posterity. Nvm, I had a iterator debug print, and I suspect pytest may have inspected the operand arrays which are invalid at the end of the test.

@seberg seberg force-pushed the reduce-buf-refactor branch from 1352e7a to 92ecbaf Compare December 5, 2024 18:39
Copy link
Contributor
@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just comments on the long comment for now. Will try to get going on the rest later...

* The functino here finds an outer dimension and it's "core" to use that
* works with reductions.
* While iterating, we will fill the buffers making sure that we:
* - Never buffer beyond the outer dimension (optimize chance of re-use).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would it mean to "buffer beyond the outer dimension"? For some length N, I can imagine buffering some smaller M, but why ever M>N? I'm probably misreading this!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added two examples, hopefully they help (if not for this, then hopefully to show how to think about these iterations).

The outer dimension may be iterated in chunks (not fully), so if it is say 100, and we can fit 30 of them in at a time, you reach 90. Then we just buffer 10, not 30 again, because that would increment the dimension that comes next (with a different stride).

* - Never buffer beyond the outer dimension (optimize chance of re-use).
* - Never buffer beyond the end of the core (all but outer dimension)
* if the user manually moved the iterator to the middle of the core.
* (Not typical and if, should usually happen once.)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Garbled sentence! In the larger context, doesn't it just mean that if the iterator is manually set, things have to be recalculated?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restructured the order and wording to make it clearer hopefully.

You always have to (re-)fill buffers when manually moved (or ranged).

The point is that I decided to just finish a single core (which could be very small). For reductions, you have to do that anyway. Otherwise, it just seems a bit simpler to not worry about what happens if you iterator 1.5 core-sizes (rather than N core-sizes).

* We don't want to figure out the complicated copies every time we fill buffers
* so here we want to find a the outer iteration dimension so that:
* - Its core size is <= the maximum buffersize if buffering is needed.
* - Allows for reductions to work (with or without reduce-mode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

follow grammatical structure of first item:

- Reductions are possible (with or without reduce mode);
- Iteration overhead is minimized. We ...

* - Allows for reductions to work (with or without reduce-mode)
* - Optimizes for iteration overhead. We estimate the total overhead with:
* `(1 + n_buffers) / size`.
* The `size` is the `min(core_size * outer_dim_size, buffersize)` and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be:

`(1 + n_buffers) / size`, a guess of how often we fill buffers, with
`size=min(core_size * outer_dim_size, buffersize)`.

But I think this is not very clear, this does not look like a number (it has 1/bytes as dimension). I think what one really is trying to minimize is,

`(1+n_buffers) * core_size * outer_dim_size / min(buffer_size, core_size * outer_dim_size)`.

In practice, since the total size is fixed, what you write does the same thing, but maybe it is clearer to write like I have it? I.e., how about the following for the whole entry:

- Iteration overhead is minimized. For this, we use the number of times buffers
  are filled, `(1 + n_buffers) * total_size / min(total_size, buffer_size)`
  (where `total_size = core_size * outer_dim_size`).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, total_size is not actually core_size * outer_dim_size. Sin F987 ce there may be more dimensions (outer here is the first "outer" dimension in a sense).

(I added the "first" in a few places, maybe it helps... It would be nice to find a different name because "outer" is also a nice name for everything that isn't buffered, but for the "outer" dimension here it is just the outermost that is still iterated as a part of a buffer.
But, I can't think of one right now.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(completely changed/expanded the formulas, though...)

@seberg seberg force-pushed the reduce-buf-refactor branch from f37e35a to 426934d Compare December 15, 2024 18:13
Copy link
Contributor
@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, a few more code comments, though none remotely serious. I cannot say that I really understand the code, so at some level am relying on tests passing. I do think your refactoring makes the code easier to follow (but I haven't really tried to understand the whole...).

NpyIter_BufferData *bufferdata = NIT_BUFFERDATA(iter);

/*
* We check two things here, first whether the core is single strided
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The single strided is a bit confusing; from the code, below, I'd call it contiguous. Maybe change that? Otherwise, just add "(C contiguous)" here?

EDIT: or rather, I think this does not have to be contiguous at all, just consistent with the first stride. So, maybe add "(note that the stride of the first axis does not have to equal the element size; if it does, this is equivalent to the operand being C contiguous)".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rephrased to say "first how many operand dimensions can be iterated using a single stride (all dimensions are consistent),"

Contiguouity of course doesn't matter (it only matters for that slightly awkward "contiguous" flag.)

F438
/* For clarity: op_reduce_outer_dim[iop] if set always matches. */
assert(!op_reduce_outer_dim[iop] || op_reduce_outer_dim[iop] == outer_reduce_dim);
}
if (iop != nop) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can this be hit? I don't see a break in the for loop above...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just forgot to delete this. The first version I had broke immediately on finding a reduce op. But I wanted the info for all operands, so now it's breaking later.

* NOTE: This is technically not true since we may still use
* an outer reduce at this point.
* So it prefers a non-reduce setup, which seems not
* ideal, but OK.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still trying to grok this bit, but could one just swap the two pieces?

npy_intp fact = NIT_ITERINDEX(iter) / NIT_BUFFERDATA(iter)->coresize;
npy_intp offset = NIT_ITERINDEX(iter) - fact * NIT_BUFFERDATA(iter)->coresize;
NIT_BUFFERDATA(iter)->coreoffset = offset;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a next iteration, I think!

Copy link
Contributor
@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more small things, and one possible little buglet. I still do not feel I really understand everything. Best would be if someone else could have a look, but, if not, my sense would be to put this in, and continue in smaller pieces...

reduce_outeraxisdata = NIT_INDEX_AXISDATA(axisdata, reduce_outerdim);
if (itflags & NPY_ITFLAG_REDUCE) {
npy_intp sizeof_axisdata = NIT_AXISDATA_SIZEOF(itflags, ndim, nop);
outer_axisdata = NIT_INDEX_AXISDATA(axisdata, bufferdata->outerdim);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not your problem, but, really, having macros require other variables to be defined is pretty bad form! Especially since here axisdata_incr could have made use of sizeof_axisdata.

/*
* Note that this code requires/uses the fact that there is always one
* axisdata even for ndim == 0 (i.e. no need to worry about it).
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, only if you are making changes, maybe add:

 * Also, it uses that the operands have already been broadcast against
 * each other (and thus may have zero strides).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dont' feel it fits here (also moved this comment down to best_ndim = 0.

}
}
NIT_BUFFERDATA(iter)->buffersize = best_size;
/* Core size is 0 (unless the user applies a range explicitly). */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type? "Core starts at 0 (unless..."

}
else {
/* If there's no buffering, the strides are always fixed */
/* If there's no buffering, the strides come from the operands. */
memcpy(out_strides, NAD_STRIDES(axisdata0), nop*NPY_SIZEOF_INTP);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A future cleanup might eliminate any internal calls to this API function and use the iter data directly.

Copy link
Member
@mattip mattip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understood, it looks good. Might be nice to check with valgrind at some point before the next release.

@mattip
Copy link
Member
mattip commented Dec 17, 2024

There are some follow ups, but nothing really jumps out as a possible continuation. Thanks @seberg

@mattip mattip merged commit f36fa72 into numpy:main Dec 17, 2024
65 of 66 checks passed
@seberg seberg deleted the reduce-buf-refactor branch December 17, 2024 09:46
@lesteve
Copy link
Contributor
lesteve commented Dec 19, 2024

Somehow we are seeing some failures in scikit-learn against numpy-dev and a quick bisect seems to indicate this PR is the reason. Suggestions more than welcome on this! See scikit-learn/scikit-learn#30509 for more details.

@seberg
Copy link
Member Author
seberg commented Dec 19, 2024

It is possible, the buffering changed, which may change things mainly for:

  • float16, in case buffer sizes are smaller, because intermittent result may be cast back to float16 more often.
  • float32 summation is more likely to change because iteration sizes increased. That will change the pair-wise summation logic (I would think slightly for the better).

(I was planning on adding a release note.)

@lesteve
Copy link
Contributor
lesteve commented Dec 19, 2024
  • float32 summation is more likely to change because iteration sizes increased. That will change the pair-wise summation logic (I would think slightly for the better).

The issue does happen with float32. We are comparing our own optimized version of pairwise_distances to scipy.spatial.cdist on float64 (no both are float32, my mistake). Before the test pass with rtol=1e-5 now the relative difference is 2.7e-4 so it kind of looks a bit off. There is also some optimized Cython code involved which I am not very familiar with so it could be partly due to our Cython code as well, hard to tell at this point.

I do realize that it's quite far away from a minimal reproducer ... I could try to spend time on creating one/debugging further, if you think this is useful.

scikit-learn/scikit-learn#30509 (comment) has more details just in case.

@seberg
Copy link
Member Author
seberg commented Dec 19, 2024

I am taking a look. Somewhere in the code, there must be a summation along the dimension, do you know where that happens?

Of course both directions of change are possible, in this instance, my current suspicion is that you are using a NumPy sum, which is then precise for:

dim = 1000000
leftover = dim % 8196  # 88 elements

summation_change = diff[:-88].sum() + diff[-88:].sum - diff.sum()

Where summation_change approximates the difference under the assumption that the precision is always better for the pairwise summation itself (I am not sure this is the case!).

But parts of the code also looks like it goes to float64 intermediately possibly, and with that I don't see this happening.

Or in other words/example:

eps = 1e-4
dim = 1000000

arr = np.array([np.float32(1 + eps) - 1] * dim)  # 1 + eps - 1 isn't eps

# NumPy 2.1.3:
(arr**2).sum() - arr[0]**2*dim
# np.float32(-1.1175871e-08)

# NumPy main:
(arr**2).sum() - arr[0]**2*dim
# np.float32(1.8626451e-09)

And to complete the reasoning above on 2.1.3:

In [18]: (arr[:-88]**2).sum() + (arr[-88:]**2).sum() - (arr**2).sum()
Out[18]: np.float32(0.0)

On main:

In [66]: (arr[:-88]**2).sum() + (arr[-88:]**2).sum() - (arr**2).sum()
Out[66]: np.float32(-9.313226e-10)

@lesteve
Copy link
Contributor
lesteve commented Dec 19, 2024

Thanks for having a look!

Somewhere in the code, there must be a summation along the dimension, do you know where that happens?

Not very familiar with this code unfortunately, maybe @jjerphan or @jeremiedbb remember more details. See scikit-learn/scikit-learn#30509 (comment) for more context and the original failure on scikit-learn when running against numpy-dev.

@lesteve
Copy link
Contributor
lesteve commented Dec 19, 2024

Somewhere in the code, there must be a summation along the dimension, do you know where that happens?

Actually, complete guess, but maybe in row_norms which does the equivalent of np.sqrt((X * X).sum(axis=1)) with a np.einsum?

https://github.com/scikit-learn/scikit-learn/blob/4ad187a7401f939c4d9cd27090c4258b5d810650/sklearn/utils/extmath.py#L47-L82

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0