8000 DOC: Expand/change iteration explanation based on Marten's review · numpy/numpy@f37e35a · GitHub
[go: up one dir, main page]

Skip to content

Commit f37e35a

Browse files
committed
DOC: Expand/change iteration explanation based on Marten's review
1 parent be6749d commit f37e35a

File tree

1 file changed

+55
-19
lines changed

1 file changed

+55
-19
lines changed

numpy/_core/src/multiarray/nditer_constr.c

Lines changed: 55 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1941,37 +1941,73 @@ operand_different_than_broadcast: {
19411941
* - One outer iteration with stride != 0 and a core of all strides == 0.
19421942
* This setup allows filling the buffer with only the stride != 0 and then
19431943
* doing the double loop.
1944+
* An Example for these two cases is:
1945+
* arr = np.ones((100, 10, 10))[::2, :, :]
1946+
* arr.sum(-1)
1947+
* arr.sum(-2)
1948+
* Where the slice prevents the iterator from collapsing axes and the
1949+
* result has stride 0 either along the last or the second to last axis.
1950+
* In both cases we can buffer 10x10 elements in reduce mode.
1951+
* (This iteration needs no buffer, add a cast to ensure actual buffering.)
19441952
*
19451953
* Only a writeable (reduce) operand require this reduce mode because for
1946-
* reading it is OK if the buffer holds duplicates elements.
1947-
* The goal of the reduce mode is that it allows for larger core sizes and
1954+
* reading it is OK if the buffer holds duplicated elements.
1955+
* The benefit of the reduce mode is that it allows for larger core sizes and
19481956
* buffers since the zero strides do not allow a single 1-d iteration.
19491957
* If do we use the reduce-mode, we can apply it also to read-only operands
19501958
* as an optimization.
19511959
*
1952-
* The functino here finds an outer dimension and it's "core" to use that
1953-
* works with reductions.
1960+
* The function here finds the first "outer" dimension and it's "core" to use
1961+
* that works with reductions.
19541962
* While iterating, we will fill the buffers making sure that we:
1955-
* - Never buffer beyond the outer dimension (optimize chance of re-use).
1956-
* - Never buffer beyond the end of the core (all but outer dimension)
1957-
* if the user manually moved the iterator to the middle of the core.
1958-
* (Not typical and if, should usually happen once.)
1963+
* - Never buffer beyond the first outer dimension (optimize chance of re-use).
1964+
* - If the iterator is manually set to an offset into what is part of the
1965+
* core (see second example below), then we only fill the buffer to finish
1966+
* that one core. This re-aligns us with the core and is necessary for
1967+
* reductions. (Such manual setting should be rare or happens exactly once
1968+
* for splitting the iteration into worker chunks.)
1969+
*
1970+
* And examples for these two constraints:
1971+
* Given the iteration shape is (100, 10, 10) and the core size 10 with a
1972+
* buffer size of 60 (due to limits), making dimension 1 the "outer" one.
1973+
* The first iterations/buffers would then range (excluding end-point):
1974+
* - (0, 0, 0) -> (0, 6, 0)
1975+
* - (0, 6, 0) -> (1, 0, 0) # Buffer only holds 40 of 60 possible elements.
1976+
* - (1, 0, 0) -> (1, 6, 0)
1977+
* - ...
1978+
* If the user limits to a range starting from 75, we use:
1979+
* - (0, 7, 5) -> (0, 8, 0) # Only 5 elements to re-align with core.
1980+
* - (0, 8, 0) -> (1, 0, 0)
1981+
* - ... # continue as above
19591982
*
19601983
* This means that the data stored in the buffer has always the same structure
1961-
* (except when manually moved). And we can simplify optimize the buffer
1962-
* filling in some cases and it is easy to decide whether a buffer content
1963-
* may be re-usable (this can happen with broadcasted operands).
1984+
* (except when manually moved), which allows us to fill the buffer more simply
1985+
* and optimally in some cases, and makes it easier to determine whether buffer
1986+
* content is re-usable (e.g., because it represents broadcasted operands).
19641987
*
19651988
* Best buffer and core size
19661989
* -------------------------
1967-
* We don't want to figure out the complicated copies every time we fill buffers
1968-
* so here we want to find a the outer iteration dimension so that:
1969-
* - Its core size is <= the maximum buffersize if buffering is needed.
1970-
* - Allows for reductions to work (with or without reduce-mode)
1971-
* - Optimizes for iteration overhead. We estimate the total overhead with:
1972-
* `(1 + n_buffers) / size`.
1973-
* The `size` is the `min(core_size * outer_dim_size, buffersize)` and
1974-
* estimates how often we fill buffers (unbounded if not buffering).
1990+
* To avoid having to figure out what to copy every time we fill buffers,
1991+
* we here want to find the outer iteration dimension such that:
1992+
* - Its core size is <= the maximum buffersize if buffering is needed;
1993+
* - Reductions are possible (with or without reduce mode);
1994+
* - Iteration overhead is minimized. We estimate the total overhead with
1995+
* the number "outer" iterations:
1996+
*
1997+
* N_o = full_iterator_size / min(core_size * outer_dim_size, buffersize)
1998+
*
1999+
* This is approximately how often `iternext()` is called when the user
2000+
* is using an external-loop and how often we would fill buffers.
2001+
* The total overhead is then estimated as:
2002+
*
2003+
* (1 + n_buffers) * N_o
2004+
*
2005+
* Since the iterator size is a constant, we can estimate the overhead as:
2006+
*
2007+
* (1 + n_buffers) / min(core_size * outer_dim_size, buffersize)
2008+
*
2009+
* And when comparing two options multiply by the others divisor/size to
2010+
* avoid the division.
19752011
*
19762012
* TODO: Probably should tweak or simplify? The formula is clearly not
19772013
* the actual cost (Buffers add a constant total cost as well).

0 commit comments

Comments
 (0)
0