@@ -1941,37 +1941,73 @@ operand_different_than_broadcast: {
1941
1941
* - One outer iteration with stride != 0 and a core of all strides == 0.
1942
1942
* This setup allows filling the buffer with only the stride != 0 and then
1943
1943
* doing the double loop.
1944
+ * An Example for these two cases is:
1945
+ * arr = np.ones((100, 10, 10))[::2, :, :]
1946
+ * arr.sum(-1)
1947
+ * arr.sum(-2)
1948
+ * Where the slice prevents the iterator from collapsing axes and the
1949
+ * result has stride 0 either along the last or the second to last axis.
1950
+ * In both cases we can buffer 10x10 elements in reduce mode.
1951
+ * (This iteration needs no buffer, add a cast to ensure actual buffering.)
1944
1952
*
1945
1953
* Only a writeable (reduce) operand require this reduce mode because for
1946
- * reading it is OK if the buffer holds duplicates elements.
1947
- * The goal of the reduce mode is that it allows for larger core sizes and
1954
+ * reading it is OK if the buffer holds duplicated elements.
1955
+ * The benefit of the reduce mode is that it allows for larger core sizes and
1948
1956
* buffers since the zero strides do not allow a single 1-d iteration.
1949
1957
* If do we use the reduce-mode, we can apply it also to read-only operands
1950
1958
* as an optimization.
1951
1959
*
1952
- * The functino here finds an outer dimension and it's "core" to use that
1953
- * works with reductions.
1960
+ * The function here finds the first " outer" dimension and it's "core" to use
1961
+ * that works with reductions.
1954
1962
* While iterating, we will fill the buffers making sure that we:
1955
- * - Never buffer beyond the outer dimension (optimize chance of re-use).
1956
- * - Never buffer beyond the end of the core (all but outer dimension)
1957
- * if the user manually moved the iterator to the middle of the core.
1958
- * (Not typical and if, should usually happen once.)
1963
+ * - Never buffer beyond the first outer dimension (optimize chance of re-use).
1964
+ * - If the iterator is manually set to an offset into what is part of the
1965
+ * core (see second example below), then we only fill the buffer to finish
1966
+ * that one core. This re-aligns us with the core and is necessary for
1967
+ * reductions. (Such manual setting should be rare or happens exactly once
1968
+ * for splitting the iteration into worker chunks.)
1969
+ *
1970
+ * And examples for these two constraints:
1971
+ * Given the iteration shape is (100, 10, 10) and the core size 10 with a
1972
+ * buffer size of 60 (due to limits), making dimension 1 the "outer" one.
1973
+ * The first iterations/buffers would then range (excluding end-point):
1974
+ * - (0, 0, 0) -> (0, 6, 0)
1975
+ * - (0, 6, 0) -> (1, 0, 0) # Buffer only holds 40 of 60 possible elements.
1976
+ * - (1, 0, 0) -> (1, 6, 0)
1977
+ * - ...
1978
+ * If the user limits to a range starting from 75, we use:
1979
+ * - (0, 7, 5) -> (0, 8, 0) # Only 5 elements to re-align with core.
1980
+ * - (0, 8, 0) -> (1, 0, 0)
1981
+ * - ... # continue as above
1959
1982
*
1960
1983
* This means that the data stored in the buffer has always the same structure
1961
- * (except when manually moved). And we can simplify optimize the buffer
1962
- * filling in some cases and it is easy to decide whether a buffer content
1963
- * may be re-usable (this can happen with broadcasted operands).
1984
+ * (except when manually moved), which allows us to fill the buffer more simply
1985
+ * and optimally in some cases, and makes it easier to determine whether buffer
1986
+ * content is re-usable (e.g., because it represents broadcasted operands).
1964
1987
*
1965
1988
* Best buffer and core size
1966
1989
* -------------------------
1967
- * We don't want to figure out the complicated copies every time we fill buffers
1968
- * so here we want to find a the outer iteration dimension so that:
1969
- * - Its core size is <= the maximum buffersize if buffering is needed.
1970
- * - Allows for reductions to work (with or without reduce-mode)
1971
- * - Optimizes for iteration overhead. We estimate the total overhead with:
1972
- * `(1 + n_buffers) / size`.
1973
- * The `size` is the `min(core_size * outer_dim_size, buffersize)` and
1974
- * estimates how often we fill buffers (unbounded if not buffering).
1990
+ * To avoid having to figure out what to copy every time we fill buffers,
1991
+ * we here want to find the outer iteration dimension such that:
1992
+ * - Its core size is <= the maximum buffersize if buffering is needed;
1993
+ * - Reductions are possible (with or without reduce mode);
1994
+ * - Iteration overhead is minimized. We estimate the total overhead with
1995
+ * the number "outer" iterations:
1996
+ *
1997
+ * N_o = full_iterator_size / min(core_size * outer_dim_size, buffersize)
1998
+ *
1999
+ * This is approximately how often `iternext()` is called when the user
2000
+ * is using an external-loop and how often we would fill buffers.
2001
+ * The total overhead is then estimated as:
2002
+ *
2003
+ * (1 + n_buffers) * N_o
2004
+ *
2005
+ * Since the iterator size is a constant, we can estimate the overhead as:
2006
+ *
2007
+ * (1 + n_buffers) / min(core_size * outer_dim_size, buffersize)
2008
+ *
2009
+ * And when comparing two options multiply by the others divisor/size to
2010
+ * avoid the division.
1975
2011
*
1976
2012
* TODO: Probably should tweak or simplify? The formula is clearly not
1977
2013
* the actual cost (Buffers add a constant total cost as well).
0 commit comments