10000 Multireduce Kernels - Linearizer and Scheduler Changes by 0xtimmy · Pull Request #4208 · tinygrad/tinygrad · GitHub
[go: up one dir, main page]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multireduce Kernels - Linearizer and Scheduler Changes #4208

Closed
wants to merge 7 commits into from

Conversation

0xtimmy
Copy link
Contributor
@0xtimmy 0xtimmy commented Apr 18, 2024

re: an earlier pr this implements the actual kernel fusion of standard deviation

it:

  1. allows schedule.py to put two ReduceOps in a kernel if they have the same shapes
  2. changes linearizer.py to handle the multiple reduceops

an example of a fused kernel:

__kernel void r_64_16_2(__global double* data0, const __global double* data1) {
  __attribute__ ((aligned (16))) __local double temp[16];
  int gidx0 = get_group_id(0); /* 64 */
  int lidx1 = get_local_id(0); /* 16 */
  double acc0 = (double)(0.0);
  int alu0 = (gidx0+(lidx1*128));
  bool alu1 = (lidx1<1);
  for (int ridx0 = 0; ridx0 < 2; ridx0++) {
    double val0 = data1[alu0+(ridx0*64)];
    acc0 = (val0+acc0);
  }
  temp[lidx1] = acc0;
  barrier(CLK_LOCAL_MEM_FENCE);
  if (alu1) {
    double acc1 = (double)(0.0);
    for (int ridx1 = 0; ridx1 < 16; ridx1++) {
      double val1 = temp[ridx1];
      acc1 = (val1+acc1);
    }
    temp[0] = acc1;
  }
  barrier(CLK_LOCAL_MEM_FENCE);
  double val2 = temp[0];
  double acc2 = (double)(0.0);
  for (int ridx2 = 0; ridx2 < 2; ridx2++) {
    double val3 = data1[alu0+(ridx2*64)];
    double alu2 = (val3-(val2*(double)(0.03125)));
    acc2 = ((alu2*alu2)+acc2);
  }
  temp[lidx1] = acc2;
  barrier(CLK_LOCAL_MEM_FENCE);
  if (alu1) {
    double acc3 = (double)(0.0);
    for (int ridx3 = 0; ridx3 < 16; ridx3++) {
      double val4 = temp[ridx3];
      acc3 = (val4+acc3);
    }
    data0[gidx0] = sqrt((acc3*(double)(0.03225806451612903)));
  }
}

@geohot
Copy link
Collaborator
geohot commented Apr 18, 2024

Can you break this up into Linearizer and Scheduler changes? First Linearizer to support generation of the kernels (with lots of good tests!), then scheduler to actually enable it.

@0xtimmy
Copy link
Contributor Author
0xtimmy commented Apr 18, 2024

yeah 100%

@geohot
Copy link
Collaborator
geohot commented Apr 18, 2024

Does this bring layernorm to a single kernel?

Right now it's three (discussing in #scheduler on discord)

// layernorm_forward
r_256_768(b16, b13)
r_256_768n1(b17, b13, b16)
E_256_768(b18, b13, b16, b17, h_0_ln_1_weight, h_0_ln_1_bias)

@0xtimmy
Copy link
Contributor Author
0xtimmy commented Apr 18, 2024

I can get it to fuse into two and probably could get it to be one; it depends mostly on how we adjust the rules around what to fuse

@Qazalin
Copy link
Collaborator
Qazalin commented Apr 19, 2024

left suggestions with the unit tests in #4220 - in general:

  1. Agreed with geohot, linearizer changes should factorize with tests - I already found a weird flakiness in AST ordering, see test_multireduce_multioutput_fancy in 4220
  2. Can we generalize scheduling fusion?

@0xtimmy
Copy link
Contributor Author
0xtimmy commented Apr 19, 2024

Thanks for the tests!
From a glance I think I have an idea as to why it's failing, I'll pull the tests and make sure the issue has been fixed

I would like to try to generalize scheduling fusions; for reduceops my thinking was that any set of consecutive shape transformations from the same shape ought to be fused: ex. the two SUMs in standard deviation or layernorm.

There is the issue of control flow divergence, where any operations on the reduced shape won't use all the threads: ex the DIV by N in a mean calculation
And that I'm not sure how well the linearizer can handle general fusions (tho I think it should)

I started working on a schedule.py change to test this but it's pretty invasive and will need more tests fs

@0xtimmy
Copy link
Contributor Author
0xtimmy commented Apr 20, 2024

An issue with getting layernorm to fuse into one kernel is that the output shape != the reduced shape, ex if the scheduler allows it to fuse it will look like this:

__kernel void r_256_16_2(__global double* data0, const __global double* data1) {
  __attribute__ ((aligned (16))) __local double temp[16];
  int gidx0 = get_group_id(0); /* 256 */
  int lidx1 = get_local_id(0); /* 16 */
  double acc0 = (double)(0.0);
  int alu0 = ((gidx0*32)+(lidx1*2));
  bool alu1 = (lidx1<1);
  for (int ridx0 = 0; ridx0 < 2; ridx0++) {
    double val0 = data1[alu0+ridx0];
    acc0 = (val0+acc0);
  }
  temp[lidx1] = acc0;
  barrier(CLK_LOCAL_MEM_FENCE);
  if (alu1) {
    double acc1 = (double)(0.0);
    for (int ridx1 = 0; ridx1 < 16; ridx1++) {
      double val1 = temp[ridx1];
      acc1 = (val1+acc1);
    }
    temp[0] = acc1;
  }
  barrier(CLK_LOCAL_MEM_FENCE);
  double val2 = temp[0];
  double acc2 = (double)(0.0);
  for (int ridx2 = 0; ridx2 < 2; ridx2++) {
    double val3 = data1[alu0+ridx2];
    double alu2 = (val3-(val2*(double)(0.03125)));
    acc2 = ((alu2*alu2)+acc2);
  }
  temp[lidx1] = acc2;
  barrier(CLK_LOCAL_MEM_FENCE);
  if (alu1) {
    double acc3 = (double)(0.0);
    for (int ridx3 = 0; ridx3 < 16; ridx3++) {
      double val4 = temp[ridx3];
      acc3 = (val4+acc3);
    }
    temp[0] = acc3;
  }
  barrier(CLK_LOCAL_MEM_FENCE);
  double val5 = temp[0];
  for (int ridx4 = 0; ridx4 < 2; ridx4++) {
    data0[alu0+ridx4] = (alu2*sqrt(((double)(1.0)/((val5*(double)(0.03125))+(double)(1e-05))))); // <- erroneous line 
  }
}

alu2 is computed in a previous loop so it can't be accessed in the final parse. I could store or recompute it, both options make some sense for fusion because they don't require any trips to dram but from an occupancy perspective it might make sense to give the final (x -μ)/σ operation it's own kernel plus whatever comes after it

# reduce op
# reduce ops
assert len(self.reduceops) == len(set(self.reduceops)), "All reduceops must be unique"
self.reduceops.reverse()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah you need to linearize the deepest reduceop first.

I think .reverse() won't work for multi output because you could have an AST like:

  0 ━┳ STORE MemBuffer(idx=0, dtype=dtypes.float, st=ShapeTracker(views=(View(shape=(1,), strides=(0,), offset=0, mask=None, contiguous=True),)))
  1  ┗━┳ SUM ((0,), dtypes.float)
  2    ┗━━ LOAD MemBuffer(idx=2, dtype=dtypes.float, st=ShapeTracker(views=(View(shape=(32,), strides=(1,), offset=0, mask=None, contiguous=True),)))
  0 ━┳ STORE MemBuffer(idx=1, dtype=dtypes.float, st=ShapeTracker(views=(View(shape=(1,), strides=(0,), offset=0, mask=None, contiguous=True),)))
  1  ┗━┳ SUM ((0,), dtypes.float)
  2    ┗━┳ SUB 
  3      ┣━━ LOAD MemBuffer(idx=2, dtype=dtypes.float, st=ShapeTracker(views=(View(shape=(32,), strides=(1,), offset=0, mask=None, contiguous=True),)))
  4      ┗━┳ SUM ((0,), dtypes.float)
  5        ┗━━ LOAD MemBuffer(idx=2, dtype=dtypes.float, st=ShapeTracker(views=(View(shape=(32,), strides=(1,), offset=0, mask=None, contiguous=True),)))

two ideas:

  • self.reduceops orders by depth
  • render_reduceop recurses (similar to parse_ast)

@@ -111,7 +109,7 @@ def _recurse_lb(buf:LazyBuffer, realizes:Dict[LazyBuffer, None], allbufs:Dict[La
if len(buf.st.views) == 1 and buf.st.views[-1].mask and all_int(buf.base.st.shape) and \
prod(buf.base.st.shape) >= prod([y-x for x,y in buf.st.views[-1].mask]):
simple_pads.add(buf.base)
else:
elif all([x.op not in ReduceOps for x in buf.base.srcs if hasattr(x, "op")]):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if buf isn't the only child of x?

x = Tensor.empty(32)
r0 = x.mean(axis=0, keepdim=True)
r1 = (x - r0).sum(axis=0).div(2)
e = r0 + r1
e.realize()

https://tiny-tools-client.vercel.app/?id=7b4dc4f7c0c34dcb94e1d82639cc3180

graph

@Qazalin Qazalin mentioned this pull request Apr 22, 2024
Copy link
Contributor

This branch currently is behind tinygrad/master. The line count difference bot is disabled.

@chenyuxyz chenyuxyz closed this Dec 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0