-
Notifications
You must be signed in to change notification settings - Fork 11.9k
ggml-backend: backend-agnostic tensor parallelism #13776
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml-backend: backend-agnostic tensor parallelism #13776
Conversation
Have you considered implementing this as a new backend instead? I think it should result in much simpler code than modifying |
Sounds like a good approach, I'll make another prototype. |
I'm currently working on support for backend-agnostic tensor parallelism. I've progressed to the point where I have a working prototype (that only works for 2 GPUs and has bad performance). I'm making this PR in order to get early feedback regarding the way I would implement it, input from @slaren in particular would be appreciated. Specifically I would:
ggml-backend.cpp
to e.g. check whether a buffer is split, which backends are associated with it if it is, and to retrieve the effective tensor for a given backend. I think this can be done without any backend-specific code. The input would be multiple backend buffers, when allocating a tensor on the split buffer this would be translated to allocating slices of the tensor on the underlying backend buffers.ggml_backend_sched
to revolve more around splits instead of the nodes from the original graph. Without tensor parallelism there will be effectively no change because the splits just contain all nodes from the original graph in sequential order. So the same results should be achieved by iterating over splits vs. iterating over nodes.GGML_CONCAT
to combine them into a tensor that contains the correct data. For this I extended the functionality ofggml_backend_sched_split::inputs
. Tensors withGGML_OP_NONE
use the existing code to retrieve data from other backends. Tensors with other ops are executed prior to the actual nodes from the split._supports_op
functions to determine the state of split tensors after some op given the states of the inputs. If an op cannot be meaningfully executed in parallel, synchronize the nodes as a fallback. This should ensure that correct results can always be produced, but with bad performance if the correct transformation logic is not defined. For the attention I think the graph should be split by dimension 2, for the FFN part I think it should be dimension 1 -> dimension 0 -> mirrored. In total there would need to be 4 synchronizations per layer.Going forward, since
ggml_backend_sched
is a critical component I would first make a separate PR to refactor it slightly so that it's easier to assert that no changes are being made for use without tensor parallelism. The approach I have in this PR is to first split the graph and to create a vector of sequential splitssplits_no_tp
where splits that need tensor parallelism are marked. Then in a second pass a vectorsplits_tp
is created where tensor parallel splits are duplicated. Only after this are inputs being assigned. Finally, the vectorsplits_tp
is copied toggml_backend_sched::splits
. So in effect I have split the 5th pass over the graph nodes into 2 passes where I can duplicate the tensor parallel splits inbetween. I used vectors because it made the implementation the easiest, but it should be possible to do the same thing with one more allocation likeggml_backend_sched::splits
that grows dynamically when needed. I assume the reason a vector is not used in the current code forggml_backend_sched::splits
is to assert that 8000 the memory is never reallocated when repeatedly changing the number of splits.For the main PR the goal would be to get an implementation that is at least as fast as the current CUDA code for
--split-mode row
but does not need code specific to the CUDA backend. This then makes it possible to removeggml_cuda_op_mul_mat
without loss of functionality.