-
Notifications
You must be signed in to change notification settings - Fork 42
RPC sync #193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Add more checks which prevent RPC server from crashing if invalid input is received from client
Add more checks which prevent RPC server from crashing if invalid input is received from client
Co-authored-by: Diego Devesa <slarengh@gmail.com>
Use structs for RPC request/response messages
I never use RPC, have never looked into the RPC code, so I'll have to rely on you for self-review and testing. |
There is, see this comment: #180 (comment) This fork has much faster PP speeds, has Deepseek MLA support with a flag (-mla), this PR should allow RPC to work, and I'm working on porting the add option to override model tensor buffers. |
This is something I've done for a while on my Windows builds due to the fact that on Windows long is not 8 bytes. On linux this changes nothing as both are 8 bytes there.
This has been tested, and does not currently work. I'm not sure why as the errors I'm getting seem to have never been encountered by people on llama.cpp. |
|
||
rpc_msg_get_alloc_size_rsp response; | ||
bool status = send_rpc_cmd(sock, RPC_CMD_GET_ALLOC_SIZE, &request, sizeof(request), &response, sizeof(response)); | ||
GGML_ASSERT(status); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RPC client crashes here, which happens as the RPC server hits an issue.
ggml_tensor * tensor = deserialize_tensor(ctx, &request.tensor); | ||
|
||
if (tensor == nullptr) { | ||
GGML_PRINT_DEBUG("Null tensor pointer passed to server get_alloc_size function.\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fairly certain this is where the RPC server is crashing, although it doesn't print the message as I never ran with GGML_DEBUG on.
I just came across another llama.cpp fork called prima.cpp which claims to have improved support for multi-device distributed inferencing. I haven't tried it, just saw it on reddit today. Might be worth a shot given your GPU is in a different system than your big RAM box. |
Thanks for the link, it is interesting. I think it would work for dense models but not as well for MoE because as far as I can tell it doesn't handle |
I grabbed all of the changes needed for llama.cpp/pull/11047 , which was ggml-org/llama.cpp#9912 and ggml-org/llama.cpp#9040
This compiles, but has not been tested yet.