8000 RPC sync by saood06 · Pull Request #193 · ikawrakow/ik_llama.cpp · GitHub
[go: up one dir, main page]

Skip to content

RPC sync #193

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 16 commits into
base: main
Choose a base branch
from
Draft

RPC sync #193

wants to merge 16 commits into from

Conversation

saood06
Copy link
Collaborator
@saood06 saood06 commented Feb 8, 2025

I grabbed all of the changes needed for llama.cpp/pull/11047 , which was ggml-org/llama.cpp#9912 and ggml-org/llama.cpp#9040

This compiles, but has not been tested yet.

@ikawrakow
8000
Copy link
Owner

I never use RPC, have never looked into the RPC code, so I'll have to rely on you for self-review and testing.

@saood06
Copy link
Collaborator Author
saood06 commented Feb 10, 2025

@jukofyork

I strongly suspect something funky is going on

There is, see this comment: #180 (comment)

This fork has much faster PP speeds, has Deepseek MLA support with a flag (-mla), this PR should allow RPC to work, and I'm working on porting the add option to override model tensor buffers.

This is something I've done for a while on my Windows builds due to the fact that on Windows long is not 8 bytes. On linux this changes nothing as both are 8 bytes there.
@saood06
Copy link
Collaborator Author
saood06 commented Feb 27, 2025

This has been tested, and does not currently work. I'm not sure why as the errors I'm getting seem to have never been encountered by people on llama.cpp.


rpc_msg_get_alloc_size_rsp response;
bool status = send_rpc_cmd(sock, RPC_CMD_GET_ALLOC_SIZE, &request, sizeof(request), &response, sizeof(response));
GGML_ASSERT(status);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RPC client crashes here, which happens as the RPC server hits an issue.

ggml_tensor * tensor = deserialize_tensor(ctx, &request.tensor);

if (tensor == nullptr) {
GGML_PRINT_DEBUG("Null tensor pointer passed to server get_alloc_size function.\n");
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fairly certain this is where the RPC server is crashing, although it doesn't print the message as I never ran with GGML_DEBUG on.

@ubergarm
Copy link
Contributor

@saood06

I just came across another llama.cpp fork called prima.cpp which claims to have improved support for multi-device distributed inferencing.

I haven't tried it, just saw it on reddit today. Might be worth a shot given your GPU is in a different system than your big RAM box.

@saood06
Copy link
Collaborator Author
saood06 commented Apr 12, 2025

@saood06

I just came across another llama.cpp fork called prima.cpp which claims to have improved support for multi-device distributed inferencing.

I haven't tried it, just saw it on reddit today. Might be worth a shot given your GPU is in a different system than your big RAM box.

Thanks for the link, it is interesting. I think it would work for dense models but not as well for MoE because as far as I can tell it doesn't handle -ot (this commit looks relevant) . I'd also need windows support which is on the roadmap (but I might see what the issue is by trying to build it on my machine, and see if I can fix it), and the GPU machine has to run windows (my big RAM box runs clear linux, and I have other servers that run FreeBSD and Proxmox).

@saood06 saood06 mentioned this pull request Jun 1, 2025
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0