8000 SYCL: Implement few same quantized type copy kernels by qnixsynapse · Pull Request #13739 · ggml-org/llama.cpp · GitHub
[go: up one dir, main page]

Skip to content

SYCL: Implement few same quantized type copy kernels #13739

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

qnixsynapse
Copy link
Collaborator

This is for supporting kv cache defragmentation when quatized kv cache is used. test-backend-ops seems to pass with this change.
Need further testing before we can merge.

@qnixsynapse qnixsynapse marked this pull request as draft May 24, 2025 06:28
@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels May 24, 2025
@qnixsynapse qnixsynapse marked this pull request as ready for review May 25, 2025 07:13
Copy link
Collaborator
@Rbiessy Rbiessy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a good opportunity to revisit the copy kernel when the source and destination are of the same type and don't require any casting. These should use the same function that copies a number of byte without depending on the type itself. It would reduce the number of kernels and simplify the code. This can be done with sycl::queue::memcpy. What do you think?

@qnixsynapse
Copy link
Collaborator Author

@Rbiessy I think memcpy only work if the src and dst tensors are not permutted. Please feel free to correct me if I am wrong.

@Rbiessy
Copy link
Collaborator
Rbiessy commented May 26, 2025

You're right, that wouldn't work if there are permutations. I didn't think too much about what would a permuted tensor with quantized type look like? Is it by design only permuting the blocks and never the values inside a block?

memcpy could be introduced in a separate PR if we find these cpy are slow and permutations are not used I suppose.

@qnixsynapse
Copy link
Collaborator Author

Yeah. Generally permuted tensors are non contiguous. If I find time, I will see if I can use memcpy to copy contiguous quantized tensors or not. (I think it is possible). I am marking this PR draft for now.

@qnixsynapse qnixsynapse marked this pull request as draft May 27, 2025 15:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0