8000 CUDA: build archs as virtual for GGML_NATIVE=OFF by JohannesGaessler · Pull Request #13135 · ggml-org/llama.cpp · GitHub
[go: up one dir, main page]

Skip to content

CUDA: build archs as virtual for GGML_NATIVE=OFF #13135

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 6, 2025

Conversation

JohannesGaessler
Copy link
Collaborator

See ggml-org/ggml#1154 .

CMake give you the option to build CUDA architectures as either real or virtual or both (default). My understanding is that if at runtime a real architecture is present it can be used directly, otherwise JIT compilation is used to create the binary code if a suiting virtual architecture is present. However, the CUDA architectures we define are the lowest possible ones for the features that we use and as a result the compiled real architectures basically never see any use. So we may as well skip them to speed up the compilation process and reduce binary size.

On my systems binary size and total compilation time of llama.cpp without CCache and with GGML_NATIVE=OFF change as follows:

Setup Size of ggml-cuda.so [MiB] CT Epyc 7742 (64C/128T) [s] CT Xeon E5-2683 v4 (16C/32T) [s]
master 114 324 562
PR 60 222 406

The difference seems to be particularly noticeable on CPUs with more cores since the compilation of the entire program spends a long time waiting for only 2 CUDA kernels.

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 27, 2025
@JohannesGaessler
Copy link
Collaborator Author

I forgot: while working on this I noticed that due to JIT compilation the reported performance on the first run can be very low. The warmup run is IIRC always done with a batch size of 2 and I think it doesn't trigger compilation of e.g. MMQ.

@slaren
Copy link
Member
slaren commented Apr 29, 2025

Focusing only on the binary releases, it's not completely obvious to me that a smaller binary is more important than faster startup on the first run. I would guess that for most people using LLMs that are typically in the tens of GiB, downloading 60MiB more is better than a worse first-run experience. The compile time shouldn't be too important, since for development this is not used regardless.

@JohannesGaessler
Copy link
Collaborator Author

Not that I have done a survey but I think most of our users are on compute capabilities 8.6 or 8.9. So note that regardless of this PR, it will be necessary to do a JIT compilation of virtual architecture 8.0. Maybe it would make sense to also add real architectures for those GPUs that we expect to be in frequent use?

@slaren
Copy link
Member
slaren commented Apr 30, 2025

Yes, I agree. Including real arch for 86 and 89 and keeping the rest virtual should be good for most users.

@JohannesGaessler JohannesGaessler force-pushed the cuda-jit branch 2 times, most recently from 500a9bb to 494d862 Compare May 6, 2025 20:10
@JohannesGaessler
Copy link
Collaborator Author
JohannesGaessler commented May 6, 2025

When adding real architectures 8.6 and 8.9:

Setup Size of ggml-cuda.so [MiB] CT Epyc 7742 (64C/128T) [s] CT Xeon E5-2683 v4 (16C/32T) [s]
master 114 324 562
PR all vitual 60 222 406
PR mixed 87 302 602

@JohannesGaessler
Copy link
Collaborator Author

@LostRuins since I remember you being concerned with compile times and the size of CUDA binaries: you may want to build only the virtual architectures (if you're not doing so already).

@JohannesGaessler JohannesGaessler merged commit 141a908 into ggml-org:master May 6, 2025
41 of 45 checks passed
@LostRuins
Copy link
Collaborator

Thanks! This would be very useful!

Also some interesting related stuff since you're here @JohannesGaessler - we had a user who owns a K6000, so we tried a build with cu11, set(CMAKE_CUDA_ARCHITECTURES "35;50;61;70;75") and seems like it works!

We also had another different user some time back try set(CMAKE_CUDA_ARCHITECTURES "37;50;61;70;75") on their K80 and apparently that worked too.

I'd assume if we do 35-virtual it should probably work fine for both of them? I don't really understand how PTX JIT works. I wish it was easier to get these GPUs to test, but finding them on cloud services is practically impossible, I can't even get a P40 nowadays.

Also, it's my (probably wrong) understanding that the difference between 86-real and 86-virtual is only startup time (not performance)? Which would probably be cached and thus identical in performance from the second run onwards? Sorry for the barrage of questions.

@JohannesGaessler
Copy link
Collaborator Author
JohannesGaessler commented May 7, 2025

I'd assume if we do 35-virtual it should probably work fine for both of them?

Yes. -virtual should always be enough as long as it's possible to compile the PTX code (CUDA equivalent of assembly) to binary code (code that can be directly run on a GPU). My understanding is that on Linux we dynamically link against the CUDA libraries for this, I think for koboldcpp you are statically linking the CUDA toolkit into the executable; I would assume it would still work.

I don't really understand how PTX JIT works.

The "high-level" CUDA code (C equivalent) is compiled to PTX (assembly equivalent). The PTX code is basically just a stream of instructions from the PTX ISA. To actually use that PTX code it needs to again be compiled and optimized for a specific GPU architecture. With -virtual CMake produces PTX code that is then compiled just-in-time to device code on the machine where the program is run; the PTX code is (with some exceptions) forwards-compatible so you only need PTX code with the minimum compute capability to cover all of the instructions that are being used and it can then be compiled for any compute capability that is at least as high. With the current ggml code it should be possible to compile the code just for 50-virtual and it should run on all GPUs (but with bad performance because none of the new instructions are used).

With -real CMake produces device code that can be run instantly and without any additional compilation but that device code cannot be recompiled for a higher architecture.

Without -virtual and -real CMake produces both.

Also, it's my (probably wrong) understanding that the difference between 86-real and 86-virtual is only startup time (not performance)? Which would probably be cached and thus identical in performance from the second run onwards?

The device code is cached so the compilation only happens on the first run. There would be no benefit to adding 86-virtual because all instructions that ggml uses are already covered by 80-virtual but 86-virtual is not compatible with A100 GPUs (compute capability 8.0). If however, you compile the code only for 8.6 then 86-virtual would have the advantage of also being usable for compute capabilities >8.6 (versus 86-real which can only be used with 8.6).

I can't even get a P40 nowadays.

I currently have a machine with 3 P40s. I only use it for development and I'm thinking about at some point replacing one of the P40s with a V100 since I expect them to become more affordable once datacenters start dumping them. I haven't yet decided what to do with the replaced P40; I was thinking I would ask around IRL whether someone wants it but if not I could also give it to you.

@JohannesGaessler
Copy link
Collaborator Author

Actually, I think binary size was a concern for @jart as well, so I'm tagging her too.

@LostRuins
Copy link
Collaborator

That's a very kind offer, but no thanks, that's not necessary - I use a laptop primarily, and already have my RTX 4090 laptop which more than meets my needs. I don't actually have or use a desktop PC.

I was referring to using cloud services to test builds - previously I used Runpod and especially Vast.AI to test support for older GPUs, but lately the supply of these have been drying up. I was wondering if you knew cloud providers that specialized in provisioning of VMs with old GPUs.

Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request May 24, 2025
To speed up compilation time and reduce binary size.

Link : ggml-org/llama.cpp#13135
Author : Johannes Gaessler.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0