CUDA: build archs as virtual for GGML_NATIVE=OFF #13135

JohannesGaessler · 2025-04-27T13:24:45Z

CMake give you the option to build CUDA architectures as either real or virtual or both (default). My understanding is that if at runtime a real architecture is present it can be used directly, otherwise JIT compilation is used to create the binary code if a suiting virtual architecture is present. However, the CUDA architectures we define are the lowest possible ones for the features that we use and as a result the compiled real architectures basically never see any use. So we may as well skip them to speed up the compilation process and reduce binary size.

On my systems binary size and total compilation time of llama.cpp without CCache and with GGML_NATIVE=OFF change as follows:

Setup	Size of ggml-cuda.so [MiB]	CT Epyc 7742 (64C/128T) [s]	CT Xeon E5-2683 v4 (16C/32T) [s]
master	114	324	562
PR	60	222	406

The difference seems to be particularly noticeable on CPUs with more cores since the compilation of the entire program spends a long time waiting for only 2 CUDA kernels.

JohannesGaessler · 2025-04-27T13:27:32Z

I forgot: while working on this I noticed that due to JIT compilation the reported performance on the first run can be very low. The warmup run is IIRC always done with a batch size of 2 and I think it doesn't trigger compilation of e.g. MMQ.

slaren · 2025-04-29T16:46:24Z

Focusing only on the binary releases, it's not completely obvious to me that a smaller binary is more important than faster startup on the first run. I would guess that for most people using LLMs that are typically in the tens of GiB, downloading 60MiB more is better than a worse first-run experience. The compile time shouldn't be too important, since for development this is not used regardless.

JohannesGaessler · 2025-04-29T18:15:24Z

Not that I have done a survey but I think most of our users are on compute capabilities 8.6 or 8.9. So note that regardless of this PR, it will be necessary to do a JIT compilation of virtual architecture 8.0. Maybe it would make sense to also add real architectures for those GPUs that we expect to be in frequent use?

slaren · 2025-04-30T19:02:47Z

Yes, I agree. Including real arch for 86 and 89 and keeping the rest virtual should be good for most users.

JohannesGaessler · 2025-05-06T21:14:34Z

When adding real architectures 8.6 and 8.9:

Setup	Size of ggml-cuda.so [MiB]	CT Epyc 7742 (64C/128T) [s]	CT Xeon E5-2683 v4 (16C/32T) [s]
master	114	324	562
PR all vitual	60	222	406
PR mixed	87	302	602

JohannesGaessler · 2025-05-06T21:31:54Z

@LostRuins since I remember you being concerned with compile times and the size of CUDA binaries: you may want to build only the virtual architectures (if you're not doing so already).

LostRuins · 2025-05-07T05:27:47Z

Thanks! This would be very useful!

Also some interesting related stuff since you're here @JohannesGaessler - we had a user who owns a K6000, so we tried a build with cu11, set(CMAKE_CUDA_ARCHITECTURES "35;50;61;70;75") and seems like it works!

We also had another different user some time back try set(CMAKE_CUDA_ARCHITECTURES "37;50;61;70;75") on their K80 and apparently that worked too.

I'd assume if we do 35-virtual it should probably work fine for both of them? I don't really understand how PTX JIT works. I wish it was easier to get these GPUs to test, but finding them on cloud services is practically impossible, I can't even get a P40 nowadays.

Also, it's my (probably wrong) understanding that the difference between 86-real and 86-virtual is only startup time (not performance)? Which would probably be cached and thus identical in performance from the second run onwards? Sorry for the barrage of questions.

JohannesGaessler · 2025-05-07T07:01:50Z

I'd assume if we do 35-virtual it should probably work fine for both of them?

Yes. -virtual should always be enough as long as it's possible to compile the PTX code (CUDA equivalent of assembly) to binary code (code that can be directly run on a GPU). My understanding is that on Linux we dynamically link against the CUDA libraries for this, I think for koboldcpp you are statically linking the CUDA toolkit into the executable; I would assume it would still work.

I don't really understand how PTX JIT works.

The "high-level" CUDA code (C equivalent) is compiled to PTX (assembly equivalent). The PTX code is basically just a stream of instructions from the PTX ISA. To actually use that PTX code it needs to again be compiled and optimized for a specific GPU architecture. With -virtual CMake produces PTX code that is then compiled just-in-time to device code on the machine where the program is run; the PTX code is (with some exceptions) forwards-compatible so you only need PTX code with the minimum compute capability to cover all of the instructions that are being used and it can then be compiled for any compute capability that is at least as high. With the current ggml code it should be possible to compile the code just for 50-virtual and it should run on all GPUs (but with bad performance because none of the new instructions are used).

With -real CMake produces device code that can be run instantly and without any additional compilation but that device code cannot be recompiled for a higher architecture.

Without -virtual and -real CMake produces both.

Also, it's my (probably wrong) understanding that the difference between 86-real and 86-virtual is only startup time (not performance)? Which would probably be cached and thus identical in performance from the second run onwards?

The device code is cached so the compilation only happens on the first run. There would be no benefit to adding 86-virtual because all instructions that ggml uses are already covered by 80-virtual but 86-virtual is not compatible with A100 GPUs (compute capability 8.0). If however, you compile the code only for 8.6 then 86-virtual would have the advantage of also being usable for compute capabilities >8.6 (versus 86-real which can only be used with 8.6).

I can't even get a P40 nowadays.

I currently have a machine with 3 P40s. I only use it for development and I'm thinking about at some point replacing one of the P40s with a V100 since I expect them to become more affordable once datacenters start dumping them. I haven't yet decided what to do with the replaced P40; I was thinking I would ask around IRL whether someone wants it but if not I could also give it to you.

JohannesGaessler · 2025-05-07T07:10:23Z

Actually, I think binary size was a concern for @jart as well, so I'm tagging her too.

LostRuins · 2025-05-07T08:30:24Z

That's a very kind offer, but no thanks, that's not necessary - I use a laptop primarily, and already have my RTX 4090 laptop which more than meets my needs. I don't actually have or use a desktop PC.

I was referring to using cloud services to test builds - previously I used Runpod and especially Vast.AI to test support for older GPUs, but lately the supply of these have been drying up. I was wondering if you knew cloud providers that specialized in provisioning of VMs with old GPUs.

To speed up compilation time and reduce binary size. Link : ggml-org/llama.cpp#13135 Author : Johannes Gaessler.

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 27, 2025

JohannesGaessler force-pushed the cuda-jit branch 2 times, most recently from 500a9bb to 494d862 Compare May 6, 2025 20:10

slaren approved these changes May 6, 2025

View reviewed changes

CUDA: mix virt/real CUDA archs for GGML_NATIVE=OFF

d9330f7

JohannesGaessler force-pushed the cuda-jit branch from 494d862 to d9330f7 Compare May 6, 2025 21:29

JohannesGaessler merged commit 141a908 into ggml-org:master May 6, 2025
41 of 45 checks passed

Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request May 24, 2025

CUDA: build archs as virtual for GGML_NATIVE=OFF

eda5a60

To speed up compilation time and reduce binary size. Link : ggml-org/llama.cpp#13135 Author : Johannes Gaessler.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA: build archs as virtual for GGML_NATIVE=OFF #13135

CUDA: build archs as virtual for GGML_NATIVE=OFF #13135

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CUDA: build archs as virtual for GGML_NATIVE=OFF #13135

CUDA: build archs as virtual for GGML_NATIVE=OFF #13135

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!