-
Notifications
You must be signed in to change notification settings - Fork 12k
CUDA: build archs as virtual for GGML_NATIVE=OFF #13135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I forgot: while working on this I noticed that due to JIT compilation the reported performance on the first run can be very low. The warmup run is IIRC always done with a batch size of 2 and I think it doesn't trigger compilation of e.g. MMQ. |
Focusing only on the binary releases, it's not completely obvious to me that a smaller binary is more important than faster startup on the first run. I would guess that for most people using LLMs that are typically in the tens of GiB, downloading 60MiB more is better than a worse first-run experience. The compile time shouldn't be too important, since for development this is not used regardless. |
Not that I have done a survey but I think most of our users are on compute capabilities 8.6 or 8.9. So note that regardless of this PR, it will be necessary to do a JIT compilation of virtual architecture 8.0. Maybe it would make sense to also add real architectures for those GPUs that we expect to be in frequent use? |
Yes, I agree. Including real arch for 86 and 89 and keeping the rest virtual should be good for most users. |
500a9bb
to
494d862
Compare
When adding real architectures 8.6 and 8.9:
|
@LostRuins since I remember you being concerned with compile times and the size of CUDA binaries: you may want to build only the virtual architectures (if you're not doing so already). |
Thanks! This would be very useful! Also some interesting related stuff since you're here @JohannesGaessler - we had a user who owns a K6000, so we tried a build with cu11, We also had another different user some time back try I'd assume if we do Also, it's my (probably wrong) understanding that the difference between 86-real and 86-virtual is only startup time (not performance)? Which would probably be cached and thus identical in performance from the second run onwards? Sorry for the barrage of questions. |
Yes.
The "high-level" CUDA code (C equivalent) is compiled to PTX (assembly equivalent). The PTX code is basically just a stream of instructions from the PTX ISA. To actually use that PTX code it needs to again be compiled and optimized for a specific GPU architecture. With With Without
The device code is cached so the compilation only happens on the first run. There would be no benefit to adding
I currently have a machine with 3 P40s. I only use it for development and I'm thinking about at some point replacing one of the P40s with a V100 since I expect them to become more affordable once datacenters start dumping them. I haven't yet decided what to do with the replaced P40; I was thinking I would ask around IRL whether someone wants it but if not I could also give it to you. |
Actually, I think binary size was a concern for @jart as well, so I'm tagging her too. |
That's a very kind offer, but no thanks, that's not necessary - I use a laptop primarily, and already have my RTX 4090 laptop which more than meets my needs. I don't actually have or use a desktop PC. I was referring to using cloud services to test builds - previously I used Runpod and especially Vast.AI to test support for older GPUs, but lately the supply of these have been drying up. I was wondering if you knew cloud providers that specialized in provisioning of VMs with old GPUs. |
To speed up compilation time and reduce binary size. Link : ggml-org/llama.cpp#13135 Author : Johannes Gaessler.
See ggml-org/ggml#1154 .
CMake give you the option to build CUDA architectures as either real or virtual or both (default). My understanding is that if at runtime a real architecture is present it can be used directly, otherwise JIT compilation is used to create the binary code if a suiting virtual architecture is present. However, the CUDA architectures we define are the lowest possible ones for the features that we use and as a result the compiled real architectures basically never see any use. So we may as well skip them to speed up the compilation process and reduce binary size.
On my systems binary size and total compilation time of llama.cpp without CCache and with
GGML_NATIVE=OFF
change as follows:The difference seems to be particularly noticeable on CPUs with more cores since the compilation of the entire program spends a long time waiting for only 2 CUDA kernels.