8000 Feature Request: Installable package via winget · Issue #8188 · ggml-org/llama.cpp · GitHub
[go: up one dir, main page]

Skip to content

Feature Request: Installable package via winget #8188

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
4 tasks done
ngxson opened this issue Jun 28, 2024 · 18 comments
Open
4 tasks done

Feature Request: Installable package via winget #8188

ngxson opened this issue Jun 28, 2024 · 18 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@ngxson
Copy link
Collaborator
ngxson commented Jun 28, 2024

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

On macos/linux, user can install a pre-built version llama.cpp easily via brew

It would be nice to have the equivalent to that on windows, via winget

Motivation

The pre-built binary is already available via releases: https://github.com/ggerganov/llama.cpp/releases

It would be nice to somehow push them to https://winget.run/

However, I'm not familiar with working on windows, so I create this issue to further discuss and to look for help from the community.

Possible Implementation

No response

@ngxson ngxson added enhancement New feature or request help wanted Extra attention is needed labels Jun 28, 2024
@slaren
Copy link
Member
slaren commented Jun 28, 2024

The macOS situation is a bit easier to handle because there is only the Metal backend, and we know that it is available on every device. For windows and linux we would need different builds for CPU AVX, CPU AVX2, CPU AVX512, CUDA, HIP, SYCL, Vulkan,.. which is not very user friendly to say the least.

@gabrielgrant
Copy link
gabrielgrant commented Sep 26, 2024

@slaren there are already builds for all those different compute acceleration options. I agree that choosing which backend to install is rather confusing. But I'm not understanding why making the existing builds available via winget would make things more confusing?

@max-krasnyansky
Copy link
Collaborator

Ah. I didn't see this issue earlier. I kind of started looking into this already as well.
I wanted to publish winget packages for Windows on ARM64 (Snapdragon X-Elite) and a decently optimized x86-64 build.
For ARM64 we can just enable the CPU (ARMv8.7 NEON with MatMul INT8) for now. For x86-64 I was thinking of also publishing CPU with upto AVX512 for now.
Basically, at least enable folks to easily install a usable version, and they are looking for best perf they can install stuff from our releases directly.
I don't know how well winget handles package flavors yet, perhaps there is a way to publish multiple packages for the same arch and have it pick one based on the machine details.
If winget itself is not great at it we can add some wrapper scripts that fetch better versions.

@slaren
Copy link
Member
slaren commented Sep 26, 2024

I don't think that uploading a dozen different packages to winget is going to be a good experience for the users. It's barely tolerable as it is on the github releases, and here we assume that the users will be somehow technically competent enough to choose the right version. I would expect a winget package to be easier to use than what we can currently offer.

@gabrielgrant
Copy link

@slaren gotcha

@max-krasnyansky having a script to algorithmically determine which build is most appropriate would be great (whether for winget or even just for determining which of the builds on GH will run on a given machine)

@max-krasnyansky
Copy link
Collaborator

Sorry, if I wasn't clear. My plan was to publish decent CPU versions to start so that simple winget install llama.cpp works.
Users get usable version with basically zero effort.

Then we can look into either having winget "auto-select" a better optimized version based on the user's machine
ie I'm definitely not suggesting user to do something like winget install llama.cpp-cuda-v3.1-avx512 :)

If winget itself can't do that (I need to look a bit deeper into their package metadata format and options) then we can figure out something else. Maybe our own script, etc.

@AndreasKunar
Copy link
Contributor

@max-krasnyansky :
Ollama v0.3.12 supports winget install and it now also works great / native on my Snapdragon X Elite Surface Laptop 7 on Windows (for ARM). I did not look into the details, but it might be a good starting-point (since it builds on top of the llama.cpp base).

@max-krasnyansky
Copy link
Collaborator

@max-krasnyansky : Ollama v0.3.12 supports winget install and it now also works great / native on my Snapdragon X Elite Surface Laptop 7 on Windows (for ARM). I did not look into the details, but it might be a good starting-point (since it builds on top of the llama.cpp base).

Yep. I saw that they have the winget package. I thought it was still x86-64 build though. Will take a look at the latest.

@slaren
Copy link
Member
slaren commented May 2, 2025

I think we are in a good position now to start releasing winget packages. I have been looking into how this could be done, and it looks fairly simple. Winget can automatically install a zip file to a private directory and add the executables to the path, so we don't really need do much beyond creating a manifest. An example of how to do this is the cURL package.

Once we create and submit the manifest, updates can be handled with a github action such as this one.

My plan is to start with a x64 package that includes only the Vulkan backend. This should be fairly safe since Vulkan is likely to work well enough on most systems. Currently including multiple GPU backends can cause issues when the same devices are supported by more than one backend, once this is resolved, the package will be extended to include other backends. Packages for Arm can also be added later.

@ngxson
Copy link
Collaborator Author
ngxson commented May 2, 2025

@slaren I'm also thinking about it lately, the name of each archive in the release looks quite confusing even for experienced user like me (i.e. should I use AVX, AVX2 or AVX512 for my windows laptop - I don't even know which processor I'm using)

Just asking, could we maybe have a "super" package that decide which package to be download? For example, if system has CUDA, great, download CUDA.

@ngxson
Copy link
Collaborator Author
ngxson commented May 2, 2025

Btw no sure if this helps, but we can also allow the binary to download .dll using the curl functionality in libcommon

@slaren
Copy link
Member
slaren commented May 2, 2025

@ngxson This is handled by loading backends dynamically. After #13220 there will be a single release for the CPU build. We can't include every backend in a single package at the moment because of the problem I mentioned, but eventually that will be the goal.

@ericcurtin
Copy link
Collaborator

The macOS situation is a bit easier to handle because there is only the Metal backend, and we know that it is available on every device. For windows and linux we would need different builds for CPU AVX, CPU AVX2, CPU AVX512, CUDA, HIP, SYCL, Vulkan,.. which is not very user friendly to say the least.

I know this is faaaar from simple but if llama.cpp solves this and gets these backends to load in more dynamically, llama.cpp could end up becoming the AI version of "mesa" for Linux and Windows.

@ericcurtin
Copy link
Collaborator

If winget does indeed use just Vulkan for now, it's a good start

@AndreasKunar
Copy link
Contributor

If winget does indeed use just Vulkan for now, it's a good start

Winget with a Vulkan backend might be a "universal" backend for Windows x64. For Windows arm64 I suggest to use only the CPU backend.

I think llama.cpp's advantage is its cutting-edge CPU and GPU optimizations - not sure if they are all possible just to be decided at runtime.

@slaren
Copy link
Member
slaren commented May 25, 2025

For Windows arm64 I suggest to use only the CPU backend.

Would there be any advantage of also bundling the OpenCL backend in the arm64 package as well? Unless the model is offloaded with -ngl, the OpenCL backend wouldn't be active anyway, so it should be fairly safe to include it. Are there any Arm chips other than the Qualcomm Snapdragons X being used at the moment in Windows Arm64 devices?

@AndreasKunar
Copy link
Contributor

For Windows arm64 I suggest to use only the CPU backend.

Would there be any advantage of also bundling the OpenCL backend in the arm64 package as well? Unless the model is offloaded with -ngl, the OpenCL backend wouldn't be active anyway, so it should be fairly safe to include it. Are there any Arm chips other than the Qualcomm Snapdragons X being used at the moment in Windows Arm64 devices?

Besides bare-metal Windows on Arm on Snapdragon there are probably a lot of Windows on Arm running in VMs on Parallels/UTM/... on Macs - probably not an issue for installing/running llama.cpp in the VM because it runs much, much faster on the host. Also there might be some Raspberry Pis and similar SBCs running WoA via WoR, but I don't think they would be a target for a winget-install. I think the rest of the arm chips run Linux. So the openCL backend could be interesting.

OpenCL currently might be a good idea to save some power / avoid some CPU-throttling. I could not really measure its power-saving impact on my Snapdragon X machine, because it does not expose CPU/GPU/NPU power-use (unlike Apple, NVIDIA,...). The Snapdragon X's GPU+openCL currently is a bit slower than the very fast Snapdragon X's CPU with the Q4_0 re-pack+GEMM/GEMV-optimizations. As much as I admire and applaud the team's efforts to build the openCL backend for the Snapdragon, I think it will mostly matter for future, faster-GPU SoCs.

And Im not sure how much Microsoft+Qualcomm/Intel/AMD's efforts on ONNX-integration into Windows ML / Windows AI Foundry will impact llama.cpp's use on Windows for arm. ONNX also runs some models on the GPU/NPU - but I have not been able to benchmark this yet. I'm not convinced, that Windows ML / Windows AI Foundry, which seems to succeed last year's DirectML, will be very successful. It's not covering cross-platform enough, e.g. with their CPU-only Mac implementation.

As announced in the Qualcomm keynote of Computex last week, the Docker team who builds Docker Model Runner (DMR) is bringing DMR to Windows on arm, and to my knowledge they based DMR on llama.cpp. To me, their great idea is, instead of waiting for the various GPU-OEMs to provide virtual GPU support for AI in containers, to use llama.cpp with its backends inside the docker-host, and provide a managed/secure openAI API server to the containers. They also treat the .gguf-based models similar to docker-images, easing distribution+updating. To me, DMR together with their Docker MCP Catalog and Toolkit is a very, very interesting development (see their Build 2025 session talk for an excellent overview on this).

@ericcurtin
Copy link
Collaborator

For Windows arm64 I suggest to use only the CPU backend.

Would there be any advantage of also bundling the OpenCL backend in the arm64 package as well? Unless the model is offloaded with -ngl, the OpenCL backend wouldn't be active anyway, so it should be fairly safe to include it. Are there any Arm chips other than the Qualcomm Snapdragons X being used at the moment in Windows Arm64 devices?

Besides bare-metal Windows on Arm on Snapdragon there are probably a lot of Windows on Arm running in VMs on Parallels/UTM/... on Macs - probably not an issue for installing/running llama.cpp in the VM because it runs much, much faster on the host. Also there might be some Raspberry Pis and similar SBCs running WoA via WoR, but I don't think they would be a target for a winget-install. I think the rest of the arm chips run Linux. So the openCL backend could be interesting.

OpenCL currently might be a good idea to save some power / avoid some CPU-throttling. I could not really measure its power-saving impact on my Snapdragon X machine, because it does not expose CPU/GPU/NPU power-use (unlike Apple, NVIDIA,...). The Snapdragon X's GPU+openCL currently is a bit slower than the very fast Snapdragon X's CPU with the Q4_0 re-pack+GEMM/GEMV-optimizations. As much as I admire and applaud the team's efforts to build the openCL backend for the Snapdragon, I think it will mostly matter for future, faster-GPU SoCs.

And Im not sure how much Microsoft+Qualcomm/Intel/AMD's efforts on ONNX-integration into Windows ML / Windows AI Foundry will impact llama.cpp's use on Windows for arm. ONNX also runs some models on the GPU/NPU - but I have not been able to benchmark this yet. I'm not convinced, that Windows ML / Windows AI Foundry, which seems to succeed last year's DirectML, will be very successful. It's not covering cross-platform enough, e.g. with their CPU-only Mac implementation.

As announced in the Qualcomm keynote of Computex last week, the Docker team who builds Docker Model Runner (DMR) is bringing DMR to Windows on arm, and to my knowledge they based DMR on llama.cpp. To me, their great idea is, instead of waiting for the various GPU-OEMs to provide virtual GPU support for AI in containers, to use llama.cpp with its backends inside the docker-host, and provide a managed/secure openAI API server to the containers. They also treat the .gguf-based models similar to docker-images, easing distribution+updating. To me, DMR together with their Docker MCP Catalog and Toolkit is a very, very interesting development (see their Build 2025 session talk for an excellent overview on this).

Sounds very similar to a project I know... Sigh...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

6 participants
0