Releases · ggml-org/llama.cpp

26 Jul 10:16

11dd5a4

b5996 Latest

Latest

CANN: Implement GLU ops (#14884)

Implement REGLU, GEGLU, SWIGLU ops according to #14158

Assets 15

cudart-llama-bin-win-cuda-12.4-x64.zip

sha256:8c79a9b226de4b3cacfd1f83d24f962d0773be79f1e7b75c6af4ded7e32ae1d6

373 MB 2025-07-26T10:16:53Z
llama-b5996-bin-macos-arm64.zip

sha256:8fa956663276686f24dde3c0e81d2f04e865e92af90d71e890c3cfcea794855a

10.6 MB 2025-07-26T10:17:03Z
llama-b5996-bin-macos-x64.zip

sha256:3670a942bc23d8cc4e877d8573850fbc47731cbd024357de4f5dd383013f6c81

27.2 MB 2025-07-26T10:17:04Z
llama-b5996-bin-ubuntu-vulkan-x64.zip

sha256:f94dad8c29008f68b2ab828cdffd31f4cb77ff930944816878939bae3586ff29

20.9 MB 2025-07-26T10:17:05Z
llama-b5996-bin-ubuntu-x64.zip

sha256:bc6a0208486d0dd4a994af7bf3c5800ab594bc4557dbcfa130923339dbf6891e

12.5 MB 2025-07-26T10:17:06Z
llama-b5996-bin-win-cpu-arm64.zip

sha256:6aa8fc7c36112e0216e08f70a085bedf0982cec51a3fa276b22cc4b29092d10f

10.9 MB 2025-07-26T10:17:07Z
llama-b5996-bin-win-cpu-x64.zip

sha256:ee23aef0b9e7c488c6e5854299918ae309ea572486294aab7bacd587b82cc197

13.7 MB 2025-07-26T10:17:08Z
llama-b5996-bin-win-cuda-12.4-x64.zip

sha256:f579d9969a97e6e803319b4447a2a2125c6efc40214e42e05a36f3c8fc4bc264

129 MB 2025-07-26T10:17:09Z
llama-b5996-bin-win-hip-radeon-x64.zip

sha256:4d545bafbfb646262d5f282e8b8a3dd2ebcc4454234c38af86868807bceb5dd1

299 MB 2025-07-26T10:17:13Z
llama-b5996-bin-win-opencl-adreno-arm64.zip

sha256:3655ee66adf68bb8d8f4d9690ccf0ca430e9b9a067f444bd26687d454dd8b211

11.2 MB 2025-07-26T10:17:22Z
Source code (zip)

2025-07-26T09:56:18Z
Source code (tar.gz)

2025-07-26T09:56:18Z

26 Jul 02:57

github-actions

b5995

9b8f3c6

b5995

musa: fix build warnings (unused variable) (#14869)

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

Assets 15

25 Jul 17:36

github-actions

b5994

c7f3169

b5994

ggml-cpu : disable GGML_NNPA by default due to instability (#14880)

* docs: update s390x document for sentencepiece

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit e086c5e3a7ab3463d8e0906efcfa39352db0a48d)

* docs: update huggingface links + reword

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 8410b085ea8c46e22be38266147a1e94757ef108)

* ggml-cpu: disable ggml-nnpa compile flag by default

fixes #14877

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 412f4c7c88894b8f55846b4719c76892a23cfe09)

* docs: update s390x build docs to reflect nnpa disable

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit c1eeae1d0c2edc74ab9fbeff2707b0d357cf0b4d)

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

Assets 15

25 Jul 17:12

github-actions

b5993

793c0d7

b5993

metal: SSM_SCAN performance (#14743)

* feat: Add s_off as a parameter in the args struct

This may not be necessary, but it more closely mirrors the CUDA kernel

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* perf: Parallelize mamba2 SSM_SCAN metal kernel over d_state

This is a first attempt at optimizing the metal kernel. The changes here
are:

- Launch the kernel with a thread group of size d_state
- Use simd groups and shared memory to do the summation for the y
  computation

When tested with G4 tiny preview, this shows roughly a 3x speedup on
prefill and 15% speedup on decode.

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Update logic to correctly do the multi-layer parallel sum

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Correctly size the shared memory bufer and assert expected size relationships

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Compute block offsets once rather than once per token

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Use local variable for state recursion

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Use a secondary simd_sum instead of a for loop

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add assertion and comment about relationship between simd size and num simd groups

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Parallelize of d_state for mamba-1

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Parallel sum in SSM_CONV

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* Revert "feat: Parallel sum in SSM_CONV"

After discussion with @compilade, the size of the parallelism here is
not worth the cost in complexity or overhead of the parallel for.

https://github.com/ggml-org/llama.cpp/pull/14743#discussion_r2223395357

This reverts commit 16bc059660c1c59e566628201c0ca2c20c9f4bc3.

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Simplify shared memory sizing

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-Authored-By: Georgi Gerganov <ggerganov@gmail.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

Assets 15

25 Jul 15:40

github-actions

b5992

ce111d3

b5992

opencl: add fused `rms_norm_mul` (#14841)

* opencl: add fused `rms_norm` + `mul`

* opencl: improve workgroup size for `rms_norm_mul`

Assets 15

25 Jul 12:19

github-actions

b5990

e2b7621

b5990

ggml : remove invalid portPos specifiers from dot files (#14838)

Neither "g" nor "x" are valid portPos specifiers per the official
[graphviz documents](https://graphviz.org/docs/attr-types/portPos/):

> If a compass point is used, it must have the form "n","ne","e","se","s","sw","w","nw","c","_".

I tested locally for it to fall back to default portPos specifier if an
invalid portPos is specified. As a consequence, we can remove associated
code.

Assets 15

25 Jul 12:07

github-actions

b5989

c1dbea7

b5989

context : restore preemptive sched reset when LLAMA_SET_ROWS=0 (#14870)

ggml-ci

Assets 15

25 Jul 11:24

github-actions

b5988

749e0d2

b5988

mtmd : fix 32-bit narrowing issue in export-lora and mtmd clip (#14503)

* [fix] Fix 32-bit narrowing issue in export-lora and mtmd clip

* Update export-lora.cpp

* Update clip.cpp

* Update export-lora.cpp

* format: use space to replace tab

Assets 15

25 Jul 10:33

github-actions

b5987

64bf1c3

b5987

rpc : check for null buffers in get/set/copy tensor endpoints (#14868)

Assets 15

25 Jul 08:44

github-actions

b5986

c12bbde

b5986

sched : fix multiple evaluations of the same graph with pipeline para…

Assets 15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Releases: ggml-org/llama.cpp

b5996

Uh oh!

b5995

Uh oh!

b5994

Uh oh!

b5993

Uh oh!

b5992

Uh oh!

b5990

Uh oh!

b5989

Uh oh!

b5988

Uh oh!

b5987

Uh oh!

b5986

Uh oh!