llama: move page cache via mbind to prevent cross-NUMA access #13335

vishalc-ibm · 2025-05-06T08:36:25Z

page cache pages are retained in memory of the node after running llama-bench bound to a node on multi-node systems, incuring cross-NUMA memory access penalty for subsequent runs of llama-bench bound to a different node. This commit introduces an mbind call as best effort basis to move the pages to the target node where llama-bench is executed, ensuring optimal NUMA locality. Additionally, necessary NUMA headers are included and the build is updated to link against the NUMA library.

Experiments:

Run llama-bench on node 1 (base)
Run llama-bench on node 0 (regression observed)
Run patched llama-bench on node 0 (throughput same as base)

/usr/bin/time -p numactl -N 1 -m 1 $llama-bench -m $models/llama-2-7b-chat.Q8_0.gguf -ngl 0 --prio 0 -b 1 -t 24

model	size	params	backend	threads	n_batch	test	t/s
llama 7B Q8_0	6.67 GiB	6.74 B	CPU	24	1	pp512	5.39 ± 0.01
llama 7B Q8_0	6.67 GiB	6.74 B	CPU	24	1	tg128	5.49 ± 0.03

build: 35782ae (5014)
real 687.60
user 15653.73
sys 42.67

/usr/bin/time -p numactl -N 0 -m 0 $llama-bench -m $models/llama-2-7b-chat.Q8_0.gguf -ngl 0 --prio 0 -b 1 -t 24

model	size	params	backend	threads	n_batch	test	t/s
llama 7B Q8_0	6.67 GiB	6.74 B	CPU	24	1	pp512	4.60 ± 0.01
llama 7B Q8_0	6.67 GiB	6.74 B	CPU	24	1	tg128	4.67 ± 0.03

build: 35782ae (5014)
real 805.99
user 18187.26
sys 48.93

/usr/bin/time -p numactl -N 0 -m 0 $patched-llama-bench -m $models/llama-2-7b-chat.Q8_0.gguf -ngl 0 --prio 0 -b 1 -t 24

model	size	params	backend	threads	n_batch	test	t/s
llama 7B Q8_0	6.67 GiB	6.74 B	CPU	24	1	pp512	5.35 ± 0.01
llama 7B Q8_0	6.67 GiB	6.74 B	CPU	24	1	tg128	5.46 ± 0.02

build: 35782ae (5014)
real 696.12
user 15735.41
sys 44.08

Suggested-by: Ritesh Harjani (IBM) ritesh.list@gmail.com
Signed-off-by: Vishal Chourasia vishalc@linux.ibm.com

page cache pages are retained in memory of the node after running llama-bench bound to a node on multi-node systems, incuring cross-NUMA memory access penalty for subsequent runs of llama-bench bound to a different node. This commit introduces an mbind call as best effort basis to move the pages to the target node where llama-bench is executed, ensuring optimal NUMA locality. Additionally, necessary NUMA headers are included and the build is updated to link against the NUMA library. Experiments: 1. Run llama-bench on node 1 (base) 2. Run llama-bench on node 0 (regression observed) 3. Run patched llama-bench on node 0 (throughput same as base) + /usr/bin/time -p numactl -N 1 -m 1 $llama-bench -m $models/llama-2-7b-chat.Q8_0.gguf -ngl 0 --prio 0 -b 1 -t 24 | model | size | params | backend | threads | n_batch | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------: | ------------: | -------------------: | | llama 7B Q8_0 | 6.67 GiB | 6.74 B | CPU | 24 | 1 | pp512 | 5.39 ± 0.01 | | llama 7B Q8_0 | 6.67 GiB | 6.74 B | CPU | 24 | 1 | tg128 | 5.49 ± 0.03 | build: 35782ae (5014) real 687.60 user 15653.73 sys 42.67 + /usr/bin/time -p numactl -N 0 -m 0 $llama-bench -m $models/llama-2-7b-chat.Q8_0.gguf -ngl 0 --prio 0 -b 1 -t 24 | model | size | params | backend | threads | n_batch | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------: | ------------: | -------------------: | | llama 7B Q8_0 | 6.67 GiB | 6.74 B | CPU | 24 | 1 | pp512 | 4.60 ± 0.01 | | llama 7B Q8_0 | 6.67 GiB | 6.74 B | CPU | 24 | 1 | tg128 | 4.67 ± 0.03 | build: 35782ae (5014) real 805.99 user 18187.26 sys 48.93 + /usr/bin/time -p numactl -N 0 -m 0 $patched-llama-bench -m $models/llama-2-7b-chat.Q8_0.gguf -ngl 0 --prio 0 -b 1 -t 24 | model | size | params | backend | threads | n_batch | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------: | ------------: | -------------------: | | llama 7B Q8_0 | 6.67 GiB | 6.74 B | CPU | 24 | 1 | pp512 | 5.35 ± 0.01 | | llama 7B Q8_0 | 6.67 GiB | 6.74 B | CPU | 24 | 1 | tg128 | 5.46 ± 0.02 | build: 35782ae (5014) real 696.12 user 15735.41 sys 44.08 Suggested-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Vishal Chourasia <vishalc@linux.ibm.com>

vishalc-ibm · 2025-05-07T03:50:04Z

Checks are failing due to absence of numa.h header on the test system.

vishalc-ibm · 2025-05-23T19:07:25Z

Submitted another PR #13731

Fixed the build issue using cmake to figure out if libnuma and the headers are present and only then build with move_pages support.

Closing this PR now.

vishalc-ibm added 2 commits May 23, 2025 22:32

Merge branch 'ggml-org:master' into move-page-cache

6296b9d

Merge branch 'ggml-org:master' into move-page-cache

17009e7

vishalc-ibm closed this May 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama: move page cache via mbind to prevent cross-NUMA access #13335

llama: move page cache via mbind to prevent cross-NUMA access #13335

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

llama: move page cache via mbind to prevent cross-NUMA access #13335

llama: move page cache via mbind to prevent cross-NUMA access #13335

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant