10000 llama: move page cache via mbind to prevent cross-NUMA access by vishalc-ibm · Pull Request #13335 · ggml-org/llama.cpp · GitHub
[go: up one dir, main page]

Skip to content

llama: move page cache via mbind to prevent cross-NUMA access #13335

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

vishalc-ibm
Copy link
@vishalc-ibm vishalc-ibm commented May 6, 2025

page cache pages are retained in memory of the node after running llama-bench bound to a node on multi-node systems, incuring cross-NUMA memory access penalty for subsequent runs of llama-bench bound to a different node. This commit introduces an mbind call as best effort basis to move the pages to the target node where llama-bench is executed, ensuring optimal NUMA locality. Additionally, necessary NUMA headers are included and the build is updated to link against the NUMA library.

Experiments:

  1. Run llama-bench on node 1 (base)
  2. Run llama-bench on node 0 (regression observed)
  3. Run patched llama-bench on node 0 (throughput same as base)
  • /usr/bin/time -p numactl -N 1 -m 1 $llama-bench -m $models/llama-2-7b-chat.Q8_0.gguf -ngl 0 --prio 0 -b 1 -t 24
model size params backend threads n_batch test t/s
llama 7B Q8_0 6.67 GiB 6.74 B CPU 24 1 pp512 5.39 ± 0.01
llama 7B Q8_0 6.67 GiB 6.74 B CPU 24 1 tg128 5.49 ± 0.03

build: 35782ae (5014)
real 687.60
user 15653.73
sys 42.67

  • /usr/bin/time -p numactl -N 0 -m 0 $llama-bench -m $models/llama-2-7b-chat.Q8_0.gguf -ngl 0 --prio 0 -b 1 -t 24
model size params backend threads n_batch test t/s
llama 8000 7B Q8_0 6.67 GiB 6.74 B CPU 24 1 pp512 4.60 ± 0.01
llama 7B Q8_0 6.67 GiB 6.74 B CPU 24 1 tg128 4.67 ± 0.03

build: 35782ae (5014)
real 805.99
user 18187.26
sys 48.93

  • /usr/bin/time -p numactl -N 0 -m 0 $patched-llama-bench -m $models/llama-2-7b-chat.Q8_0.gguf -ngl 0 --prio 0 -b 1 -t 24
model size params backend threads n_batch test t/s
llama 7B Q8_0 6.67 GiB 6.74 B CPU 24 1 pp512 5.35 ± 0.01
llama 7B Q8_0 6.67 GiB 6.74 B CPU 24 1 tg128 5.46 ± 0.02

build: 35782ae (5014)
real 696.12
user 15735.41
sys 44.08

Suggested-by: Ritesh Harjani (IBM) ritesh.list@gmail.com
Signed-off-by: Vishal Chourasia vishalc@linux.ibm.com

page cache pages are retained in memory of the node after running
llama-bench bound to a node on multi-node systems, incuring cross-NUMA
memory access penalty for subsequent runs of llama-bench bound to a
different node. This commit introduces an mbind call as best effort
basis to move the pages to the target node where llama-bench is
executed, ensuring optimal NUMA locality. Additionally, necessary NUMA
headers are included and the build is updated to link against the NUMA
library.

Experiments:
1. Run llama-bench on node 1  (base)
2. Run llama-bench on node 0  (regression observed)
3. Run patched llama-bench on node 0 (throughput same as base)

+ /usr/bin/time -p numactl -N 1 -m 1 $llama-bench -m $models/llama-2-7b-chat.Q8_0.gguf -ngl 0 --prio 0 -b 1 -t 24
| model                          |       size |     params | backend    | threads | n_batch |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------: | ------------: | -------------------: |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | CPU        |      24 |       1 |         pp512 |          5.39 ± 0.01 |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | CPU        |      24 |       1 |         tg128 |          5.49 ± 0.03 |
build: 35782ae (5014)
real 687.60
user 15653.73
sys 42.67

+ /usr/bin/time -p numactl -N 0 -m 0 $llama-bench -m $models/llama-2-7b-chat.Q8_0.gguf -ngl 0 --prio 0 -b 1 -t 24
| model                          |       size |     params | backend    | threads | n_batch |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------: | ------------: | -------------------: |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | CPU        |      24 |       1 |         pp512 |          4.60 ± 0.01 |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | CPU        |      24 |       1 |         tg128 |          4.67 ± 0.03 |
build: 35782ae (5014)
real 805.99
user 18187.26
sys 48.93

+ /usr/bin/time -p numactl -N 0 -m 0 $patched-llama-bench -m $models/llama-2-7b-chat.Q8_0.gguf -ngl 0 --prio 0 -b 1 -t 24
| model                          |       size |     params | backend    | threads | n_batch |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------: | ------------: | -------------------: |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | CPU        |      24 |       1 |         pp512 |          5.35 ± 0.01 |
| llama 7B Q8_0                  |   6.67 GiB |     6.74 B | CPU        |      24 |       1 |         tg128 |          5.46 ± 0.02 |
build: 35782ae (5014)
real 696.12
user 15735.41
sys 44.08

Suggested-by:  Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Vishal Chourasia <vishalc@linux.ibm.com>
@vishalc-ibm
Copy link
Author

Checks are failing due to absence of numa.h header on the test system.

@vishalc-ibm
Copy link
Author

Submitted another PR #13731

Fixed the build issue using cmake to figure out if libnuma and the headers are present and only then build with move_pages support.

Closing this PR now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant
0