You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using Llamafile on a NUMA (Non-Uniform Memory Access) Windows system, it's crucial for optimal performance that users can control which node Llamafile loads into memory on, especially when loading multiple models simultaneously. Currently, Llamafile appears to override the NODE option specified in the Windows start command, always filling the memory of node 0 before utilizing node 1, regardless of the specified node. There is no problem assigning CPU cores from other nodes, the problem is only related to loading the model into RAM into a specified node.
Current Behavior
Llamafile ignores the /NODE option specified in the Windows start command.
Memory is always filled on node 0 first, then node 1, regardless of the specified node.
This behavior causes performance issues when running multiple models on NUMA systems.
Expected Behavior
Llamafile should respect the /NODE option specified in the Windows start command.
Memory allocation should prioritize the specified node.
This would allow users to effectively distribute model loads across NUMA nodes for optimal performance.
Both instances will load into node 0's memory, which is not expected.
Impact
This issue significantly impacts performance when running multiple models on NUMA systems. It prevents full utilization of available cores due to the relatively slow interconnect between nodes when memory is not local to the executing cores.
Proposed Solution
Either
(1) Modify Llamafile to respect the Windows start command's /NODE option, OR
(2) Implement a command-line option for Llamafile (e.g., --numa-node) to specify the preferred NUMA node directly.
Ensure that memory allocation prioritizes the specified node before utilizing other nodes.
Additional Context
This improvement would be a big improvement for users running multiple LLM instances on multi-socket workstations or servers running Windows, enabling the best distribution of workload and best utilization of hardware.
It aligns with Llamafile's goal of providing efficient, flexible LLM deployment options.
Environment
OS: Windows 10 LTSC
Llamafile version: 0.8.9
Hardware: Dual-socket workstation with Intel Xeon processors (Broadwell), 256GB RAM per node
Version
Llamafile version: 0.8.9
What operating system are you seeing the problem on?
No response
Relevant log output
No response
The text was updated successfully, but these errors were encountered:
Contact Details
git@nic.cix.co.uk
What happened?
Improved NUMA support on Windows
When using Llamafile on a NUMA (Non-Uniform Memory Access) Windows system, it's crucial for optimal performance that users can control which node Llamafile loads into memory on, especially when loading multiple models simultaneously. Currently, Llamafile appears to override the NODE option specified in the Windows start command, always filling the memory of node 0 before utilizing node 1, regardless of the specified node. There is no problem assigning CPU cores from other nodes, the problem is only related to loading the model into RAM into a specified node.
Current Behavior
Llamafile ignores the /NODE option specified in the Windows start command.
Memory is always filled on node 0 first, then node 1, regardless of the specified node.
This behavior causes performance issues when running multiple models on NUMA systems.
Expected Behavior
Llamafile should respect the /NODE option specified in the Windows start command.
Memory allocation should prioritize the specified node.
This would allow users to effectively distribute model loads across NUMA nodes for optimal performance.
Example
Currently, when using these Windows commands:
start /NODE 0 llamafile.exe ... llm_1.gguf
start /NODE 1 llamafile.exe ... llm_2.gguf
Both instances will load into node 0's memory, which is not expected.
Impact
This issue significantly impacts performance when running multiple models on NUMA systems. It prevents full utilization of available cores due to the relatively slow interconnect between nodes when memory is not local to the executing cores.
Proposed Solution
Either
(1) Modify Llamafile to respect the Windows start command's /NODE option, OR
(2) Implement a command-line option for Llamafile (e.g., --numa-node) to specify the preferred NUMA node directly.
Ensure that memory allocation prioritizes the specified node before utilizing other nodes.
Additional Context
This improvement would be a big improvement for users running multiple LLM instances on multi-socket workstations or servers running Windows, enabling the best distribution of workload and best utilization of hardware.
It aligns with Llamafile's goal of providing efficient, flexible LLM deployment options.
Environment
OS: Windows 10 LTSC
Llamafile version: 0.8.9
Hardware: Dual-socket workstation with Intel Xeon processors (Broadwell), 256GB RAM per node
Version
Llamafile version: 0.8.9
What operating system are you seeing the problem on?
No response
Relevant log output
No response
The text was updated successfully, but these errors were encountered: