-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
one model loaded multiple times hogging whole available memory #9625
Comments
Set However, ollama should automatically unload models when space is tight and a model is being loaded. Your screenshot shows four runners, and if you are loading only one model, it could be that the server is crashing and orphaning the runners. The server log should show what's happening. |
I assume that setting OLLAMA_MAX_LOADED_MODELS variable would not really solve the problem. It would only limit loading N models per GPU. With that, should the model not be properly offloaded from GPU ollama might just load other models to CPU. I will go trough logs and if I find the culprit will let you know. Otherwise i might publish full log output if I'm unable to pinpoint the cause |
could closed connection be the main reason for such behavior? If so is it somehow preventable that if connection would for any reason be closed prematurely the model would offload properly ? This is basically the recurring theme:
As it seems that connection errors are the problem and I'm running ollama behind nginx could it be that I'm missing some configuration? This is the current config of Tabby:
can it be that the websocet support is missing ?
|
Unlikely. The log snippet shows the client disconnecting before the model is ready. ollama will just discard the model load, it wouldn't lead to multiple runners. nginx logs may show why the ollama client is terminating early, either because of an nginx timeout or an nginx client timeout. |
hmm, the runners so seem to break somehow. If you have any ideas of what I could debug next let me know. as of now I'm our of ideas |
You could just add the server log. |
oke. if you see anything let me know |
When it gets stuck, what's the output of |
See the screenshot above. It seems by the time I did the screenshot the model was loaded on cpu. The main suspect is the coder model. Also indicated in the logs as it mostly comes up when connection is closed faster then ollama can load the model |
Well, it's not the server that's creating orphan processes. Over the life of a single server, the model goes from fitting on one GPU, to being split across two GPUs, then having fewer and fewer layers offloaded, num_parallel reduced, and then eventually nothing fits on the GPU and the model is loaded into CPU. And all in the space of 44 seconds.
There may be some race condition in the runner handler that results in the runner staying active even when the handler thinks it's been terminated due to the client disconnecting, but I've been unable to replicate so far. Apart from |
These would be the options for tabby and openweb-ui. I'm not setting any custom options when loading the model. using: |
so this time i got log with debug output when the model is not unloaded properly trouble starts from Mär 11 13:00:16 if it sufficient to identify the main cause with this output ? |
What is the issue?
Currently I'm using tabby and openweb-ui with a coding model. I have notices that after some time the GPU's run out of memory even though only one model is loaded that should only occupy about 10gb of ram.
When this point is reached the requests to ollama never complete and freeze indefinitely. At the point the only option is to restart the service. It seems that ollama do not clean up loaded models from memory.
How can i debug this ? Is there any setting to force to have only one model loaded ?
Help is appreciated
Relevant log output
OS
Linux
GPU
Nvidia
CPU
Intel
Ollama version
0.5.7
The text was updated successfully, but these errors were encountered: