-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
Describe the bug
When ingress controller instances are OOM Killed then the containers fail (indefinatly) to restart. It seems they are unable to bind to a Port and one assumes that's because the port has not been released by the container what was OOM Killed.
Manual intervention in the form of restarting the Pods is required in order to bring the Ingress Controller instances back online.
For added context only... In our case, we think the OOM Kill was the result of an 'nginx config reload' which I believe was triggered on the back of an unrelated deployment auto-scaling and therefore Endpoints being updated. I've checked and I don't think any new Ingress resources were created at the time but obviously if they had been then this would be another reason for the reload.
From our end we are taking remedial steps by increasing the memory available to the Nginx IC pods and by decreasing the number of worker processes. But this only reduces the likelihood of the OOM Kill occurring....
I'm raising this issue because it should be possible for the nginx containers to restart without manual intervention if they are subject to an OOM Kill event.
To Reproduce
Steps to reproduce the behavior:
- Deploy Nginx 3.1.1 to an Azure AKS Cluster. Set the memory low enough such that you can cause Nginx to consume more than the limit of the Pod if you throw some load its way.
In our case we had 4 instances of nginx running each with a request/limit of 500Mi. Using a lower number of instances and of memory would make it easier to replicate the issue
-
See Nginx pods become 'Not Ready'.
-
View logs for any failing Pod
-
See error:
init-ingresscontroller-nginx-ingress '/etc/nginx/./conf.d' -> '/mnt/etc/./conf.d'
init-ingresscontroller-nginx-ingress '/etc/nginx/./fastcgi_params' -> '/mnt/etc/./fastcgi_params'
init-ingresscontroller-nginx-ingress '/etc/nginx/./mime.types' -> '/mnt/etc/./mime.types'
init-ingresscontroller-nginx-ingress '/etc/nginx/./modules' -> '/mnt/etc/./modules'
init-ingresscontroller-nginx-ingress '/etc/nginx/./nginx.conf' -> '/mnt/etc/./nginx.conf'
init-ingresscontroller-nginx-ingress '/etc/nginx/./scgi_params' -> '/mnt/etc/./scgi_params'
init-ingresscontroller-nginx-ingress '/etc/nginx/./secrets' -> '/mnt/etc/./secrets'
init-ingresscontroller-nginx-ingress '/etc/nginx/./stream-conf.d' -> '/mnt/etc/./stream-conf.d'
init-ingresscontroller-nginx-ingress '/etc/nginx/./uwsgi_params' -> '/mnt/etc/./uwsgi_params'
ingresscontroller-nginx-ingress NGINX Ingress Controller Version=3.1.1 Commit=72473392d14cb0971de4b916a8db9bb675a16634 Date=2023-05-04T23:50:20Z DirtyState=false Arch=linux/amd64 Go=go1.20.4
ingresscontroller-nginx-ingress I0918 06:50:31.897078 1 flags.go:294] Starting with flags: ["-nginx-plus=false" "-nginx-reload-timeout=60000" "-enable-app-protect=false" "-enable-app-protect-dos=false" "-nginx-configmaps=ingress/
ingresscontroller-nginx-ingress I0918 06:50:31.922072 1 main.go:234] Kubernetes version: 1.26.3
ingresscontroller-nginx-ingress I0918 06:50:31.935332 1 main.go:380] Using nginx version: nginx/1.23.4
ingresscontroller-nginx-ingress I0918 06:50:32.043153 1 main.go:776] Pod label updated: ingresscontroller-nginx-ingress-6d975bd49d-p6xb8
ingresscontroller-nginx-ingress 2023/09/18 06:50:32 [emerg] 18#18: bind() to unix:/var/lib/nginx/nginx-config-version.sock failed (98: Address already in use)
ingresscontroller-nginx-ingress 2023/09/18 06:50:32 [emerg] 18#18: bind() to unix:/var/lib/nginx/nginx-502-server.sock failed (98: Address already in use)
ingresscontroller-nginx-ingress 2023/
6C1D
09/18 06:50:32 [emerg] 18#18: bind() to unix:/var/lib/nginx/nginx-418-server.sock failed (98: Address already in use)
ingresscontroller-nginx-ingress 2023/09/18 06:50:32 [notice] 18#18: try again to bind() after 500ms
ingresscontroller-nginx-ingress 2023/09/18 06:50:32 [emerg] 18#18: bind() to unix:/var/lib/nginx/nginx-config-version.sock failed (98: Address already in use)
ingresscontroller-nginx-ingress 2023/09/18 06:50:32 [emerg] 18#18: bind() to unix:/var/lib/nginx/nginx-502-server.sock failed (98: Address already in use)
ingresscontroller-nginx-ingress 2023/09/18 06:50:32 [emerg] 18#18: bind() to unix:/var/lib/nginx/nginx-418-server.sock failed (98: Address already in use)
ingresscontroller-nginx-ingress 2023/09/18 06:50:32 [notice] 18#18: try again to bind() after 500ms
ingresscontroller-nginx-ingress 2023/09/18 06:50:32 [emerg] 18#18: bind() to unix:/var/lib/nginx/nginx-config-version.sock failed (98: Address already in use)
ingresscontroller-nginx-ingress 2023/09/18 06:50:32 [emerg] 18#18: bind() to unix:/var/lib/nginx/nginx-502-server.sock failed (98: Address already in use)
ingresscontroller-nginx-ingress 2023/09/18 06:50:32 [emerg] 18#18: bind() to unix:/var/lib/nginx/nginx-418-server.sock failed (98: Address already in use)
ingresscontroller-nginx-ingress 2023/09/18 06:50:32 [notice] 18#18: try again to bind() after 500ms
ingresscontroller-nginx-ingress 2023/09/18 06:50:32 [emerg] 18#18: bind() to unix:/var/lib/nginx/nginx-config-version.sock failed (98: Address already in use)
ingresscontroller-nginx-ingress 2023/09/18 06:50:32 [emerg] 18#18: bind() to unix:/var/lib/nginx/nginx-502-server.sock failed (98: Address already in use)
ingresscontroller-nginx-ingress 2023/09/18 06:50:32 [emerg] 18#18: bind() to unix:/var/lib/nginx/nginx-418-server.sock failed (98: Address already in use)
ingresscontroller-nginx-ingress 2023/09/18 06:50:32 [notice] 18#18: try again to bind() after 500ms
ingresscontroller-nginx-ingress 2023/09/18 06:50:32 [emerg] 18#18: bind() to unix:/var/lib/nginx/nginx-config-version.sock failed (98: Address already in use)
ingresscontroller-nginx-ingress 2023/09/18 06:50:32 [emerg] 18#18: bind() to unix:/var/lib/nginx/nginx-502-server.sock failed (98: Address already in use)
ingresscontroller-nginx-ingress 2023/09/18 06:50:32 [emerg] 18#18: bind() to unix:/var/lib/nginx/nginx-418-server.sock failed (98: Address already in use)
ingresscontroller-nginx-ingress 2023/09/18 06:50:32 [notice] 18#18: try again to bind() after 500ms
ingresscontroller-nginx-ingress 2023/09/18 06:50:32 [emerg] 18#18: still could not bind()
Stream closed EOF for ingress/ingresscontroller-nginx-ingress-6d975bd49d-p6xb8 (init-ingresscontroller-nginx-ingress)
- Witness that the Pods will be continually restarted but they'll incur the same error over and over. I left one instance running over a weekend during which it went through 700+ restarts and did not come back online.
Expected behavior
I expect that when the nginx ingress controller container is restarted after it has been OOM Killed that it should be able to bind to the port(s) it requires and then start up successfully and without the need of a human having to restart the Pod(s).
Your environment
- Version of the Ingress Controller - "nginx-ingress:3.1.1"
- 1.26.3
- Azure AKS
- Using NGINX