-
Notifications
You must be signed in to change notification settings - Fork 18.8k
Fix 'failed to get network during CreateEndpoint' #41011
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Ooops, I didn't notice a modified file is in |
b02c627
to
5531a65
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some thoughts / suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks scary; shouldn't these be removed, or is the code in tryDetachContainerFromClusterNetwork
meant to replace this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is the code tryDetachContainerFromClusterNetwork
meant to replace this. If removeNetworks
is in controller's method, then it is asynchronized with container-start, and will cause test case in docker/for-linux#888 (comment) to fail.
Another way may be adding a delay in controller's Start()
, but it is not reliable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of doing this, what are your thoughts on applying a retry logic below in Start
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arkodg I'm afraid we are not able to do retry logic in Start
, nc.adapter.networkAttach(ctx)
in Start
just sends a message to trigger container starting to continue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trying to understand why
moby/daemon/container_operations.go
Line 428 in a23ca16
// Retry network attach again if we failed to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arkodg
See:
moby/daemon/container_operations.go
Lines 748 to 784 in a23ca16
n, config, err := daemon.findAndAttachNetwork(container, idOrName, endpointConfig) | |
if err != nil { | |
return err | |
} | |
if n == nil { | |
return nil | |
} | |
var operIPAM bool | |
if config != nil { | |
if epConfig, ok := config.EndpointsConfig[n.Name()]; ok { | |
if endpointConfig.IPAMConfig == nil || | |
(endpointConfig.IPAMConfig.IPv4Address == "" && | |
endpointConfig.IPAMConfig.IPv6Address == "" && | |
len(endpointConfig.IPAMConfig.LinkLocalIPs) == 0) { | |
operIPAM = true | |
} | |
// copy IPAMConfig and NetworkID from epConfig via AttachNetwork | |
endpointConfig.IPAMConfig = epConfig.IPAMConfig | |
endpointConfig.NetworkID = epConfig.NetworkID | |
} | |
} | |
if err := daemon.updateNetworkConfig(container, n, endpointConfig, updateSettings); err != nil { | |
return err | |
} | |
controller := daemon.netController | |
sb := daemon.getNetworkSandbox(container) | |
createOptions, err := buildCreateEndpointOptions(container, n, endpointConfig, sb, daemon.configStore.DNS) | |
if err != nil { | |
return err | |
} | |
endpointName := strings.TrimPrefix(container.Name, "/") | |
ep, err := n.CreateEndpoint(endpointName, createOptions...) |
The original problem occurs at L784, the retry logic inside L748 does not help.
daemon/container_operations.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we failed to detach the container from the network, should we still try to delete the network? I think this would always fail in that case (as there's still endpoints attached)? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if DetachNetwork
fails, then there's no point continuing to delete the network. I will modify it.
There is only one possible error in DetachNetwork
: could not find network attachment
in c.attachers[aKey]
, I don't know how this could happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified per your suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking, I still feel we should run daemon.DeleteManagedNetwork
no matter DetachNetwork
fails or not, because this is nearly the final stage of container-stop cleanup and network deletion is not run elsewhere when container runs with attachable overlay network.
a46cec4
to
63f5231
Compare
Fix 'failed to get network during CreateEndpoint' during container starting. 2 scenarios: 1. For swarmkit service containers Change the error type to `libnetwork.ErrNoSuchNetwork`, so `Start()` in `daemon/cluster/executor/container/controller.go` will recreate the network. 2. For 'docker run --network <attachable overlay network>' Add remove-network logic in `tryDetachContainerFromClusterNetwork()` in `daemon/container_operations.go`. Also added test case for scenario 2. Signed-off-by: Xinfeng Liu <xinfeng.liu@gmail.com>
63f5231
to
5363fed
Compare
Further optimize fix potential IP overlap with node LB (method 2). - Refactor test codes - Add testServiceInvalidMount - No need to run removeNetworks() when task container startup failure. - Add removeNetworks() in Prepare() of daemon/cluster/executor/container/controller.go. - Bring over moby#41011 to make a integrated fix. Signed-off-by: Xinfeng Liu <xinfeng.liu@gmail.com>
@xinfengliu what's the status on this one; do we need both the changes in libnetwork and this one, or only "one of them" ? |
@thaJeztah So moby/libnetwork#2554 and this PR is independent from each other. |
assert.NilError(t, err) | ||
|
||
var timeout time.Duration = 20 * time.Second | ||
err = client.ContainerRestart(ctx, c1, &timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this work with a smaller restart value , default in compose and cli is 10s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arkodg Stopping busybox container need at least 10 seconds since busybox does not have signal handler for SIGTERM. So I feel the timeout here needs to be longer than 10 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah how about 11
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid setting timeout to 11s will cause CI flaky.
I don't understand this PR, but i can confirm that it fixes the original issue. |
Any chance that someone might pick up this PR? I think several users are waiting desperately for this fix. |
Further optimize fix potential IP overlap with node LB (method 2). - Refactor test codes - Add testServiceInvalidMount - No need to run removeNetworks() when task container startup failure. - Add removeNetworks() in Prepare() of daemon/cluster/executor/container/controller.go. - Bring over moby#41011 to make a integrated fix. Signed-off-by: Xinfeng Liu <xinfeng.liu@gmail.com>
Fixes docker/for-linux#888
This is a race condition issue. Before starting container, the code ensures the network exists on the node, then during creating endpoint (as part of steps in starting container), the network does not exist anymore then the error
failed to get network during CreateEndpoint
occurs.So far I have known about 2 scenarios causing this problem:
During above short period of starting container, another container is stopped and at that time it is the last container on the network on the node, so the network is removed.
For a sample scenario, see my analysis at failed to get network during CreateEndpoint docker/for-linux#888 (comment)
Restarting a container that was started via
docker run --net <attachable network>
See: failed to get network during CreateEndpoint docker/for-linux#888 (comment)
The reason is that this kind of container has a hiden correspondent swarkit task, after container is stopped, swarmkit will remove this task and call network-deletion to remove the network if it is the last container on this network on the node. The problem is that this network-deletion is asynchronizedly done by a goroutine which is different from container stop/start goroutine. This race condition causes
failed to get network during CreateEndpoint
during container starting.Signed-off-by: Xinfeng Liu xinfeng.liu@gmail.com
- What I did
Fix 'failed to get network during CreateEndpoint' during container starting.
- How I did it
For scenario 1 above:
Change the error type to
libnetwork.ErrNoSuchNetwork
, soStart()
indaemon/cluster/executor/container/controller.go
will recreate the network.For scenario 2 above:
Move network-deletion logic to
tryDetachContainerFromClusterNetwork()
indaemon/container_operations.go
. So we make sure the network deleting is done before containerStop returns.- How to verify it
The issue is not easily recreated deliberately. I tested many times using
docker stack deploy
and 'docker service update --force` randomly, and could recreated the issue occasionaly.From the log of my test, I can verify my fix works as expected.
The container
test_web.1.6uo4tek9dcioirebj7iejl105
ran into the "network not found" error, and the network got recreated.From
docker service ps
output at the same time, the task6uo4tek9dcio
did not fail.Not sure if there's a need to add multiple retries in
createNetworks()
. This issue should not have happen if the PR #40997 (comment) is merged.It can be recreated reliably. I will try adding tests in CI.
- Description for the changelog
Fix 'failed to get network during CreateEndpoint' during container starting.
- A picture of a cute animal (not mandatory but encouraged)