8000 Fix 'failed to get network during CreateEndpoint' by xinfengliu · Pull Request #41011 · moby/moby · GitHub
[go: up one dir, main page]

Skip to content

Conversation

xinfengliu
Copy link
Contributor
@xinfengliu xinfengliu commented May 22, 2020

Fixes docker/for-linux#888

This is a race condition issue. Before starting container, the code ensures the network exists on the node, then during creating endpoint (as part of steps in starting container), the network does not exist anymore then the error failed to get network during CreateEndpoint occurs.

So far I have known about 2 scenarios causing this problem:

  1. During above short period of starting container, another container is stopped and at that time it is the last container on the network on the node, so the network is removed.
    For a sample scenario, see my analysis at failed to get network during CreateEndpoint docker/for-linux#888 (comment)

  2. Restarting a container that was started via docker run --net <attachable network>
    See: failed to get network during CreateEndpoint docker/for-linux#888 (comment)
    The reason is that this kind of container has a hiden correspondent swarkit task, after container is stopped, swarmkit will remove this task and call network-deletion to remove the network if it is the last container on this network on the node. The problem is that this network-deletion is asynchronizedly done by a goroutine which is different from container stop/start goroutine. This race condition causes failed to get network during CreateEndpoint during container starting.

Signed-off-by: Xinfeng Liu xinfeng.liu@gmail.com

- What I did
Fix 'failed to get network during CreateEndpoint' during container starting.

- How I did it

  • For scenario 1 above:
    Change the error type to libnetwork.ErrNoSuchNetwork, so Start() in daemon/cluster/executor/container/controller.go will recreate the network.

  • For scenario 2 above:
    Move network-deletion logic to tryDetachContainerFromClusterNetwork() in daemon/container_operations.go. So we make sure the network deleting is done before containerStop returns.

- How to verify it

  • For scenario 1 above:
    The issue is not easily recreated deliberately. I tested many times using docker stack deploy and 'docker service update --force` randomly, and could recreated the issue occasionaly.

From the log of my test, I can verify my fix works as expected.

The container test_web.1.6uo4tek9dcioirebj7iejl105 ran into the "network not found" error, and the network got recreated.

May 22 04:44:02 dev dockerd[4151]: time="2020-05-22T04:44:02.841844037Z" level=warning msg="Network not found when starting container test_web.1.6uo4tek9dcioirebj7iejl105, will create again" error="network test_mynet not found" module=node/agent/taskmanager node.id=m2hs3p1lvm7ahtb7znft4xw88 service.id=0vig3v0cch757i526gymw9ued task.id=6uo4tek9dcioirebj7iejl105

From docker service ps output at the same time, the task 6uo4tek9dcio did not fail.

Fri 22 May 2020 04:44:02 AM UTC
ID                  NAME                IMAGE                 NODE                DESIRED STATE       CURRENT STATE                     ERROR               PORTS
6uo4tek9dcio        test_web.1          nginx:stable-alpine   dev                 Running             Starting less than a second ago
jj3vjrrcwhzy         \_ test_web.1      nginx:stable-alpine   dev                 Shutdown            Shutdown less than a second ago
3h188v77mjob         \_ test_web.1      nginx:alpine          dev                 Shutdown            Shutdown 1 second ago          
vh4gaf7wt6w7         \_ test_web.1      nginx:stable-alpine   dev                 Shutdown            Shutdown 12 seconds ago        

Fri 22 May 2020 04:44:05 AM UTC
ID                  NAME                IMAGE                 NODE                DESIRED STATE       CURRENT STATE             ERROR               PORTS
6uo4tek9dcio        test_web.1          nginx:stable-alpine   dev                 Running             Running 1 second ago           
jj3vjrrcwhzy         \_ test_web.1      nginx:stable-alpine   dev                 Shutdown            Shutdown 3 seconds ago         
3h188v77mjob         \_ test_web.1      nginx:alpine          dev                 Shutdown            Shutdown 4 seconds ago         
vh4gaf7wt6w7         \_ test_web.1      nginx:stable-alpine   dev                 Shutdown            Shutdown 15 seconds ago        

Not sure if there's a need to add multiple retries in createNetworks(). This issue should not have happen if the PR #40997 (comment) is merged.

  • For scenario 2 above.
    It can be recreated reliably. I will try adding tests in CI.

- Description for the changelog

Fix 'failed to get network during CreateEndpoint' during container starting.

- A picture of a cute animal (not mandatory but encouraged)

@xinfengliu
Copy link
Contributor Author
xinfengliu commented May 22, 2020

Ooops, I didn't notice a modified file is in libnetwork project.
Created a PR for the same purpose in moby/libnetwork#2554

Copy link
Member
@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some thoughts / suggestions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks scary; shouldn't these be removed, or is the code in tryDetachContainerFromClusterNetwork meant to replace this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the code tryDetachContainerFromClusterNetwork meant to replace this. If removeNetworks is in controller's method, then it is asynchronized with container-start, and will cause test case in docker/for-linux#888 (comment) to fail.

Another way may be adding a delay in controller's Start(), but it is not reliable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of doing this, what are your thoughts on applying a retry logic below in Start ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arkodg I'm afraid we are not able to do retry logic in Start, nc.adapter.networkAttach(ctx) in Start just sends a message to trigger container starting to continue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trying to understand why

// Retry network attach again if we failed to
won't help now that you have fixed the error codes with moby/libnetwork#2554

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arkodg
See:

n, config, err := daemon.findAndAttachNetwork(container, idOrName, endpointConfig)
if err != nil {
return err
}
if n == nil {
return nil
}
var operIPAM bool
if config != nil {
if epConfig, ok := config.EndpointsConfig[n.Name()]; ok {
if endpointConfig.IPAMConfig == nil ||
(endpointConfig.IPAMConfig.IPv4Address == "" &&
endpointConfig.IPAMConfig.IPv6Address == "" &&
len(endpointConfig.IPAMConfig.LinkLocalIPs) == 0) {
operIPAM = true
}
// copy IPAMConfig and NetworkID from epConfig via AttachNetwork
endpointConfig.IPAMConfig = epConfig.IPAMConfig
endpointConfig.NetworkID = epConfig.NetworkID
}
}
if err := daemon.updateNetworkConfig(container, n, endpointConfig, updateSettings); err != nil {
return err
}
controller := daemon.netController
sb := daemon.getNetworkSandbox(container)
createOptions, err := buildCreateEndpointOptions(container, n, endpointConfig, sb, daemon.configStore.DNS)
if err != nil {
return err
}
endpointName := strings.TrimPrefix(container.Name, "/")
ep, err := n.CreateEndpoint(endpointName, createOptions...)

The original problem occurs at L784, the retry logic inside L748 does not help.

Comment on lines 946 to 952
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we failed to detach the container from the network, should we still try to delete the network? I think this would always fail in that case (as there's still endpoints attached)? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if DetachNetwork fails, then there's no point continuing to delete the network. I will modify it.

There is only one possible error in DetachNetwork: could not find network attachment in c.attachers[aKey], I don't know how this could happen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified per your suggestion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking, I still feel we should run daemon.DeleteManagedNetwork no matter DetachNetwork fails or not, because this is nearly the final stage of container-stop cleanup and network deletion is not run elsewhere when container runs with attachable overlay network.

@xinfengliu xinfengliu force-pushed the fix-network-not-found branch 2 times, most recently from a46cec4 to 63f5231 Compare June 1, 2020 10:21
Fix 'failed to get network during CreateEndpoint' during container starting.
2 scenarios:

1. For swarmkit service containers
Change the error type to `libnetwork.ErrNoSuchNetwork`, so `Start()` in `daemon/cluster/executor/container/controller.go` will recreate the network.

2. For 'docker run --network <attachable overlay network>'
Add remove-network logic in `tryDetachContainerFromClusterNetwork()` in `daemon/container_operations.go`.

Also added test case for scenario 2.

Signed-off-by: Xinfeng Liu <xinfeng.liu@gmail.com>
@xinfengliu xinfengliu force-pushed the fix-network-not-found branch from 63f5231 to 5363fed Compare June 2, 2020 04:08
xinfengliu added a commit to xinfengliu/moby that referenced this pull request Jun 6, 2020
Further optimize fix potential IP overlap with node LB (method 2).
- Refactor test codes
- Add testServiceInvalidMount
- No need to run removeNetworks() when task container startup failure.
- Add removeNetworks() in Prepare() of daemon/cluster/executor/container/controller.go.
- Bring over moby#41011 to make a
integrated fix.

Signed-off-by: Xinfeng Liu <xinfeng.liu@gmail.com>
@thaJeztah
Copy link
Member

@xinfengliu what's the status on this one; do we need both the changes in libnetwork and this one, or only "one of them" ?

@xinfengliu
Copy link
Contributor Author

@thaJeztah
moby/libnetwork#2554 is for fixing "scenario 1" (see the description in the begining of the this PR), the code changes in this PR is for fixing "scenario 2".

So moby/libnetwork#2554 and this PR is independent from each other.

assert.NilError(t, err)

var timeout time.Duration = 20 * time.Second
err = client.ContainerRestart(ctx, c1, &timeout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this work with a smaller restart value , default in compose and cli is 10s

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arkodg Stopping busybox container need at least 10 seconds since busybox does not have signal handler for SIGTERM. So I feel the timeout here needs to be longer than 10 seconds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah how about 11 :trollface:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid setting timeout to 11s will cause CI flaky.

@tiborvass
Copy link
Contributor

I don't understand this PR, but i can confirm that it fixes the original issue.

@xenuser
Copy link
xenuser commented Aug 4, 2020

Any chance that someone might pick up this PR? I think several users are waiting desperately for this fix.

dperny pushed a commit to dperny/docker that referenced this pull request May 10, 2021
Further optimize fix potential IP overlap with node LB (method 2).
- Refactor test codes
- Add testServiceInvalidMount
- No need to run removeNetworks() when task container startup failure.
- Add removeNetworks() in Prepare() of daemon/cluster/executor/container/controller.go.
- Bring over moby#41011 to make a
integrated fix.

Signed-off-by: Xinfeng Liu <xinfeng.liu@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

failed to get network during CreateEndpoint
5 participants
0