Fix for DNS name resolution after performing init with --force-new-cluster #38626

kylewuolle · 2019-01-23T19:38:15Z

What I did
This fixes a problem where the agent on a controller is stopped when a node leaves a swarm and is never restarted. I've added
a flag to the DaemonJoinsCluster method to indicate the case where a force init is being done. When that flag is set the existing agent is cleaned up
by setting the cluster provider to nil and waiting for the agent to stop. When the cluster provider is set after, the agent is setup properly. This PR
fixes the following issue : Docker swarm overlay networking not working after --force-new-cluster docker/for-linux#495
How I did it
Added a flag indicating that this is a force new cluster situation and the agent should be cleaned up before setting the cluster provider.
How to verify it

Using the following Docker file build an image on each node called demo.

FROM ubuntu

RUN apt update
RUN apt install dnsutils -y

CMD /bin/bash -c "while true; do nslookup tasks.demo; sleep 2; done"

execute swarm init on one of the nodes
create a network docker network create --scope swarm --driver overlay --attachable test
create a service docker service create --network test --mode global --name demo demo
verify that the tasks.demo endpoint resolves to two ip addresses docker service logs demo
now execute docker swarm init --force-new-cluster on one of the nodes
demote and remove the other node and also, remove the service and network
recreate the service and network on the remaining node
have a third node join the remaining node
Previously, at this point node 3 would resolve tasks.demo to be it's container's ip but the tasks.demo would not resolve on the first node. Also the container on each node could not reach the container on the other node using it's ip. With this fix in place this would work as we expect and each task would be resolved to the respective ips.

Description for the changelog
Fix a problem with DNS resolution after performing a cluster init with the --force-new-cluster option set
A picture of a cute animal (not mandatory but encouraged)

…init flag the agent is cleaned up and recreated properly so that agent events are responded to. This was causing some networking issues around DNS resolution after performing a force init on a cluster. Signed-off-by: Kyle Wuolle <kyle.wuolle@gmail.com>

fcrisciani · 2019-01-23T19:43:12Z

daemon/daemon.go

+	// When forcing a new cluster, first clean up the existing agent
+	// ensuring that a new one will be created and started
+	if(forceNewCluster) {
+		daemon.setClusterProvider(nil)


I still see this as not super clean. If the Agent is stopped, I believe libnetwork need to nullify the previous cluster provider automatically. Is there any other case where the clusterProvider is being reused without being set?

You're right there's another way that might be better. I've updated moby/libnetwork#2307. Now instead the cluster provider would be set to nil in agentClose.

thaJeztah · 2019-01-23T20:03:50Z

Looks like there's a linting issue;

19:40:46 daemon/daemon.go:1::warning: file is not gofmted with -s (gofmt)
19:40:46 daemon/daemon.go:1::warning: file is not goimported (goimports)
19:40:47 Build step 'Execute shell' marked build as failure

coolljt0725 · 2019-02-11T01:23:06Z

ping @kylewuolle Jenkins failed

thaJeztah · 2019-04-01T18:01:10Z

@kylewoule should this one be closed now that moby/libnetwork#2307 was merged (and will be vendored through #38983 ?)

~~Note that it was already included in Docker 18.09.4 docker-archive#169~~

thaJeztah · 2019-04-01T18:05:44Z

oops, it was actually not yet included in 18.09; cherry-picking now

thaJeztah · 2019-07-16T14:59:20Z

ping @kylewuolle is this still needed now that moby/libnetwork#2307 was merged?
/cc @arkodg

caoyj1991 · 2019-10-27T07:13:52Z

@thaJeztah @kylewuolle Has the fixed merged into 18.09 docker ? and what can i do now?

thaJeztah · 2019-11-06T19:38:12Z

@caoyj1991 fix was backported in libnetwork through moby/libnetwork#2354, and included in Docker 18.09.6 through docker-archive#201 (should also be in Docker 19.03 and up)

GordonTheTurtle added the status/0-triage label Jan 23, 2019

kylewuolle mentioned this pull request Jan 23, 2019

Fix for problem where agent is stopped and does not restart moby/libnetwork#2307

Merged

fcrisciani reviewed Jan 23, 2019

View reviewed changes

thaJeztah added status/2-code-review area/networking Networking area/swarm and removed status/0-triage labels Jan 23, 2019

GordonTheTurtle assigned coolljt0725 Feb 7, 2019

thaJeztah added the kind/bugfix PR's that fix bugs label Oct 10, 2019

kylewuolle closed this Nov 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix for DNS name resolution after performing init with --force-new-cluster #38626

Fix for DNS name resolution after performing init with --force-new-cluster #38626

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fix for DNS name resolution after performing init with --force-new-cluster #38626

Fix for DNS name resolution after performing init with --force-new-cluster #38626

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!