8000 Use os.ReadDir instead of filepath.Walk in netns cleanup by thaJeztah · Pull Request #36511 · moby/moby · GitHub
[go: up one dir, main page]

Skip to content

Conversation

thaJeztah
Copy link
Member
@thaJeztah thaJeztah commented Mar 7, 2018
  • Use ioutil.ReadDir() instead of filepath.Walk(), because
    filepath.Walk() does a lot of extra things we don't need
  • Log errors on directory removal (omitting os.IsNotExist)
  • Touch-up error messages for consistency

@thaJeztah
Copy link
Member Author

ping @kolyshkin @cpuguy83 @fcrisciani PTAL; not entirely sure if this is correct, but opened this to discuss 👍

@thaJeztah
Copy link
Member Author

hm, this one is flaky /cc @selansen

13:44:44 --- FAIL: TestServiceWithPredefinedNetwork (11.75s)
13:44:44 	daemon.go:283: [dce2b71fc535a] waiting for daemon to start
13:44:44 	daemon.go:315: [dce2b71fc535a] daemon started
13:44:44 	service_test.go:52: timeout hit after 10s: task count at 1 waiting for 0
13:44:44 	daemon.go:273: [dce2b71fc535a] exiting daemon

@selansen
Copy link
Contributor
selansen commented Mar 7, 2018 via email

@thaJeztah
8000
Copy link
Member Author

@selansen @vdemeester pointed me that that message probably comes from

poll.WaitOn(t, noTasks(client), pollSettings)
(so the "teardown" of the test)

@selansen
Copy link
Contributor
selansen commented Mar 7, 2018 via email

@kolyshkin
Copy link
Contributor

The MNT_FORCE flag comes from commit 99a98cc (PR #25962). The only other use of MNT_FORCE is in libnetwork (moby/libnetwork@8a6b507#diff-50ec864109e0293aa03f973d35485e1dR396), both are by @mrjana (Jana, maybe you can shed some light on why this option is used?)

FS type is nsfs in both cases. My limited testing shows the flag is at least accepted (i.e. no EINVAL or any other error is returned):

root@kd:~/git/linux# docker run -d busybox top
f24191dc7156767505b674041477f10208847b477dc9ef9f4b1395ed46a62aee
root@kd:~/git/linux# grep nsfs /proc/self/mountinfo /proc/self/mounts
/proc/self/mountinfo:576 26 0:3 net:[4026532644] /run/docker/netns/bf03800bf58e rw shared:271 - nsfs nsfs rw
/proc/self/mounts:nsfs /run/docker/netns/bf03800bf58e nsfs rw 0 0
root@kd:~/git/linux# strace -eumount2 umount -f /run/docker/netns/bf03800bf58e 
umount2("/run/docker/netns/bf03800bf58e", MNT_FORCE) = 0
+++ exited with 0 +++
root@kd:~/git/linux# uname -a
Linux kd 4.14.23+ #46 SMP Fri Mar 2 10:31:35 PST 2018 x86_64 x86_64 x86_64 GNU/Linux

From limited reading of linux kernel source code it seems MNT_FORCE is used by any fs which implements umount_begin and is ignored otherwise, and it's working that way since Apr 2008 (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=42faad99658ee). The umount_begin is defined by 9p, ceph, cifs, fuse, nfs, and lustre (as of linux 4.14).

To sum it up:

  1. MNT_FORCE is useless (it does nothing for nsfs) but harmless (it does not cause an error) for this case. I am slightly in favor of removing it (as it gives a bad example).
  2. Yes, both ENOENT and EINVAL should be ignored.

@codecov
Copy link
codecov bot commented Mar 8, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@5a685dc). Click here to learn what that means.
The diff coverage is n/a.

@@            Coverage Diff            @@
##             master   #36511   +/-   ##
=========================================
  Coverage          ?   37.33%           
=========================================
  Files             ?      609           
  Lines             ?    45210           
  Branches          ?        0           
=========================================
  Hits              ?    16878           
  Misses            ?    26044           
  Partials          ?     2288

@thaJeztah
Copy link
Member Author

@kolyshkin thanks for doing the research! I updated the PR, and captioned your information in the commit message 👍

@thaJeztah
Copy link
Member Author

Opened #36547 for the flaky TestServiceWithPredefinedNetwork

@selansen
Copy link
Contributor
selansen commented Mar 9, 2018 via email

@thaJeztah
Copy link
Member Author

All of experimental, janky, powerpc and z are failing on DockerSwarmSuite.TestAPIServiceUpdatePort and DockerSwarmSuite.TestSwarmPublishDuplicatePorts 😞 #36501

@thaJeztah
Copy link
Member Author

Looks like this one has the same issue with tasks not getting to zero;

14:50:59 --- FAIL: TestCreateServiceSecretFileMode (11.61s)
14:50:59 	daemon.go:283: [dc83bb2ad8a0f] waiting for daemon to start
14:50:59 	daemon.go:315: [dc83bb2ad8a0f] daemon started
14:50:59 	create_test.go:199: timeout hit after 10s: task count at 3 waiting for 1
14:50:59 	daemon.go:273: [dc83bb2ad8a0f] exiting daemon

@derek derek bot added the status/failing-ci Indicates that the PR in its current state fails the test suite label Dec 22, 2018
@thaJeztah
Copy link
Member Author
thaJeztah commented Jan 20, 2019

Test is still failing; need to find out what's causing that;

15:10:09 FAIL: docker_cli_swarm_test.go:1534: DockerSwarmSuite.TestSwarmPublishDuplicatePorts
15:10:09 
15:10:09 [dc64601ed788f] waiting for daemon to start
15:10:09 [dc64601ed788f] daemon started
15:10:09 
15:10:09 docker_cli_swarm_test.go:1542:
15:10:09     // make sure task has been deployed.
15:10:09     waitAndAssert(c, defaultReconciliationTimeout, d.CheckActiveContainerCount, checker.Equals, 1)
15:10:09 docker_utils_test.go:441:
15:10:09     c.Assert(v, checker, args...)
15:10:09 ... obtained int = 0
15:10:09 ... expected int = 1
15:10:09 
15:10:09 waited for 30.010791614s (out of 30s)
15:10:09 [dc64601ed788f] exiting daemon

Logs attached of that daemon

docker.log

time="2019-01-17T15:09:39.058206372Z" level=error msg="error reading the kernel parameter net.ipv4.neigh.default.gc_thresh1" error="open /proc/sys/net/ipv4/neigh/default/gc_thresh1: no such file or directory"
time="2019-01-17T15:09:39.058241863Z" level=error msg="error reading the kernel parameter net.ipv4.neigh.default.gc_thresh2" error="open /proc/sys/net/ipv4/neigh/default/gc_thresh2: no such file or directory"
time="2019-01-17T15:09:39.058269339Z" level=error msg="error reading the kernel parameter net.ipv4.neigh.default.gc_thresh3" error="open /proc/sys/net/ipv4/neigh/default/gc_thresh3: no such file or directory"
time="2019-01-17T15:09:39.306699373Z" level=error msg="Failed creating ingress network: network sandbox join failed: subnet sandbox join failed for \"10.255.0.0/16\": error creating vxlan interface: file exists"
time="2019-01-17T15:09:39.316535617Z" level=error msg="1d2702eeab7cb24fc43c6aa5d6182c8b07387027349fdf065b9ae140299113cf cleanup: failed to delete container from containerd: no such container"
time="2019-01-17T15:09:39.650731369Z" level=error msg="fatal task error" error="network sandbox join failed: subnet sandbox join failed for \"10.255.0.0/16\": error creating vxlan interface: file exists" module=node/agent/taskmanager node.id=ybrsk120feonm5ykxmzgoxg4m service.id=kfebrrvqpzlzwylxvai14hxak task.id=p2iv89z8vbo22yh4dm7khccdx
time="2019-01-17T15:09:40.026581150Z" level=error msg="fatal task error" error="network sandbox join failed: subnet sandbox join failed for \"10.255.0.0/16\": error creating vxlan interface: file exists" module=node/agent/taskmanager node.id=ybrsk120feonm5ykxmzgoxg4m service.id=kfebrrvqpzlzwylxvai14hxak task.id=kewrl8i9dew6lwl6f9ii4cu3x
time="2019-01-17T15:09:45.026694799Z" level=error msg="fatal task error" error="network sandbox join failed: subnet sandbox join failed for \"10.255.0.0/16\": error creating vxlan interface: file exists" module=node/agent/taskmanager node.id=ybrsk120feonm5ykxmzgoxg4m service.id=kfebrrvqpzlzwylxvai14hxak task.id=38ccvgtwsqxjvypjxeomm8txq
time="2019-01-17T15:09:50.054566586Z" level=error msg="fatal task error" error="network sandbox join failed: subnet sandbox join failed for \"10.255.0.0/16\": error creating vxlan interface: file exists" module=node/agent/taskmanager node.id=ybrsk120feonm5ykxmzgoxg4m service.id=kfebrrvqpzlzwylxvai14hxak task.id=zs96129xfdfw618pc8rh3578y
time="2019-01-17T15:09:55.046855590Z" level=error msg="fatal task error" error="network sandbox join failed: subnet sandbox join failed for \"10.255.0.0/16\": error creating vxlan interface: file exists" module=node/agent/taskmanager node.id=ybrsk120feonm5ykxmzgoxg4m service.id=kfebrrvqpzlzwylxvai14hxak task.id=r34fs115501b5wsxppzwc3ie9
time="2019-01-17T15:10:00.042987178Z" level=error msg="fatal task error" error="network sandbox join failed: subnet sandbox join failed for \"10.255.0.0/16\": error creating vxlan interface: file exists" module=node/agent/taskmanager node.id=ybrsk120feonm5ykxmzgoxg4m service.id=kfebrrvqpzlzwylxvai14hxak task.id=vwze8zbau6pwm8n11mlrmaqm7
time="2019-01-17T15:10:05.054697503Z" level=error msg="fatal task error" error="network sandbox join failed: subnet sandbox join failed for \"10.255.0.0/16\": error creating vxlan interface: file exists" module=node/agent/taskmanager node.id=ybrsk120feonm5ykxmzgoxg4m service.id=kfebrrvqpzlzwylxvai14hxak task.id=3zn00udup55ekv4t3jbsd9nf2
time="2019-01-17T15:10:09.114187625Z" level=error msg="failed to remove node" error="rpc error: code = Aborted desc = dispatcher is stopped" method="(*Dispatcher).Session" node.id=ybrsk120feonm5ykxmzgoxg4m node.session=4068tn9bq7a8tlh53i3a3cz10
time="2019-01-17T15:10:09.114705809Z" level=error msg="failed to restart task after waiting for previous restart" error="raft: failed to process the request: node is stopped" module=node node.id=ybrsk120feonm5ykxmzgoxg4m
time="2019-01-17T15:10:09.114979485Z" level=error msg="failed to receive changes from store watch API" error="rpc error: code = Unknown desc = context canceled"
time="2019-01-17T15:10:09.115310137Z" level=error msg="failed to retrieve ingress network iapela2qdh1mnu8a1shgsm04j: network iapela2qdh1mnu8a1shgsm04j not found"

This looks to be the error:

error creating vxlan interface: file exists

Test itself looks straightforward at a glance

func (s *DockerSwarmSuite) TestSwarmPublishDuplicatePorts(c *check.C) {
d := s.AddDaemon(c, true, true)
out, err := d.Cmd("service", "create", "--no-resolve-image", "--detach=true", "--publish", "5005:80", "--publish", "5006:80", "--publish", "80", "--publish", "80", "busybox", "top")
c.Assert(err, check.IsNil, check.Commentf("%s", out))
id := strings.TrimSpace(out)
// make sure task has been deployed.
waitAndAssert(c, defaultReconciliationTimeout, d.CheckActiveContainerCount, checker.Equals, 1)
// Total len = 4, with 2 dynamic ports and 2 non-dynamic ports
// Dynamic ports are likely to be 30000 and 30001 but doesn't matter
out, err = d.Cmd("service", "inspect", "--format", "{{.Endpoint.Ports}} len={{len .Endpoint.Ports}}", id)
c.Assert(err, check.IsNil, check.Commentf("%s", out))
c.Assert(out, checker.Contains, "len=4")
c.Assert(out, checker.Contains, "{ tcp 80 5005 ingress}")
c.Assert(out, checker.Contains, "{ tcp 80 5006 ingress}")
}

@thaJeztah
Copy link
Member Author

The error creating vxlan interface: file exists problem was once reported in moby/libnetwork#945 (and moby/libnetwork#562, moby/libnetwork#751 before that)

That case was fixed through moby/libnetwork#1065

@salzig
Copy link
salzig commented Jul 16, 2019

@thaJeztah just as a note, got this today with docker 18.09.7 on Ubuntu 16.04.

"network sandbox join failed: subnet sandbox join failed for "10.0.3.0/24": error creating vxlan interface: file exists"

Copy link
Contributor
@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thaJeztah thaJeztah changed the title Ignore EINVAL when cleaning up network namespaces Use io.ReadDir instead of filepath.Walk in netns cleanup Apr 6, 2020
@thaJeztah thaJeztah removed the status/failing-ci Indicates that the PR in its current state fails the test suite label Apr 6, 2020
@kolyshkin
Copy link
Contributor

This needs to be rebased/retested/merged. @thaJeztah can you please revive it?

@thaJeztah
Copy link
Member Author

Rebased to trigger CI again

@thaJeztah
Copy link
Member Author

Hmm still failing on this one

Stacktrace
=== RUN   TestDockerSwarmSuite/TestSwarmPublishDuplicatePorts
    --- FAIL: TestDockerSwarmSuite/TestSwarmPublishDuplicatePorts (31.39s)
        docker_cli_swarm_test.go:1548: timeout hit after 30s: first check never completed
        poll.go:121: assertion failed: error is not nil: exit status 1

@thaJeztah
Copy link
Member Author

Rebased again, to see if things changed since last year

@thaJeztah thaJeztah force-pushed the ignore-einval branch 2 times, most recently from 7b009d5 to 7f183ec Compare August 26, 2022 18:06
@thaJeztah thaJeztah changed the title Use io.ReadDir instead of filepath.Walk in netns cleanup Use os.ReadDir instead of filepath.Walk in netns cleanup Aug 26, 2022
- Use `os.ReadDir()` instead of `filepath.WalkDir()`, because
  `filepath.WalkDir()` does a lot of extra things we don't need
- Log errors on directory removal (omitting `os.IsNotExist`)
- Touch-up error messages for consistency

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
@guoard
Copy link
guoard commented Jun 1, 2025

@thaJeztah PTAL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants
0