daemon: switch to semaphore-gated WaitGroup for startup tasks #38301

cyphar · 2018-11-30T15:18:16Z

Many startup tasks have to run for each container, and thus using a
WaitGroup (which doesn't have a limit to the number of parallel tasks)
can result in Docker exceeding the NOFILE limit quite trivially. A more
optimal solution is to have a parallelism limit.

In addition, several startup tasks were not parallelised previously
which resulted in very long startup times. According to my testing, 20K
dead containers resulted in ~6 minute startup times (during which time
Docker is completely unusable).

This patch fixes both issues, and the parallelStartupTimes factor chosen
(128 * NumCPU) is based on my own significant testing of the 20K
container case. This patch (on my machines) reduces the startup time
from 6 minutes to less than a minute (ideally this could be further
reduced by removing the need to scan all dead containers on startup --
but that's beyond the scope of this patchset).

In order to avoid the NOFILE limit problem, we also detect this
on-startup and if NOFILE < 2128NumCPU we will reduce the parallelism
factor to avoid hitting NOFILE limits (but also emit a warning since
this is almost certainly a mis-configuration).

Benchmarks were done on an 8-core machine.

cat by Danny VB

Signed-off-by: Aleksa Sarai asarai@suse.de

BlaineEXE · 2018-11-30T16:57:33Z

I would disagree that there is little benefit to having more start tasks than cores. Goroutines are really lightweight, and having multiple per core allows the scheduler to run tasks while other routines are waiting on IO, and there will be a lot of IO waiting while reading/writing to disk.

I think this could be a good fix if it needs to be released quickly, but it seems much better to me to set the semaphore to limit to some percentage of the process's NOFILE limit.

It also occurs to me that there are a lot of servers with a huge number of cores. I have seen some absurd ones with as many as 1024 cores, and the NOFILE limit on my dev machine is 1024. I would be concerned that it would be possible in some corner cases that limiting to the number of procs might still not prevent the startup process from exceeding the ulimit.

cyphar · 2018-12-01T01:32:39Z

My original point about cores was a little flippant -- however, there is a diminishing returns aspect when it comes to goroutines, especially if they are doing very little actual work. Note that "blocking on IO" usually means that the kernel is doing work, and there are many layers of caching that mean us writing to two or three JSON files really is not going to be blocking for any significant period of time. Not to mention that spawning 10K goroutines will result in slower execution because of the significant amount of time spent context-switching (and the number of threads that Go will create each time that one of the 10K goroutines spawns a syscall).

All of that being said, there probably is still a benefit to having (say) 20 goroutines per core or something similar. But it should be noted that the default NOFILE is set to LimitNOFILE=1048576 in contrib/init/systemd/docker.service -- so even if we set it to 50% of NOFILE we're still talking about 512K goroutines at worst.

To be clear, I don't really care what the number is (so long as it's reasonable), but "as many containers as we have on the machine" isn't a good number.

BlaineEXE · 2018-12-01T01:58:54Z

To be clear, I don't really care what the number is (so long as it's reasonable), but "as many containers as we have on the machine" isn't a good number.

Definitely agreed on this point. I can't really suggest an educated number for the right number of routines-per-core. Go conventional wisdom I've read suggests having even a few hundred routines per core might be alright, but the startup calls are probably not very lightweight, which kind of throws that out the window. I think the bottom-line of my concern is that this risks increasing the startup time quite a bit.

I would think that doing some reasonably quick benchmarks to see how long the initialization takes with various routines-per-core values would suggest a practical number beyond which things slow down. I think it makes sense that that practical value would be about the same order of magnitude for most systems, so no need to try to do too much optimization. And for machines with a huge number of containers, this might decrease the overall startup time. =)

IMO, Arbitrarily choosing to test startup time by increasing routines-per-core by powers of two seems as reasonable a thing as going by tens or hundreds.

cyphar · 2018-12-01T02:04:24Z

Sure, I can do some quick benchmarks (because dead containers are processed on startup this is even simpler).

But (in a slightly snarky done) it should be noted that the bug reported to me had 10K containers and the startup time was taking several minutes (so long that systemd killed Docker because it thought it was timing out). I'm not sure it'd be easy for us to do much worse than that. 😉

cyphar · 2018-12-04T08:16:39Z

After some benchmarking, the actual bottleneck (with 20K dead containers docker takes >6min to start) is not in the parts that currently use goroutines. It's the parts of restore which don't (most notable the daemon loading code). The goroutine'd sections take up only about 20s of the 6 minute long operation.

But after some more testing, it does look like there is a difference (though it's not exceptionally significant) between different values of parallelStartupJobs.

cyphar · 2018-12-04T16:30:05Z

Here's the results of my benchmarks. This was all done on an 8-core machine (with 5 samples of each multiplier -- average plotted with standard-deviation error bars). Looks like the ideal multiplier is 128.

(And it should be noted that the current state of Docker is above the top of the graph at ~400 second startup times.)

I set a very high ulimit and there were no errors during execution.

codecov · 2018-12-04T17:04:32Z

Codecov Report

Merging #38301 into master will decrease coverage by 0.01%.
The diff coverage is 0%.

@@            Coverage Diff             @@
##           master   #38301      +/-   ##
==========================================
- Coverage   36.58%   36.56%   -0.02%     
==========================================
  Files         608      608              
  Lines       44988    45036      +48     
==========================================
+ Hits        16460    16469       +9     
- Misses      26249    26290      +41     
+ Partials     2279     2277       -2

cpuguy83 · 2018-12-04T23:37:34Z

Wow, very nice work. I agree we should limit the number of goroutines here.

cpuguy83 · 2018-12-04T23:42:52Z

daemon/daemon.go

If you don't mind, I'd, personally, prefer not to add the new package for this limited case, especially since I feel like it's weird to mix the two things.

Would you prefer that I use WaitGroup and chan struct{} separately, or that I make LimitedWaitGroup a new type just for daemon/daemon.go?

Separately 🙏👼

Sure, though note that this will change

loadWg := syncext.NewLimitedWaitGroup(parallelStartupJobs) for _, v := range dir { loadWg.Add() go func(id string) { loadWg.Up() defer loadWg.Down() // ... }(v.Name()) } loadWg.Wait()

to

var loadWg sync.WaitGroup loadSem := make(chan struct{}, parallelStartupJobs) for _, v := range dir { loadWg.Add(1) go func(id string) { loadSem <- struct{}{} defer func() { <-loadSem loadWg.Done() }() // ... }(v.Name()) } loadWg.Wait()

Which is the main reason I combined them (the latter seems far easier to mess up on copy-paste and would be repeated quite a bit). But sure, I can do the conversion if you prefer.

cyphar · 2018-12-05T06:37:20Z

I've updated the PR to remove the LimitedWaitGroup abstraction. Personally I think this version is a bit easier to screw up if we add more code to initialisation (since you need to make sure to do both <-sem and wg.Done()) but if that's what you'd prefer I'm okay with that too.

cyphar · 2018-12-07T14:10:57Z

There we go.

/cc @cpuguy83

daemon/daemon_linux.go

thaJeztah · 2018-12-08T17:32:25Z

@unclejack @tonistiigi @kolyshkin PTAL

cyphar · 2018-12-12T10:31:46Z

@cpuguy83 Fixed, PTAL.

cpuguy83

LGTM assuming CI says it's good.

cyphar · 2018-12-14T08:08:42Z

The failures are flaky swarm tests.

/ping @cpuguy83

daemon/daemon.go

tonistiigi · 2018-12-19T11:22:49Z

daemon/daemon.go

Are you sure there is a benefit for doing this in all of these loops? Eg. registerName/registrerLinks does not seem to be long running.

Please look at the benchmarks I've done above. register and registerLinks (reglinks) are two of the longest-running jobs, and paralellising them has reduced the startup time from >6 minutes to less than a minute.

So in your benchmark, you are only testing containers that are already running with liverestore or in stopped state without any restart logic? Otherwise, I'd expect these parts to be orders of magnitude slower than what the benchmark is testing. I'm not sure this is very representative of the performance of this function for most of our users then.

Anyway, I'm fine with this if it makes your use-case better.

I'm testing 20K containers that have

% docker run --name container$i ubuntu:latest true

So no restart logic or live-restore. It's just on-disk state, and it takes 6 minutes to start Docker. I'm sure live-restore will make this slower -- but I think that it's completely unacceptable for Docker to take 6 minutes before the API will respond if you have 20K completely dead containers. One minute is still unacceptable, but fixing that will require some very radical changes to how container handling is done in Docker (namely, things should really be lazy-loaded).

I disagree that the majority of users are using 20K live-restore containers or 20K restart-logic containers. But a missed docker system prune and "for some reason" Docker startu 6D38 p is taking longer than you might expect.

Also, this patch actually fixes a problem with loading -- namely that we are currently exceeding ulimits in some cases (which causes containers to not load properly).

thaJeztah · 2018-12-20T19:40:24Z

@tonistiigi PTAL; is this good to go with @cyphar's latest information?

Many startup tasks have to run for each container, and thus using a WaitGroup (which doesn't have a limit to the number of parallel tasks) can result in Docker exceeding the NOFILE limit quite trivially. A more optimal solution is to have a parallelism limit by using a semaphore. In addition, several startup tasks were not parallelised previously which resulted in very long startup times. According to my testing, 20K dead containers resulted in ~6 minute startup times (during which time Docker is completely unusable). This patch fixes both issues, and the parallelStartupTimes factor chosen (128 * NumCPU) is based on my own significant testing of the 20K container case. This patch (on my machines) reduces the startup time from 6 minutes to less than a minute (ideally this could be further reduced by removing the need to scan all dead containers on startup -- but that's beyond the scope of this patchset). In order to avoid the NOFILE limit problem, we also detect this on-startup and if NOFILE < 2*128*NumCPU we will reduce the parallelism factor to avoid hitting NOFILE limits (but also emit a warning since this is almost certainly a mis-configuration). Signed-off-by: Aleksa Sarai <asarai@suse.de>

cyphar · 2018-12-21T10:52:43Z

I've handled the comment by @tonistiigi about context.Background. I think that's everything done, but obviously I'd wait for his review.

thaJeztah · 2018-12-21T11:04:35Z

Moved this to "merge", thanks @cyphar @tonistiigi !

GordonTheTurtle added the status/0-triage label Nov 30, 2018

cyphar force-pushed the waitgroup-limits branch from cafee58 to 5656c8b Compare December 4, 2018 17:04

cyphar force-pushed the waitgroup-limits branch 2 times, most recently from b16d754 to 4ff66ce Compare December 4, 2018 17:38

cpuguy83 added status/2-code-review and removed status/0-triage labels Dec 4, 2018

cpuguy83 reviewed Dec 4, 2018

View reviewed changes

cyphar force-pushed the waitgroup-limits branch from 4ff66ce to 3c2e83d Compare December 5, 2018 06:35

cyphar changed the title ~~daemon: switch to Semaphore for startup tasks~~ daemon: switch to semaphore-gated WaitGroup for startup tasks Dec 5, 2018

cpuguy83 requested changes Dec 7, 2018

View reviewed changes

daemon/daemon_linux.go Outdated Show resolved Hide resolved

cyphar mentioned this pull request Dec 8, 2018

stopp: stop containers in parallel cri-o/cri-o#1956

Merged

thaJeztah added area/performance area/daemon Core Engine labels Dec 8, 2018

cyphar force-pushed the waitgroup-limits branch from 3c2e83d to dbf559c Compare December 12, 2018 10:31

cpuguy83 added the rebuild/* label Dec 12, 2018

GordonTheTurtle removed the rebuild/* label Dec 12, 2018

cpuguy83 reviewed Dec 12, 2018

View reviewed changes

thaJeztah added the rebuild/* label Dec 14, 2018

GordonTheTurtle removed the rebuild/* label Dec 14, 2018

GordonTheTurtle assigned vdemeester Dec 15, 2018

cpuguy83 approved these changes Dec 18, 2018

View reviewed changes

tonistiigi reviewed Dec 19, 2018

View reviewed changes

derek bot added the rebuild/* label Dec 19, 2018

GordonTheTurtle removed the rebuild/* label Dec 19, 2018

derek bot added rebuild/* and removed rebuild/* labels Dec 19, 2018

GordonTheTurtle removed the rebuild/* label Dec 19, 2018

cyphar force-pushed the waitgroup-limits branch from dbf559c to 5a52917 Compare December 21, 2018 10:51

tonistiigi approved these changes Dec 21, 2018

View reviewed changes

thaJeztah added status/4-merge and removed status/2-code-review labels Dec 21, 2018

thaJeztah added the rebuild/z label Dec 21, 2018

GordonTheTurtle removed the rebuild/z label Dec 21, 2018

AkihiroSuda approved these changes Dec 21, 2018

View reviewed changes

AkihiroSuda merged commit 2cb26cf into moby:master Dec 21, 2018

cyphar deleted the waitgroup-limits branch December 21, 2018 15:19

thaJeztah added the impact/changelog label Dec 21, 2018

thaJeztah mentioned this pull request Jan 5, 2019

Fix go vet issue in daemon/daemon.go #38497

Merged

olljanat mentioned this pull request Aug 13, 2019

Windows: do not use parallelism on container restart #39733

Merged

thaJeztah mentioned this pull request Nov 13, 2019

[19.03 backport] Windows: Use system specific parallelism value on containers restart docker-archive/engine#423

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

daemon: switch to semaphore-gated WaitGroup for startup tasks #38301

daemon: switch to semaphore-gated WaitGroup for startup tasks #38301

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

daemon: switch to semaphore-gated WaitGroup for startup tasks #38301

daemon: switch to semaphore-gated WaitGroup for startup tasks #38301

Uh oh!

Conversation

Uh oh!

Benchmarks were done on an 8-core machine.

cat by Danny VB

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!