8000 daemon: switch to semaphore-gated WaitGroup for startup tasks by cyphar · Pull Request #38301 · moby/moby · GitHub
[go: up one dir, main page]

Skip to content

Conversation

cyphar
Copy link
Contributor
@cyphar cyphar commented Nov 30, 2018 8000

Many startup tasks have to run for each container, and thus using a
WaitGroup (which doesn't have a limit to the number of parallel tasks)
can result in Docker exceeding the NOFILE limit quite trivially. A more
optimal solution is to have a parallelism limit.

In addition, several startup tasks were not parallelised previously
which resulted in very long startup times. According to my testing, 20K
dead containers resulted in ~6 minute startup times (during which time
Docker is completely unusable).

This patch fixes both issues, and the parallelStartupTimes factor chosen
(128 * NumCPU) is based on my own significant testing of the 20K
container case. This patch (on my machines) reduces the startup time
from 6 minutes to less than a minute (ideally this could be further
reduced by removing the need to scan all dead containers on startup --
but that's beyond the scope of this patchset).

In order to avoid the NOFILE limit problem, we also detect this
on-startup and if NOFILE < 2128NumCPU we will reduce the parallelism
factor to avoid hitting NOFILE limits (but also emit a warning since
this is almost certainly a mis-configuration).

benchmarks

Benchmarks were done on an 8-core machine.

cat by Danny VB

cat by Danny VB

Signed-off-by: Aleksa Sarai asarai@suse.de

@BlaineEXE
Copy link

I would disagree that there is little benefit to having more start tasks than cores. Goroutines are really lightweight, and having multiple per core allows the scheduler to run tasks while other routines are waiting on IO, and there will be a lot of IO waiting while reading/writing to disk.

I think this could be a good fix if it needs to be released quickly, but it seems much better to me to set the semaphore to limit to some percentage of the process's NOFILE limit.

It also occurs to me that there are a lot of servers with a huge number of cores. I have seen some absurd ones with as many as 1024 cores, and the NOFILE limit on my dev machine is 1024. I would be concerned that it would be possible in some corner cases that limiting to the number of procs might still not prevent the startup process from exceeding the ulimit.

@cyphar
Copy link
Contributor Author
cyphar commented Dec 1, 2018

My original point about cores was a little flippant -- however, there is a diminishing returns aspect when it comes to goroutines, especially if they are doing very little actual work. Note that "blocking on IO" usually means that the kernel is doing work, and there are many layers of caching that mean us writing to two or three JSON files really is not going to be blocking for any significant period of time. Not to mention that spawning 10K goroutines will result in slower execution because of the significant amount of time spent context-switching (and the number of threads that Go will create each time that one of the 10K goroutines spawns a syscall).

All of that being said, there probably is still a benefit to having (say) 20 goroutines per core or something similar. But it should be noted that the default NOFILE is set to LimitNOFILE=1048576 in contrib/init/systemd/docker.service -- so even if we set it to 50% of NOFILE we're still talking about 512K goroutines at worst.

To be clear, I don't really care what the number is (so long as it's reasonable), but "as many containers as we have on the machine" isn't a good number.

@BlaineEXE
Copy link

To be clear, I don't really care what the number is (so long as it's reasonable), but "as many containers as we have on the machine" isn't a good number.

Definitely agreed on this point. I can't really suggest an educated number for the right number of routines-per-core. Go conventional wisdom I've read suggests having even a few hundred routines per core might be alright, but the startup calls are probably not very lightweight, which kind of throws that out the window. I think the bottom-line of my concern is that this risks increasing the startup time quite a bit.

I would think that doing some reasonably quick benchmarks to see how long the initialization takes with various routines-per-core values would suggest a practical number beyond which things slow down. I think it makes sense that that practical value would be about the same order of magnitude for most systems, so no need to try to do too much optimization. And for machines with a huge number of containers, this might decrease the overall startup time. =)

IMO, Arbitrarily choosing to test startup time by increasing routines-per-core by powers of two seems as reasonable a thing as going by tens or hundreds.

@cyphar
Copy link
Contributor Author
cyphar commented Dec 1, 2018

Sure, I can do some quick benchmarks (because dead containers are processed on startup this is even simpler).

But (in a slightly snarky done) it should be noted that the bug reported to me had 10K containers and the startup time was taking several minutes (so long that systemd killed Docker because it thought it was timing out). I'm not sure it'd be easy for us to do much worse than that. 😉

@cyphar
Copy link
Contributor Author
cyphar commented Dec 4, 2018

After some benchmarking, the actual bottleneck (with 20K dead containers docker takes >6min to start) is not in the parts that currently use goroutines. It's the parts of restore which don't (most notable the daemon loading code). The goroutine'd sections take up only about 20s of the 6 minute long operation.

But after some more testing, it does look like there is a difference (though it's not exceptionally significant) between different values of parallelStartupJobs.

@cyphar
Copy link
Contributor Author
cyphar commented Dec 4, 2018

Here's the results of my benchmarks. This was all done on an 8-core machine (with 5 samples of each multiplier -- average plotted with standard-deviation error bars). Looks like the ideal multiplier is 128.

performance

(And it should be noted that the current state of Docker is above the top of the graph at ~400 second startup times.)

I set a very high ulimit and there were no errors during execution.

@codecov
Copy link
codecov bot commented Dec 4, 2018

Codecov Report

Merging #38301 into master will decrease coverage by 0.01%.
The diff coverage is 0%.

@@            Coverage Diff             @@
##           master   #38301      +/-   ##
==========================================
- Coverage   36.58%   36.56%   -0.02%     
==========================================
  Files         608      608              
  Lines       44988    45036      +48     
==========================================
+ Hits        16460    16469       +9     
- Misses      26249    26290      +41     
+ Partials     2279     2277       -2

@cyphar cyphar force-pushed the waitgroup-limits branch 2 times, most recently from b16d754 to 4ff66ce Compare December 4, 2018 17:38
@cpuguy83
Copy link
Member
cpuguy83 commented Dec 4, 2018

Wow, very nice work. I agree we should limit the number of goroutines here.

daemon/daemon.go Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't mind, I'd, personally, prefer not to add the new package for this limited case, especially since I feel like it's weird to mix the two things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you prefer that I use WaitGroup and chan struct{} separately, or that I make LimitedWaitGroup a new type just for daemon/daemon.go?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separately 🙏👼

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, though note that this will change

	loadWg := syncext.NewLimitedWaitGroup(parallelStartupJobs)
	for _, v := range dir {
		loadWg.Add()
		go func(id string) {
			loadWg.Up()
			defer loadWg.Down()
			// ...
		}(v.Name())
	}
	loadWg.Wait()

to

	var loadWg sync.WaitGroup
	loadSem := make(chan struct{}, parallelStartupJobs)
	for _, v := range dir {
		loadWg.Add(1)
		go func(id string) {
			loadSem <- struct{}{}
			defer func() {
				<-loadSem
				loadWg.Done()
			}()
			// ...
		}(v.Name())
	}
	loadWg.Wait()

Which is the main reason I combined them (the latter seems far easier to mess up on copy-paste and would be repeated quite a bit). But sure, I can do the conversion if you prefer.

@cyphar
Copy link
Contributor Author
cyphar commented Dec 5, 2018

I've updated the PR to remove the LimitedWaitGroup abstraction. Personally I think this version is a bit easier to screw up if we add more code to initialisation (since you need to make sure to do both <-sem and wg.Done()) but if that's what you'd prefer I'm okay with that too.

@cyphar cyphar changed the title daemon: switch to Semaphore for startup tasks daemon: switch to semaphore-gated WaitGroup for startup tasks Dec 5, 2018
@cyphar
Copy link
Contributor Author
cyphar commented Dec 7, 2018

There we go.

/cc @cpuguy83

@thaJeztah
Copy link
Member

@unclejack @tonistiigi @kolyshkin PTAL

@cyphar
Copy link
Contributor Author
cyphar commented Dec 12, 2018

@cpuguy83 Fixed, PTAL.

Copy link
Member
@cpuguy83 cpuguy83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM assuming CI says it's good.

@cyphar
Copy link
Contributor Author
cyphar commented Dec 14, 2018

The failures are flaky swarm tests.

/ping @cpuguy83

daemon/daemon.go Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure there is a benefit for doing this in all of these loops? Eg. registerName/registrerLinks does not seem to be long running.

Copy link
Contributor Author
@cyphar cyphar Dec 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please look at the benchmarks I've done above. register and registerLinks (reglinks) are two of the longest-running jobs, and paralellising them has reduced the startup time from >6 minutes to less than a minute.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in your benchmark, you are only testing containers that are already running with liverestore or in stopped state without any restart logic? Otherwise, I'd expect these parts to be orders of magnitude slower than what the benchmark is testing. I'm not sure this is very representative of the performance of this function for most of our users then.

Anyway, I'm fine with this if it makes your use-case better.

Copy link
Contributor Author
@cyphar cyphar Dec 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm testing 20K containers that have

% docker run --name container$i ubuntu:latest true

So no restart logic or live-restore. It's just on-disk state, and it takes 6 minutes to start Docker. I'm sure live-restore will make this slower -- but I think that it's completely unacceptable for Docker to take 6 minutes before the API will respond if you have 20K completely dead containers. One minute is still unacceptable, but fixing that will require some very radical changes to how container handling is done in Docker (namely, things should really be lazy-loaded).

I disagree that the majority of users are using 20K live-restore containers or 20K restart-logic containers. But a missed docker system prune and "for some reason" Docker startu 6D38 p is taking longer than you might expect.

Also, this patch actually fixes a problem with loading -- namely that we are currently exceeding ulimits in some cases (which causes containers to not load properly).

@thaJeztah
Copy link
Member

@tonistiigi PTAL; is this good to go with @cyphar's latest information?

Many startup tasks have to run for each container, and thus using a
WaitGroup (which doesn't have a limit to the number of parallel tasks)
can result in Docker exceeding the NOFILE limit quite trivially. A more
optimal solution is to have a parallelism limit by using a semaphore.

In addition, several startup tasks were not parallelised previously
which resulted in very long startup times. According to my testing, 20K
dead containers resulted in ~6 minute startup times (during which time
Docker is completely unusable).

This patch fixes both issues, and the parallelStartupTimes factor chosen
(128 * NumCPU) is based on my own significant testing of the 20K
container case. This patch (on my machines) reduces the startup time
from 6 minutes to less than a minute (ideally this could be further
reduced by removing the need to scan all dead containers on startup --
but that's beyond the scope of this patchset).

In order to avoid the NOFILE limit problem, we also detect this
on-startup and if NOFILE < 2*128*NumCPU we will reduce the parallelism
factor to avoid hitting NOFILE limits (but also emit a warning since
this is almost certainly a mis-configuration).

Signed-off-by: Aleksa Sarai <asarai@suse.de>
@cyphar
Copy link
Contributor Author
cyphar commented Dec 21, 2018

I've handled the comment by @tonistiigi about context.Background. I think that's everything done, but obviously I'd wait for his review.

@thaJeztah
Copy link
Member

Moved this to "merge", thanks @cyphar @tonistiigi !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants
0