-
Notifications
You must be signed in to change notification settings - Fork 18.8k
daemon: switch to semaphore-gated WaitGroup for startup tasks #38301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I would disagree that there is little benefit to having more start tasks than cores. Goroutines are really lightweight, and having multiple per core allows the scheduler to run tasks while other routines are waiting on IO, and there will be a lot of IO waiting while reading/writing to disk. I think this could be a good fix if it needs to be released quickly, but it seems much better to me to set the semaphore to limit to some percentage of the process's NOFILE limit. It also occurs to me that there are a lot of servers with a huge number of cores. I have seen some absurd ones with as many as 1024 cores, and the NOFILE limit on my dev machine is 1024. I would be concerned that it would be possible in some corner cases that limiting to the number of procs might still not prevent the startup process from exceeding the ulimit. |
My original point about cores was a little flippant -- however, there is a diminishing returns aspect when it comes to goroutines, especially if they are doing very little actual work. Note that "blocking on IO" usually means that the kernel is doing work, and there are many layers of caching that mean us writing to two or three JSON files really is not going to be blocking for any significant period of time. Not to mention that spawning 10K goroutines will result in slower execution because of the significant amount of time spent context-switching (and the number of threads that Go will create each time that one of the 10K goroutines spawns a syscall). All of that being said, there probably is still a benefit to having (say) 20 goroutines per core or something similar. But it should be noted that the default To be clear, I don't really care what the number is (so long as it's reasonable), but "as many containers as we have on the machine" isn't a good number. |
Definitely agreed on this point. I can't really suggest an educated number for the right number of routines-per-core. Go conventional wisdom I've read suggests having even a few hundred routines per core might be alright, but the startup calls are probably not very lightweight, which kind of throws that out the window. I think the bottom-line of my concern is that this risks increasing the startup time quite a bit. I would think that doing some reasonably quick benchmarks to see how long the initialization takes with various routines-per-core values would suggest a practical number beyond which things slow down. I think it makes sense that that practical value would be about the same order of magnitude for most systems, so no need to try to do too much optimization. And for machines with a huge number of containers, this might decrease the overall startup time. =) IMO, Arbitrarily choosing to test startup time by increasing routines-per-core by powers of two seems as reasonable a thing as going by tens or hundreds. |
Sure, I can do some quick benchmarks (because dead containers are processed on startup this is even simpler). But (in a slightly snarky done) it should be noted that the bug reported to me had 10K containers and the startup time was taking several minutes (so long that systemd killed Docker because it thought it was timing out). I'm not sure it'd be easy for us to do much worse than that. 😉 |
After some benchmarking, the actual bottleneck (with 20K dead containers docker takes >6min to start) is not in the parts that currently use goroutines. It's the parts of But after some more testing, it does look like there is a difference (though it's not exceptionally significant) between different values of |
Here's the results of my benchmarks. This was all done on an 8-core machine (with 5 samples of each multiplier -- average plotted with standard-deviation error bars). Looks like the ideal multiplier is (And it should be noted that the current state of Docker is above the top of the graph at ~400 second startup times.) I set a very high ulimit and there were no errors during execution. |
cafee58
to
5656c8b
Compare
Codecov Report
@@ Coverage Diff @@
## master #38301 +/- ##
==========================================
- Coverage 36.58% 36.56% -0.02%
==========================================
Files 608 608
Lines 44988 45036 +48
==========================================
+ Hits 16460 16469 +9
- Misses 26249 26290 +41
+ Partials 2279 2277 -2 |
b16d754
to
4ff66ce
Compare
Wow, very nice work. I agree we should limit the number of goroutines here. |
daemon/daemon.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't mind, I'd, personally, prefer not to add the new package for this limited case, especially since I feel like it's weird to mix the two things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you prefer that I use WaitGroup
and chan struct{}
separately, or that I make LimitedWaitGroup
a new type just for daemon/daemon.go
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separately 🙏👼
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, though note that this will change
loadWg := syncext.NewLimitedWaitGroup(parallelStartupJobs)
for _, v := range dir {
loadWg.Add()
go func(id string) {
loadWg.Up()
defer loadWg.Down()
// ...
}(v.Name())
}
loadWg.Wait()
to
var loadWg sync.WaitGroup
loadSem := make(chan struct{}, parallelStartupJobs)
for _, v := range dir {
loadWg.Add(1)
go func(id string) {
loadSem <- struct{}{}
defer func() {
<-loadSem
loadWg.Done()
}()
// ...
}(v.Name())
}
loadWg.Wait()
Which is the main reason I combined them (the latter seems far easier to mess up on copy-paste and would be repeated quite a bit). But sure, I can do the conversion if you prefer.
4ff66ce
to
3c2e83d
Compare
I've updated the PR to remove the |
There we go. /cc @cpuguy83 |
3c2e83d
to
dbf559c
Compare
@cpuguy83 Fixed, PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM assuming CI says it's good.
The failures are flaky swarm tests. /ping @cpuguy83 |
daemon/daemon.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure there is a benefit for doing this in all of these loops? Eg. registerName/registrerLinks
does not seem to be long running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please look at the benchmarks I've done above. register
and registerLinks
(reglinks
) are two of the longest-running jobs, and paralellising them has reduced the startup time from >6 minutes to less than a minute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in your benchmark, you are only testing containers that are already running with liverestore or in stopped state without any restart logic? Otherwise, I'd expect these parts to be orders of magnitude slower than what the benchmark is testing. I'm not sure this is very representative of the performance of this function for most of our users then.
Anyway, I'm fine with this if it makes your use-case better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm testing 20K containers that have
% docker run --name container$i ubuntu:latest true
So no restart logic or live-restore. It's just on-disk state, and it takes 6 minutes to start Docker. I'm sure live-restore will make this slower -- but I think that it's completely unacceptable for Docker to take 6 minutes before the API will respond if you have 20K completely dead containers. One minute is still unacceptable, but fixing that will require some very radical changes to how container handling is done in Docker (namely, things should really be lazy-loaded).
I disagree that the majority of users are using 20K live-restore containers or 20K restart-logic containers. But a missed docker system prune
and "for some reason" Docker startu
6D38
p is taking longer than you might expect.
Also, this patch actually fixes a problem with loading -- namely that we are currently exceeding ulimits in some cases (which causes containers to not load properly).
@tonistiigi PTAL; is this good to go with @cyphar's latest information? |
Many startup tasks have to run for each container, and thus using a WaitGroup (which doesn't have a limit to the number of parallel tasks) can result in Docker exceeding the NOFILE limit quite trivially. A more optimal solution is to have a parallelism limit by using a semaphore. In addition, several startup tasks were not parallelised previously which resulted in very long startup times. According to my testing, 20K dead containers resulted in ~6 minute startup times (during which time Docker is completely unusable). This patch fixes both issues, and the parallelStartupTimes factor chosen (128 * NumCPU) is based on my own significant testing of the 20K container case. This patch (on my machines) reduces the startup time from 6 minutes to less than a minute (ideally this could be further reduced by removing the need to scan all dead containers on startup -- but that's beyond the scope of this patchset). In order to avoid the NOFILE limit problem, we also detect this on-startup and if NOFILE < 2*128*NumCPU we will reduce the parallelism factor to avoid hitting NOFILE limits (but also emit a warning since this is almost certainly a mis-configuration). Signed-off-by: Aleksa Sarai <asarai@suse.de>
dbf559c
to
5a52917
Compare
I've handled the comment by @tonistiigi about |
Moved this to "merge", thanks @cyphar @tonistiigi ! |
Many startup tasks have to run for each container, and thus using a
WaitGroup (which doesn't have a limit to the number of parallel tasks)
can result in Docker exceeding the NOFILE limit quite trivially. A more
optimal solution is to have a parallelism limit.
In addition, several startup tasks were not parallelised previously
which resulted in very long startup times. According to my testing, 20K
dead containers resulted in ~6 minute startup times (during which time
Docker is completely unusable).
This patch fixes both issues, and the parallelStartupTimes factor chosen
(128 * NumCPU) is based on my own significant testing of the 20K
container case. This patch (on my machines) reduces the startup time
from 6 minutes to less than a minute (ideally this could be further
reduced by removing the need to scan all dead containers on startup --
but that's beyond the scope of this patchset).
In order to avoid the NOFILE limit problem, we also detect this
on-startup and if NOFILE < 2128NumCPU we will reduce the parallelism
factor to avoid hitting NOFILE limits (but also emit a warning since
this is almost certainly a mis-configuration).
Benchmarks were done on an 8-core machine.
cat by Danny VB
Signed-off-by: Aleksa Sarai asarai@suse.de