8000 Windows: do not use parallelism on container restart by olljanat · Pull Request #39733 · moby/moby · GitHub
[go: up one dir, main page]

Skip to content

Conversation

olljanat
Copy link
Contributor

- What I did
#38301 did set container restart/restore task parallelism limit to 128*NumCPU which is good limit for Linux containers. Especially when they are made correctly by following one process per container rule.

However Windows containers are much heavier and example Windows Server 2019 base image mcr.microsoft.com/windows/servercore:ltsc2019 it selves includes ~20 system processes which causes restoring to generate so high load to server and it cannot response anything else until restore is completed.

- How I did it
Disabled restore parallelism from Windows platform.

- How to verify it
I created 100 containers with restart policy:

for($i=1;$i -le 100;$i++) {
	docker run -d --restart always --network nat mcr.microsoft.com/windows/servercore:ltsc2019 ping -t 127.0.0.1
	start-sleep -seconds 10
}

On my 4 CPU test machine it they take about 8 minutes to restart with and without this changes.
However there is big difference how server is able to response to other commands.

Without this change CPU load is constantly 100% and even typing text to notepad takes long time:
without_patch_restore

After this change server still uses all CPU it have now it still responses to user input.
docker_restart_parallel_1

- A picture of a cute animal (not mandatory but encouraged)
image

@olljanat
Copy link
Contributor Author

@thaJeztah @ddebroy PTAL. It would be nice to get this to cherry pick process as we have seen Windows containers hosts time to time freezing on reboot and only fix to that one is do manually hard reset for them. Afaiu it happens because they are not able to handle kernel messages as last message on event logs varies.

Copy link
Member
@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM (don't know if 1 is the right value to use, but makes sense to me from the description)

ping @jhowardmsft @ddebroy @vikramhh ptal

@olljanat
Copy link
Contributor Author

I actually also tested with number of CPUs but result was that restoring speed is very near of same than with 1 but it server is more "stuck" during restore.

One issue with this one is that

if err := daemon.containerStart(c, "", "", true); err != nil {

actually finishes when some processes inside of container are still starting so optimal solution probably would be modify code to keep track how many containers there which are not completely started but that probably would need also changes to hcsshim side (=> not sure, I have not investigated).

That why I ended up to propose this kind of simple solution which hopefully is good enough for most of the use cases.

Copy link
Contributor
@ddebroy ddebroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to make this a configurable option for the engine rather than hardcoding it? That way, depending on the workloads and environment, administrators can tune the value to exactly what they want. It can default to 1 if that is desirable.

@olljanat
Copy link
Contributor Author
olljanat commented Aug 13, 2019

@ddebroy sure it is possible but I was wondering why it have not been implemented earlier? 🤔

Also we need then decide name for that option. Any suggestions?

@ddebroy
Copy link
Contributor
ddebroy commented Aug 13, 2019

Maybe something like --parallel-containers-restore-limit ? cc @thaJeztah

@thaJeztah
Copy link
Member

@ddebroy if we want to make it configurable, I'd keep that separate. Would an administrator know what value to put there? I think we should have more metrics as to what limit would be realistic (on a beefy machine). If (even on a powerful machine) parallel > 1 doesn't make restoring the containers actually faster, then I wonder if we should make it configurable.

(basically; I'd like to avoid adding bells and whistles that nobody uses)

@thaJeztah
Copy link
Member

Also; forgot to write that, but configurable would be a new feature, and not something we would backport to current releases; it looks like this patch may be addressing an issue that's introduced in 19.03, so could be candidate for backporting

@olljanat
Copy link
Contributor Author

Would an administrator know what value to put there? I think we should have more metrics as to what limit would be realistic (on a beefy machine).

I repeated same 100 container restore test on couple of different combinations:

4 CPU - default limit

time="2019-08-14T11:22:14.260679800+03:00" level=info msg="Starting up"
...
time="2019-08-14T11:23:11.326345700+03:00" level=info msg="API listen on //./pipe/docker_engine"

(get-date "2019-08-14T11:23:11.326345700+03:00") - (get-date "2019-08-14T11:22:14.260679800+03:00")
TotalSeconds      : 57,0656659

4 CPU - limit to 1

time="2019-08-14T11:31:09.337321500+03:00" level=info msg="Starting up"
...
time="2019-08-14T11:32:09.525747200+03:00" level=info msg="API listen on //./pipe/docker_engine"

(get-date "2019-08-14T11:32:09.525747200+03:00") - (get-date "2019-08-14T11:31:09.337321500+03:00")
TotalSeconds      : 60,1884257

8 CPU - default limit

time="2019-08-14T11:12:23.265451500+03:00" level=info msg="Starting up"
...
time="2019-08-14T11:12:56.014299000+03:00" level=info msg="API listen on //./pipe/docker_engine"

(get-date "2019-08-14T11:12:56.014299000+03:00") - (get-date "2019-08-14T11:12:23.265451500+03:00")
TotalSeconds      : 32,7488475

8 CPU - limit to 1

time="2019-08-14T11:16:39.268269800+03:00" level=info msg="Starting up"
...
time="2019-08-14T11:17:44.633734300+03:00" level=info msg="API listen on //./pipe/docker_engine"

(get-date "2019-08-14T11:17:44.633734300+03:00") - (get-date "2019-08-14T11:16:39.268269800+03:00")
TotalSeconds      : 65,3654645

36 CPU - default limit

time="2019-08-14T10:41:41.025922400+03:00" level=info msg="Starting up"
...
time="2019-08-14T10:42:15.217742700+03:00" level=info msg="API listen on //./pipe/docker_engine"

(get-date "2019-08-14T10:42:15.217742700+03:00") - (get-date "2019-08-14T10:41:41.025922400+03:00")
TotalSeconds      : 34,1918203

36 CPU - limit to 1

time="2019-08-14T10:57:17.434044600+03:00" level=info msg="Starting up"
...
time="2019-08-14T10:58:27.619699700+03:00" level=info msg="API listen on //./pipe/docker_engine"

(get-date "2019-08-14T10:58:27.619699700+03:00") - (get-date "2019-08-14T10:57:17.434044600+03:00")
TotalSeconds      : 70,1856551

So it looks to be that those bigger machines can actually have bigger limit but I can try to look if I'm able to create some nice rule to there.

It is also interesting that even with 4 CPU these tests was now lot of faster after I switched to another environment (maybe because of faster harddrives, etc...)

@olljanat olljanat force-pushed the win-restore-no-parallelism branch from f4d41a3 to 64a5ba8 Compare August 14, 2019 13:22
@olljanat
Copy link
Contributor Author

@thaJeztah @ddebroy did more testing and ended up to change logic now to way that it uses value 1 if server have maximum 4 CPUs and above of that one number of CPUs.

Logs of those startup tests comparisons on below. Like you can see. More server have CPUs less this change have affect to startup speed and we still avoid that hanging which I trying to fix here.

4 CPU - default limit

time="2019-08-14T14:49:42.549725900+03:00" level=info msg="Starting up"
...
time="2019-08-14T14:50:50.048546700+03:00" level=info msg="API listen on //./pipe/docker_engine"

(get-date "2019-08-14T14:50:50.048546700+03:00") - (get-date "2019-08-14T14:49:42.549725900+03:00")
TotalSeconds      : 67,4988208

4 CPU - limit to 1

time="2019-08-14T14:25:43.905063900+03:00" level=info msg="Starting up"
...
time="2019-08-14T14:27:14.485654600+03:00" level=info msg="API listen on //./pipe/docker_engine"

(get-date "2019-08-14T14:27:14.485654600+03:00") - (get-date "2019-08-14T14:25:43.905063900+03:00")
TotalSeconds      : 90,5805907

5 CPU - default limit

time="2019-08-14T12:22:17.792513300+03:00" level=info msg="Starting up"
...
time="2019-08-14T12:22:53.407527000+03:00" level=info msg="API listen on //./pipe/docker_engine"

(get-date "2019-08-14T12:22:53.407527000+03:00") - (get-date "2019-08-14T12:22:17.792513300+03:00")
TotalSeconds      : 35,6150137

5 CPU - limit to 5

time="2019-08-14T15:20:23.314785800+03:00" level=info msg="Starting up"
...
time="2019-08-14T15:21:22.962476000+03:00" level=info msg="API listen on //./pipe/docker_engine"

(get-date "2019-08-14T15:21:22.962476000+03:00") - (get-date "2019-08-14T15:20:23.314785800+03:00")
TotalSeconds      : 59,6476902

6 CPU - default limit

time="2019-08-14T13:06:55.115185700+03:00" level=info msg="Starting up"
...
time="2019-08-14T13:07:40.747497400+03:00" level=info msg="API listen on //./pipe/docker_engine"

(get-date "2019-08-14T13:07:40.747497400+03:00") - (get-date "2019-08-14T13:06:55.115185700+03:00")
TotalSeconds      : 45,6323117

6 CPU - limit to 6

time="2019-08-14T16:15:29.737757900+03:00" level=info msg="Starting up"
...
time="2019-08-14T16:16:21.564104300+03:00" level=info msg="API listen on //./pipe/docker_engine"

(get-date "2019-08-14T16:16:21.564104300+03:00") - (get-date "2019-08-14T16:15:29.737757900+03:00")
TotalSeconds      : 51,8263464

7 CPU - default limit

time="2019-08-14T14:13:20.532105100+03:00" level=info msg="Starting up"
...
time="2019-08-14T14:14:05.378632900+03:00" level=info msg="API listen on //./pipe/docker_engine"

(get-date "2019-08-14T14:14:05.378632900+03:00") - (get-date "2019-08-14T14:13:20.532105100+03:00")
TotalSeconds      : 44,8465278

7 CPU - limit to 7

time="2019-08-14T16:00:29.633611200+03:00" level=info msg="Starting up"
...
time="2019-08-14T16:01:17.411028500+03:00" level=info msg="API listen on //./pipe/docker_engine"

(get-date "2019-08-14T16:01:17.411028500+03:00") - (get-date "2019-08-14T16:00:29.633611200+03:00")
TotalSeconds      : 47,7774173

16 CPU - default limit

time="2019-08-14T14:43:18.720012600+03:00" level=info msg="Starting up"
...
time="2019-08-14T14:44:00.060481100+03:00" level=info msg="API listen on //./pipe/docker_engine"

(get-date "2019-08-14T14:44:00.060481100+03:00") - (get-date "2019-08-14T14:43:18.720012600+03:00")
TotalSeconds      : 41,3404685

16 CPU - limit to 16

time="2019-08-14T14:31:52.016665600+03:00" level=info msg="Starting up"
...
time="2019-08-14T14:32:38.578530700+03:00" level=info msg="API listen on //./pipe/docker_engine"

(get-date "2019-08-14T14:32:38.578530700+03:00") - (get-date "2019-08-14T14:31:52.016665600+03:00")
TotalSeconds      : 46,5618651

@coolljt0725
Copy link
Contributor

ping @jhowardmsft @ddebroy @vikramhh ptal

@vikramhh
Copy link

@olljanat - thanks for sharing extensive comparison data above. #38301 talks about container numbers that are orders of magnitude more than what you used [20K vs. 100]. Any way to be sure that 100 is indeed representative of typical workloads and not "too small"?

IIUC the main reason for this change is to mitigate the negative impact on User Experience. For Windows Server SKU-s, that would typically not be an issue. It could be possible to scope this change to desktop only but at least based on data you have shared, looks like currently it would make less than 10% difference. Are there scenarios where we would care for that perf regression?

LGTM otherwise

@olljanat
Copy link
Contributor Author

@olljanat - thanks for sharing extensive comparison data above. #38301 talks about container numbers that are orders of magnitude more than what you used [20K vs. 100]. Any way to be sure that 100 is indeed representative of typical workloads and not "too small"?

That is good question and I definitely would like to know if someone from Microsoft have tested that where those limits actually are?

We are using ~20 containers per node on based on my experience I would not recommend anyone use anything more than 50 containers per node. So 100 is double about of that "limit".

IIUC the main reason for this change is to mitigate the negative impact on User Experience. For Windows Server SKU-s, that would typically not be an issue. It could be possible to scope this change to desktop only but at least based on data you have shared, looks like currently it would make less than 10% difference. Are there scenarios where we would care for that perf regression?

That is not true. User Experience is just easiest way demonstrate what happens. The real issue is that under that load server is not able to handle other processes (e.g. monitoring show that server is down, VMware says that guest tools are not running, RDP connections to server does not working, etc...). We seen that happening on production server(s) every week that after scheduled reboots they got frozen and only way fix then is do power off + power on for them (randomly on some of the servers).

Btw. I did some more testing with Moby master version on 36 cores machines. When I do 200 containers restart it actually passes on two minutes and CPU usage does not exceed 40%. Then next two minutes server just idle. But the this happens:
two_minutes_after_200_containers_restore
and server CPU usage stay on 100 % next ten minutes and all those issues which I described earlier and visible.

So definitely this PR will no fix whole issue but question more likely is to make situation little less bad.

@thaJeztah
Copy link
Member

ping @jterry75 @ddebroy @vikramhh ptal ^

@jterry75
Copy link
Contributor
jterry75 commented Sep 3, 2019

In general I would say that MAX(1, FLOOR(runtime.NumCPU() * .8) is a good number. IE: 80% of the available system devoted to containers. But not more than. But thats really a server answer and not a client answer. Also, are your numbers on Windows Server or Windows Client?

@olljanat
Copy link
Contributor Author
olljanat commented Sep 3, 2019

@jterry75 all those tests are done with Windows Server 2019. I will try with that number and update PR.

@olljanat olljanat force-pushed the win-restore-no-parallelism branch from 64a5ba8 to 2d72c4e Compare September 5, 2019 16:35
@vikramhh
Copy link
vikramhh commented Sep 5, 2019

LGTM

@olljanat olljanat force-pushed the win-restore-no-parallelism branch from 2d72c4e to b4ad1a5 Compare September 5, 2019 17:36
@olljanat
Copy link
Contributor Author
olljanat commented Sep 5, 2019

Updated based on @jterry75 suggestion.

Here is also calculation of parallelism value with different CPU counts:

for i := 1; i <= 64; i++ {
   fmt.Printf("CPU count: %v Parallelism: %v\n", i, int(math.Max(1, math.Floor(float64(i)*.8))))
}

CPU count: 1 Parallelism: 1
CPU count: 2 Parallelism: 1
CPU count: 3 Parallelism: 2
CPU count: 4 Parallelism: 3
CPU count: 5 Parallelism: 4
CPU count: 6 Parallelism: 4
CPU count: 7 Parallelism: 5
CPU count: 8 Parallelism: 6
CPU count: 9 Parallelism: 7
CPU count: 10 Parallelism: 8
CPU count: 11 Parallelism: 8
CPU count: 12 Parallelism: 9
CPU count: 13 Parallelism: 10
CPU count: 14 Parallelism: 11
CPU count: 15 Parallelism: 12
CPU count: 16 Parallelism: 12
CPU count: 17 Parallelism: 13
CPU count: 18 Parallelism: 14
CPU count: 19 Parallelism: 15
CPU count: 20 Parallelism: 16
CPU count: 21 Parallelism: 16
CPU count: 22 Parallelism: 17
CPU count: 23 Parallelism: 18
CPU count: 24 Parallelism: 19
CPU count: 25 Parallelism: 20
CPU count: 26 Parallelism: 20
CPU count: 27 Parallelism: 21
CPU count: 28 Parallelism: 22
CPU count: 29 Parallelism: 23
CPU count: 30 Parallelism: 24
CPU count: 31 Parallelism: 24
CPU count: 32 Parallelism: 25
CPU count: 33 Parallelism: 26
CPU count: 34 Parallelism: 27
CPU count: 35 Parallelism: 28
CPU count: 36 Parallelism: 28
CPU count: 37 Parallelism: 29
CPU count: 38 Parallelism: 30
CPU count: 39 Parallelism: 31
CPU count: 40 Parallelism: 32
CPU count: 41 Parallelism: 32
CPU count: 42 Parallelism: 33
CPU count: 43 Parallelism: 34
CPU count: 44 Parallelism: 35
CPU count: 45 Parallelism: 36
CPU count: 46 Parallelism: 36
CPU count: 47 Parallelism: 37
CPU count: 48 Parallelism: 38
CPU count: 49 Parallelism: 39
CPU count: 50 Parallelism: 40
CPU count: 51 Parallelism: 40
CPU count: 52 Parallelism: 41
CPU count: 53 Parallelism: 42
CPU count: 54 Parallelism: 43
CPU count: 55 Parallelism: 44
CPU count: 56 Parallelism: 44
CPU count: 57 Parallelism: 45
CPU count: 58 Parallelism: 46
CPU count: 59 Parallelism: 47
CPU count: 60 Parallelism: 48
CPU count: 61 Parallelism: 48
CPU count: 62 Parallelism: 49
CPU count: 63 Parallelism: 50
CPU count: 64 Parallelism: 51

@coolljt0725
Copy link
Contributor

ping @olljanat unit test fail

@olljanat
Copy link
Contributor Author

Hmm. Looks that RS1 failed because of build server issue. Is there some way to re-run just it?

@thaJeztah
Copy link
Member

Ah, yes, the new Jenkins does not (yet) have an option to trigger using labels 😞 (also, it's all or nothing, so we can't re-run a single stage because Jenkins declarative pipelines don't suppler that)

I see tests passed on RS1, so should be ok to ignore;


[2019-09-10T18:55:19.464Z] OK: 524 passed, 548 skipped
[2019-09-10T18:55:19.464Z] PASS

Everything else is green

@coolljt0725
Copy link
Contributor

@thaJeztah

Hmm. Looks that RS1 failed because of build server issue. Is there some way to re-run just it?

can you re-run it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should change this function to return a int64 (looks like we always convert the value);

moby/daemon/daemon.go

Lines 223 to 227 in 32688a4

parallelLimit := adjustParallelLimit(len(dir), 128*runtime.NumCPU())
// Re-used for all parallel startup jobs.
var group sync.WaitGroup
sem := semaphore.NewWeighted(int64(parallelLimit))

moby/daemon/daemon.go

Lines 576 to 579 in 32688a4

parallelLimit := adjustParallelLimit(len(daemon.List()), 128*runtime.NumCPU())
var group sync.WaitGroup
sem := semaphore.NewWeighted(int64(parallelLimit))

(doesn't have to be done in this PR, but just noticed it)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest that it will be done on another PR

@thaJeztah
Copy link
Member

@coolljt0725 I can re-run it from Jenkins, but (see my comment above); it actu 9E88 ally passed. Also left a comment/question on the code change (so let me wait for @olljanat to reply first) 🤗

@olljanat olljanat force-pushed the win-restore-no-parallelism branch from f976dfe to 464876c Compare September 15, 2019 09:25
thaJeztah
thaJeztah previously approved these changes Sep 16, 2019
Copy link
Member
@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@thaJeztah
Copy link
Member

oh! I think it's missing an import;

daemon/daemon_windows.go:1::warning: file is not goimported (goimports)

@thaJeztah thaJeztah dismissed their stale review September 16, 2019 07:18

see above

@olljanat olljanat force-pushed the win-restore-no-parallelism branch from 464876c to 7a9ab8d Compare September 16, 2019 16:43
@olljanat
Copy link
Contributor Author

oh! I think it's missing an import;

daemon/daemon_win
6D38
dows.go:1::warning: file is not goimported (goimports)

😵 ... fixed

@olljanat
Copy link
Contributor Author

@thaJeztah ping is this ok now?

Copy link
Member
@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@olljanat olljanat force-pushed the win-restore-no-parallelism branch from 7a9ab8d to a0b9ba0 Compare November 11, 2019 11:30
Signed-off-by: Olli Janatuinen <olli.janatuinen@gmail.com>
@olljanat olljanat force-pushed the win-restore-no-parallelism branch from a0b9ba0 to 447a840 Compare November 11, 2019 13:45
@olljanat
Copy link
Contributor Author

Fixed typo and rebased. Now passes also tests

@coolljt0725
Copy link
Contributor

ping @thaJeztah @tonistiigi @jterry75 PTAL

@jterry75
Copy link
Contributor

This LGTM (not a maintainer)

@thaJeztah
Copy link
Member

let's merge 👍

@thaJeztah thaJeztah merged commit c831882 into moby:master Nov 12, 2019
@olljanat
Copy link
Contributor Author

@thaJeztah nice and can we get this one to cherry pick process too?

@thaJeztah
Copy link
Member

opened docker-archive#423

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants
0