Windows: do not use parallelism on container restart #39733

olljanat · 2019-08-13T12:18:39Z

- What I did
#38301 did set container restart/restore task parallelism limit to 128*NumCPU which is good limit for Linux containers. Especially when they are made correctly by following one process per container rule.

However Windows containers are much heavier and example Windows Server 2019 base image mcr.microsoft.com/windows/servercore:ltsc2019 it selves includes ~20 system processes which causes restoring to generate so high load to server and it cannot response anything else until restore is completed.

- How I did it
Disabled restore parallelism from Windows platform.

- How to verify it
I created 100 containers with restart policy:

for($i=1;$i -le 100;$i++) {
	docker run -d --restart always --network nat mcr.microsoft.com/windows/servercore:ltsc2019 ping -t 127.0.0.1
	start-sleep -seconds 10
}

On my 4 CPU test machine it they take about 8 minutes to restart with and without this changes.
However there is big difference how server is able to response to other commands.

Without this change CPU load is constantly 100% and even typing text to notepad takes long time:

After this change server still uses all CPU it have now it still responses to user input.

- A picture of a cute animal (not mandatory but encouraged)

olljanat · 2019-08-13T12:30:18Z

@thaJeztah @ddebroy PTAL. It would be nice to get this to cherry pick process as we have seen Windows containers hosts time to time freezing on reboot and only fix to that one is do manually hard reset for them. Afaiu it happens because they are not able to handle kernel messages as last message on event logs varies.

daemon/daemon.go

thaJeztah

SGTM (don't know if 1 is the right value to use, but makes sense to me from the description)

ping @jhowardmsft @ddebroy @vikramhh ptal

olljanat · 2019-08-13T18:11:14Z

I actually also tested with number of CPUs but result was that restoring speed is very near of same than with 1 but it server is more "stuck" during restore.

One issue with this one is that

moby/daemon/daemon.go

Line 504 in 9b25c0f

if err := daemon.containerStart(c, "", "", true); err != nil {

actually finishes when some processes inside of container are still starting so optimal solution probably would be modify code to keep track how many containers there which are not completely started but that probably would need also changes to hcsshim side (=> not sure, I have not investigated).

That why I ended up to propose this kind of simple solution which hopefully is good enough for most of the use cases.

ddebroy

Is it possible to make this a configurable option for the engine rather than hardcoding it? That way, depending on the workloads and environment, administrators can tune the value to exactly what they want. It can default to 1 if that is desirable.

olljanat · 2019-08-13T21:01:49Z

@ddebroy sure it is possible but I was wondering why it have not been implemented earlier? 🤔

Also we need then decide name for that option. Any suggestions?

ddebroy · 2019-08-13T21:49:07Z

Maybe something like --parallel-containers-restore-limit ? cc @thaJeztah

thaJeztah · 2019-08-13T21:55:28Z

@ddebroy if we want to make it configurable, I'd keep that separate. Would an administrator know what value to put there? I think we should have more metrics as to what limit would be realistic (on a beefy machine). If (even on a powerful machine) parallel > 1 doesn't make restoring the containers actually faster, then I wonder if we should make it configurable.

(basically; I'd like to avoid adding bells and whistles that nobody uses)

thaJeztah · 2019-08-13T21:57:19Z

Also; forgot to write that, but configurable would be a new feature, and not something we would backport to current releases; it looks like this patch may be addressing an issue that's introduced in 19.03, so could be candidate for backporting

olljanat · 2019-08-14T09:15:34Z

Would an administrator know what value to put there? I think we should have more metrics as to what limit would be realistic (on a beefy machine).

I repeated same 100 container restore test on couple of different combinations:

4 CPU - default limit

time="2019-08-14T11:22:14.260679800+03:00" level=info msg="Starting up"
...
time="2019-08-14T11:23:11.326345700+03:00" level=info msg="API listen on //./pipe/docker_engine"

(get-date "2019-08-14T11:23:11.326345700+03:00") - (get-date "2019-08-14T11:22:14.260679800+03:00")
TotalSeconds      : 57,0656659

4 CPU - limit to 1

time="2019-08-14T11:31:09.337321500+03:00" level=info msg="Starting up"
...
time="2019-08-14T11:32:09.525747200+03:00" level=info msg="API listen on //./pipe/docker_engine"

(get-date "2019-08-14T11:32:09.525747200+03:00") - (get-date "2019-08-14T11:31:09.337321500+03:00")
TotalSeconds      : 60,1884257

8 CPU - default limit

time="2019-08-14T11:12:23.265451500+03:00" level=info msg="Starting up"
...
time="2019-08-14T11:12:56.014299000+03:00" level=info msg="API listen on //./pipe/docker_engine"

(get-date "2019-08-14T11:12:56.014299000+03:00") - (get-date "2019-08-14T11:12:23.265451500+03:00")
TotalSeconds      : 32,7488475

8 CPU - limit to 1

time="2019-08-14T11:16:39.268269800+03:00" level=info msg="Starting up"
...
time="2019-08-14T11:17:44.633734300+03:00" level=info msg="API listen on //./pipe/docker_engine"

(get-date "2019-08-14T11:17:44.633734300+03:00") - (get-date "2019-08-14T11:16:39.268269800+03:00")
TotalSeconds      : 65,3654645

36 CPU - default limit

time="2019-08-14T10:41:41.025922400+03:00" level=info msg="Starting up"
...
time="2019-08-14T10:42:15.217742700+03:00" level=info msg="API listen on //./pipe/docker_engine"

(get-date "2019-08-14T10:42:15.217742700+03:00") - (get-date "2019-08-14T10:41:41.025922400+03:00")
TotalSeconds      : 34,1918203

36 CPU - limit to 1

time="2019-08-14T10:57:17.434044600+03:00" level=info msg="Starting up"
...
time="2019-08-14T10:58:27.619699700+03:00" level=info msg="API listen on //./pipe/docker_engine"

(get-date "2019-08-14T10:58:27.619699700+03:00") - (get-date "2019-08-14T10:57:17.434044600+03:00")
TotalSeconds      : 70,1856551

So it looks to be that those bigger machines can actually have bigger limit but I can try to look if I'm able to create some nice rule to there.

It is also interesting that even with 4 CPU these tests was now lot of faster after I switched to another environment (maybe because of faster harddrives, etc...)

olljanat · 2019-08-14T13:27:21Z

@thaJeztah @ddebroy did more testing and ended up to change logic now to way that it uses value 1 if server have maximum 4 CPUs and above of that one number of CPUs.

Logs of those startup tests comparisons on below. Like you can see. More server have CPUs less this change have affect to startup speed and we still avoid that hanging which I trying to fix here.

4 CPU - default limit

time="2019-08-14T14:49:42.549725900+03:00" level=info msg="Starting up"
...
time="2019-08-14T14:50:50.048546700+03:00" level=info msg="API listen on //./pipe/docker_engine"

(get-date "2019-08-14T14:50:50.048546700+03:00") - (get-date "2019-08-14T14:49:42.549725900+03:00")
TotalSeconds      : 67,4988208

4 CPU - limit to 1

time="2019-08-14T14:25:43.905063900+03:00" level=info msg="Starting up"
...
time="2019-08-14T14:27:14.485654600+03:00" level=info msg="API listen on //./pipe/docker_engine"

(get-date "2019-08-14T14:27:14.485654600+03:00") - (get-date "2019-08-14T14:25:43.905063900+03:00")
TotalSeconds      : 90,5805907

5 CPU - default limit

time="2019-08-14T12:22:17.792513300+03:00" level=info msg="Starting up"
...
time="2019-08-14T12:22:53.407527000+03:00" level=info msg="API listen on //./pipe/docker_engine"

(get-date "2019-08-14T12:22:53.407527000+03:00") - (get-date "2019-08-14T12:22:17.792513300+03:00")
TotalSeconds      : 35,6150137

5 CPU - limit to 5

time="2019-08-14T15:20:23.314785800+03:00" level=info msg="Starting up"
...
time="2019-08-14T15:21:22.962476000+03:00" level=info msg="API listen on //./pipe/docker_engine"

(get-date "2019-08-14T15:21:22.962476000+03:00") - (get-date "2019-08-14T15:20:23.314785800+03:00")
TotalSeconds      : 59,6476902

6 CPU - default limit

time="2019-08-14T13:06:55.115185700+03:00" level=info msg="Starting up"
...
time="2019-08-14T13:07:40.747497400+03:00" level=info msg="API listen on //./pipe/docker_engine"

(get-date "2019-08-14T13:07:40.747497400+03:00") - (get-date "2019-08-14T13:06:55.115185700+03:00")
TotalSeconds      : 45,6323117

6 CPU - limit to 6

time="2019-08-14T16:15:29.737757900+03:00" level=info msg="Starting up"
...
time="2019-08-14T16:16:21.564104300+03:00" level=info msg="API listen on //./pipe/docker_engine"

(get-date "2019-08-14T16:16:21.564104300+03:00") - (get-date "2019-08-14T16:15:29.737757900+03:00")
TotalSeconds      : 51,8263464

7 CPU - default limit

time="2019-08-14T14:13:20.532105100+03:00" level=info msg="Starting up"
...
time="2019-08-14T14:14:05.378632900+03:00" level=info msg="API listen on //./pipe/docker_engine"

(get-date "2019-08-14T14:14:05.378632900+03:00") - (get-date "2019-08-14T14:13:20.532105100+03:00")
TotalSeconds      : 44,8465278

7 CPU - limit to 7

time="2019-08-14T16:00:29.633611200+03:00" level=info msg="Starting up"
...
time="2019-08-14T16:01:17.411028500+03:00" level=info msg="API listen on //./pipe/docker_engine"

(get-date "2019-08-14T16:01:17.411028500+03:00") - (get-date "2019-08-14T16:00:29.633611200+03:00")
TotalSeconds      : 47,7774173

16 CPU - default limit

time="2019-08-14T14:43:18.720012600+03:00" level=info msg="Starting up"
...
time="2019-08-14T14:44:00.060481100+03:00" level=info msg="API listen on //./pipe/docker_engine"

(get-date "2019-08-14T14:44:00.060481100+03:00") - (get-date "2019-08-14T14:43:18.720012600+03:00")
TotalSeconds      : 41,3404685

16 CPU - limit to 16

time="2019-08-14T14:31:52.016665600+03:00" level=info msg="Starting up"
...
time="2019-08-14T14:32:38.578530700+03:00" level=info msg="API listen on //./pipe/docker_engine"

(get-date "2019-08-14T14:32:38.578530700+03:00") - (get-date "2019-08-14T14:31:52.016665600+03:00")
TotalSeconds      : 46,5618651

coolljt0725 · 2019-08-28T01:03:13Z

ping @jhowardmsft @ddebroy @vikramhh ptal

vikramhh · 2019-08-28T03:17:36Z

@olljanat - thanks for sharing extensive comparison data above. #38301 talks about container numbers that are orders of magnitude more than what you used [20K vs. 100]. Any way to be sure that 100 is indeed representative of typical workloads and not "too small"?

IIUC the main reason for this change is to mitigate the negative impact on User Experience. For Windows Server SKU-s, that would typically not be an issue. It could be possible to scope this change to desktop only but at least based on data you have shared, looks like currently it would make less than 10% difference. Are there scenarios where we would care for that perf regression?

LGTM otherwise

olljanat · 2019-08-28T07:50:07Z

@olljanat - thanks for sharing extensive comparison data above. #38301 talks about container numbers that are orders of magnitude more than what you used [20K vs. 100]. Any way to be sure that 100 is indeed representative of typical workloads and not "too small"?

That is good question and I definitely would like to know if someone from Microsoft have tested that where those limits actually are?

We are using ~20 containers per node on based on my experience I would not recommend anyone use anything more than 50 containers per node. So 100 is double about of that "limit".

IIUC the main reason for this change is to mitigate the negative impact on User Experience. For Windows Server SKU-s, that would typically not be an issue. It could be possible to scope this change to desktop only but at least based on data you have shared, looks like currently it would make less than 10% difference. Are there scenarios where we would care for that perf regression?

That is not true. User Experience is just easiest way demonstrate what happens. The real issue is that under that load server is not able to handle other processes (e.g. monitoring show that server is down, VMware says that guest tools are not running, RDP connections to server does not working, etc...). We seen that happening on production server(s) every week that after scheduled reboots they got frozen and only way fix then is do power off + power on for them (randomly on some of the servers).

Btw. I did some more testing with Moby master version on 36 cores machines. When I do 200 containers restart it actually passes on two minutes and CPU usage does not exceed 40%. Then next two minutes server just idle. But the this happens:

and server CPU usage stay on 100 % next ten minutes and all those issues which I described earlier and visible.

So definitely this PR will no fix whole issue but question more likely is to make situation little less bad.

thaJeztah · 2019-09-03T08:46:03Z

ping @jterry75 @ddebroy @vikramhh ptal ^

jterry75 · 2019-09-03T21:15:41Z

In general I would say that MAX(1, FLOOR(runtime.NumCPU() * .8) is a good number. IE: 80% of the available system devoted to containers. But not more than. But thats really a server answer and not a client answer. Also, are your numbers on Windows Server or Windows Client?

olljanat · 2019-09-03T21:33:16Z

@jterry75 all those tests are done with Windows Server 2019. I will try with that number and update PR.

vikramhh · 2019-09-05T17:10:11Z

LGTM

olljanat · 2019-09-05T18:17:18Z

Updated based on @jterry75 suggestion.

Here is also calculation of parallelism value with different CPU counts:

for i := 1; i <= 64; i++ {
   fmt.Printf("CPU count: %v Parallelism: %v\n", i, int(math.Max(1, math.Floor(float64(i)*.8))))
}

CPU count: 1 Parallelism: 1
CPU count: 2 Parallelism: 1
CPU count: 3 Parallelism: 2
CPU count: 4 Parallelism: 3
CPU count: 5 Parallelism: 4
CPU count: 6 Parallelism: 4
CPU count: 7 Parallelism: 5
CPU count: 8 Parallelism: 6
CPU count: 9 Parallelism: 7
CPU count: 10 Parallelism: 8
CPU count: 11 Parallelism: 8
CPU count: 12 Parallelism: 9
CPU count: 13 Parallelism: 10
CPU count: 14 Parallelism: 11
CPU count: 15 Parallelism: 12
CPU count: 16 Parallelism: 12
CPU count: 17 Parallelism: 13
CPU count: 18 Parallelism: 14
CPU count: 19 Parallelism: 15
CPU count: 20 Parallelism: 16
CPU count: 21 Parallelism: 16
CPU count: 22 Parallelism: 17
CPU count: 23 Parallelism: 18
CPU count: 24 Parallelism: 19
CPU count: 25 Parallelism: 20
CPU count: 26 Parallelism: 20
CPU count: 27 Parallelism: 21
CPU count: 28 Parallelism: 22
CPU count: 29 Parallelism: 23
CPU count: 30 Parallelism: 24
CPU count: 31 Parallelism: 24
CPU count: 32 Parallelism: 25
CPU count: 33 Parallelism: 26
CPU count: 34 Parallelism: 27
CPU count: 35 Parallelism: 28
CPU count: 36 Parallelism: 28
CPU count: 37 Parallelism: 29
CPU count: 38 Parallelism: 30
CPU count: 39 Parallelism: 31
CPU count: 40 Parallelism: 32
CPU count: 41 Parallelism: 32
CPU count: 42 Parallelism: 33
CPU count: 43 Parallelism: 34
CPU count: 44 Parallelism: 35
CPU count: 45 Parallelism: 36
CPU count: 46 Parallelism: 36
CPU count: 47 Parallelism: 37
CPU count: 48 Parallelism: 38
CPU count: 49 Parallelism: 39
CPU count: 50 Parallelism: 40
CPU count: 51 Parallelism: 40
CPU count: 52 Parallelism: 41
CPU count: 53 Parallelism: 42
CPU count: 54 Parallelism: 43
CPU count: 55 Parallelism: 44
CPU count: 56 Parallelism: 44
CPU count: 57 Parallelism: 45
CPU count: 58 Parallelism: 46
CPU count: 59 Parallelism: 47
CPU count: 60 Parallelism: 48
CPU count: 61 Parallelism: 48
CPU count: 62 Parallelism: 49
CPU count: 63 Parallelism: 50
CPU count: 64 Parallelism: 51

coolljt0725 · 2019-09-10T06:07:01Z

ping @olljanat unit test fail

olljanat · 2019-09-12T05:41:07Z

Hmm. Looks that RS1 failed because of build server issue. Is there some way to re-run just it?

thaJeztah · 2019-09-12T06:32:10Z

Ah, yes, the new Jenkins does not (yet) have an option to trigger using labels 😞 (also, it's all or nothing, so we can't re-run a single stage because Jenkins declarative pipelines don't suppler that)

I see tests passed on RS1, so should be ok to ignore;


[2019-09-10T18:55:19.464Z] OK: 524 passed, 548 skipped
[2019-09-10T18:55:19.464Z] PASS

Everything else is green

coolljt0725 · 2019-09-12T06:38:35Z

@thaJeztah

Hmm. Looks that RS1 failed because of build server issue. Is there some way to re-run just it?

can you re-run it?

thaJeztah · 2019-09-12T06:37:19Z

daemon/daemon_windows.go

We should change this function to return a int64 (looks like we always convert the value);

moby/daemon/daemon.go

Lines 223 to 227 in 32688a4

parallelLimit := adjustParallelLimit(len(dir), 128*runtime.NumCPU())

// Re-used for all parallel startup jobs.

var group sync.WaitGroup

sem := semaphore.NewWeighted(int64(parallelLimit))

moby/daemon/daemon.go

Lines 576 to 579 in 32688a4

parallelLimit := adjustParallelLimit(len(daemon.List()), 128*runtime.NumCPU())

var group sync.WaitGroup

sem := semaphore.NewWeighted(int64(parallelLimit))

(doesn't have to be done in this PR, but just noticed it)

I suggest that it will be done on another PR

daemon/daemon_windows.go

thaJeztah · 2019-09-12T06:41:09Z

@coolljt0725 I can re-run it from Jenkins, but (see my comment above); it actu 9E88 ally passed. Also left a comment/question on the code change (so let me wait for @olljanat to reply first) 🤗

thaJeztah

LGTM, thanks!

thaJeztah · 2019-09-16T07:18:03Z

oh! I think it's missing an import;

daemon/daemon_windows.go:1::warning: file is not goimported (goimports)

see above

olljanat · 2019-09-16T16:48:52Z

oh! I think it's missing an import;
daemon/daemon_win
6D38
dows.go:1::warning: file is not goimported (goimports)

😵 ... fixed

olljanat · 2019-10-16T17:17:41Z

@thaJeztah ping is this ok now?

thaJeztah

LGTM

Signed-off-by: Olli Janatuinen <olli.janatuinen@gmail.com>

olljanat · 2019-11-11T15:26:28Z

Fixed typo and rebased. Now passes also tests

coolljt0725 · 2019-11-12T01:21:47Z

ping @thaJeztah @tonistiigi @jterry75 PTAL

jterry75 · 2019-11-12T17:18:20Z

This LGTM (not a maintainer)

thaJeztah · 2019-11-12T17:43:27Z

let's merge 👍

olljanat · 2019-11-13T07:21:00Z

@thaJeztah nice and can we get this one to cherry pick process too?

thaJeztah · 2019-11-13T22:56:16Z

opened docker-archive#423

GordonTheTurtle added the status/0-triage label Aug 13, 2019

thaJeztah requested changes Aug 13, 2019

View reviewed changes

daemon/daemon.go Outdated Show resolved Hide resolved

thaJeztah added platform/windows process/cherry-pick status/2-code-review and removed status/0-triage labels Aug 13, 2019

olljanat force-pushed the win-restore-no-parallelism branch from 419a5fe to f4d41a3 Compare August 13, 2019 17:03

thaJeztah approved these changes Aug 13, 2019

View reviewed changes

ddebroy reviewed Aug 13, 2019

View reviewed changes

olljanat force-pushed the win-restore-no-parallelism branch from f4d41a3 to 64a5ba8 Compare August 14, 2019 13:22

GordonTheTurtle assigned coolljt0725 Aug 28, 2019

olljanat force-pushed the win-restore-no-parallelism branch from 64a5ba8 to 2d72c4e Compare September 5, 2019 16:35

olljanat force-pushed the win-restore-no-parallelism branch from 2d72c4e to b4ad1a5 Compare September 5, 2019 17:36

thaJeztah requested changes Sep 12, 2019

View reviewed changes

olljanat force-pushed the win-restore-no-parallelism branch from f976dfe to 464876c Compare September 15, 2019 09:25

thaJeztah previously approved these changes Sep 16, 2019

View reviewed changes

olljanat force-pushed the win-restore-no-parallelism branch from 464876c to 7a9ab8d Compare September 16, 2019 16:43

thaJeztah approved these changes Oct 24, 2019

View reviewed changes

tonistiigi 9E88 approved these changes Oct 24, 2019

View reviewed changes

olljanat force-pushed the win-restore-no-parallelism branch from 7a9ab8d to a0b9ba0 Compare November 11, 2019 11:30

Windows: Use system specific parallelism value on containers restart

447a840

Signed-off-by: Olli Janatuinen <olli.janatuinen@gmail.com>

olljanat force-pushed the win-restore-no-parallelism branch from a0b9ba0 to 447a840 Compare November 11, 2019 13:45

jterry75 mentioned this pull request Nov 11, 2019

Remove support for limiting max parallel starts microsoft/hcsshim#735

Merged

thaJeztah merged commit c831882 into moby:master Nov 12, 2019

thaJeztah mentioned this pull request Nov 13, 2019

[19.03 backport] Windows: Use system specific parallelism value on containers restart docker-archive/engine#423

Merged

thaJeztah added process/cherry-picked and removed process/cherry-pick labels Nov 13, 2019

olljanat mentioned this pull request Dec 12, 2019

[Windows] How to properly set container Time Zone configuration? OR, how to ignore host Time Zone configuration? #40284

Open

thaJeztah added this to the 20.03.0 milestone Apr 2, 2020

	parallelLimit := adjustParallelLimit(len(dir), 128*runtime.NumCPU())

	// Re-used for all parallel startup jobs.
	var group sync.WaitGroup
	sem := semaphore.NewWeighted(int64(parallelLimit))

Windows: do not use parallelism on container restart #39733

Windows: do not use parallelism on container restart #39733

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

4 CPU - default limit

4 CPU - limit to 1

8 CPU - default limit

8 CPU - limit to 1

36 CPU - default limit

36 CPU - limit to 1

Uh oh!

4 CPU - default limit

4 CPU - limit to 1

5 CPU - default limit

5 CPU - limit to 5

6 CPU - default limit

6 CPU - limit to 6

7 CPU - default limit

7 CPU - limit to 7

16 CPU - default limit

16 CPU - limit to 16

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!