8000 sched/fair: Rewrite runnable load and utilization average tracking · bsd-unix/linux@9d89c25 · GitHub
[go: up one dir, main page]

Skip to content

Commit 9d89c25

Browse files
ydu19Ingo Molnar
authored andcommitted
sched/fair: Rewrite runnable load and utilization average tracking
The idea of runnable load average (let runnable time contribute to weight) was proposed by Paul Turner and Ben Segall, and it is still followed by this rewrite. This rewrite aims to solve the following issues: 1. cfs_rq's load average (namely runnable_load_avg and blocked_load_avg) is updated at the granularity of an entity at a time, which results in the cfs_rq's load average is stale or partially updated: at any time, only one entity is up to date, all other entities are effectively lagging behind. This is undesirable. To illustrate, if we have n runnable entities in the cfs_rq, as time elapses, they certainly become outdated: t0: cfs_rq { e1_old, e2_old, ..., en_old } and when we update: t1: update e1, then we have cfs_rq { e1_new, e2_old, ..., en_old } t2: update e2, then we have cfs_rq { e1_old, e2_new, ..., en_old } ... We solve this by combining all runnable entities' load averages together in cfs_rq's avg, and update the cfs_rq's avg as a whole. This is based on the fact that if we regard the update as a function, then: w * update(e) = update(w * e) and update(e1) + update(e2) = update(e1 + e2), then w1 * update(e1) + w2 * update(e2) = update(w1 * e1 + w2 * e2) therefore, by this rewrite, we have an entirely updated cfs_rq at the time we update it: t1: update cfs_rq { e1_new, e2_new, ..., en_new } t2: update cfs_rq { e1_new, e2_new, ..., en_new } ... 2. cfs_rq's load average is different between top rq->cfs_rq and other task_group's per CPU cfs_rqs in whether or not blocked_load_average contributes to the load. The basic idea behind runnable load average (the same for utilization) is that the blocked state is taken into account as opposed to only accounting for the currently runnable state. Therefore, the average should include both the runnable/running and blocked load averages. This rewrite does that. In addition, we also combine runnable/running and blocked averages of all entities into the cfs_rq's average, and update it together at once. This is based on the fact that: update(runnable) + update(blocked) = update(runnable + blocked) This significantly reduces the code as we don't need to separately maintain/update runnable/running load and blocked load. 3. How task_group entities' share is calculated is complex and imprecise. We reduce the complexity in this rewrite to allow a very simple rule: the task_group's load_avg is aggregated from its per CPU cfs_rqs's load_avgs. Then group entity's weight is simply proportional to its own cfs_rq's load_avg / task_group's load_avg. To illustrate, if a task_group has { cfs_rq1, cfs_rq2, ..., cfs_rqn }, then, task_group_avg = cfs_rq1_avg + cfs_rq2_avg + ... + cfs_rqn_avg, then cfs_rqx's entity's share = cfs_rqx_avg / task_group_avg * task_group's share To sum up, this rewrite in principle is equivalent to the current one, but fixes the issues described above. Turns out, it significantly reduces the code complexity and hence increases clarity and efficiency. In addition, the new averages are more smooth/continuous (no spurious spikes and valleys) and updated more consistently and quickly to reflect the load dynamics. As a result, we have less load tracking overhead, better performance, and especially better power efficiency due to more balanced load. Signed-off-by: Yuyang Du <yuyang.du@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: arjan@linux.intel.com Cc: bsegall@google.com Cc: dietmar.eggemann@arm.com Cc: fengguang.wu@intel.com Cc: len.brown@intel.com Cc: morten.rasmussen@arm.com Cc: pjt@google.com Cc: rafael.j.wysocki@intel.com Cc: umgwanakikbuti@gmail.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1436918682-4971-3-git-send-email-yuyang.du@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
1 parent cd126af commit 9d89c25

File tree

5 files changed

+249
-494
lines changed

5 files changed

+249
-494
lines changed

include/linux/sched.h

Lines changed: 18 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1175,29 +1175,24 @@ struct load_weight {
11751175
u32 inv_weight;
11761176
};
11771177

1178+
/*
1179+
* The load_avg/util_avg accumulates an infinite geometric series.
1180+
* 1) load_avg factors the amount of time that a sched_entity is
1181+
* runnable on a rq into its weight. For cfs_rq, it is the aggregated
1182+
* such weights of all runnable and blocked sched_entities.
1183+
* 2) util_avg factors frequency scaling into the amount of time
1184+
* that a sched_entity is running on a CPU, in the range [0..SCHED_LOAD_SCALE].
1185+
* For cfs_rq, it is the aggregated such times of all runnable and
1186+
* blocked sched_entities.
1187+
* The 64 bit load_sum can:
1188+
* 1) for cfs_rq, afford 4353082796 (=2^64/47742/88761) entities with
1189+
* the highest weight (=88761) always runnable, we should not overflow
1190+
* 2) for entity, support any load.weight always runnable
1191+
*/
11781192
struct sched_avg {
1179-
u64 last_runnable_update;
1180-
s64 decay_count;
1181-
/*
1182-
* utilization_avg_contrib describes the amount of time that a
1183-
* sched_entity is running on a CPU. It is based on running_avg_sum
1184-
* and is scaled in the range [0..SCHED_LOAD_SCALE].
1185-
* load_avg_contrib described the amount of time that a sched_entity
1186-
* is runnable on a rq. It is based on both runnable_avg_sum and the
1187-
* weight of the task.
1188-
*/
1189-
unsigned long load_avg_contrib, utilization_avg_contrib;
1190-
/*
1191-
* These sums represent an infinite geometric series and so are bound
1192-
* above by 1024/(1-y). Thus we only need a u32 to store them for all
1193-
* choices of y < 1-2^(-32)*1024.
1194-
* running_avg_sum reflects the time that the sched_entity is
1195-
* effectively running on the CPU.
1196-
* runnable_avg_sum represents the amount of time a sched_entity is on
1197-
* a runqueue which includes the running time that is monitored by
1198-
* running_avg_sum.
1199-
*/
1200-
u32 runnable_avg_sum, avg_period, running_avg_sum;
1193+
u64 last_update_time, load_sum;
1194+
u32 util_sum, period_contrib;
1195+
unsigned long load_avg, util_avg;
12011196
};
12021197

12031198
#ifdef CONFIG_SCHEDSTATS
@@ -1263,7 +1258,7 @@ struct sched_entity {
12631258
#endif
12641259

12651260
#ifdef CONFIG_SMP
1266-
/* Per-entity load-tracking */
1261+
/* Per entity load average tracking */
12671262
struct sched_avg avg;
12681263
#endif
12691264
};

kernel/sched/core.c

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2020,9 +2020,6 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
20202020
p->se.prev_sum_exec_runtime = 0;
20212021
p->se.nr_migrations = 0;
20222022
p->se.vruntime = 0;
2023-
#ifdef CONFIG_SMP
2024-
p->se.avg.decay_count = 0;
2025-
#endif
20262023
INIT_LIST_HEAD(&p->se.group_node);
20272024

20282025
#ifdef CONFIG_SCHEDSTATS

kernel/sched/debug.c

Lines changed: 17 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -88,12 +88,8 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
8888
#endif
8989
P(se->load.weight);
9090
#ifdef CONFIG_SMP
91-
P(se->avg.runnable_avg_sum);
92-
P(se->avg.running_avg_sum);
93-
P(se->avg.avg_period);
94-
P(se->avg.load_avg_contrib);
95-
P(se->avg.utilization_avg_contrib);
96-
P(se->avg.decay_count);
91+
P(se->avg.load_avg);
92+
P(se->avg.util_avg);
9793
#endif
9894
#undef PN
9995
#undef P
@@ -209,21 +205,19 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
209205
SEQ_printf(m, " .%-30s: %d\n", "nr_running", cfs_rq->nr_running);
210206
SEQ_printf(m, " .%-30s: %ld\n", "load", cfs_rq->load.weight);
211207
#ifdef CONFIG_SMP
212-
SEQ_printf(m, " .%-30s: %ld\n", "runnable_load_avg",
213-
cfs_rq->runnable_load_avg);
214-
SEQ_printf(m, " .%-30s: %ld\n", "blocked_load_avg",
215-
cfs_rq->blocked_load_avg);
216-
SEQ_printf(m, " .%-30s: %ld\n", "utilization_load_avg",
217-
cfs_rq->utilization_load_avg);
208+
SEQ_printf(m, " .%-30s: %lu\n", "load_avg",
209+
cfs_rq->avg.load_avg);
210+
SEQ_printf(m, " .%-30s: %lu\n", "util_avg",
211+
cfs_rq->avg.util_avg);
212+
SEQ_printf(m, " .%-30s: %ld\n", "removed_load_avg",
213+
atomic_long_read(&cfs_rq->removed_load_avg));
214+
SEQ_printf(m, " .%-30s: %ld\n", "removed_util_avg",
215+
atomic_long_read(&cfs_rq->removed_util_avg));
218216
#ifdef CONFIG_FAIR_GROUP_SCHED
219-
SEQ_printf(m, " .%-30s: %ld\n", "tg_load_contrib",
220-
cfs_rq->tg_load_contrib);
221-
SEQ_printf(m, " .%-30s: %d\n", "tg_runnable_contrib",
222-
cfs_rq->tg_runnable_contrib);
217+
SEQ_printf(m, " .%-30s: %lu\n", "tg_load_avg_contrib",
218+
cfs_rq->tg_load_avg_contrib);
223219
SEQ_printf(m, " .%-30s: %ld\n", "tg_load_avg",
224220
atomic_long_read(&cfs_rq->tg->load_avg));
225-
SEQ_printf(m, " .%-30s: %d\n", "tg->runnable_avg",
226-
atomic_read(&cfs_rq->tg->runnable_avg));
227221
#endif
228222
#endif
229223
#ifdef CONFIG_CFS_BANDWIDTH
@@ -631,12 +625,11 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
631625

632626
P(se.load.weight);
633627
#ifdef CONFIG_SMP
634-
P(se.avg.runnable_avg_sum);
635-
P(se.avg.running_avg_sum);
636-
P(se.avg.avg_period);
637-
P(se.avg.load_avg_contrib);
638-
P(se.avg.utilization_avg_contrib);
639-
P(se.avg.decay_count);
628+
P(se.avg.load_sum);
629+
P(se.avg.util_sum);
630+
P(se.avg.load_avg);
631+
P(se.avg.util_avg);
632+
P(se.avg.last_update_time);
640633
#endif
641634
P(policy);
642635
P(prio);

0 commit comments

Comments
 (0)
0