Dynamic - Power - Management - Techniques - in - Multi-Core
Dynamic - Power - Management - Techniques - in - Multi-Core
net/publication/282505535
CITATIONS READS
44 5,050
3 authors, including:
All content following this page was uploaded by Khaled Attia on 18 March 2016.
Computers and Control Systems Engineering Department, Faculty of Engineering, Mansoura University, Mansoura, Egypt
KEYWORDS Abstract Multi-core processors support all modern electronic devices nowadays. However, power
Chip multiprocessors; management is one of the most critical issues in the design of today’s microprocessors. The goal of
Multi-core; power management is to maximize performance within a given power budget. Power management
Power management techniques must balance between the demanding needs for higher performance/throughput and the
impact of aggressive power consumption and negative thermal effects. Many techniques have been
proposed in this area, and some of them have been implemented such as the well-known DVFS
technique which is used in nearly all modern microprocessors. This paper explores the concepts
of multi-core, trending research areas in the field of multi-core processors and then concentrates
on power management issues in multi-core architectures. The main objective of this paper is to sur-
vey and discuss the current power management techniques. Moreover, it proposes a new technique
for power management in multi-core processors based on that survey.
Ó 2015 Faculty of Engineering, Ain Shams University. Production and hosting by Elsevier B.V. This is an
open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
1. Introduction every two years. With the rapid increase in speed, the number
of transistors in processors increased in a way that it can’t scale
The evolution of multi-core processors led to the evolution of to Moore’s law anymore as an extremely huge number of tran-
many research areas. Before the appearance of multi-core pro- sistors switching at very high frequencies means extremely high
cessors, the speed of microprocessors increased exponentially power consumption. Also, the need for parallelism increased
over time. More speed requires more transistors. Moore [1] and the instruction level parallelism [2] was not sufficient to
observed that the number of transistors doubles approximately provide the demanding parallel applications. So the concept
of multi-core was introduced by Olukotun et al. [3], to design
more simple cores on a single chip rather than designing a huge
* Corresponding author. Mobile: +20 1000736160. complex one. Now all modern microprocessor designs are
E-mail addresses: khaled.m.attia@mans.edu.eg (K.M. Attia), implemented in a multi-core fashion. Multi-core advantages
melhosseini@mans.edu.eg (M.A. El-Hosseini), h_arafat_ali@mans. can be summarized as follows:
edu.eg (H.A. Ali).
Peer review under responsibility of Ain Shams University.
A chip multiprocessor consists of simple-to-design cores
Simple design leads to more power efficiency
High system performance in parallel applications where
Production and hosting by Elsevier many threads need to run simultaneously
http://dx.doi.org/10.1016/j.asej.2015.08.010
2090-4479 Ó 2015 Faculty of Engineering, Ain Shams University. Production and hosting by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
Please cite this article in press as: Attia KM et al., Dynamic power management techniques in multi-core architectures: A survey study, Ain Shams Eng J (2015), http://
dx.doi.org/10.1016/j.asej.2015.08.010
2 K.M. Attia et al.
Introducing multi-core processors aroused many related design, explains how we have reached the multicore era and
areas of research. Dividing code into threads, each can run mentions the main issues associated with multicore processors.
independently is very important to make use of the power of Section 3 starts to focus on the power management issue,
the multi-core approach. However, not all code can be divided showing the importance of handling such a problem and pro-
in such a manner. That issue was described by Amdhal in [4] viding a proper problem formulation. It continues to explain
which concludes that maximum speedup is limited by the serial almost all the current techniques used in the power manage-
part and that is called the serial bottleneck. Serialized code ment field in modern processors, showing the advantages
reduces performance expected by the processor; it also wastes and disadvantages of each one and the research done to try
lots of energy. Also, the parallel portion of the code is not to solve each shortage. Section 4 proposes a new mechanism
completely parallel because of many reasons such as synchro- for power management in asymmetric multicore processors.
nization overhead, load imbalance and resources contention Finally, we conclude in Section 5 by reviewing the most impor-
among cores. The serial bottleneck research led to the evolu- tant ideas that were presented in the paper.
tion of asymmetric multi-core processors [5].
The concept of asymmetric multi-core processors implies 2. Background
that the design would include one large core and many small
cores. The serial part of the code will be accelerated by moving The performance of microprocessors has increased exponen-
it to the large core and the parallel part is executed on the small tially over years. Techniques have been devised to achieve paral-
cores. This accelerates both the serial part by using the large lelism, starting from pipelining, passing by super-scalar
core and the parallel part as it will be executed simultaneously architectures and finally the chip multiprocessors or multicore
on the small cores and the large core to achieve high through- processors. Here we shed light on the various levels of parallelism
put. Using asymmetric cores can be more energy efficient too. and how consequent technologies tried to exploit each level.
In [5] Mark et al. described how asymmetry can be achieved.
They divided it into static and dynamic methods. For static 2.1. Levels of parallelism
methods, cores may be designed at different frequencies or a
more complex core with completely different micro- Each one of these techniques exploits some levels of paral-
architecture may be designed. In dynamic methods, frequencies lelism which can be listed as follows:
can be boosted dynamically on demand or small cores may be
combined to form a dynamic large core and this is described in (1) Instruction level parallelism
detail in [6]. Other research topics related to multi-core proces-
sors that emerged include the following: power management, In this level, architectures make use of independent instruc-
memory hierarchies in multi-core processors, the design of tions (the operands of one instruction do not depend on the
interconnection networks in multi-core processors, heteroge- result of another one) that exist in the instructions streams
neous computing in multi-core processors, reliability issues in to execute them concurrently.
multi-core processors and parallel programming techniques.
In power management, the main objective is to reach the max- (2) Basic Block level
imum performance of the processors without exceeding a given
total power budget for the chip. There has been lots of research A block can be considered a set of instructions that end
on power management in chip multiprocessors. Here we are with a branch. Modern architectures were able to exploit this
going to discuss most of those techniques [7] and some modern level of parallelism among basic blocks with the help of
works that try to optimize the efficiency of these techniques. In advanced branch predictors
this paper we examine all popular techniques in detail and how
they work to minimize performance losses while saving power. (3) Loop iterations
We investigate the suitable technique for each case (workloads,
power budget available, critical systems) and how to make Some types of loops work on independent data in each iter-
these techniques even more suitable for their cases. ation of the loop. So, it is possible in these loops to run differ-
This paper makes the following contributions: ent iterations concurrently in superscalar architectures for
example.
Listing almost all the used techniques for power manage-
ment in multi-core processors, discussing them in terms of (4) Tasks
advantages and disadvantages (performance loss, power
saving, suitable cases) and providing a comparison between A task signifies an independent function extracted from one
them. application. It can also be called a thread. Software developers
Examining some of the improvements added to each of have to divide their code into independent threads to make use
these techniques to make them even better. of this level of parallelism in multiprocessors systems, where
Proposing a new adaptive control mechanism for power each thread can run independently on a dedicated core.
management in asymmetric multi-core processors.
Suggesting further research to be done in some of the inves- 2.2. Advances in processor microarchitecture
tigated techniques/scenarios.
Over the years, there have been many trials to exploit better
The rest of this paper is organized as follow: Section 2 parallelism as shown in Fig. 1; advances in architecture can
introduces the historical improvements in the microprocessor be viewed as follows:
Please cite this article in press as: Attia KM et al., Dynamic power management techniques in multi-core architectures: A survey study, Ain Shams Eng J (2015), http://
dx.doi.org/10.1016/j.asej.2015.08.010
Dynamic power management techniques in multi-core architectures 3
2.2.1. Single-cycle processor hand, pipelining introduces logic overhead in each stage of
This technique was used in very early microprocessors. The the pipeline. Also, some data dependency hazards occur when
key concept is that the whole instruction is executed at once two dependent instructions are executed concurrently. How-
in one clock cycle. Whenever an instruction has started to exe- ever, many techniques were proposed to overcome such
cute, all other instructions in the instruction stream have to hazards.
wait until it fully finishes its execution. Of course, some
instructions take lots of execution/waiting time which affects 2.2.3. Deep pipelining
the execution of other instructions and degrades overall system The idea of deep pipelining [8] is to increase the number of
performance (see Fig. 2a). pipeline stages significantly. It is obvious from the discussion
about the pipelined processor that the more stages you add,
2.2.2. Pipelining the faster execution you get. That is of course valid to a certain
Instead of executing the whole instruction at once, pipelining extent. Common pipelines have up to 20 stages. The number of
divides the single-cycle processor into many stages; in each stages is greatly limited by many factors such as the existing
stage, a portion of the instruction is executed concurrently hazards and the logic overhead. As we mentioned before in
with another portion of another instruction. For example, if the pipelined processor, many techniques has been devised to
we have a three-stage pipelined processor that means the overcome the data dependency problem. These techniques
single-cycle processor is divided into three stages, let them include, but not limited to, forwarding, stalling and register
be, for example, FETCH OPERANDS, DECODE and EXE- renaming.
CUTE. Then, we can execute three instructions simultane-
ously. At clock cycle 3, the first instruction will be in the 2.2.4. Super scalar processor
EXECUTE stage, while the second instruction will be in One of the main bottlenecks in the pipelined processor design
the DECODE stage and the third instruction would be is that however many instructions can run at different phases
in the FETCH OPERANDS stage. That obviously diminishes in the same time, the pipeline can only be initiated with one
the drawback of long wait times in long instructions (see instruction. A superscalar processor is the one that contains
Fig. 2b). It exploits the instruction level parallelism where mul- multiple copies of the whole datapath (including the ALU)
tiple instructions can be executed concurrently. On the other which makes it possible to issue as many instructions as the
Figure 2 Difference between (a) single-cycle processor and (b) pipelined processor.
Please cite this article in press as: Attia KM et al., Dynamic power management techniques in multi-core architectures: A survey study, Ain Shams Eng J (2015), http://
dx.doi.org/10.1016/j.asej.2015.08.010
4 K.M. Attia et al.
number of the copies is. Each instruction runs almost indepen- multi-core processor, each thread runs independently on a ded-
dently as it has its own dedicated datapath. Superscalar pro- icated core (real parallelism). Hence, great enhancements are
cessors concepts have always been combined with pipelined made to the overall throughput of the system. However, many
processors concepts to introduce the pipelined superscalar pro- issues came up such as the problem of designing the appropri-
cessor which has been commonly used in the 1990s and early ate memory hierarchy, the data locality problem, the design of
2000s. The basic operation of a superscalar processor includes interconnection networks, maintaining the reliability and
fetching and decoding a stream of instructions, branch predic- validity of the processor and power management. In this
tion, figuring out whether there are any dependencies among paper, we are discussing the power management issue in
instructions and finally the distribution of instructions to dif- multi-core processors and the techniques proposed and used
ferent functional units to be issued [9]. It provides great in that area.
enhancements in the overall performance/throughput of the
system. However, not many instructions can run at the same 3. Power management techniques
time because of the dependency problem explained in the pipe-
lined processors. Moreover, the number of issued instructions Power management has become a major issue in the design of
is limited. Also, it introduces lots of hardware overhead mean- multi-core chips. There are many negative effects that result
ing larger areas and more power consumption. from increasing power consumption such as unstable thermal
properties of the die and hence affecting the system perfor-
2.2.5. OoO (Out-of-Order) processors mance which makes power consumption issue sometimes more
OoO processors look ahead across instruction window to find important than speed. An important observation is that
independent instructions that can be executed immediately. threads running on different cores do not need the same power
This means, instructions are not executed in the order they all time to execute at high performance. There are some wait-
were written in. Once the operands of an instruction are avail- ing times due to memory read/write operations for example
able, the instruction is executed regardless of the sequence of which require saving unnecessary processing power. So, to
the program. OoO processors solve the problem of dependen- achieve a good balance between scalar performance/through-
cies introduced in the pipelined superscalar processor. How- put performance and power it is essentially required to dynam-
ever, they introduce additional hardware overhead and ically vary the amount of power used for processing according
energy consumption for speculation. to temporal analysis of the code needs.
Developed power management techniques can be classified
2.2.6. Chip multiprocessors into two main categories: reactive and predictive. In reactive
Chip multiprocessors or multi-core processors exploit thread techniques, the technique reacts to performance changes in
level parallelism efficiently. A process is a program currently the workload. In other words, a workload may initially have
in execution. Each process consists of one or more threads. states that need high performance, others of I/O waits and
For example, a server application would have at least two low performance. When the state of the workload changes,
threads, one for listening to incoming connections and another the technique reacts to that change accordingly. However,
one for outgoing connections. No thread has to wait for the there might be some lag between workload phase changes
other to finish as they execute concurrently. In traditional and power adaptation changes which may lead to states of
uniprocessor systems, multi-threading is not well utilized. either in-efficient energy consumption or performance degra-
The uniprocessor provides an illusion that threads run concur- dation. On the other hand, predictive techniques, for example
rently but in fact a fast switch is done between threads of the [10], overcome this issue. Those techniques predict phase
same process (which is way faster than switching between pro- changes in the workload before they happen, and hence act
cesses). Multi-core architectures appeared to extract as much immediately before a program phase changes. That leads to
parallelism as possible from the thread level parallelism. In a optimal energy-saving and performance results. However,
Please cite this article in press as: Attia KM et al., Dynamic power management techniques in multi-core architectures: A survey study, Ain Shams Eng J (2015), http://
dx.doi.org/10.1016/j.asej.2015.08.010
Dynamic power management techniques in multi-core architectures 5
there is no workload that can be fully predicted, so reactive Techniques can be evaluated in terms of power efficiency. A
techniques are used for portions that cannot be predicted common metric for the evaluation of power efficiency is energy
(which is usually more than 60% of the entire workload). per instruction (EPI in Watt/MIPS or Joule/Instruction).
So, reactive techniques are inevitable to use and consequently Other metrics such as energy delay product (EDP), which
we concentrate in this study on those techniques. Here, we are was initially proposed by Horowitz et al. [11], and ED2P are
examining some of the dynamic techniques as shown in Fig. 3 used also in latency performance architectures as they assign
to achieve the best level of power management in multi-core a weight to the amount of time needed for an instruction to
processors. We also discuss some issues related to each of these be processed. Obviously, techniques that achieve lower EPI
techniques and how previous research attempted to handle are more energy-efficient. The main objective of almost all
these issues. techniques is achieving high Instruction per Cycle (IPC) while
Problem formulation can be viewed as follows: all maintaining low EPI. That balance is the main concern of
techniques assume there is an on-chip or on-board hardware almost all the research done on power management in
controller for power management which contains all the hard- microprocessors.
ware and circuitry required for performing its job. The con- The power management process can be viewed as a feed-
troller is always supported by some firmware and software back closed-loop control system. Power budget is considered
that give directives for implementing the specific technique or the desired input coming from the system-level control system.
algorithm. Fig. 4 shows a high-level view of the power manage- And there is an on-chip or on-board controller that adjusts
ment process assuming a global on-chip power management some parameters (such as voltage and frequency) based on
controller. The system-level controller directs the global on- the monitoring process (feedback) coming from the individual
chip controller toward a specific power budget. The global cores of the chip in a closed-loop and so on. Monitoring power
on-chip controller monitors power-performance statistics from consumption has been a hot research topic for many years. For
all cores and dependently takes the required action. That any powersaving mechanism, it needs to monitor consumed
action depends on the algorithm/technique used (for example, power to guide its decision. Mainly, Performance Monitoring
change voltage as in DVFS, cut-off power of specific portions Counters (PMCs) are used to obtain power models. Examples
as in power-gating techniques). of research done on that point are included [12–15] as exam-
ples. This representation leads us to another point which is
as follows: as power management control systems can be
viewed as feedback control system. That implies that they have
regions of instability which require in turn providing guarding/
security mechanism for power management which is out of the
scope of this paper. Fig. 5 illustrates that concept.
Please cite this article in press as: Attia KM et al., Dynamic power management techniques in multi-core architectures: A survey study, Ain Shams Eng J (2015), http://
dx.doi.org/10.1016/j.asej.2015.08.010
6 K.M. Attia et al.
voltage–frequency pairs within a set of discrete, predefined fhigh thigh þ ðflow tlow Þ
fhost ¼ ð2Þ
pairs to achieve the required power/performance level. In other thigh þ tlow
words, for heavy parallel workloads, many cores run at low
voltage–frequency pair. However, for scalar workloads which where fhost is the required frequency, fhigh is the high threshold
include a big portion of serial code, it is reasonable to run frequency, thigh is the number of occurrences of that frequency
few cores, and boost their frequency to adapt to the required and similarly for the low threshold.
task. Fig. 6 explains this concept. DVFS management system
for a dual-core processor can be viewed graphically as in 3.1.3. DVFS levels of granularity
figure. DVFS can be applied either per chip or per core. Applying
The system level controller directs the global on-chip or on- DVFS per core introduces much flexibility as each core would
board controller with the desired power budget. The global have its own voltage–frequency pair. However, that incurs at
controller monitors voltage, frequency and IPC (power usage) the expense of a large number of on-chip voltage regulators.
of each core. Depending on these parameters, the controller On the other hand, applying DVFS on the chip level reduces
actuates voltage and/or frequency as required. The same that expense but limits flexibility as the same voltage would
concept is applied through all power management techniques be applied to all cores regardless of the special needs of each
as previously explained, and the main difference lies in the individual core. It is extremely difficult to determine a single
algorithm itself. Machine learning algorithms (especially rein- Voltage–Frequency setting that satisfies all cores needs simul-
forcement learning) have been recently used to perform DVFS taneously. In [25] Kolpe et al. proposed an intermediate
[17–21]. Using these techniques led to even better results on technique called ‘‘clustered DVFS” which clusters the cores
both the performance and energy saving metrics. into different DVFS domains and implements DVFS on a
DVFS has not been only used in general purpose applica- per-cluster basis. The algorithm of this approach can be
tions. It is widely applied to almost all modern processors in summarized in three main steps: (1) find the optimal voltage/
embedded systems [22]. Also, it can be used in real-time frequency setting for each core individually, (2) find similarities
applications. For example in [23], DVFS is used along with a between cores (for example, the cores with similar voltage/fre-
checkpointing technique for consumed power reduction in quency setting from the first step over a certain number of
reliability-guaranteed real-time applications. That study clock cycles are similar) and cluster similar cores together
proves that with the use of backward fault recovery technique, and finally (3) evaluate the solution by finding the optimal
DVFS can achieve highest system reliability while consuming voltage/frequency setting for each cluster and compare it with
minimal amount of energy. the actual setting of the cluster. This approach proved to have
significant results, compared to per-core DVFS but it returns
3.1.2. Determination of the suitable voltage-frequency setting diminishing results when the number of clusters increases.
As mentioned, the default on demand linux governor chooses
voltage–frequency pairs from a set of predefined, discrete val- 3.1.4. Time to vary voltage and frequency
ues. That’s not very power efficient as the required voltage– Scaling voltage and frequency takes some latency to wait for
frequency pair may be not exactly one of the predefined values. the voltage/frequency reach the desired level. However, fre-
Kamga et al. [24] proposed an approach for precise determina- quency scaling is much faster than voltage scaling. Conse-
tion of the required frequency for current workload. Kamga quently, the processor can be in dangerous states where the
suggests a method to precisely determine the required fre- current voltage cannot support the frequency. In these cases,
quency based on the high and low threshold and number of hard faults would occur and cause the CPU to stop operating.
occurrences of each of them. The method ends up with the Fig. 7 shows the relationship between voltage and frequency
required frequency to be during DVFS. A boundary can be drawn to divide the volt-
age–frequency space into three areas: (1) area above the
boundary which contains dangerous power states because
the voltage cannot support the frequency, (2) area under the
boundary is not energy efficient, and (3) the boundary which
contains power-safe states. For example, if we want to scale
from s0 to s3, the frequency will scale faster which leads to
reaching the dangerous state s2.
The traditional method to overcome this issue was to scale
voltage first and stall the running application until the voltage
scaling is done, then scale frequency. It is very obvious that
this method introduces lots of latency. Some research has been
done to address the latency resulted from applying DVFS. Lai
et al. [26] proposed an algorithm that reduces latency by avoid-
ing unnecessary aggressive power states transitions. Also, Lai
et al. [27] proposed the Retroactive Frequency Scaling (RFS)
technique which suggests not stalling the execution of the
application during voltage scaling, but running it at the previ-
ous frequency setting until voltage scaling is done. Although
Figure 6 High-level graphical view of DVFS applied to a dual- that eliminates much of the latency, it comes to the cost of run-
core processor. ning at power inefficient state during voltage scaling.
Please cite this article in press as: Attia KM et al., Dynamic power management techniques in multi-core architectures: A survey study, Ain Shams Eng J (2015), http://
dx.doi.org/10.1016/j.asej.2015.08.010
Dynamic power management techniques in multi-core architectures 7
Please cite this article in press as: Attia KM et al., Dynamic power management techniques in multi-core architectures: A survey study, Ain Shams Eng J (2015), http://
dx.doi.org/10.1016/j.asej.2015.08.010
8 K.M. Attia et al.
Please cite this article in press as: Attia KM et al., Dynamic power management techniques in multi-core architectures: A survey study, Ain Shams Eng J (2015), http://
dx.doi.org/10.1016/j.asej.2015.08.010
Dynamic power management techniques in multi-core architectures 9
idea of thread motion is to have small cores, running at two highly parallel workload, it is power efficient to run many
different levels of voltage–frequency levels. When applications cores with little speculation on each core. In scalar workloads,
are executed, the algorithm decides which core has the best it’s advisable to run a few cores with as much speculation as
voltage–frequency setting to execute that application and possible.
moves it to that core instead of changing voltage–frequency
pair for that core which introduced more latency. Thread 3.5.2. Limitations of speculation control
Motion enables applications to migrate to cores with higher Regarding limitations of this technique, it is not very useful in
or lower voltage/frequency settings depending on the current cases of parallel workloads. Parallel workloads do not suffer a
workload of the program. For example, if one application lot from mis-speculated instructions. Also, latency introduced
could benefit from a higher voltage/frequency setting on some by the pipeline degrades performance significantly.
core while the application on that core is stalled for I/O oper-
ation for example, thread motion swaps the two applications 3.6. Core fusion
between the cores.
3.6.1. Basic concept
3.3.2. Limitations of thread motion
Core Fusion [6] is re-configurable chip multiprocessor archi-
The limitation of this technique is that it was proposed for tecture that starts with small simple cores which are able to
simple, homogeneous cores and it’s also limited to dynamically fuse into a larger core to support scalar perfor-
power-constrained multi-core systems. The results show that mance when needed. It neither requires special programming
it provides up to 20% better performance than coarse- effort nor specialized compiler support. Core Fusion can
grained DVFS. accommodate to software diversity and variations of work-
loads. When the workload is extremely parallel, distribute
3.4. Variable size cores the workload among the simple cores. When the workload is
heavily scalar, the simple cores dynamically fuse into a larger,
3.4.1. Basic concept more powerful single core. Full details of hardware implemen-
The basic idea is to design a complex, large core that is able to tation of this architecture can be found in [6]. Many re-
degrade later into a small core [44]. This can be done through configurable architectures used Core Fusion as the foundation
dynamically disabling execution units and even pipestages. [46–48].
This idea is based on the classic power gating [45] technique.
Power gating algorithms typically operate by turning off the 3.6.2. Limitations of core fusion
resource if it has been idle for a specified number of clock Limitations of Core Fusion according to [49] include that the
cycles. In cases of high scalar workloads (low parallelism), fused large core consumes lots of power and is slower than a
run a few cores that would fully-operate to support the scalar traditional out-of-order core because there are additional
performance. However, when dealing with highly parallel latencies among the pipeline stages of the fused core. Also,
workloads, it would be power/throughput performance effi- mode switching between small cores and fused core comes at
cient to run many cores using fewer resources on each core. the cost of flushing instruction cache and moving data between
caches.
3.4.2. Limitations of power-gating
It is obvious that power-gating (and consequently variable-size 4. Proposed technique
cores) has some serious limitations:
Based on the above discussion and referring to the comparison
3.4.2.1. Mis-prediction. As we viewed, power-gating algorithms provided by Table 1, we were able to propose a technique we
depend on turning off a resource that is reported idle for a think it will provide best balance between consumed power
specified number of clock cycles. Hence, the controller may reduction and overall performance/throughput. This technique
turn off some resource which was idle just before the applica- would make use of clustered DVFS [25] with Retroactive
tion needs that resource again, giving negative power savings Frequency Scaling (RFS) [27] in Asymmetric [5] Many-Type
and degrading performance significantly. Multicore Processor [42] which schedules critical sections
threads using the scoring mechanism [38]. Power gating tech-
3.4.2.2. Small power savings. While turning off some portions/ nique [44] may also be used in cases of very low CPU utiliza-
resources of the systems saves power consumption, it is not of tion. The technique will be controlled via an adaptive control
that great impact. Power savings resulting from this technique mechanism which decides based on many parameters
are very minimal compared to other techniques such as DVFS. (workload style, current cores utilization, available amount
of parallelism, current performance, e.g.) how to use that
3.5. Speculation control technique efficiently.
For example, when the initial workload is highly-parallel,
3.5.1. Basic concept small cores frequency will be fixed while the frequency of the
Some energy is wasted on mis-speculated instructions for large cores will be scaled down and all cores will be used to
example, instructions after a mis-predicted branch. The results execute that parallel code. If the workload contains lots of
of a mis-speculated instruction are more likely to be discarded sequential code, large cores will be used at maximum fre-
but energy has been wasted anyway to execute that instruction. quency. Our technique is currently subject to further research
Speculation reduction technique suggests that in cases of and validating it using simulation is our future work.
Please cite this article in press as: Attia KM et al., Dynamic power management techniques in multi-core architectures: A survey study, Ain Shams Eng J (2015), http://
dx.doi.org/10.1016/j.asej.2015.08.010
10 K.M. Attia et al.
5. Conclusion
software
– no extra programming
mode switching
sors, research trends in that field and focused on the power
effort needed
management issue. We reviewed most of the used techniques,
to
Core Fusion
diversity
large core their advantages and disadvantages and the research done
for each technique to address its problems. Finally, we pro-
tion
posed a new technique that makes use of the gathered results.
It is very clear from the discussion that there is no absolute
perfect way for power management in chip multiprocessor
speculated instructions
changes in the architecture itself, how much you can sacrifice
Future work suggests the
compatible techniques/
of
References
when used alone
simple
Variable-size cores
– mis-prediction
– limited to homogenous
frequency domains, migrating
in
– accommodates well to
– number of small/big
– choosing appropriate
a large core for scalar code
scheduling/mapping
Having small cores for
results
ing migration
time
– easy to implement on
many scales
Please cite this article in press as: Attia KM et al., Dynamic power management techniques in multi-core architectures: A survey study, Ain Shams Eng J (2015), http://
dx.doi.org/10.1016/j.asej.2015.08.010
Dynamic power management techniques in multi-core architectures 11
[16] Weiser Mark et al. Scheduling for reduced CPU energy. Mobile [38] Manakkadu Sheheeda, Dutta Sourav, Botros Nazeih M. Power aware
computing. US: Springer; 1996, p. 449–71. parallel computing on asymmetric multiprocessor. In: 2014 27th IEEE
[17] Khan UA, Rinner B. Online learning of timeout policies for international system-on-chip conference (SOCC). IEEE; 2014.
dynamic power management. ACM-TECS 2014;13(4):25 96. [39] Petrucci Vinicius et al. Energy-efficient thread assignment opti-
[18] Das Anup et al. Reinforcement learning-based inter-and intra- mization for heterogeneous multicore systems. ACM Trans
application thermal optimization for lifetime improvement of Embed Comput Syst (TECS) 2015;14.1:15.
multicore systems. In: 2014 51st ACM/EDAC/IEEE design [40] Wagner Harvey M. An integer linear-programming model for
automation conference (DAC). IEEE; 2014. machine scheduling. Naval Res Logist Quart 1959;6(2):131–40.
[19] Ye Rong, Xu Qiang. Learning-based power management for [41] Somu Muthukaruppan T, Pathania A, Mitra T. Price theory
multicore processors via idle period manipulation. IEEE Trans based power management for heterogeneous multi-cores. In: Proc
Comput-Aided Des Integr Circ Syst 2014;33(7):1043–55. 19th int conf archit support program lang oper syst – ASPLOS’14;
[20] Shen Hao et al. Achieving autonomous power management using 2014. p. 161–76.
reinforcement learning. ACM Trans Des Autom Electr Syst [42] Kumar Rakesh et al. A multi-core approach to addressing the
(TODAES) 2013;18(2):24. energy-complexity problem in microprocessors. In: Workshop on
[21] Otoom Mwaffaq et al. Scalable and dynamic global power complexity-effective design; 2003.
management for multicore chips. In: Proceedings of the 6th [43] Rangan Krishna K, Wei Gu-Yeon, Brooks David. Thread motion:
workshop on parallel programming and run-time management fine-grained power management for multi-core systems. ACM
techniques for many-core architectures. ACM; 2015. SIGARCH computer architecture news, vol. 37(3). ACM; 2009.
[22] Chao Seong Jin, Yun Seung Hyun, Jeon Jae Wook. A powersav- [44] Efthymiou Aristides, Garside Jim D. Adaptive pipeline depth
ing DVFS algorithm based on operational intensity for embedded control for processor power-management. In: Proceedings 2002
systems. IEICE Electr Exp 2015;0. IEEE international conference on computer design: VLSI in
[23] Li Zheng, Shangping Ren, Gang Quan. Energy minimization for computers and processors, 2002. IEEE; 2002.
reliability-guaranteed real-time applications using DVFS and [45] Hu Z et al. Microarchitectural techniques for power-gating of
checkpointing techniques. J Syst Architect 2015. execution units. In: Proc int’l symp on low power electronics and
[24] Kamga Christine Mayap. CPU frequency emulation based on design, ISLPED, Aug. 2004.
DVFS. ACM SIGOPS Operating Systs Rev 2013;47(3):34–41. [46] Boyer M, Tarjan D, Skadron K. Federation: boosting per-thread
[25] Kolpe Tejaswini, Zhai Antonia, Sapatnekar Sachin S. Enabling performance of throughput-oriented manycore architectures. In:
improved power management in multicore processors through ACM trans archit code optim (TACO); 2010.
clustered DVFS. In: Design, automation & test in europe [47] Pricopi M, Mitra T. Bahurupi: a polymorphic heterogeneous
conference & exhibition (DATE), 2011. IEEE; 2011. multi-core architecture. In: ACM TACO, January 2012.
[26] Lai Zhiquan et al. Latency-aware dynamic voltage and frequency [48] Gibson D, Wood DA. ForwardFlow: a scalable core for power-
scaling on many-core architectures for data-intensive applications. constrained CMPs, in ISCA; 2010.
In: 2013 international conference on cloud computing and big [49] Khubaib K et al. Morphcore: an energy-efficient microarchitec-
data (CloudCom-Asia). IEEE; 2013. ture for high performance ilp and high throughput tlp. In: 2012
[27] Lai Zhiquan, Zhao Baokang, Su Jinshu. Efficient DVFS to 45th annual IEEE/ACM international symposium on microar-
prevent hard faults for many-core architectures. Information and chitecture (MICRO). IEEE; 2012.
communication technology. Berlin, Heidelberg: Springer; 2014, p.
674–9.
[28] Greenhalgh Peter. Big.little processing with arm cortex-a15 & Khaled M. Attia is a teaching assistant at
cortex-a7. ARM White Paper; 2011. Computers and Control Systems Engineering
[29] Chung Hongsuk, Kang Munsik, Cho Hyun-Duk. Heterogeneous Department, Mansoura University. He
multi-processing solution of Exynos 5 Octa with ARMÒ big. received his B.Sc. in 2013 with an overall
LITTLETM Technology. grade of excellent with honors from Man-
[30] Rajovic Nikola et al. Experiences with mobile processors for soura University. His main research interests
energy efficient HPC. In: Proceedings of the conference on design, include Computer Architecture and Organi-
automation and test in Europe. EDA Consortium; 2013. zation, Heterogeneous Multi-Core Architec-
[31] Kihm J, Guimbretière FV, Karl J, Manohar R. Using asymmetric tures, Power-aware computing and
cores to reduce power consumption for interactive devices with bi- Heterogeneous Parallel Programming.
stable displays. In: Proc 32nd annu ACM conf hum factors
comput syst – CHI’14; 2014. p. 1059–62.
[32] Marowka A. Maximizing energy saving of dual-architecture
processors using DVFS. J Supercomput 2014;68:1163–83. Mostafa A. El-Hosseini is an Ass. Professor at
[33] Imes Connor, Hoffmann Henry. Minimizing energy under Computers Engineering and Control systems
performance constraints on embedded platforms; 2015. p. 12. Dept.––Faculty of Engineering––Mansoura
[34] Lakshminarayana Nagesh B, Lee Jaekyu, Kim Hyesoon. Age University, Egypt. He received the B.Sc.
based scheduling for asymmetric multiprocessors. In: Proceedings from the Electronics Engineering Department,
of the conference on high performance computing networking, M.Sc. and Ph.D. from Computers & Systems
storage and analysis. ACM; 2009. Engineering, all from Mansoura University,
[35] Becchi Michela, Crowley Patrick. Dynamic thread assignment on Egypt. His major research interests are Arti-
heterogeneous multiprocessor architectures. In: Proceedings of the ficial Intelligence such as Genetic Algorithms,
3rd conference on computing frontiers. ACM; 2006. Neural Networks, Particle Swarm Optimiza-
[36] Srinivasan Sadagopan et al. HeteroScouts: hardware assist for OS tion, Simulated Annealing, and Fuzzy Logic.
scheduling in heterogeneous CMPs. In: Proceedings of the ACM Also he is interested in the application of AI in Machine Learning,
SIGMETRICS joint international conference on measurement Image Processing, access control and Optimization. The Applications
and modeling of computer systems. ACM; 2011. of Computational Intelligence CI and Soft Computing tool in Bioin-
[37] Koufaty David, Reddy Dheeraj, Hahn Scott. Bias scheduling in formatics is also one of his interests. He served as a member of the
heterogeneous multi-core architectures. In: Proceedings of the 5th international program committees of numerous international
European conference on computer systems. ACM; 2010. conferences.
Please cite this article in press as: Attia KM et al., Dynamic power management techniques in multi-core architectures: A survey study, Ain Shams Eng J (2015), http://
dx.doi.org/10.1016/j.asej.2015.08.010
12 K.M. Attia et al.
Hesham Arafat ali is a Prof. in Computer Eng. Club 2002 for his research on network security. He is a founder
& Sys. and an assoc. Prof. in Info. Sys. and member of the IEEE SMC Society Technical Committee on Enterprise
computer Eng. He was assistant prof. at the Information Systems (EIS). He has many book chapters published by
Univ. of Mansoura, Faculty of Computer international press and about 150 published papers in international
Science from 1997 up to 1999. From January (conf. and journal). He has served as a reviewer for many high quality
2000 up to September 2001, he joined as journals, including Journal of Engineering Mansoura University. His
Visiting Professor to the Department of interests are in the areas of network security, mobile agent, Network
Computer Science, University of Connecticut. management, Search engine, pattern recognition, distributed data-
From 2002 to 2004 he was a vice dean for bases, and performance analysis.
student affair the Faculty of Computer Sci-
ence and Inf., Univ. of Mansoura. He was
awarded with the Highly Commended Award from Emerald Literati
Please cite this article in press as: Attia KM et al., Dynamic power management techniques in multi-core architectures: A survey study, Ain Shams Eng J (2015), http://
dx.doi.org/10.1016/j.asej.2015.08.010
View publication stats