0% found this document useful (0 votes)

13 views13 pages

Dynamic - Power - Management - Techniques - in - Multi-Core

Dynamic Power management techniques for multicore architecture in several usage

Uploaded by

vidyasagarbejjanki4

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views13 pages

Dynamic - Power - Management - Techniques - in - Multi-Core

Dynamic Power management techniques for multicore architecture in several usage

Uploaded by

vidyasagarbejjanki4

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/282505535

Dynamic power management techniques in multi-core architectures: A

survey study

Article in Ain Shams Engineering Journal · October 2015

DOI: 10.1016/j.asej.2015.08.010

CITATIONS READS

44 5,050

3 authors, including:

Khaled Attia Hesham Arafat Ali

Mansoura University Mansoura University
1 PUBLICATION 44 CITATIONS 204 PUBLICATIONS 3,561 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Khaled Attia on 18 March 2016.

The user has requested enhancement of the downloaded file.

Ain Shams Engineering Journal (2015) xxx, xxx–xxx

Ain Shams University

Ain Shams Engineering Journal

www.elsevier.com/locate/asej
www.sciencedirect.com

Dynamic power management techniques

in multi-core architectures: A survey study
Khaled M. Attia *, Mostafa A. El-Hosseini, Hesham A. Ali

Computers and Control Systems Engineering Department, Faculty of Engineering, Mansoura University, Mansoura, Egypt

Received 10 May 2015; accepted 3 August 2015

KEYWORDS Abstract Multi-core processors support all modern electronic devices nowadays. However, power
Chip multiprocessors; management is one of the most critical issues in the design of today’s microprocessors. The goal of
Multi-core; power management is to maximize performance within a given power budget. Power management
Power management techniques must balance between the demanding needs for higher performance/throughput and the
impact of aggressive power consumption and negative thermal effects. Many techniques have been
proposed in this area, and some of them have been implemented such as the well-known DVFS
technique which is used in nearly all modern microprocessors. This paper explores the concepts
of multi-core, trending research areas in the ﬁeld of multi-core processors and then concentrates
on power management issues in multi-core architectures. The main objective of this paper is to sur-
vey and discuss the current power management techniques. Moreover, it proposes a new technique
for power management in multi-core processors based on that survey.
Ó 2015 Faculty of Engineering, Ain Shams University. Production and hosting by Elsevier B.V. This is an
open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

1. Introduction every two years. With the rapid increase in speed, the number
of transistors in processors increased in a way that it can’t scale
The evolution of multi-core processors led to the evolution of to Moore’s law anymore as an extremely huge number of tran-
many research areas. Before the appearance of multi-core pro- sistors switching at very high frequencies means extremely high
cessors, the speed of microprocessors increased exponentially power consumption. Also, the need for parallelism increased
over time. More speed requires more transistors. Moore [1] and the instruction level parallelism [2] was not sufﬁcient to
observed that the number of transistors doubles approximately provide the demanding parallel applications. So the concept
of multi-core was introduced by Olukotun et al. [3], to design
more simple cores on a single chip rather than designing a huge
* Corresponding author. Mobile: +20 1000736160. complex one. Now all modern microprocessor designs are
E-mail addresses: khaled.m.attia@mans.edu.eg (K.M. Attia), implemented in a multi-core fashion. Multi-core advantages
melhosseini@mans.edu.eg (M.A. El-Hosseini), h_arafat_ali@mans. can be summarized as follows:
edu.eg (H.A. Ali).
Peer review under responsibility of Ain Shams University.
A chip multiprocessor consists of simple-to-design cores
Simple design leads to more power efﬁciency
High system performance in parallel applications where
Production and hosting by Elsevier many threads need to run simultaneously

http://dx.doi.org/10.1016/j.asej.2015.08.010
2090-4479 Ó 2015 Faculty of Engineering, Ain Shams University. Production and hosting by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
Please cite this article in press as: Attia KM et al., Dynamic power management techniques in multi-core architectures: A survey study, Ain Shams Eng J (2015), http://
dx.doi.org/10.1016/j.asej.2015.08.010
2 K.M. Attia et al.

Introducing multi-core processors aroused many related design, explains how we have reached the multicore era and
areas of research. Dividing code into threads, each can run mentions the main issues associated with multicore processors.
independently is very important to make use of the power of Section 3 starts to focus on the power management issue,
the multi-core approach. However, not all code can be divided showing the importance of handling such a problem and pro-
in such a manner. That issue was described by Amdhal in [4] viding a proper problem formulation. It continues to explain
which concludes that maximum speedup is limited by the serial almost all the current techniques used in the power manage-
part and that is called the serial bottleneck. Serialized code ment field in modern processors, showing the advantages
reduces performance expected by the processor; it also wastes and disadvantages of each one and the research done to try
lots of energy. Also, the parallel portion of the code is not to solve each shortage. Section 4 proposes a new mechanism
completely parallel because of many reasons such as synchro- for power management in asymmetric multicore processors.
nization overhead, load imbalance and resources contention Finally, we conclude in Section 5 by reviewing the most impor-
among cores. The serial bottleneck research led to the evolu- tant ideas that were presented in the paper.
tion of asymmetric multi-core processors [5].
The concept of asymmetric multi-core processors implies 2. Background
that the design would include one large core and many small
cores. The serial part of the code will be accelerated by moving The performance of microprocessors has increased exponen-
it to the large core and the parallel part is executed on the small tially over years. Techniques have been devised to achieve paral-
cores. This accelerates both the serial part by using the large lelism, starting from pipelining, passing by super-scalar
core and the parallel part as it will be executed simultaneously architectures and finally the chip multiprocessors or multicore
on the small cores and the large core to achieve high through- processors. Here we shed light on the various levels of parallelism
put. Using asymmetric cores can be more energy efficient too. and how consequent technologies tried to exploit each level.
In [5] Mark et al. described how asymmetry can be achieved.
They divided it into static and dynamic methods. For static 2.1. Levels of parallelism
methods, cores may be designed at different frequencies or a
more complex core with completely different micro- Each one of these techniques exploits some levels of paral-
architecture may be designed. In dynamic methods, frequencies lelism which can be listed as follows:
can be boosted dynamically on demand or small cores may be
combined to form a dynamic large core and this is described in (1) Instruction level parallelism
detail in [6]. Other research topics related to multi-core proces-
sors that emerged include the following: power management, In this level, architectures make use of independent instruc-
memory hierarchies in multi-core processors, the design of tions (the operands of one instruction do not depend on the
interconnection networks in multi-core processors, heteroge- result of another one) that exist in the instructions streams
neous computing in multi-core processors, reliability issues in to execute them concurrently.
multi-core processors and parallel programming techniques.
In power management, the main objective is to reach the max- (2) Basic Block level
imum performance of the processors without exceeding a given
total power budget for the chip. There has been lots of research A block can be considered a set of instructions that end
on power management in chip multiprocessors. Here we are with a branch. Modern architectures were able to exploit this
going to discuss most of those techniques [7] and some modern level of parallelism among basic blocks with the help of
works that try to optimize the efficiency of these techniques. In advanced branch predictors
this paper we examine all popular techniques in detail and how
they work to minimize performance losses while saving power. (3) Loop iterations
We investigate the suitable technique for each case (workloads,
power budget available, critical systems) and how to make Some types of loops work on independent data in each iter-
these techniques even more suitable for their cases. ation of the loop. So, it is possible in these loops to run differ-
This paper makes the following contributions: ent iterations concurrently in superscalar architectures for
example.
Listing almost all the used techniques for power manage-
ment in multi-core processors, discussing them in terms of (4) Tasks
advantages and disadvantages (performance loss, power
saving, suitable cases) and providing a comparison between A task signifies an independent function extracted from one
them. application. It can also be called a thread. Software developers
Examining some of the improvements added to each of have to divide their code into independent threads to make use
these techniques to make them even better. of this level of parallelism in multiprocessors systems, where
Proposing a new adaptive control mechanism for power each thread can run independently on a dedicated core.
management in asymmetric multi-core processors.
Suggesting further research to be done in some of the inves- 2.2. Advances in processor microarchitecture
tigated techniques/scenarios.
Over the years, there have been many trials to exploit better
The rest of this paper is organized as follow: Section 2 parallelism as shown in Fig. 1; advances in architecture can
introduces the historical improvements in the microprocessor be viewed as follows:

Figure 1 Advances in microprocessor design over time.

2.2.1. Single-cycle processor hand, pipelining introduces logic overhead in each stage of
This technique was used in very early microprocessors. The the pipeline. Also, some data dependency hazards occur when
key concept is that the whole instruction is executed at once two dependent instructions are executed concurrently. How-
in one clock cycle. Whenever an instruction has started to exe- ever, many techniques were proposed to overcome such
cute, all other instructions in the instruction stream have to hazards.
wait until it fully finishes its execution. Of course, some
instructions take lots of execution/waiting time which affects 2.2.3. Deep pipelining
the execution of other instructions and degrades overall system The idea of deep pipelining [8] is to increase the number of
performance (see Fig. 2a). pipeline stages significantly. It is obvious from the discussion
about the pipelined processor that the more stages you add,
2.2.2. Pipelining the faster execution you get. That is of course valid to a certain
Instead of executing the whole instruction at once, pipelining extent. Common pipelines have up to 20 stages. The number of
divides the single-cycle processor into many stages; in each stages is greatly limited by many factors such as the existing
stage, a portion of the instruction is executed concurrently hazards and the logic overhead. As we mentioned before in
with another portion of another instruction. For example, if the pipelined processor, many techniques has been devised to
we have a three-stage pipelined processor that means the overcome the data dependency problem. These techniques
single-cycle processor is divided into three stages, let them include, but not limited to, forwarding, stalling and register
be, for example, FETCH OPERANDS, DECODE and EXE- renaming.
CUTE. Then, we can execute three instructions simultane-
ously. At clock cycle 3, the first instruction will be in the 2.2.4. Super scalar processor
EXECUTE stage, while the second instruction will be in One of the main bottlenecks in the pipelined processor design
the DECODE stage and the third instruction would be is that however many instructions can run at different phases
in the FETCH OPERANDS stage. That obviously diminishes in the same time, the pipeline can only be initiated with one
the drawback of long wait times in long instructions (see instruction. A superscalar processor is the one that contains
Fig. 2b). It exploits the instruction level parallelism where mul- multiple copies of the whole datapath (including the ALU)
tiple instructions can be executed concurrently. On the other which makes it possible to issue as many instructions as the

Figure 2 Difference between (a) single-cycle processor and (b) pipelined processor.

Figure 3 A tree diagram of the most common power management techniques.

number of the copies is. Each instruction runs almost indepen- multi-core processor, each thread runs independently on a ded-
dently as it has its own dedicated datapath. Superscalar pro- icated core (real parallelism). Hence, great enhancements are
cessors concepts have always been combined with pipelined made to the overall throughput of the system. However, many
processors concepts to introduce the pipelined superscalar pro- issues came up such as the problem of designing the appropri-
cessor which has been commonly used in the 1990s and early ate memory hierarchy, the data locality problem, the design of
2000s. The basic operation of a superscalar processor includes interconnection networks, maintaining the reliability and
fetching and decoding a stream of instructions, branch predic- validity of the processor and power management. In this
tion, figuring out whether there are any dependencies among paper, we are discussing the power management issue in
instructions and finally the distribution of instructions to dif- multi-core processors and the techniques proposed and used
ferent functional units to be issued [9]. It provides great in that area.
enhancements in the overall performance/throughput of the
system. However, not many instructions can run at the same 3. Power management techniques
time because of the dependency problem explained in the pipe-
lined processors. Moreover, the number of issued instructions Power management has become a major issue in the design of
is limited. Also, it introduces lots of hardware overhead mean- multi-core chips. There are many negative effects that result
ing larger areas and more power consumption. from increasing power consumption such as unstable thermal
properties of the die and hence affecting the system perfor-
2.2.5. OoO (Out-of-Order) processors mance which makes power consumption issue sometimes more
OoO processors look ahead across instruction window to find important than speed. An important observation is that
independent instructions that can be executed immediately. threads running on different cores do not need the same power
This means, instructions are not executed in the order they all time to execute at high performance. There are some wait-
were written in. Once the operands of an instruction are avail- ing times due to memory read/write operations for example
able, the instruction is executed regardless of the sequence of which require saving unnecessary processing power. So, to
the program. OoO processors solve the problem of dependen- achieve a good balance between scalar performance/through-
cies introduced in the pipelined superscalar processor. How- put performance and power it is essentially required to dynam-
ever, they introduce additional hardware overhead and ically vary the amount of power used for processing according
energy consumption for speculation. to temporal analysis of the code needs.
Developed power management techniques can be classified
2.2.6. Chip multiprocessors into two main categories: reactive and predictive. In reactive
Chip multiprocessors or multi-core processors exploit thread techniques, the technique reacts to performance changes in
level parallelism efficiently. A process is a program currently the workload. In other words, a workload may initially have
in execution. Each process consists of one or more threads. states that need high performance, others of I/O waits and
For example, a server application would have at least two low performance. When the state of the workload changes,
threads, one for listening to incoming connections and another the technique reacts to that change accordingly. However,
one for outgoing connections. No thread has to wait for the there might be some lag between workload phase changes
other to finish as they execute concurrently. In traditional and power adaptation changes which may lead to states of
uniprocessor systems, multi-threading is not well utilized. either in-efficient energy consumption or performance degra-
The uniprocessor provides an illusion that threads run concur- dation. On the other hand, predictive techniques, for example
rently but in fact a fast switch is done between threads of the [10], overcome this issue. Those techniques predict phase
same process (which is way faster than switching between pro- changes in the workload before they happen, and hence act
cesses). Multi-core architectures appeared to extract as much immediately before a program phase changes. That leads to
parallelism as possible from the thread level parallelism. In a optimal energy-saving and performance results. However,

there is no workload that can be fully predicted, so reactive Techniques can be evaluated in terms of power efficiency. A
techniques are used for portions that cannot be predicted common metric for the evaluation of power efficiency is energy
(which is usually more than 60% of the entire workload). per instruction (EPI in Watt/MIPS or Joule/Instruction).
So, reactive techniques are inevitable to use and consequently Other metrics such as energy delay product (EDP), which
we concentrate in this study on those techniques. Here, we are was initially proposed by Horowitz et al. [11], and ED2P are
examining some of the dynamic techniques as shown in Fig. 3 used also in latency performance architectures as they assign
to achieve the best level of power management in multi-core a weight to the amount of time needed for an instruction to
processors. We also discuss some issues related to each of these be processed. Obviously, techniques that achieve lower EPI
techniques and how previous research attempted to handle are more energy-efficient. The main objective of almost all
these issues. techniques is achieving high Instruction per Cycle (IPC) while
Problem formulation can be viewed as follows: all maintaining low EPI. That balance is the main concern of
techniques assume there is an on-chip or on-board hardware almost all the research done on power management in
controller for power management which contains all the hard- microprocessors.
ware and circuitry required for performing its job. The con- The power management process can be viewed as a feed-
troller is always supported by some firmware and software back closed-loop control system. Power budget is considered
that give directives for implementing the specific technique or the desired input coming from the system-level control system.
algorithm. Fig. 4 shows a high-level view of the power manage- And there is an on-chip or on-board controller that adjusts
ment process assuming a global on-chip power management some parameters (such as voltage and frequency) based on
controller. The system-level controller directs the global on- the monitoring process (feedback) coming from the individual
chip controller toward a specific power budget. The global cores of the chip in a closed-loop and so on. Monitoring power
on-chip controller monitors power-performance statistics from consumption has been a hot research topic for many years. For
all cores and dependently takes the required action. That any powersaving mechanism, it needs to monitor consumed
action depends on the algorithm/technique used (for example, power to guide its decision. Mainly, Performance Monitoring
change voltage as in DVFS, cut-off power of specific portions Counters (PMCs) are used to obtain power models. Examples
as in power-gating techniques). of research done on that point are included [12–15] as exam-
ples. This representation leads us to another point which is
as follows: as power management control systems can be
viewed as feedback control system. That implies that they have
regions of instability which require in turn providing guarding/
security mechanism for power management which is out of the
scope of this paper. Fig. 5 illustrates that concept.

3.1. Dynamic voltage and frequency scaling

3.1.1. Basic concept

The idea of using dynamic voltage and frequency scaling in
power management in microprocessor systems was originally
invented by Weiser et al. in 1996 [16]. The power consumption
is mainly governed by the following equation:
P ¼ CV2 F ð1Þ
where P is the power, C is the switching capacitance, V is the
supplied voltage and F is the working frequency.
It’s very obvious that we can control the amount of con-
sumed power by simply adjusting voltage–frequency pairs.
This method has been widely used to achieve different Energy
per Instruction (EPI) ratios. It has been commercially intro-
Figure 4 High-level view of dynamic power management
duced under many names such as Intel’s SpeedStep
problem.
technology, AMD’s PowerNow! The main idea is to adjust

Figure 5 Feedback closed-loop representation of power management control system.

Please cite this article in press as: Attia KM et al., Dynamic power management techniques in multi-core architectures: A survey study, Ain Shams Eng J (2015), http://
dx.doi.org/10.1016/j.asej.2015.08.010
6 K.M. Attia et al.

voltage–frequency pairs within a set of discrete, predefined fhigh thigh þ ðflow tlow Þ
fhost ¼ ð2Þ
pairs to achieve the required power/performance level. In other thigh þ tlow
words, for heavy parallel workloads, many cores run at low
voltage–frequency pair. However, for scalar workloads which where fhost is the required frequency, fhigh is the high threshold
include a big portion of serial code, it is reasonable to run frequency, thigh is the number of occurrences of that frequency
few cores, and boost their frequency to adapt to the required and similarly for the low threshold.
task. Fig. 6 explains this concept. DVFS management system
for a dual-core processor can be viewed graphically as in 3.1.3. DVFS levels of granularity
figure. DVFS can be applied either per chip or per core. Applying
The system level controller directs the global on-chip or on- DVFS per core introduces much flexibility as each core would
board controller with the desired power budget. The global have its own voltage–frequency pair. However, that incurs at
controller monitors voltage, frequency and IPC (power usage) the expense of a large number of on-chip voltage regulators.
of each core. Depending on these parameters, the controller On the other hand, applying DVFS on the chip level reduces
actuates voltage and/or frequency as required. The same that expense but limits flexibility as the same voltage would
concept is applied through all power management techniques be applied to all cores regardless of the special needs of each
as previously explained, and the main difference lies in the individual core. It is extremely difficult to determine a single
algorithm itself. Machine learning algorithms (especially rein- Voltage–Frequency setting that satisfies all cores needs simul-
forcement learning) have been recently used to perform DVFS taneously. In [25] Kolpe et al. proposed an intermediate
[17–21]. Using these techniques led to even better results on technique called ‘‘clustered DVFS” which clusters the cores
both the performance and energy saving metrics. into different DVFS domains and implements DVFS on a
DVFS has not been only used in general purpose applica- per-cluster basis. The algorithm of this approach can be
tions. It is widely applied to almost all modern processors in summarized in three main steps: (1) find the optimal voltage/
embedded systems [22]. Also, it can be used in real-time frequency setting for each core individually, (2) find similarities
applications. For example in [23], DVFS is used along with a between cores (for example, the cores with similar voltage/fre-
checkpointing technique for consumed power reduction in quency setting from the first step over a certain number of
reliability-guaranteed real-time applications. That study clock cycles are similar) and cluster similar cores together
proves that with the use of backward fault recovery technique, and finally (3) evaluate the solution by finding the optimal
DVFS can achieve highest system reliability while consuming voltage/frequency setting for each cluster and compare it with
minimal amount of energy. the actual setting of the cluster. This approach proved to have
significant results, compared to per-core DVFS but it returns
3.1.2. Determination of the suitable voltage-frequency setting diminishing results when the number of clusters increases.
As mentioned, the default on demand linux governor chooses
voltage–frequency pairs from a set of predefined, discrete val- 3.1.4. Time to vary voltage and frequency
ues. That’s not very power efficient as the required voltage– Scaling voltage and frequency takes some latency to wait for
frequency pair may be not exactly one of the predefined values. the voltage/frequency reach the desired level. However, fre-
Kamga et al. [24] proposed an approach for precise determina- quency scaling is much faster than voltage scaling. Conse-
tion of the required frequency for current workload. Kamga quently, the processor can be in dangerous states where the
suggests a method to precisely determine the required fre- current voltage cannot support the frequency. In these cases,
quency based on the high and low threshold and number of hard faults would occur and cause the CPU to stop operating.
occurrences of each of them. The method ends up with the Fig. 7 shows the relationship between voltage and frequency
required frequency to be during DVFS. A boundary can be drawn to divide the volt-
age–frequency space into three areas: (1) area above the
boundary which contains dangerous power states because
the voltage cannot support the frequency, (2) area under the
boundary is not energy efficient, and (3) the boundary which
contains power-safe states. For example, if we want to scale
from s0 to s3, the frequency will scale faster which leads to
reaching the dangerous state s2.
The traditional method to overcome this issue was to scale
voltage first and stall the running application until the voltage
scaling is done, then scale frequency. It is very obvious that
this method introduces lots of latency. Some research has been
done to address the latency resulted from applying DVFS. Lai
et al. [26] proposed an algorithm that reduces latency by avoid-
ing unnecessary aggressive power states transitions. Also, Lai
et al. [27] proposed the Retroactive Frequency Scaling (RFS)
technique which suggests not stalling the execution of the
application during voltage scaling, but running it at the previ-
ous frequency setting until voltage scaling is done. Although
Figure 6 High-level graphical view of DVFS applied to a dual- that eliminates much of the latency, it comes to the cost of run-
core processor. ning at power inefficient state during voltage scaling.

workloads while the Cortex-A7 support the parallel work-

loads. Later in 2012, ARM announced the Cortex-A53 and
Cortex-A57 that implement the ARMv8-A instruction set
and are compatible to be combined together in a big.LITTLE
architecture. Later in 2014, ARM announced the Cortex-A17
which can be combined with Cortex-A12 in the big.LITTLE
architecture.
Samsung implemented ARM’s big.LITTLE architecture in
its octa-core processor Exynos 5 [29] which consists of 8 cores:
4 small, power-efficient Cortex-A7 cores and 4 big, high-
performance Cortex-A15 cores. Also, NVidia implemented
its own heterogeneous 5-core multiprocessor Tegra3 [30] which
has 4 large, high-performance cores and one small, power-
efficient core. Asymmetric cores have been widely used recently
in most modern mobile and embedded systems [31] to reduce
power consumption of these devices. Using DVFS in asymmet-
ric cores is very popular and can be used in both single-ISA
and multiple-ISA heterogeneous architectures [32]. It is
Figure 7 Relationship between voltage and frequency during expected to see more architectures and implementations of
dynamic scaling. the heterogeneous multi-core architecture as they have proven
usability, accommodating to various, special purpose applica-
tions and of course power efficiency needed for almost all
3.1.5. Limitations of DVFS modern mobile and embedded devices.
We can summarize the limitations of DVFS from the above
discussion as follows: 3.2.2. Critical sections and thread scheduling
There are many issues associated with asymmetric cores. One
3.1.5.1. DVFS domain. Based on the level to which DVFS is important issue is thread scheduling, in other words, the prob-
applied, some issues are represented. If DVFS is applied per- lem of assigning the thread to the proper core that would be
core, that would mean higher cost and less scalability when most power/performance efficient for its workload. Different
applying it to a processor with a large number of cores. When hardware platforms require different scheduling/resource allo-
DVFS is applied per-chip, it would ignore the individual needs cation techniques as they differ in performance/energy trade-
of each core separately. Also, in the clustered solution, it off spaces. So, choosing the wrong technique may give inverse,
would return diminishing results when the number of clusters negative results of power management as demonstrated in [33].
increases to a certain extent. There has been lots of research on this topic, for example,
Lakshminaryana et al. [34] proposed a scheduling scheme
3.1.5.2. Large state transition delay. As we mentioned, voltage which schedules a task with longer remaining time to a faster
change takes a while to complete. That makes the transition core. Also, Becchi et al. [35] used Instruction per Cycle (IPC)
from one power state to another slow and sometimes danger- as a metric for the assignment of threads to cores. Srinivasan
ous as showed in the hard faults issue. Slow transition is a big et al. [36] used performance prediction model to predict appli-
problem as the required setting may also change in the period cation behavior on different cores and hence assign thread to
of transition which makes it useless. the proper core.
Some works used Last-Level Cache (LLC) misses as a met-
3.2. Asymmetric cores ric of scheduling threads as in [37]. LLC miss rate provides an
indication of the intensity of off-chip memory accesses which
3.2.1. Basic concept contribute with a big part to the overall latency and in turns
The concept of designing a heterogeneous, single-ISA multi- can be a good metric for identifying the nature of each core/
core processor [5] in which each core differs in performance thread. Manakkadu et al. [38] proposed a technique for iden-
and power consumption has been very useful in the field of tifying critical sections in threads based on a scoring mecha-
power management. As previously mentioned, two types of nism. Based on the score for each thread, best optimization
cores are designed: large (or complex) cores which is usually decisions can be made. Their study claims that by applying
out-of-order, superscalar, deep-pipelined core to support heav- asymmetric frequency scaling directed by the proposed metric,
ily scalar workloads and small (or simple) core which, on the 28.13% savings in average power consumption were achieved
other side, is usually in-order, scalar, short-pipelined core that with a maximum of 7.1% performance loss. The approach
participates in parallel workloads. used in that paper depends on the IPC for each thread in dif-
One of the most popular commercial implementations of ferent time intervals. It assumes that execution time is divided
asymmetric chip multiprocessor is ARM’s big.LITTLE archi- into equal time intervals. At each interval, each thread has a
tecture [28]. In 2011, ARM announced that in this architec- specific IPC. The score at a specific interval for a thread is
ture, out-of-order superscalar Cortex-A15 cores are found by dividing the thread’s IPC at that period by the sum-
combined with the simple, in-order Cortex-A7 cores as shown mation of all threads IPCs at the same period. The final score
in Fig. 8. They both implement the ARM v7A. The Cortex- of the thread is obtained by summing up all the thread’s scores
A15 support for high-performance, energy-hungry scalar at all intervals. The formula can be written as follows:

Figure 8 ARM big.LITTLE architecture.

X IPCt means it requires modiﬁcations in the kernel itself and also

P i t ð3Þ
t i IPCi
requires the modified version of the kernel to be re-built to
make use of the framework.
where is the IPC of the ith thread at interval t. The thread with
the highest score is the most critical thread among other 3.2.3. Many-type asymmetric cores
threads and hence speeding it up would reduce the overall exe-
In addition to using only two types of cores in asymmetric chip
cution time.
multiprocessor, Kumar et al. [42] proposed single-ISA hetero-
In [39], Petrucci et al. proposed an optimization approach
geneous multi-core architectures that assume a single chip con-
that includes an Integer Linear Programming model [40]
tains different kinds of cores (not only two) based on different
(rather than concentrating on a single performance metric)
power/performance requirements. The study decides which
and a scheme to dynamically solve the problem of thread-to-
core is best for a certain application execution and then ship
core assignment. It uses a regression model to predict at run-
the application to that core. It assumes example architecture
time how thread would perform on different types of cores
with five cores with different architectures and claims that ini-
available and hence map that thread to the core that suits it
tial results show a 3 power reduction in the cost of only 18%
best. Their stated results show an EDP improvement over
performance loss.
other scheduling techniques of 10–40% depending on the
workload. However, that approach works by reallocating a
3.2.4. Limitations of asymmetric cores
thread to another core when the performance profile changes.
This means that a thread may execute on the wrong core for a From the above discussion, it is obvious that asymmetric chip
while before its performance profile changes and hence allow- multiprocessor architectures suffer from some limitations. One
ing many inefficient power states. Also, the action of thread re- of the most important limitations is that the number of large
locating may degrade performance significantly especially in and small core is fixed at design time and cannot be modified.
the case of shared LLC as contention may exist. This reduces the flexibility of accommodating to the software
Moreover, frameworks for dynamic power management in diversity.
such asymmetric architectures were purposed. In [41] one is Another important limitation of asymmetric CMPs is the
presented. The framework proposed is price theory based that overhead and latency introduced during the process of thread
tries to exploit all energy saving opportunities. It employs migration between cores. Along with that latency, another
DVFS, load balancing and thread re-allocation to achieve as data locality issue is presented. Each core has one or more
high energy savings rates as possible. That study designed levels of private cache. When threads are migrated from one
and implemented that framework within Linux OS on ARM core to another, data must also be migrated from one cache
big.LITTLE architecture. We found that framework very to another or must be fetched again from main memory. That
promising as it uses more than a single power saving opportu- introduces another communication/delay overhead and limits
nity and it is distributed and scalable. However, it divides cores performance improvements.
into clusters. Each cluster have a specific V–F setting and all
cores within a cluster are constrained to be symmetric. This 3.3. Thread motion
reduces flexibility and wastes some good power saving oppor-
tunities. Also, the framework takes into consideration the pri- 3.3.1. Basic concept
ority of the task which is assumed to be assigned by the user. Based on [43], this technique was proposed to enhance the
That introduces some level of user overhead. Moreover, the well-known DVFS. It proved that 2 levels of voltage–fre-
framework is implemented within the linux kernel which quency domains are sufficient to improve performance. The

idea of thread motion is to have small cores, running at two highly parallel workload, it is power efficient to run many
different levels of voltage–frequency levels. When applications cores with little speculation on each core. In scalar workloads,
are executed, the algorithm decides which core has the best it’s advisable to run a few cores with as much speculation as
voltage–frequency setting to execute that application and possible.
moves it to that core instead of changing voltage–frequency
pair for that core which introduced more latency. Thread 3.5.2. Limitations of speculation control
Motion enables applications to migrate to cores with higher Regarding limitations of this technique, it is not very useful in
or lower voltage/frequency settings depending on the current cases of parallel workloads. Parallel workloads do not suffer a
workload of the program. For example, if one application lot from mis-speculated instructions. Also, latency introduced
could benefit from a higher voltage/frequency setting on some by the pipeline degrades performance significantly.
core while the application on that core is stalled for I/O oper-
ation for example, thread motion swaps the two applications 3.6. Core fusion
between the cores.
3.6.1. Basic concept
3.3.2. Limitations of thread motion
Core Fusion [6] is re-configurable chip multiprocessor archi-
The limitation of this technique is that it was proposed for tecture that starts with small simple cores which are able to
simple, homogeneous cores and it’s also limited to dynamically fuse into a larger core to support scalar perfor-
power-constrained multi-core systems. The results show that mance when needed. It neither requires special programming
it provides up to 20% better performance than coarse- effort nor specialized compiler support. Core Fusion can
grained DVFS. accommodate to software diversity and variations of work-
loads. When the workload is extremely parallel, distribute
3.4. Variable size cores the workload among the simple cores. When the workload is
heavily scalar, the simple cores dynamically fuse into a larger,
3.4.1. Basic concept more powerful single core. Full details of hardware implemen-
The basic idea is to design a complex, large core that is able to tation of this architecture can be found in [6]. Many re-
degrade later into a small core [44]. This can be done through configurable architectures used Core Fusion as the foundation
dynamically disabling execution units and even pipestages. [46–48].
This idea is based on the classic power gating [45] technique.
Power gating algorithms typically operate by turning off the 3.6.2. Limitations of core fusion
resource if it has been idle for a specified number of clock Limitations of Core Fusion according to [49] include that the
cycles. In cases of high scalar workloads (low parallelism), fused large core consumes lots of power and is slower than a
run a few cores that would fully-operate to support the scalar traditional out-of-order core because there are additional
performance. However, when dealing with highly parallel latencies among the pipeline stages of the fused core. Also,
workloads, it would be power/throughput performance effi- mode switching between small cores and fused core comes at
cient to run many cores using fewer resources on each core. the cost of flushing instruction cache and moving data between
caches.
3.4.2. Limitations of power-gating
It is obvious that power-gating (and consequently variable-size 4. Proposed technique
cores) has some serious limitations:
Based on the above discussion and referring to the comparison
3.4.2.1. Mis-prediction. As we viewed, power-gating algorithms provided by Table 1, we were able to propose a technique we
depend on turning off a resource that is reported idle for a think it will provide best balance between consumed power
specified number of clock cycles. Hence, the controller may reduction and overall performance/throughput. This technique
turn off some resource which was idle just before the applica- would make use of clustered DVFS [25] with Retroactive
tion needs that resource again, giving negative power savings Frequency Scaling (RFS) [27] in Asymmetric [5] Many-Type
and degrading performance significantly. Multicore Processor [42] which schedules critical sections
threads using the scoring mechanism [38]. Power gating tech-
3.4.2.2. Small power savings. While turning off some portions/ nique [44] may also be used in cases of very low CPU utiliza-
resources of the systems saves power consumption, it is not of tion. The technique will be controlled via an adaptive control
that great impact. Power savings resulting from this technique mechanism which decides based on many parameters
are very minimal compared to other techniques such as DVFS. (workload style, current cores utilization, available amount
of parallelism, current performance, e.g.) how to use that
3.5. Speculation control technique efficiently.
For example, when the initial workload is highly-parallel,
3.5.1. Basic concept small cores frequency will be fixed while the frequency of the
Some energy is wasted on mis-speculated instructions for large cores will be scaled down and all cores will be used to
example, instructions after a mis-predicted branch. The results execute that parallel code. If the workload contains lots of
of a mis-speculated instruction are more likely to be discarded sequential code, large cores will be used at maximum fre-
but energy has been wasted anyway to execute that instruction. quency. Our technique is currently subject to further research
Speculation reduction technique suggests that in cases of and validating it using simulation is our future work.

5. Conclusion

– more latencies in the

– cache ﬂushing and data

fused core than a tradi-
can dynamically fuse into a

software

– no extra programming

migration delay during

– excellent accommoda-
execute parallel code and
Having small cores that
In this paper we investigated the concept of multi-core proces-

tional OoO core.

mode switching
sors, research trends in that ﬁeld and focused on the power

– low area cost

effort needed
management issue. We reviewed most of the used techniques,

to
Core Fusion

diversity
large core their advantages and disadvantages and the research done
for each technique to address its problems. Finally, we pro-

tion
posed a new technique that makes use of the gathered results.
It is very clear from the discussion that there is no absolute
perfect way for power management in chip multiprocessor

– latency introduced by the

– not very useful in cases of
parallel workloads as they
do not suffer a lot from mis-
performance and energy terms.
architecture. It depends on whether or not you are open to
improvements to get further
Reduce speculation to save

evaluation of such hybrid,

speculated instructions
changes in the architecture itself, how much you can sacriﬁce
Future work suggests the

compatible techniques/

performance and the amount of workload you expect on your

better results in both
Speculation Control

chip. However, the combination of some techniques with

solutions made to improve those techniques is an excellent
choice to think of. For example, applying thread scoring in
pipeline many-type asymmetric cores seems very promising. Future
work suggests the evaluation of such hybrid, compatible
power

techniques/improvements to get further better results in both

performance and energy terms.
to

– can be used in con-

junction with other

– very low power saving

off may result in nega-

resources to be turned
Turning oﬀ/on resources
as needed to save power

tive power savings

References
when used alone
simple
Variable-size cores

– mis-prediction

[1] Moore Gordon E. Cramming more components onto integrated

techniques
implement

circuits, vol. 12; 1965.

[2] Wall DW. Limits of instruction-level parallelism. WRL Research
– very

Report 93/6. Digital Western Research Laboratory, Palo Alto

(CA); 1993.
[3] Olukotun Kunle et al. The case for a single-chip multiprocessor.
Comparison between different power management techniques in multi-core processors.

– effective results in power

– limited to homogenous
frequency domains, migrating

– fast change between differ-

ACM Sigplan Notices 1996;31(9):2–11.

savings (rely on DVFS)

[4] Amdahl Gene M. Validity of the single processor approach to

Having almost 2 voltage/

threads between cores of

achieving large scale computing capabilities. In: Proceedings of the

CMP architectures

April 18–20, 1967, spring joint computer conference. ACM; 1967.

ent VF settings

[5] Hill Mark D, Marty Michael R. Amdahl’s law in the multicore

diﬀerent domains
Thread Motion

era. IEEE Comput 2008;41(7):33–8.

[6] Ipek Engin et al. Core fusion: accommodating software diversity
in chip multiprocessors. ACM SIGARCH computer architecture
news, vol. 35(2). ACM; 2007.
[7] Grochowski Ed et al. Proceedings IEEE international conference
on computer design: VLSI in computers and processors, 2004.
ICCD 2004. IEEE; 2004.
cores is ﬁxed at design
parallel code execution and

– accommodates well to

– number of small/big

– choosing appropriate
a large core for scalar code

– latency introduced dur-

– data locality issue dur-

[8] Sprangle Eric, Doug Carmean. Increasing processor performance by

ing thread migration

scheduling/mapping
Having small cores for

results

implementing deeper pipelines. In: Proceedings 29th annual inter-

software diversity
Asymmetric Cores

national symposium on computer architecture, 2002. IEEE; 2002.

power savings

ing migration

[9] Smith James E, Sohi Gurindar S. The microarchitecture of

technique

superscalar processors. Proc IEEE 1995;83(12):1609–24.

– effective

[10] Bircher William Lloyd, John Lizy. Predictive power management

execution

time

for multi-core processors. In: Computer architecture. Berlin,

Heidelberg: Springer; 2012.
[11] Horowitz Mark, Indermaur Thomas, Gonzalez Ricardo. Low-
power digital design. In: Low power electronics, 1994. Digest of
– very effective power

– long transition time

savings with minimal

– easy to implement on

– the level of granular-

ity to which DVFS is
applied affects cost
performance/throughput

between power states

frequency according to

technical papers. IEEE; 1994.

and performance
Change voltage and

[12] Bertran Ramon et al. A systematic methodology to generate

decomposable and responsive power models for CMPs. IEEE
performance
degradation

many scales

Trans Comput 2013;62(7):1289–302.

the required

[13] Basmadjian Robert, de Meer Hermann. Evaluating and modeling

power consumption of multi-core processors. In: 2012 third
DVFS

international conference on future energy systems: where energy,

computing and communication meet (e-Energy). IEEE; 2012.
[14] Rethinagiri Santhosh Kumar et al. System-level power estimation
Shortcomings

tool for embedded processor based platforms. In: Proceedings of

Advantages

the 6th workshop on rapid simulation and performance evalua-

Table 1

tion: methods and tools. ACM; 2014.

Idea

[15] Walker Matthew J et al. Run-time power estimation for mobile ad

embedded asymmetric multi-core CPUs; 2015.

[16] Weiser Mark et al. Scheduling for reduced CPU energy. Mobile [38] Manakkadu Sheheeda, Dutta Sourav, Botros Nazeih M. Power aware
computing. US: Springer; 1996, p. 449–71. parallel computing on asymmetric multiprocessor. In: 2014 27th IEEE
[17] Khan UA, Rinner B. Online learning of timeout policies for international system-on-chip conference (SOCC). IEEE; 2014.
dynamic power management. ACM-TECS 2014;13(4):25 96. [39] Petrucci Vinicius et al. Energy-efficient thread assignment opti-
[18] Das Anup et al. Reinforcement learning-based inter-and intra- mization for heterogeneous multicore systems. ACM Trans
application thermal optimization for lifetime improvement of Embed Comput Syst (TECS) 2015;14.1:15.
multicore systems. In: 2014 51st ACM/EDAC/IEEE design [40] Wagner Harvey M. An integer linear-programming model for
automation conference (DAC). IEEE; 2014. machine scheduling. Naval Res Logist Quart 1959;6(2):131–40.
[19] Ye Rong, Xu Qiang. Learning-based power management for [41] Somu Muthukaruppan T, Pathania A, Mitra T. Price theory
multicore processors via idle period manipulation. IEEE Trans based power management for heterogeneous multi-cores. In: Proc
Comput-Aided Des Integr Circ Syst 2014;33(7):1043–55. 19th int conf archit support program lang oper syst – ASPLOS’14;
[20] Shen Hao et al. Achieving autonomous power management using 2014. p. 161–76.
reinforcement learning. ACM Trans Des Autom Electr Syst [42] Kumar Rakesh et al. A multi-core approach to addressing the
(TODAES) 2013;18(2):24. energy-complexity problem in microprocessors. In: Workshop on
[21] Otoom Mwaffaq et al. Scalable and dynamic global power complexity-effective design; 2003.
management for multicore chips. In: Proceedings of the 6th [43] Rangan Krishna K, Wei Gu-Yeon, Brooks David. Thread motion:
workshop on parallel programming and run-time management fine-grained power management for multi-core systems. ACM
techniques for many-core architectures. ACM; 2015. SIGARCH computer architecture news, vol. 37(3). ACM; 2009.
[22] Chao Seong Jin, Yun Seung Hyun, Jeon Jae Wook. A powersav- [44] Efthymiou Aristides, Garside Jim D. Adaptive pipeline depth
ing DVFS algorithm based on operational intensity for embedded control for processor power-management. In: Proceedings 2002
systems. IEICE Electr Exp 2015;0. IEEE international conference on computer design: VLSI in
[23] Li Zheng, Shangping Ren, Gang Quan. Energy minimization for computers and processors, 2002. IEEE; 2002.
reliability-guaranteed real-time applications using DVFS and [45] Hu Z et al. Microarchitectural techniques for power-gating of
checkpointing techniques. J Syst Architect 2015. execution units. In: Proc int’l symp on low power electronics and
[24] Kamga Christine Mayap. CPU frequency emulation based on design, ISLPED, Aug. 2004.
DVFS. ACM SIGOPS Operating Systs Rev 2013;47(3):34–41. [46] Boyer M, Tarjan D, Skadron K. Federation: boosting per-thread
[25] Kolpe Tejaswini, Zhai Antonia, Sapatnekar Sachin S. Enabling performance of throughput-oriented manycore architectures. In:
improved power management in multicore processors through ACM trans archit code optim (TACO); 2010.
clustered DVFS. In: Design, automation & test in europe [47] Pricopi M, Mitra T. Bahurupi: a polymorphic heterogeneous
conference & exhibition (DATE), 2011. IEEE; 2011. multi-core architecture. In: ACM TACO, January 2012.
[26] Lai Zhiquan et al. Latency-aware dynamic voltage and frequency [48] Gibson D, Wood DA. ForwardFlow: a scalable core for power-
scaling on many-core architectures for data-intensive applications. constrained CMPs, in ISCA; 2010.
In: 2013 international conference on cloud computing and big [49] Khubaib K et al. Morphcore: an energy-efficient microarchitec-
data (CloudCom-Asia). IEEE; 2013. ture for high performance ilp and high throughput tlp. In: 2012
[27] Lai Zhiquan, Zhao Baokang, Su Jinshu. Efficient DVFS to 45th annual IEEE/ACM international symposium on microar-
prevent hard faults for many-core architectures. Information and chitecture (MICRO). IEEE; 2012.
communication technology. Berlin, Heidelberg: Springer; 2014, p.
674–9.
[28] Greenhalgh Peter. Big.little processing with arm cortex-a15 & Khaled M. Attia is a teaching assistant at
cortex-a7. ARM White Paper; 2011. Computers and Control Systems Engineering
[29] Chung Hongsuk, Kang Munsik, Cho Hyun-Duk. Heterogeneous Department, Mansoura University. He
multi-processing solution of Exynos 5 Octa with ARMÒ big. received his B.Sc. in 2013 with an overall
LITTLETM Technology. grade of excellent with honors from Man-
[30] Rajovic Nikola et al. Experiences with mobile processors for soura University. His main research interests
energy efficient HPC. In: Proceedings of the conference on design, include Computer Architecture and Organi-
automation and test in Europe. EDA Consortium; 2013. zation, Heterogeneous Multi-Core Architec-
[31] Kihm J, Guimbretière FV, Karl J, Manohar R. Using asymmetric tures, Power-aware computing and
cores to reduce power consumption for interactive devices with bi- Heterogeneous Parallel Programming.
stable displays. In: Proc 32nd annu ACM conf hum factors
comput syst – CHI’14; 2014. p. 1059–62.
[32] Marowka A. Maximizing energy saving of dual-architecture
processors using DVFS. J Supercomput 2014;68:1163–83. Mostafa A. El-Hosseini is an Ass. Professor at
[33] Imes Connor, Hoffmann Henry. Minimizing energy under Computers Engineering and Control systems
performance constraints on embedded platforms; 2015. p. 12. Dept.––Faculty of Engineering––Mansoura
[34] Lakshminarayana Nagesh B, Lee Jaekyu, Kim Hyesoon. Age University, Egypt. He received the B.Sc.
based scheduling for asymmetric multiprocessors. In: Proceedings from the Electronics Engineering Department,
of the conference on high performance computing networking, M.Sc. and Ph.D. from Computers & Systems
storage and analysis. ACM; 2009. Engineering, all from Mansoura University,
[35] Becchi Michela, Crowley Patrick. Dynamic thread assignment on Egypt. His major research interests are Arti-
heterogeneous multiprocessor architectures. In: Proceedings of the ficial Intelligence such as Genetic Algorithms,
3rd conference on computing frontiers. ACM; 2006. Neural Networks, Particle Swarm Optimiza-
[36] Srinivasan Sadagopan et al. HeteroScouts: hardware assist for OS tion, Simulated Annealing, and Fuzzy Logic.
scheduling in heterogeneous CMPs. In: Proceedings of the ACM Also he is interested in the application of AI in Machine Learning,
SIGMETRICS joint international conference on measurement Image Processing, access control and Optimization. The Applications
and modeling of computer systems. ACM; 2011. of Computational Intelligence CI and Soft Computing tool in Bioin-
[37] Koufaty David, Reddy Dheeraj, Hahn Scott. Bias scheduling in formatics is also one of his interests. He served as a member of the
heterogeneous multi-core architectures. In: Proceedings of the 5th international program committees of numerous international
European conference on computer systems. ACM; 2010. conferences.

Hesham Arafat ali is a Prof. in Computer Eng. Club 2002 for his research on network security. He is a founder
& Sys. and an assoc. Prof. in Info. Sys. and member of the IEEE SMC Society Technical Committee on Enterprise
computer Eng. He was assistant prof. at the Information Systems (EIS). He has many book chapters published by
Univ. of Mansoura, Faculty of Computer international press and about 150 published papers in international
Science from 1997 up to 1999. From January (conf. and journal). He has served as a reviewer for many high quality
2000 up to September 2001, he joined as journals, including Journal of Engineering Mansoura University. His
Visiting Professor to the Department of interests are in the areas of network security, mobile agent, Network
Computer Science, University of Connecticut. management, Search engine, pattern recognition, distributed data-
From 2002 to 2004 he was a vice dean for bases, and performance analysis.
student affair the Faculty of Computer Sci-
ence and Inf., Univ. of Mansoura. He was
awarded with the Highly Commended Award from Emerald Literati

Multicore Processor Technology-Advantages and Challenges: Anil Sethi, Himanshu Kushwah
No ratings yet
Multicore Processor Technology-Advantages and Challenges: Anil Sethi, Himanshu Kushwah
3 pages
Seminar Report
50% (4)
Seminar Report
30 pages
What Is A Multicore Processor
No ratings yet
What Is A Multicore Processor
21 pages
A Survey On Parallel Multicore Computing Performan
No ratings yet
A Survey On Parallel Multicore Computing Performan
9 pages
Week 6 - Review On High Performance Energy Efficient Multicore Embedded Computing 1
No ratings yet
Week 6 - Review On High Performance Energy Efficient Multicore Embedded Computing 1
7 pages
Multi-Core Processor PDF
No ratings yet
Multi-Core Processor PDF
6 pages
Note 2
No ratings yet
Note 2
3 pages
Participants
No ratings yet
Participants
8 pages
Many Core Processor Architecture
No ratings yet
Many Core Processor Architecture
36 pages
Energy Efficient Multi-Core Processing: Electronics June 2014
No ratings yet
Energy Efficient Multi-Core Processing: Electronics June 2014
9 pages
Term Paper
No ratings yet
Term Paper
9 pages
Time Critical Multitasking For Multicore Microcontroller Using Xmos® Kit
No ratings yet
Time Critical Multitasking For Multicore Microcontroller Using Xmos® Kit
18 pages
Performance Analysis On Multicore Processors
No ratings yet
Performance Analysis On Multicore Processors
9 pages
20BCE2351 Micro Assignment-02
No ratings yet
20BCE2351 Micro Assignment-02
5 pages
20BCE2351 Micro Assignment-02
No ratings yet
20BCE2351 Micro Assignment-02
5 pages
A Seminar Report 474
No ratings yet
A Seminar Report 474
4 pages
Processors: by Nipun Sharma ID: 1411981520
No ratings yet
Processors: by Nipun Sharma ID: 1411981520
24 pages
Multi-Processor Embedded Systems
No ratings yet
Multi-Processor Embedded Systems
44 pages
Multicore Processor
100% (1)
Multicore Processor
23 pages
Optimized Multi-Threading To Balance Energy and Performance Efficiency
No ratings yet
Optimized Multi-Threading To Balance Energy and Performance Efficiency
6 pages
Winsem2022-23 Cse4001 Eth Vl2022230503160 Reference Material I 15-12-2022 1.4 Multi-Core Processor
No ratings yet
Winsem2022-23 Cse4001 Eth Vl2022230503160 Reference Material I 15-12-2022 1.4 Multi-Core Processor
34 pages
"Multicore Processors": A Seminar Report
No ratings yet
"Multicore Processors": A Seminar Report
11 pages
Multicore
No ratings yet
Multicore
3 pages
Multi-Core Processors: Concepts and Implementations
No ratings yet
Multi-Core Processors: Concepts and Implementations
10 pages
Hyper Threading Technology in Microprocessors
No ratings yet
Hyper Threading Technology in Microprocessors
3 pages
Lecture 1
No ratings yet
Lecture 1
37 pages
Mmi Activity
No ratings yet
Mmi Activity
2 pages
Optimizing Embedded Multicore CPUs
No ratings yet
Optimizing Embedded Multicore CPUs
36 pages
Multi Core System
No ratings yet
Multi Core System
9 pages
Multicore Processors Overview
No ratings yet
Multicore Processors Overview
12 pages
Logo - File 5 PDF
No ratings yet
Logo - File 5 PDF
6 pages
Multithreading, SMT and CMP
No ratings yet
Multithreading, SMT and CMP
7 pages
Chapter 5
No ratings yet
Chapter 5
18 pages
A Performance Analysis For Microprocessor Architec
No ratings yet
A Performance Analysis For Microprocessor Architec
6 pages
Part - B Unit - 5 Multiprocessors and Thread - Level Parallelism
No ratings yet
Part - B Unit - 5 Multiprocessors and Thread - Level Parallelism
20 pages
ACA Microprocessor and Thread Level Parallelism
No ratings yet
ACA Microprocessor and Thread Level Parallelism
41 pages
Regalado CaseStudy
No ratings yet
Regalado CaseStudy
2 pages
Ece 10 - Microprocessor and Microcontroller System and Design (Module 1)
No ratings yet
Ece 10 - Microprocessor and Microcontroller System and Design (Module 1)
20 pages
The Lecture Contains:: Lecture 15: Memory Consistency Models and Case Studies of Multi-Core
No ratings yet
The Lecture Contains:: Lecture 15: Memory Consistency Models and Case Studies of Multi-Core
9 pages
Multicore Processor Overview
No ratings yet
Multicore Processor Overview
19 pages
Ankit - Multi Core Processing
No ratings yet
Ankit - Multi Core Processing
29 pages
A Review of Architectures - Intel Single Core, Intel Dual Core and AMD Dual Core Processors and The Benefits
No ratings yet
A Review of Architectures - Intel Single Core, Intel Dual Core and AMD Dual Core Processors and The Benefits
10 pages
A Review of Architectures - Intel Single Core, Intel Dual Core and AMD Dual Core Processors and The Benefits
No ratings yet
A Review of Architectures - Intel Single Core, Intel Dual Core and AMD Dual Core Processors and The Benefits
10 pages
Multicore Embeddedfinal Revised
No ratings yet
Multicore Embeddedfinal Revised
9 pages
An Efficient Resource Utilization Technique For Scheduling Scientific Workload in Cloud Computing Environment
No ratings yet
An Efficient Resource Utilization Technique For Scheduling Scientific Workload in Cloud Computing Environment
12 pages
SSC Course 6 CPU
No ratings yet
SSC Course 6 CPU
17 pages
Dynamic
No ratings yet
Dynamic
11 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
24 pages
Power Aware Architecture
No ratings yet
Power Aware Architecture
46 pages
Multi-Core Processing: Advantages & Challenges
No ratings yet
Multi-Core Processing: Advantages & Challenges
35 pages
Eui Seong 2008
No ratings yet
Eui Seong 2008
13 pages
Many-Core vs. Many-Thread Machines: Stay Away From The Valley
No ratings yet
Many-Core vs. Many-Thread Machines: Stay Away From The Valley
4 pages
Power Management in Embedded Systems
No ratings yet
Power Management in Embedded Systems
26 pages
Ayushagrawal HPC
No ratings yet
Ayushagrawal HPC
17 pages
Multicore Computers
No ratings yet
Multicore Computers
21 pages
BCSE412L - Parallel Computing 03
No ratings yet
BCSE412L - Parallel Computing 03
11 pages
CC Unit 1
No ratings yet
CC Unit 1
24 pages
Term Paperterm Paper
No ratings yet
Term Paperterm Paper
5 pages
Internal and External Factor Analysis Matrix of A Hotel
No ratings yet
Internal and External Factor Analysis Matrix of A Hotel
11 pages
A Genetic Algorithm For Solving The Optimal Power Flow Problem
No ratings yet
A Genetic Algorithm For Solving The Optimal Power Flow Problem
14 pages
CG9 Q4-WS-1 Week1 Updated
No ratings yet
CG9 Q4-WS-1 Week1 Updated
2 pages
Predictive Digital Twin For Wind Energy Systems: A Literature Review
No ratings yet
Predictive Digital Twin For Wind Energy Systems: A Literature Review
36 pages
A15 Bionic Vs Snapdragon 7 Plus Gen 2 - Google Search
No ratings yet
A15 Bionic Vs Snapdragon 7 Plus Gen 2 - Google Search
1 page
How To Calculate Inverter Power Rating and Inverter Battery Backup Time Learning Electrical Engineering
No ratings yet
How To Calculate Inverter Power Rating and Inverter Battery Backup Time Learning Electrical Engineering
6 pages
Wheat Varieties Receival Standards
No ratings yet
Wheat Varieties Receival Standards
1 page
VLSI Design: Performance & Inverter
No ratings yet
VLSI Design: Performance & Inverter
39 pages
Vocational Teacher (Maintenance and Repairs of Automobiles)
No ratings yet
Vocational Teacher (Maintenance and Repairs of Automobiles)
7 pages
Queer Natives in Latin America
No ratings yet
Queer Natives in Latin America
82 pages
Electricity Thesis Statement
100% (3)
Electricity Thesis Statement
6 pages
AISD Paper 5
No ratings yet
AISD Paper 5
16 pages
3 Solution
No ratings yet
3 Solution
2 pages
Math 108b: Spectral Theorem Notes
No ratings yet
Math 108b: Spectral Theorem Notes
6 pages
Christmas Truth: Believe & Receive
No ratings yet
Christmas Truth: Believe & Receive
3 pages
03. 2020년 상하수도R&D 기술동향보고서
No ratings yet
03. 2020년 상하수도R&D 기술동향보고서
190 pages
Electronics Quiz for Students
No ratings yet
Electronics Quiz for Students
2 pages
Ap003 PDF
No ratings yet
Ap003 PDF
20 pages
SLOPES
No ratings yet
SLOPES
25 pages
Kia K2500 Service Manual
100% (4)
Kia K2500 Service Manual
262 pages
Reviwer Module 4
No ratings yet
Reviwer Module 4
2 pages
Fermented Frutis and Vegetables.
No ratings yet
Fermented Frutis and Vegetables.
13 pages
Energyefficient Wireless Sensor Networks 1st Edition Vidushi Sharma Download
100% (2)
Energyefficient Wireless Sensor Networks 1st Edition Vidushi Sharma Download
70 pages
BilcoFlex (R) ENG Brochure
No ratings yet
BilcoFlex (R) ENG Brochure
20 pages
Ultrasonic Sensor Specs & Features
No ratings yet
Ultrasonic Sensor Specs & Features
2 pages
Concrete Anti-Carbonation Protection
No ratings yet
Concrete Anti-Carbonation Protection
4 pages
R&D Expense Calculations Guide
No ratings yet
R&D Expense Calculations Guide
3 pages
3D Wall Panel Catalogue
No ratings yet
3D Wall Panel Catalogue
83 pages
11.2 Circles and Ellipses Conic Sections
No ratings yet
11.2 Circles and Ellipses Conic Sections
24 pages
9TH A-STAR - JEE-ADV (2015-P2) - WAT-06 - QP - EXAM DT - 21-07-2025
No ratings yet
9TH A-STAR - JEE-ADV (2015-P2) - WAT-06 - QP - EXAM DT - 21-07-2025
17 pages

Dynamic - Power - Management - Techniques - in - Multi-Core

Uploaded by

Dynamic - Power - Management - Techniques - in - Multi-Core

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Dynamic power management techniques in multi-core architectures: A

Article in Ain Shams Engineering Journal · October 2015

Khaled Attia Hesham Arafat Ali

SEE PROFILE SEE PROFILE

The user has requested enhancement of the downloaded file.

Ain Shams University

Ain Shams Engineering Journal

Dynamic power management techniques

Received 10 May 2015; accepted 3 August 2015

Figure 1 Advances in microprocessor design over time.

Figure 3 A tree diagram of the most common power management techniques.

3.1. Dynamic voltage and frequency scaling

3.1.1. Basic concept

Figure 5 Feedback closed-loop representation of power management control system.

workloads while the Cortex-A7 support the parallel work-

Figure 8 ARM big.LITTLE architecture.

X IPCt  means it requires modiﬁcations in the kernel itself and also

– more latencies in the

– cache ﬂushing and data

migration delay during

tional OoO core.

– low area cost

– latency introduced by the

evaluation of such hybrid,

performance and the amount of workload you expect on your

chip. However, the combination of some techniques with

techniques/improvements to get further better results in both

– can be used in con-

– very low power saving

off may result in nega-

tive power savings

[1] Moore Gordon E. Cramming more components onto integrated

circuits, vol. 12; 1965.

Report 93/6. Digital Western Research Laboratory, Palo Alto

– effective results in power

– fast change between differ-

ACM Sigplan Notices 1996;31(9):2–11.

[4] Amdahl Gene M. Validity of the single processor approach to

threads between cores of

achieving large scale computing capabilities. In: Proceedings of the

April 18–20, 1967, spring joint computer conference. ACM; 1967.

[5] Hill Mark D, Marty Michael R. Amdahl’s law in the multicore

era. IEEE Comput 2008;41(7):33–8.

– latency introduced dur-

– data locality issue dur-

[8] Sprangle Eric, Doug Carmean. Increasing processor performance by

implementing deeper pipelines. In: Proceedings 29th annual inter-

national symposium on computer architecture, 2002. IEEE; 2002.

[9] Smith James E, Sohi Gurindar S. The microarchitecture of

superscalar processors. Proc IEEE 1995;83(12):1609–24.

[10] Bircher William Lloyd, John Lizy. Predictive power management

for multi-core processors. In: Computer architecture. Berlin,

– long transition time

– the level of granular-

between power states

technical papers. IEEE; 1994.

[12] Bertran Ramon et al. A systematic methodology to generate

Trans Comput 2013;62(7):1289–302.

[13] Basmadjian Robert, de Meer Hermann. Evaluating and modeling

international conference on future energy systems: where energy,

tool for embedded processor based platforms. In: Proceedings of

the 6th workshop on rapid simulation and performance evalua-

tion: methods and tools. ACM; 2014.

[15] Walker Matthew J et al. Run-time power estimation for mobile ad

You might also like

X IPCt means it requires modiﬁcations in the kernel itself and also