Dynamic GPU
Dynamic GPU
Abstract—GPUs are widely used to accelerate the training of machine learning workloads. As modern machine learning models
become increasingly larger, they require a longer time to train, leading to higher GPU energy consumption. This paper presents
GPOEO, an online GPU energy optimization framework for machine learning training workloads. GPOEO dynamically determines the
optimal energy configuration by employing novel techniques for online measurement, multi-objective prediction modeling, and search
optimization. To characterize the target workload behavior, GPOEO utilizes GPU performance counters. To reduce the performance
counter profiling overhead, it uses an analytical model to detect the training iteration change and only collects performance counter
data when an iteration shift is detected. GPOEO employs multi-objective models based on gradient boosting and a local search
algorithm to find a trade-off between execution time and energy consumption. We evaluate the GPOEO by applying it to 71 machine
arXiv:2201.01684v1 [cs.DC] 5 Jan 2022
learning workloads from two AI benchmark suites running on an NVIDIA RTX3080Ti GPU. Compared with the NVIDIA default
scheduling strategy, GPOEO delivers a mean energy saving of 16.2% with a modest average execution time increase of 5.1%.
Index Terms—Dynamic energy optimization, online application iteration detection, multi-objective machine learning, GPU
1 I NTRODUCTION
In recent years, deep neural networks (DNNs) have can only detect coarse-grained phase change patterns.
demonstrated breakthrough effectiveness in various tasks This work aims to provide a better dynamic GPU energy
that were once deemed impossible [1], [2]. Training an optimization method for iterative ML training workloads.
effective DNN requires performing expensive training on We present GPOEO, a micro-intrusive 1 GPU online energy
a large number of samples. As the model training time optimization framework. GPOEO only requires the user to
increases, the energy consumption of machine learning mark the beginning and the end of the target code region.
training workloads also grows. This is a particular concern GPOEO can then automatically collect the relevant infor-
for high-performance GPUs because they are widely used mation to monitor the program behavior to identify the re-
to train DNN workloads but are less energy-efficient than peat workload patterns and adjust the energy optimization
their CPU counterparts. As a result, there is a critical need scheme accordingly.
to reduce the energy consumption for GPUs when training GPOEO trades execution time for energy efficiency. To
machine learning workloads. Achieving this goal can reduce this end, we employ multi-objective optimization models
the maintenance cost and allow users to train larger models built on the XGBoost Classifier [12]. As a departure from
within the same budget. existing online GPU power optimization schemes, GPOEO
Most of the existing studies in general GPU energy leverages the fine-grained runtime information provided by
optimization are offline techniques [3]–[7]. These techniques the hardware performance counters to model the system
require profiling applications ahead of time [3]–[5], [7] or and program behavior to build more accurate decision
analyzing their source code [6] in advance to collect ap- models. We use the predictive model as a cost function to
plication features. These approaches cannot adapt to the search for the optimal energy configuration for the current
change of the program behavior and can incur expensive workload pattern. These supporting methods enable us to
profiling overhead. Other approaches change the internal develop a period detection algorithm based on the fast
working mechanisms of machine learning (ML) models [8], Fourier transform (FFT) and feature sequence similarity
[9], but they are specific to the ML algorithms used and do evaluation to identify if a changed training iteration takes
not generalize to other ML methods. place and adjust our optimization accordingly.
Recently, efforts have been made to target online GPU We evaluate GPOEO by applying it to 71 machine learn-
power optimization. The work presented in [10] proposes a ing workloads from three AI benchmark suites [13]. We
dynamic power management method for integrated APUs. test our approach on an NVIDIA RTX3080Ti GPU. Exper-
ODPP [11] is another online dynamic power-performance imental results show that GPOEO can reduce GPU energy
management framework. ODPP detects the periods of appli- consumption by over 20% (16.2% on average), with most
cations to build prediction models. However, their models execution time increments less than 7% (5.1% on average)
when compared to the NVIDIA default GPU scheduling
• F. Wang, W. Zhang, S. Lai, and M. Hao are with the School of Cyberspace strategy.
Science at Harbin Institute of Technology, Harbin 150000, China. This paper makes the following contributions:
E-mail:{wangfarui,wzzhang}@hit.edu.cn
• Z. Wang is with the School of Computing at University of Leeds, United
Kingdom.
1. Micro-intrusive means we require minor changes to the program
E-mail: z.wang5@leeds.ac.uk
source code.
2
2.2 Motivation
2.2.3 Reduce Period Detection Error
2.2.1 Energy Optimization Potential
Online sampling time series of power and utilization can
The training phase of machine learning usually uses high- reflect the periodic pattern. ODPP [11] uses the Fourier
performance GPUs as accelerators. GPUs consume a lot transform algorithm to detect periods, namely the execu-
of energy and time. To explore the energy-saving poten- tion time of one major iteration. However, according to
tial on GPUs, we run five applications in AIBench [13] our experimental results, the period error of ODPP [11] is
and benchmarking-gnns [16] on all combinations of SM quite large when the periodicity is not apparent. To solve
and memory clock frequencies and select the best con- this shortcoming, we propose a robust period detection
figurations for minimal energy consumption within the algorithm. Figure 2 shows the absolute percentage errors
slowdown constraint of 5%. Figure 1 shows the oracle of period detection with ODPP and our algorithm GPOEO
results of energy saving, slowdown, and ED2P (Energy × under different SM clock frequencies on two ML applica-
Dealy 2 ) saving. The AI FE, AI S2T, and SBM GIN are tions. The “Default” means the NVIDIA default scheduling
relatively compute-intensive and save significant energy strategy. The period detection errors of ODPP are pretty
(14.9%-22.4%). The CLB MLP and TSP GatedGCN are rel- significant and unstable, while the period detection errors of
atively memory-intensive and save considerable energy our GPOEO are less than 5% under all SM clock frequencies.
(18.0%-26.4%). These results highlight that both compute-
intensive and memory-intensive ML training workloads 2.2.4 Using GPU Performance Counters
have the chance to save energy with acceptable slowdown.
ODPP [11] uses power, SM utilization, and memory uti-
lization as features to predict energy and execution time.
2.2.2 Make Energy Optimization Practical However, these coarse-grained features are inadequate to
Most existing studies on GPU energy optimization are give accurate predictions for some machine learning ap-
offline and need domain knowledge and complex offline plications. We give an example in Figure 3. Applications
profiling. Thus, users or researchers can hardly apply these in each pair have similar average power and utilization
studies in practice. An ideal online GPU energy optimiza- under the same reference clock frequency configuration, but
tion system should be non-intrusive to applications and their optimal SM clocks for ED2P are different (memory
transparent to users. Majumdar’s work [10] only applies to frequency fixed at 9251MHz). In light of this observation, we
AMD APUs. Motivated by drawbacks, we propose an online wish to use more fine-grained metrics to capture the subtle
energy optimization framework called GPOEO. Users only interactions between the application and the hardware. To
need to insert the Begin and End APIs, and the GPOEO this end, we use hardware performance counter metrics
automatically optimizes energy online. as features to build predictive models to estimate energy
3
350 TABLE 1
300 avgPower: W Best Clock: MHz 2000
avgGPUUtil: (%) 1650 Description of Parameters
1515 1740
250 avgMemUtil: (%) 1365 1125
1500
200 930 990
750 1000
150
168.0
159.1
500 Notation Description
100
103.1
102.6
41.7
37.7
36.6
36.0
95.4
94.9
93.8
32.5
32.3
31.5
92.1
30.1
50 0 fobj objective function
67.6
66.2
7.6
6.8
0.9
0.9
0.0
0.0
0 eng
d SM energy consumption predicted with the SM clock
P t 2T AI_TS
_ML GCN GCN phSage _MoNe P_GCN AI_I
time execution time predicted with the SM clock
CSL ted TU_
dSM
G a r a T S P T S EngM dlSM energy prediction model with the SM clock
_ G
CSL TU_ T imeM dlSM time prediction model with the SM clock
SM geari one SM gear represents an SM clock frequency
Fig. 3. ML applications with similar coarse-grained features and different wSM the input vector containing SM clock
optimal SM clocks for ED2P engdM em energy consumption predicted with the memory
clock
time
d M em execution time predicted with the memory clock
EngM dlM em energy prediction model with the memory clock
and execution time. Later in the paper, we show that this T imeM dlM em time prediction model with the memory clock
strategy allows us to build more accurate prediction models, M emgeari one memory gear represents a memory clock fre-
which in turn lead to better optimization decisions. The key quency
wM em the input vector containing memory clock
challenges here are reducing the overhead of performance Feature feature vector measured under the reference clock
counter profiling and minimizing code modifications. configuration from the ML applications
3 OVERVIEW OF O UR A PPROACH
training data
reference energy
This section defines the scope of the work and gives an configuration performance
② train energy model
measure metrics configuration)
energy and time metrics
energy model
detection metrics in one period
energy and time
change
min FSM = fobj eng
d SM , timeSM
d
⑧
⑦ actual optimal
⑥ predicted optimal
configuration
monitor energy
configuration through
according to
s.t. eng
d SM = EngM dlSM (wSM ) , features
local search
user requirements
time
dSM = T imeM dlSM (wSM ) ,
(1) no change
s.t. eng
dM em = EngM dlM em (wM em ) , GPOEO includes two stages, shown in Figure 4. In the
timeM em = T imeM dlM em (wM em ) ,
d (2) offline training stage, we run representative benchmarks
wM em = {M emgeari , Feature} , on each SM and memory clock frequencies ( 1 ) to collect
performance counter metrics on the reference SM and mem-
M emgeari ∈ {M emgear1 , . . . , M emgearp } . ory frequencies and energy-time data on all frequencies.
Table 1 lists the variables used in our problem formula- Then we train multi-objective models ( 2 ). In the online
tion. We note that our approach can be applied to an arbi- optimization stage, we measure energy and performance
trary objective function. We select the energy consumption counter metrics in one detected period ( 3 , 4 ), then
with an execution time increase constraint as our objective predict the optimal SM and memory frequencies ( 5 , 6 ).
function to explicitly control the performance loss. Like After that, we try frequencies around the predicted optimal
[17], we assume that the search space of SM and memory configuration and compare the measured energy-time data
clocks is convex and optimize the SM frequency and mem- to search for the actual optimal configuration ( 7 ). Finally,
ory frequency in order. We use four prediction models to we set the actual optimal configuration and monitor energy
estimate energy and time with different SM and memory characteristics ( 8 ). If the fluctuation of energy characteris-
clock frequencies. Then we select the SM and memory clock tics exceeds the threshold, another optimization process will
configuration that can best satisfy the optimization goal. start.
4
4 D ESIGN AND I MPLEMENT Algorithm 1 Period calculation algorithm based on FFT and
feature sequence similarity
4.1 Robust Period Detection
Input: feature sampling sequence Smp = {s1 , . . . , sN },
Iterative ML workloads exhibit periodic behavior across sampling interval Ts
iterations. We treat one period, namely one iteration, as Output: iterative period Titer , error of period err
basic unit for online metric measurement and evaluation.
Our approach does not rely on offline code instrumentation
1: F req, Ampl = {f reqi }, {ampli } =FFT(Smp, Ts )
to reduce developer involvement.
2: T = {Ti } = {1/f reqi }
3: Find peaks in Ampl, get the set of peaks AmplP eak =
4.1.1 Fourier Transform: Get Candidate Periods
{amplpeakm } and the set of corresponding periods
To detect the period change, we leverage the Fourier trans- T P eak = {T peakm }
form. This established methodology is widely used in signal 4: IndexCand = {idxcandk }
processing to detect the dominant frequency and period. In = {m|amplpeakm > cpeak ×
our scenario, we assume the periodic feature of ML applica- max(AmplP eak)}
tions, such as GPU power and utilization, can be expressed 5: T Cand = {T candp } = {T peakidxcandk }
as a time-domain function F(t). Generally speaking, F(t) 6: for T candp ∈ T Cand do
is periodic and can be expressed as a linear combination of 7: errp = Algorithm2(T candp , Smp, Ts )
trigonometric series (know as Fourier Series) in exponential 8: end for
form as: 9: Find erridxmin = min({errp })
∞
T candopt = T candidxmin
X
F (t) = Cn e2πjnt/Titer 10:
n=−∞
11: NT = Ts × (N − 1)/T candopt ,
Tlow = T candopt × (1 − 1/(NT + 1))
where Cn are Fourier coefficients. 12: Tup = T candopt × (1 + 1/(NT − 1))
We can get a series of frequency components via Fourier 13: Construct an arithmetic sequence T Local = {T localq },
transform: where T localq = Tlow + (q − 1) × Taccu ,
∞
X q = 1, . . . , (Tup − Tup )/Taccu
F(ω) = 2π Cn δ (ω − 2πn/Titer ) 14: for T localq ∈ T Local do
n=−∞ 15: errlocalq =Algorithm2(T localp , Smp, Ts )
the frequency F(ω) is a series of impulse functions (i.e., δ ) 16: end for
located at 2πn/Titer , with amplitude proportional to Cn . 17: Find err = erridxmin = min({errlocalq })
These impulses represent different frequency components 18: Titer = T localidxmin
in F(t). The one with the largest amplitude is the major 19: return Titer , err
frequency component. The iterative period Titer of the ML
application can be calculated as Titer = 1/fmajor , where
fmajor is the frequency of the major frequency component. our scenario, the Euclidean distance is susceptible to high-
frequency interference and reports the wrong similarity.
4.1.2 Feature Sequence Similarity: Find Accurate Periods Inspired by this, instead of calculating the distance of each
pair of corresponding points, we group the sampling data
As mentioned in Section 2.2.3, the period detection results of
within a series and evaluate the distance between the corre-
the Fourier transform may have vast errors. Our period cal-
sponding groups. Specifically, we use the Gaussian mixture
culation algorithm (Algorithm 1) combines the fast Fourier
model algorithm to cluster each sub-curves into N groups
transform (FFT) and feature sequence similarity (Algorithm
(line 8-11). Then, for each group, we calculate the relative
2) to reduce these errors. We find amplitude peaks and
average amplitudes (line 12-13). The average operation can
corresponding periods T P eak in the result of FFT (line
eliminate the influence of high-frequency interference. Later,
1-3). Then we select periods T peakm with relatively high
we calculate the symmetric mean absolute percentage error
peaks as candidate periods T cand (line 4-5). We set the
(SMAPE) of the current group pair (line 14). Finally, we
peak coefficient cpeak to 0.6-0.7 empirically. Then we use
calculate the weighted average of all SMAPEs as similarity
feature sequence similarity (Algorithm 2) to evaluate and
error (line 16-17). The weights are the numbers of sampling
select the best candidate period with minimal error. Finally,
points in groups. When all adjacent sub-curves are evalu-
we perform a local search around the best candidate period
ated, we treat the mean similarity error as the error of Ti ter
to further improve accuracy.
(line 19).
We propose the feature sequence similarity algorithm
(Algorithm 2) to evaluate candidate periods. We divide the
feature sampling curve into several sub-curves, and the time 4.1.3 Online Robust Period Detection Algorithm Frame-
duration of each sub-curve is equal to the candidate period work
(line 1-4). The more similar the curves are, the closer the To get accurate Titer dynamically, we design an online
candidate period is to the actual period. So we evaluate the robust period detection algorithm framework (Algorithm
similarity of each pair of adjacent sub-curves (line 5-18). 3). The framework calculates Titer in a rolling form while
Each sub-curve is a time series, and calculating the Eu- sampling the power and utilization of the GPU until the
clidean distance between corresponding points is a common Titer is stable. First, we get the initial period Tinit , using
method to measure the similarity of time series. However, in Algorithm 2. If sampling duration is short than cmeasure
5
Algorithm 2 Feature sequence similarity algorithm Algorithm 3 Online robust period detection algorithm
Input: Period to be evaluated Titer , feature sampling se- framework
quence Smp = {s1 , . . . , sN }, sampling interval Ts Input:feature sampling sequence Smp = {s1 , . . . , sN }, sam-
Output: error of period ErrT pling interval Ts
1: N umT = bsN /Titer c; N ums = bT /Ts c Output:iterative period Titer , next sampling duration
2: for i ∈ {1, . . . N umT } do SmpDurnext
3: Smpi = {s(i−1)×N ums +1 , . . . , si×N ums } = {ssi,j } 1: Tinit , errinit =Algorithm1(Smp, Ts )
(j = 1, . . . , N ums ) 2: SmpDur = (N − 1) × Ts
4: end for 3: if SmpDur < cmeasure × Tinit then
5: for i ∈ {1, . . . , N umT − 1} do 4: Titer = Tinit
6: M eanprev = mean(Smpi ) 5: SmpDurnext = cmeasure × Tinit − SmpDur
7: M eanback = mean(Smpi+1 ) 6: end if
8: {GaGrp1 , . . . , GaGrpN umG } =Gauss(Smpi , N umG) 7: tstart = max(0, (SmpDur − (2 + ceval × step) × Tinit ))
where GaGrpj = {idxj,1 , . . . , idxj,N umj } 8: while (SmpDur − tstart )/Tinit ≥ cmeasure do
9: for j ∈ {1, . . . , N umG} do 9: istart = 1 + btstart /Ts c
10: Grpprev = {ssi,idxj,1 , . . . , ssi,idxj,N umj } 10: SubSmp = {sistart , . . . , sN }
11: Grpback = {ssi+1,idxj,1 , . . . , ssi+1,idxj,N umj } 11: Tj , errj =Algorithm1(SubSmp, Ts )
12: RelV alP revj = mean(Grpprev ) − M eanprev 12: tstart = tstart + step × Tinit
13: RelV alBackj = mean(Grpback ) − M eanback 13: end while
14: grperrj =SMAPE(RelV alP revj , RelV alBackj ) 14: T = {Tj }, Err = {errj }
15: end for 15: Find min(Err) = errk then Titer = Tk
16: W eight = {|GaGrp1 |, . . . , |GaGrpN umG |} 16: Dif f = abs((max(T ) − min(T ))/mean(T ))
17: erri = avg({grperr1 , . . . , grperrN umG }, W eight) 17: if Dif f < Dif fthreshold then
18: end for 18: SmpDurnext = −1
19: return ErrT = mean({err1 , . . . , errN umT −1 }) 19: else
Note: Gauss = the Gaussian mixture model clustering algorithm 20: SmpDurnext =
dSmpDur/max(T )e × max(T ) − SmpDur
21: end if
22: return Titer , SmpDurnext
times Tinit (cmeasure is set to 2 empirically), it is too short
to do a rolling calculation, so we calculate the required sam-
pling duration, skip rolling calculation, and return (line2- Algorithm 4 Adaptive feature measurement
6). If the sampling duration is long enough, we get some Input: features to be measured F eatureN ame =
updated samples by setting tstart (line 7). Samples before {name1 , . . . , namen }, the feature for iterative period detec-
tstart are ignored because they may be outdated. We set tion Featuredect
step = 0.5 and ceval = 6.5 according to our experiments. Output: feature data Feature = {f eature1 , . . . , f eaturen }
Then, the rolling period calculation begins. In each iteration,
1: Start measurement for F eatureN ame and Featuredect
we calculate the period (line 9-11) and increase tstart to
2: SmpDurnext = SmpDurinit
exclude more outdated samples for the next iteration (line
3: while SmpDurnext > 0 do
12). With the rolling period calculation, we get a sequence of
4: Delay SmpDurnext
periods with errors (line 14). If an ML application runs in the
5: Collect sampling sequence of Featuredect
regular iteration phase, these periods should be similar. If
6: Titer , SmpDurnext =Algorithm3(Featuredect , Ts )
the difference of these periods is less than the threshold, set
7: end while
SmpDurnext = −1 to indicate stop sampling. Otherwise,
8: Restart measurement for F eatureN ame
calculate SmpDurnext (line 16-21). The best period, namely
9: Delay Titer then stop measurement
Titer , is the period with the minimal error (line 15). If
10: Collect feature data Feature
SmpDurnext > 0, we ignore Titer , keep sampling, and call
11: return Feature
Algorithm 3 again after SmpDurnext . Otherwise, we use
Titer for feature measurement (Section 4.2).
power, SM utilization, and memory utilization can meet
4.2 Micro-intrusive Online Feature Measurement this requirement. However, the performance counters can-
Based on the robust period detection, we propose the not meet this requirement because its minimum sampling
adaptive feature measurement algorithm (Algorithm 4). The granularity is a CUDA kernel. According to our experi-
feature measurement causes overhead and extends Titer mental results, we use the composite feature of power, SM
compared with the normal run of the application. So, we utilization, and memory utilization as Featuredect , whose
adaptively detect Titer of applications while the measure- traces show more obvious periodicity. In line 6, we call
ment is running (line 1-7). In line 5, we collect Featuredect Algorithm 3 to detect Titer and determine whether Titer is
after a delay of SmpDurN ext . We use the sampling se- stable. If SmpDurN ext > 0, we redo line 4-6 with the new
quence of Featuredect to form the curve for the period SmpDurN ext . Otherwise, we think Titer is stable and restart
detection, so we must sample Featuredect continuously and the feature measurement (line 8). After a delay of Titer , we
uniformly. On the NVIDIA GPU platform, the instantaneous stop measurement and collect the feature data (line 8-10).
6
95.5
96.4
83.5
60%
73.5
68.5
68.3
67.2
66.9
40%
51.0
12.7 MLC_GatedGCN
12.7 SP_GraphSage
20%
11.7 SBM_GraphSage
10.9
10.5 SP_GCN
7.2 SBM_GAT
6.7 SP_GatedGCN
6.5 TSP_MoNet
6.1
5.4
5.0 TSP_GCN
4.9
4.6
4.3
3.6 SBM_GIN
3.1 MLC_3WLGNN
3.0
2.7
2.6
2.6 SP_MoNet
2.6 TSP_GAT
2.7 TSP_GraphSage
2.4
2.5
2.3
2.4
2.1
2.3 SBM_RingGNN
2.0
1.9 MLC_GIN
1.8 CLB_GIN
1.8 TSP_GIN
1.8 SBM_MLP
1.5 CLB_GraphSage
1.5 SP_MLP
1.4
1.4 SP_RingGNN
1.5
1.3 CLB_GCN
1.5 SBM_3WLGNN
1.2 MLC_GAT
1.2 TSP_MLP
1.2
1.0 TSP_GatedGCN
0.8
0.8
0.8
0.8 MLC_MoNet
0.7 SBM_MoNet
0.8 SP_GAT
0.7 SBM_GCN
1.0 CLB_MLP
0.0 SBM_GatedGCN
0.4
0.2
0.1
0.3
0.1 SP_3WLGNN
0.0
0.0
0.2
0%
CLB_GAT
CLB_GatedGCN
Fig. 5. The period detection errors of GPOEO and ODPP. GPOEO gives lower errors compared to ODPP.
model improved from the gradient tree boosting algorithm. ThunderSVM [19] workload, and the ThunderGBM [20]–
Combined with our scenario, we apply the XGBoost to build [22] workload. We can not evaluate their execution time
four prediction models shown in Equation 1 and 2. The or energy consumption with data measured in one period.
model construction processes are similar, so we take the In this case, we fix the measurement time interval and the
energy consumption model EngM dlSM as an example to use measured mean number of instructions executed per
introduce these processes. For EngM dlSM , the XGBoost second (IPS) and power to evaluate the execution time and
weights and sums several regression trees and can be de- energy consumption. If the number of instructions in the
fined as follows: program is Instsum , the execution time can be calculated
XK as time = Instsum /IP S , and the energy consumption can
d i =
eng SM j fk (wSM ij ), fk ∈ F be calculated as energy = power ∗ time = Instsum ∗
k=1 power/IP S . Then we can evaluate different clock frequency
d i is the predicted energy consumption of the configurations.
where eng SM j Based on these analyses, we apply GPOEO to aperi-
input vector wSM ij , which contains the features of applica- odic applications. We first measure performance counter
tion i and the SM gear j . fk represents the k th regression metrics in the fixed time interval and predict the optimal
tree, and K is the number of regression trees. We use the clock frequency configuration. In the online local search, we
training datasets EngT rSM to train the model EngM dlSM . measure and calculate timedef ault and energydef ault under
During training, we use the grid search method to tune the the NVIDIA default scheduling strategy as the baseline.
hyper-parameters such as the minimum loss reduction, the Then, we follow the golden-section search strategy and
maximum depth of a tree, the minimum sum of instance collect timei and energyi under different clock frequency
weight, and the maximum number of nodes to be added. configurations near the predicted optimal clock frequency
4.3.4 Online Local Search configuration. Finally, we compare these timei and energyi
with the baseline to find the optimal clock frequency config-
To fix minor errors caused by multi-objective predicted
uration.
models, we perform a local search around the predicted
optimal clock configuration. We use online detected period
and energy data measured during one period to evaluate 5 E VALUATION
different clock frequencies and find the actual optimal clock We evaluate our GPOEO system in this section. We first
configuration. We first conduct a local search for the mem- introduce the experimental step. Then, we evaluate the
ory clock because wrong memory clock frequencies (too accuracy of the period detection algorithm and energy-
low) can lead to severe slowdowns. Based on the optimal performance prediction models. Finally, we analyze the
memory clock frequency, we conduct another local search online energy-saving results on various ML benchmarks.
for the SM clock. According to our experimental results and
Schwarzrock’s work [17], the relationship between energy 5.1 Experimental Setup
optimization objectives and clock gears is generally a convex
function. Based on this observation, we use the golden- 5.1.1 Hardware and Software Platforms
section search method [23] to accelerate our local search. Our experimental platform is a GPU server equipped with
We first find a gear with a worse objective value on each one NVIDIA RTX3080Ti GPU, one AMD 5950X CPU, and 64
side of the predicted optimal gear to determine the search GB memory. The software environment is NVIDIA Driver
interval. Then we follow the classic golden section-search 470.57 and CUDA 11.3. RTX3080Ti supports continuously
process [23] to attempt different clock gears. Due to possible adjustable SM clock frequency, from 210 MHz to 2,025
errors of measured energy and periods, we fit the attempted MHz, and the step is 15 MHz. We find that some higher
points to obtain a convex function. Then we use the convex frequencies are not practical or stable. Under lower frequen-
function to determine the optimal clock gear. cies, applications cannot save energy while suffering severe
slowdowns. Therefore, we only consider the frequencies in
4.3.5 Optimize aperiodic applications the middle part, which can run stably and may improve
‘ The power and utilization traces of some applications are energy efficiency. We treat each SM clock frequency as an
aperiodic, such as applications in CSL and TUs datasets, the SM clock gear, from SM gear16 = 450 MHz to SM gear114
8
70% 140%
60% ODPP Period Detection Errors GPOEO Period Detection Errors 120% ODPP Period Detection Errors GPOEO Period Detection Errors
50% 100%
40% 80%
30% 60%
20% 40%
10% 20%
0% 0%
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 lt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 lt
102 108 114 120 126 132 138 144 150 156 162 168 174 180 186 192Defau 102 108 114 120 126 132 138 144 150 156 162 168 174 180 186 192Defau
SM Clock Frequencies (MHz) SM Clock Frequencies (MHz)
Fig. 6. Period detection errors of ODPP and GPOEO on the CLB GAT Fig. 8. Period detection errors of ODPP and GPOEO on the
application under different SM clock frequencies TSP GatedGCN application under different SM clock frequencies
100% 20%
ODPP Period Detection Errors GPOEO Period Detection Errors 15%
80% 10%
60% 5%
0%
40% -5%
20% -10%
-15%
0% -20%
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 lt 07 65 22 80 37 95 52 10 67 25
102 108 114 120 126 132 138 144 150 156 162 168 174 180 186 192Defau 450-6 607-7 765-9 922-10 1080-121237-131395-151552-171710-181867-20
SM Clock Frequencies (MHz) SM Clock Frequency Ranges (MHz)
(a) Energy consumption prediction errors
Fig. 7. Period detection errors of ODPP and GPOEO on the 40%
SBM 3WLGNN application under different SM clock frequencies 30%
20%
10%
0%
-10%
= 1,920 MHz. RTX3080Ti supports 5 global memory clock -20%
07 65 22 80 37 95 52 10 67 25
frequencies: M emgear0 = 450 MHz, M emgear1 = 810 MHz, 450-6 607-7 765-9 922-10 1080-121237-131395-151552-171710-181867-20
M emgear2 = 5,001 MHz, M emgear3 = 9,251 MHz, and SM Clock Frequency Ranges (MHz)
M emgear4 = 9,501 MHz. We select SM gear106 = 1,800 (b) Execution time prediction errors
MHz and M emgear3 = 9,251 MHz as reference frequencies,
which are used in performance counter profiling. Fig. 9. Prediction errors by varying the SM clock (grouped by different
SM clock ranges)
5.1.2 Benchmarks
We train our machine learning models using data collected 15%
10%
from PyTorch Benchmarks [18]. We then test the trained 5%
0%
models on AIBench [13] and benchmarking-gnns [16]. In -5%
addition to deep neural networks, we also use two classical -10%
-15%
ML workloads, the ThunderSVM [19] and ThunderGBM -20%
COLLAB CSL molecules SBMs superpixels TSP TUs
[20]–[22], in our evaluation. Applications & Data Sets
(a) Energy consumption prediction errors
5.2 Accuracy and Sensitivity of Period Detection 20%
15%
10%
We analyze our period detection algorithm on 34 differ- 5%
ent ML applications. Figure 5 shows the period detection 0%
-5%
errors of GPOEO and ODPP [11] using the NVIDIA de- -10%
-15%
fault scheduling strategy. GPOEO is far more accurate than
COLLAB CSL molecules SBMs superpixels TSP TUs
ODPP. The mean period error of GPOEO is 1.72%, while Applications & Data Sets
the mean error of ODPP is 23.16%. The maximum error of (b) Execution time prediction errors
GPOEO is 7.2%, and 32 errors are within 5%. For ODPP,
nine errors are above 50%, nine errors are among 5-13%, and Fig. 10. Prediction errors by varying the SM clock (grouped by different
sixteen errors are within 5%. GPOEO is more accurate than datasets)
ODPP on 29 applications. For the other three applications,
the accuracy differences are within 1%.
disturbed by high-frequency noise, only detects the local
Figure 6, 7, and 8 show the periods detection errors
short period, and ignores the real global period.
of GPOEO and ODPP under varying SM clock frequen-
cies and fixed memory clock frequency. GPOEO exhibits
stable low errors (within 5% for most cases). This phe- 5.3 Accuracy of Energy and Execution Time Prediction
nomenon indicates that GPOEO is sensitive to clock fre- In this section, we evaluate four prediction models. We
quency changes. However, ODPP shows poor stability. On measure selected features (Table 2) during one iteration of
CLB GAT and SBM 3WLGNN, ODPP gains huge errors each ML application under the reference SM and memory
under three and ten SM clock frequencies, respectively. clock frequency (1800 MHz and 9251MHz). Then we use
This instability causes inaccurate prediction models, which these features and different SM and memory clock frequen-
ODPP builds online. For TSP GatedGCN, ODPP gains huge cies as input of our multi-objective models to predict the
errors (nearly 100%) under all frequencies because it is relative energy consumption and execution time relative
9
30
ED2P Saving (%) Slowdown (%) Energy Saving (%)
20
10
0
-10 ODPP
GPOEO
-20
20 ODPP
GPOEO
10
0
30
20
10
0
-10 ODPP
-20 GPOEO
-30
AI_3DFR
AI_3DOR
AI_FE
AI_I2IC
AI_I2IP
AI_I2T
AI_ICMP
AI_IGEN
AI_LRK
AI_OBJ
AI_S2T
AI_ST
AI_T2T
AI_TS
TGBM
TSVM
Fig. 13. Energy savings, slowdown, and ED2P saving compared to the NVIDIA default scheduling strategy on AIBench and traditional ML
applications
TABLE 3
Online optimization process for SM and memory clock on AIBench
3DFR 3DOR FE I2IC I2IP I2T ICMP IGEN LRK OBJ S2T ST T2T TS
Oracle SM Gear 104 102 109 104 91 94 107 71 88 100 108 39 100 107
Prediction Error (SM gear) -11 5 -13 2 3 -2 -13 22 24 -4 -15 11 5 -5
Search Error (SM gear) -2 -3 -8 1 -2 0 -2 -3 -2 0 -7 4 -2 -1
# of Search Steps (SM) 5 4 4 3 4 3 6 8 9 4 5 5 4 4
Oracle Mem clock (MHz) 9501 9251 9501 9501 9251 9251 9251 405 5001 9251 9251 810 9251 9501
Predicted Mem clock 9501 9501 9251 9501 9501 9501 9501 405 5001 9251 9501 405 9251 9501
Searched Mem clock 9501 9501 9501 9501 9501 9251 9251 405 5001 9251 9251 810 9251 9501
# of Search Steps (Mem) 2 2 3 2 2 2 2 3 3 3 2 3 3 2
medium energy savings (8-14%) with reasonable slow- nine applications, while ODPP only achieves the same goal
downs (2-7%) on these seven applications. Two reasons on four applications. For GPOEO, only iterations after opti-
cause GPOEO does not make full use of energy-saving po- mization processes are guaranteed to meet the performance
tentials. For AI FE and AI S2T, the period detection and fea- loss constraint, and the previous iterations may result in a
ture measurement get inaccurate data due to some abnormal violation of the performance loss constraint.
iterations. Therefore prediction errors are significant and
local search can not eliminate these errors. For the conve-
nience of experiments, we just train each machine learning
model for a few iterations, and only iterations after opti-
Compared to ODPP, our approach can improve the
mization processes can enjoy the optimal SM gear. However,
energy saving by 9.2%, improve the ED2 P saving by
in real-life scenarios, the model will be trained on many
20.4%, and reduce the slowdown by 5.0% on average.
more iterations, for which our approach can give higher en-
GPOEO performs better than ODPP on all sixteen appli-
ergy saving. GPOEO does not perform well on AI I2IC and
cations. Especially, our approach gains both larger energy
AI T2T. These two applications are computation-intensive
savings and lighter slowdowns on eight applications. For
with little energy-saving opportunities (1% and 4%). For
AI 3DFR, AI GEN, AI LRK, and AI S2T, GPOEO achieves
non-periodical applications TGBM and TSVM, GPOEO also
more considerable energy savings with similar slowdowns.
achieves good results, saving 17.9% and 12.0% energy with
Our method gains similar energy savings with lighter slow-
slowdowns of 3.5% and 4.7%, respectively. As for the mem-
downs on AI I2IC and TSVM. ODPP builds two piecewise
ory clock, the oracle frequencies of thirteen applications
linear models online to predict energy and time. The accu-
are 9,251 MHz or 9,501 MHz. Our method finds oracle
racy of these two models is highly dependent on the period
frequencies on eleven applications and finds nearby 9,501
detection accuracy. Poor period detection accuracy of ODPP
MHz on the other two applications within two or three
causes significant prediction errors, so ODPP shows heav-
search steps. In fact, the energy consumption and execution
ier slowdowns and less energy saving. On non-periodical
time are similar under these two memory clock frequen-
applications TGBM and TSVM, ODPP shows much worse
cies. GPOEO also finds low oracle memory frequencies
results than GPOEO. This phenomenon demonstrates that
on AI IGEN, AI LRK, and AI ST within three steps. Our
ODPP cannot handle non-periodical applications, while
method meets the performance loss constraint of 5% on
GPOEO can.
11
20 ODPP
GPOEO
10
0
-10
-20
CLB_GAT
CLB_GCN
CLB_GIN
CLB_GatedGCN
CLB_GraphSage
CLB_MLP
CSL_3WLGNN
CSL_GAT
CSL_GCN
CSL_GIN
CSL_GatedGCN
CSL_GraphSage
CSL_MLP
CSL_MoNet
CSL_RingGNN
SBM_3WLGNN
SBM_GAT
SBM_GCN
SBM_GIN
SBM_GatedGCN
SBM_GraphSage
SBM_MLP
SBM_MoNet
SBM_RingGNN
TSP_GAT
TSP_GCN
TSP_GIN
TSP_GatedGCN
TSP_GraphSage
TSP_MLP
TSP_MoNet
TU_3WLGNN
TU_GAT
TU_GCN
TU_GIN
TU_GatedGCN
TU_GraphSage
TU_MLP
TU_MoNet
TU_RingGNN
MLC_3WLGNN
MLC_GAT
MLC_GCN
MLC_GIN
MLC_GatedGCN
MLC_GraphSage
MLC_MoNet
MLC_RingGNN
SP_3WLGNN
SP_GAT
SP_GCN
SP_GatedGCN
SP_GraphSage
SP_MoNet
SP_RingGNN
Fig. 14. Energy saving, slowdown, and ED2P saving compared to the NVIDIA default scheduling strategy on benchmarking-gnns applications
Figure 15, all energy and time overheads are within 4%.
6 R ELATED W ORK
Fig. 15. Energy and time overhead of GPOEO on AIBench
There is a growing interest in optimizing machine learning
workloads for performance and energy efficiency. Many
works concentrate on CPU energy optimization [17], [24]–
5.4.2 Large benchmark suite [29]. Some of these prior works [24]–[26] use offline methods
to model energy and performance, while others [17], [27],
[28] design runtime to adjust the CPU clock or enforce
We also use the benchmarking-gnns suite to evaluate power capping [29]. These methods rely heavily on run-
our method and ODPP [11] extensively. Figure 14 shows time CPU-specific performance counter profiling and are
the online optimization results of 55 applications in the not transferable to GPUs, but the prediction models [26],
benchmark-gnns. GPOEO achieves an energy saving of [28], [30] they have used may also apply to GPUs.
16.6% and an ED2 P saving of 7.8%, with an execution Profiling workloads using GPU performance counter
time increase of 5.2% on average. In comparison, ODPP during program execution time can be expensive. For this
gains an energy saving of 6.1% and an ED2 P saving of - reason, most GPU energy optimization works have to rely
4.5%, with an execution time increase of 5.6% on average. on offline profiling information [4], [6], [7], [31]. Most offline
GPOEO saves ED2 P on 50 applications, while ODPP only works [4], [6], [31] focus on modeling and predicting energy
gains ED2 P saving on 19 applications. Our method meets and performance under different hardware configurations.
the performance loss constraint of 5% on 22 applications. Other works study GPU energy and performance from
ODPP achieves the same goal on 27 applications because it the view of assembly instructions [6], [7]. These works
does not profile performance counters and has a lower mea- require tedious offline profiling and analysis for each new
surement overhead. With the help of performance counter application or even different inputs. In contrast, our work
features, GPOEO saves more energy than ODPP on 48 can automatically optimize new applications online once
applications. Applications in CSL and TU datasets are non- the prediction models are established and well trained.
periodical. ODPP can not handle these applications and Although these offline jobs are inconvenient to use, the
only saves 0.73% of energy with a slowdown of 1.32% on energy features [4], [31] they have used are instructive for
average. On the contrary, GPOEO gains an energy saving of GPU online energy consumption optimization.
20.0% with a slowdown of 5.2% on average. These phenom- A few studies [10], [11] realize GPU online energy con-
ena show that GPOEO has better performance and broader sumption optimization. Majumdar et al. [10] exploit low-
applicability than ODPP. overhead fine-grained profiling functions supported by the
12
APU platform. It obtains the stream of energy, execution the Shenzhen Science and Technology Research and Devel-
time, throughput, and performance counter metrics at ker- opment Foundation (JCYJ20190806143418198), the National
nel granularity without worrying about time and energy Natural Science Foundation of China (NSFC) (61872110 and
overhead. Therefore, it can catch energy-saving chances at 61872294), the Fundamental Research Funds for the Central
kernel granularity and save considerable energy with low Universities (Grant No. HIT.OCEF.2021007), the Peng Cheng
overhead. Majumdar’s work cannot be transplanted to the Laboratory Project (PCL2021A02), and the CCF-Huawei
NVIDIA GPU platform, considering the significant amount fund (CCF-HuaweiHP2021002). Professor Weizhe Zhang is
of code instrumentation (around each kernel) and the high the corresponding author.
overhead of continuously profiling. Our approach targets
the NVIDIA GPU platform with high runtime profiling
overhead (with a slowdown of > 8% and energy overhead
R EFERENCES
of > 10%). We profile performance counter metrics in only [1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol.
521, no. 7553, pp. 436–444, 2015.
one iteration period to minimize overhead. ODPP [11] pro- [2] M. Z. Alom, T. M. Taha, C. Yakopcic, S. Westberg, P. Sidike, M. S.
poses an online dynamic power-performance management Nasrin, M. Hasan, B. C. Van Essen, A. A. Awwal, and V. K.
framework for GPUs using coarse-grained features which Asari, “A state-of-the-art survey on deep learning theory and
cannot provide enough information to model energy con- architectures,” Electronics, vol. 8, no. 3, p. 292, 2019.
[3] J. Guerreiro, A. Ilic, N. Roma, and P. Tomas, “Gpgpu power mod-
sumption and performance accurately, as mentioned in Sec- eling for multi-domain voltage-frequency scaling,” in 2018 IEEE
tion 2.2.4. Besides, the period detection algorithm of ODPP International Symposium on High Performance Computer Architecture
is unstable and error-prone. Due to the two weaknesses, (HPCA). IEEE, 2018, pp. 789–800.
[4] ——, “Modeling and decoupling the gpu power consumption for
ODPP exhibits poor energy-saving results and can give
cross-domain dvfs,” IEEE Transactions on Parallel and Distributed
severe slowdown in our evaluation. Our approach exploits Systems, vol. 30, no. 11, pp. 2494–2506, 2019.
hardware performance counter metrics and utilizes a robust [5] ——, “Dvfs-aware application classification to improve gpgpus
period detection algorithm to tackle these two weaknesses, energy efficiency,” Parallel Computing, vol. 83, pp. 93–117, 2019.
[6] ——, “Gpu static modeling using ptx and deep structured learn-
leading to better performance. ing,” IEEE Access, vol. 7, pp. 159 150–159 161, 2019.
Energy and performance optimization built upon ma- [7] Y. Arafa, A. ElWazir, A. ElKanishy, Y. Aly, A. Elsayed, A.-H.
chine learning techniques have been demonstrated to be Badawy, G. Chennupati, S. Eidenbenz, and N. Santhi, “Verified
promising in various application domains [8], [9], [32]–[47]. instruction-level energy consumption measurement for nvidia
gpus,” in Proceedings of the 17th ACM International Conference on
Some works [8], [9], [41], [42], [44], [46] tune hyperparam- Computing Frontiers, 2020, pp. 60–70.
eters, model structure configurations, or data structures to [8] Y. Wang, Q. Wang, and X. Chu, “Energy-efficient inference service
speed up ML applications or save energy consumption. Our of transformer-based deep learning models on gpus,” in 2020 In-
ternational Conferences on Internet of Things (iThings) and IEEE Green
method can be used together with these application-level Computing and Communications (GreenCom) and IEEE Cyber, Physical
techniques to reduce energy consumption further. CAMEL and Social Computing (CPSCom) and IEEE Smart Data (SmartData)
[43] optimizes the energy of web applications on mobile and IEEE Congress on Cybermatics (Cybermatics). IEEE, 2020, pp.
platforms. It uses the frame rate as the performance index, 323–331.
[9] S. M. Nabavinejad, S. Reda, and M. Ebrahimi, “Batchsizer: Power-
which is only suitable for web applications. Many studies performance trade-off for dnn inference,” in Proceedings of the 26th
[32], [33], [35]–[40], [45] concentrate on accelerating applica- Asia and South Pacific Design Automation Conference, 2021, pp. 819–
tions on heterogeneous platforms. Combining these studies 824.
[10] A. Majumdar, L. Piga, I. Paul, J. L. Greathouse, W. Huang, and
with our method may enable acceleration while saving D. H. Albonesi, “Dynamic gpgpu power management using adap-
energy. Based on the graph neural network, Ye et al. [47] tive model predictive control,” in 2017 IEEE International Sympo-
propose a novel program analysis and modeling method sium on High Performance Computer Architecture (HPCA). IEEE,
to provide comprehensive insight into the program. This 2017, pp. 613–624.
[11] P. Zou, L. Ang, K. Barker, and R. Ge, “Indicator-directed dynamic
method may be helpful to build more accurate energy and power management for iterative workloads on gpu-accelerated
performance prediction models. systems,” in 2020 20th IEEE/ACM International Symposium on Clus-
ter, Cloud and Internet Computing (CCGRID). IEEE, 2020, pp. 559–
568.
7 C ONCLUSION [12] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting
We have presented GPOEO, a new online GPU energy system,” in Proceedings of the 22nd acm sigkdd international conference
on knowledge discovery and data mining, 2016, pp. 785–794.
optimization framework for iterative machine learning (ML) [13] F. Tang, W. Gao, J. Zhan, C. Lan, X. Wen, L. Wang, C. Luo, Z. Cao,
workloads. GPOEO detects iterative periods, measures per- X. Xiong, Z. Jiang et al., “Aibench training: balanced industry-
formance counter features, and predicts the optimal energy standard ai training benchmarking,” in 2021 IEEE International
Symposium on Performance Analysis of Systems and Software (IS-
configurations automatically. We evaluate GPOEO on 71 PASS). IEEE, 2021, pp. 24–35.
ML applications running on an NVIDIA RTX3080Ti GPU. [14] NVIDIA Corporation, “Nvml api reference manual,” https://
GPOEO achieves a mean energy saving of 16.2% at a modest docs.nvidia.com/deploy/nvml-api/index.html, August 2021, ac-
cost of 5.1% performance loss compared to the NVIDIA cessed August 19, 2021.
[15] ——, “Cupti documentation,” https://docs.nvidia.com/cupti/
default scheduling strategy. Cupti/index.html, August 2021, accessed August 19, 2021.
[16] V. P. Dwivedi, C. K. Joshi, T. Laurent, Y. Bengio, and X. Bres-
son, “Benchmarking graph neural networks,” arXiv preprint
ACKNOWLEDGMENTS arXiv:2003.00982, 2020.
This work was supported in part by the National [17] J. Schwarzrock, C. C. de Oliveira, M. Ritt, A. F. Lorenzon, and
Key Research and Development Program of China A. C. S. Beck, “A runtime and non-intrusive approach to optimize
edp by tuning threads and cpu frequency for openmp applica-
(2020YFB1406902), the Key-Area Research and Develop- tions,” IEEE Transactions on Parallel and Distributed Systems, vol. 32,
ment Program of Guangdong Province (2020B0101360001), no. 7, pp. 1713–1724, 2020.
13
[18] pytorch.org, “Pytorch benchmarks,” https://github.com/ [39] J. Fang, C. Huang, T. Tang, and Z. Wang, “Parallel programming
pytorch/benchmark, August 2021, accessed August 19, 2021. models for heterogeneous many-cores: a comprehensive survey,”
[19] Z. Wen, J. Shi, Q. Li, B. He, and J. Chen, “ThunderSVM: A fast SVM CCF Transactions on High Performance Computing, vol. 2, no. 4, pp.
library on GPUs and CPUs,” Journal of Machine Learning Research, 382–400, 2020.
vol. 19, pp. 797–801, 2018. [40] P. Zhang, J. Fang, T. Tang, C. Yang, and Z. Wang, “Auto-tuning
[20] Z. Wen, J. Shi, B. He, J. Chen, K. Ramamohanarao, and Q. Li, streamed applications on intel xeon phi,” in 2018 IEEE International
“Exploiting gpus for efficient gradient boosting decision tree train- Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2018,
ing,” IEEE Transactions on Parallel and Distributed Systems, vol. 30, pp. 515–525.
no. 12, pp. 2706–2717, 2019. [41] D. Stamoulis, E. Cai, D.-C. Juan, and D. Marculescu, “Hyper-
[21] Z. Wen, J. Shi, B. He, Q. Li, and J. Chen, “ThunderGBM: Fast power: Power-and memory-constrained hyper-parameter opti-
GBDTs and random forests on GPUs,” Journal of Machine Learning mization for neural networks,” in 2018 Design, Automation & Test
Research, vol. 21, 2020. in Europe Conference & Exhibition (DATE). IEEE, 2018, pp. 19–24.
[22] Z. Wen, J. Shi, B. He, J. Chen, K. Ramamohanarao, and Q. Li, [42] V. S. Marco, B. Taylor, Z. Wang, and Y. Elkhatib, “Optimizing
“Exploiting gpus for efficient gradient boosting decision tree train- deep learning inference on embedded systems through adaptive
ing,” IEEE Transactions on Parallel and Distributed Systems, vol. 30, model selection,” ACM Transactions on Embedded Computing Sys-
no. 12, pp. 2706–2717, 2019. tems (TECS), vol. 19, no. 1, pp. 1–28, 2020.
[43] J. Ren, L. Yuan, P. Nurmi, X. Wang, M. Ma, L. Gao, Z. Tang,
[23] H. Walser, The golden section. MAA, 2001.
J. Zheng, and Z. Wang, “Camel: Smart, adaptive energy optimiza-
[24] M. Endrei, C. Jin, M. N. Dinh, D. Abramson, H. Poxon, L. DeRose, tion for mobile web interactions,” in IEEE INFOCOM 2020-IEEE
and B. R. de Supinski, “Statistical and machine learning models Conference on Computer Communications. IEEE, 2020, pp. 119–128.
for optimizing energy in parallel applications,” The International [44] A. E. Brownlee, J. Adair, S. O. Haraldsson, and J. Jabbo, “Ex-
Journal of High Performance Computing Applications, vol. 33, no. 6, ploring the accuracy–energy trade-off in machine learning,” in
pp. 1079–1097, 2019. 2021 IEEE/ACM International Workshop on Genetic Improvement (GI).
[25] S. Ramesh, S. Perarnau, S. Bhalachandra, A. D. Malony, and IEEE, 2021, pp. 11–18.
P. Beckman, “Understanding the impact of dynamic power cap- [45] P. Zhang, J. Fang, C. Yang, C. Huang, T. Tang, and Z. Wang,
ping on application progress,” in 2019 IEEE International Parallel “Optimizing streaming parallelism on heterogeneous many-core
and Distributed Processing Symposium (IPDPS). IEEE, 2019, pp. architectures,” IEEE Transactions on Parallel and Distributed Systems,
793–804. vol. 31, no. 8, pp. 1878–1896, 2020.
[26] M. Hao, W. Zhang, Y. Wang, G. Lu, F. Wang, and A. V. Vasilakos, [46] S. Qiu, L. You, and Z. Wang, “Optimizing sparse matrix multipli-
“Fine-grained powercap allocation for power-constrained systems cations for graph neural networks,” in Lecture Notes in Computer
based on multi-objective machine learning,” IEEE Transactions on Science. Springer Verlag, 2021.
Parallel and Distributed Systems, vol. 32, no. 7, pp. 1789–1801, 2020. [47] G. Ye, Z. Tang, H. Wang, D. Fang, J. Fang, S. Huang, and
[27] N. Gholkar, F. Mueller, and B. Rountree, “Uncore power scav- Z. Wang, “Deep program structure modeling through multi-
enger: A runtime for uncore power conservation on hpc systems,” relational graph-based learning,” in Proceedings of the ACM Interna-
in Proceedings of the International Conference for High Performance tional Conference on Parallel Architectures and Compilation Techniques,
Computing, Networking, Storage and Analysis, 2019, pp. 1–23. 2020, pp. 111–123.
[28] Y. Wang, W. Zhang, M. Hao, and Z. Wang, “Online power manage-
ment for multi-cores: A reinforcement learning based approach,”
IEEE Transactions on Parallel and Distributed Systems, 2021.
[29] P. Petoumenos, L. Mukhanov, Z. Wang, H. Leather, and D. S.
Nikolopoulos, “Power capping: What works, what does not,” in
2015 IEEE 21st International Conference on Parallel and Distributed
Systems (ICPADS). IEEE, 2015, pp. 525–534.
[30] M. Endrei, C. Jin, M. N. Dinh, D. Abramson, H. Poxon, L. DeRose,
and B. R. de Supinski, “Energy efficiency modeling of parallel
applications,” in SC18: International Conference for High Performance
Computing, Networking, Storage and Analysis. IEEE, 2018, pp. 212–
224.
[31] K. Fan, B. Cosenza, and B. Juurlink, “Predictable gpus frequency
scaling for energy and performance,” in Proceedings of the 48th
International Conference on Parallel Processing, 2019, pp. 1–10.
[32] Z. Wang and M. F. O’Boyle, “Mapping parallelism to multi-cores:
a machine learning based approach,” in Proceedings of the 14th
ACM SIGPLAN symposium on Principles and practice of parallel
programming, 2009, pp. 75–84.
[33] D. Grewe, Z. Wang, and M. F. O’Boyle, “Portable mapping of
data parallel programs to opencl for heterogeneous systems,” in
Proceedings of the 2013 IEEE/ACM International Symposium on Code
Generation and Optimization (CGO). IEEE, 2013, pp. 1–10.
[34] Z. Wang, D. Grewe, and M. F. O’boyle, “Automatic and portable
mapping of data parallel programs to opencl for gpu-based het-
erogeneous systems,” ACM Transactions on Architecture and Code
Optimization (TACO), vol. 11, no. 4, pp. 1–26, 2014.
[35] C. Cummins, P. Petoumenos, Z. Wang, and H. Leather, “End-
to-end deep learning of optimization heuristics,” in 2017 26th
International Conference on Parallel Architectures and Compilation
Techniques (PACT). IEEE, 2017, pp. 219–232.
[36] Y. Wen, Z. Wang, and M. F. O’boyle, “Smart multi-task scheduling
for opencl programs on cpu/gpu heterogeneous platforms,” in
2014 21st International conference on high performance computing
(HiPC). IEEE, 2014, pp. 1–10.
[37] B. Taylor, V. S. Marco, and Z. Wang, “Adaptive optimization
for opencl programs on embedded heterogeneous systems,” in
Proceedings of the 18th ACM SIGPLAN/SIGBED Conference on Lan-
guages, Compilers, and Tools for Embedded Systems, 2017, pp. 11–20.
[38] Z. Wang and M. O’Boyle, “Machine learning in compiler optimiza-
tion,” Proceedings of the IEEE, vol. 106, no. 11, pp. 1879–1901, 2018.