[go: up one dir, main page]

0% found this document useful (0 votes)
132 views13 pages

Dynamic GPU

This document proposes GPOEO, a framework for dynamically optimizing GPU energy usage for machine learning training workloads. GPOEO uses performance counters and an analytical model to detect changes in training iterations. It then employs multi-objective models based on gradient boosting and a local search algorithm to find an optimal tradeoff between execution time and energy consumption. When evaluated on 71 machine learning workloads, GPOEO delivered a mean energy savings of 16.2% with an average execution time increase of only 5.1% compared to the default NVIDIA scheduling strategy.

Uploaded by

xiaoma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
132 views13 pages

Dynamic GPU

This document proposes GPOEO, a framework for dynamically optimizing GPU energy usage for machine learning training workloads. GPOEO uses performance counters and an analytical model to detect changes in training iterations. It then employs multi-objective models based on gradient boosting and a local search algorithm to find an optimal tradeoff between execution time and energy consumption. When evaluated on 71 machine learning workloads, GPOEO delivered a mean energy savings of 16.2% with an average execution time increase of only 5.1% compared to the default NVIDIA scheduling strategy.

Uploaded by

xiaoma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

1

Dynamic GPU Energy Optimization for Machine Learning Training


Workloads
Farui Wang, Weizhe Zhang, Shichao Lai, Meng Hao, and Zheng Wang

Abstract—GPUs are widely used to accelerate the training of machine learning workloads. As modern machine learning models
become increasingly larger, they require a longer time to train, leading to higher GPU energy consumption. This paper presents
GPOEO, an online GPU energy optimization framework for machine learning training workloads. GPOEO dynamically determines the
optimal energy configuration by employing novel techniques for online measurement, multi-objective prediction modeling, and search
optimization. To characterize the target workload behavior, GPOEO utilizes GPU performance counters. To reduce the performance
counter profiling overhead, it uses an analytical model to detect the training iteration change and only collects performance counter
data when an iteration shift is detected. GPOEO employs multi-objective models based on gradient boosting and a local search
algorithm to find a trade-off between execution time and energy consumption. We evaluate the GPOEO by applying it to 71 machine
arXiv:2201.01684v1 [cs.DC] 5 Jan 2022

learning workloads from two AI benchmark suites running on an NVIDIA RTX3080Ti GPU. Compared with the NVIDIA default
scheduling strategy, GPOEO delivers a mean energy saving of 16.2% with a modest average execution time increase of 5.1%.

Index Terms—Dynamic energy optimization, online application iteration detection, multi-objective machine learning, GPU

1 I NTRODUCTION
In recent years, deep neural networks (DNNs) have can only detect coarse-grained phase change patterns.
demonstrated breakthrough effectiveness in various tasks This work aims to provide a better dynamic GPU energy
that were once deemed impossible [1], [2]. Training an optimization method for iterative ML training workloads.
effective DNN requires performing expensive training on We present GPOEO, a micro-intrusive 1 GPU online energy
a large number of samples. As the model training time optimization framework. GPOEO only requires the user to
increases, the energy consumption of machine learning mark the beginning and the end of the target code region.
training workloads also grows. This is a particular concern GPOEO can then automatically collect the relevant infor-
for high-performance GPUs because they are widely used mation to monitor the program behavior to identify the re-
to train DNN workloads but are less energy-efficient than peat workload patterns and adjust the energy optimization
their CPU counterparts. As a result, there is a critical need scheme accordingly.
to reduce the energy consumption for GPUs when training GPOEO trades execution time for energy efficiency. To
machine learning workloads. Achieving this goal can reduce this end, we employ multi-objective optimization models
the maintenance cost and allow users to train larger models built on the XGBoost Classifier [12]. As a departure from
within the same budget. existing online GPU power optimization schemes, GPOEO
Most of the existing studies in general GPU energy leverages the fine-grained runtime information provided by
optimization are offline techniques [3]–[7]. These techniques the hardware performance counters to model the system
require profiling applications ahead of time [3]–[5], [7] or and program behavior to build more accurate decision
analyzing their source code [6] in advance to collect ap- models. We use the predictive model as a cost function to
plication features. These approaches cannot adapt to the search for the optimal energy configuration for the current
change of the program behavior and can incur expensive workload pattern. These supporting methods enable us to
profiling overhead. Other approaches change the internal develop a period detection algorithm based on the fast
working mechanisms of machine learning (ML) models [8], Fourier transform (FFT) and feature sequence similarity
[9], but they are specific to the ML algorithms used and do evaluation to identify if a changed training iteration takes
not generalize to other ML methods. place and adjust our optimization accordingly.
Recently, efforts have been made to target online GPU We evaluate GPOEO by applying it to 71 machine learn-
power optimization. The work presented in [10] proposes a ing workloads from three AI benchmark suites [13]. We
dynamic power management method for integrated APUs. test our approach on an NVIDIA RTX3080Ti GPU. Exper-
ODPP [11] is another online dynamic power-performance imental results show that GPOEO can reduce GPU energy
management framework. ODPP detects the periods of appli- consumption by over 20% (16.2% on average), with most
cations to build prediction models. However, their models execution time increments less than 7% (5.1% on average)
when compared to the NVIDIA default GPU scheduling
• F. Wang, W. Zhang, S. Lai, and M. Hao are with the School of Cyberspace strategy.
Science at Harbin Institute of Technology, Harbin 150000, China. This paper makes the following contributions:
E-mail:{wangfarui,wzzhang}@hit.edu.cn
• Z. Wang is with the School of Computing at University of Leeds, United
Kingdom.
1. Micro-intrusive means we require minor changes to the program
E-mail: z.wang5@leeds.ac.uk
source code.
2

• It is the first work to exploit performance counter met- 50% 30%


Energy Saving ED2P Saving
rics to perform online energy prediction and optimiza- 40% Slowdown 18.9 20%
tion for discrete GPUs. By minimizing the instrument 14.7
30% 26.4 9.8
and measurement overhead, our approach improves 22.4 6.8 8.5 10%
20% 14.9 16.8 18.0
the practicability of online GPU profiling. 0%
10% 4.8 5.0 4.8 4.9
• It presents a robust detection algorithm to automati- 4.7
cally detect the iterative period of ML applications from 0% AI_FE AI_S2T CLB_MLP SBM_GIN TSP_GatedGCN -10%
the GPU resource traces (Section 4.1).
• It showcases how an effective GPU online energy opti- Fig. 1. Oracle ED2P saving of ML applications
mization framework can be constructed as a cost model
to quickly identify a good energy configuration.
100% ODPP Period Detection Errors GPOEO Period Detection Errors
Online material: The GPOEO framework is publicly avail- 80%
60%
able at https://github.com/ruixueqingyang/GPOEO. 40%
20%
0%
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 lt
2 BACKGROUND AND M OTIVATION 102 108 114 120 126 132 138 144 150 156 162 168 174 180 186 192Defau
SM Clock Frequencies (MHz)
2.1 GPU configuration and profiling tools
(a) MLC 3WLGNN application
NVIDIA provides the NVML [14] and CUPTI [15] library
for reconfiguring, measuring, and profiling its GPUs. The 80% ODPP Period Detection Errors GPOEO Period Detection Errors
NVML [14] can set Streaming Multiprocessor (SM) clock 60%
40%
frequency, global memory clock frequency (only certain 20%
types of NVIDIA GPUs), and powercap online. It also sup- 0%
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 lt
ports measuring power, SM utilization, and global memory 102 108 114 120 126 132 138 144 150 156 162 168 174 180 186 192Defau
utilization. The CUPTI [15] can profile performance counter. SM Clock Frequencies (MHz)
These tools make GPU online energy optimization possible. (b) SP GCN application
In this work, we adjust the clock frequency of the GPU SM
and global memory to achieve energy saving. Fig. 2. Period detection errors of ODPP and GPOEO under different SM
clock frequencies

2.2 Motivation
2.2.3 Reduce Period Detection Error
2.2.1 Energy Optimization Potential
Online sampling time series of power and utilization can
The training phase of machine learning usually uses high- reflect the periodic pattern. ODPP [11] uses the Fourier
performance GPUs as accelerators. GPUs consume a lot transform algorithm to detect periods, namely the execu-
of energy and time. To explore the energy-saving poten- tion time of one major iteration. However, according to
tial on GPUs, we run five applications in AIBench [13] our experimental results, the period error of ODPP [11] is
and benchmarking-gnns [16] on all combinations of SM quite large when the periodicity is not apparent. To solve
and memory clock frequencies and select the best con- this shortcoming, we propose a robust period detection
figurations for minimal energy consumption within the algorithm. Figure 2 shows the absolute percentage errors
slowdown constraint of 5%. Figure 1 shows the oracle of period detection with ODPP and our algorithm GPOEO
results of energy saving, slowdown, and ED2P (Energy × under different SM clock frequencies on two ML applica-
Dealy 2 ) saving. The AI FE, AI S2T, and SBM GIN are tions. The “Default” means the NVIDIA default scheduling
relatively compute-intensive and save significant energy strategy. The period detection errors of ODPP are pretty
(14.9%-22.4%). The CLB MLP and TSP GatedGCN are rel- significant and unstable, while the period detection errors of
atively memory-intensive and save considerable energy our GPOEO are less than 5% under all SM clock frequencies.
(18.0%-26.4%). These results highlight that both compute-
intensive and memory-intensive ML training workloads 2.2.4 Using GPU Performance Counters
have the chance to save energy with acceptable slowdown.
ODPP [11] uses power, SM utilization, and memory uti-
lization as features to predict energy and execution time.
2.2.2 Make Energy Optimization Practical However, these coarse-grained features are inadequate to
Most existing studies on GPU energy optimization are give accurate predictions for some machine learning ap-
offline and need domain knowledge and complex offline plications. We give an example in Figure 3. Applications
profiling. Thus, users or researchers can hardly apply these in each pair have similar average power and utilization
studies in practice. An ideal online GPU energy optimiza- under the same reference clock frequency configuration, but
tion system should be non-intrusive to applications and their optimal SM clocks for ED2P are different (memory
transparent to users. Majumdar’s work [10] only applies to frequency fixed at 9251MHz). In light of this observation, we
AMD APUs. Motivated by drawbacks, we propose an online wish to use more fine-grained metrics to capture the subtle
energy optimization framework called GPOEO. Users only interactions between the application and the hardware. To
need to insert the Begin and End APIs, and the GPOEO this end, we use hardware performance counter metrics
automatically optimizes energy online. as features to build predictive models to estimate energy
3

350 TABLE 1
300 avgPower: W Best Clock: MHz 2000
avgGPUUtil: (%) 1650 Description of Parameters
1515 1740
250 avgMemUtil: (%) 1365 1125
1500
200 930 990
750 1000
150

168.0

159.1
500 Notation Description
100

103.1

102.6

41.7
37.7
36.6

36.0
95.4

94.9
93.8

32.5
32.3
31.5
92.1

30.1
50 0 fobj objective function

67.6

66.2
7.6
6.8
0.9

0.9
0.0

0.0

0 eng
d SM energy consumption predicted with the SM clock
P t 2T AI_TS
_ML GCN GCN phSage _MoNe P_GCN AI_I
time execution time predicted with the SM clock
CSL ted TU_
dSM
G a r a T S P T S EngM dlSM energy prediction model with the SM clock
_ G
CSL TU_ T imeM dlSM time prediction model with the SM clock
SM geari one SM gear represents an SM clock frequency
Fig. 3. ML applications with similar coarse-grained features and different wSM the input vector containing SM clock
optimal SM clocks for ED2P engdM em energy consumption predicted with the memory
clock
time
d M em execution time predicted with the memory clock
EngM dlM em energy prediction model with the memory clock
and execution time. Later in the paper, we show that this T imeM dlM em time prediction model with the memory clock
strategy allows us to build more accurate prediction models, M emgeari one memory gear represents a memory clock fre-
which in turn lead to better optimization decisions. The key quency
wM em the input vector containing memory clock
challenges here are reducing the overhead of performance Feature feature vector measured under the reference clock
counter profiling and minimizing code modifications. configuration from the ML applications

3 OVERVIEW OF O UR A PPROACH
training data
reference energy

This section defines the scope of the work and gives an configuration performance
② train energy model

overview of our approach. counter metrics


and time model
① run benchmarks
(reference energy

measure metrics  configuration)

energy and time metrics

energy model

3.1 Problem Formulation other energy


(all energy
time model
configurations configurations)

This work aims to find an optimal SM and memory clock offline


frequency configuration to balance the GPU energy con- online
④ measure energy and

sumption and execution time for ML training workloads. ③ period performance counter


⑤ predict

detection metrics in one period
energy and time

Our goal can be formulated as follows: (reference energy configuration) with ML model

change
 
min FSM = fobj eng
d SM , timeSM
d

⑦ actual optimal
⑥ predicted optimal

configuration

monitor energy
configuration through

according to

s.t. eng
d SM = EngM dlSM (wSM ) , features
local search
user requirements

time
dSM = T imeM dlSM (wSM ) ,
(1) no change

wSM = {SM geari , Feature} ,


Fig. 4. Overview of the GPOEO workflow.
SM geari ∈ {SM gear1 , . . . , SM gearm } ,
Feature = {f eature1 , . . . , f eatureq } .

  3.2 Framework Overview


min FM em = fobj eng
d M em , time
d M em

s.t. eng
dM em = EngM dlM em (wM em ) , GPOEO includes two stages, shown in Figure 4. In the
timeM em = T imeM dlM em (wM em ) ,
d (2) offline training stage, we run representative benchmarks
wM em = {M emgeari , Feature} , on each SM and memory clock frequencies ( 1 ) to collect
performance counter metrics on the reference SM and mem-
M emgeari ∈ {M emgear1 , . . . , M emgearp } . ory frequencies and energy-time data on all frequencies.
Table 1 lists the variables used in our problem formula- Then we train multi-objective models ( 2 ). In the online
tion. We note that our approach can be applied to an arbi- optimization stage, we measure energy and performance
trary objective function. We select the energy consumption counter metrics in one detected period ( 3 , 4 ), then
with an execution time increase constraint as our objective predict the optimal SM and memory frequencies ( 5 , 6 ).
function to explicitly control the performance loss. Like After that, we try frequencies around the predicted optimal
[17], we assume that the search space of SM and memory configuration and compare the measured energy-time data
clocks is convex and optimize the SM frequency and mem- to search for the actual optimal configuration ( 7 ). Finally,
ory frequency in order. We use four prediction models to we set the actual optimal configuration and monitor energy
estimate energy and time with different SM and memory characteristics ( 8 ). If the fluctuation of energy characteris-
clock frequencies. Then we select the SM and memory clock tics exceeds the threshold, another optimization process will
configuration that can best satisfy the optimization goal. start.
4

4 D ESIGN AND I MPLEMENT Algorithm 1 Period calculation algorithm based on FFT and
feature sequence similarity
4.1 Robust Period Detection
Input: feature sampling sequence Smp = {s1 , . . . , sN },
Iterative ML workloads exhibit periodic behavior across sampling interval Ts
iterations. We treat one period, namely one iteration, as Output: iterative period Titer , error of period err
basic unit for online metric measurement and evaluation.
Our approach does not rely on offline code instrumentation
1: F req, Ampl = {f reqi }, {ampli } =FFT(Smp, Ts )
to reduce developer involvement.
2: T = {Ti } = {1/f reqi }
3: Find peaks in Ampl, get the set of peaks AmplP eak =
4.1.1 Fourier Transform: Get Candidate Periods
{amplpeakm } and the set of corresponding periods
To detect the period change, we leverage the Fourier trans- T P eak = {T peakm }
form. This established methodology is widely used in signal 4: IndexCand = {idxcandk }
processing to detect the dominant frequency and period. In = {m|amplpeakm > cpeak ×
our scenario, we assume the periodic feature of ML applica- max(AmplP eak)}
tions, such as GPU power and utilization, can be expressed 5: T Cand = {T candp } = {T peakidxcandk }
as a time-domain function F(t). Generally speaking, F(t) 6: for T candp ∈ T Cand do
is periodic and can be expressed as a linear combination of 7: errp = Algorithm2(T candp , Smp, Ts )
trigonometric series (know as Fourier Series) in exponential 8: end for
form as: 9: Find erridxmin = min({errp })

T candopt = T candidxmin
X
F (t) = Cn e2πjnt/Titer 10:
n=−∞
11: NT = Ts × (N − 1)/T candopt ,
Tlow = T candopt × (1 − 1/(NT + 1))
where Cn are Fourier coefficients. 12: Tup = T candopt × (1 + 1/(NT − 1))
We can get a series of frequency components via Fourier 13: Construct an arithmetic sequence T Local = {T localq },
transform: where T localq = Tlow + (q − 1) × Taccu ,

X q = 1, . . . , (Tup − Tup )/Taccu
F(ω) = 2π Cn δ (ω − 2πn/Titer ) 14: for T localq ∈ T Local do
n=−∞ 15: errlocalq =Algorithm2(T localp , Smp, Ts )
the frequency F(ω) is a series of impulse functions (i.e., δ ) 16: end for
located at 2πn/Titer , with amplitude proportional to Cn . 17: Find err = erridxmin = min({errlocalq })
These impulses represent different frequency components 18: Titer = T localidxmin
in F(t). The one with the largest amplitude is the major 19: return Titer , err
frequency component. The iterative period Titer of the ML
application can be calculated as Titer = 1/fmajor , where
fmajor is the frequency of the major frequency component. our scenario, the Euclidean distance is susceptible to high-
frequency interference and reports the wrong similarity.
4.1.2 Feature Sequence Similarity: Find Accurate Periods Inspired by this, instead of calculating the distance of each
pair of corresponding points, we group the sampling data
As mentioned in Section 2.2.3, the period detection results of
within a series and evaluate the distance between the corre-
the Fourier transform may have vast errors. Our period cal-
sponding groups. Specifically, we use the Gaussian mixture
culation algorithm (Algorithm 1) combines the fast Fourier
model algorithm to cluster each sub-curves into N groups
transform (FFT) and feature sequence similarity (Algorithm
(line 8-11). Then, for each group, we calculate the relative
2) to reduce these errors. We find amplitude peaks and
average amplitudes (line 12-13). The average operation can
corresponding periods T P eak in the result of FFT (line
eliminate the influence of high-frequency interference. Later,
1-3). Then we select periods T peakm with relatively high
we calculate the symmetric mean absolute percentage error
peaks as candidate periods T cand (line 4-5). We set the
(SMAPE) of the current group pair (line 14). Finally, we
peak coefficient cpeak to 0.6-0.7 empirically. Then we use
calculate the weighted average of all SMAPEs as similarity
feature sequence similarity (Algorithm 2) to evaluate and
error (line 16-17). The weights are the numbers of sampling
select the best candidate period with minimal error. Finally,
points in groups. When all adjacent sub-curves are evalu-
we perform a local search around the best candidate period
ated, we treat the mean similarity error as the error of Ti ter
to further improve accuracy.
(line 19).
We propose the feature sequence similarity algorithm
(Algorithm 2) to evaluate candidate periods. We divide the
feature sampling curve into several sub-curves, and the time 4.1.3 Online Robust Period Detection Algorithm Frame-
duration of each sub-curve is equal to the candidate period work
(line 1-4). The more similar the curves are, the closer the To get accurate Titer dynamically, we design an online
candidate period is to the actual period. So we evaluate the robust period detection algorithm framework (Algorithm
similarity of each pair of adjacent sub-curves (line 5-18). 3). The framework calculates Titer in a rolling form while
Each sub-curve is a time series, and calculating the Eu- sampling the power and utilization of the GPU until the
clidean distance between corresponding points is a common Titer is stable. First, we get the initial period Tinit , using
method to measure the similarity of time series. However, in Algorithm 2. If sampling duration is short than cmeasure
5

Algorithm 2 Feature sequence similarity algorithm Algorithm 3 Online robust period detection algorithm
Input: Period to be evaluated Titer , feature sampling se- framework
quence Smp = {s1 , . . . , sN }, sampling interval Ts Input:feature sampling sequence Smp = {s1 , . . . , sN }, sam-
Output: error of period ErrT pling interval Ts
1: N umT = bsN /Titer c; N ums = bT /Ts c Output:iterative period Titer , next sampling duration
2: for i ∈ {1, . . . N umT } do SmpDurnext
3: Smpi = {s(i−1)×N ums +1 , . . . , si×N ums } = {ssi,j } 1: Tinit , errinit =Algorithm1(Smp, Ts )
(j = 1, . . . , N ums ) 2: SmpDur = (N − 1) × Ts
4: end for 3: if SmpDur < cmeasure × Tinit then
5: for i ∈ {1, . . . , N umT − 1} do 4: Titer = Tinit
6: M eanprev = mean(Smpi ) 5: SmpDurnext = cmeasure × Tinit − SmpDur
7: M eanback = mean(Smpi+1 ) 6: end if
8: {GaGrp1 , . . . , GaGrpN umG } =Gauss(Smpi , N umG) 7: tstart = max(0, (SmpDur − (2 + ceval × step) × Tinit ))
where GaGrpj = {idxj,1 , . . . , idxj,N umj } 8: while (SmpDur − tstart )/Tinit ≥ cmeasure do
9: for j ∈ {1, . . . , N umG} do 9: istart = 1 + btstart /Ts c
10: Grpprev = {ssi,idxj,1 , . . . , ssi,idxj,N umj } 10: SubSmp = {sistart , . . . , sN }
11: Grpback = {ssi+1,idxj,1 , . . . , ssi+1,idxj,N umj } 11: Tj , errj =Algorithm1(SubSmp, Ts )
12: RelV alP revj = mean(Grpprev ) − M eanprev 12: tstart = tstart + step × Tinit
13: RelV alBackj = mean(Grpback ) − M eanback 13: end while
14: grperrj =SMAPE(RelV alP revj , RelV alBackj ) 14: T = {Tj }, Err = {errj }
15: end for 15: Find min(Err) = errk then Titer = Tk
16: W eight = {|GaGrp1 |, . . . , |GaGrpN umG |} 16: Dif f = abs((max(T ) − min(T ))/mean(T ))
17: erri = avg({grperr1 , . . . , grperrN umG }, W eight) 17: if Dif f < Dif fthreshold then
18: end for 18: SmpDurnext = −1
19: return ErrT = mean({err1 , . . . , errN umT −1 }) 19: else

Note: Gauss = the Gaussian mixture model clustering algorithm 20: SmpDurnext =
dSmpDur/max(T )e × max(T ) − SmpDur
21: end if
22: return Titer , SmpDurnext
times Tinit (cmeasure is set to 2 empirically), it is too short
to do a rolling calculation, so we calculate the required sam-
pling duration, skip rolling calculation, and return (line2- Algorithm 4 Adaptive feature measurement
6). If the sampling duration is long enough, we get some Input: features to be measured F eatureN ame =
updated samples by setting tstart (line 7). Samples before {name1 , . . . , namen }, the feature for iterative period detec-
tstart are ignored because they may be outdated. We set tion Featuredect
step = 0.5 and ceval = 6.5 according to our experiments. Output: feature data Feature = {f eature1 , . . . , f eaturen }
Then, the rolling period calculation begins. In each iteration,
1: Start measurement for F eatureN ame and Featuredect
we calculate the period (line 9-11) and increase tstart to
2: SmpDurnext = SmpDurinit
exclude more outdated samples for the next iteration (line
3: while SmpDurnext > 0 do
12). With the rolling period calculation, we get a sequence of
4: Delay SmpDurnext
periods with errors (line 14). If an ML application runs in the
5: Collect sampling sequence of Featuredect
regular iteration phase, these periods should be similar. If
6: Titer , SmpDurnext =Algorithm3(Featuredect , Ts )
the difference of these periods is less than the threshold, set
7: end while
SmpDurnext = −1 to indicate stop sampling. Otherwise,
8: Restart measurement for F eatureN ame
calculate SmpDurnext (line 16-21). The best period, namely
9: Delay Titer then stop measurement
Titer , is the period with the minimal error (line 15). If
10: Collect feature data Feature
SmpDurnext > 0, we ignore Titer , keep sampling, and call
11: return Feature
Algorithm 3 again after SmpDurnext . Otherwise, we use
Titer for feature measurement (Section 4.2).
power, SM utilization, and memory utilization can meet
4.2 Micro-intrusive Online Feature Measurement this requirement. However, the performance counters can-
Based on the robust period detection, we propose the not meet this requirement because its minimum sampling
adaptive feature measurement algorithm (Algorithm 4). The granularity is a CUDA kernel. According to our experi-
feature measurement causes overhead and extends Titer mental results, we use the composite feature of power, SM
compared with the normal run of the application. So, we utilization, and memory utilization as Featuredect , whose
adaptively detect Titer of applications while the measure- traces show more obvious periodicity. In line 6, we call
ment is running (line 1-7). In line 5, we collect Featuredect Algorithm 3 to detect Titer and determine whether Titer is
after a delay of SmpDurN ext . We use the sampling se- stable. If SmpDurN ext > 0, we redo line 4-6 with the new
quence of Featuredect to form the curve for the period SmpDurN ext . Otherwise, we think Titer is stable and restart
detection, so we must sample Featuredect continuously and the feature measurement (line 8). After a delay of Titer , we
uniformly. On the NVIDIA GPU platform, the instantaneous stop measurement and collect the feature data (line 8-10).
6

We implement micro-intrusive online feature measure- TABLE 2


ment based on Algorithm 4. Users only need to insert the Selected Features
Begin and End APIs of GPOEO at the beginning and end of
their source code, rather than instrument loops. Notation Full name or expression
IPCPct sm inst executed.PctSus
4.3 Training Data Collection and Model Generation l1tex t sectors lookup miss.sum /
L1MissPerInst
sm inst executed.sum
We establish energy consumption and execution time pre- l1tex t sectors lookup miss.sum /
diction models with the boosting machine learning method. L1MissPct (l1tex t sectors lookup miss.sum +
l1tex t sectors lookup hit.sum)
Our work aims to find the optimal SM and memory clock lts t sectors lookup miss.sum /
gears that optimize the objective function. Therefore, these L2MissPerInst
sm inst executed.sum
four models only need to predict the relative energy con- lts t sectors lookup miss.sum /
sumption and execution time relative to the NVIDIA default L2MissPct (lts t sectors lookup miss.sum +
lts t sectors lookup hit.sum)
scheduling strategy rather than the absolute energy and ALUPct sm inst executed pipe alu.PctSus
time [4]. ADUPct sm inst executed pipe adu.PctSus
To train the models defined in Equation 1 and 2, we FP16Pct sm inst executed pipe fp16.PctSus
FMAPct sm inst executed pipe fma.PctSus
define and collect four training data sets: FP64Pct sm inst executed pipe fp64.PctSus
XUPct sm inst executed pipe xu.PctSus
EngT rSM = {EngSMji }, T imeT rSM = {T imeSMji } TNSPct sm inst executed pipe tensor.PctSus
CBUPct sm inst executed pipe cbu.PctSus
EngT rM em = {EngM emij }, T imeT rM em = {T imeM emij } LSUPct sm inst executed pipe lsu.PctSus
TEXPct sm inst executed pipe tex.PctSus
where EngSMji or T imeSMji represents a piece of training UNIPct sm inst executed pipe uniform.PctSus
data collected from application i under the SM clock gear Note: PctSus = sum.pct of peak sustained active
SM gearj and the memory clock controlled by the NVIDIA
default scheduling strategy. EngM emij or T imeM emij is
collected under the memory clock gear M emgearj and According to related work [4], DRAM and L2 cache
i i
the optimal SM clock gear. engdef ault and timedef ault are miss information may help predict energy and time. How-
the default energy consumption and execution time under ever, profiling performance counters of DRAM and FBPA
the NVIDIA default scheduling strategy. Each input vector need several kernel replays which cannot be implemented
wSM ij or wM em ij contains SM gearj or M emgearj and the online. We design several composite metrics as alterna-
feature vector Featurei . For each application, the feature tives, also shown in Table 2. The L1MissPerInst or
vector Featurei is measured under the reference clock con- L2MissPerInst means the number of L1 or L2 cache miss
figuration. per instruction. The L1MissPct or L2MissPct means the
percentage of L1 or L2 cache miss. In conclusion, we use
EngSMji = {wSM ij , engSM ij /engdef
i
ault } metrics listed in Table 2 as features.
T imeSMji = {wSM ij , timeSM ij /timeidef ault }
wSM ij = {SM gearj , Featurei } 4.3.2 Benchmark Selection and Training Data Collection

EngM emij = {wM em ij , engM em ij /engdef


i We use the PyTorch Benchmarks [18], which contain
ault }
over 40 mini ML applications, for training. We select
T imeM emij = {wM em ij , timeM em ij /timeidef ault } the AIBench Training Component Benchmark [13], the
wM em ij = {M emgearj , Featurei } benchmarking-gnns [16], a ThunderSVM [19] workload,
and a ThunderGBM [20]–[22] workload as the testing set.
4.3.1 Feature Selection The benchmarking-gnns is a graph neural networks (GNN)
As mentioned in Section 2.2.4, the high-level features, such benchmarking framework including seven datasets (CLB,
as power, SM utilization, and memory utilization, cannot CSL, SBM, TSP, TU, MLC, and SP) and nine models.
provide enough information for modeling. So we introduce For each application in the training set, we profile fea-
performance counter metrics as input features. tures under the reference clock configuration. We measure
The CUPTI [15] supports profiling over 1100 per- the actual energy (engSM ij ) and time (timeSM ij ) under all
formance counters on NVIDIA high-end GPUs, such as available SM clock gears and the memory clock controlled
RTX2080Ti and RTX3080Ti. Inspired by Arafa’s work by the NVIDIA default scheduling strategy. We also mea-
[7], we treat each Parallel Thread Execution (PTX) in- sure the actual energy (engM em ij ) and time (timeM em ij )
struction as a basic unit of energy consumption and under all available memory clock gears and the optimal
focus on performance counter metrics that reflect the SM clock gear. Feature, energy, and time data construct four
density of different types of PTX instructions. We list training data sets (EngT rSM , T imeT rSM , EngT rM em , and
the selected metrics in Table 2. A metric with prefix T imeT rM em ). We conduct each measurement ten times and
sm__inst_executed_pipe_ represents a type of PTX in- take the average to reduce the training data noise.
struction. The suffix pct_of_peak_sustained_active
means the percentage of actual mean instruction throughput 4.3.3 Prediction model
to the theoretical maximum sustained instruction through- We construct energy and time prediction models with the
put in an activity clock cycle [15]. XGBoost [12], a widely used supervised machine learning
7

120% GPOEO Period Detection Errors ODPP Period Detection Errors


100%
80%

95.5

96.4
83.5
60%

73.5
68.5
68.3
67.2

66.9
40%
51.0

12.7 MLC_GatedGCN

12.7 SP_GraphSage
20%

11.7 SBM_GraphSage

10.9
10.5 SP_GCN
7.2 SBM_GAT

6.7 SP_GatedGCN

6.5 TSP_MoNet
6.1

5.4
5.0 TSP_GCN
4.9
4.6
4.3

3.6 SBM_GIN
3.1 MLC_3WLGNN
3.0

2.7
2.6

2.6 SP_MoNet

2.6 TSP_GAT

2.7 TSP_GraphSage
2.4

2.5
2.3
2.4

2.1
2.3 SBM_RingGNN
2.0

1.9 MLC_GIN
1.8 CLB_GIN

1.8 TSP_GIN
1.8 SBM_MLP
1.5 CLB_GraphSage

1.5 SP_MLP
1.4

1.4 SP_RingGNN
1.5
1.3 CLB_GCN

1.5 SBM_3WLGNN
1.2 MLC_GAT

1.2 TSP_MLP
1.2

1.0 TSP_GatedGCN

0.8
0.8

0.8
0.8 MLC_MoNet

0.7 SBM_MoNet

0.8 SP_GAT
0.7 SBM_GCN
1.0 CLB_MLP

0.0 SBM_GatedGCN

0.4

0.2
0.1
0.3

0.1 SP_3WLGNN
0.0

0.0
0.2
0%
CLB_GAT
CLB_GatedGCN

Fig. 5. The period detection errors of GPOEO and ODPP. GPOEO gives lower errors compared to ODPP.

model improved from the gradient tree boosting algorithm. ThunderSVM [19] workload, and the ThunderGBM [20]–
Combined with our scenario, we apply the XGBoost to build [22] workload. We can not evaluate their execution time
four prediction models shown in Equation 1 and 2. The or energy consumption with data measured in one period.
model construction processes are similar, so we take the In this case, we fix the measurement time interval and the
energy consumption model EngM dlSM as an example to use measured mean number of instructions executed per
introduce these processes. For EngM dlSM , the XGBoost second (IPS) and power to evaluate the execution time and
weights and sums several regression trees and can be de- energy consumption. If the number of instructions in the
fined as follows: program is Instsum , the execution time can be calculated
XK as time = Instsum /IP S , and the energy consumption can
d i =
eng SM j fk (wSM ij ), fk ∈ F be calculated as energy = power ∗ time = Instsum ∗
k=1 power/IP S . Then we can evaluate different clock frequency
d i is the predicted energy consumption of the configurations.
where eng SM j Based on these analyses, we apply GPOEO to aperi-
input vector wSM ij , which contains the features of applica- odic applications. We first measure performance counter
tion i and the SM gear j . fk represents the k th regression metrics in the fixed time interval and predict the optimal
tree, and K is the number of regression trees. We use the clock frequency configuration. In the online local search, we
training datasets EngT rSM to train the model EngM dlSM . measure and calculate timedef ault and energydef ault under
During training, we use the grid search method to tune the the NVIDIA default scheduling strategy as the baseline.
hyper-parameters such as the minimum loss reduction, the Then, we follow the golden-section search strategy and
maximum depth of a tree, the minimum sum of instance collect timei and energyi under different clock frequency
weight, and the maximum number of nodes to be added. configurations near the predicted optimal clock frequency
4.3.4 Online Local Search configuration. Finally, we compare these timei and energyi
with the baseline to find the optimal clock frequency config-
To fix minor errors caused by multi-objective predicted
uration.
models, we perform a local search around the predicted
optimal clock configuration. We use online detected period
and energy data measured during one period to evaluate 5 E VALUATION
different clock frequencies and find the actual optimal clock We evaluate our GPOEO system in this section. We first
configuration. We first conduct a local search for the mem- introduce the experimental step. Then, we evaluate the
ory clock because wrong memory clock frequencies (too accuracy of the period detection algorithm and energy-
low) can lead to severe slowdowns. Based on the optimal performance prediction models. Finally, we analyze the
memory clock frequency, we conduct another local search online energy-saving results on various ML benchmarks.
for the SM clock. According to our experimental results and
Schwarzrock’s work [17], the relationship between energy 5.1 Experimental Setup
optimization objectives and clock gears is generally a convex
function. Based on this observation, we use the golden- 5.1.1 Hardware and Software Platforms
section search method [23] to accelerate our local search. Our experimental platform is a GPU server equipped with
We first find a gear with a worse objective value on each one NVIDIA RTX3080Ti GPU, one AMD 5950X CPU, and 64
side of the predicted optimal gear to determine the search GB memory. The software environment is NVIDIA Driver
interval. Then we follow the classic golden section-search 470.57 and CUDA 11.3. RTX3080Ti supports continuously
process [23] to attempt different clock gears. Due to possible adjustable SM clock frequency, from 210 MHz to 2,025
errors of measured energy and periods, we fit the attempted MHz, and the step is 15 MHz. We find that some higher
points to obtain a convex function. Then we use the convex frequencies are not practical or stable. Under lower frequen-
function to determine the optimal clock gear. cies, applications cannot save energy while suffering severe
slowdowns. Therefore, we only consider the frequencies in
4.3.5 Optimize aperiodic applications the middle part, which can run stably and may improve
‘ The power and utilization traces of some applications are energy efficiency. We treat each SM clock frequency as an
aperiodic, such as applications in CSL and TUs datasets, the SM clock gear, from SM gear16 = 450 MHz to SM gear114
8

70% 140%
60% ODPP Period Detection Errors GPOEO Period Detection Errors 120% ODPP Period Detection Errors GPOEO Period Detection Errors
50% 100%
40% 80%
30% 60%
20% 40%
10% 20%
0% 0%
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 lt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 lt
102 108 114 120 126 132 138 144 150 156 162 168 174 180 186 192Defau 102 108 114 120 126 132 138 144 150 156 162 168 174 180 186 192Defau
SM Clock Frequencies (MHz) SM Clock Frequencies (MHz)

Fig. 6. Period detection errors of ODPP and GPOEO on the CLB GAT Fig. 8. Period detection errors of ODPP and GPOEO on the
application under different SM clock frequencies TSP GatedGCN application under different SM clock frequencies

100% 20%
ODPP Period Detection Errors GPOEO Period Detection Errors 15%
80% 10%
60% 5%
0%
40% -5%
20% -10%
-15%
0% -20%
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 lt 07 65 22 80 37 95 52 10 67 25
102 108 114 120 126 132 138 144 150 156 162 168 174 180 186 192Defau 450-6 607-7 765-9 922-10 1080-121237-131395-151552-171710-181867-20
SM Clock Frequencies (MHz) SM Clock Frequency Ranges (MHz)
(a) Energy consumption prediction errors
Fig. 7. Period detection errors of ODPP and GPOEO on the 40%
SBM 3WLGNN application under different SM clock frequencies 30%
20%
10%
0%
-10%
= 1,920 MHz. RTX3080Ti supports 5 global memory clock -20%
07 65 22 80 37 95 52 10 67 25
frequencies: M emgear0 = 450 MHz, M emgear1 = 810 MHz, 450-6 607-7 765-9 922-10 1080-121237-131395-151552-171710-181867-20
M emgear2 = 5,001 MHz, M emgear3 = 9,251 MHz, and SM Clock Frequency Ranges (MHz)
M emgear4 = 9,501 MHz. We select SM gear106 = 1,800 (b) Execution time prediction errors
MHz and M emgear3 = 9,251 MHz as reference frequencies,
which are used in performance counter profiling. Fig. 9. Prediction errors by varying the SM clock (grouped by different
SM clock ranges)
5.1.2 Benchmarks
We train our machine learning models using data collected 15%
10%
from PyTorch Benchmarks [18]. We then test the trained 5%
0%
models on AIBench [13] and benchmarking-gnns [16]. In -5%
addition to deep neural networks, we also use two classical -10%
-15%
ML workloads, the ThunderSVM [19] and ThunderGBM -20%
COLLAB CSL molecules SBMs superpixels TSP TUs
[20]–[22], in our evaluation. Applications & Data Sets
(a) Energy consumption prediction errors
5.2 Accuracy and Sensitivity of Period Detection 20%
15%
10%
We analyze our period detection algorithm on 34 differ- 5%
ent ML applications. Figure 5 shows the period detection 0%
-5%
errors of GPOEO and ODPP [11] using the NVIDIA de- -10%
-15%
fault scheduling strategy. GPOEO is far more accurate than
COLLAB CSL molecules SBMs superpixels TSP TUs
ODPP. The mean period error of GPOEO is 1.72%, while Applications & Data Sets
the mean error of ODPP is 23.16%. The maximum error of (b) Execution time prediction errors
GPOEO is 7.2%, and 32 errors are within 5%. For ODPP,
nine errors are above 50%, nine errors are among 5-13%, and Fig. 10. Prediction errors by varying the SM clock (grouped by different
sixteen errors are within 5%. GPOEO is more accurate than datasets)
ODPP on 29 applications. For the other three applications,
the accuracy differences are within 1%.
disturbed by high-frequency noise, only detects the local
Figure 6, 7, and 8 show the periods detection errors
short period, and ignores the real global period.
of GPOEO and ODPP under varying SM clock frequen-
cies and fixed memory clock frequency. GPOEO exhibits
stable low errors (within 5% for most cases). This phe- 5.3 Accuracy of Energy and Execution Time Prediction
nomenon indicates that GPOEO is sensitive to clock fre- In this section, we evaluate four prediction models. We
quency changes. However, ODPP shows poor stability. On measure selected features (Table 2) during one iteration of
CLB GAT and SBM 3WLGNN, ODPP gains huge errors each ML application under the reference SM and memory
under three and ten SM clock frequencies, respectively. clock frequency (1800 MHz and 9251MHz). Then we use
This instability causes inaccurate prediction models, which these features and different SM and memory clock frequen-
ODPP builds online. For TSP GatedGCN, ODPP gains huge cies as input of our multi-objective models to predict the
errors (nearly 100%) under all frequencies because it is relative energy consumption and execution time relative
9

10% cation datasets. Each group contains 636 to 954 predicted


5% values depending on the number of applications included
0% in the datasets (six to nine applications). For all application
-5% datasets, most energy and execution time prediction errors
-10% 405 MHz 810 MHz 5001 MHz 9251 MHz 9501 MHz are less than 5%. This phenomenon demonstrates that our
Memory Clock Frequency Ranges (MHz)
SM clock models are applicable to all these applications.
(a) Energy consumption prediction errors For the memory clock models, we collect 550 predic-
20% tion values in total (2 objectives, 55 applications, and 5
10% memory clock frequencies). The mean prediction errors of
0% energy consumption and execution time are 2.72% and
-10%
2.31%, respectively. Figure 11 shows the prediction errors of
-20%
405 MHz 810 MHz 5001 MHz 9251 MHz 9501 MHz energy consumption and execution time separately grouped
Memory Clock Frequency Ranges (MHz) by different memory clock frequencies. For all clock fre-
(b) Execution time prediction errors quencies, most energy and time prediction errors are less
than 5%. Figure 12 shows the prediction errors grouped by
Fig. 11. Prediction errors by varying the memory clock (grouped by different application datasets. For all application datasets,
different memory clock frequencies)
most prediction errors are also within 5%. These phenomena
demonstrate that our memory clock models are applicable
10% to all memory frequencies and all these applications.
5%
0%
5.4 Results of Online Optimization
-5%
-10% COLLAB CSL molecules SBMs superpixels TSP TUs
Our multi-objective models predict the energy consumption
Applications & Data Sets and execution time of one iteration under different SM and
(a) Energy consumption prediction errors memory clock gears. Users can use diverse energy efficiency
20% objectives to select the predicted optimal energy gears and
10% guide local search to find the actual optimal energy gears. In
0% this experiment, we set the objective function to minimize
-10% the energy consumption within the slowdown constraint of
-20% 5%. For each ML training application, we use the energy
COLLAB CSL molecules SBMs superpixels TSP TUs
Applications & Data Sets consumption and execution time under the NVIDIA default
(b) Execution time prediction errors scheduling strategy as the baseline. Then we run these
applications with our GPOEO system to optimize the energy
Fig. 12. Prediction errors by varying the memory clock (grouped by consumption online. We also implement the ODPP [11] as a
different datasets) comparison.

5.4.1 Medium benchmark suite


to the NVIDIA default scheduling strategy. Note that for Figure 13 shows the online optimization results of the
the SM clock models EngM dlSM and T imeM dlSM , we AIBench, the ThunderSVM workload, and the Thun-
let the NVIDIA default scheduling strategy control the derGBM workload. We use energy saving, slowdown (ex-
memory clock and then predict and select the optimal SM ecution time increase), and ED2 P (ED2 P = Energy ×
clock frequency. For the memory models EngM dlM em and T ime2 ) saving to evaluate GPOEO and ODPP [11]. GPOEO
T imeM dlM em , we assume that the SM clock is already set achieves an average energy saving of 14.7% and an average
to the optimal frequency and then predict and select the ED2 P saving of 6.8%, with an average execution time
optimal memory clock frequency. increase of 4.6%. Especially, GPOEO achieves significant
For the SM clock models, we collect 11,660 predic- energy saving (≥ 20%) on four out of sixteen applications.
tion values in total (2 objectives, 55 applications, and On AI I2T, GPOEO achieves 29.5% energy saving and 0.1%
106 SM clock frequencies). The mean prediction errors of slowdown. Table 3 shows the online optimization process
energy consumption and execution time are 3.05% and for SM and memory clocks. On AI I2T, the error of our
2.09%, respectively. Next, we analyze the distributions of multi-objective prediction models is only -2 SM gears. This
errors sorted by SM clock frequency ranges and application error is eliminated by the online local search within five
datasets. Figure 9 shows the prediction errors of energy steps so that the vast energy saving can be achieved. For
consumption and execution time separately grouped by AI 3DFR, AI 3DOR, AI I2IP, and AI TS, GPOEO obtains
different SM clock frequency ranges. Each clock frequency considerable energy savings (18-24%) with reasonable slow-
range contains 550 or 605 prediction values depending on downs (less than 8%). The online optimization processes
the number of energy gears included in the clock range of these applications are similar to AI I2T. The prediction
(ten or eleven gears). For all clock ranges, most energy errors of SM gear are relatively small (within 11 SM gears),
and time prediction errors are less than 5%. Most scattered and the online local search reduces these errors in three to
errors are within 20%. This phenomenon proves that our five steps.
energy prediction model and execution time prediction AI FE, AI ICMP, AI IGEN, AI LRK, AI OBJ, AI S2T,
model are accurate for different SM clock frequencies. Figure and AI ST have medium energy-saving opportunities (10-
10 shows the prediction errors grouped by different appli- 21%) with the slowdown constraint of 5%. GPOEO gains
10

30
ED2P Saving (%) Slowdown (%) Energy Saving (%)
20
10
0
-10 ODPP
GPOEO
-20

20 ODPP
GPOEO
10
0
30
20
10
0
-10 ODPP
-20 GPOEO
-30
AI_3DFR

AI_3DOR

AI_FE

AI_I2IC

AI_I2IP

AI_I2T

AI_ICMP

AI_IGEN

AI_LRK

AI_OBJ

AI_S2T

AI_ST

AI_T2T

AI_TS

TGBM

TSVM
Fig. 13. Energy savings, slowdown, and ED2P saving compared to the NVIDIA default scheduling strategy on AIBench and traditional ML
applications

TABLE 3
Online optimization process for SM and memory clock on AIBench

3DFR 3DOR FE I2IC I2IP I2T ICMP IGEN LRK OBJ S2T ST T2T TS
Oracle SM Gear 104 102 109 104 91 94 107 71 88 100 108 39 100 107
Prediction Error (SM gear) -11 5 -13 2 3 -2 -13 22 24 -4 -15 11 5 -5
Search Error (SM gear) -2 -3 -8 1 -2 0 -2 -3 -2 0 -7 4 -2 -1
# of Search Steps (SM) 5 4 4 3 4 3 6 8 9 4 5 5 4 4
Oracle Mem clock (MHz) 9501 9251 9501 9501 9251 9251 9251 405 5001 9251 9251 810 9251 9501
Predicted Mem clock 9501 9501 9251 9501 9501 9501 9501 405 5001 9251 9501 405 9251 9501
Searched Mem clock 9501 9501 9501 9501 9501 9251 9251 405 5001 9251 9251 810 9251 9501
# of Search Steps (Mem) 2 2 3 2 2 2 2 3 3 3 2 3 3 2

medium energy savings (8-14%) with reasonable slow- nine applications, while ODPP only achieves the same goal
downs (2-7%) on these seven applications. Two reasons on four applications. For GPOEO, only iterations after opti-
cause GPOEO does not make full use of energy-saving po- mization processes are guaranteed to meet the performance
tentials. For AI FE and AI S2T, the period detection and fea- loss constraint, and the previous iterations may result in a
ture measurement get inaccurate data due to some abnormal violation of the performance loss constraint.
iterations. Therefore prediction errors are significant and
local search can not eliminate these errors. For the conve-
nience of experiments, we just train each machine learning
model for a few iterations, and only iterations after opti-
Compared to ODPP, our approach can improve the
mization processes can enjoy the optimal SM gear. However,
energy saving by 9.2%, improve the ED2 P saving by
in real-life scenarios, the model will be trained on many
20.4%, and reduce the slowdown by 5.0% on average.
more iterations, for which our approach can give higher en-
GPOEO performs better than ODPP on all sixteen appli-
ergy saving. GPOEO does not perform well on AI I2IC and
cations. Especially, our approach gains both larger energy
AI T2T. These two applications are computation-intensive
savings and lighter slowdowns on eight applications. For
with little energy-saving opportunities (1% and 4%). For
AI 3DFR, AI GEN, AI LRK, and AI S2T, GPOEO achieves
non-periodical applications TGBM and TSVM, GPOEO also
more considerable energy savings with similar slowdowns.
achieves good results, saving 17.9% and 12.0% energy with
Our method gains similar energy savings with lighter slow-
slowdowns of 3.5% and 4.7%, respectively. As for the mem-
downs on AI I2IC and TSVM. ODPP builds two piecewise
ory clock, the oracle frequencies of thirteen applications
linear models online to predict energy and time. The accu-
are 9,251 MHz or 9,501 MHz. Our method finds oracle
racy of these two models is highly dependent on the period
frequencies on eleven applications and finds nearby 9,501
detection accuracy. Poor period detection accuracy of ODPP
MHz on the other two applications within two or three
causes significant prediction errors, so ODPP shows heav-
search steps. In fact, the energy consumption and execution
ier slowdowns and less energy saving. On non-periodical
time are similar under these two memory clock frequen-
applications TGBM and TSVM, ODPP shows much worse
cies. GPOEO also finds low oracle memory frequencies
results than GPOEO. This phenomenon demonstrates that
on AI IGEN, AI LRK, and AI ST within three steps. Our
ODPP cannot handle non-periodical applications, while
method meets the performance loss constraint of 5% on
GPOEO can.
11

Slowdown (%) Energy Saving (%)


30 ODPP
20 GPOEO
10
0
-10
30 ODPP
GPOEO
20
10
0
30
ED2P Saving (%)

20 ODPP
GPOEO
10
0
-10
-20
CLB_GAT
CLB_GCN
CLB_GIN
CLB_GatedGCN
CLB_GraphSage
CLB_MLP
CSL_3WLGNN
CSL_GAT
CSL_GCN
CSL_GIN
CSL_GatedGCN
CSL_GraphSage
CSL_MLP
CSL_MoNet
CSL_RingGNN
SBM_3WLGNN
SBM_GAT
SBM_GCN
SBM_GIN
SBM_GatedGCN
SBM_GraphSage
SBM_MLP
SBM_MoNet
SBM_RingGNN
TSP_GAT
TSP_GCN
TSP_GIN
TSP_GatedGCN
TSP_GraphSage
TSP_MLP
TSP_MoNet
TU_3WLGNN
TU_GAT
TU_GCN
TU_GIN
TU_GatedGCN
TU_GraphSage
TU_MLP
TU_MoNet
TU_RingGNN
MLC_3WLGNN
MLC_GAT
MLC_GCN
MLC_GIN
MLC_GatedGCN
MLC_GraphSage
MLC_MoNet
MLC_RingGNN
SP_3WLGNN
SP_GAT
SP_GCN
SP_GatedGCN
SP_GraphSage
SP_MoNet
SP_RingGNN
Fig. 14. Energy saving, slowdown, and ED2P saving compared to the NVIDIA default scheduling strategy on benchmarking-gnns applications

7% 5.5 Overhead Evaluation


6% Energy Overhead
5% Performance Overhead To evaluate the overhead of the GPOEO system, we run
4%
3% AIBench applications with the entire GPOEO system except
2% adjusting SM and memory clocks. We also manually set the
1% number of local search steps shown in Table 3. As shown in
0%
AI_3DFR
AI_3DOR
AI_FE
AI_I2IC
AI_I2IP
AI_I2T
AI_ICMP
AI_IGEN
AI_LRK
AI_OBJ
AI_S2T
AI_ST
AI_T2T
AI_TS

Figure 15, all energy and time overheads are within 4%.

6 R ELATED W ORK
Fig. 15. Energy and time overhead of GPOEO on AIBench
There is a growing interest in optimizing machine learning
workloads for performance and energy efficiency. Many
works concentrate on CPU energy optimization [17], [24]–
5.4.2 Large benchmark suite [29]. Some of these prior works [24]–[26] use offline methods
to model energy and performance, while others [17], [27],
[28] design runtime to adjust the CPU clock or enforce
We also use the benchmarking-gnns suite to evaluate power capping [29]. These methods rely heavily on run-
our method and ODPP [11] extensively. Figure 14 shows time CPU-specific performance counter profiling and are
the online optimization results of 55 applications in the not transferable to GPUs, but the prediction models [26],
benchmark-gnns. GPOEO achieves an energy saving of [28], [30] they have used may also apply to GPUs.
16.6% and an ED2 P saving of 7.8%, with an execution Profiling workloads using GPU performance counter
time increase of 5.2% on average. In comparison, ODPP during program execution time can be expensive. For this
gains an energy saving of 6.1% and an ED2 P saving of - reason, most GPU energy optimization works have to rely
4.5%, with an execution time increase of 5.6% on average. on offline profiling information [4], [6], [7], [31]. Most offline
GPOEO saves ED2 P on 50 applications, while ODPP only works [4], [6], [31] focus on modeling and predicting energy
gains ED2 P saving on 19 applications. Our method meets and performance under different hardware configurations.
the performance loss constraint of 5% on 22 applications. Other works study GPU energy and performance from
ODPP achieves the same goal on 27 applications because it the view of assembly instructions [6], [7]. These works
does not profile performance counters and has a lower mea- require tedious offline profiling and analysis for each new
surement overhead. With the help of performance counter application or even different inputs. In contrast, our work
features, GPOEO saves more energy than ODPP on 48 can automatically optimize new applications online once
applications. Applications in CSL and TU datasets are non- the prediction models are established and well trained.
periodical. ODPP can not handle these applications and Although these offline jobs are inconvenient to use, the
only saves 0.73% of energy with a slowdown of 1.32% on energy features [4], [31] they have used are instructive for
average. On the contrary, GPOEO gains an energy saving of GPU online energy consumption optimization.
20.0% with a slowdown of 5.2% on average. These phenom- A few studies [10], [11] realize GPU online energy con-
ena show that GPOEO has better performance and broader sumption optimization. Majumdar et al. [10] exploit low-
applicability than ODPP. overhead fine-grained profiling functions supported by the
12

APU platform. It obtains the stream of energy, execution the Shenzhen Science and Technology Research and Devel-
time, throughput, and performance counter metrics at ker- opment Foundation (JCYJ20190806143418198), the National
nel granularity without worrying about time and energy Natural Science Foundation of China (NSFC) (61872110 and
overhead. Therefore, it can catch energy-saving chances at 61872294), the Fundamental Research Funds for the Central
kernel granularity and save considerable energy with low Universities (Grant No. HIT.OCEF.2021007), the Peng Cheng
overhead. Majumdar’s work cannot be transplanted to the Laboratory Project (PCL2021A02), and the CCF-Huawei
NVIDIA GPU platform, considering the significant amount fund (CCF-HuaweiHP2021002). Professor Weizhe Zhang is
of code instrumentation (around each kernel) and the high the corresponding author.
overhead of continuously profiling. Our approach targets
the NVIDIA GPU platform with high runtime profiling
overhead (with a slowdown of > 8% and energy overhead
R EFERENCES
of > 10%). We profile performance counter metrics in only [1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol.
521, no. 7553, pp. 436–444, 2015.
one iteration period to minimize overhead. ODPP [11] pro- [2] M. Z. Alom, T. M. Taha, C. Yakopcic, S. Westberg, P. Sidike, M. S.
poses an online dynamic power-performance management Nasrin, M. Hasan, B. C. Van Essen, A. A. Awwal, and V. K.
framework for GPUs using coarse-grained features which Asari, “A state-of-the-art survey on deep learning theory and
cannot provide enough information to model energy con- architectures,” Electronics, vol. 8, no. 3, p. 292, 2019.
[3] J. Guerreiro, A. Ilic, N. Roma, and P. Tomas, “Gpgpu power mod-
sumption and performance accurately, as mentioned in Sec- eling for multi-domain voltage-frequency scaling,” in 2018 IEEE
tion 2.2.4. Besides, the period detection algorithm of ODPP International Symposium on High Performance Computer Architecture
is unstable and error-prone. Due to the two weaknesses, (HPCA). IEEE, 2018, pp. 789–800.
[4] ——, “Modeling and decoupling the gpu power consumption for
ODPP exhibits poor energy-saving results and can give
cross-domain dvfs,” IEEE Transactions on Parallel and Distributed
severe slowdown in our evaluation. Our approach exploits Systems, vol. 30, no. 11, pp. 2494–2506, 2019.
hardware performance counter metrics and utilizes a robust [5] ——, “Dvfs-aware application classification to improve gpgpus
period detection algorithm to tackle these two weaknesses, energy efficiency,” Parallel Computing, vol. 83, pp. 93–117, 2019.
[6] ——, “Gpu static modeling using ptx and deep structured learn-
leading to better performance. ing,” IEEE Access, vol. 7, pp. 159 150–159 161, 2019.
Energy and performance optimization built upon ma- [7] Y. Arafa, A. ElWazir, A. ElKanishy, Y. Aly, A. Elsayed, A.-H.
chine learning techniques have been demonstrated to be Badawy, G. Chennupati, S. Eidenbenz, and N. Santhi, “Verified
promising in various application domains [8], [9], [32]–[47]. instruction-level energy consumption measurement for nvidia
gpus,” in Proceedings of the 17th ACM International Conference on
Some works [8], [9], [41], [42], [44], [46] tune hyperparam- Computing Frontiers, 2020, pp. 60–70.
eters, model structure configurations, or data structures to [8] Y. Wang, Q. Wang, and X. Chu, “Energy-efficient inference service
speed up ML applications or save energy consumption. Our of transformer-based deep learning models on gpus,” in 2020 In-
ternational Conferences on Internet of Things (iThings) and IEEE Green
method can be used together with these application-level Computing and Communications (GreenCom) and IEEE Cyber, Physical
techniques to reduce energy consumption further. CAMEL and Social Computing (CPSCom) and IEEE Smart Data (SmartData)
[43] optimizes the energy of web applications on mobile and IEEE Congress on Cybermatics (Cybermatics). IEEE, 2020, pp.
platforms. It uses the frame rate as the performance index, 323–331.
[9] S. M. Nabavinejad, S. Reda, and M. Ebrahimi, “Batchsizer: Power-
which is only suitable for web applications. Many studies performance trade-off for dnn inference,” in Proceedings of the 26th
[32], [33], [35]–[40], [45] concentrate on accelerating applica- Asia and South Pacific Design Automation Conference, 2021, pp. 819–
tions on heterogeneous platforms. Combining these studies 824.
[10] A. Majumdar, L. Piga, I. Paul, J. L. Greathouse, W. Huang, and
with our method may enable acceleration while saving D. H. Albonesi, “Dynamic gpgpu power management using adap-
energy. Based on the graph neural network, Ye et al. [47] tive model predictive control,” in 2017 IEEE International Sympo-
propose a novel program analysis and modeling method sium on High Performance Computer Architecture (HPCA). IEEE,
to provide comprehensive insight into the program. This 2017, pp. 613–624.
[11] P. Zou, L. Ang, K. Barker, and R. Ge, “Indicator-directed dynamic
method may be helpful to build more accurate energy and power management for iterative workloads on gpu-accelerated
performance prediction models. systems,” in 2020 20th IEEE/ACM International Symposium on Clus-
ter, Cloud and Internet Computing (CCGRID). IEEE, 2020, pp. 559–
568.
7 C ONCLUSION [12] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting
We have presented GPOEO, a new online GPU energy system,” in Proceedings of the 22nd acm sigkdd international conference
on knowledge discovery and data mining, 2016, pp. 785–794.
optimization framework for iterative machine learning (ML) [13] F. Tang, W. Gao, J. Zhan, C. Lan, X. Wen, L. Wang, C. Luo, Z. Cao,
workloads. GPOEO detects iterative periods, measures per- X. Xiong, Z. Jiang et al., “Aibench training: balanced industry-
formance counter features, and predicts the optimal energy standard ai training benchmarking,” in 2021 IEEE International
Symposium on Performance Analysis of Systems and Software (IS-
configurations automatically. We evaluate GPOEO on 71 PASS). IEEE, 2021, pp. 24–35.
ML applications running on an NVIDIA RTX3080Ti GPU. [14] NVIDIA Corporation, “Nvml api reference manual,” https://
GPOEO achieves a mean energy saving of 16.2% at a modest docs.nvidia.com/deploy/nvml-api/index.html, August 2021, ac-
cost of 5.1% performance loss compared to the NVIDIA cessed August 19, 2021.
[15] ——, “Cupti documentation,” https://docs.nvidia.com/cupti/
default scheduling strategy. Cupti/index.html, August 2021, accessed August 19, 2021.
[16] V. P. Dwivedi, C. K. Joshi, T. Laurent, Y. Bengio, and X. Bres-
son, “Benchmarking graph neural networks,” arXiv preprint
ACKNOWLEDGMENTS arXiv:2003.00982, 2020.
This work was supported in part by the National [17] J. Schwarzrock, C. C. de Oliveira, M. Ritt, A. F. Lorenzon, and
Key Research and Development Program of China A. C. S. Beck, “A runtime and non-intrusive approach to optimize
edp by tuning threads and cpu frequency for openmp applica-
(2020YFB1406902), the Key-Area Research and Develop- tions,” IEEE Transactions on Parallel and Distributed Systems, vol. 32,
ment Program of Guangdong Province (2020B0101360001), no. 7, pp. 1713–1724, 2020.
13

[18] pytorch.org, “Pytorch benchmarks,” https://github.com/ [39] J. Fang, C. Huang, T. Tang, and Z. Wang, “Parallel programming
pytorch/benchmark, August 2021, accessed August 19, 2021. models for heterogeneous many-cores: a comprehensive survey,”
[19] Z. Wen, J. Shi, Q. Li, B. He, and J. Chen, “ThunderSVM: A fast SVM CCF Transactions on High Performance Computing, vol. 2, no. 4, pp.
library on GPUs and CPUs,” Journal of Machine Learning Research, 382–400, 2020.
vol. 19, pp. 797–801, 2018. [40] P. Zhang, J. Fang, T. Tang, C. Yang, and Z. Wang, “Auto-tuning
[20] Z. Wen, J. Shi, B. He, J. Chen, K. Ramamohanarao, and Q. Li, streamed applications on intel xeon phi,” in 2018 IEEE International
“Exploiting gpus for efficient gradient boosting decision tree train- Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2018,
ing,” IEEE Transactions on Parallel and Distributed Systems, vol. 30, pp. 515–525.
no. 12, pp. 2706–2717, 2019. [41] D. Stamoulis, E. Cai, D.-C. Juan, and D. Marculescu, “Hyper-
[21] Z. Wen, J. Shi, B. He, Q. Li, and J. Chen, “ThunderGBM: Fast power: Power-and memory-constrained hyper-parameter opti-
GBDTs and random forests on GPUs,” Journal of Machine Learning mization for neural networks,” in 2018 Design, Automation & Test
Research, vol. 21, 2020. in Europe Conference & Exhibition (DATE). IEEE, 2018, pp. 19–24.
[22] Z. Wen, J. Shi, B. He, J. Chen, K. Ramamohanarao, and Q. Li, [42] V. S. Marco, B. Taylor, Z. Wang, and Y. Elkhatib, “Optimizing
“Exploiting gpus for efficient gradient boosting decision tree train- deep learning inference on embedded systems through adaptive
ing,” IEEE Transactions on Parallel and Distributed Systems, vol. 30, model selection,” ACM Transactions on Embedded Computing Sys-
no. 12, pp. 2706–2717, 2019. tems (TECS), vol. 19, no. 1, pp. 1–28, 2020.
[43] J. Ren, L. Yuan, P. Nurmi, X. Wang, M. Ma, L. Gao, Z. Tang,
[23] H. Walser, The golden section. MAA, 2001.
J. Zheng, and Z. Wang, “Camel: Smart, adaptive energy optimiza-
[24] M. Endrei, C. Jin, M. N. Dinh, D. Abramson, H. Poxon, L. DeRose, tion for mobile web interactions,” in IEEE INFOCOM 2020-IEEE
and B. R. de Supinski, “Statistical and machine learning models Conference on Computer Communications. IEEE, 2020, pp. 119–128.
for optimizing energy in parallel applications,” The International [44] A. E. Brownlee, J. Adair, S. O. Haraldsson, and J. Jabbo, “Ex-
Journal of High Performance Computing Applications, vol. 33, no. 6, ploring the accuracy–energy trade-off in machine learning,” in
pp. 1079–1097, 2019. 2021 IEEE/ACM International Workshop on Genetic Improvement (GI).
[25] S. Ramesh, S. Perarnau, S. Bhalachandra, A. D. Malony, and IEEE, 2021, pp. 11–18.
P. Beckman, “Understanding the impact of dynamic power cap- [45] P. Zhang, J. Fang, C. Yang, C. Huang, T. Tang, and Z. Wang,
ping on application progress,” in 2019 IEEE International Parallel “Optimizing streaming parallelism on heterogeneous many-core
and Distributed Processing Symposium (IPDPS). IEEE, 2019, pp. architectures,” IEEE Transactions on Parallel and Distributed Systems,
793–804. vol. 31, no. 8, pp. 1878–1896, 2020.
[26] M. Hao, W. Zhang, Y. Wang, G. Lu, F. Wang, and A. V. Vasilakos, [46] S. Qiu, L. You, and Z. Wang, “Optimizing sparse matrix multipli-
“Fine-grained powercap allocation for power-constrained systems cations for graph neural networks,” in Lecture Notes in Computer
based on multi-objective machine learning,” IEEE Transactions on Science. Springer Verlag, 2021.
Parallel and Distributed Systems, vol. 32, no. 7, pp. 1789–1801, 2020. [47] G. Ye, Z. Tang, H. Wang, D. Fang, J. Fang, S. Huang, and
[27] N. Gholkar, F. Mueller, and B. Rountree, “Uncore power scav- Z. Wang, “Deep program structure modeling through multi-
enger: A runtime for uncore power conservation on hpc systems,” relational graph-based learning,” in Proceedings of the ACM Interna-
in Proceedings of the International Conference for High Performance tional Conference on Parallel Architectures and Compilation Techniques,
Computing, Networking, Storage and Analysis, 2019, pp. 1–23. 2020, pp. 111–123.
[28] Y. Wang, W. Zhang, M. Hao, and Z. Wang, “Online power manage-
ment for multi-cores: A reinforcement learning based approach,”
IEEE Transactions on Parallel and Distributed Systems, 2021.
[29] P. Petoumenos, L. Mukhanov, Z. Wang, H. Leather, and D. S.
Nikolopoulos, “Power capping: What works, what does not,” in
2015 IEEE 21st International Conference on Parallel and Distributed
Systems (ICPADS). IEEE, 2015, pp. 525–534.
[30] M. Endrei, C. Jin, M. N. Dinh, D. Abramson, H. Poxon, L. DeRose,
and B. R. de Supinski, “Energy efficiency modeling of parallel
applications,” in SC18: International Conference for High Performance
Computing, Networking, Storage and Analysis. IEEE, 2018, pp. 212–
224.
[31] K. Fan, B. Cosenza, and B. Juurlink, “Predictable gpus frequency
scaling for energy and performance,” in Proceedings of the 48th
International Conference on Parallel Processing, 2019, pp. 1–10.
[32] Z. Wang and M. F. O’Boyle, “Mapping parallelism to multi-cores:
a machine learning based approach,” in Proceedings of the 14th
ACM SIGPLAN symposium on Principles and practice of parallel
programming, 2009, pp. 75–84.
[33] D. Grewe, Z. Wang, and M. F. O’Boyle, “Portable mapping of
data parallel programs to opencl for heterogeneous systems,” in
Proceedings of the 2013 IEEE/ACM International Symposium on Code
Generation and Optimization (CGO). IEEE, 2013, pp. 1–10.
[34] Z. Wang, D. Grewe, and M. F. O’boyle, “Automatic and portable
mapping of data parallel programs to opencl for gpu-based het-
erogeneous systems,” ACM Transactions on Architecture and Code
Optimization (TACO), vol. 11, no. 4, pp. 1–26, 2014.
[35] C. Cummins, P. Petoumenos, Z. Wang, and H. Leather, “End-
to-end deep learning of optimization heuristics,” in 2017 26th
International Conference on Parallel Architectures and Compilation
Techniques (PACT). IEEE, 2017, pp. 219–232.
[36] Y. Wen, Z. Wang, and M. F. O’boyle, “Smart multi-task scheduling
for opencl programs on cpu/gpu heterogeneous platforms,” in
2014 21st International conference on high performance computing
(HiPC). IEEE, 2014, pp. 1–10.
[37] B. Taylor, V. S. Marco, and Z. Wang, “Adaptive optimization
for opencl programs on embedded heterogeneous systems,” in
Proceedings of the 18th ACM SIGPLAN/SIGBED Conference on Lan-
guages, Compilers, and Tools for Embedded Systems, 2017, pp. 11–20.
[38] Z. Wang and M. O’Boyle, “Machine learning in compiler optimiza-
tion,” Proceedings of the IEEE, vol. 106, no. 11, pp. 1879–1901, 2018.

You might also like