CN116227558A

CN116227558A - Neural network dynamic exit lightweight method and system for multiple continuous reasoning

Info

Publication number: CN116227558A
Application number: CN202310306228.2A
Authority: CN
Inventors: 邹桉; 马叶涵; 沈颖涛
Original assignee: Shanghai Jiao Tong University
Current assignee: Shanghai Jiao Tong University
Priority date: 2023-03-24
Filing date: 2023-03-24
Publication date: 2023-06-06

Abstract

The invention provides a neural network dynamic exit light-weight method and a system aiming at multiple continuous reasoning, comprising the following steps: step 1: constructing an inference model based on a neural network, predicting the position of network exit for each inference within a preset small time range, and correspondingly predicting a calculation configuration, wherein the calculation configuration comprises frequency and voltage; for a plurality of inferences in a preset large time range, calibrating the frequency and the voltage of the processor through residual inference workload and time constraint; step 2: the configuration is calculated according to the prediction and calibration to execute the neural network to realize dynamic voltage and frequency adjustment. Compared with a classical deep learning network, the method can realize energy saving of 63.8%, ensure that multiple times of neural network reasoning is completed within a specified time, terminate the reasoning in advance and obtain an accurate result by early exit, and reduce calculation and energy cost.

Description

Neural network dynamic exit lightweight method and system for multiple consecutive inferences

技术领域Technical Field

本发明涉及神经网络技术领域，具体地，涉及一种针对多次连续推理的神经网络动态退出轻量化方法和系统。The present invention relates to the field of neural network technology, and in particular to a neural network dynamic exit lightweight method and system for multiple consecutive reasonings.

背景技术Background Art

深度学习方法，如卷积神经网络，在多用途应用中取得了巨大成功。然而，在资源受限的系统上部署深度学习模型的挑战之一是其巨大的能源成本。作为一种动态推理方法，早期退出为网络添加了退出层，这可以提前终止推理，并获得准确的结果以节省能源。目前用于提前退出的能量调节的被动决策无法适应正在进行的推理状态、不同的推理工作负载和时间限制，更不用说在推理过程中指导计算平台的合理配置以节省潜在的能源了。Deep learning methods, such as convolutional neural networks, have achieved great success in multi-purpose applications. However, one of the challenges in deploying deep learning models on resource-constrained systems is their huge energy cost. As a dynamic inference method, early exit adds a dropout layer to the network, which can terminate the inference early and obtain accurate results to save energy. The current passive decision-making for energy regulation of early exit cannot adapt to the ongoing inference state, different inference workloads and time constraints, let alone guide the reasonable configuration of the computing platform during the inference process to save potential energy.

专利文献US20210056357A1公开了用于实现灵活、输入自适应深度学习神经网络的系统和方法；专利文献US20210012178A1公开了用于早期退出卷积的系统、方法和设备；专利文献EP3997621A1公开了用于早期退出卷积的系统、方法和设备。上述文献虽然提出了提前退出的方法，但是无法实现对提前退出点的预测。Patent document US20210056357A1 discloses a system and method for implementing a flexible, input-adaptive deep learning neural network; patent document US20210012178A1 discloses a system, method, and device for early exit convolution; patent document EP3997621A1 discloses a system, method, and device for early exit convolution. Although the above documents propose a method for early exit, they cannot predict the early exit point.

专利文献CN114997370A公开了一种基于预测退出的低功耗神经网络系统及其实现方法，虽然该专利文献提出了对提前退出点的预测，但是只能对单次神经网络推理进行预测并降低计算量和系统能耗，且预测精度低，无法高精度地从网络的前几层甚至第一次开始预测，无法高精度地对大型网络模型进行提前退出点预测。Patent document CN114997370A discloses a low-power neural network system based on predictive exit and its implementation method. Although the patent document proposes a prediction of early exit points, it can only predict a single neural network inference and reduce the amount of calculation and system energy consumption. The prediction accuracy is low, and it is impossible to predict with high accuracy from the first few layers of the network or even the first time, and it is impossible to predict the early exit point of a large network model with high accuracy.

发明内容Summary of the invention

针对现有技术中的缺陷，本发明的目的是提供一种针对多次连续推理的神经网络动态退出轻量化方法和系统。In view of the defects in the prior art, the purpose of the present invention is to provide a method and system for dynamically exiting lightweight neural networks for multiple consecutive reasonings.

根据本发明提供的针对多次连续推理的神经网络动态退出轻量化方法，包括：The method for dynamically exiting a neural network from lightweighting for multiple consecutive reasonings provided by the present invention includes:

步骤1：构建基于神经网络的推理模型，对于预设小时间范围内的每个推理，预测网络退出的位置，并相应地预测计算配置，所述计算配置包括频率和电压；Step 1: Build a neural network-based inference model, predict the location where the network exits for each inference within a preset small time range, and predict the computing configuration accordingly, the computing configuration including frequency and voltage;

步骤2：对于预设大时间范围内的多个推理，通过剩余推理工作量和时间约束，进行处理器频率和电压校准；Step 2: For multiple inferences within a preset large time range, calibrate the processor frequency and voltage based on the remaining inference workload and time constraints;

步骤3：根据预测和校准计算配置以执行神经网络，从而实现动态电压和频率调节。Step 3: Calculate the configuration based on the prediction and calibration to execute the neural network to achieve dynamic voltage and frequency adjustment.

优选的，在推理过程中设置推理的协调周期即推理任务期限，表达式为：Preferably, a coordination period of reasoning, i.e., a deadline of the reasoning task, is set during the reasoning process, and the expression is:

其中，λ表示任务截至时间的松紧，0<λ≤1，λ越小，期限越紧；t_i表示每次推理实际完成时间；T_c为协调周期；M为推理任务的个数；T_i表示第i个推理任务的协调期限。Among them, λ represents the tightness of the task deadline, 0<λ≤1, the smaller λ is, the tighter the deadline is; _ti represents the actual completion time of each reasoning; _Tc is the coordination cycle; M is the number of reasoning tasks; _Ti represents the coordination deadline of the i-th reasoning task.

优选的，在执行推理任务时，计算过程中所消耗的功率为有功功率P_active，为动态功率P_D、静态功率P_S、恒功率P_C的总和；Preferably, when performing an inference task, the power consumed in the calculation process is active power P _active , which is the sum of dynamic power P _D , static power P _S , and constant power P _C ;

计算平台空闲时消耗的功率为空闲功率P_idle，为静态功率P_S、恒功率P_C的总和；The power consumed by the computing platform when it is idle is the idle power P _idle , which is the sum of the static power P _S and the constant power P _C ;

P_D＝CV²f _PD ＝ ^CV2f

P_S＝V N_tr I_S _PS _＝ _VNtrIS

其中，C为开关逻辑门的电容；V表示处理器的供电电压；f表示处理器时钟频率；N_tr为逻辑门的数目；I_S为每个逻辑门的归一化静态电流；Where C is the capacitance of the switching logic gate; V is the power supply voltage of the processor; f is the processor clock frequency; N _tr is the number of logic gates; I _S is the normalized static current of each logic gate;

T_i内每个推理的能量消耗为：The energy consumption of each inference in _Ti is:

位于早期退出之前的网络层运行在(V_H，f_H)，在t_i点网络退出后，计算平台通过动态电压频率调整技术DVFS将电压和频率降至最低电平(V_L，f_L)，因此，推理任务j_i的能量消耗表示为：The network layer before early exit runs at (V _H , f _H ). After the network exits at point t _i , the computing platform reduces the voltage and frequency to the lowest level (V _L , f _L ) through the dynamic voltage and frequency scaling technology DVFS. Therefore, the energy consumption of inference task j _i is expressed as:

E_i表示推理任务j_i能量消耗；V_H表示处理器高的供电电压；f_H表示处理器高时钟频率；V_L表示处理器低的供电电压。E _i represents the energy consumption of reasoning task j _i ; V _H represents the high power supply voltage of the processor; f _H represents the high clock frequency of the processor; V _L represents the low power supply voltage of the processor.

优选的，在一个推理任务j_i中，先预测网络退出位置，建立剩余层数以完成该推理；然后根据剩余层数和最推理坏情况执行时间T_i直到当前推理结束，建立(V,f)进行预测；根据T_c上剩余的推理量，通过反馈控制策略启用交叉推理校准器，根据多推理进度、总推理工作量和时间约束来校准(V,f)；根据预测和校准，通过DVFS在运行时更新处理器到适当的计算配置。Preferably, in an inference task j _i , the network exit position is first predicted, and the remaining number of layers is established to complete the inference; then (V, f) is established for prediction based on the remaining number of layers and the worst-case execution time _Ti until the end of the current inference; based on the remaining amount of inference on T _c , the cross-inference calibrator is enabled through the feedback control strategy, and (V, f) is calibrated according to the multi-inference progress, the total inference workload and the time constraint; based on the prediction and calibration, the processor is updated to the appropriate computing configuration at runtime through DVFS.

优选的，在推理任务j_i中，现有层的包含BoF池化层和FC层，设y_i为现有层第i层的中间结果，N_c为数据集中对象类的数量，BoF池化层作为从y_i中提取的特征聚合，在BoF池化中，使用一组称为码本的特征向量来描述y_i，每个码本的权重是通过测量码本和y_i之间的相似度来生成的，码本权重的大小大于N_c，因此FC层进一步作为分类器，将BoF池化的结果调整为

从而估计出现有层的最终输出，分类器的估计结果为：Preferably, in the inference task j _i , the existing layer includes a BoF pooling layer and a FC layer. Let _yi be the intermediate result of the i-th layer of the existing layer, N _c be the number of object classes in the data set, and the BoF pooling layer is used as a feature aggregation extracted from _yi . In BoF pooling, a set of feature vectors called codebooks is used to describe _yi . The weight of each codebook is generated by measuring the similarity between the codebook and _yi . The size of the codebook weight is greater than N _c , so the FC layer is further used as a classifier to adjust the result of BoF pooling to

Thus, the final output of the existing layer is estimated, and the estimated result of the classifier is:

其中，f_w(x，i)为第i层的中间结果；x为初始网络输入；W_i为第i个退出层的参数，退出层即为BoF池化层和FC层；

是分类器的估计结果；Where _fw (x,i) is the intermediate result of the i-th layer; x is the initial network input; _Wi is the parameter of the i-th exit layer, and the exit layer is the BoF pooling layer and the FC layer;

is the estimated result of the classifier;

对现有层中的参数进行离线训练，在训练W_i时选择相同的交叉熵损失函数；Train the parameters in the existing layers offline, and choose the same cross entropy loss function when training _Wi ;

其中，p(i)和q(i)分别是每个对象的实际和预测的类分布；

为交叉熵；Where p(i) and q(i) are the actual and predicted class distributions of each object, respectively;

is the cross entropy;

在训练现有层时，将训练集数据{x1，x2，…，xN}输入模型，N为训练数据个数，通过批量梯度下降对现有层参数进行优化，设L_total为网络的层数，目标向量集为

为正确结果；首先，在没有任何现有层的情况下，用以下公式训练模型：When training the existing layer, the training set data {x1, x2, …, xN} is input into the model, N is the number of training data, and the existing layer parameters are optimized by batch gradient descent. Let L _total be the number of layers in the network, and the target vector set is

To get the correct result; first, without any existing layers, train the model with the following formula:

其中，W'为更新后的整体网络参数；η为学习率，如果精度不能满足要求，则调整η；j表示向量位置；

表示参数梯度；x_j表示训练数据向量元素；r_j表示目标数据向量元素；Where W' is the updated overall network parameter; η is the learning rate. If the accuracy cannot meet the requirements, adjust η; j represents the vector position;

represents parameter gradient; x _j represents training data vector element; r _j represents target data vector element;

对原模型W进行固定，在现有层的基础上，再次训练模型，优化BoF池化层和FC层的参数，表达式为：The original model W is fixed, and based on the existing layers, the model is trained again to optimize the parameters of the BoF pooling layer and the FC layer. The expression is:

训练完退出层后，计算平均特征权重μ_i作为退出决策的参数，表达式为：After training the exit layer, the average feature weight μ _i is calculated as the parameter for the exit decision, and the expression is:

其中，k表示第k个全体分类目标；Among them, k represents the kth overall classification target;

在初始网络输入x_j的推断过程中，在每个早期退出层，权重比αW为最大特征权重除以μ_i乘以用户指定的超参数β，最大特征权值越大，认为分类越有信心，若αW大于1，则推论终止，在这个早期退出层的结果被应用为最终结果，表达式为：During the inference process of the initial network input _xj , at each early exit layer, the weight ratio αW is the maximum feature weight divided by _μi multiplied by the user-specified hyperparameter β. The larger the maximum feature weight, the more confident the classification is. If αW is greater than 1, the inference is terminated, and the result of this early exit layer is applied as the final result, expressed as:

优选的，设预测器从神经网络的第L₀层开始预测，预测器输入为第L₀层神经网络的执行结果向量y₀及其对应的退出层执行结果向量

Preferably, the predictor starts predicting from the _L0th layer of the neural network, and the predictor input is the execution result vector _y0 of the _L0th layer of the neural network and its corresponding exit layer execution result vector

对上述结果进行扩充维度，通过在向量

两端添加0的方式，将向量

由

维扩展到

维，并记为

其中N_c为向量

长度同时也是推理结果分类的数量，K为特征池化组中特征向量的个数，具体计算过程如下述公式所示：The above results are expanded in dimension by using the vector

Add 0 at both ends to convert the vector

Depend on

Dimensions expanded to

dimension, and recorded as

Where N _c is a vector

The length is also the number of inference result categories, K is the number of feature vectors in the feature pooling group, and the specific calculation process is shown in the following formula:

l表示第l个元素；l represents the lth element;

经过零填充后的输出向量

将进行一维卷积操作并获得新的特征权重向量

其中卷积权重为长度为K的一维向量h，具体计算过程如下述公式所示：The output vector is zero-padded

A one-dimensional convolution operation will be performed and a new feature weight vector will be obtained

The convolution weight is a one-dimensional vector h of length K. The specific calculation process is shown in the following formula:

通过递归将中间估计结果

替换为

并重复计算式，得到L₀+2位置的出料层的预测结果，记为

By recursively converting the intermediate estimation results

Replace with

Repeat the calculation formula to obtain the predicted result of the discharge layer at position L ₀ +2, which is recorded as

将退出层的预测结果放置在L₀之后的任意一层，得到PREDICT函数，描述如下：Place the prediction result of the exit layer in any layer after L ₀ to obtain the PREDICT function, which is described as follows:

其中，

表示L₀后的预测出点的最小值；

in,

Indicates the minimum value of the predicted output point after L ₀ ;

式中，

为L₀+ζ层的预测出口置信度，因此，L₀+ζ是预测函数预测的退出点，如果找不到ζ，则引入一个超参数τ，即ζ∈[1，τ]，τ≤L_totaal-L₀，τ的上界表示禁止最后一层以外的预测结果；如果[1，τ]中没有整数满足条件，则设ζ＝τ；In the formula,

is the prediction exit confidence of the L ₀ +ζ layer. Therefore, L ₀ +ζ is the exit point predicted by the prediction function. If ζ cannot be found, a hyperparameter τ is introduced, that is, ζ∈[1,τ], τ≤L _totaal -L ₀ . The upper bound of τ indicates that the prediction results outside the last layer are prohibited. If no integer in [1,τ] satisfies the condition, ζ＝τ is set.

将PREDICT的现有预测结果ζ转换为中级频率f_M，i，对于推理任务j_i，在开始推理到L₀预测的时间间隔内，保守应用一个相对高层次的

及时完成模型的L_total而不退出，表达式为：Convert the existing prediction results ζ of PREDICT to the mid-level frequency f _M,i . For the inference task j _i , a relatively high-level conservative approach is applied in the interval between the start of inference and the L ₀ prediction.

The L _total of the model is completed in time without exiting, and the expression is:

式中，f_H为计算平台默认的最高频率，该计算平台可用L_total层完成推理，且不因时间T_i提前退出；频率

由Δf_i校准；Where _fH is the default maximum frequency of the computing platform. The computing platform can complete reasoning with L _total layers and will not exit early due to time _Ti ; frequency

Calibrated by Δf _i ;

在预测L₀时，根据预测的退出点，将f调整为中级频率f_M，i，并运行网络，直到在T_i退出，给定一个具有L_total层数和早期预测ζ的网络，中级频率计算公式为：When predicting _L0 , adjust f to the intermediate frequency _fM,i according to the predicted exit point, and run the network until exiting at _Ti . Given a network with L _total layers and early prediction ζ, the intermediate frequency is calculated as:

其中，f_M,i是预测的通过时间T_i完成推断的最低频率。where f _M,i is the lowest frequency predicted to complete inference at time T _i .

优选的，对于推理任务j_i+1，交叉推理DVFS策略在完成包括j_i在内的所有先前任务后提供校准建议Δf_i+1，该校准是由一个离散增量比例积分PI调节器实现的，表达式为：Preferably, for the reasoning task j _i+1 , the cross-reasoning DVFS strategy provides a calibration suggestion Δf _i+1 after completing all previous tasks including j _i . The calibration is implemented by a discrete incremental proportional integral PI regulator, and the expression is:

Δf_i+1＝Δf_i+(K_P+K_I(t_i-t_i-1))e(t_i)-K_Pe(t_i-1)Δf _i+1 =Δf _i +(K _P +K _I (t _i -t _i-1 ))e(t _i )-K _P e(t _i-1 )

其中，K_P和K_I分别为比例系数和积分系数；索引i表示当前协调周期内的第i个推理任务；t_i表示推理j_i从协调周期开始的相对完成时间；t_i-t_i-1为推理任务j_i与j_i-1完成的时间间隔；Where K _P and K _I are the proportional coefficient and integral coefficient respectively; index i represents the i-th reasoning task in the current coordination cycle; _ti represents the relative completion time of reasoning j _i from the beginning of the coordination cycle; _ti _{-t i-1} is the time interval between the completion of reasoning tasks j _i and j _i-1 ;

PI调节器的输入偏差e(t_i)是根据总推理工作量和自协调期开始以来推理执行进度对T_c的推理进度的评估，表达式为：

其中第一项是推理工作负载平衡的参考速度，第二项是到目前为止的处理速度。The input deviation e(t _i ) of the PI regulator is an evaluation of the inference progress of T _c based on the total inference workload and the inference execution progress since the beginning of the coordination period, and is expressed as:

The first term is the reference speed for inference workload balancing, and the second term is the processing speed so far.

优选的，通过DVFS调速器配置频率和对应电压，分别根据小时间尺度和大时间尺度的预测和校准执行网络，在每次推理开始时，基于的交叉推理校准推导出Δf_i，DVFS调控器将计算配置设置为

保守地完成不提前退出的推理，在推理层L₀进行早期退出预测时，根据内推断预测器导出的f_M，i，以及根据交叉推断校准导出的Δf_i，DVFS调控器建立适当的频率配置

表达式为：Preferably, the DVFS governor configures the frequency and the corresponding voltage, executes the network according to the prediction and calibration of the small time scale and the large time scale respectively, and at the beginning of each inference, Δf _i is derived based on the cross-inference calibration, and the DVFS governor sets the calculation configuration to

Inference without early exit is done conservatively. When early exit prediction is performed at the inference layer _L0 , the DVFS governor establishes an appropriate frequency configuration based on fM _,i derived from the internal inference predictor and Δf _i derived from the cross-inference calibration.

The expression is:

因此，产生的能源消耗为：Therefore, the energy consumption is:

其中，

为预测层L₀及其之前的网络层j_i完成的时间；

为给定

所对应的电压；由于f的选择是基于推理中预测的剩余工作量，因此推理内预测可以使j_i由T_i完成；交叉推断校准目标通过T_c完成M个顺序推断。in,

is the time it takes to complete the prediction layer _L0 and its previous network layer j _i ;

For a given

The corresponding voltage; since the selection of f is based on the remaining workload predicted in inference, the intra-inference prediction can enable j _i to be completed by _Ti ; the cross-inference calibration target completes M sequential inferences through T _c .

根据本发明提供的针对多次连续推理的神经网络动态退出轻量化系统，包括：The neural network dynamic exit lightweight system for multiple consecutive reasonings provided by the present invention includes:

模块M1：构建基于神经网络的推理模型，对于预设小时间范围内的每个推理，预测网络退出的位置，并相应地预测计算配置，所述计算配置包括频率和电压；对于预设大时间范围内的多个推理，通过剩余推理工作量和时间约束，进行处理器频率和电压校准；Module M1: Build a neural network-based inference model, predict the location of network exit for each inference within a preset small time range, and predict the computing configuration accordingly, the computing configuration includes frequency and voltage; for multiple inferences within a preset large time range, perform processor frequency and voltage calibration based on the remaining inference workload and time constraints;

模块M2：根据预测和校准计算配置以执行神经网络，从而实现动态电压和频率调节。Module M2: Configured to execute a neural network based on prediction and calibration calculations to achieve dynamic voltage and frequency regulation.

其中，λ表示任务截至时间的松紧，0<λ≤1，λ越小，期限越紧；t_i表示每次推理实际完成时间；T_c为协调周期；M为推理任务的个数；T_i表示第i个推理任务的协调期限；Where λ represents the tightness of the task deadline, 0<λ≤1, the smaller λ is, the tighter the deadline is; _ti represents the actual completion time of each reasoning; _Tc is the coordination cycle; M is the number of reasoning tasks; _Ti represents the coordination deadline of the i-th reasoning task;

在执行推理任务时，计算过程中所消耗的功率为有功功率P_active，为动态功率P_D、静态功率P_S、恒功率P_C的总和；When performing inference tasks, the power consumed in the calculation process is active power P _active , which is the sum of dynamic power P _D , static power P _S , and constant power P _C ;

P_D＝CV²f _PD ＝ ^CV2f

P_S＝VN_trI_S _PS _＝ _VNtrIS

位于早期退出之前的网络层运行在(V_H，f_H)，在t_i点网络退出后，计算平台通过动态电压频率调整技术DVFS将电压和频率降至最低电平(V_L，f_L)，因此，推理任务j_i的能量消耗表示为：The network layer before early exit runs at (V _H , f _H ). After the network exits at point t _i , the computing platform reduces the voltage and frequency to the lowest level (V _L , f _L ) through the dynamic voltage and frequency scaling technology DVFS. Therefore, the energy consumption of the inference task j _i is expressed as:

E_i表示推理任务j_i能量消耗；V_H表示处理器高的供电电压；f_H表示处理器高时钟频率；V_L表示处理器低的供电电压；E _i represents the energy consumption of reasoning task j _i ; V _H represents the high supply voltage of the processor; f _H represents the high clock frequency of the processor; V _L represents the low supply voltage of the processor;

在一个推理任务j_i中，先预测网络退出位置，建立剩余层数以完成该推理；然后根据剩余层数和最推理坏情况执行时间T_i直到当前推理结束，建立(V,,f)进行预测；根据T_c上剩余的推理量，通过反馈控制策略启用交叉推理校准器，根据多推理进度、总推理工作量和时间约束来校准(V,f)；根据预测和校准，通过DVFS在运行时更新处理器到适当的计算配置。In an inference task j _i , first predict the network exit position and establish the remaining number of layers to complete the inference; then, based on the remaining number of layers and the worst-case execution time _Ti until the end of the current inference, establish (V,,f) for prediction; based on the remaining amount of inference on T _c , enable the cross-inference calibrator through the feedback control strategy, and calibrate (V,f) according to the multi-inference progress, total inference workload and time constraints; based on the prediction and calibration, update the processor to the appropriate computing configuration at runtime through DVFS.

与现有技术相比，本发明具有如下的有益效果：Compared with the prior art, the present invention has the following beneficial effects:

本发明在每个推理中预测网络将退出的位置，并在小时间范围内相应地调整计算配置(即频率和电压)，同时根据大时间范围内的多个推理，结合剩余推理工作量和时间约束，提供计算配置(即频率和电压)建议，根据预测和校准配置以执行神经网络，从而实现动态电压和频率调节；与经典深度学习网络相比，本发明可实现高达63.8％的节能，同时保证多次神经网络推理在规定的时间内完成，通过早期退出可以提前终止推断并获得准确的结果，减少了计算和能源成本。The present invention predicts where the network will exit in each inference, and adjusts the computing configuration (i.e., frequency and voltage) accordingly within a small time range. At the same time, based on multiple inferences within a large time range, combined with the remaining inference workload and time constraints, it provides computing configuration (i.e., frequency and voltage) recommendations, and executes the neural network based on the predicted and calibrated configuration, thereby achieving dynamic voltage and frequency regulation. Compared with classic deep learning networks, the present invention can achieve energy savings of up to 63.8%, while ensuring that multiple neural network inferences are completed within the specified time. Early exit can terminate the inference in advance and obtain accurate results, reducing computing and energy costs.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

通过阅读参照以下附图对非限制性实施例所作的详细描述，本发明的其它特征、目的和优点将会变得更明显：Other features, objects and advantages of the present invention will become more apparent from the detailed description of non-limiting embodiments made with reference to the following drawings:

图1为本发明基于提前退出的神经网络结构图。FIG1 is a diagram showing the structure of a neural network based on early exit according to the present invention.

具体实施方式DETAILED DESCRIPTION

下面结合具体实施例对本发明进行详细说明。以下实施例将有助于本领域的技术人员进一步理解本发明，但不以任何形式限制本发明。应当指出的是，对本领域的普通技术人员来说，在不脱离本发明构思的前提下，还可以做出若干变化和改进。这些都属于本发明的保护范围。The present invention is described in detail below in conjunction with specific embodiments. The following embodiments will help those skilled in the art to further understand the present invention, but are not intended to limit the present invention in any form. It should be noted that, for those of ordinary skill in the art, several changes and improvements can also be made without departing from the concept of the present invention. These all belong to the protection scope of the present invention.

实施例1：Embodiment 1:

如图1，针对在一段时间段内完成多次神经网络推理的神经网络应用，本发明提供了一种针对多次连续推理的神经网络动态退出轻量化方法，通过结合推理状态与运行时处理器动态电压和频率调节，实现降低神经网络计算的算力及功耗，，具体包括如下步骤：As shown in FIG1 , for a neural network application that completes multiple neural network inferences within a period of time, the present invention provides a neural network dynamic exit lightweight method for multiple continuous inferences, which reduces the computing power and power consumption of neural network calculations by combining the inference state with the dynamic voltage and frequency adjustment of the processor at runtime, and specifically includes the following steps:

步骤1：在小时间段每个推理中，预测网络将退出的位置，并在小时间范围内相应地预测计算配置，所述计算配置包括频率和电压；对于大时间范围内的多个推理，通过剩余推理工作负载和截止时间约束，提供处理器频率和电压校准；Step 1: In each inference of a small time period, predict where the network will exit and accordingly predict the computing configuration in the small time range, the computing configuration including frequency and voltage; for multiple inferences in a large time range, provide processor frequency and voltage calibration by the remaining inference workload and deadline constraints;

步骤2：根据预测和校准计算配置以执行神经网络，从而实现动态电压和频率调节。Step 2: Calculate the configuration based on the prediction and calibration to execute the neural network to achieve dynamic voltage and frequency scaling.

接下来进行更为详细的说明。A more detailed description follows.

推理的协调周期即推理任务期限的表达式为：The coordination cycle of reasoning, that is, the deadline of the reasoning task, is expressed as:

λ表示任务截至时间的松紧，0<λ≤1，λ越小，期限越紧；t_i表示每次推理实际完成时间；T_c为协调周期；M为推理任务的个数；T_i表示第i个推理任务的协调期限。λ represents the tightness of the task deadline, 0<λ≤1, the smaller λ is, the tighter the deadline is; _ti represents the actual completion time of each reasoning; _Tc is the coordination cycle; M is the number of reasoning tasks; _Ti represents the coordination deadline of the i-th reasoning task.

P_D＝CV²f _PD ＝ ^CV2f

P_S＝VN_trI_S _PS _＝ _VNtrIS

由于神经网络存在早期部署，位于早期退出之前的网络层运行在(V_H，f_H)，t_i点网络退出后，计算平台通过DVFS将电压和频率降至最低电平(V_L，f_L)，直到t_i，因此推理任务j_i的能量消耗表示为：Due to the early deployment of the neural network, the network layer before the early exit runs at (V _H , f _H ). After the network exits at point _ti , the computing platform reduces the voltage and frequency to the lowest level (V _L , f _L ) through DVFS until _ti . Therefore, the energy consumption of the inference task j _i is expressed as:

E_i表示能量消耗；V_H表示处理器高的供电电压；f_H表示处理器高时钟频率；V_L表示处理器低的供电电压；E _i represents energy consumption; V _H represents the high supply voltage of the processor; f _H represents the high clock frequency of the processor; V _L represents the low supply voltage of the processor;

建立适当的(V,f)来调整剩余的工作量和时间约束，根据网络退出位置的预测和在截止日期T_c之前完成的剩余推断，在运行时调整(V,f)。An appropriate (V, f) is established to adjust the remaining workload and time constraints, and (V, f) is adjusted at runtime based on the prediction of the network exit position and the remaining inference to be completed before the deadline T _c .

在一个推理任务j_i中，先预测推理网络退出位置，建立剩余层数以完成该推理；然后根据剩余层和最坏情况执行时间T_i直到当前推理结束，建立适当的(V,f)预测，以在小时间尺度上降低能源成本；考虑到T_c上剩余的推理量，通过反馈控制策略启用的交叉推理校准器，根据多推理进度、总推理工作量和时间约束来校准(V,f)，以平衡大时间尺度上的工作量、能量和时间成本；根据预测和校准，DVFS调控器将在运行时更新处理器到适当的计算配置(即频率和电压)，以节省能源，同时满足截止日期。In an inference task j _i , the exit position of the inference network is first predicted and the number of remaining layers is established to complete the inference; then, based on the remaining layers and the worst-case execution time _Ti until the end of the current inference, an appropriate (V, f) prediction is established to reduce energy costs on a small time scale; considering the remaining amount of inference on T _c , the cross-inference calibrator enabled by the feedback control strategy calibrates (V, f) according to the multi-inference progress, total inference workload and time constraints to balance the workload, energy and time costs on a large time scale; based on the prediction and calibration, the DVFS governor will update the processor to the appropriate computing configuration (i.e., frequency and voltage) at runtime to save energy while meeting the deadline.

在推理任务j_i中，现有层的设计基于现有作品，所有现有层被进一步修改以共享相同的拓扑结构。它包含BoF(Bag-of-Features)池化层和FC(full-connected)全连接层。设y_i为第i层的中间结果，N_c为数据集中对象类的数量。BoF池化层作为从y_i中提取的特征聚合。在BoF池化中，使用一组称为码本的特征向量来描述y_i。每个码本的权重是通过测量码本和y_i之间的相似度来生成的。由于码本权重的大小通常大于N_c，因此FC层进一步作为分类器，将BoF池化的结果调整为

从而估计出现有层的最终输出。In the inference task j _i , the design of the existing layers is based on existing works, and all the existing layers are further modified to share the same topology. It contains BoF (Bag-of-Features) pooling layers and FC (full-connected) fully connected layers. Let _yi be the intermediate result of the i-th layer, and N _c be the number of object classes in the dataset. The BoF pooling layer acts as a feature aggregation extracted from _yi . In BoF pooling, a set of feature vectors called codebooks is used to describe _yi . The weight of each codebook is generated by measuring the similarity between the codebook and _yi . Since the size of the codebook weight is usually larger than N _c , the FC layer is further used as a classifier to adjust the result of BoF pooling to

Thereby estimating the final output of the existing layer.

现有层的功能记为：The functions of the existing layers are recorded as:

其中，f_w(x，i)为第i层的中间结果，x为初始网络输入；W_i为第i个退出层(BoF和FC)的参数；

是分类器的估计结果。Where _fw (x,i) is the intermediate result of the i-th layer, x is the initial network input; _Wi is the parameter of the i-th exit layer (BoF and FC);

is the estimated result of the classifier.

现有层中的参数需要离线训练，由于现有层通过估计最终的网络输出来发挥作用，因此在训练W_i时选择相同的交叉熵损失函数。The parameters in the existing layers need to be trained offline. Since the existing layers work by estimating the final network output, the same cross entropy loss function is selected when training _Wi .

其中，p(i)和q(i)分别是每个对象的实际和预测的类分布；

is the cross entropy;

其中，W'为更新后的整体网络参数，η为学习率，如果精度不能满足要求，可以调整η；j表示向量位置；

表示参数梯度；x_j表示训练数据向量元素；r_j表示目标数据向量元素；Where W' is the updated overall network parameter, η is the learning rate, if the accuracy cannot meet the requirements, η can be adjusted; j represents the vector position;

之后，对原模型W进行固定，在现有层的基础上，再次训练模型，优化BoF池化层和FC层的参数，表达式为：After that, the original model W is fixed, and the model is trained again based on the existing layers to optimize the parameters of the BoF pooling layer and the FC layer. The expression is:

k表示第k个全体分类目标。k represents the kth overall classification target.

在初始网络输入x_j的推断过程中，在每个早期退出层，权重比αW为最大特征权重除以μ_i乘以用户指定的超参数β。最大特征权值越大，认为分类越有信心。一旦αW大于1，推论终止，在这个早期退出层的结果被应用为最终结果。During inference of the initial network input _xj , at each early exit layer, the weight ratio αW is the maximum feature weight divided by _μi multiplied by the user-specified hyperparameter β. The larger the maximum feature weight, the more confident the classification is. Once αW is greater than 1, inference terminates and the result of this early exit layer is applied as the final result.

假设预测器从神经网络的第L₀层开始预测，预测器输入为第L₀层神经网络的执行结果向量y₀及其对应的退出层执行结果向量

Assume that the predictor starts predicting from the _L0th layer of the neural network, and the predictor input is the execution result vector _y0 of the _L0th layer of the neural network and its corresponding exit layer execution result vector

零填充模块对上述结果进行扩充维度，通过在向量

两端添加0的方式，将向量

由

维扩展到

维，并记为

其中N_c为向量

长度同时也是推理结果分类的数量，K为特征池化组中特征向量的个数。具体计算过程如下述公式所示：The zero-filling module expands the dimension of the above results by

Add 0 at both ends to convert the vector

Depend on

Dimensions expanded to

dimension, and recorded as

Where N _c is a vector

The length is also the number of inference result categories, and K is the number of feature vectors in the feature pooling group. The specific calculation process is shown in the following formula:

l表示第l个元素。l represents the lth element.

经过零填充模块的输出向量

将进行一维卷积操作并获得新的特征权重向量

其中卷积权重为长度为K的一维向量h。具体计算过程如下述公式所示：The output vector after the zero padding module

通过递归将中间估计结果

替换为

并重复计算式，得到L₀+2位置的出料层的预测结果，记为

按照上述步骤，可以计算L₀之后任何现有层的预测结果。By recursively converting the intermediate estimation results

Replace with

Following the above steps, the prediction results of any existing layers after _L0 can be calculated.

将退出层的预测结果放置在L₀之后的任意一层，我们将得到PREDICT函数，描述如下：Placing the prediction result of the exit layer in any layer after L ₀ , we will get the PREDICT function, which is described as follows:

其中，

表示L₀后的预测出点的最小值；in,

Indicates the minimum value of the predicted output point after L ₀ ;

式中，

为L₀+ζ层的预测出口置信度。因此，L₀+ζ是预测函数预测的退出点。如果找不到ζ，进一步引入一个超参数τ，即ζ∈[1，τ]，τ≤L_total-L₀。τ的上界表示禁止最后一层以外的预测结果。如果[1，τ]中没有整数满足条件，则设ζ＝τ。预测的时间复杂度为O(N_c×K×τ)，同时在预测中引入超参数。L₀、β和τ可由高级用户调优，可以平衡不同应用场景的预测精度和计算成本。In the formula,

is the prediction exit confidence of the L ₀ +ζ layer. Therefore, L ₀ +ζ is the exit point predicted by the prediction function. If ζ cannot be found, a hyperparameter τ is further introduced, that is, ζ∈[1,τ], τ≤L _total -L ₀ . The upper bound of τ indicates that prediction results other than the last layer are prohibited. If no integer in [1,τ] satisfies the condition, then ζ＝τ is set. The time complexity of the prediction is O(N _c ×K×τ), and hyperparameters are introduced in the prediction. L ₀ , β and τ can be tuned by advanced users to balance the prediction accuracy and computational cost of different application scenarios.

频率预测：将PREDICT的现有预测结果ζ转换为适当的“中级”频率f_M，i。对于推理任务j_i，在开始推理到L₀预测的时间间隔内，保守应用一个相对“高层次”的

及时完成模型的L_total而不退出，表达式为：Frequency prediction: Convert the existing prediction results ζ of PREDICT to appropriate "mid-level" frequencies f _M,i . For inference task j _i , a relatively "high-level" prediction is conservatively applied in the interval between the start of inference and the L ₀ prediction.

式中，f_H为计算平台默认的最高频率，该计算平台可以用L_total层完成推理，且不会因时间T_i提前退出。频率

由Δf_i校准。Where _fH is the default maximum frequency of the computing platform. The computing platform can complete reasoning with L _total layers and will not exit early due to time _Ti .

Calibrated by _Δfi .

在预测L₀时，根据预测的退出点，将f调整为合适的“中层”f_M，i，并运行网络，直到在T_i退出。给定一个具有L_total层数和早出预测ζ的网络，计算频率降为：When predicting L ₀ , we adjust f to the appropriate “middle layer” f _M,i according to the predicted exit point, and run the network until exiting at _Ti . Given a network with L _total layers and early exit prediction ζ, the computation frequency drops to:

其中，f_M，i是通过时间T_i可以完成推断的最低频率的预测。where f _M,i is the lowest frequency prediction that can be inferred by time _Ti .

由于我们的目标是以突发方式执行推理任务，对于推理任务j_i+1，交叉推理DVFS策略在完成包括j_i在内的所有先前任务后提供校准建议Δf_i+1，考虑到更大时间尺度T_c上的工作量和时间限制。该校准是由一个离散增量比例积分(PI)调节器实现的，表达式为：Since our goal is to execute reasoning tasks in a bursty manner, for reasoning task j _i+1 , the cross-reasoning DVFS strategy provides a calibration suggestion Δf _i+1 after completing all previous tasks including j _i , taking into account the workload and time constraints on a larger time scale T _c . The calibration is implemented by a discrete incremental proportional integral (PI) regulator, expressed as:

其中，K_P和K_I分别为比例系数和积分系数。索引i表示当前协调周期内的第i个推理任务。t_i表示推理j_i从协调周期开始的相对完成时间。t_i-t_i-1为推理任务j_i与j_i-1完成的时间间隔。Where K _P and K _I are the proportional coefficient and integral coefficient respectively. Index i represents the i-th reasoning task in the current coordination cycle. _ti represents the relative completion time of reasoning j _i from the beginning of the coordination cycle. _ti - _ti-1 is the time interval between the completion of reasoning tasks j _i and j _i-1 .

在实际应用中，采用较小的时间周期αT_c，0＜α≤1。αT_c和T_c之间的间隔是通过反馈控制保留的余量，以消除执行时间超调的影响。α的适当选择可以在节约能源和满足最后期限之间取得平衡。交叉推理校正器的输出是DVFS调速器配置计算平台Δf_i+1的下一个推理j_i+1的频率校准。In practical applications, a smaller time period αT _c is used, 0＜α≤1. The interval between αT _c and T _c is a margin reserved by feedback control to eliminate the effect of execution time overshoot. The appropriate choice of α can strike a balance between saving energy and meeting deadlines. The output of the cross-inference corrector is the frequency calibration of the next inference j _i+1 of the DVFS governor configuration calculation platform Δf _i+1 .

最后，DVFS调速器配置频率和对应电压，分别根据小时间尺度和大时间尺度的预测和校准执行网络，进一步节约能源，满足网络推理期限。DVFS调控器以比推理更小的粒度操作(V,f)缩放。在每次推理开始时，基于的交叉推理校准推导出Δf_i。DVFS调控器将计算配置设置为

保守地完成不提前退出的推理。在推理层L₀进行早期退出预测时，DVFS调控器建立适当的频率配置

根据内推断预测器导出的f_M，i，以及根据交叉推断校准导出的Δfi，表达式为：Finally, the DVFS governor configures the frequency and corresponding voltage to execute the network based on the prediction and calibration of small and large time scales, respectively, to further save energy and meet the network inference deadline. The DVFS governor operates at a smaller granularity (V,f) than inference. At the beginning of each inference, Δf _i is derived based on the cross-inference calibration of . The DVFS governor sets the computation configuration to

Inference without early exit is done conservatively. When early exit prediction is made at inference level _L0 , the DVFS governor establishes an appropriate frequency configuration

f _M,i , derived from the intra-inference predictor, and Δfi , derived from the cross-inference calibration, are expressed as:

因此，由此产生的能源消耗将是：The resulting energy consumption would therefore be:

其中，

为预测层L₀及其之前的网络层j_i完成的时间。

为给定

所对应的电压。由于f的选择是基于推理中预测的剩余工作量，因此推理内预测可以使j_i由T_i完成。交叉推断校准目标通过T_c完成M个顺序推断。由于V和f是计算性能的线性尺度(与推断成反比)，而对P_D和P_S的三次尺度和线性尺度，双时间尺度电源管理有效地降低了E。in,

is the time it takes to complete the prediction layer _L0 and its previous network layer j _i .

For a given

The voltage corresponding to . Since the choice of f is based on the remaining workload predicted in inference, intra-inference prediction allows j _i to be completed by _Ti . The cross-inference calibration target completes M sequential inferences through T _c . Since V and f are linear scales of computational performance (inversely proportional to inference), and cubic and linear scales for P _D and _PS , dual-time-scale power management effectively reduces E.

处理器电压和频率的调整可以通过系统调用(sycall)在应用程序级别实现。根据我们的观察，每个(V,f)变化的瞬态时间大约为1ms到3ms。改变处理器(V,f)的瞬态过程被纳入到我们实际的系统评估中。The processor voltage and frequency can be adjusted at the application level through system calls (sycall). According to our observation, the transient time of each (V, f) change is about 1ms to 3ms. The transient process of changing the processor (V, f) is incorporated into our actual system evaluation.

对于运行时的每个推理，推理内出口预测的计算复杂度为O(N_c×K×τ)，交叉推理校准和DVFS调速器均为O(1)。For each inference at runtime, the computational complexity of in-inference exit prediction is O(N _c ×K×τ), and the cross-inference calibration and DVFS governor are both O(1).

我们在常用的CIFAR-10、CIFAR-100、SVHN和CINIC数据集上使用VGG-19和ResNets-18作为骨干模型来评估本发明的推理精度和时序性能。在NVIDIAJetson TX2上进行评估，GPUf_H＝1.30050GHz，f_L＝0.11475GHz。串行和I/O操作在cpu上执行，大量并行和计算密集型的段被卸载到GPU上，由PyTorch库自动管理。能源成本由Tektronix MDO32示波器和TCP2020电流探头评估，采样频率为50kHz。在表1中列出了不同基准的预测和校准参数。We use VGG-19 and ResNets-18 as backbone models on the commonly used CIFAR-10, CIFAR-100, SVHN and CINIC datasets to evaluate the inference accuracy and timing performance of the present invention. The evaluation is performed on NVIDIA Jetson TX2, GPU f _H = 1.30050GHz, f _L = 0.11475GHz. Serial and I/O operations are performed on the CPU, and a large number of parallel and computationally intensive segments are offloaded to the GPU, which is automatically managed by the PyTorch library. The energy cost is evaluated by a Tektronix MDO32 oscilloscope and a TCP2020 current probe with a sampling frequency of 50kHz. The prediction and calibration parameters of different benchmarks are listed in Table 1.

表1Table 1

在推理过程中，频率是根据早期退出点的预测来计算的。预测的准确性直接决定了任务的时间和能量消耗。因此，我们测试了不同L₀的数据集和模型的预测精度。根据预测和推断精度结果，其余部分Vgg-19和Resnet-18分别设L₀＝6和7。During inference, the frequency is calculated based on the prediction of the early exit point. The accuracy of the prediction directly determines the time and energy consumption of the task. Therefore, we tested the prediction accuracy of datasets and models with different L _0. According to the prediction and inference accuracy results, the rest of the Vgg-19 and Resnet-18 are set to L ₀ = 6 and 7 respectively.

然后对推理精度进行评估，其结果如表2所示。本发明在两种模型上都达到了与其他早期退出方法相同的精度，比经典CNN低1％-3％。因为在早期退出预测中，即使预测错误，网络也会继续进行下一次预测，而不是强制推理退出，不会引入额外的推理精度损失。其余的电源管理设置对推断精度没有影响，因为首先确定ζ的现有层预测是为了实现高推断精度，电源管理是建立在此基础上的。The inference accuracy is then evaluated, and the results are shown in Table 2. The present invention achieves the same accuracy as other early exit methods on both models, which is 1%-3% lower than the classic CNN. Because in the early exit prediction, even if the prediction is wrong, the network will continue to make the next prediction instead of forcing the inference exit, and no additional loss of inference accuracy will be introduced. The remaining power management settings have no effect on the inference accuracy, because the existing layer predictions of ζ are determined first to achieve high inference accuracy, and power management is built on this basis.

表2Table 2

我们评估了时序性能，由于有效的双时间尺度DVFS管理，本发明显示出比其他方法更集中和更短的执行时间。α可以在节约能源和满足最后期限之间取得平衡。α＝0.75导致的总执行时间比α＝0.5略长，而它节省了更多的能量。We evaluate the timing performance and show that our proposed method achieves more focused and shorter execution time than other methods due to the effective dual-timescale DVFS management. α can strike a balance between saving energy and meeting deadlines. α = 0.75 results in slightly longer total execution time than α = 0.5, while it saves more energy.

最后，我们评估了不同λ值下的截止日期T_c满足率。作为压力测试，我们比较了更紧期限下的计时性能，即

当λ＝1时，所有的方法都能满足最后期限。Finally, we evaluate the deadline T _c satisfaction rate under different values of λ. As a stress test, we compare the timing performance under tighter deadlines, i.e.

When λ=1, all methods can meet the deadline.

为了测量运行时的功耗，Jetson TX2板直接连接到Keysight E36231A电源，电压为19V。电流探头和示波器监测馈电到电路板的电流。表3总结了每项任务的平均能量消耗。在真实平台上的能量测量表明，本发明与经典深度学习网络相比节能63.8％，与最先进的退出策略下的早期退出相比节能21.5％。在大多数情况下，α＝0.75比α＝0.5消耗的能量要少。To measure the power consumption during runtime, the Jetson TX2 board was directly connected to a Keysight E36231A power supply at 19V. A current probe and an oscilloscope monitored the current fed to the board. Table 3 summarizes the average energy consumption for each task. Energy measurements on a real platform show that the proposed method saves 63.8% energy compared to a classic deep learning network and 21.5% energy compared to early exit under the state-of-the-art exit strategy. In most cases, α = 0.75 consumes less energy than α = 0.5.

表3Table 3

开销的主要来源是提前退出PREDICT。因此，我们评估了每个基准的1750个提前退出PREDICT的时间成本。所有基准测试的最大时间成本都在300μs以下。ResNets-18的盒子比VGG-19的更宽，因为ResNets-18预计会以较大的ζ退出。通过时间成本表明了本发明在线电源管理的适用性和可伸缩性。The main source of overhead is early exit PREDICT. Therefore, we evaluate the time cost of 1750 early exit PREDICT for each benchmark. The maximum time cost for all benchmarks is below 300μs. The box of ResNets-18 is wider than that of VGG-19 because ResNets-18 is expected to exit with a larger ζ. The applicability and scalability of the proposed online power management are demonstrated by the time cost.

实施例2：Embodiment 2:

本发明还提供一种针对多次连续推理的神经网络动态退出轻量化系统，所述针对多次连续推理的神经网络动态退出轻量化系统可以通过执行所述针对多次连续推理的神经网络动态退出轻量化方法的流程步骤予以实现，即本领域技术人员可以将所述针对多次连续推理的神经网络动态退出轻量化方法理解为所述针对多次连续推理的神经网络动态退出轻量化系统的优选实施方式。The present invention also provides a lightweight system for dynamic exit of a neural network for multiple continuous reasonings. The lightweight system for dynamic exit of a neural network for multiple continuous reasonings can be implemented by executing the process steps of the lightweight method for dynamic exit of a neural network for multiple continuous reasonings. That is, those skilled in the art can understand the lightweight method for dynamic exit of a neural network for multiple continuous reasonings as a preferred implementation of the lightweight system for dynamic exit of a neural network for multiple continuous reasonings.

根据本发明提供的针对多次连续推理的神经网络动态退出轻量化系统，包括：模块M1：构建基于神经网络的推理模型，对于预设小时间范围内的每个推理，预测网络退出的位置，并相应地预测计算配置，所述计算配置包括频率和电压；对于预设大时间范围内的多个推理，通过剩余推理工作量和时间约束，进行处理器频率和电压校准；模块M2：根据预测和校准计算配置以执行神经网络，从而实现动态电压和频率调节。According to the present invention, a lightweight system for dynamic exit of a neural network for multiple continuous reasoning includes: module M1: constructing a reasoning model based on a neural network, predicting the location of network exit for each reasoning within a preset small time range, and correspondingly predicting the computing configuration, wherein the computing configuration includes frequency and voltage; for multiple reasonings within a preset large time range, calibrating the processor frequency and voltage through the remaining reasoning workload and time constraints; module M2: executing the neural network according to the predicted and calibrated computing configuration, thereby realizing dynamic voltage and frequency adjustment.

在推理过程中设置推理的协调周期即推理任务期限，表达式为：In the reasoning process, the reasoning coordination cycle, i.e., the reasoning task deadline, is set, and the expression is:

P_D＝CV²f _PD ＝ ^CV2f

P_S＝VN_trI_S _PS _＝ _VNtrIS

本领域技术人员知道，除了以纯计算机可读程序代码方式实现本发明提供的系统、装置及其各个模块以外，完全可以通过将方法步骤进行逻辑编程来使得本发明提供的系统、装置及其各个模块以逻辑门、开关、专用集成电路、可编程逻辑控制器以及嵌入式微控制器等的形式来实现相同程序。所以，本发明提供的系统、装置及其各个模块可以被认为是一种硬件部件，而对其内包括的用于实现各种程序的模块也可以视为硬件部件内的结构；也可以将用于实现各种功能的模块视为既可以是实现方法的软件程序又可以是硬件部件内的结构。Those skilled in the art know that, in addition to implementing the system, device and its various modules provided by the present invention in a purely computer-readable program code, it is entirely possible to implement the same program in the form of logic gates, switches, application-specific integrated circuits, programmable logic controllers and embedded microcontrollers by logically programming the method steps. Therefore, the system, device and its various modules provided by the present invention can be considered as a hardware component, and the modules included therein for implementing various programs can also be considered as structures within the hardware component; the modules for implementing various functions can also be considered as both software programs for implementing the method and structures within the hardware component.

以上对本发明的具体实施例进行了描述。需要理解的是，本发明并不局限于上述特定实施方式，本领域技术人员可以在权利要求的范围内做出各种变化或修改，这并不影响本发明的实质内容。在不冲突的情况下，本申请的实施例和实施例中的特征可以任意相互组合。The above describes the specific embodiments of the present invention. It should be understood that the present invention is not limited to the above specific embodiments, and those skilled in the art can make various changes or modifications within the scope of the claims, which does not affect the essence of the present invention. In the absence of conflict, the embodiments of the present application and the features in the embodiments can be combined with each other at will.

Claims

1. A lightweight method for dynamic exit of a neural network for multiple consecutive reasonings, characterized by comprising:

Step 1: Build a neural network-based inference model, predict the location where the network exits for each inference within a preset small time range, and predict the computing configuration accordingly, the computing configuration includes frequency and voltage; for multiple inferences within a preset large time range, calibrate the processor frequency and voltage based on the remaining inference workload and time constraints;

Step 2: Calculate the configuration based on the prediction and calibration to execute the neural network to achieve dynamic voltage and frequency scaling.

2. According to the lightweight method of dynamic exit of neural network for multiple continuous reasoning according to claim 1, it is characterized in that the reasoning coordination cycle, i.e., the reasoning task deadline, is set during the reasoning process, and the expression is:

Among them, λ represents the tightness of the task deadline, 0＜λ≤1, the smaller λ is, the tighter the deadline is; _ti represents the actual completion time of each reasoning; _Tc is the coordination cycle; M is the number of reasoning tasks; _Ti represents the coordination deadline of the i-th reasoning task.

3. The method for dynamic exit of a neural network for multiple continuous reasoning according to claim 2 is characterized in that, when performing a reasoning task, the power consumed in the calculation process is active power P _active , which is the sum of dynamic power P _D , static power P _S , and constant power P _C ;

The power consumed by the computing platform when it is idle is the idle power P _idle , which is the sum of the static power P _S and the constant power P _C ;

_PD ＝ ^CV2f

_PS _＝ _VNtrIS

Where C is the capacitance of the switching logic gate; V is the power supply voltage of the processor; f is the processor clock frequency; N _tr is the number of logic gates; I _S is the normalized static current of each logic gate;

The energy consumption of each inference in _Ti is:

The network layer before early exit runs at (V _H , f _H ). After the network exits at point t _i , the computing platform reduces the voltage and frequency to the lowest level (V _L , f _L ) through the dynamic voltage and frequency scaling technology DVFS. Therefore, the energy consumption of inference task j _i is expressed as:

E _i represents the energy consumption of reasoning task j _i ; V _H represents the high power supply voltage of the processor; f _H represents the high clock frequency of the processor; V _L represents the low power supply voltage of the processor.

4. According to claim 3, the lightweight method for dynamic exit of a neural network for multiple continuous reasoning is characterized in that, in an reasoning task j _i , the network exit position is first predicted, and the remaining number of layers is established to complete the reasoning; then, based on the remaining number of layers and the worst-case execution time _Ti until the end of the current reasoning, (V, f) is established for prediction; based on the remaining amount of reasoning on T _c , a cross-reasoning calibrator is enabled through a feedback control strategy, and (V, f) is calibrated according to the multi-reasoning progress, the total reasoning workload and the time constraint; based on the prediction and calibration, the processor is updated to the appropriate computing configuration at runtime through DVFS.

5. The method for dynamic exit of a neural network for multiple continuous reasoning according to claim 3 is characterized in that, in the reasoning task j _i , the existing layer includes a BoF pooling layer and an FC layer, assuming that _yi is the intermediate result of the i-th layer of the existing layer, N _c is the number of object classes in the data set, and the BoF pooling layer is used as a feature aggregation extracted from _yi . In BoF pooling, a set of feature vectors called codebooks is used to describe _yi . The weight of each codebook is generated by measuring the similarity between the codebook and _yi . The size of the codebook weight is greater than N _c , so the FC layer is further used as a classifier to adjust the result of BoF pooling to

Where fw(x, i) is the intermediate result of the i-th layer; x is the initial network input; _Wi is the parameter of the i-th exit layer, and the exit layer is the BoF pooling layer and the FC layer;

is the estimated result of the classifier;

Train the parameters in the existing layers offline, and choose the same cross entropy loss function when training _Wi ;

Where p(i) and q(i) are the actual and predicted class distributions of each object, respectively;

is the cross entropy;

When training the existing layer, the training set data {x1, x2, …, xN} is input into the model, N is the number of training data, and the existing layer parameters are optimized by batch gradient descent. Let L _total be the number of layers in the network, and the target vector set is

Where W′ is the updated overall network parameter; η is the learning rate. If the accuracy cannot meet the requirements, adjust η; j represents the vector position;

The original model W is fixed, and based on the existing layers, the model is trained again to optimize the parameters of the BoF pooling layer and the FC layer. The expression is:

After training the exit layer, the average feature weight μ _i is calculated as the parameter for the exit decision, and the expression is:

Among them, k represents the kth overall classification target;

During the inference process of the initial network input _xj , at each early exit layer, the weight ratio αW is the maximum feature weight divided by _μi multiplied by the user-specified hyperparameter β. The larger the maximum feature weight, the more confident the classification is. If αW is greater than 1, the inference is terminated, and the result of this early exit layer is applied as the final result, expressed as:

6. The method for dynamic exit of a neural network for multiple continuous reasoning according to claim 5 is characterized in that the predictor starts prediction from the _L0th layer of the neural network, and the predictor input is the execution result vector _y0 of the _L0th layer of the neural network and its corresponding exit layer execution result vector

The above results are expanded in dimension by using the vector

Add 0 at both ends to convert the vector

Depend on

Dimensions expanded to

dimension, and recorded as

Where N _c is a vector

l represents the lth element;

The output vector is zero-padded

By recursively converting the intermediate estimation results

Replace with

Place the prediction result of the exit layer in any layer after L ₀ to obtain the PREDICT function, which is described as follows:

in,

Indicates the minimum value of the predicted output point after L ₀ ;

In the formula,

is the prediction exit confidence of the L ₀ +ζ layer. Therefore, L ₀ +ζ is the exit point predicted by the prediction function. If ζ cannot be found, a hyperparameter τ is introduced, that is, ζ∈[1,τ], τ≤L _total -L ₀ . The upper bound of τ indicates that the prediction results outside the last layer are prohibited. If no integer in [1,τ] satisfies the condition, ζ＝τ is set.

Convert the existing prediction results ζ of PREDICT to the mid-level frequency f _M,i . For the inference task j _i , a relatively high-level conservative approach is applied in the interval between the start of inference and the L ₀ prediction.

Where _fH is the default maximum frequency of the computing platform. The computing platform can complete reasoning with L _total layers and will not exit early due to time _Ti ; frequency

Calibrated by Δf _i ;

When predicting _L0 , adjust f to the intermediate frequency _fM,i according to the predicted exit point, and run the network until exiting at _Ti . Given a network with L _total layers and early prediction ζ, the intermediate frequency is calculated as:

where f _M,i is the lowest frequency predicted to complete inference at time _Ti .

7. The method for dynamic exit lightweighting of a neural network for multiple consecutive reasoning according to claim 6, characterized in that for the reasoning task j _i+1 , the cross-reasoning DVFS strategy provides a calibration suggestion Δf _i+1 after completing all previous tasks including j _i , and the calibration is implemented by a discrete incremental proportional integral PI regulator, and the expression is:

Δf _i+1 =Δf _i +(K _P +K _I (t _i -t _i-1 ))e(t _i )-K _P e(t _i-1 )

Where K _P and K _I are the proportional coefficient and integral coefficient respectively; index i represents the i-th reasoning task in the current coordination cycle; _ti represents the relative completion time of reasoning j _i from the beginning of the coordination cycle; _ti _{-t i-1} is the time interval between the completion of reasoning tasks j _i and j _i-1 ;

The input deviation e(t _i ) of the PI regulator is an evaluation of the inference progress of T _c based on the total inference workload and the inference execution progress since the beginning of the coordination period, and is expressed as:

8. The method for dynamic exit of a neural network for multiple consecutive inferences according to claim 7 is characterized in that the frequency and the corresponding voltage are configured by the DVFS governor, and the network is executed according to the prediction and calibration of the small time scale and the large time scale respectively. At the beginning of each inference, Δf _i is derived based on the cross-inference calibration, and the DVFS governor sets the calculation configuration to

The expression is:

Therefore, the energy consumption is:

in,

For a given

9. A lightweight system for dynamic exit of a neural network for multiple consecutive reasonings, characterized by comprising:

Module M1: Build a neural network-based inference model, predict the location where the network exits for each inference within a preset small time range, and predict the computing configuration accordingly, the computing configuration including frequency and voltage;

Module M2: For multiple inferences within a preset large time range, processor frequency and voltage calibration is performed based on the remaining inference workload and time constraints;

Module M3: Configured to execute a neural network based on prediction and calibration calculations to achieve dynamic voltage and frequency regulation.

10. The neural network dynamic exit lightweight system for multiple continuous reasoning according to claim 9 is characterized in that a reasoning coordination cycle, i.e., a reasoning task deadline, is set during the reasoning process, and the expression is:

Where λ represents the tightness of the task deadline, 0＜λ≤1, the smaller λ is, the tighter the deadline is; _ti represents the actual completion time of each reasoning; _Tc is the coordination cycle; M is the number of reasoning tasks; _Ti represents the coordination deadline of the i-th reasoning task;

When performing inference tasks, the power consumed in the calculation process is active power P _active , which is the sum of dynamic power P _D , static power P _S , and constant power P _C ;

_PD ＝ ^CV2f

_PS _＝ _VNtrIS

The energy consumption of each inference in _Ti is:

E _i represents the energy consumption of reasoning task j _i ; V _H represents the high supply voltage of the processor; f _H represents the high clock frequency of the processor; V _L represents the low supply voltage of the processor;

In an inference task j _i , the network exit position is first predicted and the remaining number of layers is established to complete the inference; then (V, f) is established for prediction based on the remaining number of layers and the worst-case execution time Ti until the end of the current inference; based on the remaining amount of inference on T _c , the cross-inference calibrator is enabled through the feedback control strategy to calibrate (V, f) according to the multi-inference progress, total inference workload and time constraints; based on the prediction and calibration, the processor is updated to the appropriate computing configuration at runtime through DVFS.