CN113315716B

CN113315716B - Training method and equipment of congestion control model and congestion control method and equipment

Info

Publication number: CN113315716B
Application number: CN202110592772.9A
Authority: CN
Inventors: 周超; 陈艳姣; 夏振厂
Original assignee: Wuhan University WHU; Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Wuhan University WHU; Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-05-28
Filing date: 2021-05-28
Publication date: 2023-05-02
Anticipated expiration: 2041-05-28
Also published as: CN113315716A

Abstract

The present disclosure provides a congestion control model training method and equipment, and a congestion control method and equipment. The congestion control method includes: obtaining the current first network state information and the current application's preference for network transmission performance; inputting the obtained first network state information and the preference into the congestion control model to obtain the predicted usage that needs to be executed Actions to resize the congestion window; perform predicted actions to reset the congestion window.

Description

Congestion control model training method and device, congestion control method and device

技术领域technical field

本公开总体说来涉及通信技术领域，更具体地讲，涉及一种拥塞控制模型的训练方法及设备、拥塞控制方法及设备。The present disclosure generally relates to the field of communication technologies, and more specifically, relates to a congestion control model training method and equipment, and a congestion control method and equipment.

背景技术Background technique

近年来，为了解决网络拥塞问题，提高网络性能，提出了很多拥塞控制协议，包括启发式协议和基于学习的协议。In recent years, in order to solve the problem of network congestion and improve network performance, many congestion control protocols have been proposed, including heuristic protocols and learning-based protocols.

基于学习的拥塞控制协议PCC和PCC Vivace以在线的方式学习速率控制行为与观察到的性能之间的关系。为了避免传统TCP变体中收集到的状态和动作之间的硬性映射，它们通过采用在线学习技术选择最佳的发送速率，这些技术不断尝试在小范围内修改发送速率，以接近更好的效用函数性能。虽然PCC和PCC Vivace能够实现良好的性能。基于学习的拥塞控制协议通过与环境交互学习拥塞控制策略，这个策略可以根据网络的状态选择适当的动作来控制发送速率或拥塞窗口。然而，基于学习的拥塞控制协议是通过预先设计的奖励或目标函数来驱动性能的，这些函数是固定的，当出现新的应用时，这些协议无法满足这些应用的性能要求，因此需要重新设计目标函数并重新训练新的模型。The learning-based congestion control protocols PCC and PCC Vivace learn the relationship between rate control behavior and observed performance in an online manner. To avoid the hard mapping between collected states and actions in traditional TCP variants, they choose the optimal sending rate by employing online learning techniques that continuously try to modify the sending rate in a small range to approach better utility function performance. While PCC and PCC Vivace are able to achieve good performance. Learning-based congestion control protocols learn congestion control strategies by interacting with the environment, and this strategy can choose appropriate actions to control the sending rate or congestion window according to the state of the network. However, learning-based congestion control protocols drive performance through pre-designed rewards or objective functions, which are fixed, and when new applications emerge, these protocols cannot meet the performance requirements of these applications, so the objective needs to be redesigned function and retrain a new model.

发明内容Contents of the invention

本公开的示例性实施例在于提供一种拥塞控制模型的训练方法和设备及拥塞控制方法和设备，以至少解决上述相关技术中的问题，也可不解决任何上述问题。Exemplary embodiments of the present disclosure aim to provide a congestion control model training method and device, and a congestion control method and device, so as to solve at least the above-mentioned problems in the related art, and may not solve any of the above-mentioned problems.

根据本公开实施例的第一方面，提供一种拥塞控制模型的训练方法，包括：初始化当前训练回合所使用的通信网络环境；将本个训练回合对网络传输性能的偏好、以及当前的第一网络状态信息输入到拥塞控制模型，得到预测的需要执行的用于调整拥塞窗口大小的动作；执行预测的动作以重新设置拥塞窗口，并控制发送端在当前设置的拥塞窗口下向接收端发送数据包；当发送端接收到接收端反馈的ACK消息时，根据所述动作、执行所述动作前的第一网络状态信息、执行所述动作后的第一网络状态信息、以及所述偏好，计算所述拥塞控制模型的损失函数；通过根据所述损失函数调整所述拥塞控制模型的模型参数，对所述拥塞控制模型进行训练，并确定是否结束本个训练回合，其中，当确定不结束本个训练回合时，返回执行将本个训练回合对网络传输性能的偏好、以及当前的第一网络状态信息输入到拥塞控制模型，得到预测的需要执行的用于调整拥塞窗口大小的动作的步骤。According to the first aspect of an embodiment of the present disclosure, a method for training a congestion control model is provided, including: initializing the communication network environment used in the current training round; The network status information is input into the congestion control model to obtain the predicted action to adjust the size of the congestion window; execute the predicted action to reset the congestion window, and control the sender to send data to the receiver under the currently set congestion window package; when the sending end receives the ACK message fed back by the receiving end, according to the action, the first network state information before executing the action, the first network state information after executing the action, and the preference, calculate The loss function of the congestion control model; by adjusting the model parameters of the congestion control model according to the loss function, the congestion control model is trained, and it is determined whether to end this training round, wherein, when it is determined not to end this In the first training round, return to the step of inputting the current training round's preference for network transmission performance and the current first network state information into the congestion control model to obtain the predicted action for adjusting the size of the congestion window.

可选地，第一网络状态信息包括以下项之中的至少一项：拥塞窗口的大小、延迟、包确认率、以及发送率；其中，延迟、包确认率、以及发送率是基于接收端反馈的ACK消息确定的。Optionally, the first network state information includes at least one of the following items: the size of the congestion window, delay, packet acknowledgment rate, and sending rate; wherein, the delay, packet acknowledgment rate, and sending rate are based on the receiving end feedback The ACK message is OK.

可选地，对网络传输性能的偏好包括对以下项之中的至少一项的偏好程度：吞吐量、丢包率、以及时延。Optionally, the preference for network transmission performance includes a degree of preference for at least one of the following items: throughput, packet loss rate, and delay.

可选地，确定是否结束本个训练回合的步骤包括：根据第二网络状态信息的变化情况，确定是否结束本个训练回合。Optionally, the step of determining whether to end the current training round includes: determining whether to end the current training round according to changes in the state information of the second network.

可选地，根据第二网络状态信息的变化情况，确定是否结束本个训练回合的步骤包括：当执行所述动作后的第二网络状态信息满足第一预设条件时，确定所述动作为胜利的动作；当执行所述动作后的第二网络状态信息满足第二预设条件时，确定所述动作为失败的动作；当胜利的动作的连续次数达到第一预设次数时，确定结束本个训练回合；当失败的动作的连续次数达到第二预设次数时，确定结束本个训练回合；当执行动作的总次数达到第三预设次数时，确定结束本个训练回合。Optionally, according to the change of the second network state information, the step of determining whether to end the current training round includes: when the second network state information after performing the action satisfies the first preset condition, determining that the action is An action of victory; when the second network state information after executing the action satisfies the second preset condition, the action is determined to be a failed action; when the consecutive times of the action of victory reaches the first preset number of times, the determination ends In this training round: when the number of consecutive failed actions reaches the second preset number of times, it is determined to end the current training round; when the total number of executed actions reaches the third preset number of times, it is determined to end the current training round.

可选地，第二网络状态信息包括：吞吐量和延迟；第一预设条件为：吞吐量处于带宽的90％-110％、且延迟≤0.7×超时阈值；第二预设条件为：吞吐量处于带宽的50％-70％、且延迟≥0.7×超时阈值。Optionally, the second network status information includes: throughput and delay; the first preset condition is: the throughput is 90%-110% of the bandwidth, and the delay is ≤0.7×timeout threshold; the second preset condition is: throughput The volume is between 50%-70% of the bandwidth and the latency is ≥0.7×timeout threshold.

可选地，所述方法还包括：初始化拥塞窗口的大小；其中，初始化拥塞窗口的大小的步骤包括：预估所述通信网络的带宽，并基于预估的带宽，确定拥塞窗口的初始大小。Optionally, the method further includes: initializing the size of the congestion window; wherein, the step of initializing the size of the congestion window includes: estimating the bandwidth of the communication network, and determining the initial size of the congestion window based on the estimated bandwidth.

可选地，预估所述通信网络的带宽的步骤包括：确定接收端针对发送端发送的N个数据包反馈的ACK消息的总数量；根据所述总数量除以N得到的平均值，确定所述通信网络的带宽。Optionally, the step of estimating the bandwidth of the communication network includes: determining the total number of ACK messages fed back by the receiving end for the N data packets sent by the sending end; dividing the total number by an average value obtained by N to determine The bandwidth of the communications network.

可选地，当发送端接收到接收端反馈的ACK消息时，根据所述动作、执行所述动作前的第一网络状态信息、执行所述动作后的第一网络状态信息、以及所述偏好，计算所述拥塞控制模型的损失函数的步骤包括：当发送端接收到接收端反馈的ACK消息时，根据所述动作、执行所述动作前的第一网络状态信息、执行所述动作后的第一网络状态信息、所述动作的奖励函数、以及所述偏好，计算所述拥塞控制模型的损失函数。Optionally, when the sending end receives the ACK message fed back by the receiving end, according to the action, the first network state information before performing the action, the first network state information after performing the action, and the preference , the step of calculating the loss function of the congestion control model includes: when the sending end receives the ACK message fed back by the receiving end, according to the action, the first network state information before performing the action, and the first network state information after performing the action The first network state information, the reward function of the action, and the preference calculate the loss function of the congestion control model.

可选地，所述动作的奖励函数是基于所述偏好、以及执行所述动作后的第三网络状态信息计算得到的；其中，第三网络状态信息包括以下项之中的至少一项：丢包率、吞吐量、以及延迟。Optionally, the reward function of the action is calculated based on the preference and third network state information after performing the action; wherein the third network state information includes at least one of the following items: Packet rate, throughput, and latency.

可选地，所述拥塞控制模型基于强化学习算法被构建；其中，所述强化学习算法中的值函数为关于动作、第一网络状态信息、以及对网络传输性能的偏好的值函数。Optionally, the congestion control model is constructed based on a reinforcement learning algorithm; wherein, a value function in the reinforcement learning algorithm is a value function about actions, first network state information, and preferences for network transmission performance.

可选地，所述拥塞控制模型预测的动作，有∈的概率是从动作集合中随机选择的一个动作，有1-∈的概率是使用值函数获得的最优动作。Optionally, the action predicted by the congestion control model, with probability ∈ is an action randomly selected from the action set, and with probability 1-∈ is an optimal action obtained by using a value function.

可选地，所述拥塞控制模型的损失函数是基于：为了使值函数更接近于最大的奖励函数的损失函数L^S(θ)、以及辅助损失函数L^T(θ)计算得到的。Optionally, the loss function of the congestion control model is calculated based on: a loss function L ^S (θ) for making the value function closer to the maximum reward function, and an auxiliary loss function L ^T (θ).

可选地，所述拥塞控制模型的损失函数被表示为：(1-ε)·L^S(θ)+ε·L^T(θ)；其中，ε为权衡指数，一个训练回合中越靠后被预测的动作，针对该动作计算所述拥塞控制模型的损失函数时ε的值越大，0≤ε≤1。Optionally, the loss function of the congestion control model is expressed as: (1-ε) L ^S (θ)+ε L ^T (θ); wherein, ε is a trade-off index, and the later in a training round is used For the predicted action, the greater the value of ε when calculating the loss function of the congestion control model for this action, 0≤ε≤1.

可选地，所述拥塞控制模型的目标函数为关于以下项的复合目标函数：奖励函数、值函数、执行动作后的第一网络状态信息、执行动作前的第一网络状态信息、动作、本个训练回合对网络传输性能的偏好、以及在当前网络环境下的最佳偏好。Optionally, the objective function of the congestion control model is a compound objective function related to the following items: reward function, value function, first network state information after performing an action, first network state information before performing an action, action, this The preference of the network transmission performance for each training round, and the best preference in the current network environment.

可选地，所述方法还包括：当确定结束本个训练回合时，确定是否结束所述拥塞控制模型的训练过程；当确定不结束所述拥塞控制模型的训练过程时，返回执行初始化当前训练回合所使用的通信网络环境的步骤，以进入下一个训练回合。Optionally, the method further includes: when it is determined to end the current training round, determining whether to end the training process of the congestion control model; when it is determined not to end the training process of the congestion control model, returning to perform initialization of the current training The steps of the communication network environment used by the round to proceed to the next training round.

根据本公开实施例的第二方面，提供一种拥塞控制方法，包括：获取当前的第一网络状态信息和当前应用对网络传输性能的偏好；将获取的第一网络状态信息和所述偏好输入到拥塞控制模型，得到预测的需要执行的用于调整拥塞窗口大小的动作；执行预测的动作以重新设置拥塞窗口。According to the second aspect of the embodiments of the present disclosure, there is provided a congestion control method, including: obtaining the current first network state information and the preference of the current application for network transmission performance; inputting the obtained first network state information and the preference into To the congestion control model, get the predicted action to adjust the size of the congestion window; execute the predicted action to reset the congestion window.

可选地，所述方法还包括：初始化拥塞窗口的大小；其中，初始化拥塞窗口的大小的步骤包括：预估通信网络的带宽，并基于预估的带宽，确定拥塞窗口的初始大小。Optionally, the method further includes: initializing the size of the congestion window; wherein, the step of initializing the size of the congestion window includes: estimating the bandwidth of the communication network, and determining the initial size of the congestion window based on the estimated bandwidth.

可选地，预估通信网络的带宽的步骤包括：确定接收端针对发送的N个数据包反馈的ACK消息的总数量；根据所述总数量除以N得到的平均值，确定所述通信网络的带宽。Optionally, the step of estimating the bandwidth of the communication network includes: determining the total number of ACK messages fed back by the receiving end for the N data packets sent; bandwidth.

可选地，所述拥塞控制模型是使用如上所述的训练方法训练得到的。Optionally, the congestion control model is obtained by training using the training method described above.

根据本公开实施例的第三方面，提供一种拥塞控制模型的训练设备，包括：环境初始化单元，被配置为初始化当前训练回合所使用的通信网络环境；预测单元，被配置为将本个训练回合对网络传输性能的偏好、以及当前的第一网络状态信息输入到拥塞控制模型，得到预测的需要执行的用于调整拥塞窗口大小的动作；拥塞窗口设置单元，被配置为执行预测的动作以重新设置拥塞窗口，并控制发送端在当前设置的拥塞窗口下向接收端发送数据包；损失函数计算单元，被配置为当发送端接收到接收端反馈的ACK消息时，根据所述动作、执行所述动作前的第一网络状态信息、执行所述动作后的第一网络状态信息、以及所述偏好，计算所述拥塞控制模型的损失函数；训练单元，被配置为通过根据所述损失函数调整所述拥塞控制模型的模型参数，对所述拥塞控制模型进行训练；回合结束确定单元，被配置为确定是否结束本个训练回合，其中，当确定不结束本个训练回合时，预测单元将本个训练回合对网络传输性能的偏好、以及当前的第一网络状态信息输入到拥塞控制模型，得到预测的需要执行的用于调整拥塞窗口大小的动作。According to a third aspect of an embodiment of the present disclosure, there is provided a congestion control model training device, including: an environment initialization unit configured to initialize the communication network environment used in the current training round; a prediction unit configured to The round's preference for network transmission performance and the current first network state information are input into the congestion control model to obtain the predicted action for adjusting the size of the congestion window; the congestion window setting unit is configured to perform the predicted action to Reset the congestion window, and control the sending end to send data packets to the receiving end under the currently set congestion window; the loss function calculation unit is configured to perform according to the action when the sending end receives the ACK message fed back by the receiving end The first network state information before the action, the first network state information after the action is executed, and the preference, calculating a loss function of the congestion control model; the training unit is configured to pass according to the loss function Adjusting the model parameters of the congestion control model to train the congestion control model; the round end determination unit is configured to determine whether to end this training round, wherein, when it is determined not to end this training round, the prediction unit will In this training round, the preference for network transmission performance and the current first network state information are input into the congestion control model to obtain the predicted action to adjust the size of the congestion window.

可选地，回合结束确定单元被配置为根据第二网络状态信息的变化情况，确定是否结束本个训练回合。Optionally, the round end determining unit is configured to determine whether to end the current training round according to the change of the second network state information.

可选地，回合结束确定单元被配置为当执行所述动作后的第二网络状态信息满足第一预设条件时，确定所述动作为胜利的动作；当执行所述动作后的第二网络状态信息满足第二预设条件时，确定所述动作为失败的动作；当胜利的动作的连续次数达到第一预设次数时，确定结束本个训练回合；当失败的动作的连续次数达到第二预设次数时，确定结束本个训练回合；当执行动作的总次数达到第三预设次数时，确定结束本个训练回合。Optionally, the round end determining unit is configured to determine that the action is a winning action when the state information of the second network after performing the action satisfies the first preset condition; when the second network after performing the action When the status information satisfies the second preset condition, it is determined that the action is a failed action; when the continuous number of victorious actions reaches the first preset number of times, it is determined to end this training round; When the second preset number of times is reached, it is determined to end the current training round; when the total number of executed actions reaches the third preset number of times, it is determined to end the current training round.

可选地，所述设备还包括：窗口初始化单元，被配置为初始化拥塞窗口的大小；其中，窗口初始化单元被配置为预估所述通信网络的带宽，并基于预估的带宽，确定拥塞窗口的初始大小。Optionally, the device further includes: a window initialization unit configured to initialize the size of the congestion window; wherein the window initialization unit is configured to estimate the bandwidth of the communication network, and determine the congestion window based on the estimated bandwidth the initial size of .

可选地，窗口初始化单元被配置为确定接收端针对发送端发送的N个数据包反馈的ACK消息的总数量；并根据所述总数量除以N得到的平均值，确定所述通信网络的带宽。Optionally, the window initialization unit is configured to determine the total number of ACK messages fed back by the receiving end for the N data packets sent by the sending end; and determine the communication network based on the average value obtained by dividing the total number by N bandwidth.

可选地，损失函数计算单元被配置为当发送端接收到接收端反馈的ACK消息时，根据所述动作、执行所述动作前的第一网络状态信息、执行所述动作后的第一网络状态信息、所述动作的奖励函数、以及所述偏好，计算所述拥塞控制模型的损失函数。Optionally, the loss function calculation unit is configured to, when the sending end receives the ACK message fed back by the receiving end, according to the action, the first network state information before the action is executed, and the first network status information after the action is executed. The state information, the reward function of the actions, and the preferences compute the loss function of the congestion control model.

可选地，所述设备还包括：训练结束确定单元，被配置为当确定结束本个训练回合时，确定是否结束所述拥塞控制模型的训练过程，其中，当确定不结束所述拥塞控制模型的训练过程时，环境初始化单元初始化当前训练回合所使用的通信网络环境，以进入下一个训练回合。Optionally, the device further includes: a training end determination unit configured to determine whether to end the training process of the congestion control model when it is determined to end the current training round, wherein, when it is determined not to end the congestion control model During the training process, the environment initialization unit initializes the communication network environment used in the current training round to enter the next training round.

根据本公开实施例的第四方面，提供一种拥塞控制设备，包括：获取单元，被配置为获取当前的第一网络状态信息和当前应用对网络传输性能的偏好；预测单元，被配置为将获取的第一网络状态信息和所述偏好输入到拥塞控制模型，得到预测的需要执行的用于调整拥塞窗口大小的动作；拥塞窗口设置单元，被配置为执行预测的动作以重新设置拥塞窗口。According to a fourth aspect of an embodiment of the present disclosure, there is provided a congestion control device, including: an acquisition unit configured to acquire current first network state information and a current application's preference for network transmission performance; a prediction unit configured to The obtained first network state information and the preference are input into the congestion control model to obtain the predicted action for adjusting the size of the congestion window; the congestion window setting unit is configured to execute the predicted action to reset the congestion window.

可选地，所述设备还包括：窗口初始化单元，被配置为初始化拥塞窗口的大小；其中，窗口初始化单元被配置为预估通信网络的带宽，并基于预估的带宽，确定拥塞窗口的初始大小。Optionally, the device further includes: a window initialization unit configured to initialize the size of the congestion window; wherein the window initialization unit is configured to estimate the bandwidth of the communication network, and determine the initial size of the congestion window based on the estimated bandwidth size.

可选地，窗口初始化单元被配置为确定接收端针对发送的N个数据包反馈的ACK消息的总数量；并根据所述总数量除以N得到的平均值，确定所述通信网络的带宽。Optionally, the window initialization unit is configured to determine the total number of ACK messages fed back by the receiving end for the sent N data packets; and determine the bandwidth of the communication network according to an average value obtained by dividing the total number by N.

可选地，所述拥塞控制模型是使用如上所述的训练设备训练得到的。Optionally, the congestion control model is obtained through training using the training device described above.

根据本公开实施例的第五方面，提供一种电子设备，包括：至少一个处理器；至少一个存储计算机可执行指令的存储器，其中，所述计算机可执行指令在被所述至少一个处理器运行时，促使所述至少一个处理器执行如上所述的拥塞控制模型的训练方法和/或如上所述的拥塞控制方法。According to a fifth aspect of the embodiments of the present disclosure, there is provided an electronic device, including: at least one processor; at least one memory storing computer-executable instructions, wherein the computer-executable instructions are executed by the at least one processor When , the at least one processor is prompted to execute the above-mentioned method for training a congestion control model and/or the above-mentioned congestion control method.

根据本公开实施例的第六方面，提供一种计算机可读存储介质，当述计算机可读存储介质中的指令被至少一个处理器运行时，促使所述至少一个处理器执行如上所述的拥塞控制模型的训练方法和/或如上所述的拥塞控制方法。According to a sixth aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium, when the instructions in the computer-readable storage medium are executed by at least one processor, the at least one processor is prompted to perform the above-mentioned congestion A method of training a control model and/or a method of congestion control as described above.

根据本公开实施例的第七方面，提供一种计算机程序产品，包括计算机指令，所述计算机指令被至少一个处理器执行时实现如上所述的拥塞控制模型的训练方法和/或如上所述的拥塞控制方法。According to a seventh aspect of the embodiments of the present disclosure, there is provided a computer program product, including computer instructions, and when the computer instructions are executed by at least one processor, the above-mentioned congestion control model training method and/or the above-mentioned congestion control method.

本公开的实施例提供的技术方案至少带来以下有益效果：The technical solutions provided by the embodiments of the present disclosure bring at least the following beneficial effects:

根据本公开示例性实施例的拥塞控制模型，能够根据应用对传输性能的偏好去选择最佳的拥塞控制策略，从而可以满足不同应用的传输性能要求，而不需要重新设计目标函数及训练模型；根据本公开示例性实施例的拥塞控制方法，适于对各种类型的应用进行拥塞控制，而且可以实现吞吐量、时延和丢包之间的权衡，可满足不同类型应用的传输性能要求；According to the congestion control model of the exemplary embodiment of the present disclosure, the best congestion control strategy can be selected according to the application's preference for transmission performance, so that the transmission performance requirements of different applications can be met without redesigning the objective function and training model; The congestion control method according to the exemplary embodiment of the present disclosure is suitable for performing congestion control on various types of applications, and can achieve a trade-off between throughput, delay and packet loss, and can meet the transmission performance requirements of different types of applications;

本公开示例性实施例的拥塞控制模型的多目标强化学习网络能够在拥塞控制的整个偏好空间上进行优化，这使得训练后的模型能够为任何给定的偏好产生最优策略，这从根本上改变了现有协议的目标函数或效用函数是固定的设计，在满足不同类型应用方面具有较大优势；The multi-objective reinforcement learning network of the congestion control model of the exemplary embodiment of the present disclosure is able to optimize over the entire preference space of congestion control, which enables the trained model to generate an optimal policy for any given preference, which fundamentally Changing the objective function or utility function of the existing protocol is a fixed design, which has great advantages in meeting different types of applications;

通过针对不同的网络带宽设置不同的初始拥塞窗口值，能够有效地加速网络收敛；By setting different initial congestion window values for different network bandwidths, network convergence can be effectively accelerated;

通过根据网络环境的变化做出结束训练回合的方法，能够根据网络带宽利用率、时延以及吞吐量的变化，在适当的时间结束训练回合，从而提高模型的训练效率；By making a method of ending the training round according to changes in the network environment, the training round can be ended at an appropriate time according to changes in network bandwidth utilization, delay, and throughput, thereby improving the training efficiency of the model;

此外，提出了一种赢-输-平局中断训练回合以提高拥塞控制模型的训练质量的方法，解决了多目标强化学习应用到拥塞控制问题中出现训练伪中断的问题。In addition, a win-lose-tie method is proposed to interrupt the training rounds to improve the training quality of the congestion control model, which solves the problem of spurious training interruptions in the application of multi-objective reinforcement learning to the congestion control problem.

应当理解的是，以上的一般描述和后文的细节描述仅是示例性和解释性的，并不能限制本公开。It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the present disclosure.

附图说明Description of drawings

此处的附图被并入说明书中并构成本说明书的一部分，示出了符合本公开的实施例，并与说明书一起用于解释本公开的原理，并不构成对本公开的不当限定。The accompanying drawings here are incorporated into the specification and constitute a part of the specification, show embodiments consistent with the disclosure, and are used together with the description to explain the principle of the disclosure, and do not constitute an improper limitation of the disclosure.

图1示出根据本公开示例性实施例的拥塞控制方法及设备的实施场景的示意图；FIG. 1 shows a schematic diagram of an implementation scenario of a congestion control method and device according to an exemplary embodiment of the present disclosure;

图2示出根据本公开示例性实施例的拥塞控制模型的训练方法的流程图；FIG. 2 shows a flowchart of a method for training a congestion control model according to an exemplary embodiment of the present disclosure;

图3示出根据本公开示例性实施例的拥塞控制方法的流程图；FIG. 3 shows a flowchart of a congestion control method according to an exemplary embodiment of the present disclosure;

图4示出根据本公开示例性实施例的拥塞控制模型的训练方法及拥塞控制方法的示意图；FIG. 4 shows a schematic diagram of a congestion control model training method and a congestion control method according to an exemplary embodiment of the present disclosure;

图5示出根据本公开示例性实施例的拥塞控制模型的训练设备的结构框图；FIG. 5 shows a structural block diagram of a training device for a congestion control model according to an exemplary embodiment of the present disclosure;

图6示出根据本公开示例性实施例的拥塞控制设备的结构框图；FIG. 6 shows a structural block diagram of a congestion control device according to an exemplary embodiment of the present disclosure;

图7示出根据本公开示例性实施例的电子设备的结构框图。FIG. 7 shows a structural block diagram of an electronic device according to an exemplary embodiment of the present disclosure.

具体实施方式Detailed ways

为了使本领域普通人员更好地理解本公开的技术方案，下面将结合附图，对本公开实施例中的技术方案进行清楚、完整地描述。In order to enable ordinary persons in the art to better understand the technical solutions of the present disclosure, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below in conjunction with the accompanying drawings.

需要说明的是，本公开的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换，以便这里描述的本公开的实施例能够以除了在这里图示或描述的那些以外的顺序实施。以下示例性实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反，它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的例子。It should be noted that the terms "first" and "second" in the specification and claims of the present disclosure and the above drawings are used to distinguish similar objects, but not necessarily used to describe a specific sequence or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein can be practiced in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples do not represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatuses and methods consistent with aspects of the present disclosure as recited in the appended claims.

在此需要说明的是，在本公开中出现的“若干项之中的至少一项”均表示包含“该若干项中的任意一项”、“该若干项中的任意多项的组合”、“该若干项的全体”这三类并列的情况。例如“包括A和B之中的至少一个”即包括如下三种并列的情况：(1)包括A；(2)包括B；(3)包括A和B。又例如“执行步骤一和步骤二之中的至少一个”，即表示如下三种并列的情况：(1)执行步骤一；(2)执行步骤二；(3)执行步骤一和步骤二。What needs to be explained here is that "at least one of several items" appearing in this disclosure all means to include "any one of the several items", "a combination of any of the several items", The three categories of "the whole of the several items" are juxtaposed. For example, "including at least one of A and B" includes the following three parallel situations: (1) including A; (2) including B; (3) including A and B. Another example is "execute at least one of step 1 and step 2", which means the following three parallel situations: (1) execute step 1; (2) execute step 2; (3) execute step 1 and step 2.

随着移动互联网技术的迅速发展、终端数量的不断增加，许多终端设备上都安装有不同类型的应用，例如，包括延迟敏感型应用和吞吐量敏感型应用，对于延迟敏感型应用，如网络电话或云游戏，需要低至几毫秒的传输延迟，这些应用可能无法从更高的带宽中获益；而对于吞吐量敏感型应用，如视频流或文件共享型应用，通常需要高带宽以获得更好的性能。根据应用的类型及其对网络传输性能的要求(例如，高吞吐量、低延迟、以及低丢包)，拥塞控制方法可能需要遵循完全不同的策略。如图1所示，如果一个应用是对吞吐量敏感的文件传输类应用，吞吐量就显得至关重要，那么该应用对文件传输的吞吐量要求很高，对延迟的要求相对较低；如果一个应用是对延迟敏感的实时流媒体应用，对其来说，最小化延迟是至关重要的，它要求低传输延迟以减少视频卡顿，并且会有相对较低的丢包要求，而对吞吐量的要求相对较低。计算机网络拥塞控制协议作为网络传输层最重要的协议，需要为具有不同网络性能要求的应用提供高质量的网络服务，即，传输层不仅要适应多变的网络条件，还要适应不同的应用需求，从而满足用户的不同需求、提高用户的体验质量。With the rapid development of mobile Internet technology and the increasing number of terminals, many terminal devices are installed with different types of applications, for example, including delay-sensitive applications and throughput-sensitive applications, for delay-sensitive applications such as VoIP Or cloud gaming, which requires a transmission delay as low as a few milliseconds, these applications may not benefit from higher bandwidth; while for throughput-sensitive applications, such as video streaming or file sharing applications, high bandwidth is usually required to obtain more bandwidth. good performance. Depending on the type of application and its requirements on network transmission performance (eg, high throughput, low latency, and low packet loss), congestion control methods may need to follow completely different strategies. As shown in Figure 1, if an application is a file transfer application that is sensitive to throughput, throughput is very important, then the application has high requirements for file transfer throughput and relatively low requirements for delay; if One application is a delay-sensitive real-time streaming application, for which minimizing delay is crucial, it requires low transmission delay to reduce video stuttering, and there will be relatively low packet loss requirements, while for The throughput requirements are relatively low. As the most important protocol in the network transport layer, the computer network congestion control protocol needs to provide high-quality network services for applications with different network performance requirements, that is, the transport layer must not only adapt to changing network conditions, but also adapt to different application requirements , so as to meet the different needs of users and improve the quality of user experience.

本公开考虑到现有基于学习的拥塞控制模型的目标函数是固定的，很难根据不同应用的不同需求进行重新调整并重新进行模型训练的问题，提出了一种多目标拥塞控制方法，训练好的同一拥塞控制模型可以满足不同应用的传输性能要求，该方法利用多目标强化学习与偏好，能够适用于不同类型的应用，具体地，可将应用对网络传输性能的偏好，当前的网络状态信息输入到拥塞控制模型，拥塞控制模型即可针对该应用性能要求给出最佳的调整拥塞窗口大小的动作。This disclosure considers that the objective function of the existing learning-based congestion control model is fixed, and it is difficult to readjust and retrain the model according to different requirements of different applications, and proposes a multi-objective congestion control method, which can be trained well The same congestion control model can meet the transmission performance requirements of different applications. This method uses multi-objective reinforcement learning and preference, and can be applied to different types of applications. Specifically, the application's preference for network transmission performance, current network status information Input to the congestion control model, the congestion control model can give the best action to adjust the size of the congestion window according to the performance requirements of the application.

应该理解，根据本公开的拥塞控制方法和/或拥塞控制设备不仅可应用于上述场景，还可应用于其他适合的场景，本公开对此不作限制。It should be understood that the congestion control method and/or congestion control device according to the present disclosure may not only be applied to the above-mentioned scenarios, but may also be applied to other suitable scenarios, which are not limited in the present disclosure.

图2示出根据本公开示例性实施例的拥塞控制模型的训练方法的流程图。Fig. 2 shows a flowchart of a method for training a congestion control model according to an exemplary embodiment of the present disclosure.

参照图2，在步骤S101，初始化当前训练回合所使用的通信网络环境。Referring to FIG. 2, in step S101, the communication network environment used in the current training round is initialized.

所述通信网络环境用于本个训练回合所使用的发送端和接收端进行数据传输。The communication network environment is used for data transmission between the sending end and the receiving end used in this training round.

此外，作为示例，在本个训练回合开始之前，还可初始化本个训练回合所使用的发送端和接收端，并让初始化后的发送端和接收端二者握手以在本个训练回合中，控制发送端在所述通信网络环境下不断向接收端发送数据包，并由接收端向发送端发送响应ACK消息，通过分析接收端对数据包的确定消息，可监测到当前的网络状态。In addition, as an example, before the start of the training round, the sending end and the receiving end used in the training round can also be initialized, and the initialized sending end and the receiving end can shake hands so that in the training round, The sending end is controlled to continuously send data packets to the receiving end in the communication network environment, and the receiving end sends a response ACK message to the sending end, and the current network status can be monitored by analyzing the confirmation message of the data packet from the receiving end.

在步骤S102，将本个训练回合对网络传输性能的偏好、以及当前的第一网络状态信息S输入到拥塞控制模型，得到预测的需要执行的用于调整拥塞窗口大小的动作a。In step S102, the current training round's preference for network transmission performance and the current first network state information S are input into the congestion control model to obtain the predicted action a for adjusting the size of the congestion window that needs to be executed.

作为示例，第一网络状态信息可包括以下项之中的至少一项：拥塞窗口的大小、延迟Delay、包确认率ACK_rate、以及发送率Sending_rate。例如，延迟、包确认率、以及发送率可以是基于接收端反馈的ACK消息(也即，确认消息)确定的。例如，可控制发送端在当前设置的拥塞窗口大小下向接收端发送数据包，当发送端接收到接收端反馈的ACK消息，可基于该ACK消息确定当前的延迟、包确认率、以及发送率，从而获取到当前的第一网络状态信息。As an example, the first network status information may include at least one of the following items: the size of the congestion window, the delay Delay, the packet acknowledgment rate ACK_rate, and the sending rate Sending_rate. For example, the delay, packet acknowledgment rate, and transmission rate may be determined based on ACK messages (ie, acknowledgment messages) fed back by the receiving end. For example, the sending end can be controlled to send data packets to the receiving end under the currently set congestion window size. When the sending end receives the ACK message fed back by the receiving end, it can determine the current delay, packet confirmation rate, and sending rate based on the ACK message. , so as to obtain the current first network state information.

作为示例，对网络传输性能的偏好可包括对以下项之中的至少一项的偏好程度：吞吐量、丢包率、以及时延。As an example, the preference for network transmission performance may include a degree of preference for at least one of the following items: throughput, packet loss rate, and delay.

作为示例，可从偏好集合中选取一个样本作为本个训练回合对网络传输性能的偏好。例如，偏好集合中的一个样本可为：对吞吐量的偏好为0.7、对丢包率的偏好为0.2、对时延的偏好为0.1；偏好集合中的另一样本可为：对吞吐量的偏好为0.5、对丢包率的偏好为0.1、对时延的偏好为0.4。应该理解，可根据不同类型的应用对网络传输性能的各种需求，来设置所述偏好集合。As an example, a sample may be selected from the preference set as the preference for network transmission performance in this training round. For example, one sample in the preference set can be: the preference for throughput is 0.7, the preference for packet loss rate is 0.2, and the preference for delay is 0.1; another sample in the preference set can be: the preference for throughput The preference is 0.5, the preference for packet loss rate is 0.1, and the preference for delay is 0.4. It should be understood that the preference set may be set according to various requirements of different types of applications on network transmission performance.

在步骤S103，执行预测的动作以重新设置拥塞窗口，并控制发送端在当前设置的拥塞窗口下向接收端发送数据包。In step S103, perform a predicted action to reset the congestion window, and control the sending end to send data packets to the receiving end under the currently set congestion window.

作为示例，可对当前的拥塞窗口大小执行预测的动作以得到需要设置的拥塞窗口的大小并进行设置。As an example, the action of predicting the current congestion window size may be performed to obtain and set the size of the congestion window that needs to be set.

作为示例，拥塞控制模型预测的动作可为动作集合中的一个动作，作为示例，动作合集可为{*0.5,-50,-10.0,+0.0,+10.0,*2.0,+50}，例如，*0.5表示将当前的拥塞窗口大小*0.5后，作为需要设置的拥塞窗口的大小；10.0表示将当前的拥塞窗口大小+10.0后，作为需要设置的拥塞窗口的大小；-10.0表示将当前的拥塞窗口大小-10.0后，作为需要设置的拥塞窗口的大小。As an example, the action predicted by the congestion control model can be an action in the action set. As an example, the action set can be {*0.5,-50,-10.0,+0.0,+10.0,*2.0,+50}, for example, *0.5 means the current congestion window size *0.5, as the size of the congestion window to be set; 10.0 means the current congestion window size + 10.0, as the size of the congestion window to be set; -10.0 means the current congestion window size After window size -10.0, it is the size of the congestion window that needs to be set.

在步骤S104，当发送端接收到接收端反馈的ACK消息时，根据所述动作a、执行所述动作前的第一网络状态信息(也即，输入到拥塞控制模型以预测动作a的第一网络状态信息S)、执行所述动作后的第一网络状态信息S′、以及本个训练回合对网络传输性能的偏好，计算所述拥塞控制模型的损失函数。In step S104, when the sending end receives the ACK message fed back by the receiving end, according to the action a, the first network state information before the action is executed (that is, input to the congestion control model to predict the first The network state information S), the first network state information S' after the action is executed, and the preference for network transmission performance in this training round, calculate the loss function of the congestion control model.

这里，执行所述动作后的第一网络状态信息S′即，基于接收端针对在步骤S103中向接收端发送的数据包所反馈的ACK消息而确定的第一网络状态信息。Here, the first network state information S′ after the above action is performed is the first network state information determined based on the ACK message fed back by the receiving end for the data packet sent to the receiving end in step S103.

作为示例，可根据所述动作a、执行所述动作前的第一网络状态信息S、执行所述动作后的第一网络状态信息S′、所述动作的奖励函数r、以及本个训练回合对网络传输性能的偏好，计算所述拥塞控制模型的损失函数。As an example, according to the action a, the first network state information S before performing the action, the first network state information S' after performing the action, the reward function r of the action, and the current training round For the preference of network transmission performance, calculate the loss function of the congestion control model.

动作的奖励函数是用于衡量拥塞控制模型预测的动作的收益的Reward函数。作为示例，所述动作的奖励函数可以是基于本个训练回合对网络传输性能的偏好、以及执行所述动作后的第三网络状态信息计算得到的。例如，第三网络状态信息可包括以下项之中的至少一项：丢包率、吞吐量、以及延迟。应该理解，执行所述动作后的第三网络状态信息即，基于接收端针对在步骤S103中向接收端发送的数据包所反馈的ACK消息而确定的第三网络状态信息。The reward function of an action is the Reward function used to measure the benefits of actions predicted by the congestion control model. As an example, the reward function of the action may be calculated based on the current training round's preference for network transmission performance and the third network state information after the action is executed. For example, the third network state information may include at least one of the following items: packet loss rate, throughput, and delay. It should be understood that the third network state information after performing the above actions is the third network state information determined based on the ACK message fed back by the receiving end for the data packet sent to the receiving end in step S103.

作为示例，可基于下面的三元组确定动作的奖励函数[L(Throughput(t)),L(Loss_rate(t)),L(Delay(t))]，例如，动作的奖励函数可为这三个量的加权和，且每个量的权重与本个训练回合对网络传输性能的偏好有关。其中，t表示时间；L(x)表示激活函数，例如，L(x)＝(-10)^-x+1；Throughput(t)可表示对吞吐量进行归一化处理后的结果，例如，可通过将吞吐量除以带宽，来对吞吐量进行归一化处理；Delay(t)可表示对延迟进行归一化处理后的结果，例如，可通过将延迟除以超时阈值timeout，来对延迟进行归一化处理；Loss_rate(t)可表示丢包率本身，即，无需对丢包率进行归一化处理。As an example, the reward function for an action can be determined based on the following triplet [L(Throughput(t)), L(Loss_rate(t)), L(Delay(t))], for example, the reward function for an action can be this The weighted sum of the three quantities, and the weight of each quantity is related to the preference of the network transmission performance in this training round. Among them, t represents time; L(x) represents the activation function, for example, L(x)=(-10) ^-x +1; Throughput(t) can represent the result after normalizing the throughput, for example, The throughput can be normalized by dividing the throughput by the bandwidth; Delay(t) can represent the result of normalizing the delay, for example, by dividing the delay by the timeout threshold timeout, to The delay is normalized; Loss_rate(t) can represent the packet loss rate itself, that is, the packet loss rate does not need to be normalized.

作为示例，在步骤S103，执行拥塞控制模型预测的动作以重新设置拥塞窗口，控制发送端在新设置的拥塞窗口下向接收端发送数据包，并等待接收端返回的确认消息ACKs。在步骤S104，在接收到ACK后，可通过计算rtt，比较发包数与确认消息数等得到训练回合的这一步Step的奖励函数r，以及观察到的当前的第一网络状态信息S′(即，执行所述动作后的第一网络状态信息)。作为示例，可将这一步之前的状态S，这一步的动作a，以及执行了动作之后获得的奖励r和转移到的新状态S′，以向量的形式保存到重放缓存

中。相应地，作为示例，可在每个训练回合开始前，初始化重放缓存

As an example, in step S103, perform the action predicted by the congestion control model to reset the congestion window, control the sending end to send data packets to the receiving end under the newly set congestion window, and wait for the acknowledgment message ACKs returned by the receiving end. In step S104, after receiving the ACK, the reward function r of this Step of the training round can be obtained by calculating rtt, comparing the number of packets sent and the number of confirmation messages, etc., and the observed current first network state information S' (i.e. , the first network status information after the action is executed). As an example, the state S before this step, the action a of this step, the reward r obtained after the action is executed, and the new state S' transferred to can be saved in the replay cache in the form of a vector

middle. Accordingly, as an example, the replay cache can be initialized before the start of each training round

此外，应该理解，本步执行动作后的第一网络状态信息，可作为下一步执行动作前的第一网络状态信息。In addition, it should be understood that the first network state information after the action in this step can be used as the first network state information before the action in the next step.

作为示例，所述拥塞控制模型可基于强化学习算法DQN被构建。作为示例，当对网络传输性能的偏好包括对多项网络传输性能的偏好程度时，所述拥塞控制模型即为多目标模型。作为示例，所述拥塞控制模型可为多目标强化学习模型。As an example, the congestion control model may be constructed based on a reinforcement learning algorithm DQN. As an example, when the preference for network transmission performance includes the degree of preference for multiple network transmission performances, the congestion control model is a multi-objective model. As an example, the congestion control model may be a multi-objective reinforcement learning model.

作为示例，所述强化学习算法中的值函数(Q函数)可为关于动作、第一网络状态信息、以及对网络传输性能的偏好的值函数。As an example, the value function (Q function) in the reinforcement learning algorithm may be a value function about actions, first network state information, and preferences for network transmission performance.

作为示例，所述拥塞控制模型可使用∈-greedily策略采样一个动作作为预测的动作a_t，可使用公式(1)采样一个动作，具体地，采样的动作有∈的概率是从动作集合A中随机选择的一个动作，有1-∈的概率是使用Q函数获得的最优动作，As an example, the congestion control model can use the ∈-greedily strategy to sample an action as the predicted action _at , and can use the formula (1) to sample an action. Specifically, the sampled action has a probability of ∈ from the action set A An action chosen at random with probability 1-∈ is the optimal action obtained using the Q-function,

其中，A表示动作集合，ω表示训练回合对网络传输性能的偏好，θ表示Q函数的参数，s_t表示当前的第一网络状态信息。Among them, A represents the action set, ω represents the preference of the network transmission performance in the training round, θ represents the parameters of the Q function, and _st represents the current first network state information.

作为示例，所述拥塞控制模型可为多目标模型，为了将拥塞控制问题进行数学化表示，假设存在多个目标，每个目标可以用目标函数的形式进行表达，以实现不同目标函数m_i(O)整体的最大化：As an example, the congestion control model may be a multi-objective model. In order to mathematically express the congestion control problem, it is assumed that there are multiple objectives, and each objective may be expressed in the form of an objective function to achieve different objective functions m _i ( O) Overall maximization:

s.t.g_i(O)≤0,i＝1,…,a_g stg _i (O)≤0,i=1,...,a _g

其中，m_i(O)表示第i个目标的目标函数，i＝1,…,m_f,g_i(O)表示拥塞控制问题的约束函数。Wherein, m _i (O) represents the objective function of the i-th objective, and i=1,...,m _f , g _i (O) represent the constraint functions of the congestion control problem.

作为示例，所述拥塞控制模型的目标函数可为关于以下项的复合目标函数：奖励函数、值函数、执行动作后的第一网络状态信息、执行动作前的第一网络状态信息、动作、本个训练回合对网络传输性能的偏好、以及在当前网络环境下的最佳偏好。As an example, the objective function of the congestion control model may be a composite objective function related to the following items: reward function, value function, first network state information after performing an action, first network state information before performing an action, action, this The preference of the network transmission performance for each training round, and the best preference in the current network environment.

作为示例，所述拥塞控制模型的目标函数可为复合的目标函数TQ(s,a,ω)，目标函数TQ(s,a,ω)可被表示为：As an example, the objective function of the congestion control model may be a composite objective function TQ(s, a, ω), and the objective function TQ(s, a, ω) may be expressed as:

其中，

r()表示奖励函数，γ表示权重系数，Q()表示值函数，s′表示执行动作后的第一网络状态信息，s表示执行动作前的第一网络状态信息，a表示动作，ω表示本个训练回合对网络传输性能的偏好，ω′表示在当前网络环境下的最佳偏好，

表示动作集合，Ω表示偏好集合。in,

r() represents the reward function, γ represents the weight coefficient, Q() represents the value function, s′ represents the first network state information after the execution of the action, s represents the first network state information before the execution of the action, a represents the action, and ω represents The preference for network transmission performance in this training round, ω′ represents the best preference in the current network environment,

represents the action set, and Ω represents the preference set.

作为示例，所述拥塞控制模型的损失函数可以是基于：为了使值函数更接近于最大的奖励函数的损失函数L^S(θ)、以及辅助损失函数L^T(θ)计算得到的。As an example, the loss function of the congestion control model may be calculated based on: a loss function L ^S (θ) for making the value function closer to the maximum reward function, and an auxiliary loss function L ^T (θ).

这里，所述辅助损失函数是考虑到最优边界中包含大量的离散解决方案，这导致损失函数的曲线变得不平滑，而提出的。Here, the auxiliary loss function is proposed considering that the optimal boundary contains a large number of discrete solutions, which causes the curve of the loss function to become unsmooth.

作为示例，所述拥塞控制模型的损失函数可被表示为：(1-ε)·L^S(θ)+ε·L^T(θ)；其中，ε为权衡指数，一个训练回合中越靠后被预测的动作，针对该动作计算所述拥塞控制模型的损失函数时ε的值越大，0≤ε≤1。As an example, the loss function of the congestion control model can be expressed as: (1-ε) L ^S (θ) + ε L ^T (θ); where ε is a trade-off index, and the later in a training round is used For the predicted action, the greater the value of ε when calculating the loss function of the congestion control model for this action, 0≤ε≤1.

换言之，在每个训练回合中，ε的初始值为0，ε随着步数的增长逐渐从0增长到1，使得损失函数从L^S(θ)向L^T(θ)迁移。In other words, in each training round, the initial value of ε is 0, and ε gradually increases from 0 to 1 with the increase of the number of steps, so that the loss function migrates from L ^S (θ) to ^LT (θ).

作为示例，损失函数L^S(θ)可被表示为：

As an example, the loss function L ^S (θ) can be expressed as:

作为示例，辅助损失函数L^T(θ)可被表示为：As an example, the auxiliary loss function L ^T (θ) can be expressed as:

其中，

r表示奖励函数，γ表示权重系数，θ表示模型参数，θ_k表示第K步模型的参数，Q()表示值函数，s′表示执行动作后的第一网络状态信息，s表示执行动作前的第一网络状态信息，a表示动作，ω表示本个训练回合对网络传输性能的偏好，ω′表示在当前网络环境下的最佳偏好。in,

r represents the reward function, γ represents the weight coefficient, θ represents the model parameters, θ _k represents the parameters of the K-th step model, Q() represents the value function, s′ represents the first network state information after the execution of the action, and s represents the first network state information before the execution of the action. The first network state information of , a represents the action, ω represents the preference of the network transmission performance in this training round, and ω′ represents the best preference in the current network environment.

在步骤S105，通过根据所述损失函数调整所述拥塞控制模型的模型参数，对所述拥塞控制模型进行训练。In step S105, the congestion control model is trained by adjusting model parameters of the congestion control model according to the loss function.

作为示例，可根据所述损失函数调整所述拥塞控制模型的Q函数的参数θ。As an example, the parameter θ of the Q function of the congestion control model may be adjusted according to the loss function.

作为示例，可使用公式(3)对Q函数的参数θ进行随机梯度下降，以更新模型的Q函数，其中，

表示参数θ的梯度量，As an example, stochastic gradient descent on the parameters θ of the Q-function can be performed using equation (3) to update the Q-function of the model, where,

Indicates the gradient amount of the parameter θ,

在步骤S106，确定是否结束本个训练回合，其中，当确定不结束本个训练回合时，返回执行步骤S102。In step S106, it is determined whether to end the current training round, wherein, if it is determined not to end the current training round, return to step S102.

此外，作为示例，根据本公开示例性实施例的拥塞控制模型的训练方法还可包括：当确定结束本个训练回合时，确定是否结束所述拥塞控制模型的训练过程；当确定不结束所述拥塞控制模型的训练过程时，返回执行步骤S101，以进入下一个训练回合，即，准备下一个训练回合。当确定结束所述拥塞控制模型的训练过程时，停止训练所述拥塞控制模型，即，已完成对拥塞控制模型的训练。例如，可根据所述拥塞控制模型的预测效果或训练总时长等，来确定是否结束所述拥塞控制模型的训练过程。此外，应该理解，不同训练回合的初始通信网络环境可相同或不同，不同训练回合对网络传输性能的偏好可相同或不同。In addition, as an example, the method for training a congestion control model according to an exemplary embodiment of the present disclosure may further include: when determining to end the current training round, determining whether to end the training process of the congestion control model; During the training process of the congestion control model, return to step S101 to enter the next training round, that is, prepare for the next training round. When it is determined to end the training process of the congestion control model, the training of the congestion control model is stopped, that is, the training of the congestion control model has been completed. For example, whether to end the training process of the congestion control model may be determined according to the prediction effect of the congestion control model or the total training time. In addition, it should be understood that the initial communication network environments of different training rounds may be the same or different, and the preferences of different training rounds for network transmission performance may be the same or different.

作为示例，可根据第二网络状态信息的变化情况，确定是否结束本个训练回合。As an example, it may be determined whether to end the current training round according to a change situation of the second network state information.

作为示例，可当执行所述动作后的第二网络状态信息满足第一预设条件时，确定所述动作为胜利的动作；当执行所述动作后的第二网络状态信息满足第二预设条件时，确定所述动作为失败的动作；当胜利的动作的连续次数Win_Num达到第一预设次数时，确定结束本个训练回合；当失败的动作的连续次数Lose_Num达到第二预设次数时，确定结束本个训练回合；当执行动作的总次数达到第三预设次数时，确定结束本个训练回合。例如，第一预设次数M可设置为50，第二预设次数L可设置为50，第三预设次数X可设置为200。As an example, when the second network status information after the action is executed satisfies the first preset condition, the action can be determined as a victory action; when the second network status information after the action is executed satisfies the second preset condition condition, determine that the action is a failed action; when the continuous number of Win _Num of the winning action reaches the first preset number of times, it is determined to end this training round; when the continuous number of Lose _Num of the failed action reaches the second preset When the total number of times of execution reaches the third preset number of times, it is determined to end the current training round. For example, the first preset number M can be set to 50, the second preset number L can be set to 50, and the third preset number X can be set to 200.

作为示例，第二网络状态信息可包括：吞吐量和延迟。As an example, the second network state information may include: throughput and delay.

作为示例，第一预设条件可为：吞吐量处于带宽的90％-110％、且延迟≤0.7×超时阈值；第二预设条件可为：吞吐量处于带宽的50％-70％、且延迟≥0.7×超时阈值。As an example, the first preset condition may be: the throughput is at 90%-110% of the bandwidth, and the delay≤0.7×timeout threshold; the second preset condition may be: the throughput is at 50%-70% of the bandwidth, and Latency ≥ 0.7 x timeout threshold.

本公开考虑到以固定的步数或固定的时间来确定训练回合的终点会导致伪中断状态、在训练中引入相当大的偏差的问题，提出了一种动态结束训练回合的方法，受胜负和平局游戏思想的启发，根据网络环境的变化做出决策，该方法能够根据网络带宽利用率、时延和吞吐量的变化，在适当的时间结束回合，从而提高模型的训练效率。本公开提出了一种能够以最合适的方式结束训练回合的方法。一个训练回合是强化学习算法的一个完整的训练过程，在这个过程中，在每个时刻根据网络状态和拥塞控制策略依次选择一系列的动作，训练回合的长短对是否会学习到最佳模型有很大影响。作为示例，在每个训练回合中，Win_Num和Lose_Num的初始值均为0，每当将一个预测的动作确定为胜利的动作，则将Win_Num的值+1，并将Lose_Num的值重置为0，每当将一个预测的动作确定为失败的动作，则将Lose_Num的值+1，并将Win_Num的值重置为0；如果该训练回合的连续胜利次数Win_Num≥M，则以胜利停止该训练回合，如果连续失败次数Loss_Num≥L，则以失败停止该训练回合，如果当前训练回合已经进行了X步，则以平局的情况结束该训练回合，换言之，在当前训练回合已经进行了X步之前，Win_Num未达到M，Lose_Num也未达到L。This disclosure considers that determining the end of a training round with a fixed number of steps or a fixed time will lead to a pseudo-interruption state and introduce a considerable deviation in training, and proposes a method for dynamically ending a training round, which is determined by the winner or loser. Inspired by the idea of draw and draw games, decisions are made according to changes in the network environment. This method can end the round at an appropriate time according to changes in network bandwidth utilization, delay and throughput, thereby improving the training efficiency of the model. The present disclosure proposes a method that enables the most appropriate way to end a training session. A training round is a complete training process of the reinforcement learning algorithm. In this process, a series of actions are sequentially selected at each moment according to the network state and congestion control strategy. The length of the training round has an effect on whether the best model will be learned. big impact. As an example, in each training round, the initial values of Win _Num and Lose _Num are both 0, whenever a predicted action is determined as a winning action, the value of Win _Num is +1, and the value of Lose _Num Reset to 0, whenever a predicted action is determined to be a failed action, the value of Lose _Num will be +1, and the value of Win _Num will be reset to 0; if the number of consecutive victories of the training round Win _Num ≥ M , then stop the training round with victory, if the number of consecutive failures Loss _Num ≥ L, stop the training round with failure, if the current training round has gone through X steps, then end the training round with a tie, in other words, in the current Win _Num has not reached M and Lose _Num has not reached L before the training round has taken X steps.

此外，本公开考虑到初始拥塞窗口(init-cwnd，即发送端开始发送数据时设置的拥塞窗口的大小)对模型收敛速度有显著影响，但相关技术中通常将init-cwnd设置为一个固定值，因此，存在在不同的网络场景下因为链路容量不同而无法实现快速收敛的问题。本公开考虑到不同链路存在的较大差异造成拥塞控制方法在不同的网络场景下无法实现快速收敛，提出可根据链路带宽动态的设计初始拥塞窗口，从而提高拥塞控制方法的收敛速度。In addition, this disclosure considers that the initial congestion window (init-cwnd, that is, the size of the congestion window set when the sender starts sending data) has a significant impact on the model convergence speed, but init-cwnd is usually set to a fixed value in related technologies , therefore, there is a problem that fast convergence cannot be achieved due to different link capacities in different network scenarios. The disclosure considers that the congestion control method cannot achieve fast convergence in different network scenarios due to the large differences in different links, and proposes to dynamically design the initial congestion window according to the link bandwidth, thereby improving the convergence speed of the congestion control method.

作为示例，根据本公开示例性实施例的拥塞控制模型的训练方法还可包括：初始化拥塞窗口的大小；其中，初始化拥塞窗口的大小的步骤可包括：预估所述通信网络的带宽，并基于预估的带宽，确定拥塞窗口的初始大小。As an example, the method for training a congestion control model according to an exemplary embodiment of the present disclosure may further include: initializing the size of the congestion window; wherein, the step of initializing the size of the congestion window may include: estimating the bandwidth of the communication network, and based on The estimated bandwidth determines the initial size of the congestion window.

作为示例，预估所述通信网络的带宽的步骤可包括：确定接收端针对发送端发送的N个数据包反馈的ACK消息的总数量Num_ack；根据所述总数量Num_ack除以N得到的平均值Num_ave，确定所述通信网络的带宽bw_i。As an example, the step of estimating the bandwidth of the communication network may include: determining the total number Num _ack of ACK messages fed back by the receiving end for the N data packets sent by the sending end; dividing the total number Num _ack by N to obtain The average value Num _ave determines the bandwidth bw _i of said communication network.

作为示例，可通过式(4)从预先定义的带宽组合Bandwidth中找到与Num_ave对应的预估网络带宽bw_i：As an example, the estimated network bandwidth bw _i corresponding to Num _ave can be found from the predefined bandwidth combination Bandwidth by formula (4):

其中，

表示一组在公式(5)中定义的关于接收速率的特征函数，(c_j-1,c_j)表示一组接收速率，in,

Represents a set of characteristic functions about the reception rate defined in formula (5), (c _j-1 ,c _j ) represents a set of reception rates,

作为示例，可通过式(6)基于预估的带宽bw_i，确定拥塞窗口的初始大小W_init-cwnd，其中，b为系数，例如，可根据实验数据将b设置为2.5，以实现更好的拟合速率和学习效果，As an example, the initial size _Winit-cwnd of the congestion window can be determined based on the estimated bandwidth bw _i by formula (6), where b is a coefficient, for example, b can be set to 2.5 according to experimental data to achieve better The fitting rate and learning effect of ,

W_init-cwnd＝b*bw_i (6) _Winit-cwnd = b*bw _i (6)

作为示例，可在开始进行本个训练回合之前，初始化拥塞窗口的大小，以在本个训练回合中发送端第一次发送数据包时使用W_init-cwnd。作为另一示例，可基于本个训练回合的前N个step的网络状态信息，初始化拥塞窗口的大小。As an example, the size of the congestion window may be initialized before the current training round, so that _Winit-cwnd is used when the sender sends the data packet for the first time in the current training round. As another example, the size of the congestion window may be initialized based on the network status information of the first N steps of the current training round.

应该理解，如果多个训练回合的通信网络环境相同，所述多个训练回合可共用同一W_init-cwnd。It should be understood that if the communication network environments of multiple training rounds are the same, the multiple training rounds may share the same _Winit-cwnd .

图3示出根据本公开示例性实施例的拥塞控制方法的流程图。Fig. 3 shows a flowchart of a congestion control method according to an exemplary embodiment of the present disclosure.

参照图3，在步骤S201，获取当前的第一网络状态信息和当前应用对网络传输性能的偏好。Referring to FIG. 3 , in step S201 , the current first network state information and the current application's preference for network transmission performance are obtained.

这里，所述应用即，使用所述拥塞控制方法进行数据传输的应用。Here, the application is an application that uses the congestion control method for data transmission.

在步骤S202，将获取的第一网络状态信息和所述偏好输入到拥塞控制模型，得到预测的需要执行的用于调整拥塞窗口大小的动作。In step S202, input the obtained first network state information and the preference into the congestion control model to obtain the predicted action for adjusting the size of the congestion window that needs to be executed.

在步骤S203，执行预测的动作以重新设置拥塞窗口。应该理解，根据本公开示例性实施例的拥塞控制方法可被重复执行，以根据网络状态实时调整拥塞窗口大小。In step S203, a predicted action is performed to reset the congestion window. It should be understood that the congestion control method according to the exemplary embodiment of the present disclosure may be repeatedly executed to adjust the size of the congestion window in real time according to the network status.

作为示例，第一网络状态信息可包括以下项之中的至少一项：拥塞窗口的大小、延迟、包确认率、以及发送率。例如，延迟、包确认率、以及发送率可以是基于接收端反馈的ACK消息确定的。As an example, the first network state information may include at least one of the following items: the size of the congestion window, delay, packet acknowledgment rate, and transmission rate. For example, the delay, packet acknowledgment rate, and transmission rate may be determined based on the ACK message fed back by the receiving end.

作为示例，所述拥塞控制模型可基于强化学习算法被构建；其中，所述强化学习算法中的值函数可为关于动作、第一网络状态信息、以及对网络传输性能的偏好的值函数。As an example, the congestion control model may be constructed based on a reinforcement learning algorithm; wherein, a value function in the reinforcement learning algorithm may be a value function about actions, first network state information, and preferences for network transmission performance.

作为示例，所述拥塞控制模型可以是使用如上述示例性实施例所述的训练方法训练得到的。As an example, the congestion control model may be trained by using the training method described in the above exemplary embodiments.

作为示例，根据本公开示例性实施例的拥塞控制方法还可包括：初始化拥塞窗口的大小；其中，初始化拥塞窗口的大小的步骤可包括：预估通信网络的带宽，并基于预估的带宽，确定拥塞窗口的初始大小。作为示例，可在所述应用开始与接收端进行数据传输时，初始化拥塞窗口的大小。As an example, the congestion control method according to an exemplary embodiment of the present disclosure may further include: initializing the size of the congestion window; wherein, the step of initializing the size of the congestion window may include: estimating the bandwidth of the communication network, and based on the estimated bandwidth, Determines the initial size of the congestion window. As an example, the size of the congestion window may be initialized when the application starts data transmission with the receiving end.

作为示例，预估通信网络的带宽的步骤可包括：确定接收端针对发送的N个数据包反馈的ACK消息的总数量；根据所述总数量除以N得到的平均值，确定所述通信网络的带宽。As an example, the step of estimating the bandwidth of the communication network may include: determining the total number of ACK messages fed back by the receiving end for the N data packets sent; bandwidth.

根据本公开示例性实施例的拥塞控制方法中的具体处理，已经在上述相关的拥塞控制模型的训练方法的实施例中进行了详细描述，此处将不做详细阐述说明。The specific processing in the congestion control method according to the exemplary embodiment of the present disclosure has been described in detail in the above embodiment of the related congestion control model training method, and will not be described in detail here.

图4示出根据本公开示例性实施例的拥塞控制模型的训练方法及拥塞控制方法的示意图。Fig. 4 shows a schematic diagram of a congestion control model training method and a congestion control method according to an exemplary embodiment of the present disclosure.

如图4所示，在训练拥塞控制模型时，开始一个训练回合时，发送端可基于带宽设定的初始拥塞窗口大小向接收端发送数据，以提高网络的收敛速度；并且，可利用网络环境交互产生的训练数据来训练使用多目标强化学习DQN网络的拥塞控制模型；此外，还可采用训练回合中断算法来解决训练回合的伪中断问题。通过使用训练好的拥塞控制模型来实现拥塞控制方法，能够针对指定的任意偏好生成最优拥塞控制策略，从而能够满足不同类型应用的需求。As shown in Figure 4, when training the congestion control model, when starting a training round, the sending end can send data to the receiving end based on the initial congestion window size set by the bandwidth, so as to improve the convergence speed of the network; and, the network environment can be used The training data generated interactively is used to train the congestion control model using the multi-objective reinforcement learning DQN network; in addition, the training round interruption algorithm can also be used to solve the false interruption problem of the training round. By using a well-trained congestion control model to implement the congestion control method, an optimal congestion control strategy can be generated for any specified preference, thereby meeting the needs of different types of applications.

根据本公开的示例性实施例的拥塞控制模型的训练方法，能够根据不同偏好设置去提高不同网络指标的性能，而不用重新设定目标函数或奖励函数，从而能够满足不同类型应用的传输性能需要，降低了拥塞控制模型的训练时间和训练成本；According to the training method of the congestion control model according to the exemplary embodiment of the present disclosure, the performance of different network indicators can be improved according to different preference settings without resetting the objective function or reward function, so as to meet the transmission performance requirements of different types of applications , reducing the training time and training cost of the congestion control model;

并且，为了提高模型的收敛性，解决当前拥塞控制模型收敛速度比较慢的问题，根据本公开的示例性实施例还提出了一种动态初始化拥塞窗口的方法，针对不同的网络带宽设置不同的初始化拥塞窗口大小，从而能够提高拥塞控制模型在不同网络环境中的收敛速度；Moreover, in order to improve the convergence of the model and solve the problem that the convergence speed of the current congestion control model is relatively slow, a method for dynamically initializing the congestion window is also proposed according to an exemplary embodiment of the present disclosure, and different initialization parameters are set for different network bandwidths. The size of the congestion window, which can improve the convergence speed of the congestion control model in different network environments;

并且，为了解决多目标强化学习应用到拥塞控制问题中出现的训练伪中断问题，根据本公开的示例性实施例还提出了一种赢-输-平局中断回合算法去提高拥塞控制模型的训练质量，从而使多目标强化学习能够应用到拥塞控制问题中，提高算法的训练效率；Moreover, in order to solve the problem of false training interruptions in the application of multi-objective reinforcement learning to congestion control problems, an exemplary embodiment of the present disclosure also proposes a win-lose-tie interruption round algorithm to improve the training quality of the congestion control model , so that multi-objective reinforcement learning can be applied to congestion control problems and improve the training efficiency of the algorithm;

此外，根据本公开的示例性实施例的拥塞控制方法具有较出色的实验性能。该方法实现了高吞吐量和低时延之间的权衡，并相应地在12Mbps和50Mbps的不同网络环境中表现出出色的拥塞控制能力。对于不同的蜂窝网络场景，通过设置不同的偏好，根据本公开的示例性实施例的拥塞控制方法能够满足不同类型应用的传输性能需求，在不同的性能指标之间实现最优的权衡，即，可以通过设置不同的偏好来满足动态网络场景下不同类型应用的性能需求。In addition, the congestion control method according to the exemplary embodiment of the present disclosure has relatively excellent experimental performance. This method achieves a trade-off between high throughput and low latency, and accordingly exhibits excellent congestion control capabilities in different network environments of 12Mbps and 50Mbps. For different cellular network scenarios, by setting different preferences, the congestion control method according to the exemplary embodiments of the present disclosure can meet the transmission performance requirements of different types of applications, and achieve an optimal trade-off between different performance indicators, that is, You can set different preferences to meet the performance requirements of different types of applications in dynamic network scenarios.

图5示出根据本公开示例性实施例的拥塞控制模型的训练设备的结构框图。Fig. 5 shows a structural block diagram of a training device for a congestion control model according to an exemplary embodiment of the present disclosure.

如图5所示，根据本公开示例性实施例的拥塞控制模型的训练设备10包括：环境初始化单元101、预测单元102、拥塞窗口设置单元103、损失函数计算单元104、训练单元105、以及回合结束确定单元106。As shown in FIG. 5 , the training device 10 of the congestion control model according to an exemplary embodiment of the present disclosure includes: an environment initialization unit 101, a prediction unit 102, a congestion window setting unit 103, a loss function calculation unit 104, a training unit 105, and a round The determination unit 106 is ended.

具体说来，环境初始化单元101被配置为初始化当前训练回合所使用的通信网络环境。Specifically, the environment initialization unit 101 is configured to initialize the communication network environment used in the current training session.

预测单元102被配置为将本个训练回合对网络传输性能的偏好、以及当前的第一网络状态信息输入到拥塞控制模型，得到预测的需要执行的用于调整拥塞窗口大小的动作。The predicting unit 102 is configured to input the current training round's preference for network transmission performance and the current first network state information into the congestion control model to obtain the predicted action for adjusting the size of the congestion window that needs to be executed.

拥塞窗口设置单元103被配置为执行预测的动作以重新设置拥塞窗口，并控制发送端在当前设置的拥塞窗口下向接收端发送数据包。The congestion window setting unit 103 is configured to perform predicted actions to reset the congestion window, and control the sending end to send data packets to the receiving end under the currently set congestion window.

损失函数计算单元104被配置为当发送端接收到接收端反馈的ACK消息时，根据所述动作、执行所述动作前的第一网络状态信息、执行所述动作后的第一网络状态信息、以及所述偏好，计算所述拥塞控制模型的损失函数。The loss function calculation unit 104 is configured to, when the sending end receives the ACK message fed back by the receiving end, according to the action, the first network state information before executing the action, the first network state information after executing the action, As well as the preference, a loss function of the congestion control model is computed.

训练单元105被配置为通过根据所述损失函数调整所述拥塞控制模型的模型参数，对所述拥塞控制模型进行训练。The training unit 105 is configured to train the congestion control model by adjusting model parameters of the congestion control model according to the loss function.

回合结束确定单元106被配置为确定是否结束本个训练回合，其中，当确定不结束本个训练回合时，预测单元102将本个训练回合对网络传输性能的偏好、以及当前的第一网络状态信息输入到拥塞控制模型，得到预测的需要执行的用于调整拥塞窗口大小的动作。The round end determination unit 106 is configured to determine whether to end the current training round, wherein, when it is determined not to end the current training round, the prediction unit 102 will use the current training round's preference for network transmission performance and the current first network state The information is fed into the congestion control model to get the predicted actions that need to be performed to adjust the size of the congestion window.

作为示例，第一网络状态信息可包括以下项之中的至少一项：拥塞窗口的大小、延迟、包确认率、以及发送率；其中，延迟、包确认率、以及发送率是基于接收端反馈的ACK消息确定的。As an example, the first network state information may include at least one of the following items: the size of the congestion window, delay, packet acknowledgment rate, and transmission rate; wherein, the delay, packet acknowledgment rate, and transmission rate are based on the receiving end feedback The ACK message is OK.

作为示例，回合结束确定单元106可被配置为根据第二网络状态信息的变化情况，确定是否结束本个训练回合。As an example, the round end determining unit 106 may be configured to determine whether to end the current training round according to the change of the second network state information.

作为示例，回合结束确定单元106可被配置为当执行所述动作后的第二网络状态信息满足第一预设条件时，确定所述动作为胜利的动作；当执行所述动作后的第二网络状态信息满足第二预设条件时，确定所述动作为失败的动作；当胜利的动作的连续次数达到第一预设次数时，确定结束本个训练回合；当失败的动作的连续次数达到第二预设次数时，确定结束本个训练回合；当执行动作的总次数达到第三预设次数时，确定结束本个训练回合。As an example, the round end determination unit 106 may be configured to determine that the action is a winning action when the second network status information after the action is executed satisfies the first preset condition; When the network state information satisfies the second preset condition, it is determined that the action is a failed action; when the continuous number of victorious actions reaches the first preset number of times, it is determined to end this training round; when the continuous number of failed actions reaches When the second preset number of times, it is determined to end the current training round; when the total number of times of performing actions reaches the third preset number of times, it is determined to end the current training round.

作为示例，第二网络状态信息可包括：吞吐量和延迟；第一预设条件为：吞吐量处于带宽的90％-110％、且延迟≤0.7×超时阈值；第二预设条件为：吞吐量处于带宽的50％-70％、且延迟≥0.7×超时阈值。As an example, the second network status information may include: throughput and delay; the first preset condition is: the throughput is 90%-110% of the bandwidth, and the delay is ≤0.7×timeout threshold; the second preset condition is: throughput The volume is between 50%-70% of the bandwidth and the latency is ≥0.7×timeout threshold.

作为示例，拥塞控制模型的训练设备10还可包括：窗口初始化单元(未示出)，窗口初始化单元被配置为初始化拥塞窗口的大小；其中，窗口初始化单元被配置为预估所述通信网络的带宽，并基于预估的带宽，确定拥塞窗口的初始大小。As an example, the training device 10 of the congestion control model may further include: a window initialization unit (not shown), the window initialization unit is configured to initialize the size of the congestion window; wherein, the window initialization unit is configured to estimate the communication network bandwidth, and based on the estimated bandwidth, determine the initial size of the congestion window.

作为示例，窗口初始化单元可被配置为确定接收端针对发送端发送的N个数据包反馈的ACK消息的总数量；并根据所述总数量除以N得到的平均值，确定所述通信网络的带宽。As an example, the window initialization unit may be configured to determine the total number of ACK messages fed back by the receiving end for the N data packets sent by the sending end; and determine the communication network based on the average value obtained by dividing the total number by N bandwidth.

作为示例，损失函数计算单元104可被配置为当发送端接收到接收端反馈的ACK消息时，根据所述动作、执行所述动作前的第一网络状态信息、执行所述动作后的第一网络状态信息、所述动作的奖励函数、以及所述偏好，计算所述拥塞控制模型的损失函数。As an example, the loss function calculation unit 104 may be configured to, when the sending end receives the ACK message fed back by the receiving end, according to the action, the first network state information before the action is executed, and the first network state information after the action is executed. The network state information, the reward function of the action, and the preference calculate the loss function of the congestion control model.

作为示例，所述动作的奖励函数可以是基于所述偏好、以及执行所述动作后的第三网络状态信息计算得到的；其中，第三网络状态信息包括以下项之中的至少一项：丢包率、吞吐量、以及延迟。As an example, the reward function of the action may be calculated based on the preference and third network state information after performing the action; wherein the third network state information includes at least one of the following items: Packet rate, throughput, and latency.

作为示例，所述拥塞控制模型可基于强化学习算法被构建；其中，所述强化学习算法中的值函数为关于动作、第一网络状态信息、以及对网络传输性能的偏好的值函数。As an example, the congestion control model may be constructed based on a reinforcement learning algorithm; wherein, a value function in the reinforcement learning algorithm is a value function about an action, first network state information, and a preference for network transmission performance.

作为示例，所述拥塞控制模型预测的动作，有∈的概率是从动作集合中随机选择的一个动作，有1-∈的概率是使用值函数获得的最优动作。As an example, the action predicted by the congestion control model, with a probability of ∈ is an action randomly selected from the action set, and with a probability of 1-∈ is an optimal action obtained by using a value function.

作为示例，损失函数L^S(θ)可被表示为：

和/或As an example, the loss function L ^S (θ) can be expressed as:

and / or

辅助损失函数L^T(θ)被表示为：

The auxiliary loss function L ^T (θ) is expressed as:

其中，

作为示例，所述拥塞控制模型的目标函数可为复合的目标函数TQ(s,a,ω)，目标函数TQ(s,a,ω)被表示为：As an example, the objective function of the congestion control model may be a composite objective function TQ(s, a, ω), and the objective function TQ(s, a, ω) is expressed as:

其中，

表示动作集合，Ω表示偏好集合。in,

represents the action set, and Ω represents the preference set.

作为示例，根据本公开示例性实施例的拥塞控制模型的训练设备10还可包括：训练结束确定单元(未示出)，训练结束确定单元被配置为当确定结束本个训练回合时，确定是否结束所述拥塞控制模型的训练过程，其中，当确定不结束所述拥塞控制模型的训练过程时，环境初始化单元101初始化当前训练回合所使用的通信网络环境，以进入下一个训练回合。As an example, the congestion control model training device 10 according to an exemplary embodiment of the present disclosure may further include: a training end determination unit (not shown), the training end determination unit is configured to determine whether to End the training process of the congestion control model, wherein, when it is determined not to end the training process of the congestion control model, the environment initialization unit 101 initializes the communication network environment used in the current training round to enter the next training round.

图6示出根据本公开示例性实施例的拥塞控制设备的结构框图。Fig. 6 shows a structural block diagram of a congestion control device according to an exemplary embodiment of the present disclosure.

如图6所示，根据本公开示例性实施例的拥塞控制设备20包括：获取单元201、预测单元202、以及拥塞窗口设置单元203。As shown in FIG. 6 , the congestion control device 20 according to the exemplary embodiment of the present disclosure includes: an acquisition unit 201 , a prediction unit 202 , and a congestion window setting unit 203 .

具体说来，获取单元201被配置为获取当前的第一网络状态信息和当前应用对网络传输性能的偏好。Specifically, the acquiring unit 201 is configured to acquire the current first network status information and the current application's preference for network transmission performance.

预测单元202被配置为将获取的第一网络状态信息和所述偏好输入到拥塞控制模型，得到预测的需要执行的用于调整拥塞窗口大小的动作。The predicting unit 202 is configured to input the obtained first network state information and the preference into the congestion control model, and obtain the predicted action for adjusting the size of the congestion window that needs to be executed.

拥塞窗口设置单元203被配置为执行预测的动作以重新设置拥塞窗口。The congestion window setting unit 203 is configured to perform predicted actions to reset the congestion window.

作为示例，拥塞控制设备20还可包括：窗口初始化单元(未示出)，窗口初始化单元被配置为初始化拥塞窗口的大小；其中，窗口初始化单元被配置为预估通信网络的带宽，并基于预估的带宽，确定拥塞窗口的初始大小。As an example, the congestion control device 20 may further include: a window initialization unit (not shown), the window initialization unit is configured to initialize the size of the congestion window; wherein, the window initialization unit is configured to estimate the bandwidth of the communication network, and based on the estimated The estimated bandwidth determines the initial size of the congestion window.

作为示例，窗口初始化单元可被配置为确定接收端针对发送的N个数据包反馈的ACK消息的总数量；并根据所述总数量除以N得到的平均值，确定所述通信网络的带宽。As an example, the window initialization unit may be configured to determine the total number of ACK messages fed back by the receiving end for the sent N data packets; and determine the bandwidth of the communication network according to an average value obtained by dividing the total number by N.

作为示例，所述拥塞控制模型可以是使用如上述示例性实施例的训练设备10训练得到的。As an example, the congestion control model may be trained using the training device 10 of the above exemplary embodiment.

关于上述实施例中的设备，其中各个单元执行操作的具体方式已经在有关该方法的实施例中进行了详细描述，此处将不做详细阐述说明。With regard to the device in the above embodiments, the specific manner in which each unit executes operations has been described in detail in the embodiments related to the method, and will not be described in detail here.

此外，应该理解，根据本公开示例性实施例的拥塞控制模型的训练设备10和拥塞控制设备20中的各个单元可被实现硬件组件和/或软件组件。本领域技术人员根据限定的各个单元所执行的处理，可以例如使用现场可编程门阵列(FPGA)或专用集成电路(ASIC)来实现各个单元。In addition, it should be understood that each unit in the training device 10 and the congestion control device 20 of the congestion control model according to the exemplary embodiment of the present disclosure may be implemented as hardware components and/or software components. Those skilled in the art may implement each unit, for example, by using a Field Programmable Gate Array (FPGA) or an Application Specific Integrated Circuit (ASIC) according to the defined processing performed by each unit.

图7示出根据本公开示例性实施例的电子设备的结构框图。参照图7，该电子设备30包括：至少一个存储器301和至少一个处理器302，所述至少一个存储器301中存储有计算机可执行指令集合，当计算机可执行指令集合被至少一个处理器302执行时，执行如上述示例性实施例所述的拥塞控制模型的训练方法和/或拥塞控制方法。FIG. 7 shows a structural block diagram of an electronic device according to an exemplary embodiment of the present disclosure. 7, the electronic device 30 includes: at least one memory 301 and at least one processor 302, the at least one memory 301 stores a set of computer-executable instructions, when the set of computer-executable instructions is executed by at least one processor 302 , executing the congestion control model training method and/or the congestion control method as described in the foregoing exemplary embodiments.

作为示例，电子设备30可以是PC计算机、平板装置、个人数字助理、智能手机、或其他能够执行上述指令集合的装置。这里，电子设备30并非必须是单个的电子设备，还可以是任何能够单独或联合执行上述指令(或指令集)的装置或电路的集合体。电子设备30还可以是集成控制系统或系统管理器的一部分，或者可被配置为与本地或远程(例如，经由无线传输)以接口互联的便携式电子设备。As an example, the electronic device 30 may be a PC computer, a tablet device, a personal digital assistant, a smart phone, or other devices capable of executing the above-mentioned set of instructions. Here, the electronic device 30 is not necessarily a single electronic device, but may also be any assembly of devices or circuits capable of individually or jointly executing the above-mentioned instructions (or instruction sets). Electronic device 30 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with local or remote (eg, via wireless transmission).

在电子设备30中，处理器302可包括中央处理器(CPU)、图形处理器(GPU)、可编程逻辑装置、专用处理器系统、微控制器或微处理器。作为示例而非限制，处理器302还可包括模拟处理器、数字处理器、微处理器、多核处理器、处理器阵列、网络处理器等。In electronic device 30, processor 302 may include a central processing unit (CPU), a graphics processing unit (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example and not limitation, processor 302 may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.

处理器302可运行存储在存储器301中的指令或代码，其中，存储器301还可以存储数据。指令和数据还可经由网络接口装置而通过网络被发送和接收，其中，网络接口装置可采用任何已知的传输协议。The processor 302 can execute instructions or codes stored in the memory 301, wherein the memory 301 can also store data. Instructions and data may also be sent and received over the network via the network interface device, which may employ any known transmission protocol.

存储器301可与处理器302集成为一体，例如，将RAM或闪存布置在集成电路微处理器等之内。此外，存储器301可包括独立的装置，诸如，外部盘驱动、存储阵列或任何数据库系统可使用的其他存储装置。存储器301和处理器302可在操作上进行耦合，或者可例如通过I/O端口、网络连接等互相通信，使得处理器302能够读取存储在存储器中的文件。The memory 301 can be integrated with the processor 302, for example, RAM or flash memory is arranged in an integrated circuit microprocessor or the like. Additionally, storage 301 may comprise a separate device, such as an external disk drive, storage array, or any other storage device usable by the database system. Memory 301 and processor 302 may be operatively coupled, or may communicate with each other, such as through an I/O port, network connection, etc., such that processor 302 can read files stored in the memory.

此外，电子设备30还可包括视频显示器(诸如，液晶显示器)和用户交互接口(诸如，键盘、鼠标、触摸输入装置等)。电子设备30的所有组件可经由总线和/或网络而彼此连接。In addition, the electronic device 30 may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device 30 may be connected to each other via a bus and/or a network.

根据本公开的示例性实施例，还可提供一种存储指令的计算机可读存储介质，其中，当指令被至少一个处理器运行时，促使至少一个处理器执行如上述示例性实施例所述的拥塞控制模型的训练方法和/或拥塞控制方法。这里的计算机可读存储介质的示例包括：只读存储器(ROM)、随机存取可编程只读存储器(PROM)、电可擦除可编程只读存储器(EEPROM)、随机存取存储器(RAM)、动态随机存取存储器(DRAM)、静态随机存取存储器(SRAM)、闪存、非易失性存储器、CD-ROM、CD-R、CD+R、CD-RW、CD+RW、DVD-ROM、DVD-R、DVD+R、DVD-RW、DVD+RW、DVD-RAM、BD-ROM、BD-R、BD-R LTH、BD-RE、蓝光或光盘存储器、硬盘驱动器(HDD)、固态硬盘(SSD)、卡式存储器(诸如，多媒体卡、安全数字(SD)卡或极速数字(XD)卡)、磁带、软盘、磁光数据存储装置、光学数据存储装置、硬盘、固态盘以及任何其他装置，所述任何其他装置被配置为以非暂时性方式存储计算机程序以及任何相关联的数据、数据文件和数据结构并将所述计算机程序以及任何相关联的数据、数据文件和数据结构提供给处理器或计算机使得处理器或计算机能执行所述计算机程序。上述计算机可读存储介质中的计算机程序可在诸如客户端、主机、代理装置、服务器等计算机设备中部署的环境中运行，此外，在一个示例中，计算机程序以及任何相关联的数据、数据文件和数据结构分布在联网的计算机系统上，使得计算机程序以及任何相关联的数据、数据文件和数据结构通过一个或多个处理器或计算机以分布式方式存储、访问和执行。According to an exemplary embodiment of the present disclosure, there may also be provided a computer-readable storage medium storing instructions, wherein, when the instructions are executed by at least one processor, at least one processor is prompted to execute the method described in the above-mentioned exemplary embodiments. A congestion control model training method and/or a congestion control method. Examples of computer readable storage media herein include: Read Only Memory (ROM), Random Access Programmable Read Only Memory (PROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Random Access Memory (RAM) , Dynamic Random Access Memory (DRAM), Static Random Access Memory (SRAM), Flash Memory, Non-volatile Memory, CD-ROM, CD-R, CD+R, CD-RW, CD+RW, DVD-ROM , DVD-R, DVD+R, DVD-RW, DVD+RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, Blu-ray or Optical Memory, Hard Disk Drive (HDD), Solid State Hard disks (SSD), memory cards (such as MultiMediaCards, Secure Digital (SD) or Extreme Digital (XD) cards), magnetic tapes, floppy disks, magneto-optical data storage devices, optical data storage devices, hard disks, solid-state disks, and any other means configured to store a computer program and any associated data, data files and data structures in a non-transitory manner and to provide said computer program and any associated data, data files and data structures to the processor or the computer to enable the processor or the computer to execute the computer program. The computer program in the above-mentioned computer-readable storage medium can run in an environment deployed in computer equipment such as a client, a host, an agent device, a server, etc. In addition, in one example, the computer program and any associated data and data files and data structures are distributed over network-connected computer systems so that the computer programs and any associated data, data files and data structures are stored, accessed and executed in a distributed fashion by one or more processors or computers.

根据本公开的示例性实施例，还可提供一种计算机程序产品，该计算机程序产品中的指令可由至少一个处理器执行以完成如上述示例性实施例所述的拥塞控制模型的训练方法和/或拥塞控制方法。According to an exemplary embodiment of the present disclosure, a computer program product may also be provided, and instructions in the computer program product may be executed by at least one processor to complete the method for training a congestion control model as described in the above exemplary embodiments and/or or congestion control methods.

本领域技术人员在考虑说明书及实践这里公开的发明后，将容易想到本公开的其它实施方案。本申请旨在涵盖本公开的任何变型、用途或者适应性变化，这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的，本公开的真正范围和精神由下面的权利要求指出。Other embodiments of the present disclosure will be readily apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any modification, use or adaptation of the present disclosure, and these modifications, uses or adaptations follow the general principles of the present disclosure and include common knowledge or conventional technical means in the technical field not disclosed in the present disclosure . The specification and examples are to be considered exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

应当理解的是，本公开并不局限于上面已经描述并在附图中示出的精确结构，并且可以在不脱离其范围进行各种修改和改变。本公开的范围仅由所附的权利要求来限制。It should be understood that the present disclosure is not limited to the precise constructions which have been described above and shown in the drawings, and various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A training method for a congestion control model, comprising:

Initialize the communication network environment used in the current training round;

Select a sample from the preference set as the preference for network transmission performance in this training round;

Input this training round's preference for network transmission performance and the current first network state information into the congestion control model to obtain the predicted action to adjust the size of the congestion window;

Execute predicted actions to reset the congestion window, and control the sender to send packets to the receiver under the currently set congestion window;

When the sending end receives the ACK message fed back by the receiving end, according to the action, the first network state information before executing the action, the first network state information after executing the action, the reward function of the action, and The preference is to calculate the loss function of the congestion control model, wherein the reward function of the action is calculated based on the preference of the network transmission performance in this training round and the third network state information after the action is executed ;

By adjusting the model parameters of the congestion control model according to the loss function, the congestion control model is trained, and it is determined whether to end this training round, wherein, when it is determined not to end this training round, return to execute this A training round's preference for network transmission performance and the current first network state information are input into the congestion control model to obtain the predicted action steps for adjusting the size of the congestion window;

When determining to end this training round, determine whether to end the training process of the congestion control model;

When it is determined not to end the training process of the congestion control model, return to the steps of initializing the communication network environment used in the current training round and selecting a sample from the preference set as the preference of the network transmission performance for this training round, to enter the next training round,

Among them, different training rounds have different preferences for network transmission performance.

2. The method according to claim 1, wherein the first network status information includes at least one of the following items: the size of the congestion window, delay, packet acknowledgment rate, and sending rate;

Wherein, the delay, the packet acknowledgment rate, and the sending rate are determined based on the ACK message fed back by the receiving end.

3. The method according to claim 1, wherein the preference for network transmission performance includes a degree of preference for at least one of the following items: throughput, packet loss rate, and delay.

4. The method according to claim 1, wherein the step of determining whether to end this training round comprises:

According to the change of the second network state information, it is determined whether to end the current training round.

5. The method according to claim 4, wherein, according to the variation of the second network state information, the step of determining whether to end this training round comprises:

When the second network status information after performing the action satisfies the first preset condition, determine the action as a winning action;

When the second network status information after performing the action satisfies a second preset condition, determine that the action is a failed action;

When the number of consecutive victorious actions reaches the first preset number of times, it is determined to end this training round;

When the continuous number of failed actions reaches the second preset number of times, it is determined to end this training round;

When the total number of times of performing actions reaches the third preset number of times, it is determined to end the current training round.

6. The method according to claim 5, wherein the second network state information comprises: throughput and delay;

The first preset condition is: the throughput is 90%-110% of the bandwidth, and the delay is ≤0.7×timeout threshold;

The second preset condition is: the throughput is 50%-70% of the bandwidth, and the delay is greater than or equal to 0.7×timeout threshold.

7. The method according to claim 1, further comprising: initializing the size of the congestion window;

Wherein, the step of initializing the size of the congestion window includes: estimating the bandwidth of the communication network, and determining the initial size of the congestion window based on the estimated bandwidth.

8. The method according to claim 7, wherein the step of estimating the bandwidth of the communication network comprises:

Determine the total number of ACK messages fed back by the receiving end for the N data packets sent by the sending end;

The bandwidth of the communication network is determined according to an average value obtained by dividing the total number by N.

9. The method according to claim 1, wherein the third network status information includes at least one of the following items: packet loss rate, throughput, and delay.

10. The method according to claim 1, wherein the congestion control model is constructed based on a reinforcement learning algorithm;

Wherein, the value function in the reinforcement learning algorithm is a value function about actions, first network state information, and preference for network transmission performance.

11. The method according to claim 10, wherein the action predicted by the congestion control model has a probability of ∈ is an action randomly selected from the action set, and the probability of 1-∈ is obtained using a value function optimal action.

12. The method according to claim 10, wherein the loss function of the congestion control model is based on: in order to make the value function closer to the loss function L ^S (θ) of the maximum reward function, and an auxiliary loss function L ^T (θ) calculated.

13. The method according to claim 12, wherein the loss function of the congestion control model is expressed as: (1ε) L ^S (θ)+ε L ^T (θ);

Wherein, ε is a trade-off index, the later the predicted action in a training round, the greater the value of ε when calculating the loss function of the congestion control model for this action, 0≤ε≤1.

14. The method according to claim 10, wherein the objective function of the congestion control model is a composite objective function about the following items: a reward function, a value function, the first network state information after performing an action, and performing an action The previous first network state information, action, preference for network transmission performance in this training round, and the best preference in the current network environment.

15. A congestion control method, comprising:

Obtain the current first network state information and the current application's preference for network transmission performance;

inputting the obtained first network state information and the preference into the congestion control model, and obtaining the predicted action for adjusting the size of the congestion window;

Execute the predicted action to reset the congestion window,

Wherein, the congestion control model is applicable to different types of applications, and different types of applications have different preferences for network transmission performance,

Wherein, the congestion control model is obtained by training using the training method described in any one of claims 1 to 14.

16. The method according to claim 15, wherein the first network status information includes at least one of the following items: the size of the congestion window, delay, packet acknowledgment rate, and transmission rate;

17. The method according to claim 15, wherein the preference for network transmission performance includes a degree of preference for at least one of the following items: throughput, packet loss rate, and delay.

18. The method according to claim 15, further comprising: initializing the size of the congestion window;

19. The method according to claim 18, wherein the step of estimating the bandwidth of the communication network comprises:

Determine the total number of ACK messages fed back by the receiving end for the N data packets sent;

20. The method according to claim 15, wherein the congestion control model is constructed based on a reinforcement learning algorithm;

21. A training device for a congestion control model, comprising:

The environment initialization unit is configured to initialize the communication network environment used in the current training round, and select a sample from the preference set as the preference for network transmission performance in this training round;

The prediction unit is configured to input the current training round's preference for network transmission performance and the current first network state information into the congestion control model, and obtain the predicted action for adjusting the size of the congestion window that needs to be executed;

The congestion window setting unit is configured to perform predicted actions to reset the congestion window, and control the sending end to send data packets to the receiving end under the currently set congestion window;

The loss function calculation unit is configured to, when the sending end receives the ACK message fed back by the receiving end, according to the action, the first network state information before performing the action, the first network state information after performing the action, The reward function of the action and the preference calculate the loss function of the congestion control model, wherein the reward function of the action is based on the preference of the network transmission performance in this training round and the The third network status information is calculated;

a training unit configured to train the congestion control model by adjusting model parameters of the congestion control model according to the loss function;

The round end determination unit is configured to determine whether to end this training round, wherein, when it is determined not to end this training round, the prediction unit will use this training round's preference for network transmission performance and the current first network state information Input to the congestion control model to obtain the predicted action to adjust the size of the congestion window;

The training end determination unit is configured to determine whether to end the training process of the congestion control model when it is determined to end the current training round, wherein, when it is determined not to end the training process of the congestion control model, the environment initialization unit initializes the current The communication network environment used in the training round, and select a sample from the preference set as the preference for network transmission performance in this training round to enter the next training round,

22. The device according to claim 21, wherein the first network status information includes at least one of the following items: the size of the congestion window, delay, packet acknowledgment rate, and sending rate;

23. The device according to claim 21, wherein the preference for network transmission performance includes a degree of preference for at least one of the following items: throughput, packet loss rate, and delay.

24. The device according to claim 21, wherein the round end determination unit is configured to determine whether to end the current training round according to the change of the second network state information.

25. The device according to claim 24, wherein the round end determining unit is configured to determine that the action is a winning action when the second network status information after the action is executed satisfies a first preset condition ; When the second network state information after performing the action satisfies the second preset condition, determine that the action is a failed action; when the number of consecutive victorious actions reaches the first preset number, determine to end the training A round; when the consecutive number of failed actions reaches the second preset number of times, it is determined to end the current training round; when the total number of executed actions reaches the third preset number of times, it is determined to end the current training round.

26. The device according to claim 25, wherein the second network status information includes: throughput and delay;

27. The device of claim 21, further comprising:

a window initialization unit configured to initialize the size of the congestion window;

Wherein, the window initialization unit is configured to estimate the bandwidth of the communication network, and determine the initial size of the congestion window based on the estimated bandwidth.

28. The device according to claim 27, wherein the window initialization unit is configured to determine the total number of ACK messages fed back by the receiving end for the N data packets sent by the sending end; and divide the total number by N The resulting average value determines the bandwidth of the communication network.

29. The device according to claim 21, wherein the third network status information includes at least one of the following items: packet loss rate, throughput, and delay.

30. The device according to claim 21, wherein the congestion control model is constructed based on a reinforcement learning algorithm;

31. The device according to claim 30, wherein the action predicted by the congestion control model has a probability of ∈ is an action randomly selected from the action set, and the probability of 1-∈ is obtained using a value function optimal action.

32. The device according to claim 30, wherein the loss function of the congestion control model is based on: a loss function L ^S (θ) in order to make the value function closer to the maximum reward function, and an auxiliary loss function L ^T (θ) calculated.

33. The device according to claim 32, wherein the loss function of the congestion control model is expressed as: (1ε) L ^S (θ)+ε L ^T (θ);

34. The device according to claim 30, wherein the objective function of the congestion control model is a composite objective function related to the following items: a reward function, a value function, the first network state information after performing an action, and performing an action The previous first network state information, action, preference for network transmission performance in this training round, and the best preference in the current network environment.

35. A congestion control device, comprising:

an acquisition unit configured to acquire current first network status information and current application preference for network transmission performance;

A predicting unit configured to input the obtained first network state information and the preference into the congestion control model, and obtain the predicted action for adjusting the size of the congestion window that needs to be executed;

a congestion window setting unit configured to perform predicted actions to reset the congestion window,

Wherein, the congestion control model is trained by using the training device according to any one of claims 21 to 34.

36. The device according to claim 35, wherein the first network state information includes at least one of the following items: the size of the congestion window, delay, packet acknowledgment rate, and transmission rate;

37. The device according to claim 35, wherein the preference for network transmission performance includes a degree of preference for at least one of the following items: throughput, packet loss rate, and delay.

38. The device of claim 35, further comprising:

39. The device according to claim 38, wherein the window initialization unit is configured to determine the total number of ACK messages fed back by the receiving end for the N data packets sent; and divide the total number by N to obtain The average value determines the bandwidth of the communication network.

40. The device according to claim 35, wherein the congestion control model is constructed based on a reinforcement learning algorithm;

41. An electronic device, characterized by comprising:

at least one processor;

at least one memory storing computer-executable instructions,

Wherein, when the computer-executable instructions are executed by the at least one processor, the at least one processor is prompted to execute the congestion control model training method according to any one of claims 1 to 14 and/or as described in The congestion control method according to any one of claims 15 to 20.

42. A computer-readable storage medium, characterized in that, when the instructions in the computer-readable storage medium are executed by at least one processor, the at least one processor is prompted to execute any one of claims 1 to 14. The training method of the congestion control model described in claim 1 and/or the congestion control method described in any one of claims 15 to 20.