CN115983438A

CN115983438A - Method and device for determining operation strategy of data center terminal air conditioning system

Info

Publication number: CN115983438A
Application number: CN202211571284.0A
Authority: CN
Inventors: 胡潇; 贾庆山; 周翰辰
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2022-12-08
Filing date: 2022-12-08
Publication date: 2023-04-18

Abstract

The invention discloses a method and a device for determining an operation strategy of a data center tail end air conditioning system, wherein the method comprises the following steps: building a temperature field distribution model of a data center machine room; constructing a Markov decision process model of an operation strategy of an air conditioning system at the tail end of a data center; in the temperature field distribution model, training is carried out by using a reinforcement learning algorithm based on Markov decision process models with different strategy functions and different parameters respectively, so as to generate operation strategies of the air conditioning system at the tail end of various data centers and construct a strategy library; according to a sequence optimization method, evaluating the performance of each operation strategy in a strategy library in a temperature field distribution model, and determining a selection set from the strategy library; and respectively applying each operation strategy in the selected set to the real operation environment of the data center machine room, and determining the optimal operation strategy in the selected set. The method and the device can accurately determine the optimal operation strategy of the air conditioning system at the tail end of the data center.

Description

Method and device for determining operation strategy of terminal air conditioning system in data center

技术领域technical field

本发明涉及互联网数据中心节能优化技术领域，尤其涉及一种数据中心末端空调系统运行策略确定方法及装置。The invention relates to the technical field of energy-saving optimization of Internet data centers, in particular to a method and device for determining an operation strategy of an air-conditioning system at a data center terminal.

背景技术Background technique

本部分旨在为权利要求书中陈述的本发明实施例提供背景或上下文。此处的描述不因为包括在本部分中就承认是现有技术。This section is intended to provide a background or context to embodiments of the invention that are recited in the claims. The descriptions herein are not admitted to be prior art by inclusion in this section.

除服务器IT负载耗电外，数据中心中最大的能源消耗是冷却基础设施，大约1/3到1/2的数据中心总功耗用于制冷系统，数据中心日益增长的能源消耗要求通过更好的热管理来提高能源利用效率。数据中心制冷系统能耗包括冷机侧能耗和末端空调能耗，冷机侧能耗优化已有较为成熟的技术手段(例如基于负荷预测的冷机能耗优化等方法)，但末端空调能耗优化涉及到数据中心机房内部温度场分布，而机房内部温度场分布模拟涉及到复杂的流体力学、热力学分析，且温度场分布一般随时间不断变化，因此在保证服务器IT设备热安全的前提下，最大程度降低数据中心末端空调系统运行功耗是一个关键挑战和技术难题。In addition to server IT load power consumption, the largest energy consumption in the data center is the cooling infrastructure, about 1/3 to 1/2 of the total power consumption of the data center is used for the cooling system, the growing energy consumption of the data center requires better thermal management to improve energy efficiency. The energy consumption of the refrigeration system of the data center includes the energy consumption of the chiller side and the energy consumption of the terminal air conditioner. The energy consumption optimization of the chiller side has relatively mature technical means (such as the method of optimizing the energy consumption of the chiller based on load forecasting), but the energy consumption of the terminal air conditioner Optimization involves the internal temperature field distribution of the data center computer room, and the simulation of the temperature field distribution inside the computer room involves complex fluid mechanics and thermodynamic analysis, and the temperature field distribution generally changes with time. Therefore, under the premise of ensuring the thermal safety of server IT equipment, Minimizing the operating power consumption of the terminal air conditioning system in the data center is a key challenge and technical difficulty.

而目前缺乏一种数据中心末端空调运行策略的挑选方案。However, there is currently a lack of a selection scheme for the terminal air conditioner operation strategy of the data center.

发明内容Contents of the invention

本发明实施例提供一种数据中心末端空调系统运行策略确定方法，用以准确地确定数据中心末端空调系统的最优运行策略，该方法包括：An embodiment of the present invention provides a method for determining an operation strategy of a data center terminal air-conditioning system to accurately determine an optimal operation strategy of a data center terminal air-conditioning system. The method includes:

搭建数据中心机房的温度场分布模型；Build the temperature field distribution model of the data center computer room;

构建数据中心末端空调系统运行策略的马尔可夫决策过程模型；Construct a Markov decision process model for the operation strategy of the terminal air conditioning system in the data center;

在温度场分布模型中，使用强化学习算法，分别基于不同的策略函数、不同参数的马尔可夫决策过程模型进行训练，生成多种数据中心末端空调系统的运行策略，构建策略库；In the temperature field distribution model, use the reinforcement learning algorithm to train based on different policy functions and Markov decision process models with different parameters, generate a variety of operating strategies for the terminal air-conditioning system of the data center, and build a strategy library;

依据序优化方法，在温度场分布模型中对策略库中每个运行策略的性能进行评估，从策略库中确定挑选集合；According to the sequential optimization method, the performance of each operation strategy in the strategy library is evaluated in the temperature field distribution model, and the selection set is determined from the strategy library;

将挑选集合中的各个运行策略分别应用于数据中心机房的真实运行环境中，确定挑选集合中的最优运行策略。Each operation strategy in the selection set is applied to the real operation environment of the data center computer room, and the optimal operation strategy in the selection set is determined.

本发明实施例还提供一种数据中心末端空调系统运行策略确定装置，用以准确地确定数据中心末端空调系统的最优运行策略，该装置包括：The embodiment of the present invention also provides a data center terminal air conditioning system operation strategy determination device, which is used to accurately determine the optimal operation strategy of the data center terminal air conditioning system. The device includes:

温度场分布模型搭建模块，用于搭建数据中心机房的温度场分布模型；The temperature field distribution model building module is used to build the temperature field distribution model of the data center computer room;

马尔可夫决策过程模型构建模块，用于构建数据中心末端空调系统运行策略的马尔可夫决策过程模型；The Markov decision process model building block is used to construct the Markov decision process model of the terminal air conditioning system operation strategy of the data center;

策略库构建模块，用于在温度场分布模型中，使用强化学习算法，分别基于不同的策略函数、不同参数的马尔可夫决策过程模型进行训练，生成多种数据中心末端空调系统的运行策略，构建策略库；The strategy library construction module is used to use the reinforcement learning algorithm in the temperature field distribution model to conduct training based on different strategy functions and Markov decision process models with different parameters, and generate a variety of operating strategies for the terminal air conditioning system of the data center. Build a strategy library;

挑选集合确定模块，用于依据序优化方法，在温度场分布模型中对策略库中每个运行策略的性能进行评估，从策略库中确定挑选集合；The selection set determination module is used to evaluate the performance of each operation strategy in the strategy library in the temperature field distribution model according to the order optimization method, and determine the selection set from the strategy library;

最优运行策略确定模块，用于将挑选集合中的各个运行策略分别应用于数据中心机房的真实运行环境中，确定挑选集合中的最优运行策略。The optimal operation strategy determination module is used to apply each operation strategy in the selection set to the real operation environment of the data center computer room, and determine the optimal operation strategy in the selection set.

本发明实施例还提供一种计算机设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述计算机程序时实现上述数据中心末端空调系统运行策略确定方法。An embodiment of the present invention also provides a computer device, including a memory, a processor, and a computer program stored in the memory and operable on the processor, when the processor executes the computer program, the above data center terminal air conditioning system can be operated Policy determination method.

本发明实施例还提供一种计算机可读存储介质，所述计算机可读存储介质存储有计算机程序，所述计算机程序被处理器执行时实现上述数据中心末端空调系统运行策略确定方法。An embodiment of the present invention also provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the above-mentioned method for determining the operation strategy of the data center terminal air-conditioning system is implemented.

本发明实施例还提供一种计算机程序产品，所述计算机程序产品包括计算机程序，所述计算机程序被处理器执行时实现上述数据中心末端空调系统运行策略确定方法。An embodiment of the present invention also provides a computer program product, the computer program product includes a computer program, and when the computer program is executed by a processor, the above-mentioned method for determining the operation strategy of the data center terminal air-conditioning system is implemented.

本发明实施例中，搭建数据中心机房的温度场分布模型；构建数据中心末端空调系统运行策略的马尔可夫决策过程模型；在温度场分布模型中，使用强化学习算法，分别基于不同的策略函数、不同参数的马尔可夫决策过程模型进行训练，生成多种数据中心末端空调系统的运行策略，构建策略库；依据序优化方法，在温度场分布模型中对策略库中每个运行策略的性能进行评估，从策略库中确定挑选集合；将挑选集合中的各个运行策略分别应用于数据中心机房的真实运行环境中，确定挑选集合中的最优运行策略。在上述过程中，基于不同参数的马尔可夫决策过程模型和策略函数形式一个包含多个运行策略的策略库，相比于传统的两阶段方法和强化学习方法仅生成单一运行策略，本方案综合考虑了多个不同形式的策略函数，并对多个运行策略进行合理挑选，因而最终得到的运行策略比传统方法得到的单一运行策略更有性能保障，即更能保障运行策略在实际数据中心环境中既确保服务器IT设备的热安全，又能最大程度降低末端空调能耗。在运行策略挑选的环节，与计算出所有运行策略的真实性能并进行排序后选择最好的策略这种传统方法不同，本方案采用了序优化方法，获得挑选集合，大大降低了策略库中的运行策略在真实运行环境中的评估次数，进一步保障了数据中心服务器IT设备的热安全，节省了人力物力财力。In the embodiment of the present invention, the temperature field distribution model of the data center computer room is built; the Markov decision process model of the operation strategy of the data center terminal air-conditioning system is built; in the temperature field distribution model, reinforcement learning algorithms are used, respectively based on different strategy functions , different parameters of the Markov decision process model to train, generate a variety of operating strategies for the terminal air conditioning system of the data center, and build a strategy library; according to the order optimization method, in the temperature field distribution model, the performance of each operating strategy in the strategy library Carry out evaluation, and determine the selection set from the strategy library; apply each operation strategy in the selection set to the real operating environment of the data center computer room, and determine the optimal operation strategy in the selection set. In the above process, the Markov decision process model based on different parameters and the policy function form a strategy library containing multiple operating strategies. Compared with the traditional two-stage method and reinforcement learning method, which only generate a single operating strategy, this scheme comprehensively Multiple different forms of policy functions are considered, and multiple running strategies are reasonably selected, so the final running strategy is more guaranteed than the single running strategy obtained by the traditional method, that is, it can better guarantee the running strategy in the actual data center environment. It not only ensures the thermal safety of server IT equipment, but also minimizes the energy consumption of terminal air conditioners. In the process of selecting the operation strategy, different from the traditional method of calculating the real performance of all operation strategies and selecting the best strategy after sorting, this scheme adopts the order optimization method to obtain the selection set, which greatly reduces the number of strategies in the strategy library. The number of evaluations of the operation strategy in the real operation environment further ensures the thermal safety of the server IT equipment in the data center and saves manpower, material and financial resources.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。在附图中：In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention. Those skilled in the art can also obtain other drawings based on these drawings without creative work. In the attached picture:

图1为本发明实施例中数据中心末端空调系统运行策略确定方法的流程图；FIG. 1 is a flow chart of a method for determining an operation strategy of a data center terminal air-conditioning system in an embodiment of the present invention;

图2为本发明实施例中搭建数据中心机房的温度场分布模型的流程图；Fig. 2 is the flowchart of building the temperature field distribution model of data center machine room in the embodiment of the present invention;

图3为本发明实施例中确定挑选集合的流程图；Fig. 3 is the flow chart of determining selection set in the embodiment of the present invention;

图4为本发明实施例中数据中心末端空调系统运行策略确定装置的示意图；4 is a schematic diagram of a device for determining an operation strategy of a data center terminal air-conditioning system in an embodiment of the present invention;

图5为本发明实施例中计算机设备的示意图。Fig. 5 is a schematic diagram of a computer device in an embodiment of the present invention.

具体实施方式Detailed ways

为使本发明实施例的目的、技术方案和优点更加清楚明白，下面结合附图对本发明实施例做进一步详细说明。在此，本发明的示意性实施例及其说明用于解释本发明，但并不作为对本发明的限定。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention more clear, the embodiments of the present invention will be further described in detail below in conjunction with the accompanying drawings. Here, the exemplary embodiments and descriptions of the present invention are used to explain the present invention, but not to limit the present invention.

发明人发现，传统解决数据中心制冷系统特别是末端空调系统的策略优化节能问题的方法大多是基于两阶段(two-stage)框架的。在第一阶段，通过机理分析的方法或数据驱动的方法建立一个近似的系统模型，该模型通常包含流体动力学、传热和机械原理，需要考虑到数据中心机房内温度场的分布情况。在第二阶段，利用该近似系统模型，通过策略优化算法求解得到空调可控变量的最优决策序列，常见的策略优化算法主要有动态规划、模型预测控制算法等等。然而，这些基于两阶段框架的传统优化方法需要先建立数据中心机房温度场的近似模型，而温度场的分布涉及到流体动力学、传热学等专业知识，若使用机理分析的方法建模则需要建立复杂的偏微分方程组，对于近年规模日益扩大的大型数据中心来说，建立其机房温度场机理模型过程复杂、难度较大且容易出错，因此这些传统的基于模型的优化算法难以解决现在数据中心末端空调系统的策略优化问题。The inventors found that most of the traditional methods for solving the energy-saving problem of strategic optimization of the data center refrigeration system, especially the terminal air-conditioning system, are based on a two-stage framework. In the first stage, an approximate system model is established through a mechanism analysis method or a data-driven method. This model usually includes fluid dynamics, heat transfer and mechanical principles, and needs to take into account the distribution of the temperature field in the data center computer room. In the second stage, using the approximate system model, the optimal decision sequence of the controllable variables of the air conditioner is obtained through the strategy optimization algorithm. Common strategy optimization algorithms mainly include dynamic programming, model predictive control algorithms, and so on. However, these traditional optimization methods based on the two-stage framework need to establish an approximate model of the temperature field of the data center computer room first, and the distribution of the temperature field involves professional knowledge such as fluid dynamics and heat transfer. If the method of mechanism analysis is used to model the It is necessary to establish complex partial differential equations. For large-scale data centers that have been expanding in size in recent years, the process of establishing the mechanism model of the temperature field in the computer room is complex, difficult, and error-prone. Therefore, these traditional model-based optimization algorithms are difficult to solve. Strategy optimization problem for terminal air conditioning system in data center.

强化学习方法在与环境交互的过程中不断学习得到最优运行策略，不要求系统的动态特性已知(特别是免模型(Model-Free)的强化学习方法)。正由于数据中心机房温度场分布的机理模型非常复杂，因而采用强化学习方法解决数据中心末端空调系统的策略优化问题可能是一个有效的办法，目前已有部分文献采用该类方法去解决数据中心制冷系统的策略优化问题。一般为防止服务器IT设备过温造成损失，强化学习算法通常不能直接在现实数据中心环境中进行训练，所以仍需要先使用计算流体动力学(Computational FluidDynamics，CFD)仿真软件建立数据中心末端空调和机房温度场的仿真模型。虽然强化学习方法可以有效避免对机房温度场的机理建模和分析，但现有的主流强化学习方法均有着样本利用率低下、策略训练过程不稳定、训练策略性能受参数影响较大等缺点，导致最终训练得到的数据中心末端空调系统运行策略性能得不到保证，并且由于是在仿真环境中进行训练的，而仿真环境与真实环境会不可避免地有差别，因此训练得到的策略在真实环境中的性能也得到不到保证。The reinforcement learning method continuously learns to obtain the optimal operation strategy in the process of interacting with the environment, and does not require the dynamic characteristics of the system to be known (especially the model-free reinforcement learning method). Because the mechanism model of the temperature field distribution in the data center computer room is very complex, it may be an effective way to solve the strategy optimization problem of the terminal air conditioning system in the data center by using the reinforcement learning method. At present, some literatures have used this type of method to solve the data center cooling problem. System strategy optimization problem. Generally, in order to prevent losses caused by overheating of server IT equipment, reinforcement learning algorithms usually cannot be directly trained in a real data center environment, so it is still necessary to use Computational Fluid Dynamics (CFD) simulation software to establish data center terminal air conditioners and computer rooms Simulation model of temperature field. Although the reinforcement learning method can effectively avoid the mechanism modeling and analysis of the temperature field in the computer room, the existing mainstream reinforcement learning methods have the disadvantages of low sample utilization, unstable strategy training process, and the performance of the training strategy is greatly affected by parameters. As a result, the performance of the operation strategy of the data center terminal air conditioning system obtained through the final training cannot be guaranteed, and because the training is carried out in a simulation environment, and the simulation environment will inevitably be different from the real environment, the strategy obtained by training is in the real environment. performance is also not guaranteed.

基于此，本发明实施例提出了一种基于策略库和序优化的数据中心空调系统运行策略挑选方法。Based on this, an embodiment of the present invention proposes a method for selecting an operation strategy of a data center air-conditioning system based on strategy library and sequence optimization.

图1为本发明实施例中数据中心末端空调系统运行策略确定方法的流程图，包括：Fig. 1 is a flowchart of a method for determining an operation strategy of a data center terminal air-conditioning system in an embodiment of the present invention, including:

步骤101，搭建数据中心机房的温度场分布模型；Step 101, building the temperature field distribution model of the computer room of the data center;

步骤102，构建数据中心末端空调系统运行策略的马尔可夫决策过程模型；Step 102, constructing a Markov decision process model of the operation strategy of the terminal air conditioning system of the data center;

步骤103，在温度场分布模型中，使用强化学习算法，分别基于不同的策略函数、不同参数的马尔可夫决策过程模型进行训练，生成多种数据中心末端空调系统的运行策略，构建策略库；Step 103, in the temperature field distribution model, use a reinforcement learning algorithm to train based on different policy functions and Markov decision process models with different parameters, generate various operating strategies for the terminal air-conditioning system of the data center, and build a strategy library;

步骤104，依据序优化方法，在温度场分布模型中对策略库中每个运行策略的性能进行评估，从策略库中确定挑选集合；Step 104, according to the sequential optimization method, evaluate the performance of each operating strategy in the strategy library in the temperature field distribution model, and determine the selection set from the strategy library;

步骤105，将挑选集合中的各个运行策略分别应用于数据中心机房的真实运行环境中，确定挑选集合中的最优运行策略。Step 105, apply each operation strategy in the selection set to the actual operation environment of the data center computer room, and determine the optimal operation strategy in the selection set.

其中，在步骤101，搭建数据中心机房的温度场分布模型；图2为本发明实施例中搭建数据中心机房的温度场分布模型的流程图，包括：Wherein, in step 101, a temperature field distribution model of a data center computer room is built; FIG. 2 is a flow chart of building a temperature field distribution model of a data center computer room in an embodiment of the present invention, including:

步骤201，通过CFD软件，依据机房布置CAD图纸，利用CFD仿真软件，对数据中心机房空间构造和空调与IT设备型号进行建模和仿真，建立数据中心机房的温度场分布模型；Step 201, using CFD software, based on the computer room layout CAD drawings, using CFD simulation software to model and simulate the data center computer room space structure and air conditioner and IT equipment models, and establish a temperature field distribution model of the data center computer room;

数据中心机房温度场受服务器IT负载、末端空调风机转速等边界条件影响，随时间、空间分布变化。使用传统的流体动力学、传热学机理分析方法建立的简单温度分布模型难以准确刻画机房内各测点温度随时间、空间分布的变化，难以及时捕捉某服务器IT设备旁的局部热点，使得机房IT设备具有过热隐患。因此，本发明实施例采用针对数据中心的CFD仿真软件对数据中心机房温度场分布进行模拟，依据机房布置CAD图纸，利用CFD仿真软件丰富的原件库(空调原件、IT设备原件等)，对机房空间构造(包括服务器IT设备空间布置、冷热通道空间布置、空调空间布置等、温度传感器空间布置、空调系统结构等)和空调与IT设备型号进行细致建模和仿真，从而建立起机房温度场分布模型，较准确地刻画机房内各测点温度随时间、空间分布的变化。The temperature field of the data center computer room is affected by boundary conditions such as the server IT load and the speed of the terminal air-conditioning fan, and changes with time and space distribution. The simple temperature distribution model established by traditional fluid dynamics and heat transfer mechanism analysis methods is difficult to accurately describe the temperature change of each measuring point in the computer room with time and space distribution, and it is difficult to capture the local hot spot next to a server IT equipment in time, making the computer room IT equipment has the potential to overheat. Therefore, the embodiment of the present invention adopts the CFD simulation software for the data center to simulate the temperature field distribution of the data center computer room, according to the computer room layout CAD drawings, and utilizes the rich original library of the CFD simulation software (air conditioner originals, IT equipment originals, etc.) Space structure (including server IT equipment space layout, hot and cold aisle space layout, air conditioning space layout, etc., temperature sensor space layout, air conditioning system structure, etc.) The distribution model can more accurately describe the change of the temperature of each measuring point in the computer room with time and space distribution.

上述机房温度场分布模型虽然是依据机房布置CAD图纸进行建模的，但一般依然会与实际机房内的真实温度场存在一定的差异，因此需要对温度场分布模型进行更加细致的整定。Although the above computer room temperature field distribution model is modeled based on the computer room layout CAD drawings, there is still a certain difference between the actual temperature field and the actual temperature field in the actual computer room. Therefore, the temperature field distribution model needs to be adjusted more carefully.

步骤202，采集机房内的真实运行环境数据；Step 202, collecting real operating environment data in the computer room;

真实运行环境数据包括实际机房内各温度测点、回风温度设定点、空调风机转速、环境工况等历史数据。The real operating environment data includes historical data such as temperature measurement points in the actual computer room, return air temperature set point, air-conditioning fan speed, and environmental conditions.

步骤203，将采集的真实运行环境数据与采用所述温度场分布模型模拟仿真的运行环境数据进行比对，不断整定温度场分布模型，使得整定后的温度场分布模型的运行环境数据与真实运行环境数据匹配度达到预设阈值。Step 203, comparing the collected real operating environment data with the operating environment data simulated by using the temperature field distribution model, and continuously adjusting the temperature field distribution model, so that the operating environment data of the adjusted temperature field distribution model is consistent with the actual operating environment data. The matching degree of environmental data reaches the preset threshold.

不断整定温度场分布模型即微调空调和服务器IT设备空间位置、运行设定参数等，预设阈值是用户根据实际需要确定的，达到一个高匹配度。Continuously adjust the temperature field distribution model, that is, fine-tune the air conditioner and server IT equipment space position, operation setting parameters, etc. The preset threshold value is determined by the user according to the actual needs to achieve a high degree of matching.

在步骤102，构建数据中心末端空调系统运行策略的马尔可夫决策过程模型；在一实施例中，所述马尔可夫决策过程模型由状态空间S、动作空间A、状态转移函数P、奖励函数R和折扣因子γ组成；可表示为一个五元组

In step 102, the Markov decision process model of the data center terminal air conditioning system operation strategy is constructed; in one embodiment, the Markov decision process model is composed of state space S, action space A, state transition function P, reward function Composed of R and discount factor γ; can be expressed as a five-tuple

状态空间S的状态从观测变量中选取；The state of the state space S is selected from the observed variables;

动作空间A中的动作从控制变量中选取；The actions in the action space A are selected from the control variables;

奖励函数R根据空调的能耗惩罚和服务器IT设备的超温惩罚获得；The reward function R is obtained according to the energy consumption penalty of the air conditioner and the overtemperature penalty of the server IT equipment;

状态转移函数P根据温度场分布模型获得；The state transition function P is obtained according to the temperature field distribution model;

在每个时刻t，依据时刻t的环境观测到的状态S_t进行学习及选择动作A_t，环境对动作A_t做出相应的响应，并呈现新的状态S_t+1同时产生一个奖励R_t+1，将所述奖励作为动作选择过程中需长期最大化的目标。At each time t, learn and select an action A _t according to the state S _t observed by the environment at time t, and the environment responds to the action A _t accordingly, presenting a new state S _t+1 and generating a reward R _t+1 , take the reward as the long-term maximization target in the action selection process.

在上述实施例中，状态S_t+1和R_t+1只取决于P和A_t，而和更早之前的状态和动作无关，这是马尔可夫决策过程模型中状态和奖励的基本特征(马尔可夫性)。In the above example, the states S _t+1 and R _t+1 only depend on P and A _t , and have nothing to do with the earlier states and actions, which are the basic characteristics of states and rewards in the Markov decision process model (Markovian).

在典型的数据中心末端空调系统的真实运行环境中，观测变量一般有：位于冷/热通道和空调送/回风口的温度测点测量值、各机柜中服务器IT负载率、室外温度和光照强度；控制变量一般有：空调送/回风温度设定点、空调风机转速等；在每一时刻t，控制变量一般影响下一时刻(t+1时刻)的冷/热通道和空调送/回风口的温度测点测量值等观测变量，而各机柜中服务器IT负载率、室外温度和光照强度等这些可观测变量不受控制变量的影响，一般只能基于历史数据采取时间序列等方法进行负荷预测。一般可以参考上述观测变量选取系统状态

参考上述控制变量选取系统动作

结合空调的能耗惩罚和服务器IT设备的超温惩罚设计奖励函数

温度场分布模型提供状态转移函数

选取合适的折扣因子γ∈(0，1)，构建末端空调系统马尔可夫决策过程模型

在该马尔可夫决策过程模型的基础上，应用强化学习算法，在仿真环境中训练得到最优末端空调系统运行策略。In the actual operating environment of a typical data center terminal air-conditioning system, the observed variables generally include: temperature measurement points located in the cold/hot aisle and air-conditioning air supply/return air outlet, IT load rate of servers in each cabinet, outdoor temperature and light intensity ; Control variables generally include: air conditioner supply/return air temperature set point, air conditioner fan speed, etc.; at each time t, control variables generally affect the cold/hot aisle and air conditioner supply/return at the next time (t+1 time) Observable variables such as the measured values of the temperature measuring points of the tuyere, and the observable variables such as the server IT load rate, outdoor temperature, and light intensity in each cabinet are not affected by the control variables. predict. Generally, the system state can be selected with reference to the above observed variables

Refer to the above control variables to select system actions

Combining the energy consumption penalty of the air conditioner and the overtemperature penalty of the server IT equipment to design a reward function

The temperature field distribution model provides the state transition function

Select the appropriate discount factor γ∈(0, 1) to construct the Markov decision process model of the terminal air-conditioning system

On the basis of the Markov decision process model, the reinforcement learning algorithm is applied to train the optimal terminal air-conditioning system operation strategy in the simulation environment.

对数据中心末端空调系统的运行策略进行优化，使得在保证机房内IT设备热安全的前提下，最大程度降低末端空调能耗。由于机房内的温度场(被控变量)随时间变化，受到各个时刻的末端空调回风温度设定点、风机转速等参数(控制变量)和服务器IT负载率的影响，因此需要结合每一时刻温度场中各温度测点的测量值和IT负载率，合理调整该时刻控制变量的值，这本质上形成了一个序贯决策问题。针对这类序贯决策问题，一般应明确状态-动作空间，建立马尔可夫决策过程模型，再使用强化学习等方法训练策略函数。The operation strategy of the terminal air-conditioning system in the data center is optimized to minimize the energy consumption of the terminal air-conditioning while ensuring the thermal safety of the IT equipment in the computer room. Since the temperature field (controlled variable) in the computer room changes with time, it is affected by the return air temperature set point of the terminal air conditioner at each moment, the fan speed and other parameters (controlled variable) and the IT load rate of the server, so it is necessary to combine each moment The measured value of each temperature measuring point in the temperature field and the IT load rate should reasonably adjust the value of the control variable at this moment, which essentially forms a sequential decision-making problem. For this kind of sequential decision-making problem, it is generally necessary to clarify the state-action space, establish a Markov decision process model, and then use reinforcement learning and other methods to train the policy function.

由于数据中心末端空调系统环境中的各冷热通道温度测点取值连续，回风温度设定点、风机转速等取值也同样连续，并且温度场随时间和空间变化复杂，因此该序贯决策问题是一个相当困难的问题。通过现有的函数逼近型强化学习方法可以训练并得到某个运行策略，但难以保证该策略的最优性。因此，可以使用强化学习算法在仿真环境中生成多种数据中心末端空调系统运行策略，构建策略库Π，并在策略库中合理挑选最优策略。Since the temperature measurement points of the hot and cold aisles in the air-conditioning system environment of the data center terminal are continuous, the values of the return air temperature set point and fan speed are also continuous, and the temperature field changes complexly with time and space, so the sequence The decision problem is a rather difficult problem. The existing function approximation reinforcement learning method can train and obtain a certain running strategy, but it is difficult to guarantee the optimality of the strategy. Therefore, the reinforcement learning algorithm can be used to generate a variety of data center terminal air-conditioning system operating strategies in the simulation environment, build a strategy library Π, and reasonably select the optimal strategy in the strategy library.

数据中心末端空调系统运行策略优化问题中，往往可以构建多种不同参数的马尔可夫决策过程模型，例如选取状态S的不同、动作A的不同、设计的奖励函数R的不同、折扣因子γ的选取不同都会对应不同参数的马尔可夫决策过程模型。对于不同参数的马尔可夫决策过程模型，应用强化学习训练得到最优策略不同，而在本发明实施例的问题中，往往只有将在温度场分布模型中训练得到的运行策略实际应用落地，才能客观评价此运行策略的性能，进而才能判断采用什么样的马尔可夫决策过程模型才能训练得到性能最好的最优运行策略。因此，可以先通过选取状态S的不同、动作A的不同、设计的奖励函数R的不同、折扣因子γ的选取不同构建末端空调系统马尔可夫决策过程模型集合

对于该集合中的各马尔可夫决策过程模型分别应用强化学习算法训练得到最优策略。In the optimization problem of the operation strategy of the terminal air-conditioning system in the data center, it is often possible to construct a Markov decision process model with different parameters, such as the selection of different states S, different actions A, different designed reward functions R, and discount factors γ Selecting different Markov decision process models will correspond to different parameters. For Markov decision process models with different parameters, the optimal strategies obtained by applying reinforcement learning training are different, and in the problems of the embodiments of the present invention, often only the actual application of the operating strategies obtained in the temperature field distribution model can be implemented. Objectively evaluate the performance of this operation strategy, and then judge what kind of Markov decision process model to use to train the optimal operation strategy with the best performance. Therefore, the Markov decision process model set of terminal air-conditioning system can be constructed by selecting different states S, different actions A, different designed reward functions R, and different selections of discount factor γ

For each Markov decision process model in the set, the reinforcement learning algorithm is used to train to obtain the optimal strategy.

在步骤103，在温度场分布模型中，使用强化学习算法，分别基于不同的策略函数、不同参数的马尔可夫决策过程模型进行训练，生成多种数据中心末端空调系统的运行策略，构建策略库；In step 103, in the temperature field distribution model, use a reinforcement learning algorithm to train based on different policy functions and Markov decision process models with different parameters, generate a variety of operating strategies for the terminal air conditioning system of the data center, and build a strategy library ;

在一实施例中，步骤103包括：In one embodiment, step 103 includes:

确定采用的多种策略函数；Determine the various strategy functions used;

确定运行策略的动作价值函数，所述动作价值函数表示使用运行策略的情况下，状态s下采取动作a的对奖励R进行加权的累计折扣奖励，可表示如下：Determine the action value function of the operation strategy, the action value function represents the cumulative discount reward weighted to the reward R for taking action a under the state s under the situation of using the operation strategy, which can be expressed as follows:

对每种策略函数，在温度场分布模型中，使用强化学习算法的框架下，在不断交替更新动作价值函数和该策略函数的过程中收敛到优化运行策略，将所述优化运行策略作为一种运行策略加入策略库。For each strategy function, in the temperature field distribution model, under the framework of the reinforcement learning algorithm, in the process of constantly updating the action value function and the strategy function, it converges to the optimal operation strategy, and the optimal operation strategy is used as a The running policy is added to the policy library.

在一实施例中，策略函数包括神经网络型策略函数、基函数线性加权型策略函数。In an embodiment, the policy function includes a neural network type policy function and a basis function linear weighted type policy function.

由于神经网络具有良好的表征能力和泛化性质，因此常使用神经网络对运行策略进行拟合：Since the neural network has good representation ability and generalization properties, the neural network is often used to fit the running strategy:

π(s)＝f^NN(s，θ)π(s)=f ^NN (s, θ)

其中π(s)为运行策略，f^NN(s，θ)表示神经网络型策略函数，神经网络输入为状态s，输出为动作a，θ表示神经网络及训练参数。多个不同参数的马尔可夫决策过程模型，可以使用DDPG、TD3、SAC等算法训练出神经网络型的最优数据中心末端空调系统运行策略函数，并将策略放置进策略库Π。Among them, π(s) is the operation strategy, f ^NN (s, θ) represents the neural network strategy function, the input of the neural network is the state s, the output is the action a, and θ represents the neural network and training parameters. Multiple Markov decision process models with different parameters can use DDPG, TD3, SAC and other algorithms to train neural network-type optimal data center terminal air-conditioning system operation strategy functions, and put the strategies into the strategy library Π.

除了神经网络型的策略函数之外，还可以使用基函数线性加权型策略函数：In addition to the neural network-type strategy function, the basis function linear weighted strategy function can also be used:

π(s)＝f^basis(s，w)＝w^Tφ(s)π(s)=f ^basis (s,w)=w ^T φ(s)

其中π(s)为运行策略，f^basis(s，w)为基函数线性加权型策略函数，

为权重向量,

为基函数，基函数的选取可以依据数据中心安全运行的先验知识，结合末端空调系统运行的特点进行设计。同样，基于不同参数的马尔可夫决策过程模型，可以在AC强化学习算法的框架下进行训练，不断更新权重向量，最终训练出基函数线性加权型的最优数据中心末端空调系统运行策略函数，并将策略放置进策略库Π。Among them, π(s) is the operation strategy, f ^basis (s, w) is the linear weighted strategy function of the basis function,

is the weight vector,

The selection of the basis function can be designed based on the prior knowledge of the safe operation of the data center and the characteristics of the operation of the terminal air conditioning system. Similarly, the Markov decision process model based on different parameters can be trained under the framework of the AC reinforcement learning algorithm, the weight vector is continuously updated, and finally the optimal data center terminal air-conditioning system operation strategy function of the linear weighted basis function is trained. And put the strategy into the strategy library Π.

基函数线性加权的表征能力可能弱于神经网络，但其只需对线性加权的权重向量进行更新，可解释性更强，训练过程更稳定，收敛性更好。The representation ability of the linear weighting of the basis function may be weaker than that of the neural network, but it only needs to update the weight vector of the linear weighting, which is more interpretable, more stable in the training process, and better in convergence.

在步骤104，依据序优化方法，在温度场分布模型中对策略库中每个运行策略的性能进行评估，从策略库中确定挑选集合；设获得的数据中心末端空调运行策略库Π包含N个运行策略，这N个运行策略都是在温度场分布模型中生成的，但由于温度场分布模型和真实运行环境会存在难以避免的偏差，因此需要在真实运行环境中对策略库Π中的运行策略的性能进行评估，以挑选出策略库Π中的最优策略π^*：In step 104, according to the sequence optimization method, the performance of each operation strategy in the strategy library is evaluated in the temperature field distribution model, and the selection set is determined from the strategy library; the obtained data center terminal air-conditioning operation strategy library Π contains N Operation strategies, these N operation strategies are all generated in the temperature field distribution model, but since there will be unavoidable deviations between the temperature field distribution model and the real operation environment, it is necessary to perform a real-time analysis of the operation strategies in the strategy library Π in the real operation environment The performance of the strategy is evaluated to pick out the optimal strategy π ^* in the strategy library Π:

在一实施例中，在温度场分布模型中对策略库中每个运行策略的性能进行评估的公式如下：In one embodiment, the formula for evaluating the performance of each operating strategy in the strategy library in the temperature field distribution model is as follows:

其中，π^*为最优运行策略，J(π)为运行策略在数据中心机房的运行环境中的性能评价函数，表示运行策略在时间T内的总能耗和总超温情况；Δt为t到t+1的时间间隔，P_t为t时刻末端空调的运行功率，

为t时刻数据中心机房第i个服务器IT设备出风口的温度，T_max为机柜出风口允许温度上限，λ为权重参数。Among them, π ^* is the optimal operation strategy, and J(π) is the performance evaluation function of the operation strategy in the operating environment of the data center computer room, indicating the total energy consumption and total overtemperature of the operation strategy within the time T; Δt is t The time interval to t+1, P _t is the operating power of the terminal air conditioner at time t,

is the temperature at the air outlet of the i-th server IT equipment in the data center computer room at time t, T _max is the upper limit of the allowable temperature at the air outlet of the cabinet, and λ is a weight parameter.

除了上述公式(1)外，J(π)还可以表示为：In addition to the above formula (1), J(π) can also be expressed as:

受制于真实运行环境的安全限制，将策略库Π中所有的运行策略都在真实运行环境中评估J(π)是不符合实际的，只能合理挑选策略库Π中一小部分更有可能是最优策略的策略在真实运行环境中评估，但是在温度场分布模型的仿真环境中，可以不受限制地对运行策略的性能进行评估；在实际工程实现中，往往不需要追求上述评估公式(1)的精确解，而是求得上述公式(1)“足够好”的解就能满足工程需要，即不需要追求策略库Π中性能J(π)最小的运行策略，而是最终求得运行策略的真实性能J(π)位于最小的g个策略集合之内就可以满足需求。Due to the security constraints of the real operating environment, it is not practical to evaluate all the operating policies in the policy library Π in the real operating environment. Only a small part of the policy library Π is more likely to be selected reasonably. The strategy of the optimal strategy is evaluated in the real operating environment, but in the simulation environment of the temperature field distribution model, the performance of the operating strategy can be evaluated without restriction; in actual engineering implementation, it is often not necessary to pursue the above evaluation formula ( 1), but to obtain the "good enough" solution of the above formula (1) can meet the engineering needs, that is, it is not necessary to pursue the operation strategy with the smallest performance J(π) in the strategy library Π, but to finally obtain The real performance J(π) of the running strategy is within the smallest set of g strategies to meet the requirements.

考虑以上问题特点，可以使用序优化方法对策略库Π中的运行策略进行挑选。具体而言，将真实运行环境的策略性能评价函数J(π)作为细致模型(Detailed Model)，温度场分布模型中，也使用上式作为策略性能评价函数，但由于温度场分布模型中的策略性能评价与真实运行环境的策略性能评价存在差异，因此将仿真环境中得到的性能评价记为J′(π)，将其视为粗糙模型(Crude Model)；估计序性能曲线OPC类别和粗糙模型的噪声等级，依据用户偏好给定“足够好”集合(策略选择集合)G的大小g和对齐水平k，对齐水平k和对齐概率有关。依据序优化的挑选集合公式以确定挑选集合S的大小s，依据粗糙模型对策略库Π中的所有运行策略的性能进行评估，选取J′(π)最小的s个策略组成挑选集合S，序优化理论可以确保策略集合S中以95％的概率至少包含k个真实策略性能为前g小的策略。也可以理解为在选出的挑选集合S中，存在k个S中的元素(design)是真正的”足够好“这件事有大于等于95％的概率。Considering the characteristics of the above problems, the sequential optimization method can be used to select the running strategies in the strategy library Π. Specifically, the strategy performance evaluation function J(π) of the real operating environment is used as a detailed model (Detailed Model). In the temperature field distribution model, the above formula is also used as the strategy performance evaluation function, but because the strategy in the temperature field distribution model There are differences between the performance evaluation and the strategy performance evaluation of the real operating environment, so the performance evaluation obtained in the simulation environment is recorded as J′(π), which is regarded as a rough model (Crude Model); the estimated sequence performance curve OPC category and the rough model The noise level of the given "good enough" set (strategy selection set) G size g and alignment level k according to user preference, the alignment level k is related to the alignment probability. Determine the size s of the selection set S according to the selection set formula of order optimization, evaluate the performance of all operating strategies in the strategy library Π according to the rough model, and select s strategies with the smallest J′(π) to form the selection set S. The optimization theory can ensure that the strategy set S contains at least k strategies with the lowest performance of the real strategy before g with a probability of 95%. It can also be understood that in the selected selection set S, there are k elements (design) in S that are truly "good enough" with a probability greater than or equal to 95%.

综上，形成以下步骤。In summary, the following steps are formed.

图3为本发明实施例中确定挑选集合的流程图，在一实施例中，依据序优化方法，在温度场分布模型中对策略库中每个运行策略的性能进行评估，从策略库中确定挑选集合，包括：Fig. 3 is the flow chart of determining the selection set in the embodiment of the present invention, in one embodiment, according to the sequential optimization method, in the temperature field distribution model, evaluate the performance of each operation strategy in the strategy library, determine from the strategy library Pick a collection, including:

步骤301，将温度场分布模型中得到的策略性能评价函数J′(π)作为粗糙模型；Step 301, using the strategy performance evaluation function J'(π) obtained in the temperature field distribution model as a rough model;

步骤302，估计序性能曲线类别和粗糙模型的噪声等级；Step 302, estimating the sequence performance curve category and the noise level of the rough model;

步骤303，获得用户确定的策略选择集合G的大小g和对齐水平k；对齐水平k和对齐概率有关。Step 303, obtaining the size g and alignment level k of the policy selection set G determined by the user; the alignment level k is related to the alignment probability.

步骤304，依据序优化的挑选集合公式，确定挑选集合S的大小s，所述序优化的挑选集合公式的参数包括序性能曲线类别、粗糙模型的噪声等级、策略选择集合G的大小g和对齐水平k；序优化的挑选集合公式如下：Step 304: Determine the size s of the selection set S according to the selection set formula of the order optimization. The parameters of the selection set formula of the order optimization include the type of the order performance curve, the noise level of the rough model, the size g of the strategy selection set G, and the alignment Level k; The selection set formula for order optimization is as follows:

其中，Z₁,Z₂,Z₃,Z₄是根据序性能曲线类别、粗糙模型的噪声等级确定的，是基于大量历史数据总结确定的。Among them, Z ₁ , Z ₂ , Z ₃ , and Z ₄ are determined according to the category of the sequence performance curve and the noise level of the rough model, and are determined based on a large amount of historical data.

步骤305，计算策略库中的所有运行策略的粗糙模型的值，并选取粗糙模型的值最小的挑选集合S的大小s个运行策略组成挑选集合S。Step 305, calculate the rough model values of all the running strategies in the strategy library, and select the running strategies of the size s of the selection set S with the smallest value of the rough model to form the selection set S.

在步骤105，将挑选集合中的各个运行策略分别应用于数据中心机房的真实运行环境中，确定挑选集合中的最优运行策略。In step 105, each operation strategy in the selection set is applied to the real operation environment of the data center computer room, and the optimal operation strategy in the selection set is determined.

在一实施例中，将挑选集合中的各个运行策略分别应用于数据中心机房的真实运行环境中，确定挑选集合中的最优运行策略，包括：In one embodiment, each operation strategy in the selection set is applied to the real operation environment of the data center computer room, and the optimal operation strategy in the selection set is determined, including:

将数据中心机房的真实运行环境的策略性能评价函数J(π)作为细致模型；The strategic performance evaluation function J(π) of the real operating environment of the data center computer room is used as a detailed model;

将挑选集合中的各个运行策略分别应用于数据中心机房的真实运行环境中，得到各个运行策略的细致模型的值；Apply each operation strategy in the selection set to the real operation environment of the data center computer room, and obtain the value of the detailed model of each operation strategy;

将细致模型的值最小的运行策略作为最终的数据中心末端空调系统的运行策略。The operation strategy with the minimum value of the detailed model is taken as the final operation strategy of the terminal air conditioning system of the data center.

综上所述，在本发明实施例提出的方法具有如下有益效果：In summary, the method proposed in the embodiment of the present invention has the following beneficial effects:

第一，基于不同参数的马尔可夫决策过程模型和策略函数形式一个包含多个运行策略的策略库，相比于传统的两阶段方法和强化学习方法仅生成单一运行策略，本方案综合考虑了多个不同形式的策略函数，并对多个运行策略进行合理挑选，因而最终得到的运行策略比传统方法得到的单一运行策略更有性能保障，即更能保障运行策略在实际数据中心环境中既确保服务器IT设备的热安全，又能最大程度降低末端空调能耗。First, the Markov decision process model based on different parameters and the policy function form a strategy library containing multiple operating strategies. Compared with the traditional two-stage method and reinforcement learning method that only generates a single operating strategy, this scheme comprehensively considers Multiple policy functions in different forms, and reasonable selection of multiple operating strategies, so the final operating strategy is more guaranteed than the single operating strategy obtained by the traditional method, that is, it can better guarantee that the operating strategy can be used in the actual data center environment. Ensure the thermal safety of server IT equipment and minimize the energy consumption of terminal air conditioners.

第二，在运行策略挑选的环节，与计算出所有运行策略的真实性能并进行排序后选择最好的策略这种传统方法不同，本方案采用了序优化方法，获得挑选集合，大大降低了策略库中的运行策略在真实运行环境中的评估次数，进一步保障了数据中心服务器IT设备的热安全，节省了人力物力财力。Second, in the process of selecting the operation strategy, unlike the traditional method of calculating the real performance of all operation strategies and selecting the best strategy after sorting, this scheme adopts the order optimization method to obtain the selection set, which greatly reduces the strategy cost. The evaluation times of the operating strategies in the library in the real operating environment further ensure the thermal safety of the data center server IT equipment and save manpower, material and financial resources.

本发明实施例还提出一种数据中心末端空调系统运行策略确定装置，其原理与数据中心末端空调系统运行策略确定方法类似，这里不再赘述。The embodiment of the present invention also proposes a device for determining the operation strategy of the terminal air-conditioning system of the data center, the principle of which is similar to the method for determining the operation strategy of the terminal air-conditioning system of the data center, and will not be repeated here.

图4为本发明实施例中数据中心末端空调系统运行策略确定装置的示意图，包括：Fig. 4 is a schematic diagram of a device for determining an operation strategy of a data center terminal air-conditioning system in an embodiment of the present invention, including:

温度场分布模型搭建模块401，用于搭建数据中心机房的温度场分布模型；The temperature field distribution model building module 401 is used to build the temperature field distribution model of the data center computer room;

马尔可夫决策过程模型构建模块402，用于构建数据中心末端空调系统运行策略的马尔可夫决策过程模型；Markov decision process model construction module 402, used to construct the Markov decision process model of the terminal air conditioning system operation strategy of the data center;

策略库构建模块403，用于在温度场分布模型中，使用强化学习算法，分别基于不同的策略函数、不同参数的马尔可夫决策过程模型进行训练，生成多种数据中心末端空调系统的运行策略，构建策略库；The strategy library construction module 403 is used to use the reinforcement learning algorithm in the temperature field distribution model to perform training based on different strategy functions and Markov decision process models with different parameters, so as to generate a variety of operating strategies for the terminal air conditioning system of the data center , build a strategy library;

挑选集合确定模块404，用于依据序优化方法，在温度场分布模型中对策略库中每个运行策略的性能进行评估，从策略库中确定挑选集合；The selection set determination module 404 is used to evaluate the performance of each operation strategy in the strategy library in the temperature field distribution model according to the order optimization method, and determine the selection set from the strategy library;

最优运行策略确定模块405，用于将挑选集合中的各个运行策略分别应用于数据中心机房的真实运行环境中，确定挑选集合中的最优运行策略。The optimal operation strategy determination module 405 is configured to apply each operation strategy in the selection set to the actual operation environment of the data center computer room, and determine the optimal operation strategy in the selection set.

在一实施例中，温度场分布模型搭建模块具体用于：In one embodiment, the temperature field distribution model building module is specifically used for:

通过CFD软件，依据机房布置CAD图纸，利用CFD仿真软件，对数据中心机房空间构造和空调与IT设备型号进行建模和仿真，建立数据中心机房的温度场分布模型；Through CFD software, according to the computer room layout CAD drawings, use CFD simulation software to model and simulate the data center computer room space structure and air conditioner and IT equipment models, and establish the temperature field distribution model of the data center computer room;

采集机房内的真实运行环境数据；Collect real operating environment data in the computer room;

将采集的真实运行环境数据与采用所述温度场分布模型模拟仿真的运行环境数据进行比对，不断整定温度场分布模型，使得整定后的温度场分布模型的运行环境数据与真实运行环境数据匹配度达到预设阈值。Comparing the collected real operating environment data with the operating environment data simulated by the temperature field distribution model, and continuously adjusting the temperature field distribution model, so that the operating environment data of the adjusted temperature field distribution model matches the real operating environment data reached the preset threshold.

在一实施例中，所述马尔可夫决策过程模型由状态空间S、动作空间A、状态转移函数P、奖励函数R和折扣因子γ组成；In one embodiment, the Markov decision process model is composed of state space S, action space A, state transition function P, reward function R and discount factor γ;

在一实施例中，在温度场分布模型中，使用强化学习算法，分别基于不同的策略函数、不同参数的马尔可夫决策过程模型进行训练，生成多种数据中心末端空调系统的运行策略，构建策略库，包括：In one embodiment, in the temperature field distribution model, a reinforcement learning algorithm is used to perform training based on different policy functions and Markov decision process models with different parameters, so as to generate a variety of operating strategies for the terminal air conditioning system of the data center, and construct Policy library, including:

确定运行策略的动作价值函数，所述动作价值函数表示使用运行策略的情况下，状态s下采取动作a的对奖励R进行加权的累计折扣奖励；Determine the action value function of the operation strategy, where the action value function represents the cumulative discount reward for taking action a under the state s and weighting the reward R under the situation of using the operation strategy;

在一实施例中，挑选集合确定模块具体用于：In one embodiment, the selection set determination module is specifically used for:

将温度场分布模型中得到的策略性能评价函数J′(π)作为粗糙模型；The strategic performance evaluation function J′(π) obtained in the temperature field distribution model is used as a rough model;

估计序性能曲线类别和粗糙模型的噪声等级；Estimation of the ordinal performance curve class and the noise level of the coarse model;

获得用户确定的策略选择集合G的大小g和对齐水平k；Obtain the size g and alignment level k of the policy selection set G determined by the user;

依据序优化的挑选集合公式，确定挑选集合S的大小s，所述序优化的挑选集合公式的参数包括序性能曲线类别、粗糙模型的噪声等级、策略选择集合G的大小g和对齐水平k；Determine the size s of the selection set S according to the selection set formula of the order optimization, the parameters of the selection set formula of the order optimization include the order performance curve category, the noise level of the rough model, the size g of the strategy selection set G and the alignment level k;

计算策略库中的所有运行策略的粗糙模型的值，并选取粗糙模型的值最小的s个运行策略组成挑选集合S。Calculate the rough model values of all running strategies in the strategy library, and select the s running strategies with the smallest value of the rough model to form the selection set S.

在一实施例中，最优运行策略确定模块具体用于：In one embodiment, the optimal operation strategy determination module is specifically used for:

综上所述，本发明实施例提出的装置的有益效果如下：In summary, the beneficial effects of the device proposed in the embodiment of the present invention are as follows:

本发明实施例还提供一种计算机设备，图5为本发明实施例中计算机设备的示意图，所述计算机设备500包括存储器510、处理器520及存储在存储器510上并可在处理器520上运行的计算机程序530，所述处理器520执行所述计算机程序530时实现上述数据中心末端空调系统运行策略确定方法。The embodiment of the present invention also provides a computer device. FIG. 5 is a schematic diagram of the computer device in the embodiment of the present invention. The computer device 500 includes a memory 510, a processor 520, and is stored on the memory 510 and can run on the processor 520. A computer program 530, when the processor 520 executes the computer program 530, implements the above-mentioned method for determining the operation strategy of the terminal air-conditioning system of the data center.

本领域内的技术人员应明白，本发明的实施例可提供为方法、系统、或计算机程序产品。因此，本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art should understand that the embodiments of the present invention may be provided as methods, systems, or computer program products. Accordingly, the present invention can take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It should be understood that each procedure and/or block in the flowchart and/or block diagram, and a combination of procedures and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions may be provided to a general purpose computer, special purpose computer, embedded processor, or processor of other programmable data processing equipment to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing equipment produce a An apparatus for realizing the functions specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to operate in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means, the instructions The device realizes the function specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded onto a computer or other programmable data processing device, causing a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process, thereby The instructions provide steps for implementing the functions specified in the flow chart or blocks of the flowchart and/or the block or blocks of the block diagrams.

以上所述的具体实施例，对本发明的目的、技术方案和有益效果进行了进一步详细说明，所应理解的是，以上所述仅为本发明的具体实施例而已，并不用于限定本发明的保护范围，凡在本发明的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The specific embodiments described above have further described the purpose, technical solutions and beneficial effects of the present invention in detail. It should be understood that the above descriptions are only specific embodiments of the present invention and are not intended to limit the scope of the present invention. Protection scope, within the spirit and principles of the present invention, any modification, equivalent replacement, improvement, etc., shall be included in the protection scope of the present invention.

Claims

1. A method for determining an operation strategy of a data center terminal air conditioning system is characterized by comprising the following steps:

building a temperature field distribution model of a data center machine room;

constructing a Markov decision process model of an operation strategy of an air conditioning system at the tail end of a data center;

in the temperature field distribution model, training is carried out by using a reinforcement learning algorithm based on Markov decision process models with different strategy functions and different parameters respectively, operation strategies of the air conditioning system at the tail end of various data centers are generated, and a strategy library is constructed;

according to a sequence optimization method, evaluating the performance of each operation strategy in a strategy library in a temperature field distribution model, and determining a selection set from the strategy library;

and respectively applying each operation strategy in the selected set to the real operation environment of the data center machine room, and determining the optimal operation strategy in the selected set.

2. The method of claim 1, wherein building a temperature field distribution model for a data center room comprises:

modeling and simulating the space structure of the data center machine room and the models of the air conditioner and the IT equipment by CFD software according to the CAD drawing of the machine room layout and by utilizing CFD simulation software, and establishing a temperature field distribution model of the data center machine room;

collecting real operating environment data in a machine room;

and comparing the acquired real operating environment data with the operating environment data simulated by the temperature field distribution model, and continuously setting the temperature field distribution model, so that the matching degree of the operating environment data of the set temperature field distribution model and the real operating environment data reaches a preset threshold value.

3. The method of claim 1, wherein the markov decision process model consists of a state space S, an action space a, a state transition function P, a reward function R, and a discount factor γ;

selecting the state of the state space S from the observation variables;

selecting the action in the action space A from the control variables;

the reward function R is obtained according to the energy consumption punishment of the air conditioner and the overtemperature punishment of the server IT equipment;

the state transfer function P is obtained according to the temperature field distribution model;

at each time t, a state S observed in dependence on the environment at time t _t Performing learning and selecting action A _t Environment to action A _t Respond correspondingly and present a new state S _t+1 With the generation of a prize R _t+1 The reward is targeted for long term maximization in the action selection process.

4. The method of claim 3, wherein in the temperature field distribution model, training is performed by using a reinforcement learning algorithm based on Markov decision process models with different strategy functions and different parameters respectively to generate a plurality of operation strategies of the data center end air conditioning system, and a strategy library is constructed, comprising:

determining various adopted strategy functions;

determining an action cost function of the operation strategy, wherein the action cost function represents the accumulated discount reward for weighting the reward R of taking the action a under the condition of using the operation strategy;

for each strategy function, in a temperature field distribution model, under the framework of a reinforcement learning algorithm, the action value function and the strategy function are converged to an optimized operation strategy in the process of continuously and alternately updating, and the optimized operation strategy is added into a strategy library as an operation strategy.

5. The method of claim 3, wherein the policy function comprises a neural network type policy function, a basis function linear weighted type policy function.

6. The method of claim 1, wherein the formula for evaluating the performance of each operating strategy in the strategy library in the temperature field distribution model is as follows:

wherein, pi ^* For the optimal operation strategy, J (pi) is a performance evaluation function of the operation strategy in the operation environment of the data center machine room, and represents the total energy consumption and total overtemperature condition of the operation strategy within time T; Δ t is the time interval t to t +1, P _t For the operation power of the air conditioner at the end of time t,

the temperature of the air outlet of the ith server IT equipment of the data center machine room at the moment T, T _max And the upper limit of the allowable temperature of the air outlet of the cabinet is defined, and lambda is a weight parameter.

7. The method of claim 6, wherein the performance of each operating policy in the policy repository is evaluated in the temperature field distribution model according to an order optimization method, and the determining a selection set from the policy repository comprises:

taking a strategy performance evaluation function J' (pi) obtained in the temperature field distribution model as a rough model;

estimating the category of the sequence performance curve and the noise level of the rough model;

obtaining the size G and the alignment level k of a strategy selection set G determined by a user;

determining the size S of a selection set S according to a selection set formula of sequence optimization, wherein parameters of the selection set formula of sequence optimization comprise sequence performance curve types, noise levels of rough models, the size G of a strategy selection set G and an alignment level k;

and calculating the values of the rough models of all the operation strategies in the strategy library, and selecting S operation strategies with the minimum values of the rough models to form a selection set S.

8. The method of claim 7, wherein the step of applying each operation strategy in the selected set to a real operation environment of the data center machine room respectively to determine an optimal operation strategy in the selected set comprises:

taking a strategic performance evaluation function J (pi) of a real operation environment of a data center machine room as a detailed model;

respectively applying each operation strategy in the selected set to a real operation environment of the data center machine room to obtain a value of a detailed model of each operation strategy;

and taking the operation strategy with the minimum value of the detailed model as the final operation strategy of the data center terminal air conditioning system.

9. An operation strategy determination device for an air conditioning system at the tail end of a data center is characterized by comprising the following steps:

the temperature field distribution model building module is used for building a temperature field distribution model of a data center machine room;

the Markov decision process model building module is used for building a Markov decision process model of the operation strategy of the air conditioning system at the tail end of the data center;

the strategy base building module is used for training a Markov decision process model based on different strategy functions and different parameters respectively by using a reinforcement learning algorithm in the temperature field distribution model, generating operation strategies of the air-conditioning system at the tail end of various data centers and building a strategy base;

the selected set determining module is used for evaluating the performance of each operation strategy in the strategy library in the temperature field distribution model according to the sequence optimization method and determining a selected set from the strategy library;

and the optimal operation strategy determining module is used for respectively applying each operation strategy in the selection set to the real operation environment of the data center machine room and determining the optimal operation strategy in the selection set.

10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any of claims 1 to 8 when executing the computer program.

11. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, implements the method of any one of claims 1 to 8.

12. A computer program product, characterized in that the computer program product comprises a computer program which, when being executed by a processor, carries out the method of any one of claims 1 to 8.