CN118816340A

CN118816340A - Energy-saving control method and intelligent agent for general air-conditioning system based on reinforcement transfer learning

Info

Publication number: CN118816340A
Application number: CN202411296074.4A
Authority: CN
Inventors: 黄磊; 冯吉星; 徐超; 周俞杰; 秦浩森
Original assignee: Shanghai Tongcai Zhineng Information Technology Co ltd
Current assignee: Shanghai Tongcai Zhineng Information Technology Co ltd
Priority date: 2024-09-18
Filing date: 2024-09-18
Publication date: 2024-10-22
Anticipated expiration: 2044-09-18
Also published as: CN118816340B

Abstract

The present invention discloses a general air-conditioning system energy-saving control method and intelligent agent based on reinforcement transfer learning, including: obtaining the air-conditioning system operation data and the energy consumption data of each air-conditioning device at the corresponding timestamp; inputting the operation data and the energy consumption data into the intelligent agent for learning, and setting the basic parameters of the intelligent agent; according to the implementation parameters of the target air-conditioning system project 1 to be controlled, the intelligent agent determines whether the Q table of the project 2 for migration can be found from the existing training library, if so, the state space and the action space are normalized, and the normalized Q table is processed; the threshold and step parameters of the action set of the intelligent agent that completes the migration are set, and the control instructions are issued to the target air-conditioning system; if not, the intelligent agent is directly set according to the reinforcement learning requirements. The present invention does not rely on a large number of sensor data, load forecasting, and does not need to be modeled to achieve a high level of energy-saving control, and can be used for different air-conditioning systems in various building scenarios.

Description

Energy-saving control method and intelligent agent for general air-conditioning system based on reinforcement transfer learning

技术领域Technical Field

本发明涉及人工智能技术领域，尤其涉及一种基于强化迁移学习的通用空调系统节能控制方法、智能体。The present invention relates to the field of artificial intelligence technology, and in particular to a universal air-conditioning system energy-saving control method and an intelligent agent based on reinforcement transfer learning.

背景技术Background Art

在公共建筑运行能耗中，空调系统的能耗通常占据总能耗的50%以上。目前，虽然在空调机房的设计和施工建设方面已有许多研究成果，但空调系统的控制和运维仍然存在诸多问题。In the energy consumption of public building operation, the energy consumption of air conditioning system usually accounts for more than 50% of the total energy consumption. At present, although there have been many research results in the design and construction of air conditioning room, there are still many problems in the control and operation and maintenance of air conditioning system.

例如，许多建筑仍采用手动控制空调系统，这种方式由于控制精度不高，常常为了达到室内温度的控制目标而导致空调系统的冗余使用，造成电量浪费。For example, many buildings still use manually controlled air conditioning systems. Due to the low control accuracy of this method, the air conditioning system is often used redundantly in order to achieve the indoor temperature control target, resulting in a waste of electricity.

又例如，部分建筑虽然使用了基于PID（比例-积分-微分）的建筑自动监控系统(Building Automatic System，BAS)系统对空调系统进行控制，但其能效提升主要依赖内置的专家经验。但这种基于专家经验的节能控制效果会随着系统使用时间而下降，因为专家经验是基于固定工况的运行参数设置，而空调系统的实际应用环境是非静态的。专家经验只能在固定工况下提供较好的系统运行效果，无法持续针对变化工况进行调节。对于多冷机协同运行的复杂工况，基于专家经验的控制方式可能无法达到最优效果。For another example, although some buildings use a PID (proportional-integral-differential)-based building automatic monitoring system (BAS) to control the air-conditioning system, its energy efficiency improvement mainly depends on built-in expert experience. However, the energy-saving control effect based on expert experience will decrease as the system is used, because expert experience is based on the operating parameter settings of fixed working conditions, and the actual application environment of the air-conditioning system is non-static. Expert experience can only provide better system operation effects under fixed working conditions and cannot continuously adjust to changing working conditions. For complex working conditions with multiple chillers operating in coordination, the control method based on expert experience may not achieve the optimal effect.

而依赖于模型优化的节能控制方法，则需要高精度的设备模型做支撑。这种方法虽然能实现一定的节能优化效果，但设备使用一段时间后，性能衰减，需要更新模型才能保证优化效果。此外，这种方法依赖于长时间的运维数据，且只能针对特定类型的建筑情景，如数据中心和商场进行建模，通用性较差。Energy-saving control methods that rely on model optimization require high-precision equipment models to support them. Although this method can achieve certain energy-saving optimization effects, the performance of the equipment will degrade after a period of use, and the model needs to be updated to ensure the optimization effect. In addition, this method relies on long-term operation and maintenance data and can only be modeled for specific types of building scenarios, such as data centers and shopping malls, and has poor versatility.

因此，当前建筑节能市场迫切需要一种能够适用于大部分建筑情景的空调系统节能控制方法，在保证节能率的情况下，具有较高的适用范围，更好地应用于工程实际。Therefore, the current building energy-saving market urgently needs an air-conditioning system energy-saving control method that can be applied to most building scenarios, which has a high scope of application while ensuring the energy-saving rate and can be better applied to engineering practice.

发明内容Summary of the invention

本发明的目的之一是提供一种基于强化迁移学习的通用空调系统节能控制方法，以解决现有技术不能适用于大部分建筑情景的空调系统自动节能控制的技术问题，在保证节能率的情况下，具有较高的适用范围，更好地应用于工程实际。One of the purposes of the present invention is to provide a universal air-conditioning system energy-saving control method based on reinforcement transfer learning to solve the technical problem that the existing technology cannot be applied to automatic energy-saving control of air-conditioning systems in most building scenarios. While ensuring the energy-saving rate, it has a higher scope of application and can be better applied to engineering practice.

为了解决上述技术问题，第一方面，本发明实施例提供了一种基于强化迁移学习的通用空调系统节能控制方法，所述方法包括：In order to solve the above technical problems, in a first aspect, an embodiment of the present invention provides a general air-conditioning system energy-saving control method based on reinforcement transfer learning, the method comprising:

智能体获取对应时间戳下的空调系统的运行数据及各空调设备的能耗数据；The agent obtains the operating data of the air-conditioning system and the energy consumption data of each air-conditioning device at the corresponding timestamp;

将所述运行数据和所述能耗数据作为输入量，输入至智能体内进行学习，并对智能体的基本参数进行设置，所述基本参数包括状态集合、动作集合，环境的奖励函数、ε-greed策略设定值；The operation data and the energy consumption data are input as input into the intelligent body for learning, and the basic parameters of the intelligent body are set, wherein the basic parameters include a state set, an action set, a reward function of the environment, and an ε-greed strategy setting value;

根据待控制的目标空调系统项目1的实施参数，智能体判断能否从现有训练库找出用于迁移的项目2的Q表，如有，则对状态空间与动作空间进行归一化以使不同状态与动作空间之间的Q表之间建立起关联，并对归一化的Q表进行处理，完成迁移，其中，所述实施参数包括但不限于空调系统负荷、设备类型、设备数量和连接方式，如无，则直接根据强化学习需求对智能体进行设置；According to the implementation parameters of the target air-conditioning system project 1 to be controlled, the agent determines whether the Q table of project 2 for migration can be found from the existing training library. If so, the state space and the action space are normalized to establish a correlation between the Q tables of different state and action spaces, and the normalized Q table is processed to complete the migration. The implementation parameters include but are not limited to the air-conditioning system load, equipment type, equipment quantity and connection method. If not, the agent is directly set according to the reinforcement learning requirements;

对智能体的动作集合的阈值和步长参数进行限制，所述智能体向目标空调系统下发控制指令。The threshold and step size parameters of the action set of the agent are restricted, and the agent issues a control instruction to the target air-conditioning system.

进一步的，所述方法还包括：Furthermore, the method further comprises:

所述智能体向目标空调系统下发控制指令后，同步获取当前时间戳的运行参数，并计算出当前的系统能效指标，所述系统能效指标作为奖励值，被智能体视为是否存入Q表的判断依据，其中，所述系统能效指标包括但不限于COP、EER或PUE中的一个或多个。After the intelligent agent sends a control instruction to the target air-conditioning system, it synchronously obtains the operating parameters of the current timestamp and calculates the current system energy efficiency index. The system energy efficiency index is used as a reward value and is regarded by the intelligent agent as a basis for determining whether to store it in the Q table. The system energy efficiency index includes but is not limited to one or more of COP, EER or PUE.

优选的，所述智能体的基本参数如下式所示：Preferably, the basic parameters of the agent are as follows:

； ;

其中，S代表智能体的状态集合，所述状态集合包括系统瞬时冷负荷与室外气象参数；A表示智能体的动作集合，所述动作集合包括空调设备的运行状态以及其当前动作设定参数；r表示环境的奖励函数，所述奖励函数为空调系统能效指标，所述系统能效指标包括但不限于COP、EER或PUE中的一个或多个；Wherein, S represents the state set of the agent, which includes the instantaneous cooling load of the system and the outdoor meteorological parameters; A represents the action set of the agent, which includes the operating state of the air-conditioning equipment and its current action setting parameters; r represents the reward function of the environment, which is the energy efficiency index of the air-conditioning system, and the system energy efficiency index includes but is not limited to one or more of COP, EER or PUE;

p代表环境的状态转移概率，γ表示折扣因子；p represents the state transition probability of the environment, and γ represents the discount factor;

为下一个状态下能获得的最大Q值。 is the maximum Q value that can be obtained in the next state.

优选的，在进行迁移学习的处理时，对状态空间与动作空间进行归一化以使不同状态与动作空间之间的Q表之间建立起关联，并对归一化的Q表进行处理具体包括：Preferably, when performing transfer learning, the state space and the action space are normalized so that a correlation is established between Q tables between different state and action spaces, and the normalized Q table is processed specifically including:

归一化公式如下所示：The normalization formula is as follows:

； ;

其中，x为需要进行归一化的对象数据，min和max分别所述对象数据中的最大值和最小值，为作为归一化处理时的基准值，将每个状态（s）和动作（a）的值除以相应的基准值和，并缩放到[0,1]的区间内，如下式所示：Wherein, x is the object data to be normalized, min and max are the maximum and minimum values in the object data, respectively. To serve as the benchmark value for normalization, divide the value of each state (s) and action (a) by the corresponding benchmark value and , and scaled to the interval [0,1], as shown below:

, ； , ;

，； , ;

其中，Q₁是项目1中的空调系统各状态点下的动作，Q₂是项目2中的空调系统各状态点下的动作；Among them, _Q1 is the action of the air conditioning system at each state point in project 1, and _Q2 is the action of the air conditioning system at each state point in project 2;

在归一化后，新Q表中某一个Q值就需要通过原Q表中与其索引值相近的Q值加权得到，具体公式如下所示：After normalization, a Q value in the new Q table It needs to be obtained by weighting the Q value close to its index value in the original Q table. The specific formula is as follows:

； ;

其中，为原Q表归一化后的行索引的值，即将状态空间对应到了行索引上；为原Q表归一化后的列索引的值，即将动作空间对应到了列索引上；in, is the value of the normalized row index of the original Q table, that is, the state space is mapped to the row index; is the value of the normalized column index of the original Q table, that is, the action space is mapped to the column index;

与为参数，分别的值∈(0,1)，被作为底数，构造以索引之间的距离绝对值为指数的函数，当距离接近或为0时，函数值为1，当距离增加时，函数值快速降低，降低速度与参数的取值相关； and is a parameter, and its value ∈(0,1) is used as the base to construct a function with the absolute value of the distance between the indices as the exponent. When the distance is close to or equal to 0, the function value is 1. When the distance increases, the function value decreases rapidly, and the speed of decrease is related to the value of the parameter.

构成了该点的Q值对所求的新Q值的贡献权重；为归一化系数，用于将所有权重之和放缩为1，归一化系数的计算公式如下式所示： It constitutes the contribution weight of the Q value of the point to the new Q value required; is the normalization coefficient, which is used to scale the sum of all weights to 1. The calculation formula of the normalization coefficient is as follows:

； ;

在对应能效指标不同的系统进行迁移时，需对整个Q表的Q值按照奖励值的大小进行缩放。When migrating systems with different energy efficiency indicators, the Q value of the entire Q table needs to be scaled according to the size of the reward value.

优选的，所述对完成迁移的智能体的动作集合的阈值和步长参数进行限制具体包括：Preferably, the limiting of the threshold and step size parameters of the action set of the agent that completes the migration specifically includes:

控制的时间间隔不得低于10min/次；The control time interval shall not be less than 10 minutes per time;

控制步长即调整频率不得高于2Hz/次；The control step size, i.e. the adjustment frequency, shall not be higher than 2Hz/time;

温度的调整间隔不得高于1℃/次；The temperature adjustment interval shall not exceed 1℃/time;

冷机出水温度、水泵频率、冷却塔风机频率的安全阈值参考整个空调系统的设计参数及对应设备的设计参数。The safety thresholds of chiller outlet water temperature, water pump frequency, and cooling tower fan frequency refer to the design parameters of the entire air-conditioning system and the design parameters of the corresponding equipment.

进一步的，所述方法还包括：部署强化学习的超参数，包括学习率、折扣系数、初始随机系数epsilon、随机系数衰减量，并根据待控制的目标空调系统项目1的实际需求，对智能体的奖励值设定进行调试。Furthermore, the method also includes: deploying hyperparameters of reinforcement learning, including learning rate, discount coefficient, initial random coefficient epsilon, random coefficient attenuation, and debugging the reward value setting of the intelligent agent according to the actual needs of the target air-conditioning system project 1 to be controlled.

优选的，所述根据待控制的目标空调系统项目1的实际需求，对智能体的奖励值设定进行调试具体包括：Preferably, the step of debugging the reward value setting of the agent according to the actual requirements of the target air-conditioning system project 1 to be controlled specifically includes:

将系统COP作为奖励值，并对系统COP进行标准化，其中，系统能效比COP为智能体在下发控制指令时同步获取当前运行参数，并根据当前运行参数计算得出；The system COP is used as the reward value and the system COP is standardized. The system energy efficiency ratio COP is obtained by the agent when issuing control instructions and is calculated based on the current operating parameters.

将标准化后的系统COP乘上温差修正系数，具体公式如下所示：Multiply the standardized system COP by the temperature difference correction factor. The specific formula is as follows:

； ;

其中，COP为系统能效比，Tcws为冷却水总管供水温度，Tcwr为冷却水总管回水温度。Among them, COP is the system energy efficiency ratio, Tcws is the cooling water main supply water temperature, and Tcwr is the cooling water main return water temperature.

第二方面，本发明实施例还提供了一种基于强化迁移学习的通用空调系统节能控制的智能体，所述智能体包括：In a second aspect, an embodiment of the present invention further provides an intelligent agent for energy-saving control of a general air-conditioning system based on reinforcement transfer learning, the intelligent agent comprising:

数据采集模块，用于获取对应时间戳下的空调系统的运行数据及各空调设备的能耗数据；The data acquisition module is used to obtain the operation data of the air-conditioning system and the energy consumption data of each air-conditioning device at the corresponding timestamp;

强化学习模块，用于将所述运行数据和所述能耗数据作为输入量，输入至智能体内进行学习，并对智能体的基本参数进行设置，所述基本参数包括状态集合、动作集合，环境的奖励函数、ε-greed策略设定值；A reinforcement learning module, used to input the operation data and the energy consumption data as input into the intelligent body for learning, and set basic parameters of the intelligent body, wherein the basic parameters include a state set, an action set, a reward function of the environment, and an ε-greed strategy setting value;

迁移学习模块，用于根据待控制的目标空调系统项目1的实施参数，判断能否从现有训练库找出用于迁移的项目2的Q表，如有，则对状态空间与动作空间进行归一化以使不同状态与动作空间之间的Q表之间建立起关联，并对归一化的Q表进行处理，完成迁移，其中，所述实施参数包括但不限于空调系统负荷、设备类型、设备数量和连接方式；A transfer learning module is used to determine whether a Q table for project 2 to be transferred can be found from an existing training library according to the implementation parameters of the target air-conditioning system project 1 to be controlled. If so, the state space and the action space are normalized to establish a correlation between the Q tables of different state and action spaces, and the normalized Q table is processed to complete the transfer, wherein the implementation parameters include but are not limited to the air-conditioning system load, equipment type, equipment quantity and connection mode;

指令下发模块，用于对完成迁移的智能体的动作集合的阈值和步长参数进行设置，向目标空调系统下发控制指令。The instruction issuing module is used to set the threshold and step parameters of the action set of the intelligent agent that has completed the migration, and issue control instructions to the target air-conditioning system.

第三方面，本发明实施例还提供了一种智能设备，包括处理器和存储器，所述存储器用于存储计算机程序，所述计算机程序包括程序指令，所述处理器被配置用于调用所述程序指令，执行如前所述的方法。In a third aspect, an embodiment of the present invention further provides an intelligent device, including a processor and a memory, wherein the memory is used to store a computer program, the computer program includes program instructions, and the processor is configured to call the program instructions to execute the method as described above.

第四方面，本发明实施例还提供了一种计算机可读存储介质，所述计算机可读存储介质存储有计算机程序，所述计算机程序包括程序指令，所述程序指令当被处理器执行时使所述处理器执行如前所述的方法。In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, wherein the computer program includes program instructions, and when the program instructions are executed by a processor, the processor executes the method as described above.

与现有技术相比，本发明实施例提供的一种基于强化迁移学习的通用空调系统节能控制方法至少具有以下有益效果：Compared with the prior art, the energy-saving control method for a general air-conditioning system based on reinforcement transfer learning provided by the embodiment of the present invention has at least the following beneficial effects:

本发明实施例省去了现有节能优化手段繁琐的设备建模过程，直接对现有空调系统的关键运行参数及能耗参数进行学习，不依赖众多的传感器数据以及负荷预测就可以达到较高的节能控制水平；无需像建模那样进行大量的参数调整和模型优化工作，无需花费高性能GPU或TPU等的设备成本，减少进行模型设计、数据处理、算法实现的时间成本。The embodiment of the present invention eliminates the cumbersome equipment modeling process of existing energy-saving optimization means, directly learns the key operating parameters and energy consumption parameters of the existing air-conditioning system, and can achieve a higher level of energy-saving control without relying on a large number of sensor data and load forecasts. There is no need to perform a large amount of parameter adjustment and model optimization work like modeling, and there is no need to spend equipment costs such as high-performance GPUs or TPUs, thereby reducing the time cost of model design, data processing, and algorithm implementation.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

下面将以明确易懂的方式，结合附图说明优选实施方式，对本发明的上述特性、技术特征、优点及其实现方式予以进一步说明。The preferred implementation modes will be described below in a clear and understandable manner with reference to the accompanying drawings to further illustrate the above-mentioned characteristics, technical features, advantages and implementation methods of the present invention.

图1为本发明实施例一种基于强化迁移学习的空调系统节能控制方法的流程图；FIG1 is a flow chart of an air conditioning system energy-saving control method based on reinforcement transfer learning according to an embodiment of the present invention;

图2为发明实施例一种基于强化迁移学习的空调系统节能控制方法中的空调系统结构示意图；FIG2 is a schematic diagram of the structure of an air-conditioning system in an air-conditioning system energy-saving control method based on reinforcement transfer learning according to an embodiment of the invention;

图3为本发明实施例一种基于强化迁移学习的空调系统节能控制迁移学习示意图；FIG3 is a schematic diagram of energy-saving control transfer learning of an air-conditioning system based on reinforcement transfer learning according to an embodiment of the present invention;

图4为本发明实施例一种基于强化迁移学习的空调系统节能控制的迁移Q表归一化示意图；4 is a schematic diagram of a migration Q-table normalization for energy-saving control of an air-conditioning system based on reinforcement transfer learning according to an embodiment of the present invention;

图5为本发明实施例一种基于强化迁移学习的空调系统节能控制冷机性能曲线；FIG5 is a performance curve of a refrigeration machine for energy-saving control of an air-conditioning system based on reinforcement transfer learning according to an embodiment of the present invention;

图6为本发明实施例冷却塔风机频率与冷却泵频率状态分布图；6 is a state distribution diagram of the fan frequency and the cooling pump frequency of the cooling tower according to an embodiment of the present invention;

图7为本发明实施例冷却水泵频率运行对比图。FIG. 7 is a comparison diagram of the frequency operation of cooling water pumps according to an embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对照附图说明本发明的具体实施方式。显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图，并获得其他的实施方式。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the specific implementation methods of the present invention will be described below with reference to the accompanying drawings. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For ordinary technicians in this field, other drawings and other implementation methods can be obtained based on these drawings without creative work.

为使图面简洁，各图中只示意性地表示出了与发明相关的部分，它们并不代表其作为产品的实际结构。另外，以使图面简洁便于理解，在有些图中具有相同结构或功能的部件，仅示意性地绘示了其中的一个，或仅标出了其中的一个。在本文中，“一个”不仅表示“仅此一个”，也可以表示“多于一个”的情形。In order to simplify the drawings, only the parts related to the invention are schematically shown in each figure, and they do not represent the actual structure of the product. In addition, in order to simplify the drawings and facilitate understanding, in some figures, only one of the parts with the same structure or function is schematically drawn or marked. In this article, "one" not only means "only one", but also means "more than one".

下面主要以一些具体的实施例为例，对本发明的技术方案的实施方式进行详述。The following mainly takes some specific embodiments as examples to describe in detail the implementation of the technical solution of the present invention.

发明人经过多年的研发经验发现，要提高自动节能控制的灵活性，就必须摆脱设备模型对控制方法的限制。After years of research and development experience, the inventors found that in order to improve the flexibility of automatic energy-saving control, it is necessary to get rid of the limitations of the device model on the control method.

强化学习是一种通过与环境交互来获取奖励值，从而进行学习的机器学习方法。相较于模型优化，强化学习具有硬件需求低、适应环境变化快、响应速度快的优点。对于空调系统控制领域，设备性能会随着使用而变化，原有基于设备初始参数的控制策略可能失效。如果采用强化学习进行优化控制，可以降低设备性能变化对控制效果的负面影响。Reinforcement learning is a machine learning method that learns by interacting with the environment to obtain reward values. Compared with model optimization, reinforcement learning has the advantages of low hardware requirements, fast adaptation to environmental changes, and fast response speed. In the field of air-conditioning system control, equipment performance will change with use, and the original control strategy based on the initial parameters of the equipment may become invalid. If reinforcement learning is used for optimized control, the negative impact of changes in equipment performance on the control effect can be reduced.

然而，强化学习中一个效果良好的智能体Agent需要与环境进行长时间交互才能达到理想状态。而在许多工业应用场景中，不允许无任何优化效果的未经训练的智能体在实际场景中随意进行学习。However, a good agent in reinforcement learning needs to interact with the environment for a long time to reach the ideal state. In many industrial application scenarios, untrained agents without any optimization effect are not allowed to learn at will in actual scenarios.

发明人经过研究发现，在空调系统的冷却侧，各种控制系统中的设备大多相似，通常包括冷水机组、冷却水泵与冷却塔，主要区别在于设备的连接方式以及功率和性能曲线。在冷冻侧，可以根据项目本身的主机类型和冷冻水泵类型进行分类。After research, the inventor found that on the cooling side of the air-conditioning system, the equipment in various control systems is mostly similar, usually including chillers, cooling water pumps and cooling towers. The main differences are the connection methods of the equipment and the power and performance curves. On the freezing side, the project can be classified according to the host type and the type of chilled water pump.

实际操作中，对控制系统下达的指令也相似，包括各类设备的开启台数、冷却塔风机和水泵的变频、主机的出水温度等。因此，已经在某空调系统中学习完成的强化学习智能体Agent可以有潜力迁移到其他空调系统控制中，也即通过迁移学习将某个领域或任务上学到的知识或模式应用到不同但相关的领域或问题，这解决了强化学习在训练智能体Agent过程中的时间成本问题。例如，在某结构的空调系统中训练完成的强化学习智能体Agent，若将负荷部分的参数按一定比例缩小，则可适用于结构相同但负荷容量按相同比例缩小的系统。In actual operation, the instructions given to the control system are also similar, including the number of devices to be turned on, the frequency conversion of cooling tower fans and water pumps, the outlet water temperature of the main unit, etc. Therefore, the reinforcement learning agent that has been learned in a certain air-conditioning system can potentially be transferred to other air-conditioning system controls, that is, through transfer learning, the knowledge or patterns learned in a certain field or task can be applied to different but related fields or problems, which solves the time cost problem of reinforcement learning in the process of training agents. For example, if the parameters of the load part of the reinforcement learning agent trained in an air-conditioning system of a certain structure are reduced by a certain proportion, it can be applied to a system with the same structure but a load capacity reduced by the same proportion.

但是在实际应用中，系统间的差异可能更加复杂，智能体Agent迁移后也不能立刻适应新环境，但智能体Agent达到可用标准的学习时间将减少。通过这种方式，可以弥补强化学习需要时间来训练的缺点，从而使强化学习得到更广泛的应用。However, in actual applications, the differences between systems may be more complex, and the agent cannot adapt to the new environment immediately after migration, but the learning time for the agent to reach the usable standard will be reduced. In this way, the disadvantage of reinforcement learning that it takes time to train can be compensated, so that reinforcement learning can be more widely used.

为解决现有建筑空调系统节能优化控制模型训练成本较高、特殊系统建模困难、模型建立对实际系统传感器精度及数量要求较高等原因导致模型优化在实际工程项目中落地难、成本大、久的问题，本发明实施例运用强化学习的方法，以大量不同建筑情景下的空调系统的历史运行数据训练出一个初始智能体Agent，并通过迁移学习使Agent可以在新的项目中快速实现节能优化。In order to solve the problems that model optimization is difficult, costly and time-consuming to implement in actual engineering projects due to the high training cost of the existing energy-saving optimization control model of the air-conditioning system in buildings, the difficulty in modeling special systems, and the high requirements on the accuracy and number of actual system sensors for model establishment, the embodiment of the present invention uses the reinforcement learning method to train an initial intelligent agent with historical operation data of air-conditioning systems in a large number of different building scenarios, and through transfer learning, the agent can quickly achieve energy-saving optimization in new projects.

智能体Agent的整体优化逻辑是针对整个空调系统的全局能耗进行优化，例如，在湿球温度较高时，适当提高冷机的出水温度可以减少冷却塔过高的工作负荷，从而减少能耗，即在冷却塔效率较低时，提高出水温度可以减少对冷却塔的依赖，从而减轻冷却塔的负荷，避免因冷却塔效率下降导致的系统能耗增加；同样的，在湿球温度较低时，在降低冷机的出水温度的同时提高冷却效果，同样有助于节能。以上两种节能路径仅供参考，具体的节能策略需要根据项目实际情况由Agent判断。The overall optimization logic of the intelligent agent is to optimize the global energy consumption of the entire air conditioning system. For example, when the wet-bulb temperature is high, appropriately increasing the outlet water temperature of the chiller can reduce the excessive workload of the cooling tower, thereby reducing energy consumption. That is, when the cooling tower efficiency is low, increasing the outlet water temperature can reduce the dependence on the cooling tower, thereby reducing the load of the cooling tower and avoiding the increase in system energy consumption due to the decrease in cooling tower efficiency; similarly, when the wet-bulb temperature is low, while reducing the outlet water temperature of the chiller, improving the cooling effect can also help save energy. The above two energy-saving paths are for reference only. The specific energy-saving strategy needs to be determined by the agent according to the actual situation of the project.

如图1所示，一种基于强化迁移学习的通用空调系统节能控制方法，所述方法包括：As shown in FIG1 , a general air conditioning system energy-saving control method based on reinforcement transfer learning includes:

S1、智能体获取对应时间戳下的空调系统的运行数据及各空调设备的能耗数据；S1, the agent obtains the operating data of the air-conditioning system and the energy consumption data of each air-conditioning device at the corresponding timestamp;

S2、将所述运行数据和所述能耗数据作为输入量，输入至智能体内进行学习，并对智能体的基本参数进行设置，所述基本参数包括环境状态集合、动作集合，环境的奖励函数、ε-greed策略设定值；S2, inputting the operation data and the energy consumption data as input into the intelligent body for learning, and setting the basic parameters of the intelligent body, wherein the basic parameters include an environment state set, an action set, an environment reward function, and an ε-greed strategy setting value;

S3、根据待控制的目标空调系统项目1的实施参数，智能体判断能否从现有训练库找出用于迁移的项目2的Q表，如有，则对状态空间与动作空间进行归一化以使不同状态与动作空间之间的Q表之间建立起关联，并对归一化的Q表进行处理，完成迁移，其中，所述实施参数包括但不限于空调系统负荷、设备类型、设备数量和连接方式，如无，则直接根据强化学习需求对智能体进行设置；S3. According to the implementation parameters of the target air-conditioning system project 1 to be controlled, the agent determines whether the Q table of project 2 for migration can be found from the existing training library. If so, the state space and the action space are normalized to establish a correlation between the Q tables of different state and action spaces, and the normalized Q table is processed to complete the migration. The implementation parameters include but are not limited to the air-conditioning system load, equipment type, equipment quantity and connection method. If not, the agent is directly set according to the reinforcement learning requirements.

S4、对完成迁移的智能体的动作集合的阈值和步长参数进行设置，所述智能体向目标空调系统下发控制指令。S4. Setting the threshold and step size parameters of the action set of the agent that has completed the migration, and the agent sends a control instruction to the target air-conditioning system.

在本发明实施例中，需要依托现有建筑中的空调系统存在的智能化传感器进行数据采集以获取对应时间戳下的空调系统的运行数据及各空调设备的能耗数据：In the embodiment of the present invention, it is necessary to rely on the intelligent sensors in the air-conditioning system of the existing building to collect data to obtain the operation data of the air-conditioning system and the energy consumption data of each air-conditioning device at the corresponding timestamp:

空调系统的运行数据包括管道运行参数和设备运行参数，管道运行参数包括冷冻（冷却）水总管供回水温度、冷冻（冷却）水总管供回水温度等；设备运行参数包括冷却塔进出塔温度、主机冷冻（冷却）水进出水温度、冷机电流百分比（负载率）、冷冻（冷却）水泵频率、冷却塔风机频率等。The operating data of the air-conditioning system includes pipeline operating parameters and equipment operating parameters. The pipeline operating parameters include the supply and return water temperature of the chilled (cooling) water main, the supply and return water temperature of the chilled (cooling) water main, etc.; the equipment operating parameters include the cooling tower inlet and outlet tower temperature, the main unit chilled (cooling) water inlet and outlet water temperature, the chiller current percentage (load rate), the chilled (cooling) water pump frequency, the cooling tower fan frequency, etc.

传感器采集的关键运行控制参数数据至少包括：冷冻水总管的供回水温度、冷冻水总管的流量、冷却塔的进出塔温度及对应时间戳、室外气象参数（空气温度、湿度）、空调末端对应的环境温度。The key operation control parameter data collected by the sensor include at least: the supply and return water temperature of the chilled water main, the flow rate of the chilled water main, the inlet and outlet temperatures of the cooling tower and the corresponding timestamps, outdoor meteorological parameters (air temperature, humidity), and the corresponding ambient temperature of the air conditioning terminal.

此外，还需要依托现有建筑中的空调系统的智能化数据采集，根据不同的空调设备，除了需要采集到空调系统各自的运行状态外，还需要采集到关键运行数据，包括：冷机的蒸发器进出水温度、冷凝器进出水温度、负载率，冷却塔的风机频率、冷却水流量，水泵的频率、流量、扬程等。同时，还需要提供各空调设备的铭牌参数。In addition, it is also necessary to rely on the intelligent data collection of the air conditioning system in the existing buildings. According to different air conditioning equipment, in addition to collecting the operating status of each air conditioning system, it is also necessary to collect key operating data, including: the evaporator inlet and outlet water temperature of the chiller, the condenser inlet and outlet water temperature, load rate, the fan frequency of the cooling tower, the cooling water flow, the frequency, flow, head of the water pump, etc. At the same time, the nameplate parameters of each air conditioning equipment need to be provided.

在该发明实施例中，还需要向Agent输入空调系统的以下实时采集的读取数据，从而完成节能优化控制的任务，包括：In this embodiment of the invention, the following real-time collected reading data of the air-conditioning system needs to be input to the Agent to complete the task of energy-saving optimization control, including:

外界温度(℃)、外界湿度；Outside temperature (℃), outside humidity;

系统负荷(kW)、各空调设备的运行状态；System load (kW), operating status of each air conditioning equipment;

各空调设备的关键参数（如冷机内部蒸发器与冷凝器的进出水温度、负载率；冷却塔的风机频率；水泵的频率等）；Key parameters of each air-conditioning equipment (such as the inlet and outlet water temperature and load rate of the evaporator and condenser inside the chiller; the fan frequency of the cooling tower; the frequency of the water pump, etc.);

各设备的能耗参数（有功功率和累计电量）。Energy consumption parameters of each device (active power and accumulated power).

在本发明实施例中，还需要依托现有建筑中的空调设备有对应的电量分项计量数据，比如采集到的各空调设备的有功功率和累计电量数据，除主机外至少每类设备一块电表，主机必须每台配有一块电表。In the embodiment of the present invention, it is also necessary to rely on the corresponding electricity sub-item metering data of the air-conditioning equipment in the existing building, such as the collected active power and cumulative electricity data of each air-conditioning equipment. In addition to the main unit, at least one electricity meter is required for each type of equipment, and each main unit must be equipped with an electricity meter.

在本发明实施例中，数据采集的协议可以支持BACnet、Modbus-TCP、OPC-UA等通用协议，也支持基于http的API经由平台或云端进行采集。In the embodiment of the present invention, the data collection protocol can support general protocols such as BACnet, Modbus-TCP, OPC-UA, etc., and also support collection via a platform or cloud based on an http API.

例如，现有某项目的空调系统，其系统如说明书附图2所示，由2台相同大小的离心式冷机、3台相同大小的冷冻水泵、3台相同大小的冷却水泵和2台相同大小的冷却塔组成，且冷却塔风机和各水泵均有变频，系统为全并联结构。目前系统运行策略为出水温度固定8℃，冷却塔风机和水泵定频45Hz运行，冷机加减机随负荷而定。For example, the air conditioning system of a certain existing project is shown in Figure 2 of the specification, which consists of 2 centrifugal chillers of the same size, 3 chilled water pumps of the same size, 3 cooling water pumps of the same size and 2 cooling towers of the same size, and the cooling tower fan and each water pump are frequency-converted, and the system is a full parallel structure. The current system operation strategy is that the outlet water temperature is fixed at 8°C, the cooling tower fan and water pump are operated at a fixed frequency of 45Hz, and the chiller addition and subtraction is determined according to the load.

其中冷机参数为：冷却水供回水温度为37/30℃、额定冷却水流量为820m³/h、冷冻水供回水温度为7/12℃、额定冷冻水流量为980m³/h、额定COP为5.5、额定制冷量为5651kW、额定功率为1027.5 kW。The chiller parameters are as follows: cooling water supply and return water temperature is 37/30℃, rated cooling water flow is ^820m3 /h, chilled water supply and return water temperature is 7/12℃, rated chilled water flow is ^980m3 /h, rated COP is 5.5, rated cooling capacity is 5651kW, and rated power is 1027.5 kW.

其中冷却塔参数为：额定进出塔温度为37/30℃、额定冷却水流量为520m³/h、额定风量400000 m³/h、风机额定频率50Hz、风机额定功率20kW。The cooling tower parameters are: rated inlet and outlet tower temperature is 37/30℃, rated cooling water flow is ^520m3 /h, rated air volume is ^400000m3 /h, rated fan frequency is 50Hz, and rated fan power is 20kW.

其中冷却水泵参数为：额定水流量800m³/h、额定频率50 Hz、额定功率35 kW、额定扬程20m。The parameters of the cooling water pump are: rated water flow 800m ³ /h, rated frequency 50 Hz, rated power 35 kW, and rated head 20m.

其中，冷冻水泵参数：额定水流量800m³/h、额定频率50 Hz、额定功率35 kW、额定扬程20m。 Among them, the parameters of the chilled water pump are: rated water flow ^800m3 /h, rated frequency 50 Hz, rated power 35 kW, and rated head 20m.

在步骤S1中，本发明实施例装配一个以算法形式载的边缘服务器，通过BACnet协议连接项目现有控制系统，可以获得空调系统的运行数据及各空调设备的能耗数据。In step S1, the embodiment of the present invention assembles an edge server loaded with an algorithm, connects to the existing control system of the project through the BACnet protocol, and can obtain the operation data of the air-conditioning system and the energy consumption data of each air-conditioning equipment.

在步骤S2中，进行强化学习相关参数的设定，环境状态可以由系统所承担的负荷与湿球温度两个维度组成。In step S2, the parameters related to reinforcement learning are set, and the environmental state can be composed of two dimensions: the load borne by the system and the wet-bulb temperature.

对于具有多设备的复杂空调系统，本发明实施例为每个动作单独分配智能体。例如对于冷机群而言，修改或者设置出水温度和加减机台数需要建立两个智能体来分别进行学习，其状态空间均为负荷与外界气象参数组成的二维空间。For complex air conditioning systems with multiple devices, the embodiment of the present invention allocates an agent for each action. For example, for a chiller group, modifying or setting the outlet water temperature and adding or subtracting the number of machines requires the establishment of two agents for learning respectively, and their state spaces are both two-dimensional spaces composed of loads and external meteorological parameters.

设定湿球温度的最大值为32℃，负荷率的最大值为1，通过随机过程在每个时间步长中，生成两个一组的分别落在[0.75,1]和 [0.3,1]两个区间中的随机参数()，该参数组分别乘以对应的参数最大值，得到的结果作该时间步长的状态参数取值。The maximum wet-bulb temperature is set to 32°C, the maximum load rate is set to 1, and two sets of random parameters are generated in each time step through a random process, falling in the intervals of [0.75, 1] and [0.3, 1] respectively ( ), the parameter group is multiplied by the corresponding maximum parameter value, and the result is used as the state parameter value of the time step.

设置Q值的更新规则，如下式所示：Set the Q value update rule as shown below:

； ;

其中，奖励值R为空调系统COP，还可以是EER或PUE等能效指标参数，奖励值可以是其中的一个参数值或多个参数值的组合。The reward value R is the COP of the air-conditioning system, and may also be an energy efficiency index parameter such as EER or PUE. The reward value may be one parameter value or a combination of multiple parameter values.

在强化学习中，Agent会把所有Q值整理为Q表，用以储存每个状态-动作对的Q值Q(s, a)。In reinforcement learning, the agent organizes all Q values into a Q table to store the Q value Q(s, a) of each state-action pair.

根据空调系统所处建筑功能或情景的不同、空调系统主机的类型或型号不同，奖励值除COP外，也可以设置为EER或PUE等判断指标。各指标的计算公式如下所示：Depending on the building functions or scenarios where the air conditioning system is located, and the types or models of the air conditioning system host, the reward value can be set as a judgment indicator such as EER or PUE in addition to COP. The calculation formulas for each indicator are as follows:

； ;

上式中，COP和EER均为空调系统的性能参数，为主机轴功的制冷量，为主机电动机输入功率的制冷量，W为空调系统各设备的总能耗，对于主机而言，电动机的输入功率不完全转化为轴功，其中一部分通过摩擦等损失掉。EER指标常用来全封闭压缩机的单位电动机的输入功率的制冷量的大小，具体选用哪个指标需要考虑到主机的类型和对应型号。PUE是专门反映数据中心能耗的一项指标，其中为整个数据中心的能耗，为数据中心中IT设备的总能耗。In the above formula, COP and EER are the performance parameters of the air conditioning system. is the cooling capacity of the main engine shaft work, The EER is the cooling capacity of the input power of the main engine motor, and W is the total energy consumption of the air conditioning system equipment. For the main engine, the input power of the motor is not completely converted into shaft work, and part of it is lost through friction. The EER index is often used to measure the cooling capacity of the unit motor input power of the fully enclosed compressor. The specific choice of which index needs to take into account the type and corresponding model of the main engine. PUE is an indicator that specifically reflects the energy consumption of the data center, among which is the energy consumption of the entire data center, is the total energy consumption of IT equipment in the data center.

在学习开始时，Q表中的所有Q值均为0。At the beginning of learning, all Q values in the Q table are 0.

如果将智能体的策略设定为选择当前状态下Q值最高的动作，则一旦Q值开始更新，第一个被采取的动作的Q值将变为正，从而该动作将变为Q值最高的动作，智能体会立刻开始追逐该动作且不去尝试其他动作。If the agent's strategy is set to choose the action with the highest Q value in the current state, once the Q value starts to update, the Q value of the first action taken will become positive, so that this action will become the action with the highest Q value, and the agent will immediately start to pursue this action and will not try other actions.

这时可以使用ε-greed策略，即当智能体处于s状态时，有1-ε的概率选择使Q(s,a)最大的动作，有ε的概率随机选择动作。在这里将ε设置为0.1。At this time, we can use the ε-greed strategy, that is, when the agent is in state s, there is a probability of 1-ε to choose the action that maximizes Q(s,a), and there is a probability of ε to randomly choose an action. Here, ε is set to 0.1.

如图3所示，Agent主要通过MDP(Markov Decision Process, 马尔可夫决策过程)方法来评估状态-动作对奖励值影响，MDP方法的计算公式为：As shown in Figure 3, the agent mainly uses the MDP (Markov Decision Process) method to evaluate the impact of state-action on the reward value. The calculation formula of the MDP method is:

；(1) ; (1)

式(1)中，S代表 Agent 采集的环境状态集合，A表示 Agent 的动作集合，R表示环境的奖励函数，P代表环境的状态转移概率，γ表示折扣因子，代表在状态s下选取动作a得到的立即奖励，表示状态s下选取动作a并转移到下一个状态s的概率，表示在状态s下选择动作a的概率。In formula (1), S represents the set of environmental states collected by the agent, A represents the set of actions of the agent, R represents the reward function of the environment, P represents the state transition probability of the environment, and γ represents the discount factor. represents the immediate reward obtained by taking action a in state s, represents the probability of taking action a in state s and transferring to the next state s, represents the probability of selecting action a in state s.

式(1)中，代表状态动作对在策略下的累计期望奖励，通常使用动作值函数来评估策略的优劣。In formula (1), Represents state-action pairs In Strategy The cumulative expected reward under the action value function is usually used to evaluate the strategy The advantages and disadvantages.

在本发明实施例中，获得最大化累计奖励的最优策略，即让空调系统整体能效比COP最高的策略是强化学习的最终目的，上式(1)可以改写为：In this embodiment of the present invention, the optimal strategy for maximizing the cumulative reward is obtained. , that is, the strategy of making the overall energy efficiency ratio COP of the air-conditioning system the highest is the ultimate goal of reinforcement learning. The above formula (1) can be rewritten as:

； (2) ; (2)

式(2)中，代表强化学习发现在当前中，让当前空调系统运行效果最高的策略的期望奖励；这里的，即立即奖励函数，设置为在空调系统运行状态点s下，选取动作a时空调系统的瞬时系统能效指标；空调系统运行状态点s由室外气象参数和系统当前负荷需求共同构成；动作a为Agent下发策略后空调系统执行的指令。In formula (2), Represents reinforcement learning discovery in the current The strategy to make the current air conditioning system operate most effectively The expected reward here is , that is, the immediate reward function, is set to the instantaneous system energy efficiency index of the air-conditioning system when action a is selected at the operating state point s of the air-conditioning system; the operating state point s of the air-conditioning system is composed of outdoor meteorological parameters and the current load demand of the system; action a is the strategy issued by the Agent Instructions executed by the rear air conditioning system.

奖励函数为空调系统能效指标，所述系统能效指标包括但不限于COP、EER或PUE中的一个或多个；The reward function is an energy efficiency index of the air conditioning system, wherein the energy efficiency index of the system includes but is not limited to one or more of COP, EER or PUE;

考虑到学习率、折扣系数对整体学习效果的影响，Q值的更新规则如公式(3)所示：Taking into account the influence of learning rate and discount factor on the overall learning effect, the update rule of Q value is shown in formula (3):

； (3) ; (3)

式(3)中，α为学习率，代表了Q值更新时将从新的Q值处学习多少，当学习率为0时，代表Q值完全不更新，为1时，代表旧的Q值将被新的Q值完全替换。In formula (3), α is the learning rate, which represents how much the Q value will be learned from the new Q value when the Q value is updated. When the learning rate is 0, it means that the Q value is not updated at all. When it is 1, it means that the old Q value will be completely replaced by the new Q value.

R为奖励值，是此次动作智能体从环境中获得的奖励，在本发明实施例中奖励值的设定均为空调系统COP。一般来说，训练的作用就是为了使获得的奖励最大化，因此奖励值函数往往设置成与要达成的目标强关联的形式。R is the reward value, which is the reward obtained by the action agent from the environment. In the embodiment of the present invention, the reward value is set to the COP of the air conditioning system. Generally speaking, the purpose of training is to maximize the reward obtained, so the reward value function is often set in a form that is strongly associated with the goal to be achieved.

为折扣系数，为此次行动后下一个状态下能获得的最大Q值。两者之积代表了未来可能获得的奖励对学习的重要性。当折扣系数为0时说明长期奖励没有影响，为1时说明长期奖励更加重要。在强化学习中，Agent会把所有Q值整理为Q表，用以储存每个状态-动作对的Q值Q(s, a)。 is the discount factor, The maximum Q value that can be obtained in the next state after this action. The product of the two represents the importance of future rewards to learning. When the discount factor is 0, it means that long-term rewards have no effect, and when it is 1, it means that long-term rewards are more important. In reinforcement learning, the agent will organize all Q values into a Q table to store the Q value Q(s, a) of each state-action pair.

在Agent对Q表的探索过程中，为平衡探索和利用的需求，需要引入ε-greed策略。在ε-greed策略中，智能体以概率ε随机选择一个动作（探索），或者以概率1-ε选择当前认为最好的动作（利用）。In the process of Agent's exploration of Q table, in order to balance the needs of exploration and utilization, it is necessary to introduce ε-greed strategy. In ε-greed strategy, the agent randomly selects an action with probability ε (exploration), or selects the action currently considered the best with probability 1-ε (utilization).

对于暖通空调这样的复杂系统，为保持系统的稳定性，避免出现冷机喘振等对设备有害的现象，在探索的过程中不应过于激进，一般将ε设置在(0.05, 0.35)的区间中。For complex systems such as HVAC, in order to maintain system stability and avoid phenomena such as cold machine surge that are harmful to the equipment, we should not be too aggressive in the exploration process. Generally, ε is set in the range of (0.05, 0.35).

进一步地，所述方法还包括：Furthermore, the method further comprises:

设置强化学习的超参数，包括学习率、折扣系数、初始随机系数epsilon、随机系数衰减量，并根据待控制的目标空调系统项目1的实际需求，对智能体的奖励值设定进行调试。Set the hyperparameters of reinforcement learning, including learning rate, discount factor, initial random coefficient epsilon, random coefficient attenuation, and debug the reward value setting of the agent according to the actual needs of the target air-conditioning system project 1 to be controlled.

为使得强化学习Q-learning的Agent可以更快地提高节能控制优化效果，在该发明技术方案正式应用于建筑空调系统之前，为帮助强化学习智能体可以更快地提高节能控制优化效果，本发明的具体实施下，需要提前根据暖通空调系统的特点对超参数进行设置，这里需要设置的超参数包括学习率learning_rate、折扣系数（即式(3)中）、初始随机系数epsilon、随机系数衰减量decay。In order to enable the reinforcement learning Q-learning agent to improve the energy-saving control optimization effect more quickly, before the technical solution of the invention is formally applied to the building air-conditioning system, in order to help the reinforcement learning agent to improve the energy-saving control optimization effect more quickly, in the specific implementation of the invention, it is necessary to set the hyperparameters in advance according to the characteristics of the HVAC system. The hyperparameters that need to be set here include the learning rate learning_rate, the discount coefficient (i.e., ), initial random coefficient epsilon, random coefficient decay decay.

在本发明实施例中，Agent通过前期的训练已经能够正常对空调系统进行控制，为保证Agent的学习不会影响到建筑内部环境温度的舒适和问鼎，学习率不宜设置太高，可将学习率设置为(0.1, 0.2)的区间；折扣系数代表了未来奖励对当前状态的价值，考虑到建筑环境的非稳态性，当前状态对未来奖励的影响较小，可将折扣系数设置为(0.05, 0.1)的区间；初始随机系数epsilon代表了智能体一开始的探索率，在源系统上进行学习时，为了鼓励智能体的探索，初始随机系数取1。In the embodiment of the present invention, the agent has been able to control the air conditioning system normally through the previous training. In order to ensure that the learning of the agent does not affect the comfort and stability of the internal environment temperature of the building, the learning rate should not be set too high. The learning rate can be set to the interval of (0.1, 0.2); the discount coefficient It represents the value of future rewards to the current state. Considering the non-steady-state nature of the building environment, the current state has little impact on future rewards, so the discount factor can be set to the interval of (0.05, 0.1). The initial random coefficient epsilon represents the initial exploration rate of the agent. When learning on the source system, in order to encourage the agent's exploration, the initial random coefficient is 1.

在迁移系统上进行学习时，为了更好地利用迁移来的知识，初始随机系数取(0.4,0.6)的区间，以满足不同程度的迁移需求；随机系数衰减量decay代表了随机系数随着学习过程的推进衰减的速度，当随机系数衰减至0时，智能体将不再探索，学习过程也将停止，为了延长学习过程，衰减量decay尽量取不高于0.01的数值。When learning on the transfer system, in order to better utilize the transferred knowledge, the initial random coefficient is taken in the interval of (0.4, 0.6) to meet the needs of different degrees of transfer; the random coefficient decay represents the speed at which the random coefficient decays as the learning process progresses. When the random coefficient decays to 0, the agent will no longer explore and the learning process will also stop. In order to prolong the learning process, the decay value decay is taken as little as possible to be no higher than 0.01.

在步骤S3，假设在Agent历史数据训练库中，存在这样一组系统，其设备型号均与本发明实施例中要规划实施的空调系统场景中的一致，但其系统含有四台冷机、七台冷却塔、四组冷却水泵、四组冷冻水泵，连接方式为机泵先串后并。设定好强化学习参数后，Agent在读取运行数据后发现，在部分负荷状态下，训练库中系统开启设备台数与本发明实施例中的系统一致，此时可以进行迁移学习。In step S3, it is assumed that in the Agent historical data training library, there is a group of systems whose equipment models are consistent with those in the air conditioning system scenario to be planned and implemented in the embodiment of the present invention, but the system contains four chillers, seven cooling towers, four cooling water pumps, and four chilled water pumps, and the connection method is that the pumps are connected in series first and then in parallel. After setting the reinforcement learning parameters, the Agent reads the operating data and finds that under partial load conditions, the number of activated equipment in the system in the training library is consistent with that in the embodiment of the present invention, and transfer learning can be performed at this time.

在强化学习的迁移中，核心思想是要将Agent在项目1的学习成果应用在项目2中。在进行迁移学习的处理时，为了在不同状态与动作空间之间的Q表之间建立起关联，首先需要对两者的状态与动作区间进行归一化，将其压缩到[0,1]的区间内。归一化时所取的基准值按具体的案例特点进行指定。这里的归一化使用线性归一化(Linear normalization)的方法，具体公式如公式(4)所示：In the transfer of reinforcement learning, the core idea is to apply the learning results of the agent in project 1 to project 2. When processing transfer learning, in order to establish a connection between the Q tables between different state and action spaces, it is first necessary to normalize the state and action intervals of both and compress them into the interval [0,1]. The reference value taken during normalization is specified according to the characteristics of the specific case. The normalization here uses the linear normalization method, and the specific formula is shown in formula (4):

； (4) ; (4)

式(4)中，x为需要进行归一化的对象数据，min和max分别是这组数据中的最大值和最小值，为作为归一化处理时的基准值。In formula (4), x is the object data to be normalized, min and max are the maximum and minimum values in this set of data, respectively. It is used as the reference value for normalization.

如图4所示，在本发明实施例中，将每个状态（s）和动作（a）的值除以相应的基准值（和），从而将它们缩放到[0,1]的区间内，如公式（5）、（6）所示：As shown in FIG. 4 , in the embodiment of the present invention, the value of each state (s) and action (a) is divided by the corresponding reference value ( and ), thereby scaling them to the interval [0,1], as shown in formulas (5) and (6):

, ；(5) , ; (5)

，； (6) , ; (6)

式(5)和(6)中，Q1是项目1中的空调系统各状态点下的动作，Q2是项目2中的空调系统各状态点下的动作。In equations (5) and (6), Q1 is the action of the air conditioning system at each state point in project 1, and Q2 is the action of the air conditioning system at each state point in project 2.

图4所示，在归一化之前，Q1和Q2两个表的数据之间是无法相互映射的；由于Q表的行与列是离散的，因此在归一化后，各行与各列依然无法进行对应，因此新Q表中某一个Q值就需要通过原Q表中与其索引值相近的Q值加权得到，具体公式如下所示：As shown in Figure 4, before normalization, the data in the two tables Q1 and Q2 cannot be mapped to each other; since the rows and columns of the Q table are discrete, after normalization, the rows and columns still cannot correspond to each other, so a Q value in the new Q table It needs to be obtained by weighting the Q value close to its index value in the original Q table. The specific formula is as follows:

；(7) ; (7)

式(7)中，为原Q表归一化后的行索引的值，此处将状态空间对应到了行索引上；为原Q表归一化后的列索引的值，此处将动作空间对应到了列索引上；与为参数，其值∈(0,1)，将其作为底数，构造以索引之间的距离绝对值为指数的函数，当距离接近或为0时，函数值为1，当距离增加时，函数值快速降低，降低速度与参数的取值相关。In formula (7), is the value of the normalized row index of the original Q table, where the state space is mapped to the row index; is the value of the normalized column index of the original Q table, where the action space is mapped to the column index; and Is a parameter, whose value ∈ (0,1), and is used as the base to construct a function with the absolute value of the distance between indices as the exponent. When the distance is close to or equal to 0, the function value is 1. When the distance increases, the function value decreases rapidly, and the speed of decrease is related to the value of the parameter.

构成了该点的Q值对所求的新Q值的贡献权重；为归一化系数，用于将所有权重之和放缩为1。归一化系数的计算公式如下式所示： It constitutes the contribution weight of the Q value of the point to the new Q value required; is the normalization coefficient, which is used to scale the sum of all weights to 1. The calculation formula of the normalization coefficient is as follows:

； (8) ; (8)

计算完成后，还需要考虑原任务和新任务奖励值的大小差距问题。如果原任务的奖励值较大，则其训练完成后的Q表中的Q值也较大。将其迁移到新任务中后，新任务较小的奖励值与迁移来的Q值进行对比会导致智能体的选择出现偏差。因此在对应能效指标不同的系统进行迁移时，需要对整个Q表的Q值按照奖励值的大小进行缩放。After the calculation is completed, the difference in the size of the reward value between the original task and the new task needs to be considered. If the reward value of the original task is large, the Q value in the Q table after training is also large. After migrating it to the new task, the smaller reward value of the new task is compared with the migrated Q value, which will cause deviations in the selection of the intelligent agent. Therefore, when migrating systems with different energy efficiency indicators, the Q value of the entire Q table needs to be scaled according to the size of the reward value.

迁移完成后，对项目中的强化学习Agent进行设置，如下表所示：After the migration is complete, set up the reinforcement learning agent in the project as shown in the following table:

上表中，，，，分别为负荷与湿球温度的上下限。In the above table, , , , are the upper and lower limits of load and wet-bulb temperature respectively.

完成迁移后，设置部署在该项目强化学习的超参数。学习率设置为0.1、折扣系数设置为0.1、初始随机系数设置为0.4，随机系数衰减量设置为0.0001。将本发明所述方法正式应用于该项目。After the migration is completed, the hyperparameters of the reinforcement learning deployed in the project are set. The learning rate is set to 0.1, the discount factor is set to 0.1, the initial random coefficient is set to 0.4, and the random coefficient attenuation is set to 0.0001. The method described in the present invention is formally applied to the project.

进一步地，对完成迁移的智能体的动作集合的阈值和步长参数进行设置具体包括：Furthermore, setting the threshold and step size parameters of the action set of the agent that completes the migration specifically includes:

除此之外，使用超过5年的设备还需考虑其性能衰减。In addition, performance degradation must be considered for equipment that has been used for more than 5 years.

为应对空调系统运行特点，满足复杂系统在非稳态工况下的节能优化控制，对智能体各超参数进行设定，强化学习Q-learning要求动作空间与状态空间均离散，因此，在设置智能体状态空间与动作空间时，需要将原本连续的动作空间进行离散处理。上述输入的调节参数，如时间戳、外界温湿度、各设备的关键参数（如温度、频率等）均为连续值，在本发明实施例中，考虑到空调系统的稳定性，控制时间间隔不得低于10min/次，控制步长上，频率的调整间隔不得高于2Hz/次，温度的调整间隔不得高于1℃/次。In order to cope with the operating characteristics of the air-conditioning system and meet the energy-saving optimization control of complex systems under non-steady-state conditions, the hyperparameters of the intelligent agent are set. Reinforcement learning Q-learning requires that both the action space and the state space are discrete. Therefore, when setting the state space and action space of the intelligent agent, the originally continuous action space needs to be discretized. The above-mentioned input adjustment parameters, such as timestamp, external temperature and humidity, and key parameters of each device (such as temperature, frequency, etc.) are all continuous values. In the embodiment of the present invention, considering the stability of the air-conditioning system, the control time interval shall not be less than 10min/time, and the control step size shall not be higher than 2Hz/time, and the temperature adjustment interval shall not be higher than 1℃/time.

进一步地，在该发明设计方案正式应用于建筑空调系统之前，在具体的实施条件下，本发明实施例需要根据现有空调运行情况对智能体的奖励值设定进行调试，Q-learning的学习目标是使系统COP(系统能效比)最大，因此，系统COP应作为奖励主体。为了使奖励更平滑，应对系统COP进行标准化。Furthermore, before the invention design is formally applied to the building air conditioning system, under specific implementation conditions, the embodiment of the invention needs to debug the reward value setting of the intelligent agent according to the existing air conditioning operation conditions. The learning goal of Q-learning is to maximize the system COP (system energy efficiency ratio), so the system COP should be used as the reward subject. In order to make the reward smoother, the system COP should be standardized.

另外，由于模型计算时假设换热是充分的，为了防止智能体因片面降低冷却水流量而导致冷却水供回水温差过高，最终奖励还需乘上温差修正系数，具体公式如下所示：In addition, since the model calculation assumes that the heat exchange is sufficient, in order to prevent the intelligent agent from unilaterally reducing the cooling water flow rate, which will cause the cooling water supply and return water temperature difference to be too high, the final reward needs to be multiplied by the temperature difference correction coefficient. The specific formula is as follows:

； (9) ; (9)

式(9)中，COP为系统能效比，Tcws为冷却水总管供水温度，Tcwr为冷却水总管回水温度。In formula (9), COP is the system energy efficiency ratio, Tcws is the cooling water main supply water temperature, and Tcwr is the cooling water main return water temperature.

当智能体向空调系统下发控制指令后，智能体会同步获取当前时间戳下的运行参数，从而计算出当前系统能效比。When the intelligent agent sends a control instruction to the air-conditioning system, it will synchronously obtain the operating parameters at the current timestamp to calculate the energy efficiency ratio of the current system.

进一步的，所述方法还包括，所述智能体向目标空调系统下发控制指令后，同步获取当前时间戳的运行参数，并计算出当前的系统能效比，所述系统能效比COP作为奖励值，被智能体视为是否存入Q表的判断依据。Furthermore, the method also includes that after the intelligent agent sends a control instruction to the target air-conditioning system, it synchronously obtains the operating parameters of the current timestamp and calculates the current system energy efficiency ratio. The system energy efficiency ratio COP is used as a reward value and is regarded by the intelligent agent as a basis for determining whether to store it in the Q table.

系统能效比作为奖励值，被Agent视为是否存入Q表的判断依据，当Q表超出一定限额后，Agent将根据奖励值高低对整个Q表进行优化，若此时有一高奖励值的Q值出现，而此时Q表内数据已经达到限额后，Q表内奖励值最低的Q值会被Agent从Q表内移除，然后加入更高的Q值，供Agent学习。The system energy efficiency ratio is used as a reward value and is considered by the Agent as a basis for determining whether to store it in the Q table. When the Q table exceeds a certain limit, the Agent will optimize the entire Q table based on the reward value. If a Q value with a high reward value appears at this time, and the data in the Q table has reached the limit, the Q value with the lowest reward value in the Q table will be removed from the Q table by the Agent, and then a higher Q value will be added for the Agent to learn.

经过大量的实践数据，本发明实施例对于作为算法运行载体的服务器性能也有要求，需要算法在其上运行能够满足以下要求：设备静态点位配置能力≥4000个、数据保存周期≥900天、指标关联分析能力≥100条/秒、对温感和空调系统数据的采集频率≥1次/分钟、对电表数据的采集频率≥1次/分钟。After a large amount of practical data, the embodiment of the present invention also has requirements for the performance of the server as the carrier of the algorithm operation. The algorithm running on it must meet the following requirements: the equipment static point configuration capacity ≥ 4000, the data retention period ≥ 900 days, the indicator correlation analysis capability ≥ 100 items/second, the collection frequency of temperature sensing and air-conditioning system data ≥ 1 time/minute, and the collection frequency of electric meter data ≥ 1 time/minute.

在现有系统中，冷机在额定水温下于60%部分负荷工况达到最高COP。因此，冷机开启台数按下式计算：In the existing system, the chiller reaches the highest COP at 60% partial load under rated water temperature. Calculate as follows:

； ;

其中，为冷机开启台数；为统所承担的总负荷，kW；为台冷机的额定冷负荷；int代表向下取整算法。由于该冷机的COP-plr曲线在极值点左侧更加陡峭，如附图5所示，因此，使用向下取整而非四舍五入计算冷机台数可以使得工况点都落在极值点右侧，使得平均COP更大。该公式是按项目实际需求设置，在其他项目中可能会用到另外几种冷机开始台数的算法。in, The number of cooling machines turned on; is the total load borne by the system, kW; is the rated cooling load of the chiller; int represents the rounding down algorithm. Since the COP-plr curve of the chiller is steeper on the left side of the extreme point, as shown in Figure 5, using rounding down instead of rounding up to calculate the number of chillers can make the operating points fall on the right side of the extreme point, making the average COP larger. This formula is set according to the actual needs of the project, and other algorithms for calculating the starting number of chillers may be used in other projects.

此时，随着系统负荷的增加，当负荷尚未增加到需要新开冷机时，最优冷却塔风机频率随着负荷的升高而升高；当新开一台冷机时，最优冷却塔风机频率忽然降低，随后延续之前的规律。湿球温度对最优冷却塔风机频率的影响则不明显。最优冷却水泵频率整体上随着系统负荷的升高而降低，但在湿球温度较高或较低时，都存在高频率与低频率交替出现的区域，如附图6所示。At this time, as the system load increases, when the load has not yet increased to the point where a new chiller needs to be opened, the optimal cooling tower fan frequency increases with the increase in load; when a new chiller is opened, the optimal cooling tower fan frequency suddenly decreases, and then continues the previous pattern. The effect of wet-bulb temperature on the optimal cooling tower fan frequency is not obvious. The optimal cooling water pump frequency decreases as the system load increases overall, but when the wet-bulb temperature is high or low, there are areas where high and low frequencies alternate, as shown in Figure 6.

可以看出，最优冷却塔风机频率随着每台冷机承担的负荷增加而增加，但在每台冷机承担的负荷较高时同时出现了高频与低频的区域。这是由于该区域同时包含了冷机台数达到上限后系统负荷继续升高的工况点和冷机台数发生突变的工况点。最优冷却泵频率随着湿球温度和每台冷机承担的负荷的升高而降低，如附图6所示。It can be seen that the optimal cooling tower fan frequency increases with the increase of the load on each chiller, but when the load on each chiller is high, high-frequency and low-frequency areas appear at the same time. This is because this area includes both the operating point where the system load continues to increase after the number of chillers reaches the upper limit and the operating point where the number of chillers changes suddenly. The optimal cooling pump frequency decreases with the increase of wet-bulb temperature and the load on each chiller, as shown in Figure 6.

冷却水泵与冷却塔在不同状态点下不再是定频运行，而是处于不同的工作状态。这里取冷却水泵的频率运行做展示，如附图7所示，7月2日是本发明未部署时冷却水泵的运行状态，定频 40Hz运行，其中，CwPump_#1、CwPump_#2分别是1号冷却水泵、2号冷却水泵，Freq_FB是频率反馈值，0702是7月2号，而7月3日是本发明部署后冷却水泵的运行状态，Agent更多的是考虑从整体能耗的角度出发，调整冷却水泵的频率，以实现总体的能耗最低。The cooling water pump and cooling tower are no longer in fixed frequency operation at different state points, but are in different working states. Here, the frequency operation of the cooling water pump is taken for demonstration, as shown in Figure 7. July 2 is the operating state of the cooling water pump when the present invention is not deployed, and it is running at a fixed frequency of 40Hz. Among them, CwPump_#1 and CwPump_#2 are cooling water pump No. 1 and cooling water pump No. 2 respectively, Freq_FB is the frequency feedback value, 0702 is July 2, and July 3 is the operating state of the cooling water pump after the present invention is deployed. The Agent is more concerned with adjusting the frequency of the cooling water pump from the perspective of overall energy consumption to achieve the lowest overall energy consumption.

经过测试，本发明部署后的一个月，空调系统的COP从去年的4.73提升到了今年5.31，且该月今年与去年的室外平均湿球温度差值仅为0.18℃，可以忽略不计。该系统能效提升率达到12.13%，本发明成功帮助该建筑实现空调系统的节能优化运行。After testing, one month after the deployment of the present invention, the COP of the air conditioning system increased from 4.73 last year to 5.31 this year, and the difference in the average outdoor wet-bulb temperature between this year and last year was only 0.18°C, which is negligible. The energy efficiency improvement rate of the system reached 12.13%, and the present invention successfully helped the building achieve energy-saving and optimized operation of the air conditioning system.

本发明实施例还提供了一种基于强化迁移学习的通用空调系统节能控制的智能体，包括：The embodiment of the present invention further provides an intelligent agent for energy-saving control of a general air-conditioning system based on reinforcement transfer learning, comprising:

指令下发模块，用于对完成迁移的智能体的动作集合的阈值和步长参数进行限制，向目标空调系统下发控制指令。The instruction issuing module is used to limit the threshold and step parameters of the action set of the intelligent agent that has completed the migration, and issue control instructions to the target air-conditioning system.

本发明实施例还提供了一种智能设备，包括处理器和存储器，所述存储器用于存储计算机程序，所述计算机程序包括程序指令，所述处理器被配置用于调用所述程序指令，执行前述的方法。An embodiment of the present invention further provides an intelligent device, including a processor and a memory, wherein the memory is used to store a computer program, the computer program includes program instructions, and the processor is configured to call the program instructions to execute the aforementioned method.

本发明实施例还提供了一种计算机可读存储介质，所述计算机可读存储介质存储有计算机程序，所述计算机程序包括程序指令，所述程序指令当被处理器执行时使所述处理器执行如前所述的方法。An embodiment of the present invention further provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, wherein the computer program includes program instructions, and when the program instructions are executed by a processor, the processor executes the method as described above.

本发明实施例与现有的空调系统优化控制方法或现有的空调系统节能优化运维方法相比具有以下技术特点：Compared with the existing air conditioning system optimization control method or the existing air conditioning system energy-saving optimization operation and maintenance method, the embodiment of the present invention has the following technical features:

本发明实施例省去了现有节能优化手段繁琐的设备建模过程，直接对现有空调系统的关键运行参数及能耗参数进行学习，不依赖众多的传感器数据以及负荷预测就可以达到较高的节能水平；无需像建模那样进行大量的参数调整和模型优化工作，无需花费高性能GPU或TPU等的设备成本，减少进行模型设计、数据处理、算法实现的时间成本。The embodiment of the present invention eliminates the cumbersome equipment modeling process of existing energy-saving optimization means, directly learns the key operating parameters and energy consumption parameters of the existing air-conditioning system, and can achieve a higher energy-saving level without relying on a large number of sensor data and load forecasts; there is no need to perform a large amount of parameter adjustment and model optimization work like modeling, and there is no need to spend equipment costs such as high-performance GPUs or TPUs, thereby reducing the time cost of model design, data processing, and algorithm implementation.

本发明实施例比起现有节能控制方案具有更高的可实现性和灵活程度。在实际工程中，仍旧有许多大量关于直觉、经验、认知等无法用形式化方法表示的特性，这意味着仅依赖传统基于相似性原理及形式化知识的建模仿真方法，难以真正表达复杂系统的深层次规律。在暖通空调学科中，虽然基于传热学、工程热力学、流体力学等基础学科的发展，其机理公式是科学且经过实验验证的，但在实际工程中，仍旧有许多大量关于直觉、经验、认知等无法用形式化方法表示的特性，这意味着仅依赖传统基于相似性原理及形式化知识的建模仿真方法，难以真正表达复杂系统的深层次规律。The embodiments of the present invention have higher feasibility and flexibility than the existing energy-saving control scheme. In actual engineering, there are still a large number of characteristics about intuition, experience, cognition, etc. that cannot be expressed by formal methods, which means that it is difficult to truly express the deep laws of complex systems by relying solely on traditional modeling and simulation methods based on the principle of similarity and formalized knowledge. In the HVAC discipline, although its mechanism formula is scientific and experimentally verified based on the development of basic disciplines such as heat transfer, engineering thermodynamics, and fluid mechanics, in actual engineering, there are still a large number of characteristics about intuition, experience, cognition, etc. that cannot be expressed by formal methods, which means that it is difficult to truly express the deep laws of complex systems by relying solely on traditional modeling and simulation methods based on the principle of similarity and formalized knowledge.

而本发明实施例通过强化学习技术对实际运行数据进行深入分析，以识别和量化那些难以用传统方法捕捉的非线性动态特性和运行模式。同时，通过前期大量项目数据的训练，Agent会自发地为学习状态设立基于暖通空调机理的预制，类似于内置了一个简易的机理模型。这种无模型优化的策略不仅保证了控制效果的准确性和可解释性，还通过数据驱动的方法增强了模型的适应性和预测能力。同时，强化学习具有较高的自适应能力，能够根据环境的变化和反馈信号优化决策策略，强化学习内置的智能体，可以根据当前的环境参数与系统负荷，通过评价函数计算来得出最佳的系统运行状态，从而进行实时且最优化的参数调整。The embodiment of the present invention uses reinforcement learning technology to conduct an in-depth analysis of actual operating data to identify and quantify nonlinear dynamic characteristics and operating modes that are difficult to capture with traditional methods. At the same time, through the training of a large amount of project data in the early stage, the Agent will spontaneously set up a prefabrication based on the HVAC mechanism for the learning state, which is similar to a simple mechanism model built in. This model-free optimization strategy not only ensures the accuracy and interpretability of the control effect, but also enhances the adaptability and predictive ability of the model through a data-driven approach. At the same time, reinforcement learning has a high degree of adaptability and can optimize decision-making strategies according to environmental changes and feedback signals. The intelligent agent built into reinforcement learning can calculate the optimal system operating state through evaluation function calculation based on the current environmental parameters and system load, thereby performing real-time and optimal parameter adjustments.

另外，本发明实施例比起现有的PID优化控制方法，具有更高的节能上限和运行健康度水平。现在被广泛使用的基于PID的优化控制方法，大多依赖于事先构建好的规则库和模糊推理机制，根据输入变量的模糊值计算出对应的输出变量的模糊值，并将其转化为具体的控制命令。这种优化并没有对PID控制的逻辑基础进行改良，只是内置了大量的专家经验。它将系统负荷分划了不同的扇区，控制系统会根据系统运行时负荷所处的区间，依靠专家经验对系统的设定值进行固定的变化，并不会根据实时的负荷数据进行计算，从而改变设定值。这种方法在设计阶段就需要手动编写规则库，对于复杂、非线性的控制问题，模糊规则难以穷尽，并且只能依靠提前写好的专家经验来调整暖通空调系统参数设定。In addition, the embodiments of the present invention have a higher energy-saving upper limit and operating health level than the existing PID optimization control method. Most of the PID-based optimization control methods that are widely used now rely on a pre-built rule base and fuzzy reasoning mechanism to calculate the fuzzy value of the corresponding output variable based on the fuzzy value of the input variable and convert it into a specific control command. This optimization does not improve the logical basis of PID control, but only builds in a large amount of expert experience. It divides the system load into different sectors. The control system will make fixed changes to the system's set value based on the load interval when the system is running, relying on expert experience, and will not calculate based on real-time load data to change the set value. This method requires manual writing of the rule base in the design stage. For complex and nonlinear control problems, fuzzy rules are difficult to exhaust, and the HVAC system parameter settings can only be adjusted based on pre-written expert experience.

而本发明实施例依靠对系统不间断运行的训练，可以处理各种复杂的系统动态，因为它能够从实时运行的大量的数据中学习到最佳策略。在遇到极端天气或系统设备故障时，Agent可以根据运行情况做出改变，避免设备出现严重损坏，且在系统允许的情况下不影响正常生产活动。同时，设备的性能波动会从运行的数据中表现出来并被Agent捕捉，从而最大程度减轻设备性能衰减对整体节能优化的影响。The embodiment of the present invention can handle various complex system dynamics by training the system to run continuously, because it can learn the best strategy from a large amount of data running in real time. When encountering extreme weather or system equipment failure, the Agent can make changes according to the operating conditions to avoid serious damage to the equipment and not affect normal production activities if the system allows. At the same time, the performance fluctuations of the equipment will be reflected in the operating data and captured by the Agent, thereby minimizing the impact of equipment performance degradation on overall energy-saving optimization.

需要说明的是，在本文中将强化学习与迁移学习分开进行描述，仅是为了考虑阅读者的体验，能够正确领会本发明的内容，不代表在本发明中，这两种算法完全独立运行的。事实上，这两种算法会根据项目实际情况或分别运行，或一起运行，或在一个架构下分模块化运行，这几种情况本质上仍是本发明内容中所描述的方法。此外，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that the description of reinforcement learning and transfer learning separately in this article is only for the reader's experience and to correctly understand the content of the present invention. It does not mean that in the present invention, the two algorithms run completely independently. In fact, the two algorithms will run separately, together, or modularly under an architecture according to the actual situation of the project. These situations are essentially still the methods described in the content of the present invention. In addition, the terms "include", "comprise" or any other variants thereof are intended to cover non-exclusive inclusion, so that the process, method, article or device including a series of elements includes not only those elements, but also other elements that are not explicitly listed, or also includes elements inherent to such process, method, article or device. In the absence of more restrictions, the elements defined by the sentence "including one..." do not exclude the existence of other identical elements in the process, method, article or device including the elements.

应当注意的是，在本文的实施方式中所揭露的装置、单元和方法，也可以通过其他的方式实现。以上所描述的装置实施方式仅仅是示意性的，例如，附图中的流程图和框图显示了根据本文的多个实施方式的装置、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上，流程图或框图中的每个方框可以代表一个模块、程序或代码的一部分，所述模块、程序段或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令，所述模块、程序段或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意，在有些作为替换的实现方式中，方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如，两个连续的方框实际上可以基本并行地执行，它们有时也可以按相反的顺序执行，这依所涉及的功能而定。也要注意的是，框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合，可以用于执行规定的功能或动作的专用的基于硬件的系统来实现，或者可以用专用硬件与计算机指令的组合来实现。It should be noted that the devices, units and methods disclosed in the embodiments of this article can also be implemented in other ways. The device embodiments described above are only schematic. For example, the flowcharts and block diagrams in the accompanying drawings show the possible architecture, functions and operations of the devices, methods and computer program products according to multiple embodiments of this article. In this regard, each box in the flowchart or block diagram can represent a part of a module, program or code, and the module, program segment or part of the code contains one or more executable instructions for implementing the specified logical function, and the module, program segment or part of the code contains one or more executable instructions for implementing the specified logical function. It should also be noted that in some alternative implementations, the functions marked in the box can also occur in a different order from the order marked in the accompanying drawings. For example, two consecutive boxes can actually be executed substantially in parallel, and they can sometimes be executed in the opposite order, depending on the functions involved. It should also be noted that each box in the block diagram and/or flowchart, and the combination of boxes in the block diagram and/or flowchart can be implemented by a dedicated hardware-based system for performing a specified function or action, or can be implemented by a combination of dedicated hardware and computer instructions.

需要说明的是，对于前述的各方法实施例，为了简单描述，故将其都表述为一系列的动作组合，但是本领域技术人员应该知悉，本申请并不受所描述的动作顺序的限制，因为依据本申请，某些步骤可以采用其他顺序或者同时进行。其次，本领域技术人员也应该知悉，说明书中所描述的实施例均属于优选实施例，所涉及的动作和模块并不一定是本申请所必须的。It should be noted that, for the aforementioned method embodiments, for the sake of simplicity, they are all expressed as a series of action combinations, but those skilled in the art should be aware that the present application is not limited by the described order of actions, because according to the present application, certain steps can be performed in other orders or simultaneously. Secondly, those skilled in the art should also be aware that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily required by the present application.

在上述实施例中，对各个实施例的描述都各有侧重，某个实施例中没有详述的部分，可以参见其他实施例的相关描述。In the above embodiments, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, reference can be made to the relevant descriptions of other embodiments.

在本申请所提供的几个实施例中，应该理解到，所揭露的装置，可通过其它的方式实现。例如，以上所描述的装置实施例仅仅是示意性的，例如所述单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个单元或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口，装置或单元的间接耦合或通信连接，可以是电性或其它的形式。In the several embodiments provided in the present application, it should be understood that the disclosed device can be implemented in other ways. For example, the device embodiments described above are only schematic, such as the division of the units, which is only a logical function division. There may be other division methods in actual implementation, such as multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed. Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be through some interfaces, and the indirect coupling or communication connection of the device or unit can be electrical or other forms.

所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

另外，在申请明各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现，也可以采用软件程序模块的形式实现。In addition, the functional units in the various embodiments of the application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit. The above-mentioned integrated units may be implemented in the form of hardware or in the form of software program modules.

所述集成的单元如果以软件程序模块的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储器中。基于这样的理解，本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储器中，包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储器包括：U盘、只读存储器(read-only memory，ROM)、随机存取存储器(random access memory，RAM)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is implemented in the form of a software program module and sold or used as an independent product, it can be stored in a computer-readable memory. Based on this understanding, the technical solution of the present application is essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a memory, including a number of instructions to enable a computer device (which can be a personal computer, server or network device, etc.) to perform all or part of the steps of the method described in each embodiment of the present application. The aforementioned memory includes: U disk, read-only memory (ROM), random access memory (RAM), mobile hard disk, disk or optical disk and other media that can store program codes.

本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成，该程序可以存储于一计算机可读存储器中，存储器可以包括：闪存盘、只读存储器、随机存取器、磁盘或光盘等。A person of ordinary skill in the art can understand that all or part of the steps in the various methods of the above embodiments can be completed by instructing related hardware through a program, and the program can be stored in a computer-readable memory, which can include: a flash drive, a read-only memory, a random access memory, a magnetic disk or an optical disk, etc.

以上对本申请实施例进行了详细介绍，本文中应用了具体个例对本申请的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本申请的方法及其核心思想；同时，对于本领域的一般技术人员，依据本申请的思想，在具体实施方式及应用范围上均会有改变之处，综上所述，本说明书内容不应理解为对本申请的限制。The embodiments of the present application are introduced in detail above. Specific examples are used in this article to illustrate the principles and implementation methods of the present application. The description of the above embodiments is only used to help understand the method and core idea of the present application. At the same time, for general technical personnel in this field, according to the idea of the present application, there will be changes in the specific implementation method and application scope. In summary, the content of this specification should not be understood as a limitation on the present application.

应当说明的是，上述实施例均可根据需要自由组合。以上仅是本发明的优选实施方式，应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明原理的前提下，还可以作出若干改进和润饰，这些改进和润饰也应视为本发明的保护范围。It should be noted that the above embodiments can be freely combined as needed. The above are only preferred embodiments of the present invention. It should be pointed out that for ordinary technicians in this technical field, several improvements and modifications can be made without departing from the principle of the present invention, and these improvements and modifications should also be regarded as the protection scope of the present invention.

Claims

1. A general air conditioning system energy-saving control method based on reinforcement transfer learning, characterized in that the method comprises:

The agent obtains the operating data of the air-conditioning system and the energy consumption data of each air-conditioning device at the corresponding timestamp;

The operation data and the energy consumption data are used as inputs and input into the intelligent body for learning, and the basic parameters of the intelligent body are set. The basic parameters include a state set, an action set, an environment reward function, and an ε-greed strategy setting value. The basic parameters of the intelligent body are shown in the following formula:

;

Wherein, S represents the state set of the agent, which includes the instantaneous cooling load of the system and the outdoor meteorological parameters; A represents the action set of the agent, which includes the operating state of the air-conditioning equipment and its current action setting parameters; r represents the reward function of the environment, which is the energy efficiency index of the air-conditioning system, and the system energy efficiency index includes but is not limited to one or more of COP, EER or PUE;

p represents the state transition probability of the environment, and γ represents the discount factor;

is the maximum Q value that can be obtained in the next state;

According to the implementation parameters of the target air-conditioning system project 1 to be controlled, the agent determines whether the Q table of project 2 for migration can be found from the existing training library. If so, the state space and the action space are normalized to establish a correlation between the Q tables of different state and action spaces, and the normalized Q table is processed to complete the migration. The implementation parameters include but are not limited to the air-conditioning system load, equipment type, equipment quantity and connection method. If not, the agent is directly set according to the reinforcement learning requirements;

The normalizing of the state space and the action space so as to establish associations between Q tables between different state and action spaces, and processing the normalized Q table specifically includes:

The normalization formula is as follows:

;

Wherein, x is the object data to be normalized, min and max are the maximum and minimum values in the object data, respectively. As the reference value for normalization, divide the value of each state point s and action point a by the corresponding reference value and , and scaled to the interval [0,1], as shown below:

, ;

Among them, _Q1 is the action of the air conditioning system at each state point in project 1, and _Q2 is the action of the air conditioning system at each state point in project 2;

After normalization, a Q value in the new Q table It is obtained by weighting the Q values in the original Q table that are close to its index value. The specific formula is as follows:

;

in, is the value of the normalized row index of the original Q table, that is, the state space is mapped to the row index; is the value of the normalized column index of the original Q table, that is, the action space is mapped to the column index;

and is a parameter, and its value ∈(0,1) is used as the base to construct a function with the absolute value of the distance between the indices as the exponent. When the distance is 0, the function value is 1. When the distance increases, the function value decreases rapidly, and the speed of decrease is related to the value of the parameter;

It constitutes the contribution weight of the Q value of the point to the new Q value required; is the normalization coefficient, which is used to scale the sum of all weights to 1. The calculation formula of the normalization coefficient is as follows:

;

When migrating systems with different energy efficiency indicators, the Q value of the entire Q table is scaled according to the size of the reward value;

The threshold and step size parameters of the action set of the agent that completes the migration are set, and the agent sends a control instruction to the target air-conditioning system.

2. The energy-saving control method for a general air-conditioning system based on reinforcement transfer learning according to claim 1, characterized in that the method further comprises:

After the intelligent agent sends a control instruction to the target air-conditioning system, it synchronously obtains the operating parameters of the current timestamp and calculates the current system energy efficiency index. The system energy efficiency index is used as a reward value and is regarded by the intelligent agent as a basis for determining whether to store it in the Q table. The system energy efficiency index includes but is not limited to one or more of COP, EER or PUE.

3. The energy-saving control method for a general air-conditioning system based on reinforcement transfer learning according to claim 1 is characterized in that the setting of the threshold and step parameters of the action set of the intelligent agent that completes the transfer specifically includes:

The control time interval shall not be less than 10 minutes per time;

The control step size, i.e. the adjustment frequency, shall not be higher than 2Hz/time;

The temperature adjustment interval shall not exceed 1℃/time;

The safety thresholds of chiller outlet water temperature, water pump frequency, and cooling tower fan frequency refer to the design parameters of the entire air-conditioning system and the design parameters of the corresponding equipment.

4. The energy-saving control method for a general air-conditioning system based on reinforcement transfer learning as described in claim 1 is characterized in that the method also includes: setting hyperparameters of reinforcement learning, including learning rate, discount coefficient, initial random coefficient epsilon, random coefficient attenuation, and debugging the reward value setting of the intelligent agent according to the actual needs of the target air-conditioning system project to be controlled.

5. The energy-saving control method for a general air-conditioning system based on reinforcement transfer learning according to claim 4 is characterized in that the step of debugging the reward value setting of the agent according to the actual needs of the target air-conditioning system project to be controlled specifically comprises:

The system energy efficiency ratio COP is used as the reward value and the system energy efficiency ratio COP is standardized. The system energy efficiency ratio COP is obtained by the agent when issuing control instructions and is calculated based on the current operating parameters.

Multiply the standardized system COP by the temperature difference correction factor. The specific formula is as follows:

;

Among them, COP is the system energy efficiency ratio, Tcws is the cooling water main supply water temperature, and Tcwr is the cooling water main return water temperature.

6. An intelligent agent for energy-saving control of a general air-conditioning system based on reinforcement transfer learning, characterized in that the intelligent agent comprises:

The data acquisition module is used to obtain the operation data of the air-conditioning system and the energy consumption data of each air-conditioning device at the corresponding timestamp;

The reinforcement learning module is used to input the operation data and the energy consumption data as input into the intelligent body for learning, and set the basic parameters of the intelligent body, wherein the basic parameters include a state set, an action set, an environment reward function, and an ε-greed strategy setting value, wherein the basic parameters of the intelligent body are shown in the following formula:

;

Wherein, S represents the state set of the agent, which includes the instantaneous cooling load of the system and the outdoor meteorological parameters; A represents the action set of the agent, which includes the operating state of the air-conditioning equipment and its current action setting parameters; r represents the reward function of the environment, which is the energy efficiency index of the air-conditioning system, and the system energy efficiency index includes but is not limited to one or more of COP, EER or PUE; p represents the state transition probability of the environment, and γ represents the discount factor;

is the maximum Q value that can be obtained in the next state;

A transfer learning module is used to determine whether a Q table for project 2 to be transferred can be found from an existing training library according to the implementation parameters of the target air-conditioning system project 1 to be controlled. If so, the state space and the action space are normalized to establish a correlation between the Q tables of different state and action spaces, and the normalized Q table is processed to complete the transfer, wherein the implementation parameters include but are not limited to the air-conditioning system load, equipment type, equipment quantity and connection mode;

Among them, the state space and the action space are normalized to establish a correlation between the Q tables between different state and action spaces, and the normalized Q table is processed specifically including:

The normalization formula is as follows:

;

, ;

;

The instruction issuing module is used to limit the threshold and step parameters of the action set of the intelligent agent that has completed the migration, and issue control instructions to the target air-conditioning system.

7. An intelligent device, characterized in that it comprises a processor and a memory, the memory is used to store a computer program, the computer program comprises program instructions, and the processor is configured to call the program instructions to execute the method according to any one of claims 1 to 5.

8. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, wherein the computer program comprises program instructions, and when the program instructions are executed by a processor, the processor is caused to execute the method according to any one of claims 1 to 5.