CN116358114B

CN116358114B - Air conditioner temperature control method based on deep reinforcement learning

Info

Publication number: CN116358114B
Application number: CN202310519295.2A
Authority: CN
Inventors: 卫祎欢; 楼涛; 刘睿捷; 杨成; 王健; 吕施霖; 于淼; 邹凯; 薛伟; 龚正; 谢锡飞; 包哲静
Original assignee: Zhejiang University ZJU; Comprehensive Services Branch of State Grid Zhejiang Electric Power Co Ltd
Current assignee: Zhejiang University ZJU; Comprehensive Services Branch of State Grid Zhejiang Electric Power Co Ltd
Priority date: 2023-05-06
Filing date: 2023-05-06
Publication date: 2024-08-20
Anticipated expiration: 2043-05-06
Also published as: CN116358114A

Abstract

The present invention discloses an air conditioning temperature control method based on deep reinforcement learning. First, an air conditioning energy consumption optimization model is established: the objective function of the air conditioning temperature control is set, and the constraint condition is that the temperature of the air conditioning satisfies the PMV‑PPD human comfort condition and the air conditioning operation constraint; the setting of the air conditioning temperature is expressed by the Markov decision process, the state space and action space of the air conditioning energy consumption optimization model are determined, and the reward function and the state-action value function are determined by the state space, action space and constraint conditions, so as to obtain the optimal strategy of the air conditioning energy consumption optimization model; Q-value neural network training is performed based on the TD3 algorithm, and the trained Q-value neural network is deployed in the air conditioning control system to adjust the air conditioning temperature in real time. Compared with the traditional PMV-based air conditioning comfort control method, the present invention comprehensively considers the air conditioning energy consumption cost throughout the day, and the temperature control is more accurate and energy-saving under the premise of ensuring human comfort.

Description

An air conditioning temperature control method based on deep reinforcement learning

技术领域Technical Field

本发明属于空调温度控制领域，尤其涉及于一种基于深度强化学习的空调温度控制方法。The present invention belongs to the field of air conditioning temperature control, and in particular relates to an air conditioning temperature control method based on deep reinforcement learning.

背景技术Background Art

近年来，随着科技的迅速发展与人民生活水平的提高，空调走进千家万户，为人们带来了舒适的环境，成为了不可或缺的一部分。但是，空调也会带来巨大的能耗，居民或办公建筑能耗约50％以上来源于空调制冷与采暖系统，空调带来了严峻的能耗问题。In recent years, with the rapid development of science and technology and the improvement of people's living standards, air conditioners have entered thousands of households, bringing people a comfortable environment and becoming an indispensable part. However, air conditioners also bring huge energy consumption. About 50% of the energy consumption of residential or office buildings comes from air conditioning cooling and heating systems, which brings serious energy consumption problems.

目前，已经有许多学者关注到了空调节能的问题，并在空调遥控器上内置“26℃”键，倡导人们进行空调节能。某些厂商在空调遥控器上进一步引入了PMV按钮，利用空调传感器上采集的温度、湿度、风速等信息，依据PMV指标计算出合适的室内设定温度，实现空调温度的自动调节。At present, many scholars have paid attention to the issue of air conditioning energy saving, and built-in "26℃" button on the air conditioner remote control to encourage people to save energy. Some manufacturers have further introduced the PMV button on the air conditioner remote control, using the temperature, humidity, wind speed and other information collected by the air conditioner sensor, and calculating the appropriate indoor set temperature based on the PMV index to achieve automatic adjustment of the air conditioner temperature.

但是，现有的算法多基于某一时刻的温湿度、风速等信息进行调节，无法考虑在一个周期内使得空调能耗最小。事实上，空调能耗不仅与本时刻的设定温度相关，也与上一时刻的设定温度、室外温度相关。因此，单时刻的计算无法综合整体考虑全天的空调能耗成本，从而综合考虑得到当前时刻的最佳温度值。此外，由于PMV指标计算的复杂性，空调能耗的优化属于复杂的非线性问题，传统的优化算法难以求解，因此可以考虑数据驱动的深度强化学习等方法进行温度控制。However, most existing algorithms are based on information such as temperature, humidity, and wind speed at a certain moment, and cannot consider minimizing air conditioning energy consumption within a cycle. In fact, air conditioning energy consumption is not only related to the set temperature at this moment, but also to the set temperature and outdoor temperature at the previous moment. Therefore, the calculation at a single moment cannot comprehensively consider the air conditioning energy consumption cost for the whole day, so as to comprehensively consider the optimal temperature value at the current moment. In addition, due to the complexity of the PMV indicator calculation, the optimization of air conditioning energy consumption is a complex nonlinear problem, which is difficult to solve by traditional optimization algorithms. Therefore, data-driven deep reinforcement learning and other methods can be considered for temperature control.

发明内容Summary of the invention

本发明基于现有的空调温度控制方法无法综合考虑全天的空调能耗最低的情况，提出了一种基于深度强化学习的空调温度控制方法，包括以下步骤：Based on the fact that the existing air conditioning temperature control method cannot comprehensively consider the situation where the air conditioning energy consumption is the lowest throughout the day, the present invention proposes an air conditioning temperature control method based on deep reinforcement learning, which includes the following steps:

步骤1：建立空调能耗优化模型：设置空调温度控制的目标函数为所有使用时间段的空调能耗最低，约束条件为空调的温度满足PMV-PPD人体舒适度条件以及空调运行约束；Step 1: Establish an air conditioning energy consumption optimization model: Set the objective function of air conditioning temperature control to minimize air conditioning energy consumption in all usage time periods, and the constraints are that the air conditioning temperature meets the PMV-PPD human comfort conditions and air conditioning operation constraints;

步骤2：进行空调能耗的强化学习框架设计：用马尔科夫决策过程表述空调温度的设定，确定空调能耗优化模型的状态空间和动作空间，并通过状态空间、动作空间和约束条件来确定奖励函数和状态-动作值函数，从而得到空调能耗优化模型的最佳策略；Step 2: Design a reinforcement learning framework for air conditioning energy consumption: Use the Markov decision process to describe the setting of air conditioning temperature, determine the state space and action space of the air conditioning energy consumption optimization model, and determine the reward function and state-action value function through the state space, action space and constraints, so as to obtain the optimal strategy of the air conditioning energy consumption optimization model;

步骤3：基于状态-动作值函数，历史温湿度、风速传感器以及室外温度数据进行基于TD3算法的Q值神经网络训练；Step 3: Based on the state-action value function, historical temperature and humidity, wind speed sensor and outdoor temperature data, the Q value neural network training based on the TD3 algorithm is performed;

步骤4：将训练好的Q值神经网络部署在空调控制系统中，实时调整空调温度。Step 4: Deploy the trained Q-value neural network in the air-conditioning control system to adjust the air-conditioning temperature in real time.

进一步地，所述步骤1中，包括以下步骤：Furthermore, the step 1 includes the following steps:

步骤1.1：建立系统的目标函数，即使用时间内，空调的总能耗最低：Step 1.1: Establish the objective function of the system, that is, the total energy consumption of the air conditioner is the lowest during the use time:

其中，W表示用户的总能耗，空调优化运行时间窗内共有m个时刻，P_t表示t时刻的空调功率，Δt为相邻两次空调控制的时间间隔，T_t ⁱⁿ与T_t ^out分别表示用户在t时刻的室内空调设定温度与室外温度，空调控制使得室内温度与空调设定温度一致，α与β分别为传热参数与热效率；Where W represents the total energy consumption of the user, there are m moments in the air conditioning optimization operation time window, _Pt represents the air conditioning power at moment t, Δt is the time interval between two adjacent air conditioning controls, _Ttin and ^Ttout represent the user's indoor air conditioning set temperature and outdoor temperature at moment _t , respectively. The air conditioning control makes the indoor temperature consistent with the air conditioning set temperature, ^and α and β are the heat transfer parameter and thermal efficiency, respectively.

步骤1.2：建立空调温度控制的约束条件，即室内温度要满足PMV-PPD舒适度条件、空调功率与室内温度的上下限：Step 1.2: Establish the constraints for air conditioning temperature control, that is, the indoor temperature must meet the PMV-PPD comfort conditions, the upper and lower limits of air conditioning power and indoor temperature:

PMV的计算公式如下：The calculation formula of PMV is as follows:

PMV＝(0.303e^(-0.036M)+0.028){(M-w)-3.05×10^-3×[5733-6.99(M-w)-P_a]-0.42×[(M-w)-58.15]-1.73×10^-5M(5876-P_a)-0.0014M(34-t_a)-3.96×10^-8f_cl[(t_cl+273)⁴-(t_r+273)⁴]-f_clh_c(t_cl-t_a)}PMV＝(0.303e ^(-0.036M) +0.028){(Mw)-3.05×10 ^-3 ×[5733-6.99(Mw)-P _a ]-0.42×[(Mw)-58.15]-1.73×10 ^{- 5} M(5876-P _a )-0.0014M(34-t _a )-3.96×10 ^-8 f _cl [(t _cl +273) ⁴ -(t _r +273) ⁴ ]-f _cl h _c (t _cl -t _a )}

式中，M表示人们的新陈代谢率，单位为W/m²，工作状态下一般取60W/m²；w为人体对外机械功，单位为W/m²，一般取0；t_a为空气温度，单位为℃；f_cl为人体着装的面积系数，单位为m²·K/W，取1.1(m²·K)/W；t_r为人体所在房间的平均辐射温度，单位为℃；t_cl为人体服装外表面温度，单位为℃；h_c为对热换流系数，单位为W/(m²·K)；P_a为人体周围水蒸气分压力；t_cl、h_c、P_a的计算公式分别为：Wherein, M represents the metabolic rate of people, in W/m ² , and is generally taken as 60W/m ² in working state; w is the external mechanical work of human body, in W/m ² , and is generally taken as 0; _ta is the air temperature, in ℃; _fcl is the area coefficient of human clothing, in m ² ·K/W, and is taken as 1.1(m ² ·K)/W; _tr is the average radiation temperature of the room where the human body is located, in ℃; _tcl is the outer surface temperature of human clothing, in ℃; _hc is the heat transfer coefficient, in W/(m ² ·K); _Pa is the partial pressure of water vapor around the human body; the calculation formulas of _tcl , _hc , and _Pa are:

t_cl＝35.7-0.028(M-W)-I_cl{(3.96×10^-8)f_cl[(t_cl+273)⁴-(t_r+273)⁴]+f_clh_c(t_cl-t_a)}t _cl ＝35.7-0.028(MW)-I _cl {(3.96×10 ^-8 )f _cl [(t _cl +273) ⁴ -(t _r +273) ⁴ ]+f _cl h _c (t _cl -t _a )}

式中，为相对湿度；v为室内风速，单位为m/s，I_cl为人体服装的热阻，单位为m²·k/W；In the formula, is the relative humidity; v is the indoor wind speed, in m/s; I _cl is the thermal resistance of human clothing, in m ² ·k/W;

PPD的计算公式如下：The calculation formula for PPD is as follows:

PPD＝100-95_exp[-(0.03353PMV⁴+0.2179PMV²)]PPD＝100-95 _exp [-(0.03353PMV ⁴ +0.2179PMV ² )]

舒适条件下的PPD范围在10％以下：The PPD range under comfortable conditions is below 10%:

PPD≤10％PPD≤10%

空调功率不能超过其额定功率：The air conditioner power cannot exceed its rated power:

0≤P_t≤P_max 0≤P _t ≤P _max

其中，P_max表示空调的额定功率；Among them, P _max represents the rated power of the air conditioner;

遥控器设定温度是一定范围内的离散值，存在约束：The temperature set by the remote control is a discrete value within a certain range, and there are constraints:

其中，与表示空调设定温度的上界与下界。in, and Indicates the upper and lower limits of the air conditioner set temperature.

进一步地，所述步骤2中，包括以下步骤：Furthermore, the step 2 includes the following steps:

步骤2.1：确定状态空间S＝[S₁ S₂ ... S_t ... S_m]，在空调温度控制系统中，智能体从环境中获取的观测变量包括室内空调设定温度Tⁱⁿ，室外温度T^out，室内湿度室内风速v，时序t；状态空间S表示为：Step 2.1: Determine the state space S = [S ₁ S ₂ ... S _t ... S _m ]. In the air conditioning temperature control system, the observed variables obtained by the agent from the environment include the indoor air conditioning set temperature ^Tin , the outdoor temperature ^Tout , the indoor humidity Indoor wind speed v, time series t; state space S is expressed as:

步骤2.2：确定动作空间A＝[A₁ A₂ ... A_t ... A_m]，在本系统中，智能体的动作空间A为室内空调温度设定值，即：Step 2.2: Determine the action space A = [A ₁ A ₂ ... A _t ... A _m ]. In this system, the action space A of the agent is the indoor air conditioning temperature setting value, that is:

A＝Tⁱⁿ A＝T ⁱⁿ

步骤2.3：设置奖励函数R_t，奖励函数代表了在某一状态下，智能体采取指定动作时，环境反馈给智能体的及时收益；为了使得空调在整个调度周期内能耗最小，奖励函数设置为：Step 2.3: Set the reward function R _t . The reward function represents the immediate benefit that the environment gives back to the agent when the agent takes a specified action in a certain state. In order to minimize the energy consumption of the air conditioner during the entire scheduling cycle, the reward function is set to:

R_t＝-W_t-ξR _t = -W _t -ξ

式中，W_t表示t时刻的空调能耗，ξ为惩罚因子；当约束条件满足时惩罚因子为0，否则为一个正常数；Where _Wt represents the air conditioning energy consumption at time t, ξ is the penalty factor; when the constraint condition is met, the penalty factor is 0, otherwise it is a positive constant;

步骤2.4：设置状态-动作函数Q_π(S,A)，表征策略π的优劣程度，即在策略π下奖励函数的累积回报：Step 2.4: Set the state-action function Q _π (S, A) to characterize the quality of strategy π, that is, the cumulative return of the reward function under strategy π:

其中，智能体的策略π为状态S到动作A的映射，γ为取值为[0,1]的折扣因子；Among them, the agent's strategy π is the mapping from state S to action A, and γ is a discount factor with a value of [0,1];

最优策略π^*为状态-动作函数Q_π(S,A)最大，即奖励函数的累积回报最大：The optimal strategy π ^* is the state-action function Q _π (S, A) is maximized, that is, the cumulative return of the reward function is maximized:

π^*＝argmaxQ_π(S,A)。π ^* ＝ _argmaxQπ (S,A).

进一步地，所述TD3算法包括Q值网络和策略网络；策略网络实现状态S到动作A的映射，Q值网络实现状态-动作值函数的量化评估；TD3算法包括2个Q值网络和1个策略网络，同时也有2个目标Q值网络和1个目标策略网络；两个Q值网络用于减少状态-动作值函数的过高估计，两个Q值网络的结构相同，但参数不同；目标策略网络、目标Q值网络的结构与策略网络、Q值网络相同，但是参数不同，目标网络的参数不会经常更新，以减少学习过程中的误差；通过给定温湿度、室内风速的历史数据生成数据集，对空调控制系统的深度神经网络进行训练，从而获取最优的状态-动作值映射。Furthermore, the TD3 algorithm includes a Q-value network and a policy network; the policy network realizes the mapping of state S to action A, and the Q-value network realizes the quantitative evaluation of the state-action value function; the TD3 algorithm includes 2 Q-value networks and 1 policy network, and also has 2 target Q-value networks and 1 target policy network; the two Q-value networks are used to reduce the overestimation of the state-action value function, and the two Q-value networks have the same structure but different parameters; the structures of the target policy network and the target Q-value network are the same as those of the policy network and the Q-value network, but the parameters are different, and the parameters of the target network are not frequently updated to reduce errors in the learning process; a data set is generated by giving historical data of temperature, humidity and indoor wind speed, and the deep neural network of the air-conditioning control system is trained to obtain the optimal state-action value mapping.

进一步地，基于TD3算法的空调温度控制方法的具体步骤如下：Furthermore, the specific steps of the air conditioning temperature control method based on the TD3 algorithm are as follows:

步骤3.1初始化Q值网络、策略网络、目标Q值网络、目标策略网络参数、经验缓冲池参数；Step 3.1 Initialize the Q value network, policy network, target Q value network, target policy network parameters, and experience buffer pool parameters;

步骤3.2对于每一幕的每个时间步执行如下步骤；Step 3.2 Perform the following steps for each time step of each scene;

步骤3.2.1获取当前室内外的环境状态S_t，通过策略网络得到动作A_t；Step 3.2.1: Get the current indoor and outdoor environment state S _t , and get the action A _t through the strategy network;

步骤3.2.2在动作中引入随机噪声n得到随机动作 Step 3.2.2 Introduce random noise n into the action to obtain random action

步骤3.2.3执行随机动作，获取奖励函数值R_t和下一时刻状态S_t+1；Step 3.2.3: Perform random actions to obtain the reward function value R _t and the next state S _t+1 ;

步骤3.2.4将(S_t,A_t,R_t,S_t+1)存储到经验回放池；Step 3.2.4: store (S _t ,A _t ,R _t ,S _t+1 ) into the experience replay pool;

步骤3.2.5从经验回放池中随机抽取一小批经验样本；Step 3.2.5 randomly selects a small batch of experience samples from the experience replay pool;

步骤3.2.6基于经验样本通过目标策略网络获得下一时刻动作A_t+1；Step 3.2.6 Obtain the next action A _t+1 through the target strategy network based on the experience sample;

步骤3.2.7在下一时刻动作中引入随机噪声n得到随机动作 Step 3.2.7 Introduce random noise n into the next moment action to obtain random action

步骤3.2.8通过两个目标Q值网络的最小值和贝尔曼方程，获取目标Q值函数Q_t ^Target：Step 3.2.8 Obtain the target Q value function Q _t ^Target through the minimum value of the two target Q value networks and the Bellman equation:

其中γ为取值为[0,1]的折扣因子，为第i个目标Q值网络的取值；Where γ is a discount factor with a value of [0,1], is the value of the i-th target Q value network;

步骤3.2.9：通过目标Q值函数与Q值网络计算出的当前Q值函数的均方误差来计算Q值网络损失函数，根据Q值网络损失函数关于Q值网络的梯度更新Q值网络；通过总奖励值与策略概率的乘积计算策略网络损失函数，根据策略网络损失函数关于策略网络的梯度参数更新；并对目标Q值网络和目标策略网络更新软参数；Step 3.2.9: Calculate the Q-value network loss function by the mean square error between the target Q-value function and the current Q-value function calculated by the Q-value network, and update the Q-value network according to the gradient of the Q-value network loss function with respect to the Q-value network; calculate the policy network loss function by the product of the total reward value and the policy probability, and update the policy network parameters according to the gradient of the policy network loss function with respect to the policy network; and update the soft parameters of the target Q-value network and the target policy network;

步骤3.3：输出训练好的Q值网络。Step 3.3: Output the trained Q value network.

进一步地，所述步骤4中，包括以下步骤：Furthermore, the step 4 includes the following steps:

将训练好的神经网络程序部署在控制终端，并在控制终端设置定时器，每隔相同的时间间隔利用传感器实时获取室内的温湿度、室外温度、室内风速信息，并将状态变量输入Q值神经网络，得到输出的空调温度控制值，并与空调系统通讯，实时控制空调系统的温度。The trained neural network program is deployed on the control terminal, and a timer is set on the control terminal. Sensors are used to obtain indoor temperature and humidity, outdoor temperature, and indoor wind speed information in real time at the same time interval. The state variables are input into the Q value neural network to obtain the output air-conditioning temperature control value, and communicate with the air-conditioning system to control the temperature of the air-conditioning system in real time.

本发明的有益效果是：The beneficial effects of the present invention are:

基于深度强化学习的空调温度控制方法可以让空调系统不断地根据环境和反馈进行优化和学习，实现无感知的空调温度控制，与传统空调手动控制相比，更加智能、高效。The air conditioning temperature control method based on deep reinforcement learning allows the air conditioning system to continuously optimize and learn according to the environment and feedback, realizing imperceptible air conditioning temperature control. Compared with traditional manual air conditioning control, it is more intelligent and efficient.

基于深度强化学习的空调温度控制方法基于PPD人体舒适度评价指标，保证最终的空调设定温度满足用户需求，使用户更加舒适。The air conditioning temperature control method based on deep reinforcement learning is based on the PPD human comfort evaluation index to ensure that the final air conditioning setting temperature meets user needs and makes users more comfortable.

基于深度强化学习的空调温度控制方法相比于传统的基于PMV空调舒适度控制系统，综合考虑了全天的空调能耗成本，使得空调温度控制更加精准、节能。Compared with the traditional PMV-based air conditioning comfort control system, the air conditioning temperature control method based on deep reinforcement learning comprehensively considers the air conditioning energy consumption cost throughout the day, making the air conditioning temperature control more accurate and energy-saving.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本发明实施例的深度强化学习的基本框架；FIG1 is a basic framework of deep reinforcement learning according to an embodiment of the present invention;

图2是本发明实施例的基于深度强化学习算法的空调温度控制方法模型。FIG2 is a model of an air conditioning temperature control method based on a deep reinforcement learning algorithm according to an embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

为了便于本领域普通技术人员理解和实施此发明，下面结合附图及实施例对本发明作进一步的详细阐述。In order to facilitate those skilled in the art to understand and implement the invention, the invention is further described in detail below with reference to the accompanying drawings and embodiments.

应当理解，此处所描述的实施示例仅用于说明和解释本发明，并不用于限定本发明。It should be understood that the implementation examples described herein are only used to illustrate and explain the present invention, and are not used to limit the present invention.

本发明应用深度强化学习方法进行空调温度的控制，利用PPD人体舒适度指标进行空调能耗性的建模，并利用历史温湿度、风速等数据训练基于TD3算法的Q值神经网络，之后将训练好的神经网络部署在控制终端，每隔一段时间实时获取环境状态，利用训练好的神经网络基于上一时刻状态、当前时刻环境状态输出下一时刻的空调温度设定值，并利用红外传感器控制空调温度。The present invention applies a deep reinforcement learning method to control the air-conditioning temperature, uses the PPD human comfort index to model the energy consumption of the air-conditioning, and uses historical temperature, humidity, wind speed and other data to train a Q-value neural network based on the TD3 algorithm. The trained neural network is then deployed on the control terminal, and the environmental status is acquired in real time at regular intervals. The trained neural network is used to output the air-conditioning temperature setting value at the next moment based on the status at the previous moment and the environmental status at the current moment, and the air-conditioning temperature is controlled using an infrared sensor.

本发明采用的深度强化学习方法与普通的基于PMV指标的人体舒适度空调温度调节算法相比，综合考虑了空调运行时间段内的综合能耗，对能耗的刻画更加全面，空调控制更加准确、节能。Compared with the common human comfort air conditioning temperature adjustment algorithm based on the PMV index, the deep reinforcement learning method adopted by the present invention comprehensively considers the comprehensive energy consumption during the air conditioning operation period, has a more comprehensive characterization of energy consumption, and makes the air conditioning control more accurate and energy-saving.

本发明提供了一种基于深度强化学习的空调温度控制方法。该方法应用于室内空调温度控制领域，具体流程包括：The present invention provides an air conditioning temperature control method based on deep reinforcement learning. The method is applied to the field of indoor air conditioning temperature control, and the specific process includes:

步骤1：建立空调能耗优化模型。设置目标函数为所有使用时间段的空调能耗最低，约束条件为空调的温度满足PMV-PPD人体舒适度条件以及空调运行约束；具体为：Step 1: Establish an air conditioning energy consumption optimization model. Set the objective function to minimize the air conditioning energy consumption in all usage time periods, and the constraints are that the air conditioning temperature meets the PMV-PPD human comfort conditions and air conditioning operation constraints; specifically:

1.1：建立系统的目标函数，即使用时间内，空调的总能耗最低：1.1: Establish the objective function of the system, that is, the total energy consumption of the air conditioner is the lowest during the use time:

其中，W表示用户的总能耗，P_t表示t时刻的空调功率，空调运行时间t共m个时刻，Δt为相邻两次空调控制的时间间隔，T_t ⁱⁿ与T_t ^out分别表示用户在t时刻的室内空调设定温度与室外温度，空调控制可以使得室内温度与空调设定温度一致，α与β分别为传热参数与热效率，与房间大小、阻热材料等因素有关。Among them, W represents the total energy consumption of the user, _Pt represents ^the air-conditioning power at time t, the air-conditioning operation time t is _a total of m moments, Δt is the time interval between two adjacent air-conditioning controls, _Ttin and ^Ttout represent the indoor air-conditioning set temperature and outdoor temperature of the user at time t, respectively. Air-conditioning control can make the indoor temperature consistent with the air-conditioning set temperature, α and β are the heat transfer parameters and thermal efficiency, respectively, which are related to factors such as room size and heat-resistant materials.

1.2：建立空调温度控制的约束条件，即室内温度要满足PMV-PPD舒适度条件、空调功率与室内温度存在上下限：1.2: Establish the constraints for air conditioning temperature control, that is, the indoor temperature must meet the PMV-PPD comfort condition, and there are upper and lower limits for air conditioning power and indoor temperature:

PMV是迄今为止最全面、最综合的评估室内舒适度的方法，以人体热平衡的基本方程式以及心理生理学主观热感觉的等级为出发点，通过考虑气温、湿度、辐射、风速、服装等因素，来预测人们的舒适感受，通常以-3到+3的数值表示，以此来指导室内温度、湿度、空气流动速度等因素的控制。数值越小表示人们越感到寒冷，数值越大表示人们越感到闷热。当数值为0时，表示人们感到完全舒适。PMV方法已经广泛应用于各种室内场所，如办公室、工厂、医院、学校、商店等。PMV的计算公式如下：PMV is the most comprehensive and integrated method to date for evaluating indoor comfort. It takes the basic equation of human thermal balance and the level of subjective thermal sensation in psychophysiology as the starting point, and predicts people's comfort by considering factors such as temperature, humidity, radiation, wind speed, and clothing. It is usually expressed as a value from -3 to +3 to guide the control of indoor temperature, humidity, air flow rate and other factors. The smaller the value, the colder people feel, and the larger the value, the hotter people feel. When the value is 0, it means that people feel completely comfortable. The PMV method has been widely used in various indoor places, such as offices, factories, hospitals, schools, shops, etc. The calculation formula of PMV is as follows:

式中，M表示人们的新陈代谢率，单位为W/m²，工作状态下一般取60W/m²；w为人体对外机械功，单位为W/m²，一般取0；t_a为空气温度，可以看做环境温度，单位为℃；f_cl为人体着装的面积系数，单位为m²·K/W，取1.1(m²·K)/W；t_r为人体所在房间的平均辐射温度，单位为℃；t_cl为人体服装外表面温度，单位为℃；h_c为对热换流系数，单位为W/(m²·K)；P_a为人体周围水蒸气分压力。t_cl、h_c、P_a的计算公式分别为：In the formula, M represents the metabolic rate of people, in W/m ² , and is generally taken as 60W/m ² in the working state; w is the mechanical work of the human body, in W/m ² , and is generally taken as 0; _ta is the air temperature, which can be regarded as the ambient temperature, in ℃; _fcl is the area coefficient of human clothing, in m ² ·K/W, and is taken as 1.1(m ² ·K)/W; _tr is the average radiation temperature of the room where the human body is located, in ℃; _tcl is the outer surface temperature of the human clothing, in ℃; _hc is the heat transfer coefficient, in W/(m ² ·K); _Pa is the partial pressure of water vapor around the human body. The calculation formulas of _tcl , _hc , and _Pa are:

式中，为相对湿度；v为室内风速，单位为m/s，I_cl为人体服装的热阻，单位为m²·k/W。In the formula, is the relative humidity; v is the indoor wind speed, in m/s; I _cl is the thermal resistance of human clothing, in m ² ·k/W.

PPD是继PMV指标后的补充指标，与PMV指标共同描述人体舒适度，相比于反应大多数人平均评价的PMV指标，PPD指标可以反应在某一环境下感到不舒适的人群的百分比。PPD的计算公式如下：PPD is a supplementary indicator after PMV, and together with PMV, it describes human comfort. Compared with PMV, which reflects the average evaluation of most people, PPD can reflect the percentage of people who feel uncomfortable in a certain environment. The calculation formula of PPD is as follows:

PPD＝100-95exp[-(0.03353PMV⁴+0.2179PMV²)]PPD＝100-95exp[-(0.03353PMV ⁴ +0.2179PMV ² )]

按照《建筑物室内环境通风与空气质量规范》的规定，舒适条件下的PPD范围应该在10％以下：According to the "Building Indoor Environment Ventilation and Air Quality Code", the PPD range under comfortable conditions should be below 10%:

PPD≤10％PPD≤10%

0≤P_t≤P_max 0≤P _t ≤P _max

其中，P_max表示空调的额定功率。Wherein, P _max represents the rated power of the air conditioner.

步骤2：进行空调能耗的强化学习框架设计。用马尔科夫决策过程表述空调温度的设定，确定模型的状态空间、动作空间、奖励函数、状态-动作值函数；Step 2: Design a reinforcement learning framework for air conditioning energy consumption. Use the Markov decision process to describe the setting of air conditioning temperature, and determine the model's state space, action space, reward function, and state-action value function;

该步骤具体为：The specific steps are:

2.1：确定状态空间S＝[S₁ S₂ ... S_t ... S_m]。在空调温度控制系统中，智能体从环境中获取的观测变量包括室内空调设定温度Tⁱⁿ，室外温度T^out，室内湿度室内风速v，时序t。状态空间S表示为：2.1: Determine the state space S = [S ₁ S ₂ ... S _t ... S _m ]. In the air conditioning temperature control system, the observed variables obtained by the agent from the environment include the indoor air conditioning set temperature ^Tin , the outdoor temperature ^Tout , the indoor humidity Indoor wind speed v, time series t. The state space S is expressed as:

2.2：确定动作空间A＝[A₁ A₂ ... A_t ... A_m]。在本系统中，智能体的动作空间A为空调温度设定值，即：2.2: Determine the action space A = [A ₁ A ₂ ... A _t ... A _m ]. In this system, the action space A of the agent is the air conditioning temperature setting value, that is:

A＝Tⁱⁿ A＝T ⁱⁿ

2.3：设置奖励函数R_t。奖励函数代表了在某一状态下，智能体采取指定动作时，环境反馈给智能体的及时收益。为了使得空调在整个调度周期内能耗最小，奖励函数设置为：2.3: Set the reward function R _t . The reward function represents the immediate benefit that the environment gives back to the agent when the agent takes a specified action in a certain state. In order to minimize the energy consumption of the air conditioner during the entire scheduling cycle, the reward function is set to:

R_t＝-W_t-ξR _t = -W _t -ξ

式中，W_t表示t时刻的空调能耗，ξ为惩罚因子。当约束满足时惩罚因子为0，否则为一个正常数。Where _Wt represents the air conditioning energy consumption at time t, and ξ is the penalty factor. The penalty factor is 0 when the constraint is satisfied, otherwise it is a positive constant.

2.4：设置状态-动作函数Q_π(S,A)，表征策略π的优劣程度，即在策略π下奖励函数的累积回报：2.4: Set the state-action function Q _π (S, A) to characterize the quality of strategy π, that is, the cumulative return of the reward function under strategy π:

其中，智能体的策略π为状态S到动作A的映射，γ为折扣因子，取值为[0,1]。最优策略π^*为状态-动作函数Q_π(S,A)最大，即奖励函数的累积回报最大：Among them, the agent's strategy π is the mapping from state S to action A, and γ is the discount factor, which takes a value of [0,1]. The optimal strategy π ^* is the state-action function Q _π (S, A) that is maximized, that is, the cumulative return of the reward function is maximized:

π^*＝argmaxQ_π(S,A)π ^* = argmaxQ _π (S, A)

步骤3：基于历史温湿度、风速传感器以及室外温度数据进行基于TD3算法的Q值神经网络训练：Step 3: Perform Q-value neural network training based on the TD3 algorithm based on historical temperature and humidity, wind speed sensor and outdoor temperature data:

深度强化学习算法基于马尔科夫决策过程，下一时刻的状态仅与当前时刻状态和动作有关，其目的是为了训练智能体得到最优状态-动作策略，从而获得最大回报。深度强化学习的基本框架如图1所示，其核心在于智能体和环境的交互。智能体中的深度神经网络则依据环境给定的状态制定一定的策略，深度强化学习算法则根据环境给出的奖励、状态和深度神经网络给出的动作更新深度神经网络的参数，训练深度神经网络从而得到累积奖励最大的状态-动作策略。环境则执行智能体给出的动作，得到新的状态和奖励后传递给智能体。在不断的迭代中，深度神经网络就能够得到最优的制定最佳状态-动作策略的方法。The deep reinforcement learning algorithm is based on the Markov decision process. The state at the next moment is only related to the state and action at the current moment. Its purpose is to train the agent to obtain the optimal state-action strategy to obtain the maximum reward. The basic framework of deep reinforcement learning is shown in Figure 1. Its core lies in the interaction between the agent and the environment. The deep neural network in the agent formulates a certain strategy based on the state given by the environment. The deep reinforcement learning algorithm updates the parameters of the deep neural network according to the rewards, states and actions given by the deep neural network given by the environment, and trains the deep neural network to obtain the state-action strategy with the maximum cumulative reward. The environment executes the actions given by the agent, obtains the new state and reward, and passes it to the agent. In continuous iterations, the deep neural network can obtain the optimal method to formulate the best state-action strategy.

TD3算法包括Q值网络和策略网络；策略网络实现状态S到动作A的映射，Q值网络实现状态-动作值函数的量化评估；TD3算法包括2个Q值网络和1个策略网络，同时也有2个目标Q值网络和1个目标策略网络。两个Q值网络是为了减少状态-动作值函数的过高估计，两个Q值网络的结构相同，但参数不同；目标策略网络、目标Q值网络的结构与策略网络、Q值网络相同，但是参数不同，目标网络的参数不会经常更新，以减少学习过程中的误差。通过给定温湿度、室内风速的历史数据生成数据集，对空调控制系统的深度神经网络进行训练，从而获取最优的状态-动作值映射。基于TD3算法的空调温度控制方法的具体步骤如下：The TD3 algorithm includes a Q-value network and a policy network; the policy network realizes the mapping from state S to action A, and the Q-value network realizes the quantitative evaluation of the state-action value function; the TD3 algorithm includes 2 Q-value networks and 1 policy network, as well as 2 target Q-value networks and 1 target policy network. The two Q-value networks are designed to reduce the overestimation of the state-action value function. The two Q-value networks have the same structure but different parameters; the target policy network and the target Q-value network have the same structure as the policy network and the Q-value network, but different parameters. The parameters of the target network are not updated frequently to reduce errors in the learning process. By generating a data set with historical data of given temperature, humidity, and indoor wind speed, the deep neural network of the air-conditioning control system is trained to obtain the optimal state-action value mapping. The specific steps of the air-conditioning temperature control method based on the TD3 algorithm are as follows:

步骤4：将训练好的Q值神经网络部署在空调控制系统中，实时调整空调温度。具体为：将训练好的神经网络程序部署在控制终端，并在控制终端设置定时器，每隔相同的时间间隔利用传感器实时获取室内的温湿度、室外温度、室内风速等信息，并将状态变量输入Q值神经网络，得到输出的空调温度控制值，并与空调系统通讯，实时控制空调系统的温度。Step 4: Deploy the trained Q-value neural network in the air conditioning control system to adjust the air conditioning temperature in real time. Specifically, deploy the trained neural network program in the control terminal, set a timer in the control terminal, use sensors to obtain indoor temperature and humidity, outdoor temperature, indoor wind speed and other information in real time at the same time interval, and input the state variable into the Q-value neural network to obtain the output air conditioning temperature control value, and communicate with the air conditioning system to control the temperature of the air conditioning system in real time.

下面将结合具体实例对本发明的效果做进一步的描述：The effects of the present invention will be further described below in conjunction with specific examples:

在某公司的综合办公楼选择了4个房间对空调节能算法进行部署，设置空调运行时间为上班时间，即早上9点至下午5点。在四个房间部署风速传感器、室内温湿度传感器，利用ZigBee无线网络通讯，将采集到的数据传输至控制终端，控制终端可以通过第三方天气软件获取当前实时室外温度。将神经网络部署在控制终端，并采用crontab命令使得系统每15分钟进行一次温度控制，即输入室内的温湿度、室外温度、室内风速等信息以及状态变量，得到空调温度控制值，并通过红外发射模块将空调控制指令发送至空调，实现基于深度强化学习的空调温度控制。通过实践，相比于同等气候条件下采用手动控制，采用本发明所提的空调温度控制程序可以实现节能约6.6％。Four rooms were selected in a company's comprehensive office building to deploy the air conditioning energy-saving algorithm, and the air conditioning operation time was set to working hours, that is, 9 am to 5 pm. Wind speed sensors and indoor temperature and humidity sensors were deployed in the four rooms, and the collected data was transmitted to the control terminal using the ZigBee wireless network communication. The control terminal can obtain the current real-time outdoor temperature through a third-party weather software. The neural network was deployed on the control terminal, and the crontab command was used to make the system perform temperature control every 15 minutes, that is, the indoor temperature and humidity, outdoor temperature, indoor wind speed and other information and state variables were input to obtain the air conditioning temperature control value, and the air conditioning control command was sent to the air conditioning through the infrared transmission module to realize the air conditioning temperature control based on deep reinforcement learning. Through practice, compared with manual control under the same climatic conditions, the air conditioning temperature control program proposed by the present invention can achieve energy saving of about 6.6%.

应当理解的是，上述针对实施例的描述较为详细，并不能因此而认为是对本发明专利保护范围的限制，本领域的普通人员在本发明的启示下，在不脱离本发明权利要求所保护的范围情况下，还可以作出替换或变形，均落入本发明的保护范围之内，本发明的请求保护范围应该以所附权利要求为准。It should be understood that the above description of the embodiments is relatively detailed and cannot be regarded as limiting the scope of patent protection of the present invention. Under the enlightenment of the present invention, ordinary persons in the field may make substitutions or modifications without departing from the scope of protection of the claims of the present invention, which shall fall within the scope of protection of the present invention. The scope of protection requested by the present invention shall be based on the attached claims.

Claims

1. The air conditioner temperature control method based on deep reinforcement learning is characterized by comprising the following steps of:

Step 1: establishing an air conditioner energy consumption optimization model: setting an objective function of air conditioner temperature control as the lowest air conditioner energy consumption in all using time periods, wherein the constraint condition is that the temperature of the air conditioner meets the PMV-PPD human comfort condition and the air conditioner operation constraint;

Step 2: and (3) performing reinforcement learning framework design of air conditioner energy consumption: expressing the setting of the air conditioner temperature by using a Markov decision process, determining a state space and an action space of an air conditioner energy consumption optimization model, and determining a reward function and a state-action value function through the state space, the action space and constraint conditions, thereby obtaining an optimal strategy of the air conditioner energy consumption optimization model; the method specifically comprises the following steps:

Step 2.1: in the air conditioning temperature control system, the observed variables acquired by the intelligent agent from the environment comprise indoor air conditioning set temperature T ⁱⁿ, outdoor temperature T ^out and indoor humidity Indoor wind speed v, time sequence t; the state space S is expressed as:

Step 2.2: determining an action space a= [ a ₁ A₂ ... A_t ... A_m ], wherein in the system, the action space a of the intelligent agent is an indoor air conditioner temperature set value, namely:

A＝Tⁱⁿ

Step 2.3: setting a reward function R _t, wherein the reward function represents timely benefits of the environment feedback to the intelligent agent when the intelligent agent takes the appointed action in a certain state; in order to minimize the energy consumption of the air conditioner throughout the scheduling period, the bonus function is set to:

R_t＝-W_t-ξ

Wherein W _t represents air conditioner energy consumption at time t, and xi is penalty factor; when the constraint condition is satisfied, the penalty factor is 0, otherwise, the penalty factor is a positive constant;

step 2.4: setting a state-action function Q _π (S, A), and representing the goodness of the strategy pi, namely, the cumulative return of the rewarding function under the strategy pi:

wherein, the policy pi of the agent is the mapping from the state S to the action A, and gamma is the discount factor with the value of [0,1 ];

The optimal strategy pi ^* is the state-action function Q _π (S, a) max, i.e. the cumulative return of the bonus function is the largest:

π^*＝argmaxQ_π(S,A)；

Step 3: based on the state-action value function, performing Q-value neural network training based on a TD3 algorithm on historical temperature and humidity, wind speed sensors and outdoor temperature data;

Step 4: and deploying the trained Q-value neural network in an air conditioner control system, and adjusting the temperature of an air conditioner in real time.

2. The method for controlling the temperature of an air conditioner based on deep reinforcement learning according to claim 1, wherein the step 1 comprises the steps of:

step 1.1: establishing an objective function of the system, namely, during the use time, the total energy consumption of the air conditioner is the lowest:

Wherein W represents the total energy consumption of a user, m times are total in an air conditioner optimizing operation time window, P _t represents the air conditioner power at the time T, deltat is the time interval of two adjacent air conditioner controls, T _t ⁱⁿ and T _t ^out respectively represent the indoor air conditioner set temperature and the outdoor temperature of the user at the time T, the air conditioner controls to enable the indoor temperature to be consistent with the air conditioner set temperature, and alpha and beta are the heat transfer parameters and the heat efficiency respectively;

step 1.2: establishing constraint conditions of air conditioner temperature control, namely that indoor temperature is required to meet PMV-PPD comfort level conditions, and upper and lower limits of air conditioner power and indoor temperature:

the calculation formula of PMV is as follows:

PMV＝(0.303e^(-0.036M)+0.028){(M-w)-3.05×10^-3×[5733-6.99(M-w)-P_a]-0.42×[(M-w)-58.15]-1.73×10^-5M(5876-P_a)-0.0014M(34-t_a)-3.96×10^-8f_cl[(t_cl+273)⁴-(t_r+273)⁴]-f_clh_c(t_cl-t_a)}

Wherein M represents the metabolism rate of people, the unit is W/M ², and 60W/M ² is generally adopted in the working state; w is the mechanical work of the human body from outside, the unit is W/m ², and generally 0 is taken; t _a is the air temperature in degrees celsius; the f _cl is the area coefficient of the wearing of the human body, the unit is m ² K/W, 1.1 (m ²·K)/W;t_r is the average radiation temperature of a room where the human body is located, the unit is the temperature of the outer surface of the clothing of the human body, t _cl is the temperature of the outer surface of the clothing of the human body, h _c is the heat conversion coefficient, the unit is W/(m ²·K);P_a) is the partial pressure of water vapor around the human body, and the calculation formulas of t _cl、h_c、P_a are respectively:

t_cl＝35.7-0.028(M-W)-I_cl{(3.96×10^-8)f_cl[(t_cl+273)⁴-(t_r+273)⁴]+f_clh_c(t_cl-t_a)}

In the method, in the process of the invention, Is relative humidity; v is the indoor wind speed, the unit is m/s, I _cl is the thermal resistance of the human body clothing, and the unit is m ².k/W;

the formula of PPD is as follows:

PPD＝100-95exp[-(0.03353PMV⁴+0.2179PMV²)]

The PPD range under comfort conditions is below 10%:

PPD≤10％

the power of the air conditioner cannot exceed the rated power:

0≤P_t≤P_max

Wherein P _max represents the rated power of the air conditioner;

the set temperature of the remote controller is a discrete value in a certain range, and the constraint exists that:

Wherein, And (3) withThe upper and lower limits of the air conditioner set temperature are indicated.

3. The air conditioner temperature control method based on deep reinforcement learning according to claim 1, wherein the TD3 algorithm comprises a Q value network and a strategy network; the strategy network realizes the mapping from the state S to the action A, and the Q value network realizes the quantitative evaluation of the state-action value function; the TD3 algorithm comprises 2Q value networks and 1 strategy network, and simultaneously, 2 target Q value networks and 1 target strategy network are also provided; the two Q value networks are used for reducing overestimation of the state-action value function, and the two Q value networks have the same structure but different parameters; the structures of the target strategy network and the target Q value network are the same as the strategy network and the Q value network, but the parameters of the target network are different, so that the parameters of the target network are not updated frequently, and the errors in the learning process are reduced; and generating a data set through historical data of given temperature, humidity and indoor wind speed, and training a deep neural network of the air conditioning control system, so as to obtain an optimal state-action value mapping.

4. The air conditioner temperature control method based on deep reinforcement learning as claimed in claim 3, wherein the specific steps of the air conditioner temperature control method based on the TD3 algorithm are as follows:

step 3.1, initializing a Q value network, a strategy network, a target Q value network, target strategy network parameters and experience buffer pool parameters;

step 3.2 the following steps are performed for each time step of each screen;

Step 3.2.1, acquiring the current indoor and outdoor environment state S _t, and obtaining an action A _t through a strategy network;

step 3.2.2 introducing random noise n into the motion to obtain random motion

Step 3.2.3, executing random action, and obtaining a reward function value R _t and a next moment state S _t+1;

step 3.2.4 store (S _t,A_t,R_t,S_t+1) to the experience playback pool;

Step 3.2.5 randomly extracting a small batch of experience samples from the experience playback pool;

Step 3.2.6 obtains a next moment action a _t+1 through the target policy network based on the experience sample;

step 3.2.7 introducing random noise n into the next moment action to obtain random action

Step 3.2.8 obtaining a target Q function by the minimum values of the two target Q networks and the Belman equation

Wherein gamma is a discount factor with a value of [0,1],The value of the ith target Q value network is taken;

Step 3.2.9: calculating a Q-value network loss function through a mean square error of the target Q-value function and the current Q-value function calculated by the Q-value network, and updating the Q-value network according to the gradient of the Q-value network loss function with respect to the Q-value network; calculating a strategy network loss function according to the product of the total rewarding value and the strategy probability, and updating gradient parameters of the strategy network according to the strategy network loss function; updating soft parameters for the target Q value network and the target strategy network;

step 3.3: and outputting the trained Q value network.

5. The method for controlling the temperature of an air conditioner based on deep reinforcement learning according to claim 1, wherein the step 4 comprises the steps of:

The trained neural network program is deployed at a control terminal, a timer is arranged at the control terminal, indoor temperature and humidity, outdoor temperature and indoor wind speed information are obtained in real time by using a sensor at intervals of the same time, a state variable is input into the Q-value neural network, an output air conditioner temperature control value is obtained, the output air conditioner temperature control value is communicated with an air conditioner system, and the temperature of the air conditioner system is controlled in real time.