CN106097733A

CN106097733A - A kind of traffic signal optimization control method based on Policy iteration and cluster

Info

Publication number: CN106097733A
Application number: CN201610696748.9A
Authority: CN
Inventors: 王冬青; 张震; 董心壮; 丁军航; 宋婷婷
Original assignee: Qingdao University
Current assignee: Qingdao University
Priority date: 2016-08-22
Filing date: 2016-08-22
Publication date: 2016-11-09
Anticipated expiration: 2036-08-22
Also published as: CN106097733B

Abstract

The present invention proposes a kind of traffic signal optimization control method based on Policy iteration and cluster, the method relates to Intelligent Optimization Technique field, including step 1, selects control program, definition traffic behavior, control action, immediate yield andQValue；Step 2, sensing controls traffic light, records the traffic behavior of each sampling instant, control action and leave the vehicle number of stop line；Step 3, carries out pretreatment, then carries out k mean cluster traffic behavior；Step 4, uses Policy iteration method optimisation strategy in junction machine, and the barycenter obtained in the strategy that optimization is obtained and step 3 is saved in traffic signal control；Step 5, the control strategy using step 4 to obtain substitutes sensing and controls, and at the initial time in each sampling period, traffic signal control receives the traffic behavior that junction machine gathers, according to the discrete state inquiry control strategy that barycenter is corresponding, obtain control action and send to junction machine execution.

Description

A Traffic Signal Optimal Control Method Based on Policy Iteration and Clustering

技术领域technical field

本发明涉及智能优化技术领域。The invention relates to the technical field of intelligent optimization.

背景技术Background technique

交通信号的优化控制是城市交通管理与控制系统的重要组成部分，交通信号控制策略的优劣直接影响整个路网的运输效率和人们的出行体验，因此，各种智能优化控制方法被提出并被尝试应用于交通信号控制策略的优化。The optimal control of traffic signals is an important part of urban traffic management and control systems. The quality of traffic signal control strategies directly affects the transportation efficiency of the entire road network and people's travel experience. Therefore, various intelligent optimal control methods have been proposed and adopted. Attempts to apply to the optimization of traffic signal control strategies.

动态规划是一种求解最优控制策略的方法，包括值迭代和策略迭代两种方法。对策略交通状态、相位和直接回报进行采样，然后利用样本对控制策略进一步优化，因而很适合解决交通信号优化控制问题。在对交通信号控制问题进行策略迭代时，需要将车辆排队长度等连续变量进行离散化。传统的离散化方法是将整个状态空间进行均一划分，而实际出现的状态只聚集在状态空间的某些区域，因此，使用k-均值聚类对状态聚集的区域进行划分，可以在使用相同数目离散状态的条件下保证更高的离散化精度，从而提高优化的效果。Dynamic programming is a method to solve the optimal control strategy, including value iteration and strategy iteration. Sampling the traffic state, phase and direct return of the strategy, and then using the samples to further optimize the control strategy, so it is very suitable for solving the problem of optimal control of traffic signals. When performing policy iteration on the traffic signal control problem, it is necessary to discretize continuous variables such as vehicle queue length. The traditional discretization method is to divide the entire state space evenly, but the actual states only gather in certain areas of the state space. Therefore, using k-means clustering to divide the area of state aggregation can use the same number Under the conditions of discrete states, higher discretization accuracy is guaranteed, thereby improving the optimization effect.

发明内容Contents of the invention

本发明的目的是使用k-均值聚类对交通状态进行离散化，来提高策略迭代的优化效果，更好地优化路口交通信号灯的控制策略。最终目的是为了增加单位时间内通过路口的车辆数，并且降低因等待红灯引起的停车次数和平均延误。The purpose of the present invention is to use k-means clustering to discretize the traffic state to improve the optimization effect of strategy iteration and better optimize the control strategy of traffic lights at intersections. The ultimate goal is to increase the number of vehicles passing through the intersection per unit time, and reduce the number of stops and average delays caused by waiting for red lights.

本发明先使用感应控制方法对路口交通信号进行控制，每隔一段较短的单位时间间隔，路口机记录当前相位和下一相位的车辆排队长度、离开停车线的车辆数和交通信号控制器的控制动作。路口机采集到足够的样本后，对样本中的车辆排队长度进行k-均值聚类，得到离散交通状态。然后使用策略迭代对策略进行优化，并将优化好的策略保存在交通信号控制器中。每隔一段较短的单位时间间隔，路口机把检测到的当前相位和下一相位的车辆排队长度发送给交通信号控制器，交通信号控制器根据车辆排队长度和事先保存好的优化策略选择合适的相位动作，供路口机执行。The present invention firstly uses the induction control method to control the intersection traffic signal, and at intervals of a short unit time interval, the intersection machine records the vehicle queuing length of the current phase and the next phase, the number of vehicles leaving the stop line, and the traffic signal controller. Control the action. After the intersection machine collects enough samples, it performs k-means clustering on the vehicle queuing length in the samples to obtain discrete traffic states. Then use strategy iteration to optimize the strategy, and save the optimized strategy in the traffic signal controller. Every short unit time interval, the intersection machine sends the detected current phase and the vehicle queue length of the next phase to the traffic signal controller, and the traffic signal controller selects the appropriate The phase action for the intersection machine to execute.

本发明提出一种基于策略迭代和聚类的交通信号优化控制方法，包括以下步骤：The present invention proposes a traffic signal optimization control method based on strategy iteration and clustering, comprising the following steps:

步骤1，选择待优化的信号控制方案为固定相序控制，定义交通状态为当前相位和下一相位的车辆排队长度，定义控制动作为保持当前相位或切换到下一相位，定义直接回报为一个与单个采样周期内离开停车线的车辆数有关的变量，定义状态-动作对为离散交通状态和控制动作组成的数据向量，定义每个状态-动作对的Q值表示处于相应离散交通状态下采取控制动作后获得的期望累积回报，定义控制策略为每个离散交通状态应该执行的控制动作；Step 1. Select the signal control scheme to be optimized as fixed phase sequence control, define the traffic state as the vehicle queuing length of the current phase and the next phase, define the control action as maintaining the current phase or switch to the next phase, and define the direct return as a Variables related to the number of vehicles leaving the stop line in a single sampling period, define the state-action pair as a data vector composed of discrete traffic states and control actions, and define the Q value of each state-action pair to represent the corresponding discrete traffic state. The expected cumulative reward obtained after the control action defines the control strategy as the control action that should be executed for each discrete traffic state;

步骤2，路口机把交通信号控制器的控制策略设置为感应控制，最小绿灯时间、最大绿灯时间设置为采样周期的正整数倍，单位绿灯延长时间与采样周期相同，路口机对交通状态、执行的相位动作和离开停车线的车辆数进行采样并记录样本，采样方法为：在每个采样时刻记录交通状态、控制动作和每个采样周期离开停车线的车辆数；Step 2. The intersection machine sets the control strategy of the traffic signal controller as induction control. The minimum green light time and the maximum green light time are set as positive integer multiples of the sampling period. The unit green light extension time is the same as the sampling period. The phase action and the number of vehicles leaving the stop line are sampled and recorded. The sampling method is: record the traffic state, control action and the number of vehicles leaving the stop line in each sampling period at each sampling moment;

步骤3，路口机采集到指定数目的样本后，对样本中的交通状态进行离散化，离散化方法为：先对采样得到的交通状态进行归一化，并且去掉间距超过预设阈值的交通状态，再进行k-均值聚类，将得到的质心进行编号，每个质心对应一个离散交通状态，并且把归一化样本中的交通状态用最近的质心的编号表示，得到对应的离散交通状态；Step 3. After the intersection machine collects a specified number of samples, it discretizes the traffic state in the sample. The discretization method is: first normalize the traffic state obtained by sampling, and remove the traffic state whose distance exceeds the preset threshold , and then perform k-means clustering, number the obtained centroids, each centroid corresponds to a discrete traffic state, and represent the traffic state in the normalized sample with the number of the nearest centroid to obtain the corresponding discrete traffic state;

步骤4，路口机使用策略迭代优化策略，把优化得到的策略和步骤3中得到的质心保存在交通信号控制器中；Step 4, the intersection machine uses strategy iteration to optimize the strategy, and saves the optimized strategy and the centroid obtained in step 3 in the traffic signal controller;

步骤5，路口机设置交通信号控制器的控制策略为步骤4获得的控制策略，并把决策周期设置为采样周期，在每个决策时刻，交通信号控制器接收路口机检测到的交通状态，进行归一化，计算归一化后的交通状态到每个质心的距离，求出距离最近的质心，根据质心对应的离散交通状态查询控制策略，得到控制动作并发送至路口机执行。Step 5, the intersection machine sets the control strategy of the traffic signal controller as the control strategy obtained in step 4, and sets the decision-making cycle as the sampling cycle. At each decision-making moment, the traffic signal controller receives the traffic status detected by the intersection machine and performs Normalization, calculate the distance from the normalized traffic state to each centroid, find the nearest centroid, query the control strategy according to the discrete traffic state corresponding to the centroid, get the control action and send it to the intersection machine for execution.

本发明较现有技术所具有的优点：The present invention has the advantage compared with prior art:

在使用策略迭代优化交通信号控制策略之前，需要先对交通状态进行离散化——把两个相位的车辆排队长度构成的连续状态空间转化为离散状态空间，离散化的精度会影响策略迭代的优化效果。在不同的典型时段，实际的交通状态并非散布在整个状态空间，而是集中在某些区域。使用k-均值聚类算法得到的离散交通状态只考虑实际出现的交通状态集中的区域，而不像传统离散化方法那样把不存在实际交通状态的区域也考虑进去。因而，与传统方法相比，使用k-均值聚类算法后，使用相同数目的离散交通状态能够得到更高的离散化精度，从而提高策略迭代的优化效果。Before using strategy iteration to optimize the traffic signal control strategy, it is necessary to discretize the traffic state first—convert the continuous state space formed by the vehicle queuing lengths of the two phases into a discrete state space, and the accuracy of discretization will affect the optimization of strategy iteration Effect. In different typical time periods, the actual traffic state is not scattered throughout the state space, but concentrated in certain areas. The discrete traffic state obtained by using the k-means clustering algorithm only considers the area where the actual traffic state is concentrated, instead of taking into account the area where there is no actual traffic state like the traditional discretization method. Therefore, compared with the traditional method, after using the k-means clustering algorithm, the same number of discrete traffic states can be used to obtain higher discretization accuracy, thereby improving the optimization effect of strategy iteration.

附图说明Description of drawings

图1为城市道路交叉口交通信号控制原理图。Figure 1 is a schematic diagram of traffic signal control at urban road intersections.

图2为一种基于策略迭代和聚类的交通信号优化控制方法流程图。Fig. 2 is a flowchart of a traffic signal optimization control method based on strategy iteration and clustering.

1、第一地磁车辆检测器；2、第二地磁车辆检测器；3、第三地磁车辆检测器；4、第四地磁车辆检测器；5、第五地磁车辆检测器；6、第六地磁车辆检测器；7、第七地磁车辆检测器；8、第八地磁车辆检测器；9、第九地磁车辆检测器；10、第十地磁车辆检测器；11、第十一地磁车辆检测器；12、第十二地磁车辆检测器；13、第十三地磁车辆检测器；14、第十四地磁车辆检测器；15、第十五地磁车辆检测器；16、第十六地磁车辆检测器；17、第十七地磁车辆检测器；18、第十八地磁车辆检测器；19、第十九地磁车辆检测器；20、第二十地磁车辆检测器；21、第二十一地磁车辆检测器；22、第二十二地磁车辆检测器；23、第二十三地磁车辆检测器；24、第二十四地磁车辆检测器；25、车道一；26、车道二；27、车道三；28、车道四；29、车道五；30、车道六；31、车道七；32、车道八；33、车道九；34、车道十；35、车道十一；36、车道十二。1. The first geomagnetic vehicle detector; 2. The second geomagnetic vehicle detector; 3. The third geomagnetic vehicle detector; 4. The fourth geomagnetic vehicle detector; 5. The fifth geomagnetic vehicle detector; 6. The sixth geomagnetic vehicle detector Vehicle detector; 7. The seventh geomagnetic vehicle detector; 8. The eighth geomagnetic vehicle detector; 9. The ninth geomagnetic vehicle detector; 10. The tenth geomagnetic vehicle detector; 11. The eleventh geomagnetic vehicle detector; 12. The twelfth geomagnetic vehicle detector; 13. The thirteenth geomagnetic vehicle detector; 14. The fourteenth geomagnetic vehicle detector; 15. The fifteenth geomagnetic vehicle detector; 16. The sixteenth geomagnetic vehicle detector; 17. The seventeenth geomagnetic vehicle detector; 18. The eighteenth geomagnetic vehicle detector; 19. The nineteenth geomagnetic vehicle detector; 20. The twentieth geomagnetic vehicle detector; 21. The twenty-first geomagnetic vehicle detector ; 22, the twenty-second geomagnetic vehicle detector; 23, the twenty-third geomagnetic vehicle detector; 24, the twenty-fourth geomagnetic vehicle detector; 25, one lane; 26, two lanes; 27, three lanes; 28 , lane four; 29, lane five; 30, lane six; 31, lane seven; 32, lane eight; 33, lane nine; 34, lane ten; 35, lane eleven; 36, lane twelve.

具体实施方式detailed description

为使本发明的目的、技术方案和优点更加清楚，下面参照附图，对本发明作进一步详细说明。In order to make the purpose, technical solution and advantages of the present invention clearer, the present invention will be described in further detail below with reference to the accompanying drawings.

每个车道都需要安置两个地磁车辆检测器，一个地磁车辆检测器安置在停车线上游紧靠停车线处，检测通过停车线的车辆数，另一个地磁车辆检测器安置在停车线上游120米处，检测通过停车线上游120米处断面的车辆数。通过这两个地磁车辆检测器可以计算其所在车道的任意时刻的位于停车线和停车线上游120米断面之间的车辆数，并换算成车辆排队长度。如图1所示，第一地磁车辆检测器1和第二地磁车辆检测器2用于检测车道一25的车辆排队长度，第三地磁车辆检测器3和第四地磁车辆检测器4用于检测车道二26的车辆排队长度，第五地磁车辆检测器5和第六地磁车辆检测器6用于检测车道三27的车辆排队长度，第七地磁车辆检测器7和第八地磁车辆检测器8用于检测车道四28的车辆排队长度，第九地磁车辆检测器9和第十地磁车辆检测器10用于检测车道五29的车辆排队长度，第十一地磁车辆检测器11和第十二地磁车辆检测器12用于检测车道六30的车辆排队长度，第十三地磁车辆检测器13和第十四地磁车辆检测器14用于检测车道七31的车辆排队长度，第十五地磁车辆检测器15和第十六地磁车辆检测器16用于检测车道八32的车辆排队长度，第十七地磁车辆检测器17和第十八地磁车辆检测器18用于检测车道九33的车辆排队长度，第十九地磁车辆检测器19和第二十地磁车辆检测器20用于检测车道十34的车辆排队长度，第二十一地磁车辆检测器21和第二十二地磁车辆检测器22用于检测车道十一35的车辆排队长度，第二十三地磁车辆检测器23和第二十四地磁车辆检测器24用于检测车道十二36的车辆排队长度。Each lane needs to install two geomagnetic vehicle detectors, one geomagnetic vehicle detector is placed at the upstream of the stop line and close to the stop line to detect the number of vehicles passing the stop line, and the other geomagnetic vehicle detector is placed 120 meters upstream of the stop line At , the number of vehicles passing through the section at 120 meters upstream of the stop line is detected. Through these two geomagnetic vehicle detectors, the number of vehicles located between the stop line and the 120-meter section upstream of the stop line can be calculated at any time in the lane where it is located, and converted into the vehicle queue length. As shown in Figure 1, the first geomagnetic vehicle detector 1 and the second geomagnetic vehicle detector 2 are used to detect the vehicle queuing length of lane one 25, and the 3rd geomagnetic vehicle detector 3 and the 4th geomagnetic vehicle detector 4 are used for detecting The vehicle queue length of lane two 26, the fifth geomagnetic vehicle detector 5 and the sixth geomagnetic vehicle detector 6 are used to detect the vehicle queue length of lane three 27, the seventh geomagnetic vehicle detector 7 and the eighth geomagnetic vehicle detector 8 are used To detect the vehicle queuing length of lane four 28, the ninth geomagnetic vehicle detector 9 and the tenth geomagnetic vehicle detector 10 are used to detect the vehicle queuing length of lane five 29, the eleventh geomagnetic vehicle detector 11 and the twelfth geomagnetic vehicle The detector 12 is used to detect the vehicle queuing length of lane six 30, the thirteenth geomagnetic vehicle detector 13 and the fourteenth geomagnetic vehicle detector 14 are used to detect the vehicle queuing length of lane seven 31, and the fifteenth geomagnetic vehicle detector 15 And the sixteenth geomagnetic vehicle detector 16 is used to detect the vehicle queue length of lane eight 32, the seventeenth geomagnetic vehicle detector 17 and the eighteenth geomagnetic vehicle detector 18 are used to detect the vehicle queue length of lane nine 33, the tenth Nine geomagnetic vehicle detectors 19 and the twentieth geomagnetic vehicle detector 20 are used to detect the vehicle queuing length of lane ten 34, and the twenty-first geomagnetic vehicle detector 21 and the twenty-second geomagnetic vehicle detector 22 are used to detect lane ten The vehicle queuing length of -35, the twenty-third geomagnetic vehicle detector 23 and the twenty-fourth geomagnetic vehicle detector 24 are used to detect the vehicle queuing length of lane twelve 36.

路口机接收第一地磁车辆检测器1至第二十四地磁车辆检测器24共计二十四个地磁车辆检测器发送的信息，然后转发至交通信号控制器。每隔10秒，交通信号控制器根据接收到的交通状态和路口机设置的控制策略决定控制动作。The intersection machine receives the information sent by a total of twenty-four geomagnetic vehicle detectors from the first geomagnetic vehicle detector 1 to the twenty-fourth geomagnetic vehicle detector 24, and then forwards the information to the traffic signal controller. Every 10 seconds, the traffic signal controller determines the control action according to the received traffic status and the control strategy set by the intersection machine.

图2所示的一种基于策略迭代和聚类的交通信号优化控制方法流程图包含如下步骤：The flow chart of a traffic signal optimization control method based on policy iteration and clustering shown in Figure 2 includes the following steps:

步骤1，选择信号控制方案，定义交通状态、控制动作、直接回报和Q值：Step 1, select the signal control scheme, define the traffic state, control action, direct return and Q value:

待优化的信号控制方案采用固定相序控制方案，下面以四对称相位的情况为例介绍控制方案，但本发明不限于使用四相位、也不限于使用对称相位。相位1：允许车道一25和车道四28上的车辆直行和右转，允许车道二26和车道五29上的车辆直行；相位2：允许车道三27和车道六上30的车辆左转；相位3：允许车道七31和车道十34上的车辆直行和右转，允许车道八32和车道十一35上的车辆直行；相位4：允许车道九33和车道十二36上的车辆左转。交通信号在每个时刻只能处于四个相位中的一个，并且按照顺序依次执行。尽管相位顺序是固定的，每个相位的绿灯时长却不必固定。定义控制动作为保持当前相位或切换到下一相位，如果当前相位为相位1，则经过10秒后，交通信号控制器需要决策控制动作：保持相位1，或者切换到相位2，如果选择相位2，经过10秒又需要做出一次控制动作：保持相位2，或者切换到相位3，如果选择相位3，经过10秒又需要做出一次控制动作：保持相位3，或者切换到相位4，如果选择相位4，经过10秒又需要做出一次控制动作：保持相位4，或者切换到相位1……如此循环往复。定义所有相位的最小绿灯时间为10秒，最大绿灯时间为60秒。The signal control scheme to be optimized adopts a fixed phase sequence control scheme, and the control scheme is described below taking the case of four symmetrical phases as an example, but the present invention is not limited to the use of four phases, nor is it limited to the use of symmetrical phases. Phase 1: vehicles on lane one 25 and lane four 28 are allowed to go straight and turn right, and vehicles on lane two 26 and lane five 29 are allowed to go straight; phase 2: vehicles on lane three 27 and lane six 30 are allowed to turn left; phase 3: Allow vehicles on lane seven 31 and lane ten 34 to go straight and turn right, and allow vehicles on lane eight 32 and lane eleven 35 to go straight; Phase 4: Allow vehicles on lane nine 33 and lane twelve 36 to turn left. Traffic signals can only be in one of the four phases at each moment, and they are executed sequentially. Although the order of the phases is fixed, the duration of the green light for each phase does not have to be fixed. Define the control action as maintaining the current phase or switching to the next phase. If the current phase is phase 1, after 10 seconds, the traffic signal controller needs to make a decision control action: maintain phase 1, or switch to phase 2, if phase 2 is selected , after 10 seconds, you need to make another control action: keep phase 2, or switch to phase 3, if you choose phase 3, you need to make another control action after 10 seconds: keep phase 3, or switch to phase 4, if you choose Phase 4, after 10 seconds, another control action is required: maintain phase 4, or switch to phase 1... and so on. Define a minimum green time of 10 seconds and a maximum green time of 60 seconds for all phases.

定义每个相位的车辆排队长度为该相位所有车道的车辆排队长度的最大值，相位1的车辆排队长度等于车道一25、车道二26、车道四28和车道五29的车辆排队长度中的最大值；相位2的车辆排队长度等于车道三27和车道六30的车辆排队长度中的最大值；相位3的车辆排队长度等于车道七31、车道八32、车道十34和车道十一35的车辆排队长度中的最大值；相位4的车辆排队长度等于车道九33和车道十二36的车辆排队长度中的最大值。The vehicle queuing length of each phase is defined as the maximum value of the vehicle queuing lengths of all lanes of the phase, and the vehicle queuing length of phase 1 is equal to the maximum of the vehicle queuing lengths of lane one 25, lane two 26, lane four 28 and lane five 29 value; the vehicle queuing length of phase 2 is equal to the maximum value among the vehicle queuing lengths of lane three 27 and lane six 30; the vehicle queuing length of phase 3 is equal to the vehicles of lane seven 31, lane eight 32, lane ten 34 and lane eleven 35 The maximum value in the queuing length; the vehicle queuing length of phase 4 is equal to the maximum value in the vehicle queuing lengths of lane nine 33 and lane twelve 36 .

定义交通状态为当前相位和下一相位的车辆排队长度，例如，如果当前相位为相位1，则当前交通状态由相位1和相位2的车辆排队长度这两个变量组成的向量数据表示。The traffic state is defined as the vehicle queue length of the current phase and the next phase. For example, if the current phase is phase 1, the current traffic state is represented by the vector data consisting of two variables, the vehicle queue length of phase 1 and phase 2.

定义一个采样周期的起始时刻为交通信号控制器决策控制动作的时刻，采样周期的时长与决策周期的时长相等，为10秒；定义直接回报为与单个采样周期离开停车线的车辆数有关的无量纲量，表示处于一个交通状态下采取控制动作后获得的直接好处；定义状态-动作对为离散交通状态和控制动作组成的数据向量；定义每个状态-动作对的Q值是处于相应离散交通状态下采取控制动作后获得的期望累积回报，即采取控制动作后数个采样周期内获得的直接回报之和的期望，Q值代表的是处于离散交通状态下采取控制动作后获得的长远利益；定义控制策略为给定离散交通状态时应该采取的控制动作；Define the starting moment of a sampling period as the moment when the traffic signal controller makes a decision-making control action, and the duration of the sampling period is equal to the duration of the decision-making period, which is 10 seconds; define the direct return as the number of vehicles leaving the stop line in a single sampling period Dimensionless quantity, which represents the direct benefits obtained after taking control actions in a traffic state; define a state-action pair as a data vector composed of a discrete traffic state and a control action; define the Q value of each state-action pair to be in the corresponding discrete The expected cumulative return obtained after taking the control action in the traffic state, that is, the expectation of the sum of the direct returns obtained in several sampling periods after the control action is taken, and the Q value represents the long-term benefit obtained after taking the control action in the discrete traffic state ;Define the control strategy as the control action that should be taken when the discrete traffic state is given;

直接回报r的计算公式如下：The calculation formula of the direct return r is as follows:

上式中，n_p表示一个采样周期内通过停车线的车辆数，公式中的常数6.5、4.5和-1.0的作用是使直接回报r维持在[-1,1]之间。交通信号控制器根据相邻两次路口机发送来的交通状态计算出n_p，然后按照上面的公式计算得到直接回报r。In the above formula, n _p represents the number of vehicles passing the stop line within a sampling period, and the constants 6.5, 4.5 and -1.0 in the formula are used to maintain the direct return r between [-1,1]. The traffic signal controller calculates n _p according to the traffic status sent by two adjacent intersection machines, and then calculates according to the above formula to obtain the direct return r.

状态-动作对的Q值定义如下：The Q value of a state-action pair is defined as follows:

$Q Q ((s the s,, a a)) = = E E. (({Σ Σ}_{k k = = 11}^{k k = = T T} {γ γ}^{k k - - 11} r r ((s the s,, a a))))$

s表示离散交通状态，a表示在交通状态s下执行的控制动作，Q(s,a)表示状态-动作对s-a的Q值，E表示期望，r(s,a)表示在状态s下执行控制动作a获得的直接回报，γ是折扣因子，是一个介于0和1之间的实数，k表示遇到交通状态s后经历了第k个采样周期，经历交通状态s并执行控制动作a，经过一个采样周期后对应k＝1，T表示遇到交通状态s后采样终止于第T个采样周期，即累积回报的计算只使用T个采样周期的直接回报。s represents the discrete traffic state, a represents the control action performed in the traffic state s, Q(s,a) represents the Q value of the state-action pair s-a, E represents the expectation, r(s,a) represents the execution in the state s The direct reward obtained by control action a, γ is the discount factor, which is a real number between 0 and 1, and k represents the kth sampling period after encountering traffic state s, experiencing traffic state s and executing control action a , corresponding to k=1 after a sampling period, T means that the sampling ends at the Tth sampling period after encountering the traffic state s, that is, the calculation of the cumulative return only uses the direct return of T sampling periods.

步骤2，对交通状态、执行的控制动作和离开停车线的车辆数进行采样。Step 2, sampling the traffic state, the executed control action and the number of vehicles leaving the stop line.

在指定的典型时段，如早高峰或晚高峰时段进行一段时间的采样，在采样阶段，路口机把交通信号控制器的控制策略设置为感应控制，最小绿灯时间、最大绿灯时间和设置为采样周期的正整数倍，单位绿灯延长时间与采样周期相同，设置每个相位的最小绿灯时间为10秒，最大绿灯时间为60秒，单位绿灯延长时间为10秒。每一秒钟均按照下面的方法决策相位：当前相位绿灯时间小于10秒时，保持当前相位；当前相位绿灯时间超过或等于60秒时，切换到下一个相位；当前相位绿灯时间介于10秒和60秒时，当前相位有来车就延长绿灯时间10秒，没有来车就直接切换到下一相位。每隔10秒，路口机检测并存储下列信息作为样本：当前相位和下一相位的车辆排队长度、执行的控制动作和每个采样周期离开停车线的车辆数。设定要采集的样本数为9000。Sampling is carried out for a period of time during the specified typical time period, such as the morning rush hour or evening rush hour. In the sampling stage, the intersection machine sets the control strategy of the traffic signal controller as induction control, and the minimum green light time, maximum green light time and set as the sampling period The positive integer multiple of , the unit green light extension time is the same as the sampling period, the minimum green light time of each phase is set to 10 seconds, the maximum green light time is 60 seconds, and the unit green light extension time is 10 seconds. Every second, the phase is determined according to the following method: when the green light time of the current phase is less than 10 seconds, keep the current phase; when the green light time of the current phase exceeds or is equal to 60 seconds, switch to the next phase; the green light time of the current phase is between 10 seconds and 60 seconds, if there is an incoming vehicle in the current phase, the green light time will be extended by 10 seconds, and if there is no incoming vehicle, it will directly switch to the next phase. Every 10 seconds, the intersection machine detects and stores the following information as a sample: the vehicle queuing length of the current phase and the next phase, the executed control action and the number of vehicles leaving the stop line in each sampling period. Set the number of samples to be collected to 9000.

步骤3，路口机采集到9000个样本后，对样本中的交通状态进行离散化。把每个样本整理为数据向量(l,a,l’,r)的形式，l表示某个采样时刻的交通状态，a表示交通状态为l时执行的控制动作，l’表示l之后下一个采样时刻的交通状态，r表示交通状态从l转移到l’的这个采样周期内获得的直接回报，可以使用原始样本中每个采样周期内离开停车线的车辆数，按照步骤1中直接回报r的计算公式计算得到。Step 3, after the intersection machine collects 9000 samples, discretize the traffic status in the samples. Organize each sample into the form of data vector (l, a, l', r), l represents the traffic state at a certain sampling moment, a represents the control action executed when the traffic state is l, and l' represents the next The traffic state at the sampling time, r represents the direct return obtained during the sampling period when the traffic state shifts from l to l', you can use the number of vehicles leaving the stop line in each sampling period in the original sample, and directly return r according to step 1 calculated by the calculation formula.

对样本中的交通状态进行预处理，先进行归一化，然后去掉间距超过预设阈值的交通状态。选择欧氏距离作为距离，设置阈值为0.1，先从样本中随机选择一个归一化的交通状态加入一个空的数据集，称为交通状态数据集，然后把样本中剩余的交通状态按照下列原则加入到数据集中：如果样本中的交通状态到交通状态数据集中所有交通状态的距离都大于0.1，则把该交通状态加入交通状态数据集，否则不加入。Preprocess the traffic state in the sample, first normalize, and then remove the traffic state whose distance exceeds the preset threshold. Select the Euclidean distance as the distance, set the threshold to 0.1, first randomly select a normalized traffic state from the sample and add it to an empty data set, called the traffic state data set, and then use the remaining traffic states in the sample according to the following principles Add to the data set: If the distance between the traffic state in the sample and all the traffic states in the traffic state data set is greater than 0.1, then add the traffic state to the traffic state data set, otherwise it will not be added.

对交通状态数据集中的交通状态进行k-均值聚类，定义簇为相近交通状态的集合，每个簇对应一个离散交通状态，定义质心为簇包含的所有交通状态的质心，设置质心数为30，后开始聚类，步骤如下：Carry out k-means clustering on the traffic state in the traffic state data set, define the cluster as a collection of similar traffic states, each cluster corresponds to a discrete traffic state, define the centroid as the centroid of all traffic states contained in the cluster, and set the number of centroids to 30 , and start clustering, the steps are as follows:

步骤a，从交通状态数据集中随机选择30个不同的交通状态作为初始质心；Step a, randomly select 30 different traffic states from the traffic state dataset as initial centroids;

步骤b，计算每个交通状态到每个质心的距离，将每个交通状态指派到最近的质心，形成30个簇；Step b, calculate the distance from each traffic state to each centroid, and assign each traffic state to the nearest centroid, forming 30 clusters;

步骤c，重新计算每个簇的质心；Step c, recalculate the centroid of each cluster;

步骤d，计算质心的变化量，即原先的质心和新的质心之间的距离，若所有簇的质心不再发生变化，k-均值聚类结束，否则执行步骤b。Step d, calculate the change of the centroid, that is, the distance between the original centroid and the new centroid, if the centroids of all clusters no longer change, the k-means clustering ends, otherwise, perform step b.

k-均值聚类结束后，把每个样本(l,a,l’,r)中的l和l’分别指派到最近的质心，即分别转化为离散交通状态s和s’，把样本整理为数据向量(s,a,s’,r)。After k-means clustering, assign l and l' in each sample (l, a, l', r) to the nearest centroid, that is, transform them into discrete traffic states s and s' respectively, and organize the samples is the data vector (s,a,s',r).

步骤4，在路口机中任意初始化一个交通信号控制策略，然后使用策略迭代方法优化策略，把优化得到的策略和步骤3中得到的质心保存在交通信号控制器中；Step 4, arbitrarily initialize a traffic signal control strategy in the intersection computer, then use the strategy iteration method to optimize the strategy, and save the optimized strategy and the centroid obtained in step 3 in the traffic signal controller;

在单路口交通信号控制优化问题中，共有30个离散交通状态，每个离散交通状态下都有两个控制动作——a₁表示保持当前相位，a₂表示切换到下一相位，策略的优化在路口机中进行，使用策略迭代方法进行优化，步骤如下：In the single-intersection traffic signal control optimization problem, there are 30 discrete traffic states, and each discrete traffic state has two control actions—— _{a 1} means to maintain the current phase, a ₂ means to switch to the next phase, the optimization of the strategy It is carried out in the intersection machine, and the strategy iteration method is used for optimization. The steps are as follows:

步骤a,设置迭代次数为1，初始化Q值和控制策略，计算状态转移矩阵和直接回报矩阵。把每个状态-动作对的Q值初始化为零，保存在矩阵Q中，根据样本(s,a,s’,r)估算直接回报矩阵R₁和R₂，R₁，R₂分别保存执行控制动作a₁、a₂后获得的直接回报的期望，设i＝1,2,...,30，j＝1,2,...,30，k＝1,2，Q，R₁和R₂的定义分别如下：Step a, set the number of iterations to 1, initialize the Q value and control strategy, and calculate the state transition matrix and direct return matrix. Initialize the Q value of each state-action pair to zero, save it in the matrix Q, and estimate the direct return matrix R ₁ and R ₂ according to the sample (s, a, s', r), R ₁ and R ₂ are saved and executed respectively The expectation of direct reward obtained after controlling actions a ₁ and a ₂ , set i=1,2,...,30, j=1,2,...,30, k=1,2, Q, R ₁ and R2 are defined as follows _:

$Q Q = = [\begin{matrix} Q Q (({s the s}_{11},, {a a}_{11})) \\ Q Q (({s the s}_{11},, {a a}_{22})) \\ Q Q (({s the s}_{22},, {a a}_{11})) \\ Q Q (({s the s}_{22},, {a a}_{22})) \\ . . \\ . . \\ . . \\ Q Q (({s the s}_{3030},, {a a}_{11})) \\ Q Q (({s the s}_{3030},, {a a}_{22})) \end{matrix}],, {R R}_{11} = = [\begin{matrix} r r (({s the s}_{11},, {a a}_{11},, {s the s}_{11})) & r r (({s the s}_{11},, {a a}_{11},, {s the s}_{22})) & ... ... & r r (({s the s}_{11},, {a a}_{11},, {s the s}_{3030})) \\ r r (({s the s}_{22},, {a a}_{11},, {s the s}_{11})) & r r (({s the s}_{22},, {a a}_{11},, {s the s}_{22})) & ... ... & r r (({s the s}_{22},, {a a}_{11},, {s the s}_{3030})) \\ . . & . . & . . \\ . . & . . & . . \\ . . & . . & . . \\ r r (({s the s}_{3030},, {a a}_{11},, {s the s}_{11})) & r r (({s the s}_{3030},, {a a}_{11},, {s the s}_{22})) & ... ... & r r (({s the s}_{3030},, {a a}_{11},, {s the s}_{3030})) \end{matrix}],,$

${R R}_{22} = = [\begin{matrix} r r (({s the s}_{11},, {a a}_{22},, {s the s}_{11})) & r r (({s the s}_{11},, {a a}_{22},, {s the s}_{22})) & ... ... & r r (({s the s}_{11},, {a a}_{22},, {s the s}_{3030})) \\ r r (({s the s}_{22},, {a a}_{22},, {s the s}_{11})) & r r (({s the s}_{22},, {a a}_{22},, {s the s}_{22})) & ... ... & r r (({s the s}_{22},, {a a}_{22},, {s the s}_{3030})) \\ . . & . . & . . \\ . . & . . & . . \\ . . & . . & . . \\ r r (({s the s}_{3030},, {a a}_{22},, {s the s}_{11})) & r r (({s the s}_{3030},, {a a}_{22},, {s the s}_{22})) & ... ... & r r (({s the s}_{3030},, {a a}_{22},, {s the s}_{3030})) \end{matrix}] . .$

其中，Q(s_i,a_k)表示动作-状态对s_i-a_k的Q值，r(s_i,a_k,s_j)表示处于离散交通状态s_i，执行控制动作a_k之后，转移到离散交通状态s_j时获得的直接回报。初始化一个控制策略为任意策略，保存在矩阵Π中，Π的定义如下：Among them, Q(s _i , a _k ) represents the Q value of the action-state pair s _i -a _k , r(s _i , a _k , s _j ) represents the discrete traffic state s _i , after executing the control action a _k , The immediate reward obtained when transitioning to the discrete traffic state _sj . Initialize a control strategy as an arbitrary strategy and store it in the matrix Π. The definition of Π is as follows:

$Π Π = = [\begin{matrix} π π (({s the s}_{11},, {a a}_{11})) & π π (({s the s}_{11},, {a a}_{22})) & 00 & 00 & 00 & 00 & ... ... & 00 & 00 \\ 00 & 00 & π π (({s the s}_{22},, {a a}_{11})) & π π (({s the s}_{22},, {a a}_{22})) & 00 & 00 & ... ... & 00 & 00 \\ 00 & 00 & 00 & 00 & π π (({s the s}_{33},, {a a}_{11})) & π π (({s the s}_{33},, {a a}_{22})) & ... ... & 00 & 00 \\ . . & . . & . . & . . & . . & . . & . . & . . \\ . . & . . & . . & . . & . . & . . & . . & . . \\ . . & . . & . . & . . & . . & . . & . . & . . \\ 00 & 00 & 00 & 00 & 00 & 00 & ... ... & π π (({s the s}_{3030},, {a a}_{11})) & π π (({s the s}_{3030},, {a a}_{22})) \end{matrix}] . .$

其中，π(s_i,a_k)表示在离散状态s_i下执行动作a_k的概率，Π的每行元素之和为1。根据样本(s,a,s’,r)估算状态转移矩阵P，定义如下：Among them, π(s _i , a _k ) represents the probability of executing action a _k in discrete state s _i , and the sum of elements in each row of Π is 1. Estimate the state transition matrix P according to the sample (s, a, s', r), defined as follows:

$P P = = [\begin{matrix} p p (({s the s}_{11} | | {s the s}_{11},, {a a}_{11})) & p p (({s the s}_{22} | | {s the s}_{11},, {a a}_{11})) & ... ... & p p (({s the s}_{3030} | | {s the s}_{11},, {a a}_{11})) \\ p p (({s the s}_{11} | | {s the s}_{11},, {a a}_{22})) & p p (({s the s}_{22} | | {s the s}_{11},, {a a}_{22})) & ... ... & p p (({s the s}_{3030} | | {s the s}_{11},, {a a}_{22})) \\ p p (({s the s}_{11} | | {s the s}_{22},, {a a}_{11})) & p p (({s the s}_{22} | | {s the s}_{22},, {a a}_{11})) & ... ... & p p (({s the s}_{3030} | | {s the s}_{22},, {a a}_{11})) \\ p p (({s the s}_{11} | | {s the s}_{22},, {a a}_{22})) & p p (({s the s}_{22} | | {s the s}_{22},, {a a}_{22})) & ... ... & p p (({s the s}_{3030} | | {s the s}_{22},, {a a}_{22})) \\ . . & . . & . . \\ . . & . . & . . \\ . . & . . & . . \\ p p (({s the s}_{11} | | {s the s}_{3030},, {a a}_{11})) & p p (({s the s}_{22} | | {s the s}_{3030},, {a a}_{11})) & ... ... & p p (({s the s}_{3030} | | {s the s}_{3030},, {a a}_{11})) \\ p p (({s the s}_{11} | | {s the s}_{3030},, {a a}_{22})) & p p (({s the s}_{22} | | {s the s}_{3030},, {a a}_{22})) & ... ... & p p (({s the s}_{3030} | | {s the s}_{3030},, {a a}_{22})) \end{matrix}]$

其中，矩阵元素p(s_j|s_i,a_k)是条件概率，表示处于离散交通状态s_i，执行控制动作a_k之后，下一个采样时刻转移到离散交通状态s_j的概率。利用R₁,R₂和P中的元素，可以求出直接回报矩阵R，R的定义如下：Among them, the matrix element p(s _j |s _i , a _k ) is the conditional probability, which represents the probability of shifting to the discrete traffic state s _j at the next sampling moment after executing the control action a _k in the discrete traffic state s _i . Using the elements in R ₁ , R ₂ and P, the direct return matrix R can be obtained. The definition of R is as follows:

$R R = = [\begin{matrix} r r (({s the s}_{11},, {a a}_{11})) \\ r r (({s the s}_{11},, {a a}_{22})) \\ r r (({s the s}_{22},, {a a}_{11})) \\ r r (({s the s}_{22},, {a a}_{22})) \\ . . \\ . . \\ . . \\ r r (({s the s}_{3030},, {a a}_{11})) \\ r r (({s the s}_{3030},, {a a}_{22})) \end{matrix}] . .$

其中，r(s_i,a_k)表示处于离散交通状态s_i，执行控制动作a_k之后获得的直接回报的期望，计算公式如下：Among them, r(s _i , a _k ) represents the expectation of the direct reward obtained after executing the control action a _k in the discrete traffic state s _i , and the calculation formula is as follows:

$r r (({s the s}_{i i},, {a a}_{k k})) = = {Σ Σ}_{j j = = 11}^{3030} r r (({s the s}_{i i},, {a a}_{k k},, {s the s}_{j j})) p p (({s the s}_{j j} | | {s the s}_{i i},, {a a}_{k k})) . .$

步骤b，更新Q值，按照下式更新矩阵Q：Step b, update the Q value, and update the matrix Q according to the following formula:

Q＝(I-γPΠ)^-1RQ＝(I-γPΠ) ^-1 R

其中，I表示单位矩阵，γ是折扣因子，设置为0.95，()^-1表示对矩阵求逆；Among them, I represents the identity matrix, γ is the discount factor, set to 0.95, () ^-1 represents the inversion of the matrix;

步骤c，根据Q值更新控制策略，按照下式更新矩阵Π中的元素：Step c, update the control strategy according to the Q value, and update the elements in the matrix Π according to the following formula:

$π π (({s the s}_{i i},, {a a}_{k k})) = = \{\begin{matrix} 11 & {a a}_{k k} = = \underset{a a &Element; &Element; {{{a a}_{11},, {a a}_{22}}}}{arg arg max max} Q Q (({s the s}_{i i},, a a)) \\ 00 & {a a}_{k k} &NotEqual; &NotEqual; \underset{a a &Element; &Element; {{{a a}_{11},, {a a}_{22}}}}{arg arg max max} Q Q (({s the s}_{i i},, a a)) \end{matrix}$

步骤d，如果迭代次数为1，保存矩阵Π到一个同维矩阵Π'，迭代次数加1，返回步骤b，否则，求解矩阵Π与矩阵Π'的差的二范数：Step d, if the number of iterations is 1, save the matrix Π to a matrix Π' of the same dimension, increase the number of iterations by 1, and return to step b, otherwise, solve the second norm of the difference between matrix Π and matrix Π':

D＝||Π-Π'||D＝||Π-Π'||

如果D等于0，则策略迭代结束，如果D不等于0，保存矩阵Π到矩阵Π'，迭代次数加1，返回步骤b。If D is equal to 0, the strategy iteration ends. If D is not equal to 0, save matrix Π to matrix Π', increase the number of iterations by 1, and return to step b.

策略迭代结束后，得到的控制策略保存在矩阵Q中，把矩阵Q和步骤3中得到的质心保存在交通信号控制器中；After the strategy iteration is over, the obtained control strategy is saved in the matrix Q, and the matrix Q and the centroid obtained in step 3 are saved in the traffic signal controller;

步骤5，路口机设置交通信号控制器的控制策略为步骤4获得的控制策略，每隔10秒钟，交通信号控制器接收路口机检测到的交通状态，对其进行归一化，计算归一化后的交通状态到每个质心的距离，求出距离最近的质心的编号，即离散交通状态s_i状态的编号i，然后根据下式选择控制动作a^*：Step 5, the intersection machine sets the control strategy of the traffic signal controller to the control strategy obtained in step 4, every 10 seconds, the traffic signal controller receives the traffic status detected by the intersection machine, normalizes it, and calculates the normalized Calculate the distance from the transformed traffic state to each centroid, find the number of the nearest centroid, that is, the number i of the discrete traffic state s _i state, and then select the control action a ^* according to the following formula:

${a a}^{* *} = = arg arg \underset{a a &Element; &Element; {{{a a}_{11},, {a a}_{22}}}}{m m a a x x} Q Q (({s the s}_{i i},, a a))))$

交通信号控制器把控制动作a^*发送至路口机执行，如果a^*的值为a₁则保持当前相位，如果a^*的值为a₂则切换到下一相位。The traffic signal controller sends the control action a ^* to the intersection machine for execution. If the value of a ^* is a ₁ , the current phase will be maintained, and if the value of a ^* is a ₂ , it will switch to the next phase.

Claims

1. A traffic signal optimization control method based on policy iteration and clustering, comprising the following steps:

Step 1. Select the signal control scheme to be optimized as fixed phase sequence control, define the traffic state as the vehicle queuing length of the current phase and the next phase, define the control action as maintaining the current phase or switch to the next phase, and define the direct return as a Variables related to the number of vehicles leaving the stop line in a single sampling period, define the state-action pair as a data vector composed of discrete traffic states and control actions, and define the Q value of each state-action pair to represent the corresponding discrete traffic state. The expected cumulative reward obtained after the control action defines the control strategy as the control action that should be executed for each discrete traffic state;

Step 2. The intersection machine sets the control strategy of the traffic signal controller as induction control. The minimum green light time and the maximum green light time are set as positive integer multiples of the sampling period. The unit green light extension time is the same as the sampling period. The phase action and the number of vehicles leaving the stop line are sampled and recorded. The sampling method is: record the traffic state, control action and the number of vehicles leaving the stop line in each sampling period at each sampling moment;

Step 3. After the intersection machine collects a specified number of samples, it discretizes the traffic state in the sample. The discretization method is: first normalize the traffic state obtained by sampling, and remove the traffic state whose distance exceeds the preset threshold , and then perform k-means clustering, number the obtained centroids, each centroid corresponds to a discrete traffic state, and represent the traffic state in the normalized sample with the number of the nearest centroid to obtain the corresponding discrete traffic state;

Step 4, the intersection machine uses strategy iteration to optimize the strategy, and saves the optimized strategy and the centroid obtained in step 3 in the traffic signal controller;

Step 5, the intersection machine sets the control strategy of the traffic signal controller as the control strategy obtained in step 4, and sets the decision-making cycle as the sampling cycle. At each decision-making moment, the traffic signal controller receives the traffic status detected by the intersection machine and performs Normalization, calculate the distance from the normalized traffic state to each centroid, find the nearest centroid, query the control strategy according to the discrete traffic state corresponding to the centroid, get the control action and send it to the intersection machine for execution.

2. The method as claimed in claim 1, wherein, after normalization and before k-means clustering, the traffic state whose distance exceeds a preset threshold is removed, the method is: first randomly select a The traffic state is added to an empty data set, and then the remaining traffic states in the normalized sample are added to the data set according to the following principles: If the Euclidean distance between the traffic state in the normalized sample and all the traffic states in the data set is greater than the preset If the threshold is set, the traffic state will be added to the data set, otherwise it will not be added.