CN110837888A

CN110837888A - Traffic missing data completion method based on bidirectional cyclic neural network

Info

Publication number: CN110837888A
Application number: CN201911106967.7A
Authority: CN
Inventors: 申彦明; 徐文权; 齐恒; 尹宝才
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2019-11-13
Filing date: 2019-11-13
Publication date: 2020-02-25

Abstract

The invention provides a traffic missing data completion method based on a bidirectional cyclic neural network, which belongs to the field of traffic. This method firstly utilizes the time-series characteristics of data, and also considers the impact of data before and after the completion time point on the current time point, which greatly improves the utilization and completion accuracy of data, and secondly considers the external The influence of features and adjacent sensor data on the current sensor data is added to the completion model, which greatly improves the completion accuracy. The method of the present invention not only greatly improves the completion accuracy in the case of a low data missing rate, but also improves the completion accuracy in the case of a high data missing rate.

Description

A Complete Method for Traffic Missing Data Based on Bidirectional Recurrent Neural Network

技术领域technical field

本发明属于交通领域，具体涉及一种基于双向循环神经网络的交通缺失数据补全方法。The invention belongs to the field of traffic, and in particular relates to a traffic missing data completion method based on a bidirectional cyclic neural network.

背景技术Background technique

道路线圈车流量数据具有周期性、时间序列性和趋势性。现阶段，对车流量数据补全的方法主要是基于其时序性。The traffic flow data of road coils have periodicity, time series and trend. At this stage, the method for completing traffic flow data is mainly based on its timing.

基于时序性的车流量数据补全，取当前缺失点之前的一段时间的数据，通过神经网络，来对缺失点数据进行补全。比如要补全今天16点的车流量数据，那么就取当天8点到15点的数据作为输入，通过循环神经网络，得到下一个时间点——16点的数据。这种基于历史数据的补全方法，很好地利用了数据的时序性的特点来进行补全，补全结果相对较好，但是该方法具有局限性。当有特殊事件发生时，当前的缺失点之前也是一系列的缺失点，比如：停电，会导致一段连续的数据的丢失，当对最后一个缺失点进行补全时，由于输入数据缺失严重，补全效果在这种情况下非常差。Based on the completion of time series traffic flow data, the data of a period of time before the current missing point is taken, and the missing point data is completed through the neural network. For example, if you want to complete the traffic flow data at 16:00 today, then take the data from 8:00 to 15:00 that day as input, and get the data at the next time point - 16:00 through the recurrent neural network. This historical data-based completion method makes good use of the time-series characteristics of data for completion, and the completion results are relatively good, but this method has limitations. When a special event occurs, the current missing point is also preceded by a series of missing points, such as a power outage, which will result in the loss of a continuous segment of data. The full effect is very poor in this case.

神经网络最开始是受生物神经系统的启发，为了模拟生物神经系统而出现的，由大量的节点(或称神经元)之间相互联接构成。神经网络根据输入的变化，对权值进行调整，改善系统的行为，自动学习到一个能够解决问题的模型。 LSTM(长短记忆网络)是RNN(循环神经网络)的一种特殊形式，有效地解决多层神经网络训练的梯度消失和梯度爆炸问题，能够处理长时时间依赖序列。LSTM 能够捕获充电量数据的时间序列特性，使用LSTM模型能够有效提高补全精度。The neural network was originally inspired by the biological nervous system and appeared in order to simulate the biological nervous system. It is composed of a large number of nodes (or neurons) connected with each other. The neural network adjusts the weights according to changes in the input, improves the behavior of the system, and automatically learns a model that can solve the problem. LSTM (Long Short-Term Memory Network) is a special form of RNN (Recurrent Neural Network), which can effectively solve the gradient disappearance and gradient explosion problems of multi-layer neural network training, and can handle long-term time-dependent sequences. LSTM can capture the time series characteristics of charging data, and the use of LSTM model can effectively improve the completion accuracy.

LSTM网络由LSTM单元组成，LSTM单元由单元，输入门、输出门和遗忘门组成。An LSTM network consists of LSTM cells, which consist of cells, input gates, output gates, and forget gates.

遗忘门：决定从上一个单元的输出状态中丢弃多少信息，公式如下：Forget gate: decides how much information to discard from the output state of the previous unit, the formula is as follows:

f_t＝σ_g(W_fx_t+U_fh_t-1+b_f)f _t =σ _g (W _f x _t +U _f h _t-1 +b _f )

其中，f_t是遗忘门的输出，x_t是输入序列，h_t-1是上一个单元的输出，σ_g表示sigmoid函数，W_f表示输入的权重参数矩阵，U_f表示上一个单元输出的权重参数矩阵，b_f表示偏差参数向量。where f _t is the output of the forget gate, x _t is the input sequence, h _t-1 is the output of the previous unit, σ _g represents the sigmoid function, W _f represents the input weight parameter matrix, and U _f represents the output of the previous unit Weight parameter matrix, b _f represents the bias parameter vector.

输入门：决定让多少新的信息加入到Cell状态中，并对单元状态C进行更新，公式如下：Input gate: Determine how much new information is added to the Cell state, and update the cell state C, the formula is as follows:

i_t＝σ_g(W_ix_t+U_ih_t-1+b_i)i _t =σ _g (W _i x _t +U _i h _t-1 +b _i )

其中，c_t表示当前单元的单元状态，σ_g和σ_c表示sigmoid函数，

表示矩阵乘积，W_i表示输入的权重参数矩阵，U_i表示上一个单元输出的权重参数矩阵，b_i表示偏差参数向量，f_t是遗忘门的输出，c_t-1是上一个单元的单元状态，表示矩阵乘积，W_c表示输入的权重参数矩阵，U_c表示上一个单元输出的权重参数矩阵， b_c表示偏差参数向量。where c _t represents the cell state of the current cell, σ _g and σ _c represent the sigmoid function,

represents the matrix product, Wi represents the input weight parameter matrix, U _i represents the weight parameter matrix output by the previous unit, _bi _represents the bias parameter vector, f _t is the output of the forget gate, and c _t-1 is the unit of the previous unit state, represents the matrix product, W _c represents the input weight parameter matrix, U _c represents the weight parameter matrix output by the previous unit, and b _c represents the bias parameter vector.

输出门：基于当前的单元状态输出结果。Output gates: output results based on the current cell state.

o_t＝σ_g(W_ox_t+U_oh_t-1+b_o)o _t =σ _g (W _o x _t +U _o h _t-1 +b _o )

其中，h_t表示当前单元的输出，σ_g和σ_h表示sigmoid函数，

表示矩阵乘积， W_o表示输入的权重参数矩阵，U_o表示上一个单元输出的权重参数矩阵，b_o表示偏差参数向量。where h _t represents the output of the current unit, σ _g and σ _h represent the sigmoid function,

represents the matrix product, W _o represents the input weight parameter matrix, U _o represents the weight parameter matrix output by the previous unit, and b _o represents the bias parameter vector.

发明内容SUMMARY OF THE INVENTION

本发明提出了一种基于双向循环神经网络的交通缺失数据补全方法，是基于时序性、周期性以及空间性的深度学习补全方法，目的在于提高道路车流量数据的补全精度。The invention proposes a traffic missing data completion method based on a bidirectional cyclic neural network, which is a deep learning completion method based on time series, periodicity and space, and aims to improve the completion accuracy of road traffic flow data.

本发明的技术方案：Technical scheme of the present invention:

一种基于双向循环神经网络的交通缺失数据补全方法，步骤如下：A traffic missing data completion method based on bidirectional recurrent neural network, the steps are as follows:

第一步，将车流量数据进行预处理The first step is to preprocess the traffic flow data

所述的预处理包括时间粒度划分和对数据进行标准化；The preprocessing includes time granularity division and data standardization;

第二步、将预处理后的数据进行随机数据点丢失处理，构建带有缺失点的数据集，然后记录缺失点所在的位置信息，用作验证值，从而验证方法的补全效果。The second step is to perform random data point loss processing on the preprocessed data, construct a data set with missing points, and then record the location information of the missing points as a verification value to verify the completion effect of the method.

同时，构建时间维度影响衰减性矩阵。由于数据发生缺失会出现连续缺失的情况，比如，传感器的供电元件的损坏会导致之后一段时间的数据丢失，随着时间的积累，历史数据对缺失点的数据的影响会越来越小，会影响补全精度，所以需要记录时间维度数据影响的衰减性。时间维度影响衰减性矩阵定义如下：At the same time, the construction time dimension affects the decay matrix. Due to the lack of data, there will be continuous deletion. For example, the damage of the power supply element of the sensor will lead to the loss of data for a period of time. With the accumulation of time, the impact of historical data on the data of the missing points will become smaller and smaller. It affects the completion accuracy, so it is necessary to record the attenuation of the influence of time dimension data. The time dimension influence decay matrix is defined as follows:

其中，n_t表示当前的时刻，的定义如下：Among them, n _t represents the current moment, is defined as follows:

第三步、将丢失处理后的车流量数据划分为训练集、验证集和测试集。在每个数据集中，不同模型采用的数据有以下几种类型：The third step is to divide the lost traffic data into training set, validation set and test set. In each dataset, the data used by the different models are of the following types:

前向时间序列深度学习模块用的数据：

Data used by the forward time series deep learning module:

反向时间序列深度学习模块的数据：

Data for the reverse time series deep learning module:

外部特征模块中采用的外部特征数据：F_n；External feature data used in the external feature module: F _n ;

周期性特征模块中采用的周期性序列数据：

Periodic sequence data used in periodic feature module:

其中，n表示当前时刻，t表示时间序列的步长，p表示周期序列的步长。S 表示的是车流量数据，T表示的是S在时间维度上的反向序列。s_i表示在n时刻的车流量数据，表示第n时刻的前i天的日内相同时刻的车流量数据，

表示包括第n时刻的前t个时刻的车流量数据的集合，

表示包括第n时刻当天的前p天日内相同时刻的车流量数据集合，F_n表示在第n时刻的外部特征，包括节假日、位置区域、天气和气温。Among them, n represents the current moment, t represents the step size of the time series, and p represents the step size of the periodic sequence. S represents the traffic flow data, and T represents the reverse sequence of S in the time dimension. s _i represents the traffic flow data at time n, represents the traffic flow data at the same time in the day i days before the nth time,

represents the set of traffic flow data including the first t moments of the nth moment,

Represents the traffic flow data set at the same time in the p days before the nth time, and _Fn represents the external features at the nth time, including holidays, location area, weather and temperature.

第四步、构建补全模型，补全模型包括前向时间序列深度学习模块、反向时间序列深度学习模块、周期性特征模块和外部特征模块，各个模块的结构及训练机制如下：The fourth step is to build a completion model. The completion model includes a forward time series deep learning module, a reverse time series deep learning module, a periodic feature module and an external feature module. The structure and training mechanism of each module are as follows:

(1)前向时间序列深度学习模块：是一个线性回归网络和多层长短记忆网络组合LSTM模型，通过一层线性回归网络，添加当前缺失点在时间上的延续性信息，用来应对长时间序列缺失的情况，提升补全精度。(1) Forward time series deep learning module: It is a combination of a linear regression network and a multi-layer long-short-term memory network LSTM model. Through a layer of linear regression network, the continuity information of the current missing points in time is added to deal with long-term In the case of missing sequences, the completion accuracy is improved.

前向序列深度学习模块的实现细节：先将时间维度衰减性矩阵输入到线性回归网络，然后将线性回归网络的输出和前向时间序列数据

输入LSTM网络中，对当前时刻输入值x_t，如果数据点没有缺失，则直接输入，当数据点缺失时，将上一个时刻的隐含状态作为当前时刻的输入，在处理完输入后，对深度学习网络进行训练，在不断的迭代更新中得到最终的前向序列深度学习模块的输出。The implementation details of the forward sequence deep learning module: first input the time dimension decay matrix into the linear regression network, and then combine the output of the linear regression network with the forward time series data

In the LSTM network, input the value x _t at the current moment. If the data point is not missing, input it directly. When the data point is missing, use the hidden state of the previous moment as the input at the current moment. The deep learning network is trained, and the output of the final forward sequence deep learning module is obtained in the continuous iterative update.

(2)反向时间序列深度学习模块：在网络结构上与前向序列深度学习模块一致，不同的在于将前向时间序列深度学习模块的输入在时间维度上做一个反向处理，作为模块的输入。(2) Reverse time series deep learning module: The network structure is consistent with the forward series deep learning module, the difference is that the input of the forward time series deep learning module is reversely processed in the time dimension, as the module's input enter.

(3)周期性特征学习模块：是由三层全连接网络构成的模块，通过对周期性数据特征的提取，获取历史数据中、同一个传感器、同一个时间段车流量的变化规律，然后将提取到的特征输出。实现细节：将周期序列数据输入到全连接层中，经过三层全连接层，提取周期性数据的时序性特征，然后输出。(3) Periodic feature learning module: It is a module composed of a three-layer fully connected network. Through the extraction of periodic data features, the change rule of traffic flow in the historical data, the same sensor, and the same time period is obtained, and then the Extracted feature output. Implementation details: The periodic sequence data is input into the fully connected layer, and after three fully connected layers, the time series features of the periodic data are extracted, and then output.

(4)外部特征模块：由两部分组成：第一部分处理节假日、天气特征，是一层特征编码层。实现细节：将外部特征数据输入到特征编码层，把数据转化为向量形式，然后把得到的向量和上述三个模块的输出合并。(4) External feature module: It consists of two parts: the first part deals with holiday and weather features, and is a feature encoding layer. Implementation details: Input the external feature data into the feature encoding layer, convert the data into vector form, and then combine the obtained vector with the outputs of the above three modules.

第二部分处理空间性特征。为了将道路空间上的信息考虑进去，将路段上所有传感器同时输入第二部分中，然后将与当前传感器的缺失点相同时刻的其它传感器的隐含状态作为输入，通过Softmax网络计算权重之后，得到输出，将该输出输入到前向、反向时间序列深度学习模块中。The second part deals with spatial features. In order to take into account the information on the road space, all the sensors on the road segment are input into the second part at the same time, and then the hidden states of other sensors at the same time as the missing point of the current sensor are used as input, and after calculating the weight through the Softmax network, we get output, which is fed into the forward and reverse time series deep learning modules.

最后将上述四个模块的输出合并成一维向量，通过一层全连接网络，得到最终的补全结果。Finally, the outputs of the above four modules are combined into a one-dimensional vector, and the final completion result is obtained through a layer of fully connected network.

第五步、使用训练集数据对前向时间序列深度学习模块、反向时间序列深度学习模块的预训练部分进行预训练，提前优化时间序列深度学习模型的参数，避免在整体训练时将参数优化到局部最优点。Step 5: Use the training set data to pre-train the pre-training part of the forward time series deep learning module and the reverse time series deep learning module, optimize the parameters of the time series deep learning model in advance, and avoid optimizing the parameters during the overall training. to the local optimum.

第六步、使用训练集数据和验证集数据对步骤四建立的四个模块进行整体性训练：Step 6: Use the training set data and the validation set data to perform overall training on the four modules established in Step 4:

将预处理后的数据分别输入到相应的模块中，同时对所有模块进行整体性训练。计算每次训练后的补全值和车流量数据的真值的损失函数值，将模型的参数训练到目标值。根据模型在训练集、验证集上的效果，不断调试模型的超参数，在减小过拟合的条件下提高补全精度。The preprocessed data are input into the corresponding modules respectively, and the overall training is performed on all modules at the same time. Calculate the loss function value of the complement value after each training and the true value of the traffic flow data, and train the parameters of the model to the target value. According to the effect of the model on the training set and validation set, the hyperparameters of the model are continuously debugged, and the completion accuracy is improved under the condition of reducing overfitting.

所述的输入数据包括：前向时间序列数据

(前t₁小时的车流量数据)、反向时间序列数据

(后t₂小时的车流量数据)、周期序列数据

(前t₃天同一时刻的车流量数据)、时间维度影响衰减性矩阵

外部特征数据F_n(第n时刻的节假日、区域、天气和气温外部特征数据)和车流量数据的真值

(当前时刻的车流量数据)。The input data includes: forward time series data

(traffic data for the first t ₁ hour), reverse time series data

(traffic flow data in the last t ₂ hours), periodic series data

(traffic flow data at the same time in the first t ₃ days), time dimension influence attenuation matrix

The true value of external feature data _Fn (holiday, area, weather and temperature external feature data at the nth time) and traffic flow data

(traffic data at the current moment).

经过一次迭代后，得到的是经过一次补全操作之后的车流量数据。将这次迭代后的数据作为下一次迭代的输入，之前缺失点虽然有了补全值，但是由于标签还是表示缺失，后续迭代过程中，目标还是对这些缺失点进行数据补全，但是由于已经存在相对接近真值的数据，提供了先验知识，可以提升模型收敛的速度以及补全精度。After one iteration, the traffic flow data after one completion operation is obtained. The data after this iteration is used as the input of the next iteration. Although the missing points have completed values before, but because the labels still indicate missing, in the subsequent iteration process, the goal is to complete the data for these missing points, but due to the The existence of data that is relatively close to the true value provides prior knowledge, which can improve the speed of model convergence and completion accuracy.

第七步、使用测试集利用第六步训练好的模型进行车流量数据补全。The seventh step is to use the test set to use the model trained in the sixth step to complete the traffic flow data.

输入数据为：前向时间序列数据

反向时间序列数据周期序列数据时间维度影响衰减性矩阵

外部特征数据

和车流量数据的真值

The input data is: forward time series data

reverse time series data Periodic sequence data Time Dimension Influence Decay Matrix

External feature data

and the ground truth of the traffic flow data

通过第六步的模型得到缺失的车流量数据的补全值，和第二步进行丢失处理后得到的验证值进行对比，验证模型的补全效果。The complement value of the missing traffic flow data is obtained through the model in the sixth step, and is compared with the verification value obtained after the loss processing in the second step to verify the complement effect of the model.

所述第一步中，预处理的具体过程为：In the first step, the specific process of preprocessing is:

(1)时间粒度划分：将所有车流量数据按k分钟的时间粒度处理为每k分钟的车流量数据；(1) Time granularity division: All traffic flow data are processed into traffic flow data per k minutes according to the time granularity of k minutes;

(2)对数据进行标准化：采用最小值和最大值对车流量数据进行标准化，公式如下：(2) Standardize the data: Use the minimum and maximum values to standardize the traffic flow data, the formula is as follows:

其中，x表示原始值，x_min表示原始值的最小值，x_max表示原始值的最大值， max为归一化的上限值，min为归一化的下限值，[min,max]表示归一化后的区间，x^*为标准化后的结果。Among them, x represents the original value, x _min represents the minimum value of the original value, x _max represents the maximum value of the original value, max is the upper limit of normalization, min is the lower limit of normalization, [min,max] Indicates the normalized interval, and x ^* is the normalized result.

所述第四步中，考虑道路空间信息部分(Softmax处理)：设所有传感器在当前时刻的隐含状态h＝<h₁，h₂，h₃，…，h_i，…，h_t>，h_i是第i个传感器在当前时刻的隐含状态，然后对每一个h_i计算权重，得到当前传感器的新的隐含状态h′_i。In the fourth step, consider the road space information part (Softmax processing): set the hidden states of all sensors at the current moment h=<h ₁ , h ₂ , h ₃ ,..., _hi ,..., h _t >, h _i is the hidden state of the i-th sensor at the current moment, and then the weight is calculated for each h _i to obtain the new hidden state h′ _i of the current sensor.

使用Softmax处理后，所有的权重和为1。其中，l表示传感器数目，h_ij表示第j个传感器i时刻的隐含状态。After processing with Softmax, all weights sum to 1. Among them, l represents the number of sensors, and h _ij represents the implicit state of the jth sensor i at the moment.

所述第六步中，计算每次迭代所得到的补全得到后的数据和车流量数据真值的均方误差MAE，使用Adam方法最小化MAE。In the sixth step, the mean square error MAE between the complemented data obtained in each iteration and the true value of the traffic flow data is calculated, and the Adam method is used to minimize the MAE.

其中，x′_i表示第i时刻的传感器真实值，x_i表示第i时刻的传感器补全值。Among them, x′ _i represents the real value of the sensor at the ith moment, and _xi represents the sensor complement value at the ith moment.

本发明的有益效果：本发明与已有方法的区别在于，首先是对数据时序性特点使用上的改进，以往的方法在利用数据时序性特点时，往往考虑的是历史数据对当前时间点数据的影响，但是在车流量数据的补全应用上，后续时间点的信息对当前时间点的数据有影响，本发明同时考虑前向时间序列和反向时间序列，大幅提高了补全精度。其次考虑到外部特征节假日、传感器相邻区域对车流量数据的影响，将其加入到补全模型中，大幅提高了补全精度和对特殊值的补全。最后还考虑了数据缺失在时间维度上影响的衰减性，提高了补全精度。本发明的方法不仅大幅提高低缺失率车流量数据的补全精度，而且能够在数据缺失率较高的情况下达到很好的补全效果。Beneficial effects of the present invention: The difference between the present invention and the existing method is that, first of all, it is an improvement in the use of the time-series characteristics of data. When using the time-series characteristics of data, the previous method often considers the difference between historical data and current time point data. However, in the application of the completion of traffic flow data, the information of the subsequent time point has an impact on the data of the current time point. The present invention simultaneously considers the forward time series and the reverse time series, which greatly improves the completion accuracy. Secondly, considering the influence of external characteristics, holidays and sensor adjacent areas on traffic flow data, it is added to the completion model, which greatly improves the completion accuracy and the completion of special values. Finally, the attenuation of the impact of missing data in the time dimension is also considered, which improves the completion accuracy. The method of the invention not only greatly improves the completion accuracy of the low missing rate traffic flow data, but also can achieve a good completion effect under the condition of a high data missing rate.

附图说明Description of drawings

图1是本发明涉及的补全模型结构图。FIG. 1 is a structural diagram of a complementation model according to the present invention.

图2是数据缺失率为20％的低缺失率补全结果与真实值的对比图。Figure 2 is a comparison diagram of the low-missing rate completion result with a data missing rate of 20% and the true value.

图3是数据缺失率为50％的高缺失率补全结果与真实值的对比图。Figure 3 is a comparison diagram of the high missing rate completion result with a data missing rate of 50% and the true value.

具体实施方法Specific implementation method

下面将结合具体实施例和附图对本发明的技术方案进行进一步的说明。The technical solutions of the present invention will be further described below with reference to specific embodiments and accompanying drawings.

第一步，将车流量数据预处理The first step is to preprocess the traffic flow data

(1)时间粒度划分：将所有车流量数据按5分钟的时间粒度处理为每5分钟的车流量数据；(1) Time granularity division: All traffic flow data are processed into traffic flow data every 5 minutes according to the time granularity of 5 minutes;

(2)对数据进行标准化：采用最小值最大值对车流量数据进行标准化，公式如下：(2) Standardize the data: Use the minimum and maximum values to standardize the traffic flow data, the formula is as follows:

第二步，将预处理后的数据进行随机数据点丢失，采用随机数的方法，将一定比例(根据实验要求自行设置)的数据打上缺失的标签，用来作为缺失点，然后记录这些点的值，作为真值，用来验证模型最终的补全效果。The second step is to lose random data points in the preprocessed data. Using the method of random numbers, a certain proportion of the data (set by yourself according to the experimental requirements) is labeled as missing points, which are used as missing points, and then record the data of these points. The value, as the true value, is used to verify the final completion effect of the model.

同时，建立时间维度影响衰减性矩阵。由于数据的缺失会出现连续缺失的情况，比如，一次停电可能会导致传感器在几个小时之内采集不到数据，随着时间的积累，历史数据对缺失点的数据的影响会越来越小，会影响补全精度，所以需要记录时间维度数据影响的衰减性。时间维度影响衰减性矩阵定义如下：At the same time, the time dimension influence decay matrix is established. Due to the lack of data, there will be continuous loss. For example, a power outage may cause the sensor to fail to collect data within a few hours. With the accumulation of time, the impact of historical data on the data of missing points will become smaller and smaller. , which will affect the completion accuracy, so it is necessary to record the attenuation of the influence of time dimension data. The time dimension influence decay matrix is defined as follows:

其中，n_t表示当前的时刻，

的定义如下：Among them, n _t represents the current moment,

is defined as follows:

第三步、将预处理后的车流量数据划分为训练集、验证集和测试集，按照8：1：1的比例进行划分。在每个数据集中，不同模型采用的数据有以下几种类型：The third step is to divide the preprocessed traffic flow data into training set, validation set and test set, and divide them according to the ratio of 8:1:1. In each dataset, the data used by the different models are of the following types:

前向时间序列深度学习模块用的数据：

Data used by the forward time series deep learning module:

反向时间序列深度学习模块的数据：

Data for the reverse time series deep learning module:

外部特征模型中采用的外部特征数据：F_n；External feature data used in the external feature model: F _n ;

周期性特征模块中采用的周期性序列数据： Periodic sequence data used in periodic feature module:

其中，n表示当前时刻，t表示时间序列的步长，p表示周期序列的步长。S 表示的是车流量数据，T表示的是S在时间维度上的反向序列。s_i表示在n时刻的车流量数据，

表示第n时刻的前i天的日内相同时刻的车流量数据，表示包括第n时刻的前t个时刻的车流量数据的集合，

表示包括第n时刻当天的前p天日内相同时刻的车流量数据集合，F_n表示在第n时刻的外部特征，包括节假日、位置区域、天气和气温。Among them, n represents the current moment, t represents the step size of the time series, and p represents the step size of the periodic sequence. S represents the traffic flow data, and T represents the reverse sequence of S in the time dimension. s _i represents the traffic flow data at time n,

represents the traffic flow data at the same time in the day i days before the nth time, represents the set of traffic flow data including the first t moments of the nth moment,

第四步、构建补全模型，补全模型包括前向序列深度学习模块、反向时间序列深度学习模块、周期性特征模块和外部特征模块，各个模块的结构及训练机制如下：The fourth step is to build a completion model. The completion model includes a forward sequence deep learning module, a reverse time series deep learning module, a periodic feature module and an external feature module. The structure and training mechanism of each module are as follows:

(1)前向序列深度学习模块：是一个线性回归网络和多层长短记忆网络组合LSTM模型，通过一层线性回归网络，添加当前缺失点在时间上的延续性信息，用来应对长时间序列缺失的情况，提升补全精度。(1) Forward sequence deep learning module: It is a combination of a linear regression network and a multi-layer long-short-term memory network LSTM model. Through a layer of linear regression network, the temporal continuity information of the current missing point is added to deal with long-term sequences. In the case of missing, improve the completion accuracy.

(2)反向序列深度学习模块：在网络结构上与前向序列深度学习模块一致，不同的在于将前向序列深度学习模块的输入在时间维度上做一个反向处理，作为模块的输入。(2) Reverse sequence deep learning module: The network structure is consistent with the forward sequence deep learning module, the difference is that the input of the forward sequence deep learning module is reversely processed in the time dimension as the input of the module.

(3)周期性特征模块：是由三层全连接网络构成的模块，通过对周期性数据特征的提取，获取历史数据中，同一个传感器，同一个时间段车流量的变化规律，然后将提取到的特征输出。实现细节：将周期序列数据输入到全连接层中，经过三层全连接层，提取周期性数据的时序性特征，然后输出。(3) Periodic feature module: It is a module composed of a three-layer fully connected network. Through the extraction of periodic data features, it can obtain the change law of the traffic flow of the same sensor and the same time period in the historical data, and then extract the to the feature output. Implementation details: The periodic sequence data is input into the fully connected layer, and after three fully connected layers, the time series features of the periodic data are extracted, and then output.

(4)外部特征模块：是一层特征编码层；实现细节：将外部特征数据输入到特征编码层，将文字化描述的天气，节假日等外部特征，通过划分等级的方式：比如根据是否是节假日，将是节假日的用1来表示，不是节假日的用0来表示，将周期序列数据转化为向量形式，然后把得到的向量输出到下一步。(4) External feature module: it is a layer of feature coding layer; implementation details: input external feature data into the feature coding layer, and describe the weather, holidays and other external features in text, through the method of grading: for example, according to whether it is a holiday or not , which will be represented by 1 for holidays and 0 for non-holidays, convert the periodic sequence data into vector form, and then output the resulting vector to the next step.

为了将道路空间上的信息考虑进去，还加入了空间性特征学习模块，将路段上所有传感器同时输入模型中，然后将与当前传感器的缺失点相同时刻的其它传感器的隐含状态作为输入，通过Softmax网络计算权重之后，得到输出，将输入到前向序列模块和反向序列模块中。In order to take into account the information on the road space, a spatial feature learning module is also added to input all sensors on the road segment into the model at the same time, and then use the hidden state of other sensors at the same time as the missing point of the current sensor as input, through After the Softmax network calculates the weights, the output is obtained, which will be input to the forward sequence module and the reverse sequence module.

最后，将各个模块的输出合并成一维向量，然后通过一层全连接网络，得到最终的补全结果。Finally, the outputs of each module are combined into a one-dimensional vector, and then the final completion result is obtained through a layer of fully connected network.

第五步、使用训练集数据对时间序列深度学习模型的预训练部分进行预训练，提前优化时间序列深度学习模型的参数，避免在整体训练时将参数优化到局部最优点。The fifth step is to use the training set data to pre-train the pre-training part of the time series deep learning model, optimize the parameters of the time series deep learning model in advance, and avoid optimizing the parameters to the local optimum during the overall training.

第六步、使用训练集数据和验证集数据对步骤四建立的四个模块进行整体性训练(对于数据有缺失的点用补全值替换，数据没有缺失就保持原始数据不变)：Step 6: Use the training set data and the validation set data to perform overall training on the four modules established in Step 4 (replace the points with missing data with complementary values, and keep the original data unchanged if there is no missing data):

将预处理后的数据分别输入到相应的模块中，同时对所有模块进行整体性训练。计算每次训练后的补全值和车流量数据的真值的损失函数值，将模型的参数训练到目标值。根据模型在训练集、验证集上的效果，不断调试模型的超参数，在减小过拟合的条件下提高补全精度。训练过程中，计算每次迭代所得到的补全得到后的数据和车流量数据真值的MAE(均方误差)，使用Adam方法最小化MAE。The preprocessed data are input into the corresponding modules respectively, and the overall training is performed on all modules at the same time. Calculate the loss function value of the complement value after each training and the true value of the traffic flow data, and train the parameters of the model to the target value. According to the effect of the model on the training set and validation set, the hyperparameters of the model are continuously debugged, and the completion accuracy is improved under the condition of reducing overfitting. During the training process, the MAE (mean square error) of the complemented data obtained in each iteration and the true value of the traffic flow data is calculated, and the Adam method is used to minimize the MAE.

所述的输入数据包括：前向时间序列数据

(前t₁小时的车流量数据)、反向时间序列数据

(后t₂小时的车流量数据)、时间维度影响衰减性矩阵

周期序列数据

(前t₃天同一时刻的车流量数据)、外部特征数据F_n(第n时刻的节假日、区域、天气和气温外部特征数据)和车流量数据的真值

(traffic data for the first t ₁ hour), reverse time series data

(traffic flow data in the last t ₂ hours), time dimension influence attenuation matrix

Periodic sequence data

(traffic flow data at the same time in the previous t ₃ days), external feature data _Fn (holidays, area, weather and temperature external feature data at the nth time) and the true value of traffic flow data

(traffic flow data at the current moment).

输入数据为：前向时间序列数据

反向时间序列数据

周期序列数据

外部特征数据

和车流量数据的真值

时间维度影响衰减性矩阵

The input data is: forward time series data

reverse time series data

Periodic sequence data

External feature data

and the ground truth of the traffic flow data

Time Dimension Influence Decay Matrix

图2是数据缺失率为20％的补全结果与真实值的对比图，模型补全结果与车流量真实值的均方误差MAE是29.18。(图中选取前100个缺失点)Figure 2 is a comparison chart of the completion result with the data missing rate of 20% and the real value. The mean square error MAE between the model completion result and the real value of traffic flow is 29.18. (Select the first 100 missing points in the figure)

图3是数据缺失率为50％的补全结果与真实值的对比图，模型补全结果与车流量真实值的均方误差MAE是31.94。(图中选取前100个缺失点)。Figure 3 is a comparison diagram of the completion result with the data missing rate of 50% and the real value. The mean square error MAE between the model completion result and the real value of traffic flow is 31.94. (The first 100 missing points are selected in the figure).

Claims

1. a traffic missing data completion method based on bidirectional recurrent neural network, is characterized in that, step is as follows:

The first step is to preprocess the traffic flow data

The preprocessing includes time granularity division and data standardization;

The second step is to perform random data point loss processing on the preprocessed data, construct a data set with missing points, and then record the location information of the missing points as the verification value; at the same time, construct the time dimension influence attenuation matrix:

Among them, n _t represents the current moment,

is defined as follows:

The third step is to divide the traffic flow data after loss processing into training set, validation set and test set; in each data set, the data used by different models have the following types:

Data used by the forward time series deep learning module:

Data for the reverse time series deep learning module:

External feature data used in the external feature module: F _n ;

Periodic sequence data used in periodic feature module:

Among them, _n represents the current moment, t represents the step size of the time series, p represents the step size of the periodic sequence; S represents the traffic flow data, T represents the reverse sequence of S in the time dimension; time traffic data,

represents the traffic flow data at the same time in the day i days before the nth time,

Represents the traffic flow data set at the same time in the p days before the nth time, and _Fn represents the external characteristics at the nth time, including holidays, location area, weather and temperature;

The fourth step is to build a completion model. The completion model includes a forward time series deep learning module, a reverse time series deep learning module, a periodic feature module and an external feature module. The structure and training mechanism of each module are as follows:

(1) Forward time series deep learning module: It is a combination of a linear regression network and a multi-layer long-short-term memory network LSTM model. Through a layer of linear regression network, the continuity information of the current missing points in time is added to deal with long-term In the case of missing sequences, the completion accuracy is improved;

The implementation details of the forward sequence deep learning module: first input the time dimension decay matrix into the linear regression network, and then combine the output of the linear regression network with the forward time series data

In the LSTM network, input the value x _t at the current moment. If the data point is not missing, input it directly. When the data point is missing, use the hidden state of the previous moment as the input at the current moment. The deep learning network is trained, and the output of the final forward sequence deep learning module is obtained in the continuous iterative update;

(2) Reverse time series deep learning module: The network structure is consistent with the forward series deep learning module, the difference is that the input of the forward time series deep learning module is reversely processed in the time dimension, as the module's input enter;

(3) Periodic feature learning module: It is a module composed of a three-layer fully connected network. Through the extraction of periodic data features, the change rule of traffic flow in the historical data, the same sensor, and the same time period is obtained, and then the Extracted feature output; implementation details: input the periodic sequence data into the fully connected layer, through three fully connected layers, extract the time series features of the periodic data, and then output;

(4) External feature module: This module consists of two parts: the first part deals with holiday and weather features, and is a feature coding layer; implementation details: input external feature data into the feature coding layer, convert the data into vector form, and then Combine the obtained vector with the output of the above three modules;

The second part deals with spatial features. All sensors on the road section are input into the second part at the same time, and then the hidden state of other sensors at the same time as the missing point of the current sensor is used as input. After calculating the weight through the Softmax network, the output is obtained, Input this output into the forward and reverse time series deep learning modules;

Finally, the outputs of the above four modules are combined into a one-dimensional vector, and the final completion result is obtained through a layer of fully connected network;

Step 5: Use the training set data to pre-train the pre-training part of the forward time series deep learning module and the reverse time series deep learning module, optimize the parameters of the time series deep learning model in advance, and avoid optimizing the parameters during the overall training. to the local optimum;

Step 6: Use the training set data and the validation set data to perform overall training on the four modules established in Step 4:

Input the preprocessed data into the corresponding modules respectively, and conduct overall training for all modules at the same time; Target value; according to the effect of the model on the training set and validation set, continuously debug the hyperparameters of the model, and improve the completion accuracy under the condition of reducing overfitting;

The input data includes:

Forward time series data: traffic flow data for the first t ₁ hour

Reverse time series data: traffic flow data for the last t ₂ hours

Periodic series data: traffic flow data at the same time in the first t ₃ days

The time dimension affects the decay matrix:

External feature data: holiday, area, weather and temperature external feature data F _n at the nth time;

The true value of the traffic flow data: the traffic flow data at the current moment

After one iteration, the traffic flow data obtained after one completion operation is obtained; the data after this iteration is used as the input of the next iteration. Although the missing points have completed values before, but because the labels still indicate missing, In the subsequent iteration process, the goal is to complete data for these missing points;

The seventh step is to use the test set to complete the traffic flow data with the model trained in the sixth step;

The input data is: forward time series data reverse time series data

Periodic sequence data

Time Dimension Influence Decay Matrix

External feature data

and the ground truth of the traffic flow data

The complement value of the missing traffic flow data is obtained through the model in the sixth step, and is compared with the verification value obtained after the loss processing in the second step to verify the complement effect of the model.

2. a kind of traffic missing data completion method based on bidirectional cyclic neural network according to claim 1, is characterized in that, in the described first step, the concrete process of preprocessing is:

(1) Time granularity division: All traffic flow data are processed into traffic flow data per k minutes according to the time granularity of k minutes;

(2) Standardize the data: Use the minimum and maximum values to standardize the traffic flow data, the formula is as follows:

Among them, x represents the original value, x _min represents the minimum value of the original value, x _max represents the maximum value of the original value, max is the upper limit of normalization, min is the lower limit of normalization, [min,max] Indicates the normalized interval, and x ^* is the normalized result.

3. a kind of traffic missing data completion method based on bidirectional recurrent neural network according to claim 1 or 2, it is characterized in that, in described 4th step, the concrete process of processing spatial characteristic: set all sensors in the current The implicit state at the moment h=<h ₁ , h ₂ , h ₃ , ..., hi , ..., h _t >, _hi is the hidden state of the _ith sensor at the current moment, and then calculates for each _hi weight, get the new hidden state h′ _i of the current sensor;

Among them, l represents the number of sensors, and h _ij represents the implicit state of the jth sensor i at the moment.

4. a kind of traffic missing data completion method based on bidirectional cyclic neural network according to claim 1 and 2, is characterized in that, in the described 6th step, calculate the data after the completion obtained by each iteration and the mean square error MAE of the true value of the traffic flow data, using the Adam method to minimize the MAE;

Among them, x′ _i represents the real value of the sensor at the ith moment, and _xi represents the sensor complement value at the ith moment.

5. a kind of traffic missing data completion method based on bidirectional cyclic neural network according to claim 3, is characterized in that, in described 6th step, calculate the data and vehicle after the completion obtained by each iteration The mean square error MAE of the true value of the flow data, using the Adam method to minimize the MAE;