CN117292346A

CN117292346A - Vehicle running risk early warning method for driver and vehicle state integrated sensing

Info

Publication number: CN117292346A
Application number: CN202311284729.1A
Authority: CN
Inventors: 俞山川; 骆中斌; 宋浪; 李刚; 王少飞; 谢耀华; 彭亚雪; 周欣; 陈晨; 周盼; 陈奇
Original assignee: China Merchants Chongqing Communications Research and Design Institute Co Ltd
Current assignee: China Merchants Chongqing Communications Research and Design Institute Co Ltd
Priority date: 2023-10-07
Filing date: 2023-10-07
Publication date: 2023-12-26

Abstract

The invention discloses a vehicle driving risk early warning method for integrated perception of driver and vehicle status, which includes: using a driving status data set to train a driving behavior recognition model to obtain a trained driving behavior recognition model; using lane offset data The vehicle trajectory recognition model is trained collectively to obtain the trained vehicle trajectory recognition model; the collected driver images are input into the trained driving behavior recognition model, and the driver's driving behavior recognition results are output; the collected vehicle driving status images are Input to the trained vehicle trajectory recognition model and output the lane deviation recognition results; determine whether the driving behavior recognition results and/or lane deviation recognition results meet the warning requirements. If so, remind the driver to correct the driving behavior and stay in the correct lane. driving; if not, no processing will be performed. The invention can perform real-time synchronous perception of the driver's driving status and the vehicle's driving trajectory, thereby enhancing the reliability of the driving warning.

Description

Vehicle driving risk warning method for integrated perception of driver and vehicle status

技术领域Technical field

本发明涉及车辆行驶预警领域，具体涉及一种面向驾驶人和车辆状态一体感知的车辆行驶风险预警方法。The invention relates to the field of vehicle driving warning, and specifically relates to a vehicle driving risk warning method for integrated perception of driver and vehicle status.

背景技术Background technique

近年来，碰撞事故依然是道路运输安全事故的主要形态，事故起数和死亡人数分别占事故总数的70.5％和68.4％，暴露出道路运输车辆碰撞事故防控存在不足。在车辆碰撞事故发生之前，驾驶人出现疲劳、分神等，车辆出现车道偏离和车距过近等现象占事故中的40-50％。车辆行驶风险预警在提高道路安全和交通管理方面具有重要意义。In recent years, collision accidents are still the main form of road transportation safety accidents, with the number of accidents and deaths accounting for 70.5% and 68.4% of the total accidents respectively, exposing the deficiencies in the prevention and control of road transportation vehicle collision accidents. Before a vehicle collision accident occurs, drivers are fatigued and distracted, and vehicle lane departures and vehicle distances that are too close account for 40-50% of accidents. Vehicle driving risk warning is of great significance in improving road safety and traffic management.

智能化辅助越来越广泛地应用于车辆行驶预警领域，但目前，车辆行驶风险预警技术存在误报或漏报潜在危险的情况，另外，现有的车辆行驶风险预警技术也存在不能有效地识别复杂的交通情况和风险因素的问题，从而导致风险预警的可靠性大大降低，因此，需要一种面向驾驶人和车辆状态一体感知的车辆行驶风险预警方法，能够解决以上问题。Intelligent assistance is more and more widely used in the field of vehicle driving warning, but currently, vehicle driving risk warning technology has false positives or omissions of potential dangers. In addition, existing vehicle driving risk warning technology also cannot effectively identify Complex traffic conditions and risk factors have greatly reduced the reliability of risk warnings. Therefore, a vehicle driving risk warning method that is oriented to the integrated perception of driver and vehicle status is needed to solve the above problems.

发明内容Contents of the invention

有鉴于此，本发明的目的是克服现有技术中的缺陷，提供面向驾驶人和车辆状态一体感知的车辆行驶风险预警方法，能够对驾驶人驾驶状态以及车辆行驶轨迹进行实时同步感知，增强了行驶预警的可靠性，提升了车辆安全行驶水平。In view of this, the purpose of the present invention is to overcome the shortcomings in the existing technology and provide a vehicle driving risk warning method for integrated perception of driver and vehicle status, which can perform real-time synchronous perception of the driver's driving status and vehicle driving trajectory, and enhance the The reliability of driving warning improves the safe driving level of vehicles.

本发明的面向驾驶人和车辆状态一体感知的车辆行驶风险预警方法，包括如下步骤：The vehicle driving risk warning method for integrated perception of driver and vehicle status of the present invention includes the following steps:

制作驾驶人的驾驶状态数据集；使用驾驶状态数据集对驾驶行为识别模型进行训练，得到训练好的驾驶行为识别模型；Create a driver's driving status data set; use the driving status data set to train the driving behavior recognition model to obtain a trained driving behavior recognition model;

制作车道偏移数据集；使用车道偏移数据集对车辆轨迹识别模型进行训练，得到训练好的车辆轨迹识别模型；Create a lane offset data set; use the lane offset data set to train the vehicle trajectory recognition model to obtain a trained vehicle trajectory recognition model;

将采集的驾驶人图像输入到训练好的驾驶行为识别模型，输出驾驶人的驾驶行为识别结果；Input the collected driver images into the trained driving behavior recognition model and output the driver's driving behavior recognition results;

将采集的车辆行驶状态图像输入到训练好的车辆轨迹识别模型，输出车道偏移识别结果；Input the collected vehicle driving status images into the trained vehicle trajectory recognition model and output the lane offset recognition results;

判断驾驶行为识别结果和/或车道偏移识别结果是否满足预警要求，若是，则提醒驾驶人更正驾驶行为并在正确的车道上行驶；若否，则不作处理。Determine whether the driving behavior recognition results and/or lane deviation recognition results meet the warning requirements. If so, the driver is reminded to correct the driving behavior and drive in the correct lane; if not, no processing is performed.

进一步，所述驾驶行为识别模型包括策略网络以及二维卷积神经网络；所述策略网络包括特征提取器以及长短期记忆模块；所述二维卷积神经网络的主干网络中嵌入有混合注意力机制模块；所述混合注意力机制模块包括时空激励子模块、通道激励子模块以及运动激励子模块；Further, the driving behavior recognition model includes a policy network and a two-dimensional convolutional neural network; the policy network includes a feature extractor and a long and short-term memory module; the backbone network of the two-dimensional convolutional neural network is embedded with hybrid attention. Mechanism module; the hybrid attention mechanism module includes a spatiotemporal excitation sub-module, a channel excitation sub-module and a motion excitation sub-module;

所述时空激励子模块使用单通道三维卷积来表示时空特征；The spatiotemporal excitation submodule uses single-channel three-dimensional convolution to represent spatiotemporal features;

所述通道激励子模块基于通道之间的相互依赖性自适应地校准通道的特性响应；The channel excitation submodule adaptively calibrates the characteristic response of the channel based on the interdependence between the channels;

所述运动激励子模块在特征级上计算时间差，从而刺激运动敏感信道。The motion excitation sub-module calculates time differences at the feature level to stimulate motion sensitive channels.

进一步，所述策略网络自适应地选择不同的帧尺度来实现驾驶行为识别效率，包括：Furthermore, the policy network adaptively selects different frame scales to achieve driving behavior recognition efficiency, including:

在时间步长t＜T₀，将帧I_t调整到最低分辨率，并将其发送到特征提取器；其中，T₀为设定的时间周期；I_t为t时刻的驾驶人员状态图像帧；At the time step t<T ₀ , adjust the frame _It to the lowest resolution and send it to the feature extractor; where T ₀ is the set time period; I _t is the driver status image frame at time t ;

长短期记忆模块使用提取的特征和先前状态更新隐藏状态并进行输出；The long short-term memory module uses the extracted features and previous states to update the hidden state and output it;

在给定隐藏状态的情况下，估计策略分布，对t时刻的动作a_t进行采样，并进行Gumbel Softmax操作；Given the hidden state, estimate the policy distribution, sample the action a _t at time t, and perform the Gumbel Softmax operation;

当动作a_t＜L，将帧大小调整到空间分辨率3×H_at×W_at，并将其转发到相应的主干网络以获得帧级预测；其中，L为状态图像分辨率种类数；H_at为动作a_t时刻t图像的高；W_at为动作a_t时刻t图像的宽；When the action a _t <L, adjust the frame size to the spatial resolution 3×H _at ×W _at and forward it to the corresponding backbone network to obtain frame-level prediction; where L is the number of state image resolution types; H _at is the height of the image at time t of action a _t ; W _at is the width of the image at time t of action a _t ;

当动作a_t≥L时，主干网络将跳过当前帧进行预测，并且策略网络将跳过随后的F_at-L-1帧；其中，F_at-L-1为a_t≥L时的视频帧。When the action a _t ≥ L, the backbone network will skip the current frame for prediction, and the policy network will skip the subsequent F _at-L-1 frame; where F _at-L-1 is the video when a _t ≥ L frame.

进一步，所述时空激励子模块使用单通道三维卷积来表示时空特征，具体包括：Furthermore, the spatiotemporal excitation submodule uses single-channel three-dimensional convolution to represent spatiotemporal features, specifically including:

对于给定的输入图像X∈R^{N×T×C×H×W}，对各通道的输入张量进行平均，以获得相对于通道轴的全局时空张量F∈R^{N×T×1×H×W}；然后，将F重塑为F^*∈R^{N×T×1×H×W}，并将其馈送到核心尺寸为3×3×3的三维卷积层K，得到最后，将/>重塑为F_o∈R^{N×T×1×H×W}，并将其馈送到Sigmoid激活，得到时空掩码M∈R^{N×T×1×H×W}，最终输出Y：Y＝X+X⊙M； ^For ^a given input image ^×W ; then, reshape F into F ^* ∈R ^{N×T×1×H×W} and feed it to a three-dimensional convolutional layer K with a core size of 3×3×3 to get Finally, add/> Reshape it as F _o ∈R ^{N×T×1×H×W} and feed it to Sigmoid activation to obtain the space-time mask M∈R ^{N×T×1×H×W} , and finally output Y: Y=X+ X⊙M;

其中，⊙表示时空掩码M与所有通道输入X逐个元素相乘，T表示图像对应的视频被划分的片段数；N表示片段数T的批量个数；C表示图像通道数；H表示图像的高；W表示图像的宽。Among them, ⊙ represents the element-by-element multiplication of the spatio-temporal mask M and all channel inputs Height; W represents the width of the image.

进一步，所述通道激励子模块基于通道之间的相互依赖性自适应地校准通道的特性响应，具体包括：Further, the channel excitation submodule adaptively calibrates the characteristic response of the channel based on the interdependence between channels, specifically including:

对于给定的输入图像X∈R^{N×T×C×H×W}，首先通过对输入进行平均来获得输入元素的全局空间信息F∈R^{N×T×C×1×1}；将F的通道数按比例r压缩，得到F_r＝K₁*F；其中，K₁为1×1的二维卷积层， For a given input image X∈R ^{N×T×C×H×W} , first obtain the global spatial information of the input elements F∈R ^{N×T×C×1×1} by averaging the input; The number is compressed according to proportion r, and F _r =K ₁ *F is obtained; where K ₁ is a 1×1 two-dimensional convolution layer,

然后，重塑F_r到直到可以启用时间推理，一维卷积层K₂内核大小为3用于处理/>得到/>其中，/> Then, reshape F _r to Until temporal inference can be enabled, 1D convolutional layer K ₂ kernel size 3 is used for processing /> Get/> Among them,/>

将重新整形为/>然后通过使用1×1二维卷积层K₃对其进行解压缩，得到F_o＝K₃*F_temp，馈送到Sigmoid激活，得到通道掩码M；其中，F_o∈R^{N×T×C×1×1}和M∈R^{N×T×C×1×1}；Will Reshape to/> It is then decompressed by using a 1×1 two-dimensional convolution layer K ₃ to obtain F _o =K ₃ *F _temp , which is fed to Sigmoid activation to obtain the channel mask M; where, F _o ∈R ^{N×T× C×1×1} and M∈R ^{N×T×C×1×1} ;

最终输出Y：Y＝X+X⊙M。The final output is Y: Y=X+X⊙M.

进一步，所述运动激励子模块在特征级上计算时间差，从而刺激运动敏感信道，具体包括：Further, the motion excitation sub-module calculates the time difference at the feature level to stimulate the motion-sensitive channel, specifically including:

对于给定的输入图像X∈R^{N×T×C×H×W}，使用一个1×1的二维卷积层，将通道数按比例r压缩，得到使用一个1×1的二维卷积层，对F_r进行解压缩；For a given input image X∈R ^{N×T×C×H×W} , using a 1×1 two-dimensional convolution layer to compress the number of channels in proportion to r, we get Use a 1×1 two-dimensional convolution layer to decompress F _r ;

对运动特征进行建模，得到F_m＝K*F_r[:,t+1,:,:,:]-F_r[:,t,:,:,:]；Model the motion characteristics and obtain F _m =K*F _r [:,t+1,:,:,:]-F _r [:,t,:,:,:];

其中，K为3×3的二维卷积层，其中，F_r[:,t+1,:,:,:]表示t+1时刻压缩后的特征图，F_r[:,t,:,:,:]表示t时刻压缩后的特征图；Among them, K is a 3×3 two-dimensional convolution layer, Among them, F _r [:,t+1,:,:,:] represents the compressed feature map at time t+1, and F _r [:,t,:,:,:] represents the compressed feature map at time t;

根据时间维度将运动特征相互连接，并将0填充到末位元素，则是：Connect the motion features to each other according to the time dimension, and fill the last element with 0, then:

F_m-[F_m(1),...,F_m(t-1),0]；其中，F_m(t-1)为第t-1个运动表示；F _m -[F _m (1),...,F _m (t-1),0]; where F _m (t-1) is the t-1th motion representation;

然后对F_m进行平均来获得输入元素的全局空间信息。F _m is then averaged to obtain the global spatial information of the input elements.

进一步，所述车辆轨迹识别模型包括改进的Deeplabv3+网络；Further, the vehicle trajectory recognition model includes an improved Deeplabv3+ network;

所述改进的Deeplabv3+网络以Deeplabv3+为基础框架，将Deeplabv3+的主干网络Xception替换成轻量级网络MobileNetv2，增加通道注意力机制模块，并使用Dense-ASPP替换Deeplabv3+网络中的ASPP结构；The improved Deeplabv3+ network uses Deeplabv3+ as the basic framework, replaces the backbone network Xception of Deeplabv3+ with the lightweight network MobileNetv2, adds a channel attention mechanism module, and uses Dense-ASPP to replace the ASPP structure in the Deeplabv3+ network;

所述通道注意力机制模块用于将注意力放在特征图的通道之间。The channel attention mechanism module is used to place attention between channels of the feature map.

进一步，还包括：Furthermore, it also includes:

使用前车距离数据集对车距识别模型进行训练，得到训练好的车距识别模型；将采集的前车距离图像输入到训练好的车距识别模型，输出前车距离识别结果；判断前车距离识别结果是否小于距离阈值，若是，则提醒驾驶人更正驾驶行为；若否，则不作处理。Use the preceding vehicle distance data set to train the vehicle distance recognition model to obtain a trained vehicle distance recognition model; input the collected preceding vehicle distance image into the trained vehicle distance recognition model and output the preceding vehicle distance recognition result; determine the preceding vehicle Whether the distance recognition result is less than the distance threshold, if so, the driver is reminded to correct the driving behavior; if not, no processing is performed.

进一步，所述车距识别模型包括改进的YOLOv5网络；Further, the vehicle distance recognition model includes an improved YOLOv5 network;

所述改进的YOLOv5网络以YOLOv5为基础框架，使用GhostNet中的Ghost Module替换YOLOv5的卷积操作，将注意力机制Coordinate Attention引入，将位置信息镶嵌到通道中。The improved YOLOv5 network uses YOLOv5 as the basic framework, uses the Ghost Module in GhostNet to replace the convolution operation of YOLOv5, introduces the attention mechanism Coordinate Attention, and embeds position information into the channel.

进一步，若前车在本车的正前方，则根据如下公式确定本车与前车之间的距离d：Furthermore, if the vehicle in front is directly in front of the vehicle, the distance d between the vehicle in front and the vehicle in front is determined according to the following formula:

其中，h为设置于本车的相机与前车在竖直方向上的距离；θ为相机俯仰角；相机的镜头光轴与像平面的交点为O(x,y)，焦距为f，前车底部中心点在像平面的成像点为D(u,v)，前车底部中心点到相机的直线与镜头光轴的夹角为α； Among them, h is the distance in the vertical direction between the camera installed on the own vehicle and the vehicle in front; θ is the camera pitch angle; the intersection of the camera's lens optical axis and the image plane is O(x,y), the focal length is f, and the front The imaging point of the center point of the car bottom on the image plane is D(u,v), and the angle between the straight line from the center point of the front car bottom to the camera and the optical axis of the lens is α;

若前车在本车的侧前方，则根据如下公式确定本车与前车之间的距离D：If the vehicle in front is to the side of the vehicle in front of the vehicle in front, the distance D between the vehicle in front and the vehicle in front is determined according to the following formula:

其中，γ为前车偏航角。Among them, γ is the yaw angle of the preceding vehicle.

本发明的有益效果是：本发明公开的一种面向驾驶人和车辆状态一体感知的车辆行驶风险预警方法，基于车载视频的驾驶人分神和疲劳驾驶检测以及车辆车道偏离和前车车距的检测，两者形成一体化的检测，对驾驶人驾驶状态以及车辆行驶轨迹进行实时同步感知，对车辆行驶状态进行实时风险评估，并对风险做出预警，语音提醒驾驶人更正驾驶行为，使其尽快恢复到安全驾驶状态，从而提升了车辆的安全行驶水平。The beneficial effects of the present invention are: the present invention discloses a vehicle driving risk warning method for integrated perception of driver and vehicle status, driver distraction and fatigue driving detection based on in-vehicle video, and vehicle lane deviation and front vehicle distance detection. Detection, the two form an integrated detection, which can realize real-time synchronous perception of the driver's driving status and vehicle's driving trajectory, conduct real-time risk assessment of the vehicle's driving status, and provide early warning of risks. The driver will be reminded by voice to correct his driving behavior so that he can Return to a safe driving state as soon as possible, thus improving the safe driving level of the vehicle.

附图说明Description of drawings

下面结合附图和实施例对本发明作进一步描述：The present invention will be further described below in conjunction with the accompanying drawings and examples:

图1为本发明的车辆行驶风险预警流程示意图；Figure 1 is a schematic diagram of the vehicle driving risk early warning process of the present invention;

图2为本发明的视频关键帧提取示意图；Figure 2 is a schematic diagram of video key frame extraction according to the present invention;

图3为本发明的驾驶人分神和疲劳驾驶行为识别流程图；Figure 3 is a flow chart of driver distraction and fatigue driving behavior identification according to the present invention;

图4为本发明的ResNet-50的SCM模块架构示意图；Figure 4 is a schematic diagram of the SCM module architecture of ResNet-50 of the present invention;

图5为本发明的时空激励子模块作用原理示意图；Figure 5 is a schematic diagram of the working principle of the spatiotemporal excitation submodule of the present invention;

图6为本发明的通道激励子模块工作原理示意图；Figure 6 is a schematic diagram of the working principle of the channel excitation sub-module of the present invention;

图7为本发明的运动激励子模块工作原理示意图；Figure 7 is a schematic diagram of the working principle of the motion excitation sub-module of the present invention;

图8为本发明的改进的Deeplabv3+网络结构示意图；Figure 8 is a schematic diagram of the improved Deeplabv3+ network structure of the present invention;

图9为本发明的车道偏移帧图像展示示意图；Figure 9 is a schematic diagram showing the lane offset frame image of the present invention;

图10为本发明的车距测量流程示意图；Figure 10 is a schematic diagram of the vehicle distance measurement process of the present invention;

图11为本发明的改进的YOLOv5网络结构示意图；Figure 11 is a schematic diagram of the improved YOLOv5 network structure of the present invention;

图12为本发明的基于俯仰角的车距测量原理示意图；Figure 12 is a schematic diagram of the vehicle distance measurement principle based on pitch angle according to the present invention;

图13为本发明的基于俯仰角及偏航角的车距测量原理示意图。Figure 13 is a schematic diagram of the vehicle distance measurement principle based on pitch angle and yaw angle according to the present invention.

具体实施方式Detailed ways

以下结合说明书附图对本发明做出进一步的说明，如图所示：The present invention will be further described below in conjunction with the accompanying drawings, as shown in the figure:

如图1所示，本发明通过车载的双向摄像头获取驾驶人的实时驾驶行为序列和车辆的实时行车状态序列，然后为了获得符合模型输入要求的视频帧尺寸，经过裁剪、缩放等数据预处理操作后将其输入到训练好的驾驶行为识别模型中进行识别，当驾驶行为识别模型识别到驾驶人疲劳或分神驾驶行为、车辆轨迹识别模型识别到车辆车道偏移或车距识别模型识别到与前车距离过近，需要综合考虑驾驶人行为和车辆轨迹，对行驶安全风险做出评估。As shown in Figure 1, the present invention obtains the driver's real-time driving behavior sequence and the vehicle's real-time driving status sequence through the vehicle-mounted two-way camera. Then, in order to obtain the video frame size that meets the model input requirements, it performs data preprocessing operations such as cropping and zooming. Then input it into the trained driving behavior recognition model for recognition. When the driving behavior recognition model recognizes the driver's fatigue or distracted driving behavior, the vehicle trajectory recognition model recognizes the vehicle lane deviation, or the vehicle distance recognition model recognizes the driver's behavior, If the vehicle in front is too close, driver behavior and vehicle trajectory need to be comprehensively considered to assess driving safety risks.

进一步地，可以对识别到的驾驶行为和车辆状态进行连续量化，以达到更加准确及时的预警效果。比如连续2秒的驾驶人图像都被识别出存在分神驾驶行为，或比如连续1秒的驾驶人图像都被识别出存在分神驾驶行为且同时车辆行驶状态图像检测到车道偏移，则认定当前车辆存在行驶风险，此时做出预警处理，开始语音提醒驾驶人更正驾驶行为，尽快恢复到安全驾驶状态，进而达到实现提升驾驶安全性的目的。Furthermore, the identified driving behavior and vehicle status can be continuously quantified to achieve a more accurate and timely warning effect. For example, if distracted driving behavior is identified in driver images for 2 consecutive seconds, or if distracted driving behavior is identified in driver images for 1 second consecutively and lane deviation is detected in vehicle driving status images at the same time, it is deemed that If there is a driving risk in the current vehicle, an early warning will be issued at this time, and a voice will be started to remind the driver to correct the driving behavior and return to a safe driving state as soon as possible, thereby achieving the purpose of improving driving safety.

本实施例中，驾驶行为是连续性的动作。相较于仅通过单一图像进行识别的方法，以视频帧序列作为输入，从输入数据的时间、空间和运动维度特征，对驾驶人驾驶状态进行识别，获得更好的识别精度。因此，本发明进行了基于自适应帧分辨率的驾驶人疲劳和分神驾驶行为检测。In this embodiment, the driving behavior is a continuous action. Compared with the method of identifying only through a single image, the video frame sequence is used as input to identify the driver's driving status from the time, space and motion dimensional characteristics of the input data, achieving better recognition accuracy. Therefore, the present invention performs driver fatigue and distracted driving behavior detection based on adaptive frame resolution.

由于来自静态场景或非常低的帧质量(模糊、弱光条件等)的大量冗余，处理视频中的每一帧通常是不必要的且效率低下。因此，使用策略网络在统一的框架中自适应地选择帧分辨率的同时，还设计了一种跳帧机制，在需要时，跳过帧(即将分辨率设置为零)，以进一步提高动作识别的效率。同时，由于二维卷积神经网络(CNN)无法获得长期的时间关系，但采用三维CNN处理，会面对计算量偏大的问题。因此，输入的视频在经过策略网络处理后，采用了一个嵌入到二维CNN主干网络中的混合注意力机制模块。Processing every frame in a video is often unnecessary and inefficient due to the large amount of redundancy that comes from static scenes or very low frame quality (blur, low-light conditions, etc.). Therefore, while using the policy network to adaptively select the frame resolution in a unified framework, a frame skipping mechanism is also designed to skip frames (i.e., set the resolution to zero) when needed to further improve action recognition. s efficiency. At the same time, since the two-dimensional convolutional neural network (CNN) cannot obtain long-term time relationships, the use of three-dimensional CNN processing will face the problem of a relatively large amount of calculation. Therefore, after the input video is processed by the policy network, a hybrid attention mechanism module embedded in the two-dimensional CNN backbone network is used.

所述驾驶行为识别模型包括策略网络以及二维卷积神经网络；所述策略网络包括特征提取器以及长短期记忆模块；所述二维卷积神经网络的主干网络中嵌入有混合注意力机制模块；所述混合注意力机制模块包括时空激励子模块(STE)、通道激励子模块(CE)以及运动激励子模块(ME)；所述时空激励子模块使用单通道三维卷积来表示时空特征；所述通道激励子模块基于通道之间的相互依赖性自适应地校准通道的特性响应；所述运动激励子模块在特征级上计算时间差，从而刺激运动敏感信道。The driving behavior recognition model includes a policy network and a two-dimensional convolutional neural network; the policy network includes a feature extractor and a long and short-term memory module; a hybrid attention mechanism module is embedded in the backbone network of the two-dimensional convolutional neural network ; The hybrid attention mechanism module includes a spatiotemporal excitation submodule (STE), a channel excitation submodule (CE) and a motion excitation submodule (ME); the spatiotemporal excitation submodule uses single-channel three-dimensional convolution to represent spatiotemporal features; The channel excitation sub-module adaptively calibrates the characteristic response of the channel based on the interdependence between channels; the motion excitation sub-module calculates the time difference at the feature level to stimulate the motion-sensitive channel.

本发明通过YAWDD等公开数据集进行正常驾驶、分神驾驶、疲劳驾驶的驾驶状态数据集构建，以6:2:2的比例划分成训练集、验证集与测试集，从而对驾驶行为识别模型进行训练和验证，并将其封装集成到系统中。This invention uses public data sets such as YAWDD to construct driving state data sets of normal driving, distracted driving, and fatigue driving, and divides them into training sets, verification sets, and test sets in a ratio of 6:2:2, thereby identifying the driving behavior model. Conduct training and verification, and package and integrate it into the system.

本实施例中，策略网络自适应地选择不同的帧尺度来实现驾驶行为识别效率。将一系列分辨率按降序表示为：其中S₀＝(H₀,W₀)表示原始(和最高)帧分辨率，S_L-1＝(H_L-1,W_L-1))是最低分辨率。将l^th尺度中时间t处的帧表示为/>跳帧是“选择分辨率S^∞”的一种特殊情况。定义跳跃序列(升序)为/>第i^th次操作表示跳过预测中的当前帧和后续(F_i-1)帧。分辨率和跳跃的选择形成了动作空间Ω。In this embodiment, the policy network adaptively selects different frame scales to achieve driving behavior recognition efficiency. Express a range of resolutions in descending order as: Where S ₀ =(H ₀ ,W ₀ ) represents the original (and highest) frame resolution, and S _L-1 =(HL _-1 ,W _L-1) ) is the lowest resolution. Denote the frame at time t in the ^lth scale as/> Frame skipping is a special case of "selecting resolution S ^∞ ". Define the jump sequence (ascending order) as/> The i ^th operation represents skipping the current frame and subsequent ( _Fi -1) frames in prediction. The choice of resolution and jumps forms the action space Ω.

策略网络包括一个轻量级特征提取器φ(·；θ_φ)和一个长短期记忆(LSTM)模块。The policy network includes a lightweight feature extractor φ (·; θ _φ ) and a long short-term memory (LSTM) module.

在时间步长t＜T₀，将帧I_t调整到最低分辨率并将其发送到特征提取器：At time step t<T ₀ , adjust frame I _t to the lowest resolution and send it to the feature extractor:

其中，T₀为设定的时间周期；I_t为t时刻的驾驶人员状态图像帧；f_t是特征向量，θ_φ表示可学习参数。Among them, T ₀ is the set time period; I _t is the driver status image frame at time t; f _t is the feature vector, and θ _φ represents the learnable parameter.

LSTM使用提取的特征和先前状态更新隐藏状态h_t并输出ot：LSTM updates the hidden state h _t using the extracted features and previous states and outputs ot:

[h_t,o_t]＝LSTM(f_t,h_t-1,o_t-1,θ_LSTM) (2)[h _t ,o _t ]=LSTM (f _t ,h _t-1 ,o _t-1 ,θ _LSTM ) (2)

在给定隐藏状态的情况下，策略网络估计策略分布并对动作Given the hidden states, the policy network estimates the policy distribution and evaluates the actions

a_t∈Ω＝{0,1,...,L+M-1}进行采样，通过Gumbel Softmax操作：a _t ∈Ω＝{0,1,...,L+M-1} is sampled and operated by Gumbel Softmax:

a_t～GUMBEL(h_t,θ_G) (3)a _t ~GUMBEL(h _t ,θ _G ) (3)

如果a_t＜L，将帧大小调整到空间分辨率3×H_at×W_at，并将其转发到相应的主干网络以获得帧级预测：If a _t <L, adjust the frame size to the spatial resolution 3×H _at ×W _at and forward it to the corresponding backbone network To get frame-level prediction:

其中，是调整大小的框架，/>是预测值。L为状态图像分辨率种类数；H_at为动作a_t时刻t图像的高；W_at为动作a_t时刻t图像的宽。in, is the resize frame,/> is the predicted value. L is the number of state image resolution types; H _at is the height of the image at time t of action a _t ; W _at is the width of the image at time t of action a _t .

当动作a_t≥L时，主干网络将跳过当前帧以进行预测，并且策略网络将跳过随后的帧；/>为a_t≥L时的视频帧。When action a _t ≥ L, the backbone network will skip the current frame for prediction, and the policy network will skip subsequent frame;/> is the video frame when a _t ≥ L.

此外，为了节省计算量，可以用共享策略网络来生成最低分辨率的策略和预测，即(φ′是一个特征向量)。In addition, in order to save computational effort, a shared policy network can be used to generate the lowest resolution policy and prediction, i.e. (φ′ is an eigenvector).

本实施例中，为了得到更准确的预测结果，在主干网络中增加了一个SCM模块。SCM模块由三个子模块组成，即时空激励子模块(STE)、通道激励子模块(CE)和运动激励子模块(ME)。In this embodiment, in order to obtain more accurate prediction results, an SCM module is added to the backbone network. The SCM module consists of three sub-modules, namely the temporal and spatial excitation sub-module (STE), the channel excitation sub-module (CE) and the motion excitation sub-module (ME).

其中，STE通过利用3D卷积激发时空信息，与传统的三维卷积不同，该模块对所有通道进行平均以获得全局时空特征，这能够显著减少三维卷积的计算，STE的输出包含全局时空信息。CE用于激活关于时间信息的信道相关性，输出包含基于时间视角的信道相关性。ME显示了视频中推理动作运动的有效性，ME在特征级别上对相邻帧之间的差异进行建模，然后与上面介绍的模块结合起来，用于推理视频中保持的丰富信息。Among them, STE excites spatiotemporal information by using 3D convolution. Unlike traditional three-dimensional convolution, this module averages all channels to obtain global spatiotemporal features, which can significantly reduce the calculation of three-dimensional convolution. The output of STE contains global spatiotemporal information. . CE is used to activate channel correlation with time information, and the output contains channel correlation based on time perspective. ME has shown the effectiveness of reasoning about action motion in videos. ME models the differences between adjacent frames at the feature level and is then combined with the modules introduced above for reasoning about the rich information held in videos.

其中，在我们采用的SCM作用模块之外的所有张量都是4D，即(N(批量大小)×T(段数)，C(通道)，H(高度)，W(宽度))。我们将输入的4D张量重塑为5D张量(N、T、C、H、W)，然后再将其输入到SCM模块中，以便能够在SCM模块内部的特定维度上进行操作。然后，5D输出张量在被馈送到下一个2D卷积块之前被重塑为4D。通过这样做，SCM模块的输出可以从时空角度、通道相关性和运动来感知信息。Among them, all tensors outside the SCM action module we use are 4D, that is (N (batch size) × T (number of segments), C (channel), H (height), W (width)). We reshape the input 4D tensor into a 5D tensor (N, T, C, H, W) before feeding it into the SCM module to be able to operate on specific dimensions inside the SCM module. The 5D output tensor is then reshaped into 4D before being fed to the next 2D convolution block. By doing so, the output of the SCM module can sense information from a spatiotemporal perspective, channel correlation, and motion.

图4显示了ResNet-50-SCM模块架构，其中SCM模块插在每个残留块的开头。ResNet-50给出了每一层的输出特征图的大小(CLS表示类的数量，T表示分段的数量)。首先，将输入视频平均划分为T个片段，然后将经过策略网络处理后的视频随机采样一帧。Figure 4 shows the ResNet-50-SCM module architecture, where the SCM module is inserted at the beginning of each residual block. ResNet-50 gives the size of the output feature map of each layer (CLS represents the number of classes, T represents the number of segments). First, the input video is divided into T segments evenly, and then one frame is randomly sampled from the video processed by the policy network.

其中，STE利用三维卷积有效地模拟时空信息。在这一阶段中，STE生成时空掩码M∈R^{N×T×1×H×W}，用于将所有通道的输入X∈R^{N×T×C×H×W}逐个元素相乘。Among them, STE uses three-dimensional convolution to effectively simulate spatiotemporal information. In this stage, STE generates a spatiotemporal mask M∈R ^{N×T×1×H×W} , which is used to multiply the input X∈R ^{N×T×C×H×W} of all channels element by element.

如图5所示，给定图像输入X∈R^{N×T×C×H×W}，对各通道的输入张量进行平均，以获得相对于通道轴的全局时空张量F∈R^{N×T×1×H×W}。然后，将F重塑为F^*∈R^{N×T×1×H×W}，并将其馈送到核心尺寸为3×3×3的三维卷积层K。公式为：As ^shown in ^Figure 5, given an image input ^×1×H×W . Then, F is reshaped into F ^* ∈R ^{N×T×1×H×W} and fed to a three-dimensional convolutional layer K with core size of 3×3×3. The formula is:

最后，将重塑为F_o∈R^{N×T×1×H×W}，并将其馈送到Sigmoid激活，得到时空掩码M∈R^N ^×T×1×H×W。它可以表示为：Finally, add Reshape it into F _o ^∈ ^R ^N It can be expressed as:

M＝δ(F_o) (6)M＝δ(F _o ) (6)

最终输出为：The final output is:

Y＝X+X⊙M (7)Y＝X+X⊙M (7)

其中，⊙表示时空掩码M与所有通道输入X逐个元素相乘。Among them, ⊙ represents the element-by-element multiplication of the space-time mask M with all channel inputs X.

T表示图像对应的视频被划分的片段数；N表示片段数T的批量个数；C表示图像通道数；H表示图像的高；W表示图像的宽。T represents the number of segments into which the video corresponding to the image is divided; N represents the batch number of segments T; C represents the number of image channels; H represents the height of the image; W represents the width of the image.

CE的设计类似于STE块，如图6所示。The design of the CE is similar to the STE block, as shown in Figure 6.

给定一个输入X∈R^{N×T×C×H×W}，首先通过对输入进行平均来获得输入元素的全局空间信息，它可以表示为：Given an input X∈R ^{N×T×C×H×W} , the global spatial information of the input elements is first obtained by averaging the input, which can be expressed as:

其中F∈R^{N×T×C×1×1}。将F的通道数按比例r(r-通道压缩率)压缩，这可以解释为：where F∈R ^{N×T×C×1×1} . Compress the number of channels of F in proportion to r (r-channel compression rate), which can be interpreted as:

F_r＝K₁*F (9)F _r ＝K ₁ *F (9)

其中K₁为1×1的二维卷积层， Where K ₁ is a 1×1 two-dimensional convolution layer,

然后重塑F_r到直到可以启用时间推理。一维卷积层K₂内核大小为3用于处理/>作为：Then reshape F _r to Until temporal inference can be enabled. One-dimensional convolutional layer K ₂ kernel size 3 is used for processing/> as:

其中然后将/>重新整形为/>然后通过使用1×1二维卷积层K₃对其进行解压缩并馈送到Sigmoid激活。这是获得通道掩码M的最后两个步骤，可以分别公式化：in Then add/> Reshape to/> It is then decompressed by using a 1×1 2D convolutional layer _K3 and fed to the Sigmoid activation. These are the last two steps to obtain the channel mask M and can be formulated separately:

F_o＝K₃*F_temp (11)F _o ＝K ₃ *F _temp (11)

M＝δ(F_o) (12)M＝δ(F _o ) (12)

其中F_o∈R^{N×T×C×1×1}和M∈R^{N×T×C×1×1}。最后，使用新生成的掩码将CE的输出公式化为与式(7)相同。where F _o ∈R ^{N×T×C×1×1} and M∈R ^{N×T×C×1×1} . Finally, the output of CE is formulated to be the same as Equation (7) using the newly generated mask.

ME与上述提到的两个模块STE、CE并行使用，如图7所示，运动信息由相邻帧建模。ME is used in parallel with the two modules STE and CE mentioned above. As shown in Figure 7, the motion information is modeled by adjacent frames.

使用与CE子模块相同的压缩和解压策略，使用两个1×1的二维卷积层，分别可以参考式(9)和式(11)。给定压缩操作处理后的特征根据提出的类似操作对运动特征进行建模，可以得到表示为：Use the same compression and decompression strategy as the CE sub-module and use two 1×1 two-dimensional convolution layers. You can refer to Equation (9) and Equation (11) respectively. Features processed by a given compression operation Modeling the motion characteristics according to the proposed similar operations can be expressed as:

F_m＝K*F_r[:,t+1,:,:,:]-F_r[:,t,:,:,:] (13)F _m ＝K*F _r [:,t+1,:,:,:]-F _r [:,t,:,:,:] (13)

其中K为3×3的二维卷积层，F_r[:,t+1,:,:,:]表示t+1时刻压缩后的特征图，F_r[:,t,:,:,:]表示t时刻的特征图；where K is a 3×3 two-dimensional convolution layer, F _r [:,t+1,:,:,:] represents the compressed feature map at time t+1, F _r [:,t,:,:,:] represents the feature map at time t;

根据时间维度将运动特征相互连接，并将0填充到末位元素，则是F_m-[F_m(1),...,F_m(t-1),0],其中，F_m(t-1)为第t-1个运动表示。Connect the motion features to each other according to the time dimension and fill the last element with 0, then it is F _m -[F _m (1),...,F _m (t-1),0], Among them, F _m (t-1) is the t-1th motion representation.

然后通过与公式(8)中相同的空间平均合并来处理，也即是对F_m进行平均来获得输入元素的全局空间信息。It is then processed through the same spatial average merging as in formula (8), that is, averaging F _m to obtain the global spatial information of the input elements.

本实施例中，所述车辆轨迹识别模型包括改进的Deeplabv3+网络；In this embodiment, the vehicle trajectory recognition model includes an improved Deeplabv3+ network;

首先将行车记录仪采集的视频分帧、截图处理，将截取好的图片与公共数据集Tusimple、CUlane进行结合，组成车道线检测所用的车道偏移数据集。其次，将这些数据集进行预处理，增强较暗图片的光照度等，再进行人工标注，以便告知模型车道线的深层特征。最后，将标注好的数据集输入到车辆轨迹识别模型中开始训练，通过损失函数观察模型的训练程度，如果训练效果不佳，将得到的参数反向传播到网络继续训练，直到训练出用于车道线检测的理想模型。First, the video collected by the driving recorder is divided into frames and screenshots are processed, and the captured pictures are combined with the public data sets Tusimple and CUlane to form a lane offset data set used for lane line detection. Secondly, these data sets are preprocessed to enhance the illumination of darker images, etc., and then manually annotated to inform the model of the deep features of the lane lines. Finally, the labeled data set is input into the vehicle trajectory recognition model to start training. The training degree of the model is observed through the loss function. If the training effect is not good, the obtained parameters are back-propagated to the network to continue training until the training is completed. An ideal model for lane line detection.

本发明以Deeplabv3+为基础框架，建立了提取特征更有效的轻量级网络模型，针对原始网络特征提取不全面的问题，增加了密集型结构完善该情况。本发明还增添了注意力机制提高对重要特征的关注度，利于模型的训练和实时检测的精度。This invention uses Deeplabv3+ as the basic framework to establish a lightweight network model that is more effective in extracting features. Aiming at the problem of incomplete feature extraction in the original network, an intensive structure is added to improve the situation. The present invention also adds an attention mechanism to increase attention to important features, which is beneficial to model training and real-time detection accuracy.

如图8所示，将Deeplabv3+的主干网络Xception替换成轻量级网络MobileNetv2，参数量小、速度快，非常适合应用到实时检测的场景。为了提升改进后模型的性能，本发明增加了SE(Squeeze-and-Excitation)模块(通道注意力机制模块)。SE模块主要是将注意力放在特征图的通道之间，自动学习不同通道的重要性。As shown in Figure 8, the backbone network Xception of Deeplabv3+ is replaced by the lightweight network MobileNetv2, which has small parameters and fast speed, and is very suitable for real-time detection scenarios. In order to improve the performance of the improved model, the present invention adds a SE (Squeeze-and-Excitation) module (channel attention mechanism module). The SE module mainly focuses on the channels of the feature map and automatically learns the importance of different channels.

为了解决膨胀率增大导致局部特征的获取越困难的问题，引入了密集型空间金字塔池化(Dense Atrous Spatial Pyramid Pooling，Dense-ASPP)，使用Dense-ASPP替换了原网络的ASPP结构。In order to solve the problem that the acquisition of local features becomes more difficult as the expansion rate increases, Dense Atrous Spatial Pyramid Pooling (Dense-ASPP) is introduced, and Dense-ASPP is used to replace the ASPP structure of the original network.

基于上述的车道线检测，如图9所示，假设汽车前身所在图像中的像素坐标为(180,0)，D_L、D_R代表车辆中心距离左右车道线的距离，(L_x,L_y)、(R_x,R_y)代表图像中左右车道线的像素坐标。则车辆中心至左车道线的距离D_L＝180-L_x，同理D_R＝R_x-180。Based on the above lane line detection, as shown in Figure 9, assuming that the pixel coordinates of the car's predecessor in the image are (180,0), D _L and _DR represent the distance between the vehicle center and the left and right lane lines, (L _x ,L _y ), (R _x ,R _y ) represent the pixel coordinates of the left and right lane lines in the image. Then the distance from the vehicle center to the left lane line D _L =180-L _x , and similarly D _R =R _x -180.

正常采集20帧图像会花费0.5秒的时间，此时，可以通过D_L、D_R在20帧内连续变化的情况来判断车辆是否发生偏移。此时存在数组L可以存放20个D_L的变化值，数组R可以存放20个D_R的变化值。Normally it will take 0.5 seconds to collect 20 frames of images. At this time, you can judge whether the vehicle has drifted through the continuous changes of _DL and _DR within 20 frames. At this time, there is an array L that can store 20 changing values of D _L , and an array _R that can store 20 changing values of DR.

进一步地，则此时的左右横向速度V_L、V_R见式(14)、(15)：Furthermore, the left and right lateral velocities V _L and _VR at this time are shown in equations (14) and (15):

本实施例中，还包括：In this embodiment, it also includes:

本发明选择单目视觉进行车距识别分析，主要根据单目视觉测距模型结合检测到的目标框完成测距。单目视觉测距只需要一个摄像头，并做一些坐标系转换工作即可以完成相应的测距，具有计算量小、消耗低等优点。The present invention selects monocular vision for vehicle distance recognition analysis, and mainly completes ranging based on the monocular vision ranging model combined with the detected target frame. Monocular visual ranging only requires one camera and some coordinate system conversion work to complete the corresponding ranging. It has the advantages of small calculation amount and low consumption.

本发明将驾驶车辆上的行车记录仪作为采集数据的主要设备，驾驶车辆于市区繁华路段、高速路段、公路等多种场景进行实时的前车距离数据集采集。The present invention uses the driving recorder on the driving vehicle as the main device for collecting data, and the driving vehicle collects the real-time distance data set of the vehicle ahead in various scenes such as busy urban sections, high-speed sections, and highways.

如图11所示，所述车距识别模型以YOLOv5为基础框架，对主干网络做出了相应的修改，并融合了注意力机制，以满足轻量型、高精度、多尺度车辆检测、复杂环境条件下车辆检测以及实时性的要求。As shown in Figure 11, the vehicle distance recognition model uses YOLOv5 as the basic framework, makes corresponding modifications to the backbone network, and integrates the attention mechanism to meet the requirements of lightweight, high-precision, multi-scale vehicle detection, complex Vehicle detection under environmental conditions and real-time requirements.

在YOLOv5的主干网络中，其CSP机构用到了较多的卷积(Conv)操作，卷积的计算量会随着层数的增加而增加，为了解决这一问题，本发明使用GhostNet中的Ghost Module替换YOLOv5的卷积操作，在不影响精度的情况下提升了模型的速度并减少了相关参数量。注意力机制Coordinate Attention(CA)引入，将位置信息镶嵌到通道中，解决一些专注于通道注意力的机制忽略位置信息的缺点。In the backbone network of YOLOv5, its CSP mechanism uses more convolution (Conv) operations, and the calculation amount of convolution will increase as the number of layers increases. In order to solve this problem, the present invention uses Ghost in GhostNet Module replaces the convolution operation of YOLOv5, improving the speed of the model and reducing the amount of related parameters without affecting accuracy. The attention mechanism Coordinate Attention (CA) is introduced, embedding position information into the channel, and solving the shortcomings of some mechanisms focusing on channel attention that ignore position information.

在对本车前方的车辆进行检测后，提取对应目标的像素值，基于相机俯仰角及前车偏航角进行测距建模，如图12所示。After detecting the vehicle in front of the vehicle, the pixel value of the corresponding target is extracted, and the ranging modeling is performed based on the camera pitch angle and the yaw angle of the preceding vehicle, as shown in Figure 12.

图中左上角下方的虚直线为镜头光轴，其与像平面的交点为O(x,y)(像平面坐标系)，焦距为f，左上角上方的虚直线为检测到的车辆底部中心点到相机的直线距离，其在像平面的成像点为D(u,v)(像素坐标系)，与光轴的夹角为α。其中，相机设置于本车，则相机到车辆的水平距离，也即是本车与前车之间的距离为：The dotted straight line below the upper left corner in the figure is the optical axis of the lens. Its intersection with the image plane is O(x,y) (image plane coordinate system). The focal length is f. The dotted straight line above the upper left corner is the detected center of the vehicle bottom. The straight-line distance from the point to the camera, its imaging point on the image plane is D(u,v) (pixel coordinate system), and the angle with the optical axis is α. Among them, if the camera is installed on the vehicle, the horizontal distance from the camera to the vehicle, that is, the distance between the vehicle and the vehicle in front is:

其中，h为相机与前车在竖直方向上的距离，θ为相机俯仰角(相机镜头光轴与水平面之间的夹角)。in, h is the vertical distance between the camera and the vehicle in front, and θ is the camera pitch angle (the angle between the optical axis of the camera lens and the horizontal plane).

由于实际场景中，前车并未行驶在车辆正前方，亦有可能在左右两侧，则前车在本车的侧前方，具体实现见图13所示。Since in the actual scene, the vehicle in front is not driving directly in front of the vehicle, but may also be on the left and right sides, then the vehicle in front is in front of and to the side of the vehicle. The specific implementation is shown in Figure 13.

β为相机外参与光轴的水平夹角，γ为前车偏航角。B′(B_x,B_y)为前方车辆底部中心偏航轨迹的落点B于像素坐标系的位置，像素中心点为O′(u,v)。基于相机俯仰角及前侧方车辆偏航角的测距如下：β is the horizontal angle between the external camera and the optical axis, and γ is the yaw angle of the preceding vehicle. B′(B _x ,B _y ) is the position of the landing point B of the yaw trajectory of the bottom center of the front vehicle in the pixel coordinate system, and the pixel center point is O′(u,v). The ranging based on the camera pitch angle and the yaw angle of the front and side vehicles is as follows:

式(17)中D为本车与前车之间的距离，θ，γ分别代表相机俯仰角与前车偏航角。In equation (17), D is the distance between the vehicle and the vehicle in front, θ and γ represent the camera pitch angle and the yaw angle of the vehicle in front, respectively.

其中，由于俯仰角及偏航角较难获取，可以设计一个实时获取变换角度方法：车道线基本是呈现平行状态的，若将车道线拍摄下来，两条车道线最终会交汇于一点，该点就称为消失点。利用消失点来计算相机的偏航角与俯仰角。Among them, since the pitch angle and yaw angle are difficult to obtain, a real-time acquisition method of changing angles can be designed: the lane lines are basically parallel. If the lane lines are photographed, the two lane lines will eventually intersect at one point. It's called the vanishing point. Use vanishing points to calculate the yaw and pitch angles of the camera.

首先采用Gabor滤波器计算拍摄的图中各个像素点的纹理方向，之后根据这些像素点的置信度决定是否需要投票，最后使用快速局部投票方法确定消失点的位置。First, the Gabor filter is used to calculate the texture direction of each pixel in the captured image, and then the confidence level of these pixels is used to determine whether voting is required. Finally, a fast local voting method is used to determine the location of the vanishing point.

其中，关于Gabor滤波器的函数表达式如下：Among them, the functional expression of Gabor filter is as follows:

式中，ω,表示尺度和方向，/> In the formula, ω, Indicates scale and direction,/>

x，y代表像素坐标位置。x, y represent the pixel coordinate position.

由于Gabor滤波得到每个像素点的36个方向纹理，但不能保证每个方向的纹理都是所需要的，这时，需要引入置信度，当该参数大于设置的阈值，即可认为该方向的纹理是投票点。图像中某一点处的像素值d(x,y)的置信度可用下式定义：Since Gabor filtering obtains 36 directional textures for each pixel, it cannot guarantee that the texture in each direction is required. At this time, confidence needs to be introduced. When this parameter is greater than the set threshold, it can be considered that the texture in that direction is Texture is the voting point. The confidence of the pixel value d(x,y) at a certain point in the image can be defined by the following formula:

其中，r_i(d)表示该像素点处第i方向的响应值，r₅(d)至r₁₅(d)是局部的最大响应值出现在它们之间，其余参数很少出现。设置阈值为：Among them, r _i (d) represents the response value in the i-th direction at the pixel point, r ₅ (d) to r ₁₅ (d) are the local maximum response values appearing between them, and the remaining parameters rarely appear. Set the threshold to:

t＝0.4(maxC(d)-minC(d)) (20)t＝0.4(maxC(d)-minC(d)) (20)

当置信度大于t时，该像素点即为投票点。根据投票点以及置信度选取最终道路消失点。When the confidence is greater than t, the pixel is the voting point. Select the final road vanishing point based on voting points and confidence.

选投票算法原理为选取候选消失点H，以H为圆心，半径为图像尺寸的三分之一，取该圆形一半为候选区域。具体投票公式如下：The principle of the voting algorithm is to select the candidate vanishing point H, with H as the center of the circle, the radius as one-third of the image size, and half of the circle as the candidate area. The specific voting formula is as follows:

式中，α为某一投票点P的纹理方向与线段PH的夹角，d(P,H)表示点P到圆心的距离，该式将图像中的每一个像素点作为候选点，最后，得票分数高的点为消失点。In the formula, α is the angle between the texture direction of a certain voting point P and the line segment PH, d(P,H) represents the distance from the point P to the center of the circle. This formula takes each pixel in the image as a candidate point. Finally, The point with the highest vote score is the vanishing point.

得到消失点后，俯仰角和偏航角分别为：After obtaining the vanishing point, the pitch angle and yaw angle are:

其中，R为摄像头图像旋转矩阵；R_xz表示绕x轴和z轴旋转的矩阵；R_yz表示绕y轴和z轴旋转的矩阵。Among them, R is the camera image rotation matrix; R _xz represents the matrix of rotation around the x-axis and z-axis; R _yz represents the matrix of rotation around the y-axis and z-axis.

最后说明的是，以上实施例仅用以说明本发明的技术方案而非限制，尽管参照较佳实施例对本发明进行了详细说明，本领域的普通技术人员应当理解，可以对本发明的技术方案进行修改或者等同替换，而不脱离本发明技术方案的宗旨和范围，其均应涵盖在本发明的权利要求范围当中。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and are not limiting. Although the present invention has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the present invention can be modified. Modifications or equivalent substitutions without departing from the spirit and scope of the technical solution of the present invention shall be included in the scope of the claims of the present invention.

Claims

1. A vehicle running risk early warning method facing to the integrated perception of a driver and a vehicle state is characterized in that: comprising the following steps:

creating a driving state data set of a driver; training the driving behavior recognition model by using the driving state data set to obtain a trained driving behavior recognition model;

creating a lane departure data set; training the vehicle track recognition model by using the lane offset data set to obtain a trained vehicle track recognition model;

inputting the acquired driver images into a trained driving behavior recognition model, and outputting a driving behavior recognition result of the driver;

inputting the acquired vehicle running state image into a trained vehicle track recognition model, and outputting a lane deviation recognition result;

judging whether the driving behavior recognition result and/or the lane deviation recognition result meet the early warning requirement, if so, reminding a driver to correct the driving behavior and drive on a correct lane; if not, the processing is not performed.

2. The vehicle running risk early warning method for driver and vehicle state integrated perception according to claim 1, characterized in that: the driving behavior recognition model comprises a strategy network and a two-dimensional convolutional neural network; the strategy network comprises a feature extractor and a long-term and short-term memory module; a mixed attention mechanism module is embedded in a backbone network of the two-dimensional convolutional neural network; the mixed attention mechanism module comprises a space-time excitation sub-module, a channel excitation sub-module and a motion excitation sub-module;

the space-time excitation submodule uses single-channel three-dimensional convolution to represent space-time characteristics;

the channel excitation submodule adaptively calibrates characteristic responses of the channels based on interdependencies between the channels;

the motion-excitation sub-module calculates a time difference at a feature level to stimulate a motion-sensitive channel.

3. The vehicle running risk early warning method for driver and vehicle state integrated perception according to claim 2, characterized in that: the strategy network adaptively selects different frame scales to realize driving behavior recognition efficiency, and comprises the following steps:

at time step T < T ₀ Frame I _t Adjust to the lowest resolution and send it to the feature extractor; wherein T is ₀ For a set period of time; i _t A driver status image frame at the time t;

the long-period and short-period memory module updates and outputs the hidden state by using the extracted characteristics and the previous state;

given a hidden state, the policy distribution is estimated for action a at time t _t Sampling and performing Gumbel Softmax operation;

action a _t < L, frame size is adjusted to spatial resolution 3 XH _at ×W _at And forwards it to the corresponding backbone network to obtain frame-level predictions; wherein L is the resolution category number of the state image; h _at For action a _t The high of the image at time t; w (W) _at For action a _t The width of the image at time t;

action a _t If L is not less, the backbone network will skip the current frame for prediction and the policy network will skip the followingA frame; wherein (1)>Is a as _t And (5) video frames when the video frame is more than or equal to L.

4. The vehicle running risk early warning method for driver and vehicle state integrated perception according to claim 2, characterized in that: the space-time excitation submodule uses single-channel three-dimensional convolution to represent space-time characteristics and specifically comprises the following steps:

for a given input image X ε R ^{N×T×C×H×W} The input tensors for each channel are averaged to obtain a global spatio-temporal tensor F.epsilon.R relative to the channel axis ^{N×T×1×H×W} The method comprises the steps of carrying out a first treatment on the surface of the F is then remolded to F ^* ∈R ^{N×T×1×H×W} And fed to a three-dimensional convolutional layer K of core size 3 x 3, obtainingFinally, will->Remodelling to F _o ∈R ^{N×T×1×H×W} And feeds it to Sigmoid activation to get a spatiotemporal mask M ε R ^{N×T×1×H×W} And finally outputting Y: y=x+x+m;

wherein, as follows, the space-time mask M is multiplied by all channel inputs X element by element, and T is the number of divided segments of the video corresponding to the image; n represents the batch number of the fragment number T; c represents the number of image channels; h represents the high of the image; w represents the width of the image.

5. The vehicle running risk early warning method for driver and vehicle state integrated perception according to claim 2, characterized in that: the channel excitation submodule adaptively calibrates characteristic response of the channels based on interdependence among the channels, and specifically comprises the following steps:

for a given input image X ε R ^{N×T×C×H×W} First global spatial information F epsilon R of an input element is obtained by averaging the inputs ^{N×T×C×1×1} The method comprises the steps of carrying out a first treatment on the surface of the Compressing the channel number of F in proportion r to obtain F _r ＝K ₁ * F, performing the process; wherein K is ₁ Is a 1 x 1 two-dimensional convolution layer,

then remodel F _r To the point ofUntil time reasoning can be enabled, one-dimensional convolution layer K ₂ Kernel size 3 for handling +.>Obtain->Wherein (1)>

Will beReshape to +.>Then by using a 1 x 1 two-dimensional convolution layer K ₃ Decompressing it to obtain F _o ＝K ₃ *F _temp Feeding to Sigmoid activation to obtain a channel mask M; wherein F is _o ∈R ^{N×T×C×1×1} And M.epsilon.R ^N ^×T×C×1×1 ；

Final output Y: y=x+x+m.

6. The vehicle running risk early warning method for driver and vehicle state integrated perception according to claim 2, characterized in that: the motion-excitation sub-module calculates a time difference at a feature level to stimulate a motion-sensitive channel, comprising:

for a given input image X ε R ^{N×T×C×H×W} The channel number is compressed in proportion r by using a 1X 1 two-dimensional convolution layer to obtainUsing a 1 x 1 two-dimensional convolution layer for F _r Decompressing;

modeling the motion characteristics to obtain F _m ＝K*F _r [:,t+1,:,:,:]-F _r [:,t,:,:,:]；

Wherein K is a 3 x 3 two-dimensional convolution layer,wherein F is _r [:,t+1,:,:,:]Representing the compressed characteristic diagram at time t+1, F _r [:,t,:,:,:]The characteristic diagram after compression at the time t is shown;

connecting motion characteristics with each other according to a time dimension, and filling 0 into a last element, wherein the steps are as follows:

F _m -[F _m (1),...,F _m (t-1),0]the method comprises the steps of carrying out a first treatment on the surface of the Wherein F is _m (t-1) is the t-1 th motion representation;

then to F _m Averaging is performed to obtain global spatial information of the input elements.

7. The vehicle running risk early warning method for driver and vehicle state integrated perception according to claim 1, characterized in that: the vehicle track recognition model comprises a modified deep labv3+ network;

the improved deep bv3+ network takes deep bv3+ as a basic framework, replaces a backbone network Xaccept of the deep bv3+ with a lightweight network MobileNet v2, increases a channel attention mechanism module, and replaces an ASPP structure in the deep bv3+ network by using a Dense-ASPP;

the channel attention mechanism module is used for focusing attention among channels of the feature map.

8. The vehicle running risk early warning method for driver and vehicle state integrated perception according to claim 1, characterized in that: further comprises:

training the vehicle distance recognition model by using the front vehicle distance data set to obtain a trained vehicle distance recognition model; inputting the acquired front vehicle distance image into a trained vehicle distance recognition model, and outputting a front vehicle distance recognition result; judging whether the front vehicle distance recognition result is smaller than a distance threshold value, if so, reminding a driver to correct driving behaviors; if not, the processing is not performed.

9. The vehicle running risk early warning method for driver and vehicle state integrated perception of claim 8, wherein: the vehicle distance identification model comprises a modified YOLOv5 network;

the improved YOLOv5 network uses YOLOv5 as a basic framework, uses a Ghost Module in GhostNet to replace the convolution operation of YOLOv5, introduces an attention mechanism Coordinate Attention, and embeds position information into the channel.

10. The vehicle running risk early warning method for driver and vehicle state integrated perception of claim 8, wherein: if the front vehicle is right in front of the vehicle, determining a distance d between the vehicle and the front vehicle according to the following formula:

h is the distance between the camera arranged on the vehicle and the front vehicle in the vertical direction; θ is the camera pitch angle; the intersection point of the lens optical axis of the camera and the image plane is O (x, y), the focal length is f, the imaging point of the center point of the bottom of the front vehicle at the image plane is D (u, v), and the included angle between the straight line from the center point of the bottom of the front vehicle to the camera and the lens optical axis is alpha;

if the front vehicle is in front of the side of the vehicle, determining a distance D between the vehicle and the front vehicle according to the following formula:

wherein, gamma is the yaw angle of the front vehicle.