WO2021088300A1

WO2021088300A1 - Rgb-d multi-mode fusion personnel detection method based on asymmetric double-stream network

Info

Publication number: WO2021088300A1
Application number: PCT/CN2020/080991
Authority: WO
Inventors: 张文利; 郭向; 杨堃; 王佳琪
Original assignee: 北京工业大学
Priority date: 2019-11-09
Filing date: 2020-03-25
Publication date: 2021-05-14
Also published as: CN110956094B; CN110956094A

Abstract

The present invention relates to the field of computer vision and image processing, and disclosed in the present invention is an RGB-D multi-mode fusion personnel detection method based on an asymmetric double-stream network. The method comprises the steps of RGBD image collection, depth image preprocessing, RGB feature extraction and Depth feature extraction, RGB multi-scale fusion and Depth multi-scale fusion, multi-mode feature channel reweighting, and multi-scale personnel prediction. According to the present invention, an asymmetric RGBD double-stream convolutional neural network model is designed to solve the problem that traditional symmetric RGBD double-stream networks are prone to causing depth feature loss. Multi-scale fusion structures are respectively designed for RGBD double-stream networks, so that multi-scale information complementation is achieved. A multi-mode reweighting structure is constructed, RGB and Depth feature maps are combined, and weighted assignment is performed on each combined feature channel to implement model automatic learning contribution proportion. Personnel classification and frame regression are performed by using multi-mode features, so that the accuracy of personnel detection is improved while the real-time performance is ensured, and the robustness of detection under low illumination at night and personnel shielding is enhanced.

Description

A RGB-D multi-modal fusion method for personnel detection based on asymmetric dual-stream network

Technical field

The invention belongs to the field of computer vision and image processing, and in particular relates to an RGB-D multimodal fusion personnel detection method based on an asymmetric dual-stream network.

Background technique

In recent years, the fields of smart homes, smart buildings, and smart security have been rapidly developed. The wide application of video extraction and analysis technology has become a key driving force for their progress. Among them, personnel detection and statistics have gradually become the field of image video analysis and artificial intelligence. A popular research topic. In terms of smart home, by detecting indoor personnel, it is possible to locate people's positions, record their behaviors and habits, and further adjust smart devices such as indoor lighting and air conditioning to provide people with a more comfortable and smart home environment. In terms of smart buildings, personnel detection technology can be applied to service robots to achieve precise obstacle avoidance and office document transfer. At the same time, according to the location and density of indoor personnel, it can automatically adjust the comfort of the office area and improve office efficiency. In terms of smart security, personnel detection in security surveillance videos can be used for identity verification, effectively responding to illegal intrusion by strangers, tracking and investigating suspicious persons and analyzing abnormal behaviors, providing core video information support for the smart security system.

At present, there are two main methods of personnel detection: personnel detection based on RGB images and personnel detection based on multi-modal image fusion.

1) The person detection method based on RGB image is to detect persons only under the RGB image. The typical person methods include the person detection method based on RGB face and the person detection method based on RGB whole body. The RGB face-based person detection method extracts the general feature representation of the face by calibrating the key points of the face and encoding the features of the face under the RGB image only, and adopts machine learning or deep learning. The method trains the face detection model, and selects and locates the face area of the person in the test sample image through the circumscribed rectangular frame predicted by the model, so as to achieve the purpose of person detection. The RGB whole-body-based person detection method is different from the face detection. The method is to extract the image area containing the whole body of the person or the main body parts with recognizability only under the RGB image for feature representation, and train the person detection based on the whole body image The model uses the circumscribed rectangular frame predicted by the model to select and locate the whole body area of the character, so as to achieve the purpose of personnel detection. However, this method is susceptible to the limitation of the scene and the impact of image imaging resolution. Due to the optical imaging principle of the visible light camera, the RGB color image captured by the visible light camera has poor immunity to changes in lighting conditions, especially in low-illumination scenes such as night, rain, snow, and fog. The real-time captured image of the camera presents a dark or similar background , The foreground and background information that cannot be clearly distinguished from the image will greatly affect the training convergence of the detection model and reduce the accuracy of the detection of people. In addition, when detecting multiple people in the scene, there will usually be occlusion between people and objects or cross occlusion between people. The visible light camera cannot obtain the depth information and thermal radiation information of the objects or people in the scene. Therefore, the captured two-dimensional planar image cannot effectively highlight the key information such as the edge contour and texture of the occluded target to solve the problem of human occlusion, and it is even submerged by similar background information, resulting in a significant drop in the accuracy and recall rate of human detection.

2) The person detection method based on multi-modal image fusion is different from the person detection method based on RGB images. The input data is images from different image sources in the same detection scene, such as RGB images, depth images, and infrared thermal images. Each image source is captured by different camera equipment, and the image itself has different characteristics. The detection method of multi-modal image fusion mainly uses the cross fusion of images of different modalities to achieve feature enhancement and complementary association. Compared with the infrared thermal image and the depth image, the RGB color image is more robust to light changes, and can be imaged stably under low illumination conditions such as night, and because the imaging principles of the infrared thermal camera, the depth camera and the visible light camera are different, the two It can better capture auxiliary clues such as the edge contour of the person under partial occlusion, and to a certain extent can alleviate the problem of partial occlusion. Nowadays, deep learning methods are often used to realize the feature fusion and associated modeling of multi-modal information. The trained model is suitable for personnel detection under multi-constraint and multi-scene conditions (such as low illumination at night, severe occlusion, long-distance shooting, etc.) Better robustness. However, the existing methods for multi-modal image fusion methods mostly use traditional manual extraction of multi-modal feature fusion and use RGBT or RGBD (color image + thermal infrared image, color image + depth image) dual-stream neural network for additional four-channel fusion, single Simple fusion methods such as scale fusion and weighted decision fusion. The traditional manual multi-modal fusion method requires human design and extraction of multi-modal features, which relies on subjective experience and is time-consuming and laborious, and cannot achieve end-to-end personnel detection. However, the simple dual-stream neural network multi-modal fusion strategy cannot fully and effectively utilize the fine-grained information such as color and texture of the color image and the semantic information such as edge and depth provided by the depth image to realize the correlation and complementarity between multi-modal data. Even the over-fitting phenomenon occurs due to the high complexity of the model, which causes the accuracy and recall rate of personnel detection to decrease instead of increasing. However, the RGB-T personnel detection has great limitations due to the high cost of the red thermal imaging camera and the high cost in practical applications.

1 current representative technology.

(1) Title of Invention: A method and system for pedestrian detection and identification based on RGBD (application number: 201710272095)

The present invention provides a pedestrian detection and identity recognition method based on RGBD. The method includes: inputting RGB and depth images, preprocessing the images, and converting color channels; and then constructing multi-channel features of RGB and depth images. Specifically, First calculate the horizontal gradient and vertical gradient of the RGB image to construct the RGB gradient direction histogram feature, as well as the horizontal gradient, vertical gradient and depth normal vector direction of the depth image, construct the gradient direction histogram of the depth image, as the multi-channel feature of RGBD; calculate; The scale corresponding to each pixel of the depth image is quantified to obtain the scale list; according to the multi-channel features, the Adaboost algorithm is used to train the pedestrian detection classifier; the detection classifier is used to search the scale space corresponding to the scale list to obtain pedestrian information Circumscribed rectangular frame to complete pedestrian detection

However, this method needs to manually extract the gradient direction histogram of the traditional RGBD image as the image feature, which is time-consuming and labor-intensive and takes up a large storage space. It cannot achieve end-to-end pedestrian detection; the gradient direction histogram feature is relatively simple, and it is difficult to extract RGB and depth. Pedestrian detection is performed on the distinguishing features in the image; this method uses the simple fusion of RGB and depth image features, and it is difficult to fully and effectively mine the fine-grained information such as color and texture of the RGB image and the semantic information such as edge and depth provided by the depth image. , To realize the correlation and complementation between multi-modal data, which has great limitations in improving the accuracy of pedestrian detection.

Summary of the invention

Aiming at the defects in the prior art, the present invention provides a RGBD multi-modal fusion personnel detection method based on an asymmetric dual-stream network, but it is not limited to personnel detection, and can also be applied to tasks such as target detection and vehicle detection.

The representative diagram of a RGBD multi-modal fusion personnel detection method based on an asymmetric dual-stream network provided by the present invention is shown in Figure 1, including RGBD image acquisition, depth image preprocessing, RGB feature extraction and Depth feature extraction, and RGB multi-scale fusion Multi-scale fusion with Depth, multi-modal feature channel weighting, and multi-scale personnel prediction, the specific functions of each step are as follows:

S1 RGBD image collection;

The original RGB image and depth image (hereinafter referred to as Depth image) are obtained by a camera with the function of shooting RGB image and depth image at the same time, and the RGB and Depth images are matched and grouped. Each group of images consists of an RGB image and the same scene. The captured Depth image is composed, and the grouped and matched RGB and Depth images are output. The original RGB image and Depth image can also be obtained from the public RGBD data set.

S2 depth image preprocessing;

Obtain the grouped and matched Depth image from the RGBD image of S1. First, remove part of the noise of the Depth image, then fill in the holes, and finally re-encode the single-channel Depth image into three-channel images, and re-encode the image values of the three channels Normalize to 0-255, and output the Depth image after encoding normalization.

S3 RGB feature extraction and Depth feature extraction;

Obtain the original RGB image from the RGBD image collection of the S1, input it to the RGB feature extraction (RGB network stream of the asymmetric dual-stream network model), perform down-sampling feature extraction, and output the high, medium, and low resolution feature maps of the RGB image, Denoted as RGB_FP_H, RGB_FP_M, RGB_FP_L, which represent the low-level color texture, intermediate edge contour, and high-level semantic feature representation of RGB images; the Depth image after encoding and normalization is obtained from the depth image preprocessing, and input to the Depth feature extraction (asymmetric dual-stream network model) Depth network stream), perform down-sampling feature extraction, and output the high, medium, and low resolution feature maps of the Depth image, which are respectively denoted as D_FP_H, D_FP_M, and D_FP_L, representing the low-level color texture, intermediate edge contour and high-level semantic features of the Depth image Said. The RGB network stream and the Depth network stream have a symmetrical structure, that is, the structure of the RGB network stream and the Depth network stream are exactly the same. However, the features contained in the Depth image are simpler than the RGB image. When the Depth feature is extracted using the convolutional network structure with the same depth as the RGB network, the Depth feature will disappear due to the network transmission being too deep, and the network parameters will increase overly. The risk of cooperation. Based on the above reasons, an asymmetric dual-stream convolutional neural network model is designed to extract the features of RGB image and Depth image. Figures 2-1 to 2-4 are a specific embodiment structure of the asymmetric dual-stream convolutional neural network model designed by the method, but are not limited to the structures shown in Figures 2-1 to 2-4. The DarkNet-53 described in Figure 2-1 and the MiniDepth-30 described in Figure 2-2 respectively represent the RGB network stream and the Depth network stream, and their network structures are asymmetrical.

S4 RGB multi-scale fusion and Depth multi-scale fusion;

From RGB feature extraction to obtain RGB feature maps RGB_FP_H, RGB_FP_M, RGB_FP_L input to the RGB multi-scale fusion, first expand the obtained RGB_FP_L to the same size as RGB_FP_M through the upsampling layer, and then merge the channels with RGB_FP_M to realize the advanced semantics of the RGB network deep layer The complementary fusion of features and the mid-level edge contour features of the middle layer, the output channel merged new feature map RGB_FP_M; then the new feature map RGB_FP_M after the output channel merged is expanded to the same size as RGB_FP_H through the upsampling layer, and the channel is channeled with RGB_FP_H Merge, realize the complementary fusion of the deep high-level semantic features of the RGB network, the middle-level edge contour features of the middle layer, and the shallow low-level color texture features, and output the new feature map RGB_FP_H after channel merging; obtain the Depth feature map D_FP_H, from the Depth feature extraction, D_FP_M and D_FP_L are input to Depth multi-scale fusion, and perform the same operation as RGB multi-scale fusion. The final output of Depth multi-scale fusion is the original input RGB_FP_L, the new feature maps RGB_FP_M and RGB_FP_H after channel merging; the output of Depth multi-scale fusion is the original input D_FP_L, the new feature maps D_FP_M and D_FP_H after channel merging. S5 multi-modal feature channel re-weighting;

Obtain RGB feature maps RGB_FP_L, RGB_FP_M, RGB_FP_H from RGB multi-scale fusion and Depth feature maps D_FP_L, D_FP_M, D_FP_H from Depth multi-scale fusion, and enter the channels corresponding to the same resolution in the multi-modal feature channel weighting according to the resolution grouping In the re-weighted structure, a more effective RGB and Depth multi-modal feature fusion is realized, and the detection robustness in processing a variety of restricted scenes is improved. The specific method takes RGB_FP_L and D_FP_L channel weighting as an example. RGB_FP_L is obtained from RGB multi-scale fusion and D_FP_L is obtained from Depth multi-scale fusion. First, channel merging is performed, and the feature map obtained after channel merging is marked as Concat_L; then the channel re-weighting module is applied (Hereinafter referred to as RW_Module) linearly weights the feature channels of Concat_L, assigns a weight to each feature channel, and the re-weighted feature map of the output channel is denoted as RW_L. The channel weighting of RGB_FP_M and D_FP_M, RGB_FP_H and D_FP_H is done in the same manner as the RGB_FP_L and D_FP_L. The final multi-modal feature channel re-weighted output channel re-weighted low-, medium-, and high-resolution feature maps, respectively marked as RW_L, RW_M, RW_H.

S6 multi-scale personnel prediction;

Obtain the re-weighted feature maps RW_L, RW_M, and RW_H from the multi-modal feature channel of the S5, and input them into the corresponding prediction branch in the multi-scale human prediction for classification and frame coordinate regression to obtain larger, Predicted results for medium and small size personnel. Because the resolution of the feature map is different, the receptive field corresponding to each prediction point on the feature map is also different. Each prediction point on RW_L has a large receptive field, which is used to predict a larger target in the image; each prediction point on RW_M has a medium receptive field, which is used to predict a medium target in the image; each prediction on RW_H Points have smaller receptive fields and are used to predict smaller targets in the image. The prediction results of the above three different scales are summarized, and the non-maximum suppression (hereinafter referred to as NMS) algorithm [1] is used to eliminate overlapping target borders, and the final retained personnel detection results are output, that is, the category confidence score C _i and the prediction of the personnel Rectangular border

i=1,2,...,N. In this embodiment, i represents the ID number of the person, and N is the total number of person detection results retained in the current image.

They represent the abscissa of the upper left corner, the ordinate of the upper left corner, the abscissa of the lower right corner, and the ordinate of the lower right corner of the rectangular frame containing all persons.

Compared with the prior art, the present invention addresses the problem that the traditional symmetrical RGBD dual-stream network (RGB network stream + Depth network stream) is prone to loss of depth characteristics due to the excessive depth of the Depth network. The present invention designs an asymmetric RGBD dual-stream convolutional neural network. Model, the Depth network stream is obtained by effectively model pruning the RGB network stream. While reducing the parameters, it can reduce the risk of model overfitting and improve the detection accuracy. The RGB network stream and the Depth network stream are used to extract the high, medium, and low resolution feature maps of RGB and depth images (hereinafter referred to as Depth images), respectively, representing the low-level color texture, intermediate edge contour and high-level semantics of RGB and Depth images. Feature representation; secondly, a multi-scale fusion structure is designed for the RGB network stream and the Depth network stream to realize the high-level semantic features contained in the low-resolution feature map and the intermediate edge contour and low-level color texture features contained in the medium and high-resolution feature maps. The multi-scale information is complementary; then the multi-modal feature channel weighting structure is constructed, RGB and Depth feature maps are combined, and each feature channel after the combination is weighted and assigned, so that the model can automatically learn the contribution proportion, complete feature selection and remove redundancy Remaining functions, so as to realize the multi-modal feature fusion of RGB and Depth features corresponding to high, medium and low resolutions; finally, the use of multi-modal features for personnel classification and border regression, while ensuring real-time performance, improve The accuracy of people detection and the robustness of detection under low illumination at night and under people's obscuration are enhanced.

Description of the drawings

Fig. 1 A representative diagram of a RGBD multi-modal fusion personnel detection method based on an asymmetric dual-stream network provided by the present invention

Figure 2-1 is a structure diagram of an RGB network stream-DarkNet-53, Figure 2-2 is a structure diagram of a Depth network stream-MiniDepth-30, and Figure 2-3 is a general structure diagram of a convolution block. Figure 2-4 is a general structure diagram of a residual convolution block.

Fig. 3 is a flowchart of a method for RGBD multi-modal fusion personnel detection based on an asymmetric dual-stream network provided by an embodiment of the present invention

Figure 4 A general structure diagram of a channel reweighting module provided by an embodiment of the present invention

Figure 5 A flowchart of the NMS algorithm provided by an embodiment of the present invention

Detailed ways

In order to make the objectives, technical solutions and points of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, rather than all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention. The present invention will be described in detail below through specific embodiments. The schematic diagram of the method provided by the embodiment of the present invention is shown in FIG. 3, and includes the following steps:

S1: Use a camera with the function of simultaneously shooting RGB images and depth images to obtain the original RGB image and the depth image, match and group the images, and output the grouped and matched RGB and Depth images.

Step S110: Obtain the original RGB image by using a camera with the function of simultaneously shooting RGB images and depth images, and the original RGB images can also be obtained from the public RGBD data set.

Step S120: Acquire the Depth image matching the RGB image synchronously from the step S110, and group the RGB and Depth images. Each group of images is composed of an RGB image and a depth image captured in the same scene, and the output group is matched Depth image.

S2: Perform denoising, hole repair and coding standardization on the Depth image after group matching obtained in the step S120, and output the preprocessed Depth image after coding.

The original depth image obtained from the step S120 is used as input. First, part of the noise of the Depth image is eliminated, then the hole is filled, and finally the single-channel Depth image is re-encoded into a three-channel image, and the values of the three channels are renormalized to 0 -255, output the Depth image after encoding normalization. In this embodiment, a 5x5 Gaussian filter is used to remove noise; the image repair algorithm proposed in [2] is used for hole repair, and the local normal vector and occlusion boundary in the Depth image are extracted, and then global optimization is applied to fill the hole of the Depth image; Depth Image coding adopts HHA coding [3] (horizontal disparity, height above ground, and the angle the pixel), and the three channels are the horizontal disparity, the height above the ground and the angle of the surface normal vector.

S3: Obtain the original RGB image from the step S110, and use the RGB network stream of the asymmetric dual-stream network model to extract the general, low-level, intermediate, and high-level features of the RGB image at different network levels, and then output the corresponding general feature map and the high- and middle-level features. The RGB feature maps of the lower three resolutions are respectively marked as RGB_FP_C, RGB_FP_H, RGB_FP_M, RGB_FP_L, and input RGB_FP_H, RGB_FP_M, RGB_FP_L to S4. In this embodiment, the RGB network stream of the asymmetric dual-stream network model adopts DarkNet-53 [4], and the network structure of DarkNet-53 is shown in Figure 2-1. The network contains 52 convolutional layers, among which layers L1～L10 of the network are used to extract the general features of RGB images and output RGB_FP_C; layers L11～L27 are used to extract low-level color texture features of RGB images and output RGB_FP_H; layers L28～L44 Used to extract the middle-level edge contour features of RGB images, and output RGB_FP_M; L45~L52 layers are used to extract high-level semantic features of RGB images, and output RGB_FP_L. It is worth noting that the DarkNet-53 model used in this embodiment is only a specific embodiment of the RGB network flow of the asymmetric dual-stream network, and is not limited to the aforementioned DarkNet-53 model. The following only uses DarkNet-53 as an example. Discourse.

Step S310: Obtain the original RGB image from the S110, extract the general features of the RGB image through the L1~L10 layers of the DarkNet-53 network, and downsample the image resolution by K times, and output the RGB general feature map RGB_FP_C, whose size becomes One Kth of the original input size. In this embodiment, the value of K is 8. Layers L1 to L10 can be divided into three sub-sampling layers, L1 to L2, L3 to L5, and L6 to L10. Each sub-sampling layer down-samples the input image resolution from the previous layer by 2 times. The first sub-sampling layer includes a standard convolution block with a step length of 1 (denoted as Conv0) and a pooled convolution block with a step length of 2 (denoted as Conv0_pool). The general structure of the convolution block is shown in the figure As shown in 2-3, it includes a standard image convolution layer, a batch normalization layer, and a Leaky_ReLU activation layer; the second sub-sampling layer includes a residual convolution block (denoted as Residual_Block_1) and one of the pooling convolution blocks (denoted as Residual_Block_1). Is Conv1_pool), where the general structure of the residual convolution block is shown in Figure 2-4, including a 1x1xM standard convolution block, a 3x3xN standard convolution block, and an Add that transfers the identity map of the input to the output Module, M represents the number of input feature channels, N represents the number of output feature channels, where the values of M and N are respectively 32; the third sub-sampling layer includes 2 of the residual convolution blocks (denoted as Residual_Block_2_1~2_2) and One such pooled convolution block (denoted as Conv2_pool). In this embodiment, the value of K is 8, and the values of M and N are shown in layers L1 to L10 in Fig. 3-1.

Step S320: Obtain RGB_FP_C from S310, extract the low-level color texture features of the RGB image through the L11~L27 layers of the DarkNet-53 network, and downsample the image resolution by K times, and output the RGB high-resolution feature map RGB_FP_H with its size It becomes one Kth of the original input size. In this embodiment, L11 to L27 are composed of 8 residual convolution blocks (denoted as Residual_Block_3_1 to 3_8) and one pooling convolution block (Conv3_pool). The value of K is 2, and the values of M and N are shown in layers L11 to L27 in Figure 3-1.

Step S330: Obtain RGB_FP_H from the S320, extract the mid-level edge contour features of the RGB image through the L28~L44 layer of the DarkNet-53 network, and downsample the image resolution by K times, and output the RGB mid-resolution feature map RGB_FP_M with its size It becomes one Kth of the original input size. In this embodiment, L28 to L44 are composed of 8 residual convolution blocks (denoted as Residual_Block_4_1 to 4_8) and one convolution block (Conv4_pool). The value of K is 2, and the values of M and N are shown in Layers L28 to L44 in Figure 3-1.

Step S340: Obtain RGB_FP_M from the S320, extract the high-level semantic features of the RGB image through the L45~L52 layers of the DarkNet-53 network, and downsample the image resolution by K times, and output the RGB low-resolution feature map RGB_FP_L, the size of which is changed It is one Kth of the original input size. In this embodiment, L45 to L52 are composed of 4 of the residual convolution blocks (denoted as Residual_Block_5_1 to 5_4). The value of K is 2, and the values of M and N are shown in Layers L45 to L52 in Figure 3-1.

S3': Obtain the normalized Depth image from the S2, and use the Depth network stream of the asymmetric dual-stream network model to extract the general, low-level, intermediate and high-level features of the Depth image at different network levels, and then output the corresponding general feature map And the RGB feature maps of high, medium, and low resolutions are denoted as D_FP_C, D_FP_H, D_FP_M, D_FP_L, and D_FP_H, D_FP_M, D_FP_L are input to S4'. In this embodiment, the Depth network stream of the asymmetric dual-stream network model is obtained by pruning the model on the basis of the RGB network stream DarkNet-53, which is hereinafter referred to as MiniDepth-30 for short. The MiniDepth-30 network can extract semantic features such as the edge contour of the depth image more effectively and clearly, and at the same time achieve the effect of reducing network parameters and preventing over-fitting. The network structure of MiniDepth-30 is shown in Figure 2-2. The network contains a total of 30 convolutional layers, among which the L1～L10 layers of the network are used to extract the general features of the Depth image and output D_FP_C; the L11～L17 layers are used to extract the low-level color texture features of the Depth image, and the output D_FP_H; L18～L24 layers Used to extract the middle-level edge contour features of the Depth image, output D_FP_M; L25 ~ L30 layers are used to extract the high-level semantic features of the Depth image, output D_FP_L. It is worth noting that the MiniDepth-30 model used in this embodiment is only a specific embodiment of the Depth network flow of the asymmetric dual-stream network, and is not limited to the aforementioned MiniDepth-30 model. The following only uses MiniDepth-30 as an example for the method. Discourse.

Step S310': Obtain the normalized Depth image from the S2, extract the general features of the RGB image through the L1~L10 layers of the MiniDepth-30 network, and downsample the image resolution by K times, and output the general Depth feature map D_FP_C, Its size becomes one K of the original input size. In this embodiment, the L1 to L10 network layers of MiniDepth-30 have the same structure as the L1 to L10 network layers of DarkNet-53 in step S310, and the value of K is 8.

Step S320': Obtain D_FP_C from the step S310', extract the low-level color texture features of the Depth image through the L11~L17 layers of the MiniDepth-30 network, and downsample the image resolution by K times, and output the Depth high-resolution feature map D_FP_H , Its size becomes one K of the original input size. In this embodiment, L11 to L17 are composed of three of the residual convolution blocks (denoted as Residual_Block_D_3_1 to 3_3) and one of the pooling convolution blocks (Conv3_D_pool). The value of K is 2, and the values of M and N are shown in Layers L11 to L17 in Figure 3-2.

Step S330': Obtain D_FP_H from the step S320', extract the mid-level edge contour features of the Depth image through the L18~L24 layers of the MiniDepth-30 network, and downsample the image resolution by K times, and output the Depth mid-resolution feature map D_FP_M , Its size becomes one K of the original input size. In this embodiment, L18 to L24 are composed of three of the residual convolution blocks (denoted as Residual_Block_D_4_1 to 4_3) and one of the pooling convolution blocks (Conv4_D_pool). The value of K is 2, and the values of M and N are shown in layers L18 to L24 in Figure 3-1.

Step S340': Obtain D_FP_M from the step S330', extract the high-level semantic features of the Depth image through the L25~L30 layers of the DarkNet-53 network, and downsample the image resolution by K times, and output the Depth low-resolution feature map D_FP_L, Its size becomes one K of the original input size. In this embodiment, L25 to L30 are composed of three of the residual convolution blocks (denoted as Residual_Block_D_5_1 to 5_3). The value of K is 2, and the values of M and N are shown in Layers L25 to L30 in Figure 3-1.

S4: Obtain RGB_FP_H, RGB_FP_M and RGB_FP_L from the S3, use upsampling to expand the feature map size, merge the feature channels of the RGB feature maps with the same resolution to achieve feature fusion, and output the feature maps RGB_FP_H, RGB_FP_M and RGB_FP_L after feature fusion to S5.

Step S410: From the RGB_FP_L obtained in the step S340, it is up-sampled by M times and then merged with the RGB_FP_M obtained in the step S330 to realize the complementary fusion of the high-level semantic features of the deep layer of the RGB network and the middle-level edge contour feature of the middle layer, and output The new feature map RGB_FP_M after feature fusion. The specific method of channel merging: the number of channels of RGB_FP_L is C1, the number of channels of RGB_FP_M is C2, the two channels are merged C1+C2 to obtain C3, and C3 is the number of channels of the new feature map RGB_FP_M after feature fusion. In this embodiment, the value of M is 2, and the values of C1, C2, and C3 are 256, 512, and 768, respectively.

Step S420: Obtain the new feature map RGB_FP_M after the feature fusion from the step S410, and perform channel merging with the RGB_FP_H obtained in the step S320 after up-sampling to realize the deep high-level semantic features of the RGB network and the middle-level edge contour of the middle layer The complementary fusion of features and shallow low-level color texture features, and output the new feature map D_FP_H after feature fusion. The specific method of channel merging: the number of channels of RGB_FP_M is C1, the number of channels of RGB_FP_H is C2, the two channels are merged C1+C2 to obtain C3, and C3 is the number of channels of the new feature map RGB_FP_H after feature fusion. In this embodiment, the value of M is 2, and the values of C1, C2, and C3 are 128, 256, and 384, respectively.

S4': Obtain D_FP_H, D_FP_M, and D_FP_L from the S3', use upsampling to expand the size of the feature map, merge the feature channels of the Depth feature maps with the same resolution to achieve feature fusion, and output the feature maps D_FP_H, D_FP_M, D_FP_M, and D_FP_M after feature fusion. D_FP_L to S5.

Step S410': From the D_FP_L obtained in the step S340', it is up-sampled by M times and then merged with the D_FP_M obtained in the step S330' to realize the complementarity of the deep high-level semantic features of the Depth network and the middle-level edge contour features of the middle layer Fusion, output the new feature map D_FP_M after feature fusion. The specific method of channel merging: the number of channels of D_FP_L is C1, the number of channels of D_FP_M is C2, the two channels are merged C1+C2 to obtain C3, and C3 is the number of channels of the new feature map D_FP_M after feature fusion. In this embodiment, the value of M is 2, and the values of C1, C2, and C3 are 256, 512, and 768, respectively.

Step S420': Obtain the new feature map D_FP_M after the feature fusion from the step S410, and perform channel merging with the D_FP_H obtained in the step S320' after up-sampling to realize the deep high-level semantic features of the Depth network and the intermediate level of the middle layer. The complementary fusion of edge contour features and shallow low-level color texture features will output the new feature map D_FP_H after feature fusion. The specific method of channel merging: the number of channels of D_FP_M is C1, the number of channels of D_FP_H is C2, the two channels are merged C1+C2 to obtain C3, and C3 is the number of channels of the new feature map D_FP_H after feature fusion. In this embodiment, the value of M is 2, and the values of C1, C2, and C3 are 128, 256, and 384, respectively.

S5: Obtain new feature maps RGB_FP_H, RGB_FP_M, and RGB_FP_L after feature fusion from S4, and obtain new feature maps D_FP_H, D_FP_M, D_FP_L after feature fusion from S4', and perform feature channel merging at corresponding equal resolutions to obtain channels The merged feature maps are marked as Concat_L, Concat_M, and Concat_H respectively, and then the channel weighting module (hereinafter referred to as RW_Module) is applied to linearly weight Concat_L, Concat_M, and Concat_H respectively, and the high, medium, and low resolutions after the channel weighting are output Rate characteristic map, respectively denoted as RW_H, RW_M, RW_L.

Step S510: Obtain RGB_FP_L and D_FP_L from the S4, first combine the feature channels of RGB_FP_L and D_FP_L to obtain Concat_L, realize the complementary fusion of RGB and Depth in the network deep multi-modal information, and then apply the channel weighting module RW_Module to Concat_L Linear weighting, assign weight to each feature channel, and output the re-weighted feature map RW_L of the channel. Taking the channel weighting of RGB_FP_L and D_FP_L as an example, the general structure of a channel weighting module provided in this embodiment is shown in FIG. 4. Specifically, the number of channels of RGB_FP_L is C1, the number of channels of D_FP_L is C2, and the number of channels of the new feature map Concat_L after channel merging is C3, where C3=C1+C2; then the Concat_L passes through 1 1x1 Ave in turn -Pooling layer, 1 standard convolution layer composed of C3/s (s is the reduction step size) 1x1 convolution kernel, 1 C3 standard convolution layer composed of 1x1 convolution kernel, and 1 Sigmoid layer to obtain C3 weight values ranging from 0 to 1; finally, the obtained C3 weight values are multiplied by the C3 feature channels of the Concat_L, and each feature channel is assigned a weight, and the weighted C3 channels are output. The characteristic channel, namely RW_L. In this embodiment, the values of C1, C2, and C3 are 1024, 1024, and 2048 respectively, and the value of the reduction step size s is 16 respectively.

Step S520: Obtain RGB_FP_M from step S410 and D_FP_M from step S410', first combine the characteristic channels of RGB_FP_M and D_FP_M to obtain Concat_M, to achieve the complementary fusion of RGB and Depth multi-modal information in the middle layer of the network, and then apply The channel re-weighting module RW_Module performs linear weighting on Concat_M, assigns a weight to each feature channel, and outputs the channel re-weighted feature map RW_M. In this embodiment, the channel weighting method of RGB_FP_M and D_FP_M is consistent with the channel weighting method of RGB_FP_L and D_FP_L in step S510, where the values of C1, C2, and C3 are 512, 512, 1024, respectively, and the step size is reduced. The values of are 16 respectively.

Step S530: Obtain RGB_FP_H from the step S420 and D_FP_H from the step S420', first combine the feature channels of RGB_FP_H and D_FP_H to obtain Concat_H, realize the complementary fusion of RGB and Depth in the network shallow multi-modal information, and then apply The channel re-weighting module RW_Module performs linear weighting on Concat_H, assigns a weight to each feature channel, and outputs the channel re-weighted feature map RW_H. In this embodiment, the channel weighting method of RGB_FP_H and D_FP_H is consistent with the channel weighting method of RGB_FP_L and D_FP_L in step S510, where the values of C1, C2, and C3 are 256, 256, 512, respectively, which reduces the step size s. The values are 16 respectively.

S6: Obtain the channel-weighted feature maps RW_L, RW_M, RW_H from the S5, and perform classification and frame coordinate regression respectively to obtain prediction results for persons with larger, medium and smaller sizes, and predict the above three different scales The results are summarized, and the non-maximum suppression (hereinafter referred to as NMS) algorithm is used to eliminate overlapping target borders, and the final retained person detection results are output, namely the category confidence score C _{i of the person} and the predicted rectangular frame

Step S610: Obtain the channel-weighted low-resolution feature map RW_L from the step S510, transmit it to the SoftMax classification layer and the coordinate regression layer, and output the category confidence score for predicting larger-size persons under the low-resolution feature map

And the coordinates of the upper left and lower right corners of the rectangular border

The subscript L represents the prediction result under the low-resolution feature map.

Step S620: Obtain the channel-weighted low-resolution feature map RW_M from the step S520, transmit it to the SoftMax classification layer and the coordinate regression layer, and output the category confidence score for predicting medium-sized persons under the medium-resolution feature map

The subscript M represents the prediction result under the medium-resolution feature map.

Step S630: Obtain the channel-weighted high-resolution feature map RW_H from the step S530, transmit it to the SoftMax classification layer and the coordinate regression layer, and output the category confidence score for predicting the smaller-sized person under the high-resolution feature map

The subscript H represents the prediction result under the high-resolution feature map.

Step S640: Obtain the category confidence scores of persons of larger, medium and smaller sizes from the steps S610, S620 and S630

And the upper left and lower right coordinates of the rectangular border

The prediction results of the three scales are summarized, and then the NMS algorithm is used to remove the overlapping target frame, and the final retained personnel detection results are output. That is, the category confidence score of the person C _i and the predicted rectangular frame

i=1,2,...,N. The flow chart of the NMS algorithm is shown in Figure 5.

The NMS algorithm steps are as follows:

Step S640-1: Obtain the confidence scores of the categories of persons of larger, medium, and smaller sizes from the steps S610, S620, and S630

And the upper left and lower right coordinates of the rectangular border

Summarize the prediction results of the three scales, use the confidence threshold to filter the prediction boxes, keep the prediction boxes with category confidence scores greater than the confidence threshold, and add them to the prediction list. In this embodiment, the confidence threshold is set to 0.3.

Step S640-2: From the prediction list obtained in step S640-1, the unprocessed prediction frames in the prediction list are sorted in descending order of confidence scores, and the prediction list sorted in descending order is output.

Step S640-3: Obtain the prediction list in descending order from the step S640-2, select the frame corresponding to the maximum confidence score as the current reference frame, and add the category confidence score and frame coordinates of the current reference frame to the final result list , And remove the reference frame from the prediction list, and calculate the intersection ratio (IoU) for all other predicted frames and the current reference frame.

Step S640-4: Obtain the prediction list and the IoU values of all the frames in the prediction list and the reference frame from the step S640-3. If the IoU of the current frame is greater than the preset NMS threshold, it is considered to be a duplicate target with the reference frame, and It is removed from the list of predicted bounding boxes, otherwise the current bounding box is kept. Output the filtered prediction list.

Step S640-5: Obtain the filtered prediction list from the step S640-4. If all frames in the prediction list are processed, that is, the prediction frame is empty, the algorithm ends and the final result list is returned; otherwise, the current prediction list is still If there is an unprocessed frame, return to step S640-2 to repeat the algorithm flow.

Step S640-6: For the step S640-5, when there is no unprocessed prediction frame in the prediction list, output the final result list as the final retained personnel detection result.

references:

[1]Neubeck A, Gool L V. Efficient Non-Maximum Suppression[C]//International Conference on Pattern Recognition. 2006.

[2]Zhang Y, Funkhouser T. Deep Depth Completion of a Single RGB-D Image[J].2018.

[3]Gupta S, Girshick R, Arbeláez P, et al. Learning Rich Features from RGB-D Images for Object Detection and Segmentation[C]//2014.

[4]Redmon J, Farhadi A.YOLOv3: An Incremental Improvement[J].2018.

Claims

A RGB-D multi-modal fusion method for personnel detection based on an asymmetric dual-stream network, which is characterized by: including RGBD image acquisition, depth image preprocessing, RGB feature extraction and Depth feature extraction, RGB multi-scale fusion and Depth multi-scale fusion , Multi-modal feature channel weighting and multi-scale personnel prediction.
An RGB-D multi-modal fusion method for personnel detection based on an asymmetric dual-stream network according to claim 1, characterized in that: S1RGBD image acquisition;

The original RGB image and Depth image are acquired by a camera with the function of simultaneously shooting RGB images and depth images, and the RGB and Depth images are matched and grouped. Each group of images is composed of an RGB image and a Depth image captured in the same scene, and output Grouped and matched RGB and Depth images; the original RGB image and Depth image can also be obtained from the public RGBD data set.
The RGB-D multi-modal fusion method for personnel detection based on an asymmetric dual-stream network according to claim 2, characterized in that: S2 depth image preprocessing;

Obtain the grouped and matched Depth image from the RGBD image of S1. First, remove part of the noise of the Depth image, then fill in the holes, and finally re-encode the single-channel Depth image into three-channel images, and re-encode the image values of the three channels Normalize to 0-255, and output the Depth image after encoding normalization.
An RGB-D multi-modal fusion method for personnel detection based on an asymmetric dual-stream network according to claim 3, characterized in that: S3RGB feature extraction and Depth feature extraction;

Obtain the original RGB image from the S1 RGBD image collection, input it to RGB feature extraction, perform down-sampling feature extraction, and output the high, medium, and low resolution feature maps of the RGB image, which are respectively denoted as RGB_FP_H, RGB_FP_M, RGB_FP_L, representing RGB The low-level color texture, middle-level edge contour and high-level semantic feature representation of the image; the Depth image after coding and normalization is obtained from the depth image preprocessing, input to the Depth feature extraction, down-sampling feature extraction, and the high, medium and low resolution of the Depth image are output Rate feature maps, respectively denoted as D_FP_H, D_FP_M, D_FP_L, represent the low-level color texture, intermediate edge contour, and high-level semantic feature representation of the Depth image; the RGB network flow and the Depth network flow are symmetrical, that is, the RGB network flow and the Depth network flow The structure is exactly the same; the asymmetric dual-stream convolutional neural network model is designed to extract the features of the RGB image and the depth image; DarkNet-53 and MiniDepth-30 represent the RGB network stream and the depth network stream, respectively, and the network structure of DarkNet-53 and MiniDepth-30 has Asymmetrical characteristics.
The RGB-D multi-modal fusion method for personnel detection based on an asymmetric dual-stream network according to claim 4, characterized in that: S4RGB multi-scale fusion and Depth multi-scale fusion;

From RGB feature extraction to obtain RGB feature maps RGB_FP_H, RGB_FP_M, RGB_FP_L input to the RGB multi-scale fusion, first expand the obtained RGB_FP_L to the same size as RGB_FP_M through the upsampling layer, and then merge the channels with RGB_FP_M to realize the advanced semantics of the RGB network deep layer The complementary fusion of features and the mid-level edge contour features of the middle layer, the output channel merged new feature map RGB_FP_M; then the new feature map RGB_FP_M after the output channel merged is expanded to the same size as RGB_FP_H through the upsampling layer, and the channel is channeled with RGB_FP_H Merge, realize the complementary fusion of the deep high-level semantic features of the RGB network, the middle-level edge contour features of the middle layer, and the shallow low-level color texture features, and output the new feature map RGB_FP_H after channel merging; obtain the Depth feature map D_FP_H, from the Depth feature extraction, D_FP_M and D_FP_L are input to Depth multi-scale fusion, and perform the same operation as RGB multi-scale fusion; the final Depth multi-scale fusion output is the original input RGB_FP_L, and the new feature maps after channel merging RGB_FP_M and RGB_FP_H; the output of Depth multi-scale fusion is The original input D_FP_L, the new feature maps D_FP_M and D_FP_H after the channel merge.
The RGB-D multi-modal fusion method for personnel detection based on an asymmetric dual-stream network according to claim 5, characterized in that: S5 multi-modal feature channels are reweighted;

Obtain RGB feature maps RGB_FP_L, RGB_FP_M, RGB_FP_H from RGB multi-scale fusion and Depth feature maps D_FP_L, D_FP_M, D_FP_H from Depth multi-scale fusion, and enter the channels corresponding to the same resolution in the multi-modal feature channel weighting according to the resolution grouping In the re-weighting structure, more effective RGB and Depth multi-modal feature fusion is realized, and the detection robustness in processing a variety of restricted scenes is improved; the specific method is to take the re-weighting of RGB_FP_L and D_FP_L channels as an example, and obtain from RGB multi-scale fusion RGB_FP_L and D_FP_L are obtained from Depth multi-scale fusion. First, channel merging is performed, and the feature map obtained after channel merging is marked as Concat_L; then the channel weighting module is abbreviated as RW_Module to linearly weight the feature channels of Concat_L, and each feature channel is assigned Weight, the re-weighted feature map of the output channel is marked as RW_L; the channel re-weighting of RGB_FP_M and D_FP_M, RGB_FP_H and D_FP_H is done in the same way as the RGB_FP_L and D_FP_L; the final multi-modal feature channel re-weighted output channel re-weighted Low, medium, and high-resolution feature maps are marked as RW_L, RW_M, and RW_H respectively.
An RGB-D multi-modal fusion method for personnel detection based on an asymmetric dual-stream network according to claim 6, characterized in that: S6 multi-scale personnel prediction;

Obtain the re-weighted feature maps RW_L, RW_M, and RW_H from the multi-modal feature channel of the S5, and input them into the corresponding prediction branch in the multi-scale human prediction for classification and frame coordinate regression to obtain larger, The prediction results of medium and small size personnel; due to the different resolution of the feature map, the receptive field corresponding to each prediction point on the feature map is also different; each prediction point on RW_L has a larger receptive field, which is used to predict the image Larger target; each prediction point on RW_M has a medium receptive field, which is used to predict a medium target in the image; each prediction point on RW_H has a small receptive field, which is used to predict a smaller target in the image; The prediction results of three different scales are summarized, and the non-maximum value suppression algorithm is used to eliminate overlapping target frames, and the final retained personnel detection results are output, namely the category confidence score C i of the personnel and the predicted rectangular frame
i=1,2,...,N; i represents the ID number of the person, and N is the total number of person detection results retained in the current image;
They represent the abscissa of the upper left corner, the ordinate of the upper left corner, the abscissa of the lower right corner, and the ordinate of the lower right corner of the rectangular frame containing all persons.
The RGB-D multi-modal fusion personnel detection method based on an asymmetric dual-stream network according to claim 2, characterized in that:

Step S110: Obtain the original RGB image by using a camera with the function of simultaneously shooting RGB images and depth images, and the original RGB images can also be obtained from the public RGBD data set;

Step S120: Acquire the Depth image matching the RGB image synchronously from the step S110, and group the RGB and Depth images. Each group of images is composed of an RGB image and a depth image captured in the same scene, and the output group is matched Depth image.
The RGB-D multi-modal fusion method for detecting persons based on an asymmetric two-stream network according to claim 2, characterized in that: obtaining category confidence scores of persons of larger, medium and smaller sizes
And the upper left and lower right coordinates of the rectangular border

Summarize the prediction results of the three scales, then use the NMS algorithm to remove the overlapping target borders, and output the final retained personnel detection results; namely, the category confidence score C i of the personnel and the predicted rectangular frame
i=1,2,...,N.
The RGB-D multi-modal fusion method for personnel detection based on an asymmetric dual-stream network according to claim 9, characterized in that:

The NMS algorithm steps are as follows:

Step S640-1: Obtain the confidence scores of the categories of persons of larger, medium, and smaller sizes from
And the upper left and lower right coordinates of the rectangular border
Summarize the prediction results of the three scales, use the confidence threshold to filter the prediction boxes, keep the prediction boxes with category confidence scores greater than the confidence threshold, and add them to the prediction list; set the confidence threshold to 0.3;

Step S640-2: From the prediction list obtained in step S640-1, arrange the unprocessed prediction frames in the prediction list in descending order of confidence scores, and output the prediction list in descending order;

Step S640-3: Obtain the prediction list in descending order from the step S640-2, select the frame corresponding to the maximum confidence score as the current reference frame, and add the category confidence score and frame coordinates of the current reference frame to the final result list , And remove the reference frame from the prediction list, and calculate the intersection and IoU of all other predicted frames with the current reference frame;

Step S640-4: Obtain the prediction list and the IoU values of all the frames in the prediction list and the reference frame from the step S640-3. If the IoU of the current frame is greater than the preset NMS threshold, it is considered to be a duplicate target with the reference frame, and Remove it from the prediction frame list, otherwise keep the current frame; output the filtered prediction list;

Step S640-5: Obtain the filtered prediction list from the step S640-4. If all frames in the prediction list are processed, that is, the prediction frame is empty, the algorithm ends and the final result list is returned; otherwise, the current prediction list is still If there is an unprocessed frame, return to step S640-2 to repeat the algorithm flow;

Step S640-6: For the step S640-5, when there is no unprocessed prediction frame in the prediction list, output the final result list as the final retained personnel detection result.