CN108734210B

CN108734210B - An object detection method based on cross-modal multi-scale feature fusion

Info

Publication number: CN108734210B
Application number: CN201810474925.8A
Authority: CN
Inventors: 刘盛; 尹科杰; 刘儒瑜; 陈一彬; 沈康
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2018-05-17
Filing date: 2018-05-17
Publication date: 2021-10-15
Anticipated expiration: 2038-05-17
Also published as: CN108734210A

Abstract

The invention discloses an object detection method based on cross-modal multi-scale feature fusion, which is characterized in that a depth map detection network model is initialized by network parameters of an RGB detection network model; and respectively initializing feature extraction weights of the fusion network model based on the obtained RGB detection network model and the depth map detection network model, and finally training to obtain the fusion network model fusing multi-scale cross-modal features. The method does not depend on a large number of labeled depth image data sets, can fuse the characteristics of the depth image and the RGB image in a cross-mode manner, is real-time and efficient, and accurately completes object identification, positioning and detection. The fusion network model designed by the invention can achieve real-time detection speed only by using a consumption-level display card and a CPU as hardware.

Description

Object detection method based on cross-modal multi-scale feature fusion

Technical Field

The invention relates to the technical field of image recognition, in particular to an object detection method based on cross-modal multi-scale feature fusion, which can complete detection, positioning and accurate recognition tasks on an object in a color depth image (RGB-D image, including color information and depth information).

Background

Faster, more accurate and more generalized object detection methods are always an urgent need in the industry. The RGB image is severely affected in some special environments, such as motion or glare can degrade the image data, and the detection using the RGB image features often cannot achieve the desired accuracy. It is necessary to utilize information from different sensors, such as depth information, to improve the performance of object detection.

Since the convolutional neural network is used for the task of identifying and detecting objects, most high-precision object detection methods are implemented based on the convolutional neural network. These networks are capable of learning generic feature expressions of objects from large scale labeled RGB image datasets. If the depth map data is to be used to improve the accuracy of object detection, a generic depth feature representation of the object needs to be extracted. However, the industry does not include a large-scale depth image dataset with a sufficient number of classes and all labeled, so that a general feature expression of depth information cannot be directly obtained.

On the other hand, the existing fusion feature detection method has speed limitation, and often requires a high-performance GPU to obtain a result through long-time calculation, so that the rigidity requirement of high real-time performance in an industrial system cannot be met.

Disclosure of Invention

The invention aims to provide an RGB-D image detection method based on cross-modal multi-scale feature fusion, which designs a fusion model with real-time and high precision, utilizes multi-modal features of an object to accurately detect the object, and simultaneously completes the tasks of detection, positioning and accurate identification of the object in the image.

In order to achieve the purpose, the technical scheme of the invention is as follows:

an object detection method based on cross-modal multi-scale feature fusion comprises the following steps:

training a pre-training model by using RGB images of object classes in labeled images in the first data set, and initializing a single-mode RGB detection network model based on the pre-training model;

training an RGB detection network model by adopting the RGB image with the object type and position in the labeled image in the second data set;

initializing a single-mode depth map detection network model based on the trained RGB detection network model;

training a depth map detection network model by using depth images of object types and positions in labeled images corresponding to the RGB images in the second data set;

initializing a fusion network model and carrying out multi-scale feature fusion based on the trained RGB detection network model and the depth map detection network model;

training a fusion network model by adopting the RGB image and the depth image of the object type and the object position in the matched labeled image;

and detecting the object in the color depth image by adopting the trained fusion network model.

Further, initializing a single-mode depth map detection network model based on the trained RGB detection network model, including:

and copying the network parameters of the RGB detection network model as the network parameters of the depth map detection network model.

Further, initializing a fusion network model and performing multi-scale feature fusion based on the trained RGB detection network model and depth map detection network model, including:

copying network parameters of the RGB detection network model and the depth map detection network model as weights of two feature extraction parts in the fusion network model;

the multi-scale features extracted by the two feature extraction parts are combined by a plurality of fusion layers.

Further, the fusion network model adopts Multibox Loss as a Loss function during training.

Further, when training the RGB detection network model, training the depth map detection network model, and training the fusion network model, the method further includes:

and carrying out data enhancement processing on the input data.

Further, when training the converged network model, the method further includes:

the weights of the feature extraction sections are frozen.

The invention provides an object detection method based on cross-modal multi-scale feature fusion, which combines RGB and depth features to improve detection performance, and initializes a depth map detection network model by network parameters of an RGB detection network model; and respectively initializing feature extraction weights of the fusion network model based on the obtained RGB detection network model and the depth map detection network model, and finally training to obtain the fusion network model fusing multi-scale cross-modal features. The method does not depend on a large number of labeled depth image data sets, can fuse the characteristics of the depth image and the RGB image in a cross-mode manner, is real-time and efficient, and accurately completes object identification, positioning and detection. The converged network model designed by the invention can achieve real-time detection speed only by using a consumer-grade display card and a CPU (Central processing Unit) as hardware, such as the display card GTX1080 and the CPU Intel 7700K.

Drawings

FIG. 1 is a flowchart of an object detection method based on cross-modal multi-scale feature fusion according to the present invention;

fig. 2 is a schematic structural diagram of a converged network model.

Detailed Description

The technical solutions of the present invention are further described in detail below with reference to the drawings and examples, which should not be construed as limiting the present invention.

The general idea of the invention is that the method does not depend on a large amount of labeled depth image data sets, can cross-modal fusion depth image and RGB image characteristics, is real-time and efficient, and accurately completes object identification, positioning and detection. And training to obtain a fusion model which can accept cross-modal RGB and depth image input and obtain the position and category information of a plurality of objects in real time. This solution requires the completion of cross-modal feature transfers: initializing a depth map information network by RGB model parameters and training to obtain a depth map model; and respectively initializing a feature extraction part of the fusion network provided by the invention based on the obtained RGB model and the depth map model, and finally training to obtain a network model fusing multi-scale cross-modal features. The multi-scale cross-modal fusion network with high real-time performance and detection precision designed by the invention is a core element of a solution.

As shown in fig. 1, an object detection method based on cross-modal multi-scale feature fusion includes:

The following describes the steps of the above method in detail, wherein the model training of this embodiment includes three stages, a first stage trains the RGB detection network model, a second stage trains the depth map detection network model by using a cross-modal supervised transfer training mode, and a third stage trains the fusion network model based on the trained RGB detection network model and depth map detection network model.

Since the convolutional neural network is used for the task of identifying and detecting objects, most high-precision object detection methods are implemented based on the convolutional neural network. These networks are capable of learning generic feature expressions of objects from large scale labeled RGB image datasets. According to the technical scheme, the accuracy of object detection is improved by using the depth map data, and the general depth feature expression of an object needs to be extracted. However, the industry does not include a large-scale depth image dataset with a sufficient number of classes and all labeled, so that a general feature expression of depth information cannot be directly obtained. In the embodiment, the single-mode RGB detection network model is trained firstly, the depth map detection network model is trained in a cross-mode supervised transfer training mode, and the depth map detection network model can be obtained only by utilizing a small-scale data set.

The first stage is as follows: firstly, training a pre-training model by using RGB images of object classes in labeled images in a first data set, and initializing a single-mode RGB detection network model based on the pre-training model; and then, the RGB images of the object types and positions in the marked images in the second data set are used for training the RGB detection network model.

In the current pre-training model, the pre-training model obtained by training the labeled large-scale RGB image dataset is mature, for example, a VGG16 model trained in advance on the ImageNet dataset can be directly adopted by the technical scheme. The pre-training model is trained using a labeled large-scale RGB image dataset (also referred to as the first dataset), typically labeled with the class of the object in the RGB image.

After the pre-training model is selected, the RGB detection network model is initialized based on the pre-training model, that is, parameters of the neural network of the pre-training model are copied to the RGB detection network model. Then, fine tuning training is carried out on the RGB detection network model by using the RGB image in the small-scale data set (also called as a second data set), the type and the position of an object (namely an object) to be detected in the RGB image in the small-scale data set need to be labeled in advance, the small-scale data set comprises a depth image corresponding to the RGB image, and the type and the position of the object to be detected are also labeled in the depth image.

And a second stage: in the embodiment, a single-mode depth map detection network model is initialized based on a trained RGB detection network model; and training the depth image detection network model by adopting the depth image of the object type and position in the labeled image corresponding to the RGB image in the second data set.

In this embodiment, the RGB detection network model and the depth map detection network model are both single-modality, and are both expressed hierarchically by using the neural network model, where the RGB image modality is expressed as:

is expressed by the i-th layer characteristic trained from a large-scale labeled data set, # l is the layer number of the neural network, and the parameters of the neural network are used

To indicate.

The depth image modality is represented as:

ψⁱis the ith layer characteristic expression, and # u is the number of layers of the neural network, and the same holds true

As a parameter of the hierarchical representation neural network.

In this embodiment, based on the trained RGB detection network model, the single-mode depth map detection network model is initialized, that is, the network parameters of the RGB detection network model are copied

The network parameters of the depth map detection network model are taken as the network parameters of the depth map detection network model, then the depth map detection network model is subjected to fine tuning training by using the depth image part in the small-scale data set, and the trained network parameters of the depth map detection network model are

The depth map detection network model can identify the object type and position in the depth map.

A cross-mode Supervision Transfer (Supervision Transfer) method is adopted, the neural network of the depth information mode is initialized by using the neural network expression of the RGB mode, and the cross-mode migration mode is verified in a convolution pooling layer. Assume that there has been a large data set, labeled P, from two modalities that have been paired, but not labeled_l,u. By characterizing the image

And psi^#uPaired image RGB map I with two modalities_uAnd depth map I_lMatch (A)

Partial representation of an RGB network, I_lAnd I_uRGB and depth images in the dataset, respectively), fromCan learn rich expression of the depth map, use transformation function t to make the two expressions have the same dimension, and propose a loss function based on the above network (f can be in any functional form), then can get the parameters of the depth map network in a training way:

the single-mode detection network of the invention comprises an element _ wise-sum layer, a permate layer, a flatten layer and a private layer besides the conventional convolution pooling and full connection layer. First, the element _ wise-sum layer sums the feature maps, which can be regarded as a summation operation of the multidimensional matrix. Second, the permate layer changes the order of the data dimensions, which is the same as multiplying by an identity matrix that has undergone row swapping. The flatten layer then merges the multidimensional matrices into one dimension. Finally, the priorbox layer is used to process the bounding box and has no effect on the rich features of the image. All these layers can be unified into a conversion function s, and the loss function of the network can be described as follows:

based on the loss function, the cross-modal model transfer of the single-modal network can be realized.

It should be noted that the small-scale data set includes RGB images of the labeled object type and position, and depth maps of the labeled object type and position corresponding to the RGB images, and when the small-scale data set is used for performing fine tuning training on the RGB detection network model, the RGB images in the small-scale data set are used; when the small-scale data set is adopted to carry out fine tuning training on the depth map detection network model, the depth map in the small-scale data set is adopted, and the depth map is represented in an HHA format (HHA coding has three dimensions, namely horizontal parallax, height from the ground and angle with gravity), and the dimensions are consistent with RGB images.

In this embodiment, when the pre-training model adopts the VGG16 model, the network structures of the RGB detection network model and the depth map detection network model are the same, and may adopt a form of SSD-VGG16, where the SSD-VGG16 is an SSD network that adopts VGG16 as a feature extraction part. However, these networks are not fixed, and other network forms can be adopted, such as a pre-training model and a ResNet, and the corresponding RGB detection network model and the depth map detection network model can adopt the form of SSD-ResNet.

A third stage, based on the trained RGB detection network model and depth map detection network model, initializing a fusion network model and carrying out multi-scale feature fusion; and training the fusion network model by adopting the paired RGB images and depth images of the object types and positions in the labeled images.

In this embodiment, trained network parameters of the RGB detection network model and the depth map detection network model are used to initialize weights of the feature extraction part in the fusion network model, that is, the network parameters of the RGB detection network model and the depth map detection network model are copied as weights of the feature extraction part in the fusion network model. The input data, which may be in the form of a second data set, is then paired RGB pictures and depth pictures for fine tuning training. As shown in fig. 2, taking SSD-VGG16 as the feature extraction part as an example, the fusion network needs to copy all the parameters of the SSD-VGG16 module obtained from the RGB detection network and the depth map detection network as the weights of the two feature extraction parts (RGB feature extraction part and depth map feature extraction part) of the fusion network. Thus, the network input RGB image and the depth image are fused, and multi-level RGB general features and general features (multi-scale features) of the depth image are obtained after the two feature extraction parts are processed respectively.

After the two feature extraction parts (the RGB feature extraction part and the depth map feature extraction part) respectively extract the multi-scale features of two modes, the fusion network model adopts a multi-layer structure to fuse the features from different scales (a plurality of fusion layers are used to combine the multi-scale features extracted by the two feature extraction parts), wherein the features come from a convolution pooling layer with more remarkable semantic features and correspond to a higher layer in the network structure. There are two feasible merging points in the converged network architecture, which are divided into two major categories, one is the network bottom layer before the feature extraction part; the other class follows the feature extraction layer with respect to the location of the higher layers. The lower layer network has more spatial features, while the higher layer network has more semantic features. If two objects to be detected are the same object, the high-level general feature expressions are closer, but the low-level expressions may be greatly different. Therefore, the invention selects the fusion of the high layer rather than the fusion of the low layer, and experiments also prove that the high layer fusion can achieve better effect. In the system architecture of the fusion network, the invention selects a plurality of specific network layer characteristics for fusion instead of singly adopting the characteristics of a certain layer, wherein the network layer characteristics comprise a plurality of multi-scale characteristics of a convolution pooling layer with more obvious semantic characteristics. As shown in fig. 2, taking SSD-VGG16 as an example of the feature extraction part, the fusion layer can fuse the features of the conv4-3, fc7, conv6_2, conv7_2, conv8_2 and conv9_2 layers to obtain the fusion feature of the network upper layer. It should be noted that, in fig. 2, only conv4-3 and fc7 are shown, and the schematic diagram of the fusion of other layers is omitted, and will not be described herein again.

The experimental result shows that the effect is poor when the component _ wise-sum layer is replaced by the component layer, and the cross-modal characteristics of the RGB and the depth map are fused by using the merging network layers with the corresponding number of component _ wise-sum. The characteristics can be used for predicting the class and the position of the object, after the characteristics are obtained, regression prediction is carried out on a convolution layer containing two 3 x 3 convolution kernels to obtain a plurality of results, wherein the first convolution kernel completes the prediction of the position (1 x 4 dimensions), and the other convolution kernel completes the prediction of the class of the object (1 x the dimension of the number of the classes needing to be predicted). Finally, the obtained multiple predictions are subjected to a Non-maximum suppression (NMS) method to obtain the final result. The fusion network adopts Multibox Loss as a Loss function during training.

In order to make better use of the input data, the present embodiment performs data enhancement on the input data, for example, by means of rotation, mirroring, cropping, etc., to present spatial diversity of the picture, so that the trained model will have better robustness.

When the fusion network model is trained, the weights of the feature extraction parts are frozen, and only the fusion part is trained, namely the learning rate of the feature extraction part is set to be lower than a set threshold (the threshold is 0 or a very low value, such as 10e-8), so that the network training process can be focused on the fusion part without changing the weights of the RGB and depth feature extraction parts too much. By freezing the module weights copied from the RGB and depth models, only the fusion part is trained to complete the fine tuning training of the fusion network. The number of training iterations is generally more than forty-thousand times, and the basic learning rate is set to be about 0.001. The fusion part is the other part of the fusion network model except the feature extraction part.

The technical scheme of the invention combines RGB and depth features to improve detection performance, before feature combination, RGB and depth images are respectively converted into general feature expression by two feature extraction parts in a fusion network, the two feature extraction parts are feature extraction parts consisting of a plurality of convolution pooling layers and are respectively RGB and depth image feature extraction parts, and weights of the RGB and depth image feature extraction parts are obtained by initializing and training the two single-mode RGB detection network models and the depth image detection models. The two single-mode networks have been individually fine-tuned before the converged network training and use the same architecture. The present embodiment can obtain a generic feature representation of depth images without a large dataset of depth annotations. In addition, in the training process of the fusion network, the input data of two different modes must be kept the same in dimension. At the same time, the data enhancement operation of the two input images (RGB and depth images) must also be the same.

The above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and those skilled in the art can make various corresponding changes and modifications according to the present invention without departing from the spirit and the essence of the present invention, but these corresponding changes and modifications should fall within the protection scope of the appended claims.

Claims

1. An object detection method based on cross-modal multi-scale feature fusion is characterized in that the object detection method based on cross-modal multi-scale feature fusion comprises the following steps:

training a pre-training model by adopting a first data set, wherein the first data set comprises RGB images of object types in labeled images, and initializing a single-mode RGB detection network model based on the pre-training model;

training an RGB detection network model by using an RGB image in a second data set, wherein the second data set comprises the RGB image of the object type and position in the labeled image and a corresponding depth image;

training a depth map detection network model by using the depth image in the second data set;

2. The method for object detection based on cross-modal multi-scale feature fusion according to claim 1, wherein initializing a single-modal depth map detection network model based on the trained RGB detection network model comprises:

3. The method for detecting an object based on cross-modal multi-scale feature fusion of claim 1, wherein initializing a fusion network model and performing multi-scale feature fusion based on the trained RGB detection network model and the depth map detection network model comprises:

4. The method for object detection based on cross-modal multi-scale feature fusion of claim 3, wherein the fusion network model adopts Multibox Loss as a Loss function during training.

5. The method for object detection based on cross-modal multi-scale feature fusion of claim 1, wherein the training of the RGB detection network model, the training of the depth map detection network model, and the training of the fusion network model further comprises:

and carrying out data enhancement processing on the input data.

6. The method for detecting an object based on cross-modal multi-scale feature fusion according to claim 3, wherein when training the fusion network model, the method further comprises:

the weights of the feature extraction sections are frozen.