Cross-mode pedestrian re-identification method based on auxiliary mode enhancement and multi-scale feature fusion
Technical Field
The invention relates to the technical field of computer vision image retrieval, in particular to a cross-mode pedestrian re-identification method based on auxiliary mode enhancement and multi-scale feature fusion.
Background
Pedestrian re-recognition aims at searching for pedestrians of a specific identity from a wide candidate set, the images in the candidate set come from different cameras, and the shooting angles, backgrounds, illumination and the like are different. In recent years, the research on the re-identification of pedestrians in a visible light mode is endless, and the performance of the pedestrian is also higher and higher. However, the conventional method has a limited use scenario because a visible light camera cannot capture a clear image at night. To achieve all-weather monitoring, modern monitoring systems often incorporate infrared cameras to obtain images of pedestrians in dark environments. Because of the significant mode difference between the infrared image and the visible light image, the traditional visible light mode pedestrian re-identification method cannot effectively match the two modes. Thus, visible-infrared cross-modality pedestrian re-recognition technology has emerged to address this challenge.
The challenge of visible-infrared cross-modal pedestrian re-recognition is that there is a large cross-modal difference between the two images. Existing approaches typically reduce modal differences from two perspectives: image level and feature level. Image-level methods typically utilize a generation countermeasure network to convert the original visible and infrared images into the same or similar style images to reduce the style differences between them. However, the resulting image obtained by this method is poor in quality and is prone to additional noise, and does not take advantage of the improvement in model performance. Feature level approaches aim to map features of two modalities to a common space to obtain features shared by the modalities. However, this approach sacrifices some modality-specific discriminant information, which is detrimental to the performance enhancement of the model.
Disclosure of Invention
The invention aims to: the invention aims to provide a cross-mode pedestrian re-identification method based on auxiliary mode enhancement and multi-scale feature fusion, which solves the problems existing in the prior art, effectively makes up the mode difference and fully excavates the identity information of pedestrians.
The technical scheme is as follows: the invention discloses a cross-mode pedestrian re-identification method based on auxiliary mode enhancement and multi-scale feature fusion, which comprises the following steps:
(1) Acquiring an original image, and dividing a training set, a verification set and a test set; preprocessing visible light images and infrared images in the training set;
(2) Using ResNet as a backbone network, and adding an auxiliary mode enhancement module;
(3) Continuously inputting the features output in the step (2) to ResNet for feature extraction and fusion, and calculating cross-modal instance aggregation loss; wherein, adding a multi-scale feature fusion module after the third and fourth residual blocks of ResNet;
(4) And carrying out global average pooling and batch normalization on the ResNet final output characteristics, and calculating the local semantic consistency loss.
Further, the step (1) specifically comprises the following steps: acquiring pedestrian images and identity tags from the existing dataset SYSU-MM01 and regDB, and dividing the pedestrian images and the identity tags into a training set, a verification set and a test set; performing horizontal overturning and random erasing pretreatment operation on the training set image, and cutting the image into 288 x 144 pixels; all images were then normalized using the channel mean and standard deviation.
Further, the step (2) specifically includes the following steps: firstly, carrying out random channel combination on visible light images in a training set to obtain an auxiliary mode image, inputting the images of the three modes into ResNet networks, and then enhancing the image representation of the auxiliary mode by using an attention weighted fusion strategy.
Further, the step (3) includes the following steps:
(31) Continuously inputting the features output in the step (2) into a shallow network formed by a first residual block and a second residual block of ResNet to continuously extract the features;
(32) Carrying out global average pooling and batch normalization on the characteristics output by the shallow network, and then calculating cross-modal instance aggregation loss;
(33) Inputting the characteristics of three modes output by the ResNet second residual block into a mode sharing branch, wherein the mode sharing branch is composed of third and fourth residual blocks of ResNet; and the third residual error block and the fourth residual error block are respectively added with a multi-scale characteristic fusion module.
Further, the step (32) specifically includes the following steps: let the shallow feature map of the second residual block output be、、The average value of the cross-modal paired sample characteristic differences is calculated after global average pooling and batch normalization, and the formula is as follows:
;
;
where N is the number of paired samples in a training batch, AndFeatures of the ith sample representing m-mode and n-mode respectively; /(I)Representing the mean of the difference between the two features.
Further, in the step (33), the multi-scale feature fusion module includes two branches with identical structures: low-level features of last residual block outputAdvanced features/>, and current residual block outputAs inputs, where h, w, c represent the height, width, and number of channels of the feature, respectively;
Wherein each branch utilizes expansion convolution to respectively obtain multi-scale low-level characteristics And advanced features; Then weighting and fusing, and fusing the characteristicsInput to the next stage, the formula is as follows:
;
Wherein, Is a learnable parameter for controlling the fusion ratio of low-level features and high-level features.
Further, in the step (4), the partial semantic consistency loss formula is as follows:
;
Wherein, ;
;
AndIs a super parameter.
The invention relates to a cross-mode pedestrian re-identification system based on auxiliary mode enhancement and multi-scale feature fusion, which comprises the following components:
and a pretreatment module: the method comprises the steps of obtaining an original image, and dividing a training set, a verification set and a test set; preprocessing visible light images and infrared images in the training set;
an auxiliary mode enhancement module: the auxiliary mode enhancement module is used for adding an auxiliary mode enhancement module by utilizing ResNet as a backbone network;
a multi-scale feature fusion module: the method comprises the steps of continuously inputting the characteristics output by an auxiliary mode enhancement module into ResNet for characteristic extraction and fusion, and calculating cross-mode instance aggregation loss; wherein, adding a multi-scale feature fusion module after the third and fourth residual blocks of ResNet;
A local semantic consistency module: and carrying out global average pooling and batch normalization on the ResNet final output characteristics, and calculating the local semantic consistency loss.
The device comprises a memory, a processor and a program stored on the memory and capable of running on the processor, wherein the processor realizes any one of the cross-mode pedestrian re-identification methods based on auxiliary mode enhancement and multi-scale feature fusion when executing the program.
The storage medium of the invention stores a computer program designed to implement any one of the above-described cross-modal pedestrian re-recognition methods based on auxiliary modal enhancement and multi-scale feature fusion when running.
The beneficial effects are that: compared with the prior art, the invention has the remarkable advantages that: by adding the auxiliary mode enhancement mechanism and the multi-scale feature fusion module in ResNet, the mode difference between visible light and infrared can be effectively reduced, more mode shared identity information can be learned, the identity information of different receptive fields can be captured, and full mining of the identity features of pedestrians is realized. The cross-modal instance aggregation loss and the local semantic consistency loss realize double constraint on shallow features and deep features, and the distinguishing property and the robustness of the features are further enhanced.
Drawings
FIG. 1 is an overall flow chart of the present invention;
FIG. 2 is a network structure diagram of a cross-modality pedestrian re-recognition framework based on auxiliary modality enhancement and multi-scale feature fusion of the present invention;
FIG. 3 is a network block diagram of an auxiliary modality enhancement module in a cross-modality pedestrian re-recognition framework of the present invention;
FIG. 4 is a network block diagram of a multi-scale feature fusion module in a cross-modal pedestrian re-recognition framework of the present invention;
FIG. 5 is a schematic diagram of a loss function in a dual feature space constraint in a cross-modal pedestrian re-recognition framework of the present invention;
Fig. 6 is a training flow chart of the neural network model of the present invention.
Detailed Description
The technical scheme of the invention is further described below with reference to the accompanying drawings.
As shown in fig. 1-6, an embodiment of the present invention provides a cross-mode pedestrian re-recognition method based on auxiliary mode enhancement and multi-scale feature fusion, which includes the following steps:
(1) Acquiring an original image, and dividing a training set, a verification set and a test set; preprocessing visible light images and infrared images in the training set; the method comprises the following steps: acquiring pedestrian images and identity tags from the existing dataset SYSU-MM01 and regDB, and dividing the pedestrian images and the identity tags into a training set, a verification set and a test set; performing horizontal overturning and random erasing pretreatment operation on the training set image, and cutting the image into 288 x 144 pixels; all images were then normalized using the channel mean and standard deviation.
(2) Using ResNet as a backbone network, and adding an auxiliary mode enhancement module; the method comprises the following steps: firstly, carrying out random channel combination on visible light images in a training set to obtain an auxiliary mode image, inputting the images of the three modes into ResNet networks, and then enhancing the image representation of the auxiliary mode by using an attention weighted fusion strategy. The specific process of the attention weighted fusion strategy is as follows: respectively calculating the similarity of the auxiliary mode image and the visible light image and the similarity of the auxiliary mode image and the infrared image, and then enhancing the auxiliary mode image representation by using the two similarities; first, a1×1 convolution is employed、、The features of the three modes are changed into compact features:
;
Wherein, ,,;、、Respectively express、、Is a parameter of (a).
And then the attention force diagrams between the auxiliary features and the visible light features and the infrared features are calculated by using the Softmax, namely:
;
Wherein d represents the channel dimension of Q; weighted fusion of the two strikings is performed, and enhanced strikings are obtained:
;
Wherein, AndAre all learnable parameters, andRepresenting the fusion weights of the two attention attempts.
Finally, an auxiliary mode characteristic diagram enhanced by fusion attention diagram and residual connection is utilized:
;
Where W is a learnable parameter of the fully connected layer comprising a 1x1 convolution and batch normalization. And (5) representing the enhanced auxiliary mode characteristic diagram.
(3) Continuously inputting the features output in the step (2) to ResNet for feature extraction and fusion, and calculating cross-modal instance aggregation loss; wherein, adding a multi-scale feature fusion module after the third and fourth residual blocks of ResNet; the method comprises the following steps:
(31) Continuously inputting the features output in the step (2) into a shallow network formed by a first residual block and a second residual block of ResNet to continuously extract the features;
(32) Carrying out global average pooling and batch normalization on the characteristics output by the shallow network, and then calculating cross-modal instance aggregation loss; the method comprises the following steps: let the shallow feature map of the second residual block output be 、、The average value of the cross-modal paired sample characteristic differences is calculated after global average pooling and batch normalization, and the formula is as follows:
;
;
where N is the number of paired samples in a training batch, AndFeatures of the ith sample representing m-mode and n-mode respectively; /(I)Representing the mean of the difference between the two features.
(33) Inputting the characteristics of three modes output by the ResNet second residual block into a mode sharing branch, wherein the mode sharing branch is composed of third and fourth residual blocks of ResNet; and the third residual error block and the fourth residual error block are respectively added with a multi-scale characteristic fusion module. The multi-scale feature fusion module comprises two branches with the same structure: low-level features of last residual block outputAdvanced features/>, and current residual block outputAs inputs, where h, w, c represent the height, width, and number of channels of the feature, respectively;
Wherein each branch utilizes expansion convolution to respectively obtain multi-scale low-level characteristics And advanced features; Then weighting and fusing, and fusing the characteristicsInput to the next stage, the formula is as follows:
;
Wherein, Is a learnable parameter for controlling the fusion ratio of low-level features and high-level features.
Wherein the multi-scale low-level featuresAnd advanced featuresThe calculation method is as follows: two fully connected layers were designed to obtain characteristics of different receptive fields using 3 x 3 convolutions of different expansion rates, expansion convolutions of expansion rate 1 and expansion convolutions of expansion rate 2 were used respectively:
;
Wherein, Features representing different receptive fields. /(I)AndRepresenting fully connected layers, consisting of an expansion convolution, a batch normalization layer, and a ReLU activation function, respectively.
And then, the features of different scales obtained by the two branches are fused through element-level addition, and global feature information is obtained through global average pooling. Using a fully connected layer to scale the channel dimensions of the featureCompressed toTo obtain a more compact featureTo balance performance and complexity, the dimension reduction rate r is set to 16. The process is represented as follows:
;
To enable adaptive selection of different scale features, feature maps of different receptive field branches are given different attention weights derived from compact feature Z:
;
Wherein, FeaturesChannel dimension reduction to. A and b represent the attention weights of U and V. And finally, weighting and fusing the deep multi-scale features with the attention as the weight to obtain the deep multi-scale features:
;
Similarly, low-level features After the resolution is reduced by a 1X 1 convolution, the multi-scale low-level characteristic/> can be obtained by the steps。
(4) And carrying out global average pooling and batch normalization on the ResNet final output characteristics, and calculating the local semantic consistency loss. The local semantic consistency loss formula is as follows:
;
Wherein, ;
;
AndIs a super parameter.
The invention obtains good performance on two main stream cross-mode pedestrian re-identification data sets of SYSU-MM01 and RegDB, and the comparison experiment results are shown in Table 1.
Table 1 comparison of the accuracy of the method with other cross-modality pedestrian re-recognition methods
The embodiment of the invention also provides a cross-mode pedestrian re-identification system based on auxiliary mode enhancement and multi-scale feature fusion, which comprises the following steps:
and a pretreatment module: the method comprises the steps of obtaining an original image, and dividing a training set, a verification set and a test set; preprocessing visible light images and infrared images in the training set;
an auxiliary mode enhancement module: the auxiliary mode enhancement module is used for adding an auxiliary mode enhancement module by utilizing ResNet as a backbone network;
a multi-scale feature fusion module: the method comprises the steps of continuously inputting the characteristics output by an auxiliary mode enhancement module into ResNet for characteristic extraction and fusion, and calculating cross-mode instance aggregation loss; wherein, adding a multi-scale feature fusion module after the third and fourth residual blocks of ResNet;
a local semantic consistency module: global average pooling and batch normalization are performed with ResNet final output features to calculate local semantic consistency loss.
The embodiment of the invention also provides equipment, which comprises a memory, a processor and a program stored on the memory and capable of running on the processor, wherein the processor realizes any one of the cross-mode pedestrian re-identification methods based on auxiliary mode enhancement and multi-scale feature fusion when executing the program.
The embodiment of the invention also provides a storage medium, which stores a computer program, wherein the computer program is designed to realize the cross-mode pedestrian re-identification method based on auxiliary mode enhancement and multi-scale feature fusion in any one of the running processes.