Cross-view gait recognition method and system based on feature fusion
Technical Field
The invention belongs to the technical field of computer vision, relates to a method and a system for identifying gait of video data across visual angles, and particularly relates to a method and a system for identifying gait of video data across visual angles based on feature fusion.
Technical Field
In recent years, a large number of monitoring cameras are deployed in social public places, and a large-scale video monitoring system is constructed to guarantee public safety. The cross-camera pedestrian search is an important means for video investigation, and although the pedestrian re-identification technology can obtain good performance on a standard data set, the pedestrian re-identification technology is excessively dependent on the appearance characteristics of pedestrians, such as the colors of pedestrian dressed clothes. Such pedestrian retrieval techniques would be frustrated when a suspect consciously disguised identity by a retooling. Gait, as a behavioral biometric, includes the appearance of an individual and the dynamic nature of walking, the gait of a pedestrian can be captured at a distance and is difficult to mimic and disguise compared to other biometrics. Therefore, the gait recognition has great application prospect and practical value in the fields of intelligent monitoring, urban security and the like.
The existing gait recognition research is generally divided into two main categories, namely a template-based method and a sequence-based method. The identification process of the former mainly comprises two steps of template generation and feature matching, generally, the human body contour of each frame image in a video is obtained through background subtraction, then the well-ordered contour images are processed into a gait template at the frame level, the representation of gait is extracted through a machine learning method, and finally the similarity between different template features is compared through distance measurement and classified. The latter firstly generates a contour map of each frame image by the same method, and then directly takes a series of contour maps as input to extract the features of the gait sequence. The former has the advantage of simplicity, but is easy to lose timing sequence information, sensitive to changes in a real scene and poor in stability; the latter requires maintaining the sequence order of the sequences without flexibility of recognition and at a high computational cost.
In an actual scene, the view angle of a pedestrian is often changed, and when the pedestrian enters a camera area from different directions, the captured postures of the pedestrian are different, and the gait characteristics of the pedestrian are also different, so that the gait recognition is challenged.
Disclosure of Invention
In order to solve the problems, the invention provides a method and a system for identifying gait of video data across visual angles based on feature fusion. Gait information with different granularities is extracted through a multi-scale feature fusion module, then a multi-branch learning mode is adopted, global features are extracted to obtain gait outline information, local features are extracted to obtain gait detail information, features of different branches are gradually fused along with the deepening of a network, and the purpose of extracting complementary information is achieved. Finally, in the feature mapping stage, a generalized average pooling layer is adopted to replace the traditional space pooling layer in the time aggregator so as to obtain more remarkable feature representation.
The method adopts the technical scheme that: a cross-perspective gait recognition method based on feature fusion comprises the following steps:
step 1: constructing a cross-perspective gait recognition model based on feature fusion;
the model comprises a multi-scale feature fusion module, a global feature extraction module, a local feature extraction module, a global and local feature fusion module and a feature mapping module;
the multi-scale feature fusion module is formed by splicing three parallel convolution branches in a channel dimension, wherein each convolution branch comprises a convolution layer and a pooling layer; wherein, the three parallel convolution layers are respectively a 1 × 1 convolution layer, a 3 × 3 convolution layer and a 5 × 5 convolution layer;
the global feature extraction module comprises two stages of feature extraction; the first level is composed of two 3 x 3 convolutional layers and a 2 x 2 maximal pooling layer, the first convolutional layer transforms the feature from 96 channels to 128 channels, and the second convolutional layer keeps the number of channels of the feature unchanged; the second level is composed of two 3 x 3 convolutional layers, the first layer transforms the feature from 128 channels to 256 channels, the second layer keeps the number of channels of the feature unchanged;
the local feature extraction module comprises an upper branch and a lower branch; when the features pass through the upper branch, the features firstly pass through two parallel 3 x 3 convolutional layers, the feature channel is converted from 96 to 128, then the obtained two parallel features are spliced in height, and the feature channel passes through one 3 x 3 convolutional layer and is converted from 128 to 256; the lower branch comprises two stages of feature extraction, the first stage is composed of two focusing convolution layers with parameters of 4 and a 2 multiplied by 2 maximum pooling layer, the first focusing convolution layer converts the features from 96 channels to 128 channels, and the second convolution layer keeps the number of the channels of the features unchanged; the second level consists of two focused convolutional layers with parameters of 8, the first convolutional layer transforms the feature from 128 channels to 256 channels, and the second convolutional layer keeps the number of channels of the feature unchanged;
the global and local feature fusion module comprises three times of fusion of global features and local features; the first time, the local features extracted by the first stage of the lower branch of the local feature extraction module are fused to the features extracted by the first stage of the global feature extraction module, the second time, the local features extracted by the second stage of the lower branch of the local feature extraction module are fused to the features extracted by the second stage of the global feature extraction module, and the third time, the features output by the second stage of the global feature extraction module are fused to the upper branch of the local feature extraction module;
the feature mapping module comprises a horizontal feature mapping module and a time sequence enhancing module; the horizontal feature mapping module is composed of a one-dimensional global average pooling layer and a one-dimensional global maximum pooling layer, and the time sequence enhancing module introduces a generalized average pooling layer which is between the average pooling layer and the maximum pooling layer;
step 2: aiming at a pedestrian image of a sequence to be detected, extracting gait feature information with different granularities by using a multi-scale feature fusion module to obtain a feature map;
and step 3: extracting complete gait feature information from the feature map obtained in the step 2 by using a global feature extraction module, a local feature extraction module and a global and local feature fusion module;
and 4, step 4: mapping the complete gait feature information obtained in the step 3 to a high-dimensional space by using a feature mapping module;
and 5: and calculating the similarity between different characteristics through the Euclidean distance to finally obtain the pedestrian identity of the sequence to be detected.
The technical scheme adopted by the system of the invention is as follows: a cross-perspective gait recognition system based on feature fusion comprises the following modules:
the module 1 is used for constructing a cross-perspective gait recognition model based on feature fusion;
the model comprises a multi-scale feature fusion module, a global feature extraction module, a local feature extraction module, a global and local feature fusion module and a feature mapping module;
the multi-scale feature fusion module is formed by splicing three parallel convolution branches in a channel dimension, wherein each convolution branch comprises a convolution layer and a pooling layer; wherein, the three parallel convolution layers are respectively a 1 × 1 convolution layer, a 3 × 3 convolution layer and a 5 × 5 convolution layer;
the global feature extraction module comprises two stages of feature extraction; the first level is composed of two 3 x 3 convolutional layers and a 2 x 2 maximal pooling layer, the first convolutional layer transforms the feature from 96 channels to 128 channels, and the second convolutional layer keeps the number of channels of the feature unchanged; the second level is composed of two 3 x 3 convolutional layers, the first layer transforms the feature from 128 channels to 256 channels, the second layer keeps the number of channels of the feature unchanged;
the local feature extraction module comprises an upper branch and a lower branch; when the features pass through the upper branch, the features firstly pass through two parallel 3 x 3 convolutional layers, the feature channel is converted from 96 to 128, then the obtained two parallel features are spliced in height, and the feature channel passes through one 3 x 3 convolutional layer and is converted from 128 to 256; the lower branch comprises two stages of feature extraction, the first stage is composed of two focusing convolution layers with parameters of 4 and a 2 multiplied by 2 maximum pooling layer, the first focusing convolution layer converts the features from 96 channels to 128 channels, and the second convolution layer keeps the number of the channels of the features unchanged; the second level consists of two focused convolutional layers with parameters of 8, the first convolutional layer transforms the feature from 128 channels to 256 channels, and the second convolutional layer keeps the number of channels of the feature unchanged;
the global and local feature fusion module comprises three times of fusion of global features and local features; the first time, the local features extracted by the first stage of the lower branch of the local feature extraction module are fused to the features extracted by the first stage of the global feature extraction module, the second time, the local features extracted by the second stage of the lower branch of the local feature extraction module are fused to the features extracted by the second stage of the global feature extraction module, and the third time, the features output by the second stage of the global feature extraction module are fused to the upper branch of the local feature extraction module;
the feature mapping module comprises a horizontal feature mapping module and a time sequence enhancing module; the horizontal feature mapping module is composed of a one-dimensional global average pooling layer and a one-dimensional global maximum pooling layer, and the time sequence enhancing module introduces a generalized average pooling layer which is between the average pooling layer and the maximum pooling layer;
the module 2 is used for extracting gait feature information with different granularities by using a multi-scale feature fusion module aiming at a pedestrian image of a sequence to be detected to obtain a feature map;
a module 3, which is used for extracting complete gait feature information from the feature map obtained in the module 2 by using a global feature extraction module, a local feature extraction module and a global and local feature fusion module;
a module 4, configured to map the complete gait feature information obtained in the module 3 to a high-dimensional space by using a feature mapping module;
and the module 5 is used for calculating the similarity between different characteristics through the Euclidean distance to finally obtain the pedestrian identity of the sequence to be detected.
Compared with the existing cross-perspective gait recognition scheme, the invention can extract more complete global and local characteristic information from the pedestrian gait sequence, and improve the gait recognition precision.
Drawings
FIG. 1: the cross-perspective gait recognition model structure diagram based on feature fusion is disclosed by the embodiment of the invention.
Detailed Description
In order to facilitate the understanding and implementation of the present invention for those of ordinary skill in the art, the present invention is further described in detail with reference to the accompanying drawings and the implementation examples, it is to be understood that the implementation examples described herein are only for the purpose of illustration and explanation and are not to be construed as limiting the present invention.
Referring to fig. 1, the cross-perspective gait recognition method based on feature fusion provided by the invention comprises the following steps:
step 1: constructing a cross-perspective gait recognition model based on feature fusion;
in this embodiment, the model includes a multi-scale feature fusion module, a global feature extraction module, a local feature extraction module, a global and local feature fusion module, and a feature mapping module;
in this embodiment, the multi-scale feature fusion module is formed by splicing three parallel convolution branches in a channel dimension, and each convolution branch includes a convolution layer and a pooling layer; wherein, the three parallel convolution layers are respectively a 1 × 1 convolution layer, a 3 × 3 convolution layer and a 5 × 5 convolution layer;
in this embodiment, the global feature extraction module includes two stages of feature extraction; the first level is composed of two 3 x 3 convolutional layers and a 2 x 2 maximal pooling layer, the first convolutional layer transforms the feature from 96 channels to 128 channels, and the second convolutional layer keeps the number of channels of the feature unchanged; the second level is composed of two 3 x 3 convolutional layers, the first layer transforms the feature from 128 channels to 256 channels, the second layer keeps the number of channels of the feature unchanged;
in this embodiment, the local feature extraction module includes an upper branch and a lower branch; when the features pass through the upper branch, the features firstly pass through two parallel 3 x 3 convolutional layers, the feature channel is converted from 96 to 128, then the obtained two parallel features are spliced in height, and the feature channel passes through one 3 x 3 convolutional layer and is converted from 128 to 256; the lower branch comprises two stages of feature extraction, the first stage is composed of two focusing convolution layers with parameters of 4 and a 2 multiplied by 2 maximum pooling layer, the first focusing convolution layer converts the features from 96 channels to 128 channels, and the second convolution layer keeps the number of the channels of the features unchanged; the second level consists of two focused convolutional layers with parameters of 8, the first convolutional layer transforms the feature from 128 channels to 256 channels, and the second convolutional layer keeps the number of channels of the feature unchanged;
in this embodiment, the global and local feature fusion module includes three times of fusion of the global feature and the local feature; the first time, the local features extracted by the first stage of the lower branch of the local feature extraction module are fused to the features extracted by the first stage of the global feature extraction module, the second time, the local features extracted by the second stage of the lower branch of the local feature extraction module are fused to the features extracted by the second stage of the global feature extraction module, and the third time, the features output by the second stage of the global feature extraction module are fused to the upper branch of the local feature extraction module;
in this embodiment, the feature mapping module includes a horizontal feature mapping module and a timing enhancement module; the horizontal feature mapping module is composed of a one-dimensional global average pooling layer and a one-dimensional global maximum pooling layer, and the time sequence enhancing module introduces a special generalized average pooling layer which is between the average pooling layer and the maximum pooling layer and contains learnable parameters;
step 2: aiming at a pedestrian image of a sequence to be detected, extracting gait feature information with different granularities by using a multi-scale feature fusion module to obtain a feature map;
in this embodiment, the specific implementation of step 2 includes the following substeps:
step 2.1: determining the central point of a gait silhouette image of a pedestrian, aligning according to the central point, and cutting the edge of the image to obtain a gait image sequence with the size of 64 multiplied by 44 pixels;
step 2.2: inputting the gait sequence into three different parallel branches, namely 1 × 1 convolution, 3 × 3 convolution and 5 × 5 convolution operations, wherein the image sequence can obtain feature maps with different granularities after passing through the different branches;
step 2.3: splicing the obtained feature maps with different particle sizes on the channel dimension;
assume that the input is
Where c represents the number of channels, and h and w represent the length and width of each frame imageThen the feature after stitching in the channel dimension is:
where F1×1(X)、F3×3(X) and F5×5(X) represents two-dimensional convolution operations with convolution kernels of 1, 3 and 5, respectively, and catC represents a splicing operation in the channel dimension.
Step 2.4: and (4) transmitting the characteristics of different receptive fields of the same level to the next layer through a convolution and pooling layer by the spliced characteristic diagram.
And step 3: extracting complete gait feature information from the feature map obtained in the step 2 by using a global feature extraction module, a local feature extraction module and a global and local feature fusion module;
in this embodiment, the specific implementation of step 3 includes the following substeps:
step 3.1: extracting global features of the sequence from the feature map obtained in the step 2 by using a common convolutional layer and a pooling layer;
assume that the input is
Where c represents the number of channels, h and w represent the length and width of each frame of image, the global feature is:
step 3.2: bisecting the feature map obtained in the step 2 into a plurality of blocks along the horizontal direction, mapping each block through the same convolution kernel, and splicing the obtained feature maps on the horizontal dimension to obtain complete local features;
assume that the input is
Where c represents the number of channels, and h and w represent the length and width of each frame of image; dividing the input feature into n parts along the horizontal direction to obtain a local feature map, and recording the local feature map as
Wherein
And representing the ith local feature map in the horizontal direction, and outputting the complete local features as follows:
where F
3×3Representing a two-dimensional convolution operation with a convolution kernel of 3, and catH representing a stitching operation in the horizontal dimension.
Step 3.3: and fusing the obtained global features and the local features on the channel dimension.
And 4, step 4: mapping the complete gait feature information obtained in the step 3 to a high-dimensional space by using a feature mapping module;
in this embodiment, the specific implementation of step 4 includes the following sub-steps:
step 4.1: obtaining a micromotion feature vector sequence by one-dimensional generalized average pooling;
in the present embodiment, it is assumed that the input characteristics of this process are
Where n represents the number of sequence frames, c represents the number of channels, and h represents the length of each frame image; then the characteristic Y is output
outComprises the following steps:
here, p is a learnable parameter, which corresponds to maximum pooling when p → ∞ is 1, and which corresponds to average pooling when p → ∞.
Step 4.2: a channel attention mechanism is introduced to weight the feature vector of each moment, and a one-dimensional convolution layer is adopted to weight the micro motion component;
step 4.3: the resulting features are mapped to a high-dimensional space using an independent fully connected layer.
And 5: and calculating the similarity between different characteristics through the Euclidean distance to finally obtain the pedestrian identity of the sequence to be detected.
In the training process, gait silhouette sequences of the same pedestrian at different visual angles are sequentially input into the network, the model can learn gait characteristics of the same pedestrian at different visual angles by combining data at multiple visual angles, the network is endowed with proper weight through back propagation, and the trained network model can map the gait characteristics at different visual angles into a unified judgment subspace. In the process of model testing, pedestrian gait features under different visual angles are mapped into the same judgment subspace, similarity among different features is calculated through Euclidean distance, and finally the identity of a pedestrian of a sequence to be tested is obtained.
It should be understood that parts of the specification not set forth in detail are well within the prior art.
It should be understood that the above description of the preferred embodiments is given for clarity and not for any purpose of limitation, and that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.