Disclosure of Invention
Aiming at the defects in the prior art, the aim of the present disclosure is to provide a multi-target tracking method based on dual-branch feature enhancement and multi-level track association, which can solve the problem of target tracking failure in complex scenes such as target shielding, blurring and the like, so as to improve the tracking performance of a plurality of targets to be tracked in the complex scenes.
In order to achieve the above object, the present disclosure provides the following technical solutions:
a multi-target tracking method based on dual-branch feature enhancement and multi-stage track association comprises the following steps:
s100: acquiring an input image containing a plurality of targets to be tracked;
s200: constructing a multi-target tracking model and training to obtain a trained multi-target tracking model;
the multi-target tracking model is used for relieving transition competition between detection and tracking by utilizing a double-branch feature learning network, and obtaining a more accurate offset vector by introducing an incidence matrix AM prediction so as to reduce the identity switching times of a target to be tracked in an input image;
s300: and inputting the input image into a trained multi-target tracking model to realize simultaneous tracking of a plurality of targets to be tracked in the input image.
Preferably, in step S200, the multi-target tracking model is trained by the following method:
s201: acquiring a data set, and dividing the data set into a training set and a testing set;
s202: setting training parameters, training the model by using a training set, and finishing the model training when the training reaches the set number of rounds;
s203: and testing the trained model by using a test set, wherein in the test process, multi-target tracking accuracy and IDF1 fraction are used as evaluation indexes to evaluate the model, and when the tracking accuracy reaches 66.1%, the IDF1 fraction reaches 64.2%, the model test is passed.
Preferably, in step S203, the multi-target tracking accuracy is expressed as follows:
wherein FN represents false negative, FP represents false positive, IDS represents identity switching times, GT is group trunk, and the number of marked targets in the scene is represented.
Preferably, in step S203, the IDF1 score is expressed as:
wherein IDTP is true positive ID, which indicates the number of correctly allocated detection targets in the whole video; the IDFN is a false negative ID, and represents the number of missed allocation of the detection targets in the whole video; IDFP is a false positive ID indicating the number of false assignments of detection targets in the entire video.
The present disclosure also provides a multi-target tracking device based on dual-branch feature enhancement and multi-stage trajectory correlation, comprising:
the acquisition module is used for acquiring an input image containing a plurality of targets to be tracked;
the model construction and training module is used for constructing a multi-target tracking model and training the multi-target tracking model to obtain a trained multi-target tracking model;
the multi-target tracking model is used for relieving transition competition between detection and tracking by utilizing a double-branch feature learning network, and obtaining a more accurate offset vector by introducing an incidence matrix AM prediction so as to reduce the identity switching times of a target to be tracked in an input image;
and the tracking module is used for inputting the input image into the trained multi-target tracking model so as to realize simultaneous tracking of a plurality of targets to be tracked in the input image.
Preferably, the model building and training module includes:
dividing the data set of model training into a training set and a testing set;
the training sub-module is used for training the model by utilizing the training set;
and the test sub-module is used for testing the trained model by using the test set.
The present disclosure also provides a computer storage medium storing computer-executable instructions for performing a method as described in any one of the preceding claims.
The present disclosure also provides an electronic device, including:
a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein,
the processor, when executing the program, implements a method as described in any of the preceding.
Compared with the prior art, the beneficial effects that this disclosure brought are:
1. according to the method, the specificity and the correlation of the two tasks are detected and tracked by adopting the dual-branch feature learning network, so that excessive competition between the two tasks is relieved, and sufficient target feature information can be extracted;
2. the method introduces the incidence matrix, and can reduce the number of identity switching times by predicting the offset vector by using more time sequence information;
3. the method adopts a multi-level track association strategy, and associates the high-low detection frames with the tracks in different matching modes, so that the track missing times can be reduced;
4. the present disclosure is based on the improvements in the above 3 aspects, and can improve the tracking performance of complex scenes for multiple targets.
Detailed Description
Specific embodiments of the present disclosure will be described in detail below with reference to fig. 1 to 7 (b). While specific embodiments of the disclosure are shown in the drawings, it should be understood that the disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
It should be noted that certain terms are used throughout the description and claims to refer to particular components. Those of skill in the art will understand that a person may refer to the same component by different names. The specification and claims do not identify differences in terms of components, but rather differences in terms of the functionality of the components. As used throughout the specification and claims, the terms "include" and "comprise" are used in an open-ended fashion, and thus should be interpreted to mean "include, but not limited to. The description hereinafter sets forth the preferred embodiments for carrying out the present disclosure, but is not intended to limit the scope of the disclosure in general, as the description proceeds. The scope of the present disclosure is defined by the appended claims.
For the purposes of promoting an understanding of the embodiments of the disclosure, reference will now be made to the embodiments illustrated in the drawings and specific examples, without the intention of being limiting the embodiments of the disclosure.
In one embodiment, as shown in fig. 1, the present disclosure provides a multi-target tracking method based on dual-branch feature enhancement and multi-level track association, comprising the steps of:
s100: acquiring an input image containing a plurality of targets to be tracked;
s200: constructing a multi-target tracking model and training;
s300: and inputting the input image into the multi-target tracking model to realize simultaneous tracking of a plurality of targets to be tracked in the input image.
In another embodiment, as shown in fig. 2, the multi-objective tracking model includes: the present embodiment provides a detailed description of the input layer, the feature extraction layer, the feature enhancement layer, the parallel detection layer and tracking layer, the association layer, and the output layer.
1. Feature extraction layer:
the feature extraction layer adopts DLASeg (DLASeg is a segmented network added with deformation convolution (Deformable Convolution) on the basis of DLA (Deep Layer Aggregation)) as a main network, and the input image generates basic features through the main network(H F Representing the height, W, of the input image after 4 times downsampling F Representing the width of the input image after 4 times downsampling), where H F =H/4,W F =W/4。
2. Feature enhancement layer:
the feature enhancement layer learns features for the detection layer to perform detection tasks and for the tracking layer to perform tracking tasks by adopting a Dual-branch feature learning network (DFL, dual-branch Feature Learning) so as to relieve the excessive competition problem of the two tasks and extract sufficient target feature information. The DFL network mainly realizes feature enhancement by learning the specificity and relativity of two tasks, and the structure is shown in fig. 3:
the DFL network comprises two branches, which first employ different pooling functions (the first branch employing average pooling and the second branch employing maximum pooling) for resolution reduction; then, the two branches respectively generate a characteristic diagram A of the respective branches through a convolution combination (3×3 convolution+instance Norm (a normalization method) +leak ReLU (activation function)) without sharing parameters 1 And A 2 And carrying out interactive calculation between the feature graphs, and obtaining an output result with the initial feature graphs through operation (matrix multiplication and addition). Specifically, the DFL network first obtains the shared feature from the backbone networkIn order to reduce the calculation amount brought by the matrix operation of the feature map, the shared feature is first required to be +.>Pooling operation is carried out, and different pooling modes are needed for different tasks, wherein the characteristics obtained by average pooling (Avgpool) are more sensitive to background information, and can be used for learning detection characteristics of a detection layer; the features obtained by maximum pooling (Maxpool) are more sensitive to texture information and can be used for learning tracking features of a tracking layer. Shared features->After pooling, two features containing local information are obtained, namely the detection feature +.>And tracking feature->Next, f 1 、f 2 Encoding by 3 x 3 convolutional layers, respectively, to generate feature map a for detection and tracking 1 And A 2 And remodelled (using Reshape function) to a size of c×h' F W′ F Of (2) two-dimensional tensor M 1 And M 2 . Then, for M 1 And M 2 And its corresponding transpose tensor->And->Matrix multiplication (Matrix Multiplication) and normalization by softmax function are performed to calculate a task-specific response map s for each task k ∈R C×C The calculation method is as follows:
wherein, the dot product operation representing two vectors,and->Respectively represent M k Lines i, j and l, +.>Is S k The value of the upper (i, j) position represents the correlation between the i-th channel and the j-th channel in the feature map, C represents the number of feature channels, and c=64.
Next, M will be 1 AndM 2 and->Respectively performing matrix multiplication, wherein T represents transposition operation to learn correlation among different tasks, and normalizing to obtain a correlation response graph R among the tasks k ∈R C×C The calculation method is as follows:
wherein,representing the relation of the ith channel of task 1 to the jth channel of task 2,Representing the correlation of the ith channel of task 2 to the jth channel of task 1, (k, h) represents different tensor combinations, e.g., (1, 2) represents tensors M1 and M2.The larger the value, the greater the degree of attention that the characteristic information representing the channel is commonly focused by the two tasks.
Finally, a feature enhancement response diagram W is obtained by fusing the special response diagram and the correlation response diagram through a trainable parameter lambda k ∈R C×C The calculation method is as follows:
W k =λ k ×S k +(1-λ k )×R k ,k∈{1,2}
wherein lambda is k Representing trainable parameters lambda 1 Representing a response diagram W 1 Trainable parameter lambda 2 Representing a response diagram W 2 Is provided for the training parameters.
Enhancing response graphs W corresponding to different tasks 1 And W is 2 Matrix-multiplying the remodeled input features to obtain enhanced features of each task, wherein the enhanced features are remodeled into the input featuresAnd (3) the three-dimensional tensors with the same shape are fused with f to prevent information loss. Two features for detecting branch and tracking branch inputs are finally obtained.
3. Detection layer and tracking layer:
detection ofThe layers include three output heads, each head consisting of two convolutional layers and a Relu activation function located in both convolutional layers. Wherein the first convolution layer is a 3 x 3 convolution for increasing the number of characteristic channels from 64 to 256; the second convolution layer is a 1 x 1 convolution for reducing the number of characteristic channels from 256 to 1 or 2 (center point: H) F ×W F X 1; center point offset: h F ×W F X 2; width and height: h F ×W F ×2)。
The Re-identification network (ReID) in the tracking layer consists of four layers of convolution, wherein the first two layers are depth separable convolution, namely channel-by-channel convolution and point-by-point convolution respectively, and are 1 multiplied by 1 convolution, and the Re-identification network (Re-identification) is used for improving the number of characteristic channels from 64 to 128; then, a BatchNorm2d normalization and Relu activation function is carried out; the third layer convolution is 3×3 convolution, and the number of channels is unchanged; then a BatchNorm2d normalization and Relu activation function is carried out; the fourth layer convolution is a 1×1 convolution, and the number of channels is unchanged.
In order to utilize more time sequence information, the embodiment also introduces an AM matrix in the tracking layer, and the AM matrix utilizes the extracted characteristics to construct a similarity relation between two frames, so that more accurate offset vectors are obtained through prediction, and the identity switching times of a target to be tracked in an input image can be reduced. AM matrix is expressed asCan pass-> Features extracted by ReID) and e t-1 Matrix multiplication is carried out on the corresponding transposed tensor to obtain a matrix A which represents an image I t And I t-1 The similarity between the two is calculated as follows:
wherein T is a transposed symbol, A i,j,m,n And the characteristic similarity between the t-th frame image target point (i, j) and the t-1 th frame image target point (m, n) is represented. For the center point (i, j) of the t-th frame target x, the corresponding two-dimensional associated matrix can be obtained from the matrix ARepresenting feature similarity between the object x and all points on the t-1 frame image.
Next, through A i,j The offset vector is determined, i.e. the maximum pooling in both horizontal and vertical directions acts on A i,j The pooling cores are H respectively F X 1 and 1 XW F Obtaining a matrix in two directionsAnd->Will->Andnormalizing by softmax function to obtain two vectors +.>And->Wherein (1)>And->The probabilities of the target appearing in the horizontal direction position and the vertical direction position on the t-1 th frame are respectively represented. An offset template for two directions is defined according to the magnitude of the resolution of the output image,respectively->And->The offset value representing the actual appearance of the target at other positions is calculated as follows:
X i,j,n =(n-j)×s 1≤n≤W F
Y i,j,m =(m-i)×s 1≤m≤H F
wherein s represents a downsampling magnification, and is set to 4, X i,j,n And Y i,j,m Indicating the offset of the presence of the object at positions (x, n) and (m,) on the t-1 frame, respectively. The final tracking offset is obtained by the dot product of the offset value of the actual target position and the offset probability of the actual target position at the corresponding position:
the deviation of the horizontal position and the vertical position is learned through two channels respectively, and finally the tracked deviation is obtainedFor subsequent track association.
Most existing trackers filter out low-resolution detection boxes when the detection box score is below a threshold. However, these low-resolution detection frames may be due to occlusion, and the like, so simply filtering these detection frames easily causes track loss, and based on this problem, the present embodiment introduces a Multi-level track association policy (MTA, multi-level Trajectory Association) policy in the association layer to reduce track loss, thereby further improving tracking performance in complex scenes. As shown in fig. 4, the MTA policy re-divides the detection frame into a high frame and a low frame according to a set threshold, where the high frame can extract accurate target feature information, so as to realize long-term association of targets; the low frame can be used for recovering the missing track, and the two frames are respectively associated with the track by adopting different matching modes.
When the tracks are associated, firstly, the high-score detection frame is matched with the track of the previous frame by a simple greedy algorithm based on the offset vector, and an unmatched detection frame, a successfully matched track and an unmatched track are generated. And then, carrying out secondary matching on cosine similarity between the unmatched detection frame and the tracking track calculation characteristics of the previous frame, and creating a new track if the similarity is lower than a threshold value, so that re-association after target shielding can be realized, and the IDs times are reduced. If the matching is higher than the threshold value, the matching is successful, the matching is added into the track, and the characteristic information f of the ith track in the t frame is updated i t The calculation formula is as follows:
wherein,and the epsilon is the weight for the current frame image characteristics extracted by the ReID network. The unmatched track may be caused by filtering out the occlusion of the low-resolution detection frame, so that the low-resolution detection frame and the unmatched track are subjected to secondary matching to recover the lost track in the scene such as the occlusion.
In another embodiment, the present disclosure trains a model using a MOT17 dataset, the MOT17 dataset consisting of 7 video sequences for training and 7 sequences for testing. MOT17 carries bounding boxes generated by three different object detectors, namely DPM, fast R-CNN and SDP.
In this embodiment, the front half part of each video in the MOT17 data set is used as a training set to train the model, and the rear half part of each video is used as a testing set to test the model, so that the model is specifically trained by the following method:
training the model with the training set while setting training parameters, i.e., batch size, to 32, initial learningThe learning rate is set to be 1.25X10 -4 The training wheel number is set to be 70, and when the model training times reach 70 times, the training is completed;
and testing the trained model by using a test set, and adopting a multi-objective tracking accuracy (MOTA) and an IDF1 score as evaluation indexes in the test process, wherein the MOTA represents the overall performance of the tracker, and the MOTA is measured by evaluating three sources of False Negative (FN), false Positive (FP) and identity switching times (IDs), and the calculation mode is as follows:
wherein GT is the group Truth, which represents the marked number of targets in the scene.
The IDF1 score represents the associated performance of the tracker, which refers to the average ratio of the correct target detection number to the sum of the true number and the calculated detection number, calculated as follows:
wherein IDTP is true positive ID, which indicates the number of correctly allocated detection targets in the whole video; the IDFN is a false negative ID, and represents the number of missed allocation of the detection targets in the whole video; IDFP is a false positive ID indicating the number of false assignments of detection targets in the entire video.
Above, when the tracking accuracy reaches 66.1% and the IDF1 score reaches 64.2%, the model test passes.
Next, the effectiveness of the model described in the present disclosure will be described in detail in connection with fig. 5 (a) to 7 (b) and tables 1 and 2.
Firstly, the validity of each module in the model is verified through an ablation experiment, and the validity is specifically shown in table 1:
TABLE 1
In table 1, the first line of data is the results of an ablation experiment using the baseline algorithm centrtrack.
The second line of data is the result of ablation experiments with AM matrix introduced, compared with the first line of data, the MOTA value is improved by 1.0% and the IDF1 score is improved by 4.4% by using the AM matrix for offset prediction, and IDs is reduced from 528 to 369. The reference algorithm centrtrack predicts an offset vector from the center point of the current frame to the center point of the previous frame using regression learning, and such an offset vector does not make full use of timing information. The AM matrix is formed by the similar relation between adjacent frames and contains more time sequence information, so that the predicted offset vector is more accurate. Therefore, the track association is performed by using the offset vector predicted by the AM, so that the number of times of target IDs can be greatly reduced, and the association capability IDF1 is improved.
And the third row of data is an ablation experimental result adopting a DFL network, compared with the second row of data, the MOTA is improved by 0.3%, the IDF1 is improved by 0.5%, and the overall performance of the model is improved. This is mainly because the two branches of the DFL network enhance the input shared features respectively to obtain the detection features and the tracking features, thereby alleviating the competition problem between detection and tracking.
The fourth row of data is an ablation experimental result adopting an MTA strategy on the basis of introducing an AM matrix, compared with the second row of data, MOTA is improved by 0.8%, IDF1 score is reduced by 0.8%, and overall performance of the tracker is improved but association capability is reduced. In the experiment, the detection frames are divided by confidence coefficient eta, wherein the range of the high-resolution detection frame is eta more than or equal to 0.4, and the range of the low-resolution detection frame is 0.2 less than or equal to eta less than 0.4. Since the MTA strategy utilizes a portion of the low scoring boxes, retaining them results in an increase in false positive FP and therefore a decrease in IDF1 score. Compared to the first line of data, the MT was increased by 3.2% after the MTA strategy, indicating a substantial decrease in the missing traces. MT represents the ratio at which at least 80% of the track of the object is correctly tracked during video tracking.
The last line of data is an ablation experimental result after three modules of AM, DFL and MTA are added simultaneously, and compared with the first line of data, MOTA and IDF1 are respectively improved by 2.1 percent and 4.3 percent, and IDs are reduced from 528 to 333, so that the model disclosed by the disclosure can be proved to be capable of effectively improving the performance of multi-target tracking in a complex scene.
Furthermore, the method also selects video scenes in the MOT17 test set to perform qualitative analysis on the model disclosed by the method, and compares the model with a reference algorithm centrTrack to achieve a target tracking effect. Fig. 5 (a) and 5 (b) show the results of the reference algorithm centrtrack and the partial visualization of the model described in the present disclosure in MOT17-04 video sequence, respectively, from left to right, before occlusion, during occlusion, after occlusion, and the lower right corner is the frame number. In a scene of a distant view, the targets are easily affected by local occlusion, and as can be seen from fig. 5 (a), the targets No. 103 and No. 128 at the arrow point generate IDs phenomenon after being occluded, and are switched into a new track identifier. As shown in fig. 5 (b), after the DFL network, the AM matrix, and the MTA policy are introduced, the reference algorithm centrtrack can maintain the original target ID in the occlusion scene, and improve the tracking performance of the multi-target tracker.
Fig. 6 (a) and 6 (b) show the visualization results of the benchmark algorithm centrtrack and the model described in the present disclosure at MOT17-09, respectively. In close-up scenes, the target is susceptible to severe occlusion. As shown in fig. 6 (a), the target 21 at the arrow after undergoing complete occlusion generates a new ID value of 29; as shown in fig. 6 (b), the model disclosed in the present disclosure may implement re-association after occlusion, that is, after complete occlusion, when the target reappears, cosine distance matching may be performed between the appearance feature extracted by the ReID and the track, so as to cope with a strong occlusion scene, so that tracking performance of the tracker may be improved.
Fig. 7 (a) and 7 (b) show the visualization results of the benchmark algorithm centrtrack and the model described in the present disclosure at MOT17-11, respectively. In some scenarios, the target may decrease the detection box score as the occlusion level increases, and low-scoring detection boxes are typically filtered out by the detector, thus causing a loss of trajectory. As shown in fig. 7 (a), the target under the reference algorithm centrtrack is not detected due to occlusion, and cannot be tracked; as shown in fig. 7 (b), the model disclosed in the disclosure maintains a part of low-score detection frames in the MTA policy part, so that the targets can be continuously tracked all the time, the track missing phenomenon is reduced, and the tracking performance of the tracker can be improved.
To further verify the effectiveness of the model described in this disclosure, 6 advanced MOT algorithms, CTracker, JDE, centerTrack, quasiDense, transTrack and MOTR respectively, were selected and compared on MOT17 and MOT20 data sets, with the analysis results shown in table 2.
TABLE 2
As can be seen from Table 2, the MOTA and IDF1 indexes on the MOT17 data set of the model disclosed by the disclosure reach 68.2% and 68.5% respectively, and the MOTA and IDF1 indexes on the MOT20 data set reach 52.7% and 48.2% respectively, so that the optimal tracking results are obtained compared with the other 6 algorithms.
On MOT20, the model of the present disclosure increased the MOTA index by 1.4% and IDF1 index by 7.9% compared to the benchmark algorithm centrtrack, IDS decreased from 7731 to 3043, fp increased from 10080 to 13403, and fn decreased from 281757 to 274419. The result shows that the model can effectively solve the problems of insufficient extraction of target features, identity switching and track missing in dense scenes.
In another embodiment, the present disclosure further provides a multi-target tracking apparatus based on dual branch feature enhancement and multi-level trajectory correlation, the apparatus comprising:
the acquisition module is used for acquiring an input image containing a plurality of targets to be tracked;
the model construction and training module is used for constructing a multi-target tracking model and training the multi-target tracking model to obtain a trained multi-target tracking model;
the multi-target tracking model is used for relieving transition competition between detection and tracking by utilizing a double-branch feature learning network, and obtaining a more accurate offset vector by introducing an incidence matrix AM prediction so as to reduce the identity switching times of a target to be tracked in an input image;
the tracking module is used for inputting the input image into the trained multi-target tracking model so as to realize simultaneous tracking of a plurality of targets to be tracked in the input image.
In another embodiment, the model building and training module comprises:
dividing the data set of model training into a training set and a testing set;
the training sub-module is used for training the model by utilizing the training set;
and the test sub-module is used for testing the trained model by using the test set.
In another embodiment, the present disclosure also provides a computer storage medium storing computer-executable instructions for performing a method as set forth in any one of the preceding claims.
In another embodiment, the present disclosure further provides an electronic device, including:
a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein,
the processor, when executing the program, implements a method as described in any of the preceding.
The applicant of the present disclosure has described embodiments of the present disclosure in detail with reference to the accompanying drawings of the specification, but it should be understood by those skilled in the art that the above embodiments are merely preferred examples of the present disclosure and are not limited to the specific embodiments described above. The detailed description knowledge is intended to aid the reader in better understanding the spirit of the disclosure, and is not intended to limit the scope of the disclosure, but rather any modifications or variations based on the spirit of the disclosure are intended to be included within the scope of the disclosure.