Background
The task of vehicle tracking is to predict the size and position of a vehicle in a subsequent frame given the size and position of the vehicle in an initial frame of a video sequence, and tracking based on correlation filtering is a lot of attention due to its real-time nature. And updating a filtering template for training data based on a tracking result of the relevant filtering for tracking the previous frame, and obtaining a response graph by obtaining the correlation between the obtained filtering template and the extracted features of the current frame, wherein the position of the maximum response point on the response graph is the position of the vehicle target. In order to solve the appearance change situation of the target in the tracking process, people design different feature descriptors, such as HOG features, SIFT features and the like. With the rapid development of deep learning in the fields of target detection, image classification and image segmentation, it is a latest trend to apply a deep neural network as a feature extractor to the field of vehicle tracking.
The patent document "a road vehicle tracking method based on multi-feature fusion" (patent application number: 201910793516.9, publication number: CN 110517291A) applied by Nanjing post and telecommunications university discloses a road vehicle tracking method based on multi-feature space fusion. Firstly, reading a section of video and dividing the video into image frames, selecting an area where a vehicle target is located, converting the input image frame from an RGB color space to an HSV color space, and taking a color histogram as a color feature; calculating horizontal edge characteristics, vertical edge characteristics and diagonal edge characteristics by constructing an integral graph to obtain Haar-like shape characteristics; then respectively establishing a target model and a candidate model in a vertical edge feature space, a horizontal edge feature space, a diagonal edge feature space and a color feature space, calculating the similarity between the two models by utilizing a Bhattacharyya coefficient, and iteratively calculating the position of the candidate model which is most similar to the target model in the current frame by using a mean shift algorithm; and respectively finding four possible target positions in the color feature space, the horizontal edge feature space, the vertical edge feature space and the diagonal edge feature space, and performing weighted fusion to obtain the final position of the target. The method has the disadvantages that because the method adopts the Haar-like shape characteristics to describe the appearance characteristics of the vehicle, when illumination change, mutual shielding of the vehicle and motion blurring of the vehicle occur, the Haar-like characteristics easily judge an interfering object similar to a vehicle target as the vehicle target, and the tracking fails. In real-time tracking of vehicles under actual road conditions, the mutual shielding condition between vehicles is very common, so the robustness of the method cannot meet the requirement of vehicle tracking under actual road conditions.
Disclosure of Invention
The invention aims to provide a vehicle tracking method based on target feature sensitivity and deep learning aiming at the defects of the prior art, and the method is used for solving the problem of tracking failure caused by occlusion, illumination change and the like in the vehicle tracking process.
The idea for realizing the purpose of the invention is as follows: and constructing and training a discriminant connected network, extracting features through a trained public network model, selecting a filter which is more sensitive to a vehicle target, and tracking the vehicle target by using the discriminant connected network and the selected sensitive filter.
The method comprises the following specific steps:
step 1, constructing a discriminant type connected network:
two identical sub-networks are built, and each sub-network has five layers of structures which are as follows in sequence: first convolution layer → first downsampling layer → second convolution layer → second downsampling layer → third convolution layer; setting the number of convolution kernels of the first convolution layer, the second convolution layer and the third convolution layer as 16, 32 and 1 in sequence, and setting the sizes of the convolution kernels as 3 x 3, 3 x 3 and 1 x 1 in sequence; setting the filter sizes of the first and second downsampling layers to be 2 x 2;
two sub-networks are arranged in parallel up and down and then connected with a cross-correlation layer XCorr to form a discriminant type connected network; setting a loss function of the discriminant connected network as a contrast loss function;
step 2, generating a training set:
randomly collecting at least 1000 pictures from a continuous video, wherein each picture comprises at least one target and marks the target; cutting the marked target in the picture into a 127 x 127 picture, and randomly cutting the background in the picture into a 127 x 127 picture;
combining the cut target picture and the cut background picture into a picture pair at random, wherein each picture pair at least comprises one target picture; if two pictures in the picture pair are the same target, setting the label of the picture pair to be 1; if the two pictures in the image pair are two different target pictures or a target picture and a background picture, setting the label of the picture pair to be 0; all the picture pairs and the labels thereof form a training set;
step 3, training a discriminant connected network:
inputting the training set into a discriminant connected network, and iteratively updating the network weight by using an Adam optimization algorithm until a comparison loss function is converged to obtain the trained discriminant connected network;
step 4, calculating a filtering template:
firstly, a rectangular frame is formed in a first frame of a tracking video and clings to the periphery of a tracked vehicle target, all pixel points in the range of the rectangular frame are extracted to form a real target picture, and all pixel points in the rectangular frame with the central point of the rectangular frame as the center and with the width and the height expanded by two times respectively form an initial filtering sample picture;
secondly, generating initial filter labels corresponding to each pixel point in the initial filter sample picture one by using a filter label generation formula, and forming the initial filter labels of all the pixel points into a label picture;
inputting the initial filtering sample picture into a trained public network model, outputting two-dimensional sub-feature matrixes with the same number as the last layer of filter of the model, and summing elements at the same positions in all the two-dimensional sub-feature matrixes to obtain a two-dimensional deep layer feature matrix of the initial filtering sample picture;
fourthly, generating a filtering template by using a filtering template calculation formula and the two-dimensional deep layer feature matrix of the tag picture and the initial filtering sample picture;
and 5, determining a sensitive filter combination:
firstly, performing related filtering operation by using each two-dimensional sub-feature matrix in an initial filtering picture and a filtering template to obtain response graphs with the same number as that of filters;
the second step, comparing the magnitude of each response point value in each response map and determining the maximum response point of each response map;
thirdly, the distance between the maximum response point of each response image and the central point of the label image is calculated, and filters corresponding to the first 100 distance values are found out according to the sequence from small to large to form a sensitive filter combination;
step 6, setting a first frame of the tracking video as a current frame;
step 7, positioning a tracked vehicle target in the next frame image of the current frame;
step 8, generating a target picture to be evaluated:
taking the positioned position as a center in the next frame of the current frame, extracting all pixel points in the area with the same size as the real target picture generated in the first step of the step 4 to form a target picture to be evaluated;
step 9, inputting the real target picture and the target picture to be evaluated into the discriminant connected network trained in the step 3, judging whether the output of the discriminant connected network is 1, if so, setting the next frame of the current frame as the current frame, and then executing the step 11; otherwise, the tracking is regarded as failed, and the step 10 is executed;
step 10, repositioning the tracking target:
inputting the next frame of the current frame into a common detector to output the position of the vehicle target to be tracked, taking the output target position as the position of the tracked vehicle target in the next frame of the current frame, and executing the step 11 after setting the next frame of the current frame as the current frame;
step 11, judging whether the current frame is the last frame of the tracking video, if so, executing step 12, otherwise, executing step 7;
and step 12, finishing the vehicle tracking process.
Compared with the prior art, the invention has the following advantages:
firstly, the invention selects the filter which is more sensitive to the vehicle target, and can accurately extract the characteristics of the tracked vehicle target when similar interference occurs, thereby overcoming the problem that the interference similar to the vehicle target is easily judged as the vehicle target when illumination change, mutual vehicle shielding and vehicle motion blurring occur in the prior art, and having the advantages of low calculated amount and strong robustness.
Secondly, the method can evaluate the tracking result by constructing and training the discriminant connection network, can relocate the vehicle target after the tracking fails, and overcomes the problem that the tracking is difficult to continue after the tracking fails in the prior art, so that the method has the advantage of high tracking accuracy.
Detailed Description
The technical solution and effects of the present invention will be further described in detail with reference to the accompanying drawings.
The specific implementation steps of the present invention are further described in detail with reference to fig. 1.
Step 1, constructing a discriminant connection network.
Two identical sub-networks are built, each sub-network has five layers, and the structure of the five sub-networks sequentially comprises the following parts from left to right: first convolution layer → first downsampling layer → second convolution layer → second downsampling layer → third convolution layer; setting the number of convolution kernels of the first convolution layer, the second convolution layer and the third convolution layer as 16, 32 and 1 in sequence, and setting the sizes of the convolution kernels as 3 x 3, 3 x 3 and 1 x 1 in sequence; the filter sizes of the first and second downsampling layers are set to 2 × 2.
Two sub-networks are arranged in parallel up and down and then connected with a cross-correlation layer XCorr to form a discriminant type connected network; and setting the loss function of the discriminant connected network as a contrast loss function.
The discriminant connectivity network constructed in accordance with the present invention is further described with reference to fig. 2.
The upper and lower layers in fig. 2 represent two sub-networks respectively, each layer of each sub-network is sequentially a first convolutional layer, a first lower sampling layer, a second convolutional layer, a second lower sampling layer and a third convolutional layer from left to right with reference to fig. 2, and the two sub-networks are connected in parallel and then connected with a cross-correlation layer XCorr.
And 2, generating a training set.
Randomly collecting at least 1000 pictures from a continuous video, wherein each picture comprises at least one target and marks the target; the labeled objects in the picture are cut out into 127 × 127 pictures, and the background in the pictures is randomly cut out into 127 × 127 pictures.
Combining the cut target picture and the cut background picture into a picture pair at random, wherein each picture pair at least comprises one target picture; if two pictures in the picture pair are the same target, setting the label of the picture pair to be 1; if the two pictures in the image pair are two different target pictures or a target picture and a background picture, setting the label of the picture pair to be 0; and (4) forming a training set by all the picture pairs and the labels thereof.
And 3, training the discriminant connection network.
And inputting the training set into the discriminant connected network, and iteratively updating the network weight by using an Adam optimization algorithm until the comparison loss function is converged to obtain the trained discriminant connected network.
And 4, calculating a filtering template.
Step 1, a rectangular frame is formed in a first frame of a tracking video and clings to the periphery of a tracked vehicle target, all pixel points in the range of the rectangular frame are extracted to form a real target picture, and all pixel points in the rectangular frame with the central point of the rectangular frame as the center and with the width and the height expanded by two times respectively form an initial filtering sample picture.
And 2, generating initial filter labels corresponding to each pixel point in the initial filter sample picture one by using the following filter label generation formula, and forming the initial filter labels of all the pixel points into a label picture:
wherein g (x, y) represents an initial filtering label corresponding to a pixel point at (x, y) in the filtering sample, pi represents a circumferential rate, sigma represents a control parameter with a value of 0.5, e represents an exponential operation with a natural constant as a base, and x (x, y) represents an initial filtering label corresponding to a pixel point at (x, y) in the filtering samplecAbscissa value, y, representing the central pixel of the initial filtered sample pictureCAnd the ordinate value of the central pixel point of the initial filtering sample picture is represented.
The label picture generated by the present invention will be further described with reference to fig. 3.
The size of fig. 3 is the same as the size of the initial filtered sample picture, and the white dots in the center of fig. 3 represent the locations of tracked vehicle targets in the initial filtered sample picture.
And 3, inputting the initial filtering sample picture into a trained public network model, outputting two-dimensional sub-feature matrixes with the same number as the last layer of filter of the model, and summing elements at the same positions in all the two-dimensional sub-feature matrixes to obtain a two-dimensional deep layer feature matrix of the initial filtering sample picture.
And 4, generating a filtering template by using the following filtering template calculation formula and the two-dimensional deep layer feature matrix of the tag picture and the initial filtering sample picture.
Wherein, F (·) represents a fourier transform operation, h represents a filtering template, x represents a conjugate transpose operation, g represents a label picture, and F represents a two-dimensional deep feature matrix of an initial filtering sample picture.
And 5, determining the sensitive filter combination.
And step 1, performing related filtering operation by using each two-dimensional sub-feature matrix in the initial filtering picture and a filtering template to obtain response graphs with the same number as that of the filters.
And step 2, comparing the magnitude of each response point value in each response map and determining the maximum response point of each response map.
And 3, solving the distance between the maximum response point of each response image and the central point of the label image, and finding out the filters corresponding to the first 100 distance values according to the sequence from small to large to form a sensitive filter combination.
And 6, setting the first frame of the tracking video as the current frame.
And 7, positioning the tracked vehicle target in the next frame image of the current frame.
Step 1, reading the position and the size of a tracked vehicle target in the current frame of the tracking video, and obtaining the range of a search area by taking the central point position of the vehicle target as the center and expanding the width and the height by two times respectively.
And 2, extracting all pixel points in the search area range from the next frame image of the current frame of the tracking video to form a search area picture, inputting the search area picture into a public network model, and summing the sensitive sub-features extracted by each filter in the sensitive filter combination determined in the step 5 to obtain the sensitive features of the search area picture.
And 3, performing relevant filtering operation on the sensitive features and the filtering template to obtain a sensitive response graph.
And 4, comparing the magnitude of each response point value in the sensitive response map, determining a maximum response point, and taking the position of the maximum response point as the position of the tracked vehicle target in the next frame of image.
And 8, generating a target picture to be evaluated.
And 4, taking the positioned position as a center in the next frame of the current frame, and extracting all pixel points in the area with the same size as the real target picture generated in the first step of the step 4 to form the target picture to be evaluated.
Step 9, inputting the real target picture and the target picture to be evaluated into the discriminant connected network trained in the step 3, judging whether the output of the discriminant connected network is 1, if so, setting the next frame of the current frame as the current frame, and then executing the step 11; otherwise, the tracking is considered to be failed, and step 10 is executed.
And 10, repositioning the tracking target.
Inputting the next frame of the current frame into a common detector to output the position of the vehicle target to be tracked, taking the output target position as the position of the tracked vehicle target in the next frame of the current frame, and executing the step 11 after setting the next frame of the current frame as the current frame.
And 11, judging whether the current frame is the last frame of the tracking video, if so, executing a step 12, otherwise, executing a step 7.
And step 12, finishing the vehicle tracking process.
The effect of the present invention will be further described with reference to simulation experiments.
1. Simulation conditions are as follows:
the simulation of the invention is carried out on an Ubuntu14.04 system with a CPU of Intel (R) core (TM) i8, a main frequency of 3.5GHz and a memory of 128G by using MATLAB R2014 software and a MatConvnet deep learning toolkit.
2. Simulation content and result analysis:
vehicle tracking in simulation experiment data is simulated by using the method and three methods (a nuclear correlation filtering algorithm is abbreviated as KCF, a full convolution connected network algorithm for tracking is abbreviated as Sim _ FC, and a hierarchical convolution characteristic for tracking is abbreviated as HCFT) in the prior art respectively.
In the simulation experiment, three prior arts are adopted:
the prior art kernel Correlation filter algorithm KCF refers to a target Tracking algorithm, KCF algorithm for short, proposed by Henriques et al in "High-Speed Tracking with Kernelized Correlation Filters [ J ]. IEEE Transactions on Pattern Analysis & Machine integration, 37(3): 583-.
The prior art full convolution integral network algorithm Siam _ FC for tracking means that the algorithm, Bertinetto L et al, in "Bertinetto L, Valmadre J, Henriques,
F,et al.Fully-Convolutional Siamese Networks for Object Tracking[J]2016, "sets forth a real-time target algorithm, referred to as the Siam _ FC algorithm for short.
The hierarchical Convolutional network HCFT used for Tracking in the prior art means that Zhang H et al put forward a target Tracking algorithm, HCFT algorithm for short, in Ma C, Huang JB, Yang X, et al.
The simulation experiment data used by the invention are a common tracking database OTB and TColor-128, wherein the OTB database comprises 100 video sequences, and the TColor-128 comprises 128 video sequences. And (4) evaluating the tracking results of the four methods by using two evaluation indexes (distance accuracy DP and overlapping success rate OP). The distance accuracy DP and the overlapping success ratio OP of all videos in the two databases are calculated by using the following formulas, and the average distance accuracy and the average overlapping power of the OTB database and the TColor-128 database are drawn as table 1 and table 2:
the effect of the present invention is further described below with reference to the simulation diagrams of tables 1 and 2.
TABLE 1 OTB database distance accuracy and overlay success rate comparison chart
TABLE 2 TColor-128 database distance accuracy and overlay success rate comparison chart
It can be seen from tables 1 and 2 that the present invention achieves better results in both distance accuracy and overlay success rate on OTB100 and TColor-128 databases, and can achieve better tracking effect, mainly because the present invention can obtain features that can better describe the tracked vehicle target through the combination of sensitive filters, and relocate the tracked target after the tracking fails, thereby obtaining higher and more robust tracking effect.