CN109671102A

CN109671102A - A kind of composite type method for tracking target based on depth characteristic fusion convolutional neural networks

Info

Publication number: CN109671102A
Application number: CN201811467752.3A
Authority: CN
Inventors: 王天江; 冯平; 赵志强; 罗逸豪; 冯琪
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2018-12-03
Filing date: 2018-12-03
Publication date: 2019-04-23
Anticipated expiration: 2038-12-03
Also published as: CN109671102B

Abstract

The invention discloses a comprehensive target tracking method based on channel feature fusion convolutional neural network, which belongs to the field of computer vision. Tracked convolutional neural networks are used to extract deep features as appearance representations. On the other hand, track the previously constructed long-term classification prediction sub-network module and regression prediction sub-network module, use the information collected from the initial target to train the long-term classification prediction sub-network module and regression prediction sub-network module, and use the long-term classification prediction sub-network module in the tracking process. The network module classifies all candidate blocks, and adaptively combines the long-term and short-term classification prediction sub-network module, the regression prediction sub-network module and the multi-template matching module to track according to the probability results that they belong to the foreground class. The method of the invention has strong robustness and high accuracy.

Description

A kind of composite type method for tracking target based on depth characteristic fusion convolutional neural networks

Technical field

The contents of the present invention are related to computer vision field, and specific application is a kind of based on depth characteristic fusion convolution The composite type method for tracking target of neural network.It can make video frequency object tracking in complex scene using the method in the invention Success rate and accuracy be effectively improved.

Background technique

In modern society, the development speed of social informatization is getting faster, in the work and life of people there is A large amount of video capture device, these equipment record and save a large amount of video data.On the one hand, for point of these data Analysis and processing rely on artificial mode that will gradually become abnormal difficult, it might even be possible to say it is infeasible.However on the other hand, For these video datas, but there is the demands from many class different applications in actual application, are leading among these If the management of security monitoring, intelligent transportation including video, the man-machine interactive system of intelligence, the analysis of target movement and machine The automatic Pilot etc. of the dynamic vehicles, items of the video frequency object tracking technology in video analysis, video understanding and video interactive There is particularly important, one basic skill for needing to rely on when being the video task progress of such high-order in concrete application Art.

Video frequency object tracking problem is a very active research topic in computer vision field, but simultaneously by The disturbing factor of some column such as illumination variation that may be present, dimensional variation, attitudes vibration, target occlusion in scene, therefore It is again very challenging.

Video frequency object tracking refers to using after video capture device acquisition video data, and one is selected from video either Multiple objects provide the initial center in target area and scale size information as tracking target, effective by designing Method for tracking target predicts the center of target in subsequent video frame and scale size information, to complete to mesh Target persistently tracks.

Although there is a large amount of applications to need in the work and life of people, the target following skill based on video is needed It is supported based on art, is automatically performed target following using computer vision technique, people can be made from a large amount of cumbersome inefficient It is freed in task, and important foundation is provided for the analysis and decision of people.But it is passed through among complicated reality scene It often will appear a variety of different disturbing factors, cause to become very difficult based on video progress target following.

Therefore, it is necessary to develop a kind of novel method or system for carrying out target following based on video, to realize robust The target following that property is strong, accuracy is high.

Summary of the invention

Aiming at the above defects or improvement requirements of the prior art, the present invention provides one kind merges convolution based on depth characteristic The composite type method for tracking target of neural network, it is intended that constructing multiple and different places by extracting target depth feature Reason mode sufficiently combines production model, discriminative model, long-term follow and short-term the advantages of tracking, realizes that strong robustness is quasi- The high target following of exactness, to be further video analysis, video understands, video interactive provides good basis, Jin Erwei Video security monitoring, intellectual traffic control, target motion analysis, man-machine interactive system and automatic Pilot are that the vision of representative is answered With the good technical support of offer.

To achieve the above object, the present invention provides a kind of composite type mesh based on depth characteristic fusion convolutional neural networks Tracking is marked, the also referred to as single vision method for tracking target of complex scene, this method includes the following steps:

(1) modify VGG-M network model and be added channel characteristics fusion convolutional layer, using the conventional part in network as Shared depth characteristic extracts sub-network, and using the remaining part of network as the specific depth characteristic classification sub-network of sequence, The two is connected to the convolutional neural networks model of construction one channel characteristics fusion；

In the present processes, modified network reduces the convolutional layer of VGG-M network and the number of plies of full articulamentum, most Depth characteristic classification sub-network afterwards only retains a full articulamentum.The spy that existing network model is exported in last convolutional layer Levy number of channels it is more, and the data in each feature channel be in fact it is sparse, the present processes before full articulamentum plus Enter a channel characteristics fusion convolutional layer, includes the characteristic information of basic isodose with lower data dimension, be conducive to accelerate The speed of similarity computing module in production model.It is not added before this channel characteristics fusion convolutional layer, convolutional layer output 512 kinds of channel characteristics, what is obtained after addition is the channel characteristics of 32 kinds of fusions.

(2) video sequence for carrying target position and dimensional information is collected, to each video sequence therein according to mark The target information that note provides acquires the training set of the sample composition network model of prospect class and background classes respectively；

Wherein, some scholars and research institution provide disclosed video frequency object tracking data set, select therein several Data set comprising different challenge factors, including VOT-2013, VOT-2014 and VOT-2015 will wherein be gone duplicate video Fall.For these each of selected video sequences, a part of video frame images are therefrom randomly chosen, then for each The selected each frame of sequence utilizes the coordinate parameters and ruler of target's center's point according to the target position of mark and dimensional information The Gaussian function sampling of the high wide parameter of degree, to generate a large amount of sample image sub-block.Intercept the image in these sub-block regions simultaneously The image procossing that it is normalized is defined according to these sub-block regions ratio Chong Die with true object block region Prospect class and background classes are classified into corresponding two class and retain according to a certain percentage these two types of samples, thus Constitute the training sample set of network model.

(3) training sample is formed into batch according to the corresponding mode of sequence, sequence recycles network model one by one Repetitive exercise, until completing the cycle-index of setting or reaching preset precision threshold；

Influenced by deep neural network processing speed, in network model training process using sample in batches by the way of will It is organized.The training of network repetitive exercise by the way of sequence loops specifically refers in circulation each time to shared Feature extraction sub-network and the specific tagsort sub-network of sequence use the specific tagsort sub-network of the sequence one by one Corresponding sequence batch sample.The convergent of a certain size cycle-index observation network class performance can be first set, when It is unsatisfactory for increasing the threshold value of cycle-index when convergent requirement, conversely, for the overfitting problem for avoiding depth network, it should be appropriate Reduction the number of iterations.

(4) for new video sequence, the corresponding specific tagsort sub-network of sequence is reconfigured They are extracted sub-network with shared depth characteristic and are connected by module and the specific regression forecasting sub-network module of a sequence It connects, to constitute new sequence target following network model；

Specifically, since there is illumination variation, postures to become in the video sequence in used training sample set The various disturbing factors such as change, target rotation, dimensional variation, motion blur and target occlusion.Therefore, when these samples pair of utilization Network model carries out after sufficient repetitive exercise, it will be able to the depth of strong robustness is extracted by shared feature extraction sub-network Spend fusion feature.

Tracked target in each video sequence is different, and the target being tracked in some sequence is in addition A video in may be background interfering object even similar with target.Therefore, for the target of new video sequences Tracking, needs to construct a completely new sequence specific depth characteristic classification sub-network, and by itself and trained shared spy Sign extracts sub-network and is connected to constitute classification prediction network model used in tracking in the process.In addition, in the method for the present invention A regression forecasting network module is also used, which is equally used for constructing a sequence specifically deeply to new video sequence Spend feature regression forecasting sub-network module.

(5) the prospect class sample and back initial using the position of the target in new sequence head frame and the information collection of scale Scape class sample, and be trained with tagsort sub-network of these samples to neotectonics, for regression forecasting sub-network module It is then trained using positive sample therein, extracts sub-network using shared depth characteristic and depth characteristic is carried out to initial target Extraction, and feature templates that the feature of extraction is initial as target；

The tagsort prediction sub-network module and feature of needs return when previous step constructs new video sequences tracking It predicts sub-network module, needs to acquire the sample of prospect class and background classes in initial frame according to the information of new sequence initial target This, obtains prediction sub-network module of classifying for a long time using whole sample trainings classification sub-network, uses prospect class sample therein Training regression forecasting sub-network module.It regard the least significant end convolutional layer output of target prime area as characteristic processing, and saves as Initial target signature template.

(6) using a variety of different target signature templates are arrived in object tracking process, wherein initial history target signature The set of template is set as empty, and the target signature template of previous frame is then set as initial target signature template；

The method of the present invention is a kind of method for tracking target of composite type, wherein using the generation of a multi-template matching strategy Formula module.Separately included in the previous frame of initial frame and present frame target it is initial and it is upper it is primary track obtained information, Furthermore in object tracking process before, there may be may repeat in some subsequent tracking external appearance characteristic The external appearance characteristic information of significant change.Initial target feature templates, previous frame target signature are constructed respectively according to this information above Template and a history target signature template set empty for history feature template set before carrying out target following, and Initial target signature template is then set by the target signature template of previous frame.

(7) candidate region that target is generated using newest target position and dimensional information, uses these regions shared Feature extraction sub-network extracts their depth characteristic and calculates separately the class probability that they belong to prospect class and background classes；

The movement of target has certain regularity, the position of target and the variation of scale in a new frame under normal conditions It is likely to be therefore a kind of Gaussian Profile can use Gaussian function for the target position of previous frame and scale The region of candidate target is generated, then extracts the candidate target that sub-network generates these using the sharing feature of network model Extracted region feature predicts that automatic network further calculates the class probability that they belong to prospect class and background classes using classification.

(8) variation degree that prospect class probability results judge target appearance is belonged to according to the depth characteristic of all candidate blocks, By these probability values with one set threshold value be compared, using comparison result as a condition, i.e., whether all candidates The probability value that block belongs to prospect class is both greater than the threshold value of the setting；

(9) when the probability value that all candidate blocks belong to prospect class is both greater than the threshold value set, the Rule of judgment of previous step at It is vertical, show that the cosmetic variation degree of target is little, the probability correctly identified by long-term classification prediction sub-network module is higher, at this time It is combined using long-term classification prediction sub-network module and regression forecasting network module and carries out the comprehensive predicted value of analytical calculation；

Conversely, then show that the appearance of target may have occurred biggish variation when Rule of judgment is invalid, it is then new at this time Construction one short-term classification prediction sub-network module mutually ties shot and long term classification prediction sub-network module with multi-template matching module It closes and carries out the comprehensive predicted value of analytical calculation；

Predict that the tracking relatively high sample of collected confidence level in the process is used only in sub-network module due to classifying for a long time Originally it is updated, therefore when the appearance of the appearance of target largely changes, this module be may tend to all times Selecting block sort is threshold value of the probability less than setting that background classes and all candidate blocks belong to prospect class.At this time merely with long-term point Class prediction sub-network module is easy to appear serious tracking drift, therefore constructs a new short-term classification and predict sub-network mould It is combined with the production module of long-term classification prediction sub-network module and multi-template matching and calculates the pre- of synthesis by block Measured value.Conversely, the probability that part candidate blocks belong to prospect class if it exists is greater than the threshold value of setting, then show target appearance not Very big variation occurs, it is only necessary to which long-term classification prediction sub-network module is combined into calculating with regression forecasting sub-network module Comprehensive predicted value.

(10) using the highest candidate blocks of predicted value as the target following of present frame as a result, and by the target signature of previous frame Template renewal is that new target block feature is added them into and is used for according to new target position and dimensional information collecting sample In the sample set for updating short-term classification prediction sub-network module, and the probability that all candidate blocks belong to prospect class is analyzed, thus Determine whether to add them into the sample set of long-term classification prediction sub-network module, and whether generates new history mesh It marks feature templates and updates network；

It is calculated by the combination of disparate modules after the integrated forecasting value of all candidate blocks, wherein predicted value is most for selection Then the previous frame target signature used in multi-template strategy is replaced with fresh target block by target of the big block as present frame Depth characteristic, and according to new target position and dimensional information collecting sample, by them plus it is used to update short-term classification prediction The sample set of network module.

Classified using long-term classification prediction sub-network module to the feature of all candidate blocks, obtained result can be compared with For the size for objectively reflecting target appearance variation degree, if all candidate blocks belong to prospect class probability not Gao Zeke with Think that current tracking result confidence level is not also high, such case shows that more apparent change has occurred in the external appearance characteristic of target Change, at this time using the sample in the higher sample set of confidence level collected during tracking to long-term classification prediction sub-network mould Block is updated, while and the depth characteristic of fresh target block being added in history target signature template set；

Conversely, the sample of acquisition is added in the sample set of long-term classification prediction sub-network module.

(11) judge whether tracking terminates, if it has not ended, then going to step (7), circuit sequentially and execute step (7) To step (11).

In general, through the invention it is contemplated above technical scheme is compared with the prior art, can obtain down and show Beneficial effect:

The method include the steps that the depth characteristic of two classification of construction merges convolutional neural networks model, network mould Type includes shared feature extraction sub-network, and with the specific tagsort sub-network of the one-to-one sequence of tracking sequence； Selection video sequence is concentrated to construct training set from the video tracking public data with mark, and from wherein acquiring prospect class sample and back Scape class sample carries out the repetitive exercise of sequence wheel streaming using the sequence samples of acquisition to network model.To new video sequence In target when being tracked, the various parameters in feature extraction sub-network are kept fixed, and reconfigure one for new sequence The specific tagsort sub-network of sequence and the specific regression forecasting sub-network module of a sequence；According to the first frame mesh of new sequence Cursor position and dimensional information acquire the relevant prospect class background classes classification samples of initial sequence, and using these samples to new structure The specific tagsort sub-network of the sequence made and regression forecasting sub-network module are trained；In the process of target following In, candidate blocks are generated according to newest target position and dimensional information, feature and classification are extracted to them using newest network, When the probability that all candidate blocks belong to prospect class is both greater than the classification thresholds of a setting, sub-network is predicted using long-term classification Module and regression forecasting sub-network module are combined and are predicted, are saved in for a long time according to new target status information collecting sample In the sample set of classification prediction sub-network module；Otherwise it constructs and one short-term sequence of training is specifically classified sub-network mould Shot and long term is classified and predicts that sub-network module and multi-template matching module are combined and predicted by block, and during utilization tracking The Sample Refreshment of collection is classified for a long time predicts sub-network module, and history target is added in the depth characteristic of new target area In feature templates set；The sample set of short-term sorter network module is saved according to new target status information collecting sample In；Using the highest candidate blocks of predicted value as new target following result.The present invention merges convolutional Neural net by depth characteristic Network extracts feature, and proposes the composite type method for tracking target based on fusion feature, and design is simple, is classified by shot and long term pre- The composite type target following model that sub-network module and multi-template matching module are closed is surveyed, target following can be effectively improved Precision of prediction.

Detailed description of the invention

Fig. 1 is a kind of composite type target following side that convolutional neural networks are merged based on channel characteristics in the embodiment of the present invention The block schematic illustration of method principle.

Fig. 2 is the schematic network structure of the channel characteristics fusion convolutional neural networks in the embodiment of the present invention.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.As long as in addition, technical characteristic involved in the various embodiments of the present invention described below Not constituting a conflict with each other can be combined with each other.

It is a kind of based on logical it is a primary object of the present invention to be provided for the tracking problem of the single vision target under complex scene The composite type tracking of road Fusion Features convolutional neural networks, this method is by extracting to target rotation, illumination variation, posture Variation, target occlusion etc. have the depth characteristic of good robustness, construct multiple and different processing modules, sufficiently combine production The advantages of model, discriminative model, long-term follow and short-term tracking, realize the high target following of strong robustness accuracy, thus Be further video analysis, video understands, video interactive provides good basis, and then be Video security monitoring, intelligent transportation Control, target motion analysis, man-machine interactive system and automatic Pilot provide good technical support for the vision application of representative.

Main thought of the invention is, proposes a kind of composite type target based on channel characteristics fusion convolutional neural networks Tracking.On the one hand, a new channel characteristics weighted convolution layer is wherein added for network structure, and constructs a kind of be suitble to It is indicated in the convolutional neural networks of target following for extracting depth characteristic as appearance, so that sparse but number of channels is more originally Feature with lower characteristic dimension include basic same information content feature, be conducive to the calculating for accelerating similarity.Another party Face, the long-term classification prediction sub-network module and regression forecasting sub-network module of the front construction of tracking, utilizes the letter of initial target It ceases collecting sample and sub-network module and regression forecasting sub-network module is predicted in the long-term classification of training, using long-term during tracking Classification prediction sub-network module classifies to all candidate blocks, the adaptive knot of the probability results for belonging to prospect class according to them Shot and long term classification prediction sub-network module, regression forecasting sub-network module and multi-template matching module is closed to be tracked.Equally, short Phase classification prediction the reconfiguring of sub-network module, the update of classification prediction sub-network module, regression forecasting sub-network mould in short term The acquisition of sample and the generation of history target signature template are belonged to according to all candidate blocks during the update of block, tracking The probability results of prospect class adaptively carry out.

Fig. 1 is that the present invention implements a kind of composite type target following that convolutional neural networks are merged based on channel characteristics in example The block schematic illustration of method, this method are main including the following steps:

(1) the convolution kernel size of the number of plies of the classical sorter network VGG-M of modification and each convolutional layer, and new depth is added Fusion Features convolutional layer, the feature extraction sub-network that the conventional part before full articulamentum is shared as all sequences, to every Sequence structure one of a tracking includes the specific tagsort sub-network of sequence of a full articulamentum and function layer, two sons Network connection forms depth characteristic together and merges convolutional neural networks model；

What is analyzed when tracking is certain pieces in every frame image, and size is smaller, therefore modifying VGG-M network connects network Input picture size after the normalization of receipts is 107*107*3, reduces the convolutional layer of network to 3, the size of each layer convolution kernel Respectively 7*7*3*96,5*5*96*259,3*3*256*512, two parameters indicate the size of core before convolution kernel size, after Two, face parameter then respectively indicate before convolution and after convolution feature channel quantity.Wherein, the first two convolutional layer carries out convolution behaviour The step-length of work is 2*2, and the convolution step-length of third convolutional layer is then 1*1.It further include ReLU layers, normalizing between these convolutional layers Change layer and pond layer, wherein the scale in the pond pond Hua Ceng is 3*3, step-length 2*2.In third convolutional layer followed by one It is ReLU layers a, a channel characteristics fusion convolutional layer and one ReLU layers, the convolution kernel size of feature channel convolutional layer is added For 1*1*512*32, the step-length of convolution operation is 1*1, these above-mentioned layers together constitute the shared feature of a sequence Extract sub-network, feature share be connected to after sub-network be full articulamentum of the convolution kernel having a size of 3*3*512*2 with And a function layer, they collectively form the specific tagsort sub-network of a sequence.The present processes also use one A specific regression forecasting sub-network module of sequence, structure are similar to tagsort sub-network, the difference is that the size of convolution kernel For 3*3*512*1, while function layer uses logistic function rather than softmax function.

(2) in order to which training obtains the network model for tracking problem, collect with target position and dimensional information with Track video utilizes the status information capture prospect class sample and background classes sample of target to each video sequence, to constitute net The training set of network model；

Selection has under different challenge factor scenes and includes the video sequence of mark for network model training sample Sampling.It is randomly chosen 8 frame image therein in each video sequence, is given on these images according to the mark of target Position and dimensional information out defines prospect class and background classes sample, and acquires 50 and 200 these two types of samples respectively.Before Scape class and background classes sample are the sizes of the region that is marked according to the sample areas and real goal overlapping ratio of area between the two It is defined, two threshold values is set, one is 0.7, another is 0.5.If the ratio of area overlapping is more than or equal to 0.7, then corresponding sample is defined as prospect class sample, whereas if area overlapping ratio less than 0.5, then will be corresponding Sample is defined as background classes sample.

(3) training sample of acquisition is formed into batch by different sequences, list type is carried out to network model using it Loop iteration training, until reaching the cycle-index of setting or the error rate of network lower than preset threshold value；

Initial network model is trained 150 times using the video sequence loop iteration in training set, this process is mainly The convolution nuclear parameter of the shared feature extraction sub-network of study.It is every from training set in trained for loop iteration each time 32 prospect class samples and 96 background classes samples therein, which are randomly chosen, in all samples of a sequence constitutes the sequence Sample batch used in an iteration.

(4) for new video sequence, the corresponding specific tagsort sub-network of sequence is reconfigured With the specific regression forecasting sub-network module of a sequence, they are extracted into sub-network with shared depth characteristic and is connected, from And constitute the network model used when new sequence target following；

(5) the prospect class sample and back initial using the position of the target in new sequence head frame and the information collection of scale Scape class sample, and being trained to the tagsort sub-network of neotectonics with these samples, for regression forecasting sub-network mould Block is then trained using positive sample therein, is extracted sub-network using shared depth characteristic and is carried out depth spy to initial target The extraction of sign, and the feature templates that the feature of extraction is initial as target；

500 prospect class samples and 5000 are acquired respectively using the position of target in sequence initial frame and dimensional information Background classes sample similarly randomly selects 32 prospect class samples from these samples every time and 96 background classes samples is made A sample batch of processing is received for network, and carries out 20 loop iteration training, to realize the sequence to neotectonics The training study of specific full connection layer parameter.Then, by the shared extracted initial target block of feature extraction sub-network Depth characteristic carries out vectorization and normalized, and using result as initial target signature template.

(6) it is used in object tracking process and arrives a variety of different target signature templates, wherein initial history target signature The set of template is set as empty, and the target signature template of previous frame is then set as initial target signature template.

(7) these regions are used network by the candidate region that target is generated using newest target position and dimensional information Its depth characteristic of model extraction simultaneously calculates the class probability that they belong to prospect class and background classes；

It is raw using the coordinate of central point and the Gaussian function of length and width scale according to the newest position of target and dimensional information At 256 candidate sample blocks, their depth characteristic is extracted, and using newest long-term classification prediction sub-network module to this The feature of class candidate blocks is classified.

(8) the variation journey of target appearance is judged according to the result that the depth characteristic of all candidate blocks belongs to prospect class probability Degree, and the threshold value that these probability values are set with one is compared, using comparison result as decision condition, i.e., whether own The probability value that candidate blocks belong to prospect class is both greater than the threshold value of the setting；

Threshold value used when comparing in Rule of judgment is set as 0.55, illustrates to belong to prospect class higher than this threshold value A possibility that possibility ratio belongs to background classes is higher, counter to push away the result known lower than this threshold value.

(9) when the Rule of judgment of previous step is set up namely a possibility that prospect class a possibility that ratio belongs to background classes more Gao Shi shows that the cosmetic variation degree of target is little, and the probability correctly identified by long-term classification prediction sub-network module is higher, this Shi Liyong classifies for a long time predicts that sub-network module and regression forecasting sub-network module combine the prediction for carrying out analytical calculation synthesis Value；

And when lower when Rule of judgment is invalid namely a possibility that prospect class a possibility that ratio belongs to background classes, then Show that the appearance of target may have occurred biggish variation, then sub-network module is predicted in neotectonics one short-term classification at this time, will Shot and long term classification prediction sub-network module is combined with multi-template matching module carries out the comprehensive predicted value of analytical calculation；

When the Rule of judgment of previous step is set up, block of the probability greater than 0.5 for belonging to prospect class is selected from candidate blocks, is made It is combined with long-term classification prediction sub-network module and regression forecasting sub-network module, wherein the former, which takes, belongs to the general of prospect class Rate value, weight are fixed as 1, and the output valve of regression forecasting sub-network module directly indicates that these blocks are the probability of target, weight It is set as belonging in selected block the average value of highest 5 values of prospect class probability value (if the block for meeting condition takes institute less than 5 There is the average value of value).Conversely, selection belongs to the descending sequence of prospect class probability value from candidate blocks when condition is invalid Preceding 50 blocks, construct a new short-term classification prediction sub-network module, and be trained using the sample of nearest three frame, so The selected piece of probability value for belonging to prospect class is calculated using it afterwards, shot and long term is classified, and it is general to predict that network module is calculated Rate value weighted combination, the weight of long-term module are set as 1, and the weight of short-term classification prediction sub-network module is to utilize tracking process The ratio-dependent that the middle higher partial frame prospect class sample of confidence level is correctly classified by short-term classification prediction sub-network module；It connects Using EMD distance calculate selected block depth characteristic and three kinds of target signature templates used in method it is corresponding similar It spends and weights and obtain comprehensive matching value, the weighting weight of three kinds of similarities is set according to following formula respectively:

ω_f=C1 (1)

Wherein, ω_f、ω_l、ω_hIt respectively indicates and initial target feature templates, previous frame target signature template and history Weighting weight when the relevant three kinds of Similarity-Weighteds of target signature template set are summed, p^*(t-1)Indicate be previous frame with The probability for belonging to prospect class being calculated when track result is using long-term classification prediction sub-network module, parameter C1, C2, α, β are Four different constants are respectively set to 2,0.2,0.5,0.01.Finally probabilistic forecasting value and matching value weighted combination are calculated comprehensive The predicted value of conjunction, the two weight are respectively set to 0.7 and 0.3.

(10) using the highest candidate blocks of predicted value as the target following of present frame as a result, and by the target signature of previous frame Template renewal is new target block feature, according to new target position and dimensional information collecting sample, is added them into for more The sample set of new short-term classification prediction sub-network module, and the probability that all candidate blocks belong to prospect class is analyzed, so that it is determined that Whether these samples are added in the sample set of long-term classification prediction sub-network module, and whether benefit is special to history target Sign template and network model are updated；

Integrated forecasting is worth the highest piece of target following as present frame as a result, and replacing previous frame target signature template It is changed to the depth characteristic of the block, is then protected using the position of newest object block and dimensional information acquisition prospect class and background classes sample It is stored to after constructing short-term classification prediction sub-network module in the used sample set of training.For present frame, if long-term point When there are the candidate blocks for belonging to prospect class probability value more than or equal to 0.6 in the classification results of class prediction sub-network module, by these The sample of sampling is also added in sample set used in the long-term classification prediction subsequent update of sub-network module.Conversely, then recognizing Obvious variation has occurred for the appearance of new object block, the long-term classification prediction sub-network module of collection is updated into institute at this time In the prospect class sample and nearest 50 frame for (then taking all nearest frames less than 20 frames) in nearest 20 frame in the sample set used The background classes sample of (then taking all nearest frames less than 50 frames) is updated long-term classification prediction sub-network module, simultaneously will Treated that the depth characteristic history target signature template new as one is saved in history target signature template for newest object block In set.

(11) judge whether tracking terminates, recycled if being not over and execute step (7) to (11).

Fig. 2 is the schematic network structure of the channel characteristics fusion convolutional neural networks in the embodiment of the present invention, You Tuke Know, the Conv in figure indicates convolutional layer, which convolutional layer the subsequent digital representation layer is, K:n*n table in the bracket of lower section What is shown is the scale size of core, the step-length for the operation that s:l is indicated；What same pooling was indicated is convolutional layer, and ReLu indicates amendment Linear unit layer, normalize indicate normalization layer；Feature fusion and full connection respectively indicate feature Fused layer and full articulamentum, they are the special shapes of convolutional layer.

Testing as an example, introducing the evaluation index of two kinds of target followings, and displaying makes with multiple video sequences below The target following result that the tracking proposed in the present invention obtains.

Evaluation goal tracks there are mainly two types of the indexs of accuracy, and one is use in euclidean distance metric tracking result The error of center in heart position and target time of day, referred to as centralized positioning error (Center Location Error, CLE), it is evident that the smaller expression error of the Euclidean distance at center is smaller, then it is more accurate to track；Another is then used for The region area in the region and target time of day of measuring tracking result is overlapped ratio, referred to as overlapping ratio (Overlap Ratio, OR), show that the registration of prediction is higher when the overlapping ratio of region area is higher, then tracking result is more accurate.For The evaluation of entire video sequence tracking result accuracy is then that the average value of single frames evaluation result is taken to be compared, it is assumed that a certain frame The abscissa and ordinate and region area for the prediction result center that tracking obtains are denoted as (x respectively_p, y_p) and R_p, The abscissa and ordinate and region area of the center of corresponding real goal are denoted as (x respectively_g, y_g) and R_g, then two kinds The calculation formula of evaluation index is as follows:

The first row in table is the title of different video sequence, and CLE refers to that the smaller then center of target value is more accurate, and OR refers to that the more big then registration of target value is higher.It was found from above table, method of the invention can obtain in the above video sequence The tracking effect that center deviation is small while registration is high.

The present invention makes full use of the advanced image procossing proposed in computer vision field and mode identification technology, effectively complete At video frequency object tracking in complex scene.

As it will be easily appreciated by one skilled in the art that the foregoing is merely illustrative of the preferred embodiments of the present invention, not to The limitation present invention, any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should all include Within protection scope of the present invention.

Claims

1. a kind of composite type method for tracking target based on depth characteristic fusion convolutional neural networks, which is characterized in that it includes Following step:

(1) it modifies VGG-M network model and the convolutional layer of channel characteristics fusion is added, it is special using conventional part as shared depth Sign extracts sub-network, and using remaining part as the specific depth characteristic classification sub-network of sequence, both connections construction one is logical The convolutional neural networks model of road Fusion Features；

(2) video sequence for carrying target position and dimensional information is collected, each video sequence therein is mentioned according to mark The target information of confession acquires the training set of the sample composition convolutional neural networks model of prospect class and background classes respectively；

(3) training sample is formed into batch according to the corresponding mode of sequence, one by one sequence to convolutional neural networks model carry out Loop iteration training, until completing the cycle-index of setting or reaching preset precision threshold；

(4) for new video sequence, the corresponding specific tagsort sub-network module of sequence is reconfigured With the specific regression forecasting sub-network module of a sequence, two network modules and shared depth characteristic are extracted into sub-network It is connected, to constitute new sequence target following network model；

(5) the prospect class sample and back initial using the position of the target in new video sequence head frame and the information collection of scale Scape class sample, and the specific tagsort sub-network module of the sequence of neotectonics is trained with two kinds of samples, for sequence It arranges the positive sample that specific regression forecasting sub-network module then uses to be trained, extracts sub-network using shared depth characteristic The extraction of depth characteristic, and the feature templates that the feature of extraction is initial as target are carried out to initial target,

It regard the output of the least significant end convolutional layer of target prime area as characteristic processing, and that result saves as is initial by treated Target signature template；

(6) using a variety of different target signature templates are arrived in object tracking process, wherein initial history target signature template Set be set as empty, the target signature template of previous frame is then set as initial target signature template；

(7) candidate region that target is generated using newest target position and dimensional information, uses altogether the candidate region of target The depth characteristic enjoyed extracts sub-network and extracts depth characteristic and calculate separately the classification that depth characteristic belongs to prospect class and background classes Probability；

(8) variation degree that prospect class probability results judge target appearance is belonged to according to the depth characteristic of all candidate regions, it will These probability values with one set threshold value be compared, using comparison result as a condition, i.e., whether all candidate regions The probability value that domain belongs to prospect class is both greater than the threshold value of the setting；

(9) when the probability value that all candidate regions belong to prospect class is both greater than the threshold value set, show the cosmetic variation journey of target Less, the probability correctly identified by long-term classification prediction sub-network module is higher for degree, predicts sub-network using long-term classification at this time Module and regression forecasting network module, which combine, carries out the comprehensive predicted value of analytical calculation；

Conversely, then showing that biggish variation has occurred in the appearance of target, then sub-network is predicted in neotectonics one short-term classification at this time The long and short phase is classified and predicts that sub-network module combines the prediction for carrying out analytical calculation synthesis with multi-template matching module by module Value；

(10) using the highest candidate blocks of predicted value as the target following of present frame as a result, and by the target signature template of previous frame It is updated to new target block feature, according to new target position and dimensional information collecting sample, which is added to and is used for more In the sample set of new short-term classification prediction sub-network module, and the probability that all candidate regions belong to prospect class is analyzed, thus Determine whether to be added to the candidate region in the sample set of long-term classification prediction sub-network module, and whether generates new History target signature template and update network；

After the integrated forecasting value of all candidate regions is calculated, select wherein the maximum region of predicted value as current Then the previous frame target signature used in multi-template strategy is replaced with the depth characteristic in fresh target region by the target of frame, and According to new target position and dimensional information collecting sample, sample is added and is used to update short-term classification prediction sub-network module In sample set；

Classified using long-term classification prediction sub-network module to the feature of all candidate regions, obtained result is more objective Ground reflects the size of target appearance variation degree,

If all candidate regions belong to prospect class, probability is not high, determines that current tracking result confidence level is not also high, this Kind of situation shows that more apparent variation has occurred in the external appearance characteristic of target, at this time using the confidence level collected during tracking compared with Sample in high sample set is updated long-term classification prediction sub-network module, while the depth in fresh target region is special Sign is added in history target signature template set；

(11) judge whether tracking terminates,

If it has not ended, then going to step (7), circuits sequentially and execute step (7) to step (11).

2. the composite type method for tracking target as described in claim 1 based on depth characteristic fusion convolutional neural networks, special Sign is, in step 3,

Influenced by deep neural network processing speed, in network model training process using sample in batches by the way of carry out group Knit, the training of network by the way of sequence loops repetitive exercise, in particular to each time circulation in shared feature extraction Sub-network and the specific tagsort sub-network of sequence use the corresponding sequence of the specific tagsort sub-network of the sequence one by one Batch sample,

The convergent of a certain size cycle-index observation network class performance, the increasing when being unsatisfactory for convergent requirement can be first set The threshold value of systemic circulation number, conversely, reducing the number of iterations for the overfitting problem for avoiding depth network.

3. the composite type method for tracking target as claimed in claim 2 based on depth characteristic fusion convolutional neural networks, special Sign is, in step 4,

For the target following of new video sequences, the specific depth characteristic classification sub-network of a completely new sequence is constructed, and will It is connected to constitute used classification prediction network mould during tracking with trained shared feature extraction sub-network Type,

The specific depth characteristic regression forecasting sub-network module of a sequence equally is constructed to new video sequence.

4. the composite type method for tracking target as claimed in claim 3 based on depth characteristic fusion convolutional neural networks, special Sign is, in step 6,

Construct initial target feature templates, previous frame target signature template and a history target signature template respectively, into It sets empty for history feature template set before row target following, and then sets initial for the target signature template of previous frame Target signature template.

5. the composite type method for tracking target as claimed in claim 4 based on depth characteristic fusion convolutional neural networks, special Sign is, in step 7,

Candidate target region is generated using Gaussian function, then extracts sub-network to these using the sharing feature of network model The candidate target region of generation extracts depth characteristic, reuses classification and predicts that automatic network model further calculates it and belongs to prospect class With the class probability of background classes.

6. the composite type method for tracking target as claimed in claim 5 based on depth characteristic fusion convolutional neural networks, special Sign is, in step (9),

When the appearance of the appearance of target largely changes, constructs a new short-term classification and predict sub-network module, by it Combine with the production module of long-term classification prediction sub-network module and multi-template matching and calculate comprehensive predicted value,

Comprehensive predicted value is calculated conversely, long-term classification prediction sub-network module is combined with regression forecasting sub-network module.