CN109903339B

CN109903339B - A video group person location detection method based on multi-dimensional fusion features

Info

Publication number: CN109903339B
Application number: CN201910235608.5A
Authority: CN
Inventors: 陈志�; 掌静; 岳文静; 周传; 陈璐; 刘玲; 任杰; 周松颖; 江婧
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2019-03-26
Filing date: 2019-03-26
Publication date: 2021-03-05
Anticipated expiration: 2039-03-26
Also published as: CN109903339A

Abstract

The invention discloses a video group person location detection method with multi-dimensional fusion features. The invention first extracts multi-level video feature maps, establishes top-down and bottom-up bidirectional feature processing channels to fully mine the semantic information of the video, then fuses the multi-level video feature maps to obtain multi-dimensional fusion features, captures video candidate targets, and finally The candidate target position regression and category classification are processed in parallel to complete the video group person localization detection. The present invention obtains rich video semantic information by fusing multi-level features, and simultaneously performs multi-task prediction operations, thereby effectively improving the speed of group person location detection, and has good accuracy and practicability.

Description

Video group figure positioning detection method based on multi-dimensional fusion features

Technical Field

The invention relates to the cross technical field of computer vision, pattern recognition and the like, in particular to a video group figure positioning detection method based on multi-dimensional fusion characteristics.

Background

With the development of video acquisition and image processing technologies, people positioning and detection of video groups is a popular research direction in the field of computer vision at present, has wide application value, and is also the basis of more high-level computer vision problems, such as intensive crowd monitoring, social semantic analysis and the like.

The task content of video group character positioning detection is not difficult for human eyes, the positions of target characters are mainly classified through perception positioning of different color blocks, but for a computer, an RGB matrix is processed, the positions of the regions where the group characters are located are segmented from a scene, and the influence of a background region on positioning detection is reduced.

The development of the video group figure positioning detection algorithm is subjected to the progress of several spanning technologies, namely, bounding box regression, deep neural network rising, multi-reference window development, difficult sample mining and focusing and multi-scale multi-port detection, and can be divided into two types according to the algorithm core, wherein one type is a positioning detection algorithm based on the traditional manual characteristics, and the other type is a positioning detection algorithm based on deep learning. Before 2013, the positioning detection of people in videos or images is mainly based on traditional manual characteristics and limited by characteristic description and computing capacity, computer vision researchers can design diversified detection algorithms to the best extent to make up for the deficiency of manually designed characteristics in image characteristic expression capacity, and an exquisite computing method is used for accelerating a detection model and reducing space-time consumption. Several representative manual feature detection algorithms, Viola-Jones detector, HOG detector, deformable element model detector, appear in this.

With the rise of deep neural networks, the detection model based on deep learning overcomes the defect that the description of features is limited by the traditional manual feature detection algorithm, the representation of the features is automatically learned from big data, wherein the representation comprises thousands of parameters, new effective feature representation can be quickly obtained through training and learning aiming at new application scenes, and the detection model based on deep learning mainly comprises two directions of area nomination-based and end-to-end-based. The detection model based on the region nomination selects a large number of region candidate frames which may contain a target to be detected from an image to be detected, extracts the features of each candidate frame to obtain a feature vector, classifies the feature vectors to obtain category information, and finally performs position regression to obtain corresponding coordinate information. The candidate frame extraction is abandoned on the basis of end-to-end detection, and the feature extraction, the candidate frame regression and the classification are directly finished in a convolution network.

Because the group character behaviors have the characteristics of integration and diversity and are the set of the behavior interaction between people and the environment, the mutual shielding between people or the mutual shielding between people and objects is easy to occur in the group character behavior generating process, and then the interference of factors such as illumination change and the like occurs in the video imaging process, and the existing detection model based on deep learning can not accurately position the character position in the detection process because of the interference factors, even cause character missing detection.

Disclosure of Invention

The purpose of the invention is as follows: in the group character scene, since a plurality of characters exist at the same time, in order to effectively locate and detect the group character, each character needs to be accurately characterized. The existing detection model based on deep learning generally adopts single-level top-level video features as a detection basis, and although the top-level video features contain rich video semantics, the regressed positions of people are rough. In recent years, some detection models using multi-level fusion video features are also provided, the video features of the models are fused with the bottom-level video features to improve the detection accuracy, but only a one-way fusion structure is used in the feature fusion process, so that each level feature map only contains feature information of the current level and the higher levels, the mapping results of all the level features cannot be embodied, and the detection results cannot be optimized. In order to overcome the defects of the prior art, the invention provides a video group character positioning detection method based on multi-dimensional fusion characteristics, which extracts multi-level video characteristics, adopts a bidirectional processing channel to fuse the multi-level video characteristics to form the multi-dimensional fusion characteristics, can effectively utilize the characteristic information of all levels to obtain rich video semantic information, thereby more comprehensively describing the character characteristics in the video, simultaneously carrying out multi-task prediction operation in parallel, effectively improving the speed of group character positioning detection and having good accuracy and implementation.

The technical scheme is as follows: in order to achieve the purpose, the technical scheme provided by the invention is as follows:

a video group person positioning detection method based on multi-dimensional fusion features comprises the steps (1) to (8) which are sequentially executed:

(1) inputting a video serving as a training sample, wherein the type and the position of an object in the video are known, carrying out size normalization processing on the video frame by frame, and uniformly scaling the size of each frame of video frame into H multiplied by W, wherein H represents the height of the video frame, and W represents the width of the video frame;

(2) performing feature extraction on the video processed in the step (1) frame by using an Inception V3 model to obtain image features of each level of the video, and forming a multi-level video feature map F', F ═ F { (F)_i′|i＝1，2，…，numF}，F_i' denotes the i-th layer image feature, numF denotes the total number of layers of extracted video image features, F₁'denotes the underlayer image feature, F'_numFRepresenting top-level image features;

(3) and performing feature fusion operation on the extracted multi-level video feature map F', wherein the method comprises the following steps (3-1) to (3-4) which are sequentially executed:

(3-1) adding one slave F'_numFTo F₁'the fusion channel performs feature fusion from the top-level feature to the bottom-level video feature map F' to obtain a top-down video feature map F^top-down(ii) a The method for feature fusion comprises the following steps: from Top layer image feature F'_numFInitially, each layer of image features F is traversed down_i', for F_i' performing convolution kernel sequentially to conv₁Step size of stride₁Convolution operation and upSample₁A multiple upsampling operation to obtain

To obtain finally

(3-2) adding one from

To

The fusion channel of

Performing feature fusion from bottom-layer feature to top to obtain bottom-up video feature map F^bottom-up，

Representing bottom-up video feature graph F^bottom-upThe ith layer image feature of (1); the method for feature fusion comprises the following steps:

a. initializing i to 1;

b. computing

To pair

Performing convolution kernel as conv₂Step size of stride₂The convolution operation of (2) to obtain the result

Computing

c. Updating i to i + 1;

d. and c, circularly executing the steps b to c until i is more than numF, and obtaining:

(3-3) to bottom-up video feature map F^bottom-upEach of which isLayer image features

Performing convolution kernel as conv₃Step size of stride₃The convolution operation of (2) and the result obtained is denoted as F_iAll F obtained_iForming a multi-dimensional fusion feature map F, wherein F is { F ═ F_i|i＝1，2，…，numF}；

(4) Inputting the multi-dimensional fusion feature map F into the area candidate network, outputting K detection targets, and obtaining a target position set Box ═ Box_j1,2, …, K and a corresponding Person probability set Person_j1,2, …, K, the Box_jIndicating the position of the jth detection target, Person_jRepresents the probability that the jth detected object is a Person, Person_j∈[0，1]，Person_jThe larger the value of (a) is, the higher the possibility that the detection target is a person is;

(5) classifying the detection targets according to Person, and setting the positions of real boundary frames of K detection targets as PPerson ═ PPerson { (PPerson)_jCalculating a Loss function Loss of the character category of the group I1, 2, …, K |, and calculating a Loss function Loss of the character category of the group I2_clsThe calculation formula is

Wherein, PPerson_jRepresenting the true class of the jth detected object, PPerson_jTaking the value of 0 or 1, PPerson_j0 indicates that the detection target is not a human, PPerson_j1 indicates that the detection target is a human;

(6) setting the real positions of K detection targets as follows according to the positions of the Box and the Person regression targets:

BBox＝{BBox_j|j＝1，2，…，K}

the population character position loss function is calculated as:

wherein BBox_jTrue bit representing jth detection targetPlacing;

(7) calculating the Loss value Loss of group figure positioning detection, wherein the calculation formula is Loss ═ Loss_cls+λLoss_locIf Loss is less than or equal to Loss_maxIf the area candidate network is trained, outputting area candidate network parameters, and executing the step (8); if Loss > Loss_maxUpdating the parameters of each layer of the regional candidate network

Then, returning to the step (4), and carrying out human detection again; loss_maxIs a preset maximum loss value of crowd positioning detection, lambda is a balance factor of a position regression and people classification task, alpha is the learning rate of a random gradient descent method,

representing a partial derivative of a population person positioning detection loss function;

(8) re-acquiring the video to be detected, and sequentially carrying out normalization processing, feature extraction and feature fusion on the video to be detected to obtain a multi-dimensional fusion feature map F of the video to be detected_newWill F_newAnd (5) inputting the area candidate network trained in the step (7) to obtain a group character positioning detection result in the new video.

Further, in step (1), H is 720 and W is 1280.

Further, in the step (2), numF is 4.

Further, in the step (3), conv₁＝1，stride₁＝1，upSample₁＝2，conv₂＝3，stride₂＝2，conv₃＝1，stride₃＝1。

Further, in the step (4), K is 12; in the step (7), Loss_max＝0.5，λ＝1，α＝0.0001。

Has the advantages that: compared with the prior art, the invention adopting the technical scheme has the following technical effects:

the method extracts multi-level video description of the video, performs bidirectional feature processing, fuses multi-level video feature maps to obtain multi-dimensional fusion features, captures video candidate targets, and performs position regression and category classification on the candidate targets in parallel to complete video group figure positioning detection. The invention obtains rich video semantic information by fusing multi-level features, simultaneously performs multi-task prediction operation, effectively improves the speed of group figure positioning detection, and has good accuracy and implementation, particularly:

(1) the invention establishes a bidirectional feature processing channel from top to bottom and from bottom to top, fully excavates the semantic information of the video and improves the utilization rate of the hierarchical features.

(2) The method integrates the multi-dimensional video characteristics, organically combines the bottom-layer characteristics with accurate positions and the top-layer characteristics with rich semantics, and can better improve the detection accuracy.

(3) The invention processes a plurality of prediction tasks in parallel and sets the task balance factor, which is beneficial to establishing the most suitable detection model according to the scene characteristics.

Drawings

FIG. 1 is a flow chart of a video group person positioning detection method based on multi-dimensional fusion features;

FIG. 2 is a block diagram of a regional candidate network in accordance with the present invention;

FIG. 3 is a comparison of the detection accuracy of different methods.

Detailed Description

The technical scheme of the invention is further explained in detail by combining the drawings and the specific embodiments:

example 1: fig. 1 is a flowchart of a method for detecting person positioning in a video group based on multi-dimensional fusion features according to this embodiment, which specifically includes the following steps:

firstly, preprocessing: inputting a video serving as a training sample, wherein the type and the position of an object in the video are known, carrying out size normalization processing on the video frame by frame, and uniformly scaling the size of each frame of video frame into H multiplied by W, wherein H represents the height of the video frame, and W represents the width of the video frame; this step corresponds to preprocessing, which is advantageous for subsequent detection, in this embodiment, H is 720 and W is 1280.

II, characteristics ofExtracting: performing feature extraction on the video processed in the step (1) frame by using an Inception V3 model to obtain image features of each level of the video, and forming a multi-level video feature map F', F ═ F { (F)_i′|i＝1，2，…，numF}，F_i' denotes the i-th layer image feature, numF denotes the total number of layers of extracted video image features, F₁'denotes the underlayer image feature, F'_numFIndicating a top-level image feature, and numF is 4 in this embodiment.

The position information of the target with the bottom layer characteristics is accurate, the detailed positioning data of the target can be regressed, but the representable semantic information is less, the data size is large, and a large amount of space-time consumption is required to be occupied during operation and processing. Although the top-level features contain rich semantics, the target positions are rough due to multi-layer processing, the regressed target semantics are not fine, and misjudgment is easily caused in a group character scene. The characteristics of each layer have respective advantages and disadvantages, and in order to extract accurate group character positioning information in a group character scene, the Inception V3 model is used for extracting the image characteristics of multiple layers of the video to form a multi-layer characteristic diagram. The reason for using the Inception V3 model at this step is that the feature extraction model not only performs well, but also has strong computational performance, which facilitates subsequent processing.

Thirdly, feature fusion: and performing feature fusion operation on the extracted multi-level video feature map F', wherein the method comprises the following steps (3-1) to (3-4) which are sequentially executed:

To obtain finally

(3-2) adding one from

To

Fusion channel of (2), pair F^top-downPerforming feature fusion from bottom-layer feature to top to obtain bottom-up video feature map F^bottom-up，

a. initializing i to 1;

b. computing

To pair

Computing

c. Updating i to i + 1;

(3-3) to bottom-up video feature map F^bottom-upEach layer of image features in

Performing convolution kernel as conv₃Step size of stride₃The convolution operation of (2) and the result obtained is denoted as F_iAll F obtained_iForming a multi-dimensional fusion feature map F, wherein F is { F ═ F_i|i＝1，2，…，numF}。

In step three, conv₁＝1，stride₁＝1，upSample₁＝2，conv₂＝3，stride₂＝2，conv₃＝1，stride₃＝1。

The fusion of the multi-layer features is not simply carried out, firstly whether the sizes of the hierarchical features are consistent or not needs to be considered, and secondly, the rationality of the fusion of the hierarchical features needs to be considered, so that the situation that the detection effect is reduced on the contrary after the fusion is avoided. The invention carries out reconstruction design on the existing feature fusion method, each layer of the top-down structure comprises feature information of the current layer and the higher layer, the optimal size of each layer can be directly adopted for detection, in order to reflect the mapping results of all the layer features to achieve the optimal detection effect, a bottom-up channel is particularly added, the top-down processing result is reversely connected, the bottom layer position information is more effectively utilized, and finally, convolution operation is adopted to carry out convolution on each fusion result to eliminate the aliasing effect of up-sampling.

Fourthly, training a regional candidate network:

the regional candidate network is a commonly used target detection network, and the main functional module is as shown in fig. 2, and firstly, k rectangular windows are generated for each pixel point of the sliding window to meet the requirements of targets with different sizes, and then the position information of each rectangular window and the corresponding image characteristics are input into the network, and the operation of a classification layer and a regression layer is respectively performed for each rectangular window. The classification layer mainly judges the probability of the figure existing in the current rectangular window, and the parameters comprise figure weight parameters W_PAnd background interference parameter W_E. The regression layer mainly obtains coordinate information of a current rectangular window in an original size image, and the parameters comprise rectangular window coordinates and a width and height offset weight parameter W_x、W_y、W_hAnd W_w. And sharing the setting and adjustment of all parameters in the whole process of training the regional candidate network.

The training process of the regional candidate network is as follows:

(4-1) inputting the multi-dimensional fusion feature map F into the area candidate network, and outputting K detection targets, where K is 12, thereby obtaining a target position set Box { Box ═ Box_j1,2, …, 12 and the corresponding Person probability set Person_j1,2, …, 12}, said Box_jIndicating the position of the jth detection target, Person_jRepresents the probability that the jth detected object is a Person, Person_j∈[0，1]，Person_jThe larger the value of (a) is, the higher the possibility that the detection target is a person is;

(4-2) classifying the detection targets according to Person, and setting the real boundary frame positions of the 12 detection targets as PPerson ═ PPerson_jCalculating a Loss function Loss of the group character category (Loss function Loss) | j ═ 1,2, …, 12}_clsThe calculation formula is

Wherein, PPerson_jRepresenting the true class of the jth detected object, PPerson_jTaking the value of 0 or 1, PPerson _j0 indicates that the detection target is not a human, PPerson_j1 indicates that the detection target is a human;

(4-3) according to the positions of the Box and the Person regression targets, setting the real positions of 12 detection targets as follows:

BBox＝{BBox_j|j＝1，2，…，12}

the population character position loss function is calculated as:

wherein BBox_jRepresenting the real position of the jth detection target;

(4-4) calculating a group person positioning detection Loss value Loss, wherein the calculation formula is Loss ═ Loss_cls+λLoss_locIf Loss is less than or equal to Loss_maxIf the area candidate network is trained, outputting area candidate network parameters, and executing the step (8); if Loss > Loss_maxUpdating the parameters of each layer of the regional candidate network

partial derivative of Loss function representing human figure positioning detection, Loss in this embodiment_max＝0.5，λ＝1，α＝0.0001。

Fifthly, detecting the video to be detected by adopting the trained regional candidate network:

re-acquiring the video to be detected, and sequentially carrying out normalization processing, feature extraction and feature fusion on the video to be detected to obtain a multi-dimensional fusion feature map F of the video to be detected_newWill F_newAnd (5) inputting the area candidate network trained in the step (7) to obtain a group character positioning detection result in the new video. And (3) performing target detection by using the regional candidate network, and performing position regression and category classification operation in parallel by considering the characteristics of more people and complex tasks of group characters, thereby improving the detection efficiency. In the classification process, because the detection target is definitely a person, the classification II is divided into two types of persons and non-persons, the time wasted in detecting other classes is reduced, the real classification result is merged, and the classification accuracy is improved. In the position regression process, in order to simplify the calculation process, only the target positions of the person categories are regressed, and the regression task is refined. In the process of integral training, adding a task balance factor, and adjusting the optimal task according to the scene typeAnd (5) proportion, and finishing the positioning detection of the people in the video group.

Sixth, simulation of experiment

In the performance testing process of the method, the currently common target detection methods of fast-RCNN, FPN and Mask-RCNN are selected as comparison methods, and the evaluation standard is the detection accuracy under different IoU threshold values and different sizes. IoU indicates the intersection ratio of the detection result and the real result, IoU ∈ [0, 1], the higher the IoU value is, the closer the detection result is to the real result, and IoU ≧ 0.5 is assumed as AP _50, and IoU ≧ 0.75 is assumed as AP _75 during the test. In the evaluation process, the target size is divided into three categories, namely small, medium and large, which are respectively marked as AP _ S, AP _ M, AP _ L. FIG. 3 shows a comparison graph of the detection accuracy of fast-RCNN, FPN and Mask-RCNN according to the present invention and the comparison method. From experimental results, it can be found that compared with the fast-RCNN method using only single-level top-level features, the three methods using multi-level fusion features obtain higher detection accuracy, which indicates that the multi-level fusion features have stronger feature expression capability compared with the single-level top-level features. The FPN and the Mask-RCNN only use a one-way structure to perform fusion processing in the characteristic processing process, a two-way processing channel is used for obtaining a more accurate detection effect, and an experiment result also shows that the method obtains a better detection accuracy rate aiming at different IoU threshold values and target sizes.

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims

1. a video group character location detection method based on multi-dimensional fusion feature, is characterized in that, comprises the steps (1) to (8) of sequential execution:

(1) Input the video as a training sample. The types and positions of the objects in the video are known. The size of the video is normalized frame by frame, and the size of each video frame is uniformly scaled to H×W size, where H represents the video frame. Height, W represents the width of the video frame;

(2) Use the InceptionV3 model to extract the features of the video processed in step (1) frame by frame to obtain the image features of each level of the video, and form a multi-level video feature map F', F'={F _i '|i=1 ,2,...,numF}, F _i ' represents the i-th layer image feature, numF represents the total number of layers of extracted video image features, F ₁ ' represents the underlying image feature, F' _numF represents the top-level image feature;

(3) Feature fusion operation is performed on the extracted multi-level video feature map F', including steps (3-1) to (3-4) performed in turn:

(3-1) add a fusion channel from F' _numF to F ₁ ', carry out the feature fusion from top-level features downward to the multi-level video feature map F', and obtain the top-down video feature map F ^top-down ; The method of feature fusion is: starting from the top-level image feature F' _numF , traverse down each layer of image features F _i ', and perform convolution operations with convolution kernel conv ₁ and stride ₁ on F _i ' in turn. upSample ₁ times upsampling operation to obtain F _i ^top-down , and finally F ^top-down ={F _i ^top-down |i=1,2,...,numF};

(3-2) Add a line from F ₁ ^top-down to

The fusion channel of F ^top-down performs feature fusion from the bottom feature to the top, and obtains the bottom-up video feature map F ^bottom-up , F ^bottom-up ={Fi ^bottom-up | _i =1,2,..., numF}, F _i ^bottom-up represents the i-th layer image feature of the bottom-up video feature map F ^bottom-up ; the method of feature fusion is:

a. Initialize i=1;

b. Calculate F _i ^bottom-up = F _i ^top-down , perform a convolution operation on F _i ^bottom-up with a convolution kernel of conv ₂ and a stride of stride ₂ to obtain the result

calculate

c. update i=i+1;

d. Execute steps b to c in a loop until i>numF, after the loop is over, get:

F ^bottom-up = {F _i ^bottom-up |i=1,2,...,numF}

(3-3) Perform a convolution operation with a convolution kernel of conv ₃ and a stride of stride ₃ on each layer of image feature F _i ^bottom-up in the bottom-up video feature map F ^bottom-up , and the obtained result is denoted as is F _i , all F _i obtained constitute a multi-dimensional fusion feature map F, F={F _i |i=1,2,...,numF};

(4) Input the multi-dimensional fusion feature map F into the region candidate network, output K detection targets, and obtain the target position set Box={Box _j |j=1,2,...,K} and the corresponding person probability set Person={Person _j |j=1,2,...,K}, the Box _j represents the position of the jth detection target, Person _j represents the probability that the jth detection target is a person, Person _j ∈ [0,1], Person _j The larger the value of , indicates that the detection target is more likely to be a person;

(5) Classify the detection targets according to Person, set the true categories of the K detection targets as PPerson={PPerson _j |j=1,2,…,K}, calculate the loss function Loss _cls of the group person category, and the calculation formula is

Wherein, PPerson _j represents the real category of the j-th detection target, PPerson _j is 0 or 1, PPerson _j =0 indicates that the detection target is not a person, and PPerson _j =1 indicates that the detection target is a person;

(6) According to the Box and Person regression target positions, set the real positions of the K detection targets as:

BBox={BBox _j |j=1,2,…,K}

The loss function for calculating the position of group characters is:

Among them, BBox _j represents the real position of the j-th detection target;

(7) Calculate the loss value Loss of group person positioning detection, the calculation formula is Loss=Loss _cls +λLoss _loc , if Loss≤Loss _max , the regional candidate network has been trained, output the regional candidate network parameters, and execute step (8); Loss>Loss _max , then update the parameters of each layer of the regional candidate network

Then go back to step (4), and perform person detection again; Loss _max is the preset maximum loss value of crowd positioning detection, λ is the balance factor between position regression and person classification tasks, α is the learning rate of stochastic gradient descent,

Represents the partial derivative of the loss function of group person location detection;

(8) Re-acquire the video to be detected, perform normalization processing, feature extraction and feature fusion on the video to be detected in turn, to obtain a multi-dimensional fusion feature map F _new of the video to be detected, and input F _new to the trained step (7) The regional candidate network is used to obtain the detection results of group person positioning in the video to be detected.

2 . The method for detecting group people in video based on multi-dimensional fusion features according to claim 1 , wherein, in the step (1), H=720 and W=1280. 3 .

3 . The method for detecting group people in video based on multi-dimensional fusion features according to claim 1 , wherein, in the step (2), numF=4. 4 .

4. a kind of video group person location detection method based on multi-dimensional fusion feature according to claim 1, is characterized in that, in described step (3), conv ₁ =1, stride ₁ =1, upSample ₁ =2, conv ₂ =3, stride ₂ =2, conv ₃ =1, stride ₃ =1.

5. a kind of video group person location detection method based on multi-dimensional fusion feature according to claim 1, is characterized in that, in described step (4), K=12; In described step (7), Loss _max = 0.5, λ=1, α=0.0001.