Background
With the development of video acquisition and image processing technologies, people positioning and detection of video groups is a popular research direction in the field of computer vision at present, has wide application value, and is also the basis of more high-level computer vision problems, such as intensive crowd monitoring, social semantic analysis and the like.
The task content of video group character positioning detection is not difficult for human eyes, the positions of target characters are mainly classified through perception positioning of different color blocks, but for a computer, an RGB matrix is processed, the positions of the regions where the group characters are located are segmented from a scene, and the influence of a background region on positioning detection is reduced.
The development of the video group figure positioning detection algorithm is subjected to the progress of several spanning technologies, namely, bounding box regression, deep neural network rising, multi-reference window development, difficult sample mining and focusing and multi-scale multi-port detection, and can be divided into two types according to the algorithm core, wherein one type is a positioning detection algorithm based on the traditional manual characteristics, and the other type is a positioning detection algorithm based on deep learning. Before 2013, the positioning detection of people in videos or images is mainly based on traditional manual characteristics and limited by characteristic description and computing capacity, computer vision researchers can design diversified detection algorithms to the best extent to make up for the deficiency of manually designed characteristics in image characteristic expression capacity, and an exquisite computing method is used for accelerating a detection model and reducing space-time consumption. Several representative manual feature detection algorithms, Viola-Jones detector, HOG detector, deformable element model detector, appear in this.
With the rise of deep neural networks, the detection model based on deep learning overcomes the defect that the description of features is limited by the traditional manual feature detection algorithm, the representation of the features is automatically learned from big data, wherein the representation comprises thousands of parameters, new effective feature representation can be quickly obtained through training and learning aiming at new application scenes, and the detection model based on deep learning mainly comprises two directions of area nomination-based and end-to-end-based. The detection model based on the region nomination selects a large number of region candidate frames which may contain a target to be detected from an image to be detected, extracts the features of each candidate frame to obtain a feature vector, classifies the feature vectors to obtain category information, and finally performs position regression to obtain corresponding coordinate information. The candidate frame extraction is abandoned on the basis of end-to-end detection, and the feature extraction, the candidate frame regression and the classification are directly finished in a convolution network.
Because the group character behaviors have the characteristics of integration and diversity and are the set of the behavior interaction between people and the environment, the mutual shielding between people or the mutual shielding between people and objects is easy to occur in the group character behavior generating process, and then the interference of factors such as illumination change and the like occurs in the video imaging process, and the existing detection model based on deep learning can not accurately position the character position in the detection process because of the interference factors, even cause character missing detection.
Disclosure of Invention
The purpose of the invention is as follows: in the group character scene, since a plurality of characters exist at the same time, in order to effectively locate and detect the group character, each character needs to be accurately characterized. The existing detection model based on deep learning generally adopts single-level top-level video features as a detection basis, and although the top-level video features contain rich video semantics, the regressed positions of people are rough. In recent years, some detection models using multi-level fusion video features are also provided, the video features of the models are fused with the bottom-level video features to improve the detection accuracy, but only a one-way fusion structure is used in the feature fusion process, so that each level feature map only contains feature information of the current level and the higher levels, the mapping results of all the level features cannot be embodied, and the detection results cannot be optimized. In order to overcome the defects of the prior art, the invention provides a video group character positioning detection method based on multi-dimensional fusion characteristics, which extracts multi-level video characteristics, adopts a bidirectional processing channel to fuse the multi-level video characteristics to form the multi-dimensional fusion characteristics, can effectively utilize the characteristic information of all levels to obtain rich video semantic information, thereby more comprehensively describing the character characteristics in the video, simultaneously carrying out multi-task prediction operation in parallel, effectively improving the speed of group character positioning detection and having good accuracy and implementation.
The technical scheme is as follows: in order to achieve the purpose, the technical scheme provided by the invention is as follows:
a video group person positioning detection method based on multi-dimensional fusion features comprises the steps (1) to (8) which are sequentially executed:
(1) inputting a video serving as a training sample, wherein the type and the position of an object in the video are known, carrying out size normalization processing on the video frame by frame, and uniformly scaling the size of each frame of video frame into H multiplied by W, wherein H represents the height of the video frame, and W represents the width of the video frame;
(2) performing feature extraction on the video processed in the step (1) frame by using an Inception V3 model to obtain image features of each level of the video, and forming a multi-level video feature map F', F ═ F { (F)i′|i=1,2,…,numF},Fi' denotes the i-th layer image feature, numF denotes the total number of layers of extracted video image features, F1'denotes the underlayer image feature, F'numFRepresenting top-level image features;
(3) and performing feature fusion operation on the extracted multi-level video feature map F', wherein the method comprises the following steps (3-1) to (3-4) which are sequentially executed:
(3-1) adding one slave F'
numFTo F
1'the fusion channel performs feature fusion from the top-level feature to the bottom-level video feature map F' to obtain a top-down video feature map F
top-down(ii) a The method for feature fusion comprises the following steps: from Top layer image feature F'
numFInitially, each layer of image features F is traversed down
i', for F
i' performing convolution kernel sequentially to conv
1Step size of stride
1Convolution operation and upSample
1A multiple upsampling operation to obtain
To obtain finally
(3-2) adding one from
To
The fusion channel of
Performing feature fusion from bottom-layer feature to top to obtain bottom-up video feature map F
bottom-up,
Representing bottom-up video feature graph F
bottom-upThe ith layer image feature of (1); the method for feature fusion comprises the following steps:
a. initializing i to 1;
b. computing
To pair
Performing convolution kernel as conv
2Step size of stride
2The convolution operation of (2) to obtain the result
Computing
c. Updating i to i + 1;
d. and c, circularly executing the steps b to c until i is more than numF, and obtaining:
(3-3) to bottom-up video feature map F
bottom-upEach of which isLayer image features
Performing convolution kernel as conv
3Step size of stride
3The convolution operation of (2) and the result obtained is denoted as F
iAll F obtained
iForming a multi-dimensional fusion feature map F, wherein F is { F ═ F
i|i=1,2,…,numF};
(4) Inputting the multi-dimensional fusion feature map F into the area candidate network, outputting K detection targets, and obtaining a target position set Box ═ Boxj1,2, …, K and a corresponding Person probability set Personj1,2, …, K, the BoxjIndicating the position of the jth detection target, PersonjRepresents the probability that the jth detected object is a Person, Personj∈[0,1],PersonjThe larger the value of (a) is, the higher the possibility that the detection target is a person is;
(5) classifying the detection targets according to Person, and setting the positions of real boundary frames of K detection targets as PPerson ═ PPerson { (PPerson)
jCalculating a Loss function Loss of the character category of the group I1, 2, …, K |, and calculating a Loss function Loss of the character category of the group I2
clsThe calculation formula is
Wherein, PPerson
jRepresenting the true class of the jth detected object, PPerson
jTaking the value of 0 or 1, PPerson
j0 indicates that the detection target is not a human, PPerson
j1 indicates that the detection target is a human;
(6) setting the real positions of K detection targets as follows according to the positions of the Box and the Person regression targets:
BBox={BBoxj|j=1,2,…,K}
the population character position loss function is calculated as:
wherein BBoxjTrue bit representing jth detection targetPlacing;
(7) calculating the Loss value Loss of group figure positioning detection, wherein the calculation formula is Loss ═ Loss
cls+λLoss
locIf Loss is less than or equal to Loss
maxIf the area candidate network is trained, outputting area candidate network parameters, and executing the step (8); if Loss > Loss
maxUpdating the parameters of each layer of the regional candidate network
Then, returning to the step (4), and carrying out human detection again; loss
maxIs a preset maximum loss value of crowd positioning detection, lambda is a balance factor of a position regression and people classification task, alpha is the learning rate of a random gradient descent method,
representing a partial derivative of a population person positioning detection loss function;
(8) re-acquiring the video to be detected, and sequentially carrying out normalization processing, feature extraction and feature fusion on the video to be detected to obtain a multi-dimensional fusion feature map F of the video to be detectednewWill FnewAnd (5) inputting the area candidate network trained in the step (7) to obtain a group character positioning detection result in the new video.
Further, in step (1), H is 720 and W is 1280.
Further, in the step (2), numF is 4.
Further, in the step (3), conv1=1,stride1=1,upSample1=2,conv2=3,stride2=2,conv3=1,stride3=1。
Further, in the step (4), K is 12; in the step (7), Lossmax=0.5,λ=1,α=0.0001。
Has the advantages that: compared with the prior art, the invention adopting the technical scheme has the following technical effects:
the method extracts multi-level video description of the video, performs bidirectional feature processing, fuses multi-level video feature maps to obtain multi-dimensional fusion features, captures video candidate targets, and performs position regression and category classification on the candidate targets in parallel to complete video group figure positioning detection. The invention obtains rich video semantic information by fusing multi-level features, simultaneously performs multi-task prediction operation, effectively improves the speed of group figure positioning detection, and has good accuracy and implementation, particularly:
(1) the invention establishes a bidirectional feature processing channel from top to bottom and from bottom to top, fully excavates the semantic information of the video and improves the utilization rate of the hierarchical features.
(2) The method integrates the multi-dimensional video characteristics, organically combines the bottom-layer characteristics with accurate positions and the top-layer characteristics with rich semantics, and can better improve the detection accuracy.
(3) The invention processes a plurality of prediction tasks in parallel and sets the task balance factor, which is beneficial to establishing the most suitable detection model according to the scene characteristics.
Detailed Description
The technical scheme of the invention is further explained in detail by combining the drawings and the specific embodiments:
example 1: fig. 1 is a flowchart of a method for detecting person positioning in a video group based on multi-dimensional fusion features according to this embodiment, which specifically includes the following steps:
firstly, preprocessing: inputting a video serving as a training sample, wherein the type and the position of an object in the video are known, carrying out size normalization processing on the video frame by frame, and uniformly scaling the size of each frame of video frame into H multiplied by W, wherein H represents the height of the video frame, and W represents the width of the video frame; this step corresponds to preprocessing, which is advantageous for subsequent detection, in this embodiment, H is 720 and W is 1280.
II, characteristics ofExtracting: performing feature extraction on the video processed in the step (1) frame by using an Inception V3 model to obtain image features of each level of the video, and forming a multi-level video feature map F', F ═ F { (F)i′|i=1,2,…,numF},Fi' denotes the i-th layer image feature, numF denotes the total number of layers of extracted video image features, F1'denotes the underlayer image feature, F'numFIndicating a top-level image feature, and numF is 4 in this embodiment.
The position information of the target with the bottom layer characteristics is accurate, the detailed positioning data of the target can be regressed, but the representable semantic information is less, the data size is large, and a large amount of space-time consumption is required to be occupied during operation and processing. Although the top-level features contain rich semantics, the target positions are rough due to multi-layer processing, the regressed target semantics are not fine, and misjudgment is easily caused in a group character scene. The characteristics of each layer have respective advantages and disadvantages, and in order to extract accurate group character positioning information in a group character scene, the Inception V3 model is used for extracting the image characteristics of multiple layers of the video to form a multi-layer characteristic diagram. The reason for using the Inception V3 model at this step is that the feature extraction model not only performs well, but also has strong computational performance, which facilitates subsequent processing.
Thirdly, feature fusion: and performing feature fusion operation on the extracted multi-level video feature map F', wherein the method comprises the following steps (3-1) to (3-4) which are sequentially executed:
(3-1) adding one slave F'
numFTo F
1'the fusion channel performs feature fusion from the top-level feature to the bottom-level video feature map F' to obtain a top-down video feature map F
top-down(ii) a The method for feature fusion comprises the following steps: from Top layer image feature F'
numFInitially, each layer of image features F is traversed down
i', for F
i' performing convolution kernel sequentially to conv
1Step size of stride
1Convolution operation and upSample
1A multiple upsampling operation to obtain
To obtain finally
(3-2) adding one from
To
Fusion channel of (2), pair F
top-downPerforming feature fusion from bottom-layer feature to top to obtain bottom-up video feature map F
bottom-up,
Representing bottom-up video feature graph F
bottom-upThe ith layer image feature of (1); the method for feature fusion comprises the following steps:
a. initializing i to 1;
b. computing
To pair
Performing convolution kernel as conv
2Step size of stride
2The convolution operation of (2) to obtain the result
Computing
c. Updating i to i + 1;
d. and c, circularly executing the steps b to c until i is more than numF, and obtaining:
(3-3) to bottom-up video feature map F
bottom-upEach layer of image features in
Performing convolution kernel as conv
3Step size of stride
3The convolution operation of (2) and the result obtained is denoted as F
iAll F obtained
iForming a multi-dimensional fusion feature map F, wherein F is { F ═ F
i|i=1,2,…,numF}。
In step three, conv1=1,stride1=1,upSample1=2,conv2=3,stride2=2,conv3=1,stride3=1。
The fusion of the multi-layer features is not simply carried out, firstly whether the sizes of the hierarchical features are consistent or not needs to be considered, and secondly, the rationality of the fusion of the hierarchical features needs to be considered, so that the situation that the detection effect is reduced on the contrary after the fusion is avoided. The invention carries out reconstruction design on the existing feature fusion method, each layer of the top-down structure comprises feature information of the current layer and the higher layer, the optimal size of each layer can be directly adopted for detection, in order to reflect the mapping results of all the layer features to achieve the optimal detection effect, a bottom-up channel is particularly added, the top-down processing result is reversely connected, the bottom layer position information is more effectively utilized, and finally, convolution operation is adopted to carry out convolution on each fusion result to eliminate the aliasing effect of up-sampling.
Fourthly, training a regional candidate network:
the regional candidate network is a commonly used target detection network, and the main functional module is as shown in fig. 2, and firstly, k rectangular windows are generated for each pixel point of the sliding window to meet the requirements of targets with different sizes, and then the position information of each rectangular window and the corresponding image characteristics are input into the network, and the operation of a classification layer and a regression layer is respectively performed for each rectangular window. The classification layer mainly judges the probability of the figure existing in the current rectangular window, and the parameters comprise figure weight parameters WPAnd background interference parameter WE. The regression layer mainly obtains coordinate information of a current rectangular window in an original size image, and the parameters comprise rectangular window coordinates and a width and height offset weight parameter Wx、Wy、WhAnd Ww. And sharing the setting and adjustment of all parameters in the whole process of training the regional candidate network.
The training process of the regional candidate network is as follows:
(4-1) inputting the multi-dimensional fusion feature map F into the area candidate network, and outputting K detection targets, where K is 12, thereby obtaining a target position set Box { Box ═ Boxj1,2, …, 12 and the corresponding Person probability set Personj1,2, …, 12}, said BoxjIndicating the position of the jth detection target, PersonjRepresents the probability that the jth detected object is a Person, Personj∈[0,1],PersonjThe larger the value of (a) is, the higher the possibility that the detection target is a person is;
(4-2) classifying the detection targets according to Person, and setting the real boundary frame positions of the 12 detection targets as PPerson ═ PPerson
jCalculating a Loss function Loss of the group character category (Loss function Loss) | j ═ 1,2, …, 12}
clsThe calculation formula is
Wherein, PPerson
jRepresenting the true class of the jth detected object, PPerson
jTaking the value of 0 or 1,
PPerson j0 indicates that the detection target is not a human, PPerson
j1 indicates that the detection target is a human;
(4-3) according to the positions of the Box and the Person regression targets, setting the real positions of 12 detection targets as follows:
BBox={BBoxj|j=1,2,…,12}
the population character position loss function is calculated as:
wherein BBoxjRepresenting the real position of the jth detection target;
(4-4) calculating a group person positioning detection Loss value Loss, wherein the calculation formula is Loss ═ Loss
cls+λLoss
locIf Loss is less than or equal to Loss
maxIf the area candidate network is trained, outputting area candidate network parameters, and executing the step (8); if Loss > Loss
maxUpdating the parameters of each layer of the regional candidate network
Then, returning to the step (4), and carrying out human detection again; loss
maxIs a preset maximum loss value of crowd positioning detection, lambda is a balance factor of a position regression and people classification task, alpha is the learning rate of a random gradient descent method,
partial derivative of Loss function representing human figure positioning detection, Loss in this embodiment
max=0.5,λ=1,α=0.0001。
Fifthly, detecting the video to be detected by adopting the trained regional candidate network:
re-acquiring the video to be detected, and sequentially carrying out normalization processing, feature extraction and feature fusion on the video to be detected to obtain a multi-dimensional fusion feature map F of the video to be detectednewWill FnewAnd (5) inputting the area candidate network trained in the step (7) to obtain a group character positioning detection result in the new video. And (3) performing target detection by using the regional candidate network, and performing position regression and category classification operation in parallel by considering the characteristics of more people and complex tasks of group characters, thereby improving the detection efficiency. In the classification process, because the detection target is definitely a person, the classification II is divided into two types of persons and non-persons, the time wasted in detecting other classes is reduced, the real classification result is merged, and the classification accuracy is improved. In the position regression process, in order to simplify the calculation process, only the target positions of the person categories are regressed, and the regression task is refined. In the process of integral training, adding a task balance factor, and adjusting the optimal task according to the scene typeAnd (5) proportion, and finishing the positioning detection of the people in the video group.
Sixth, simulation of experiment
In the performance testing process of the method, the currently common target detection methods of fast-RCNN, FPN and Mask-RCNN are selected as comparison methods, and the evaluation standard is the detection accuracy under different IoU threshold values and different sizes. IoU indicates the intersection ratio of the detection result and the real result, IoU ∈ [0, 1], the higher the IoU value is, the closer the detection result is to the real result, and IoU ≧ 0.5 is assumed as AP _50, and IoU ≧ 0.75 is assumed as AP _75 during the test. In the evaluation process, the target size is divided into three categories, namely small, medium and large, which are respectively marked as AP _ S, AP _ M, AP _ L. FIG. 3 shows a comparison graph of the detection accuracy of fast-RCNN, FPN and Mask-RCNN according to the present invention and the comparison method. From experimental results, it can be found that compared with the fast-RCNN method using only single-level top-level features, the three methods using multi-level fusion features obtain higher detection accuracy, which indicates that the multi-level fusion features have stronger feature expression capability compared with the single-level top-level features. The FPN and the Mask-RCNN only use a one-way structure to perform fusion processing in the characteristic processing process, a two-way processing channel is used for obtaining a more accurate detection effect, and an experiment result also shows that the method obtains a better detection accuracy rate aiming at different IoU threshold values and target sizes.
The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.