[go: up one dir, main page]

CN109903339B - A video group person location detection method based on multi-dimensional fusion features - Google Patents

A video group person location detection method based on multi-dimensional fusion features Download PDF

Info

Publication number
CN109903339B
CN109903339B CN201910235608.5A CN201910235608A CN109903339B CN 109903339 B CN109903339 B CN 109903339B CN 201910235608 A CN201910235608 A CN 201910235608A CN 109903339 B CN109903339 B CN 109903339B
Authority
CN
China
Prior art keywords
video
feature
person
detection
loss
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910235608.5A
Other languages
Chinese (zh)
Other versions
CN109903339A (en
Inventor
陈志�
掌静
岳文静
周传
陈璐
刘玲
任杰
周松颖
江婧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN201910235608.5A priority Critical patent/CN109903339B/en
Publication of CN109903339A publication Critical patent/CN109903339A/en
Application granted granted Critical
Publication of CN109903339B publication Critical patent/CN109903339B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

本发明公开一种多维融合特征的视频群体人物定位检测方法。该发明首先抽取多层级视频特征图,建立自顶向下和自底向上的双向特征处理通道充分挖掘视频的语义信息,接着融合多层级视频特征图获取多维融合特征,抓取视频候选目标,最后并行处理候选目标位置回归和类别分类,完成视频群体人物定位检测。本发明通过融合多层级特征获得丰富的视频语义信息,同时进行多任务预测操作,有效提升群体人物定位检测的速度,具有良好的准确率和实施性。

Figure 201910235608

The invention discloses a video group person location detection method with multi-dimensional fusion features. The invention first extracts multi-level video feature maps, establishes top-down and bottom-up bidirectional feature processing channels to fully mine the semantic information of the video, then fuses the multi-level video feature maps to obtain multi-dimensional fusion features, captures video candidate targets, and finally The candidate target position regression and category classification are processed in parallel to complete the video group person localization detection. The present invention obtains rich video semantic information by fusing multi-level features, and simultaneously performs multi-task prediction operations, thereby effectively improving the speed of group person location detection, and has good accuracy and practicability.

Figure 201910235608

Description

Video group figure positioning detection method based on multi-dimensional fusion features
Technical Field
The invention relates to the cross technical field of computer vision, pattern recognition and the like, in particular to a video group figure positioning detection method based on multi-dimensional fusion characteristics.
Background
With the development of video acquisition and image processing technologies, people positioning and detection of video groups is a popular research direction in the field of computer vision at present, has wide application value, and is also the basis of more high-level computer vision problems, such as intensive crowd monitoring, social semantic analysis and the like.
The task content of video group character positioning detection is not difficult for human eyes, the positions of target characters are mainly classified through perception positioning of different color blocks, but for a computer, an RGB matrix is processed, the positions of the regions where the group characters are located are segmented from a scene, and the influence of a background region on positioning detection is reduced.
The development of the video group figure positioning detection algorithm is subjected to the progress of several spanning technologies, namely, bounding box regression, deep neural network rising, multi-reference window development, difficult sample mining and focusing and multi-scale multi-port detection, and can be divided into two types according to the algorithm core, wherein one type is a positioning detection algorithm based on the traditional manual characteristics, and the other type is a positioning detection algorithm based on deep learning. Before 2013, the positioning detection of people in videos or images is mainly based on traditional manual characteristics and limited by characteristic description and computing capacity, computer vision researchers can design diversified detection algorithms to the best extent to make up for the deficiency of manually designed characteristics in image characteristic expression capacity, and an exquisite computing method is used for accelerating a detection model and reducing space-time consumption. Several representative manual feature detection algorithms, Viola-Jones detector, HOG detector, deformable element model detector, appear in this.
With the rise of deep neural networks, the detection model based on deep learning overcomes the defect that the description of features is limited by the traditional manual feature detection algorithm, the representation of the features is automatically learned from big data, wherein the representation comprises thousands of parameters, new effective feature representation can be quickly obtained through training and learning aiming at new application scenes, and the detection model based on deep learning mainly comprises two directions of area nomination-based and end-to-end-based. The detection model based on the region nomination selects a large number of region candidate frames which may contain a target to be detected from an image to be detected, extracts the features of each candidate frame to obtain a feature vector, classifies the feature vectors to obtain category information, and finally performs position regression to obtain corresponding coordinate information. The candidate frame extraction is abandoned on the basis of end-to-end detection, and the feature extraction, the candidate frame regression and the classification are directly finished in a convolution network.
Because the group character behaviors have the characteristics of integration and diversity and are the set of the behavior interaction between people and the environment, the mutual shielding between people or the mutual shielding between people and objects is easy to occur in the group character behavior generating process, and then the interference of factors such as illumination change and the like occurs in the video imaging process, and the existing detection model based on deep learning can not accurately position the character position in the detection process because of the interference factors, even cause character missing detection.
Disclosure of Invention
The purpose of the invention is as follows: in the group character scene, since a plurality of characters exist at the same time, in order to effectively locate and detect the group character, each character needs to be accurately characterized. The existing detection model based on deep learning generally adopts single-level top-level video features as a detection basis, and although the top-level video features contain rich video semantics, the regressed positions of people are rough. In recent years, some detection models using multi-level fusion video features are also provided, the video features of the models are fused with the bottom-level video features to improve the detection accuracy, but only a one-way fusion structure is used in the feature fusion process, so that each level feature map only contains feature information of the current level and the higher levels, the mapping results of all the level features cannot be embodied, and the detection results cannot be optimized. In order to overcome the defects of the prior art, the invention provides a video group character positioning detection method based on multi-dimensional fusion characteristics, which extracts multi-level video characteristics, adopts a bidirectional processing channel to fuse the multi-level video characteristics to form the multi-dimensional fusion characteristics, can effectively utilize the characteristic information of all levels to obtain rich video semantic information, thereby more comprehensively describing the character characteristics in the video, simultaneously carrying out multi-task prediction operation in parallel, effectively improving the speed of group character positioning detection and having good accuracy and implementation.
The technical scheme is as follows: in order to achieve the purpose, the technical scheme provided by the invention is as follows:
a video group person positioning detection method based on multi-dimensional fusion features comprises the steps (1) to (8) which are sequentially executed:
(1) inputting a video serving as a training sample, wherein the type and the position of an object in the video are known, carrying out size normalization processing on the video frame by frame, and uniformly scaling the size of each frame of video frame into H multiplied by W, wherein H represents the height of the video frame, and W represents the width of the video frame;
(2) performing feature extraction on the video processed in the step (1) frame by using an Inception V3 model to obtain image features of each level of the video, and forming a multi-level video feature map F', F ═ F { (F)i′|i=1,2,…,numF},Fi' denotes the i-th layer image feature, numF denotes the total number of layers of extracted video image features, F1'denotes the underlayer image feature, F'numFRepresenting top-level image features;
(3) and performing feature fusion operation on the extracted multi-level video feature map F', wherein the method comprises the following steps (3-1) to (3-4) which are sequentially executed:
(3-1) adding one slave F'numFTo F1'the fusion channel performs feature fusion from the top-level feature to the bottom-level video feature map F' to obtain a top-down video feature map Ftop-down(ii) a The method for feature fusion comprises the following steps: from Top layer image feature F'numFInitially, each layer of image features F is traversed downi', for Fi' performing convolution kernel sequentially to conv1Step size of stride1Convolution operation and upSample1A multiple upsampling operation to obtain
Figure BDA0002007066990000021
To obtain finally
Figure BDA0002007066990000022
(3-2) adding one from
Figure BDA0002007066990000023
To
Figure BDA0002007066990000024
The fusion channel of
Figure BDA0002007066990000025
Performing feature fusion from bottom-layer feature to top to obtain bottom-up video feature map Fbottom-up
Figure BDA0002007066990000031
Figure BDA0002007066990000032
Representing bottom-up video feature graph Fbottom-upThe ith layer image feature of (1); the method for feature fusion comprises the following steps:
a. initializing i to 1;
b. computing
Figure BDA0002007066990000033
To pair
Figure BDA0002007066990000034
Performing convolution kernel as conv2Step size of stride2The convolution operation of (2) to obtain the result
Figure BDA0002007066990000035
Computing
Figure BDA0002007066990000036
c. Updating i to i + 1;
d. and c, circularly executing the steps b to c until i is more than numF, and obtaining:
Figure BDA0002007066990000037
(3-3) to bottom-up video feature map Fbottom-upEach of which isLayer image features
Figure BDA0002007066990000038
Performing convolution kernel as conv3Step size of stride3The convolution operation of (2) and the result obtained is denoted as FiAll F obtainediForming a multi-dimensional fusion feature map F, wherein F is { F ═ Fi|i=1,2,…,numF};
(4) Inputting the multi-dimensional fusion feature map F into the area candidate network, outputting K detection targets, and obtaining a target position set Box ═ Boxj1,2, …, K and a corresponding Person probability set Personj1,2, …, K, the BoxjIndicating the position of the jth detection target, PersonjRepresents the probability that the jth detected object is a Person, Personj∈[0,1],PersonjThe larger the value of (a) is, the higher the possibility that the detection target is a person is;
(5) classifying the detection targets according to Person, and setting the positions of real boundary frames of K detection targets as PPerson ═ PPerson { (PPerson)jCalculating a Loss function Loss of the character category of the group I1, 2, …, K |, and calculating a Loss function Loss of the character category of the group I2clsThe calculation formula is
Figure BDA0002007066990000039
Wherein, PPersonjRepresenting the true class of the jth detected object, PPersonjTaking the value of 0 or 1, PPersonj0 indicates that the detection target is not a human, PPersonj1 indicates that the detection target is a human;
(6) setting the real positions of K detection targets as follows according to the positions of the Box and the Person regression targets:
BBox={BBoxj|j=1,2,…,K}
the population character position loss function is calculated as:
Figure BDA00020070669900000310
wherein BBoxjTrue bit representing jth detection targetPlacing;
(7) calculating the Loss value Loss of group figure positioning detection, wherein the calculation formula is Loss ═ Losscls+λLosslocIf Loss is less than or equal to LossmaxIf the area candidate network is trained, outputting area candidate network parameters, and executing the step (8); if Loss > LossmaxUpdating the parameters of each layer of the regional candidate network
Figure BDA0002007066990000041
Then, returning to the step (4), and carrying out human detection again; lossmaxIs a preset maximum loss value of crowd positioning detection, lambda is a balance factor of a position regression and people classification task, alpha is the learning rate of a random gradient descent method,
Figure BDA0002007066990000042
representing a partial derivative of a population person positioning detection loss function;
(8) re-acquiring the video to be detected, and sequentially carrying out normalization processing, feature extraction and feature fusion on the video to be detected to obtain a multi-dimensional fusion feature map F of the video to be detectednewWill FnewAnd (5) inputting the area candidate network trained in the step (7) to obtain a group character positioning detection result in the new video.
Further, in step (1), H is 720 and W is 1280.
Further, in the step (2), numF is 4.
Further, in the step (3), conv1=1,stride1=1,upSample1=2,conv2=3,stride2=2,conv3=1,stride3=1。
Further, in the step (4), K is 12; in the step (7), Lossmax=0.5,λ=1,α=0.0001。
Has the advantages that: compared with the prior art, the invention adopting the technical scheme has the following technical effects:
the method extracts multi-level video description of the video, performs bidirectional feature processing, fuses multi-level video feature maps to obtain multi-dimensional fusion features, captures video candidate targets, and performs position regression and category classification on the candidate targets in parallel to complete video group figure positioning detection. The invention obtains rich video semantic information by fusing multi-level features, simultaneously performs multi-task prediction operation, effectively improves the speed of group figure positioning detection, and has good accuracy and implementation, particularly:
(1) the invention establishes a bidirectional feature processing channel from top to bottom and from bottom to top, fully excavates the semantic information of the video and improves the utilization rate of the hierarchical features.
(2) The method integrates the multi-dimensional video characteristics, organically combines the bottom-layer characteristics with accurate positions and the top-layer characteristics with rich semantics, and can better improve the detection accuracy.
(3) The invention processes a plurality of prediction tasks in parallel and sets the task balance factor, which is beneficial to establishing the most suitable detection model according to the scene characteristics.
Drawings
FIG. 1 is a flow chart of a video group person positioning detection method based on multi-dimensional fusion features;
FIG. 2 is a block diagram of a regional candidate network in accordance with the present invention;
FIG. 3 is a comparison of the detection accuracy of different methods.
Detailed Description
The technical scheme of the invention is further explained in detail by combining the drawings and the specific embodiments:
example 1: fig. 1 is a flowchart of a method for detecting person positioning in a video group based on multi-dimensional fusion features according to this embodiment, which specifically includes the following steps:
firstly, preprocessing: inputting a video serving as a training sample, wherein the type and the position of an object in the video are known, carrying out size normalization processing on the video frame by frame, and uniformly scaling the size of each frame of video frame into H multiplied by W, wherein H represents the height of the video frame, and W represents the width of the video frame; this step corresponds to preprocessing, which is advantageous for subsequent detection, in this embodiment, H is 720 and W is 1280.
II, characteristics ofExtracting: performing feature extraction on the video processed in the step (1) frame by using an Inception V3 model to obtain image features of each level of the video, and forming a multi-level video feature map F', F ═ F { (F)i′|i=1,2,…,numF},Fi' denotes the i-th layer image feature, numF denotes the total number of layers of extracted video image features, F1'denotes the underlayer image feature, F'numFIndicating a top-level image feature, and numF is 4 in this embodiment.
The position information of the target with the bottom layer characteristics is accurate, the detailed positioning data of the target can be regressed, but the representable semantic information is less, the data size is large, and a large amount of space-time consumption is required to be occupied during operation and processing. Although the top-level features contain rich semantics, the target positions are rough due to multi-layer processing, the regressed target semantics are not fine, and misjudgment is easily caused in a group character scene. The characteristics of each layer have respective advantages and disadvantages, and in order to extract accurate group character positioning information in a group character scene, the Inception V3 model is used for extracting the image characteristics of multiple layers of the video to form a multi-layer characteristic diagram. The reason for using the Inception V3 model at this step is that the feature extraction model not only performs well, but also has strong computational performance, which facilitates subsequent processing.
Thirdly, feature fusion: and performing feature fusion operation on the extracted multi-level video feature map F', wherein the method comprises the following steps (3-1) to (3-4) which are sequentially executed:
(3-1) adding one slave F'numFTo F1'the fusion channel performs feature fusion from the top-level feature to the bottom-level video feature map F' to obtain a top-down video feature map Ftop-down(ii) a The method for feature fusion comprises the following steps: from Top layer image feature F'numFInitially, each layer of image features F is traversed downi', for Fi' performing convolution kernel sequentially to conv1Step size of stride1Convolution operation and upSample1A multiple upsampling operation to obtain
Figure BDA0002007066990000051
To obtain finally
Figure BDA0002007066990000052
(3-2) adding one from
Figure BDA0002007066990000053
To
Figure BDA0002007066990000054
Fusion channel of (2), pair Ftop-downPerforming feature fusion from bottom-layer feature to top to obtain bottom-up video feature map Fbottom-up
Figure BDA0002007066990000055
Figure BDA0002007066990000056
Representing bottom-up video feature graph Fbottom-upThe ith layer image feature of (1); the method for feature fusion comprises the following steps:
a. initializing i to 1;
b. computing
Figure BDA0002007066990000061
To pair
Figure BDA0002007066990000062
Performing convolution kernel as conv2Step size of stride2The convolution operation of (2) to obtain the result
Figure BDA0002007066990000063
Computing
Figure BDA0002007066990000064
c. Updating i to i + 1;
d. and c, circularly executing the steps b to c until i is more than numF, and obtaining:
Figure BDA0002007066990000065
(3-3) to bottom-up video feature map Fbottom-upEach layer of image features in
Figure BDA0002007066990000066
Performing convolution kernel as conv3Step size of stride3The convolution operation of (2) and the result obtained is denoted as FiAll F obtainediForming a multi-dimensional fusion feature map F, wherein F is { F ═ Fi|i=1,2,…,numF}。
In step three, conv1=1,stride1=1,upSample1=2,conv2=3,stride2=2,conv3=1,stride3=1。
The fusion of the multi-layer features is not simply carried out, firstly whether the sizes of the hierarchical features are consistent or not needs to be considered, and secondly, the rationality of the fusion of the hierarchical features needs to be considered, so that the situation that the detection effect is reduced on the contrary after the fusion is avoided. The invention carries out reconstruction design on the existing feature fusion method, each layer of the top-down structure comprises feature information of the current layer and the higher layer, the optimal size of each layer can be directly adopted for detection, in order to reflect the mapping results of all the layer features to achieve the optimal detection effect, a bottom-up channel is particularly added, the top-down processing result is reversely connected, the bottom layer position information is more effectively utilized, and finally, convolution operation is adopted to carry out convolution on each fusion result to eliminate the aliasing effect of up-sampling.
Fourthly, training a regional candidate network:
the regional candidate network is a commonly used target detection network, and the main functional module is as shown in fig. 2, and firstly, k rectangular windows are generated for each pixel point of the sliding window to meet the requirements of targets with different sizes, and then the position information of each rectangular window and the corresponding image characteristics are input into the network, and the operation of a classification layer and a regression layer is respectively performed for each rectangular window. The classification layer mainly judges the probability of the figure existing in the current rectangular window, and the parameters comprise figure weight parameters WPAnd background interference parameter WE. The regression layer mainly obtains coordinate information of a current rectangular window in an original size image, and the parameters comprise rectangular window coordinates and a width and height offset weight parameter Wx、Wy、WhAnd Ww. And sharing the setting and adjustment of all parameters in the whole process of training the regional candidate network.
The training process of the regional candidate network is as follows:
(4-1) inputting the multi-dimensional fusion feature map F into the area candidate network, and outputting K detection targets, where K is 12, thereby obtaining a target position set Box { Box ═ Boxj1,2, …, 12 and the corresponding Person probability set Personj1,2, …, 12}, said BoxjIndicating the position of the jth detection target, PersonjRepresents the probability that the jth detected object is a Person, Personj∈[0,1],PersonjThe larger the value of (a) is, the higher the possibility that the detection target is a person is;
(4-2) classifying the detection targets according to Person, and setting the real boundary frame positions of the 12 detection targets as PPerson ═ PPersonjCalculating a Loss function Loss of the group character category (Loss function Loss) | j ═ 1,2, …, 12}clsThe calculation formula is
Figure BDA0002007066990000071
Wherein, PPersonjRepresenting the true class of the jth detected object, PPersonjTaking the value of 0 or 1, PPerson j0 indicates that the detection target is not a human, PPersonj1 indicates that the detection target is a human;
(4-3) according to the positions of the Box and the Person regression targets, setting the real positions of 12 detection targets as follows:
BBox={BBoxj|j=1,2,…,12}
the population character position loss function is calculated as:
Figure BDA0002007066990000072
wherein BBoxjRepresenting the real position of the jth detection target;
(4-4) calculating a group person positioning detection Loss value Loss, wherein the calculation formula is Loss ═ Losscls+λLosslocIf Loss is less than or equal to LossmaxIf the area candidate network is trained, outputting area candidate network parameters, and executing the step (8); if Loss > LossmaxUpdating the parameters of each layer of the regional candidate network
Figure BDA0002007066990000073
Then, returning to the step (4), and carrying out human detection again; lossmaxIs a preset maximum loss value of crowd positioning detection, lambda is a balance factor of a position regression and people classification task, alpha is the learning rate of a random gradient descent method,
Figure BDA0002007066990000074
partial derivative of Loss function representing human figure positioning detection, Loss in this embodimentmax=0.5,λ=1,α=0.0001。
Fifthly, detecting the video to be detected by adopting the trained regional candidate network:
re-acquiring the video to be detected, and sequentially carrying out normalization processing, feature extraction and feature fusion on the video to be detected to obtain a multi-dimensional fusion feature map F of the video to be detectednewWill FnewAnd (5) inputting the area candidate network trained in the step (7) to obtain a group character positioning detection result in the new video. And (3) performing target detection by using the regional candidate network, and performing position regression and category classification operation in parallel by considering the characteristics of more people and complex tasks of group characters, thereby improving the detection efficiency. In the classification process, because the detection target is definitely a person, the classification II is divided into two types of persons and non-persons, the time wasted in detecting other classes is reduced, the real classification result is merged, and the classification accuracy is improved. In the position regression process, in order to simplify the calculation process, only the target positions of the person categories are regressed, and the regression task is refined. In the process of integral training, adding a task balance factor, and adjusting the optimal task according to the scene typeAnd (5) proportion, and finishing the positioning detection of the people in the video group.
Sixth, simulation of experiment
In the performance testing process of the method, the currently common target detection methods of fast-RCNN, FPN and Mask-RCNN are selected as comparison methods, and the evaluation standard is the detection accuracy under different IoU threshold values and different sizes. IoU indicates the intersection ratio of the detection result and the real result, IoU ∈ [0, 1], the higher the IoU value is, the closer the detection result is to the real result, and IoU ≧ 0.5 is assumed as AP _50, and IoU ≧ 0.75 is assumed as AP _75 during the test. In the evaluation process, the target size is divided into three categories, namely small, medium and large, which are respectively marked as AP _ S, AP _ M, AP _ L. FIG. 3 shows a comparison graph of the detection accuracy of fast-RCNN, FPN and Mask-RCNN according to the present invention and the comparison method. From experimental results, it can be found that compared with the fast-RCNN method using only single-level top-level features, the three methods using multi-level fusion features obtain higher detection accuracy, which indicates that the multi-level fusion features have stronger feature expression capability compared with the single-level top-level features. The FPN and the Mask-RCNN only use a one-way structure to perform fusion processing in the characteristic processing process, a two-way processing channel is used for obtaining a more accurate detection effect, and an experiment result also shows that the method obtains a better detection accuracy rate aiming at different IoU threshold values and target sizes.
The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims (5)

1.一种基于多维融合特征的视频群体人物定位检测方法,其特征在于,包括顺序执行的步骤(1)至(8):1. a video group character location detection method based on multi-dimensional fusion feature, is characterized in that, comprises the steps (1) to (8) of sequential execution: (1)输入作为训练样本的视频,视频中的物体种类及位置已知,对视频逐帧进行大小归一处理,将每一帧视频帧的尺寸统一缩放为H×W大小,H表示视频帧高度,W表示视频帧宽度;(1) Input the video as a training sample. The types and positions of the objects in the video are known. The size of the video is normalized frame by frame, and the size of each video frame is uniformly scaled to H×W size, where H represents the video frame. Height, W represents the width of the video frame; (2)使用InceptionV3模型逐帧对经过步骤(1)处理后的视频进行特征抽取,得到视频各个层级的图像特征,组成多层级视频特征图F',F'={Fi'|i=1,2,…,numF},Fi'表示第i层图像特征,numF表示提取出的视频图像特征总层数,F1'表示底层图像特征,F′numF表示顶层图像特征;(2) Use the InceptionV3 model to extract the features of the video processed in step (1) frame by frame to obtain the image features of each level of the video, and form a multi-level video feature map F', F'={F i '|i=1 ,2,...,numF}, F i ' represents the i-th layer image feature, numF represents the total number of layers of extracted video image features, F 1 ' represents the underlying image feature, F' numF represents the top-level image feature; (3)对抽取到的多层级视频特征图F'进行特征融合操作,包括依次执行的步骤(3-1)至(3-4):(3) Feature fusion operation is performed on the extracted multi-level video feature map F', including steps (3-1) to (3-4) performed in turn: (3-1)增加一条从F′numF到F1'的融合通道,对多层级视频特征图F'进行从顶层特征向下的特征融合,获得自顶向下视频特征图Ftop-down;特征融合的方法为:自从顶层图像特征F′numF开始,向下遍历每一层图像特征Fi',对Fi'依次进行卷积核为conv1、步长为stride1的卷积操作和upSample1倍上采样操作,得到Fi top-down,最终得到Ftop-down={Fi top-down|i=1,2,…,numF};(3-1) add a fusion channel from F' numF to F 1 ', carry out the feature fusion from top-level features downward to the multi-level video feature map F', and obtain the top-down video feature map F top-down ; The method of feature fusion is: starting from the top-level image feature F' numF , traverse down each layer of image features F i ', and perform convolution operations with convolution kernel conv 1 and stride 1 on F i ' in turn. upSample 1 times upsampling operation to obtain F i top-down , and finally F top-down ={F i top-down |i=1,2,...,numF}; (3-2)增加一条从F1 top-down
Figure FDA0002820334800000011
的融合通道,对Ftop-down进行从底层特征向上的特征融合,获得自底向上视频特征图Fbottom-up,Fbottom-up={Fi bottom-up|i=1,2,…,numF},Fi bottom-up表示自底向上视频特征图Fbottom-up的第i层图像特征;特征融合的方法为:
(3-2) Add a line from F 1 top-down to
Figure FDA0002820334800000011
The fusion channel of F top-down performs feature fusion from the bottom feature to the top, and obtains the bottom-up video feature map F bottom-up , F bottom-up ={Fi bottom-up | i =1,2,..., numF}, F i bottom-up represents the i-th layer image feature of the bottom-up video feature map F bottom-up ; the method of feature fusion is:
a.初始化i=1;a. Initialize i=1; b.计算Fi bottom-up=Fi top-down,对Fi bottom-up进行卷积核为conv2、步长为stride2的卷积操作,得到结果
Figure FDA0002820334800000012
计算
Figure FDA0002820334800000013
b. Calculate F i bottom-up = F i top-down , perform a convolution operation on F i bottom-up with a convolution kernel of conv 2 and a stride of stride 2 to obtain the result
Figure FDA0002820334800000012
calculate
Figure FDA0002820334800000013
c.更新i=i+1;c. update i=i+1; d.循环执行步骤b至c,直至i>numF,循环结束后,得到:d. Execute steps b to c in a loop until i>numF, after the loop is over, get: Fbottom-up={Fi bottom-up|i=1,2,…,numF}F bottom-up = {F i bottom-up |i=1,2,...,numF} (3-3)对自底向上视频特征图Fbottom-up中的每一层图像特征Fi bottom-up进行卷积核为conv3、步长为stride3的卷积操作,得到的结果记为Fi,得到的所有Fi构成多维融合特征图F,F={Fi|i=1,2,…,numF};(3-3) Perform a convolution operation with a convolution kernel of conv 3 and a stride of stride 3 on each layer of image feature F i bottom-up in the bottom-up video feature map F bottom-up , and the obtained result is denoted as is F i , all F i obtained constitute a multi-dimensional fusion feature map F, F={F i |i=1,2,...,numF}; (4)将多维融合特征图F输入区域候选网络,输出K个检测目标,获得目标位置集合Box={Boxj|j=1,2,…,K}和对应的人物概率集合Person={Personj|j=1,2,…,K},所述Boxj表示第j个检测目标的位置,Personj表示第j个检测目标为人物的概率,Personj∈[0,1],Personj的取值越大表示该检测目标为人物的可能性越大;(4) Input the multi-dimensional fusion feature map F into the region candidate network, output K detection targets, and obtain the target position set Box={Box j |j=1,2,...,K} and the corresponding person probability set Person={Person j |j=1,2,...,K}, the Box j represents the position of the jth detection target, Person j represents the probability that the jth detection target is a person, Person j ∈ [0,1], Person j The larger the value of , indicates that the detection target is more likely to be a person; (5)根据Person对检测目标进行分类,设置K个检测目标的真实类别为PPerson={PPersonj|j=1,2,…,K},计算群体人物类别损失函数Losscls,计算公式为
Figure FDA0002820334800000021
其中,PPersonj表示第j个检测目标的真实类别,PPersonj取值为0或1,PPersonj=0表示该检测目标不是人物,PPersonj=1表示该检测目标为人物;
(5) Classify the detection targets according to Person, set the true categories of the K detection targets as PPerson={PPerson j |j=1,2,…,K}, calculate the loss function Loss cls of the group person category, and the calculation formula is
Figure FDA0002820334800000021
Wherein, PPerson j represents the real category of the j-th detection target, PPerson j is 0 or 1, PPerson j =0 indicates that the detection target is not a person, and PPerson j =1 indicates that the detection target is a person;
(6)根据Box和Person回归目标位置,设置K个检测目标的真实位置为:(6) According to the Box and Person regression target positions, set the real positions of the K detection targets as: BBox={BBoxj|j=1,2,…,K}BBox={BBox j |j=1,2,…,K} 计算群体人物位置损失函数为:The loss function for calculating the position of group characters is:
Figure FDA0002820334800000022
Figure FDA0002820334800000022
其中,BBoxj表示第j个检测目标的真实位置;Among them, BBox j represents the real position of the j-th detection target; (7)计算群体人物定位检测损失值Loss,计算公式为Loss=Losscls+λLossloc,若Loss≤Lossmax,则区域候选网络已经训练完毕,输出区域候选网络参数,执行步骤(8);若Loss>Lossmax,则更新区域候选网络每一层的参数
Figure FDA0002820334800000023
然后返回步骤(4),重新进行人物检测;Lossmax是预设的人群定位检测最大损失值,λ是位置回归和人物分类任务的平衡因子,α是随机梯度下降法的学习率,
Figure FDA0002820334800000024
表示群体人物定位检测损失函数的偏导数;
(7) Calculate the loss value Loss of group person positioning detection, the calculation formula is Loss=Loss cls +λLoss loc , if Loss≤Loss max , the regional candidate network has been trained, output the regional candidate network parameters, and execute step (8); Loss>Loss max , then update the parameters of each layer of the regional candidate network
Figure FDA0002820334800000023
Then go back to step (4), and perform person detection again; Loss max is the preset maximum loss value of crowd positioning detection, λ is the balance factor between position regression and person classification tasks, α is the learning rate of stochastic gradient descent,
Figure FDA0002820334800000024
Represents the partial derivative of the loss function of group person location detection;
(8)重新获取待检测的视频,对待检测视频依次进行归一化处理、特征抽取和特征融合,得到待检测的视频的多维融合特征图Fnew,将Fnew输入步骤(7)训练好的区域候选网络,得到待检测视频中的群体人物定位检测结果。(8) Re-acquire the video to be detected, perform normalization processing, feature extraction and feature fusion on the video to be detected in turn, to obtain a multi-dimensional fusion feature map F new of the video to be detected, and input F new to the trained step (7) The regional candidate network is used to obtain the detection results of group person positioning in the video to be detected.
2.根据权利要求1所述的一种基于多维融合特征的视频群体人物定位检测方法,其特征在于,所述步骤(1)中,H=720,W=1280。2 . The method for detecting group people in video based on multi-dimensional fusion features according to claim 1 , wherein, in the step (1), H=720 and W=1280. 3 . 3.根据权利要求1所述的一种基于多维融合特征的视频群体人物定位检测方法,其特征在于,所述步骤(2)中,numF=4。3 . The method for detecting group people in video based on multi-dimensional fusion features according to claim 1 , wherein, in the step (2), numF=4. 4 . 4.根据权利要求1所述的一种基于多维融合特征的视频群体人物定位检测方法,其特征在于,所述步骤(3)中,conv1=1,stride1=1,upSample1=2,conv2=3,stride2=2,conv3=1,stride3=1。4. a kind of video group person location detection method based on multi-dimensional fusion feature according to claim 1, is characterized in that, in described step (3), conv 1 =1, stride 1 =1, upSample 1 =2, conv 2 =3, stride 2 =2, conv 3 =1, stride 3 =1. 5.根据权利要求1所述的一种基于多维融合特征的视频群体人物定位检测方法,其特征在于,所述步骤(4)中,K=12;所述步骤(7)中,Lossmax=0.5,λ=1,α=0.0001。5. a kind of video group person location detection method based on multi-dimensional fusion feature according to claim 1, is characterized in that, in described step (4), K=12; In described step (7), Loss max = 0.5, λ=1, α=0.0001.
CN201910235608.5A 2019-03-26 2019-03-26 A video group person location detection method based on multi-dimensional fusion features Active CN109903339B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910235608.5A CN109903339B (en) 2019-03-26 2019-03-26 A video group person location detection method based on multi-dimensional fusion features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910235608.5A CN109903339B (en) 2019-03-26 2019-03-26 A video group person location detection method based on multi-dimensional fusion features

Publications (2)

Publication Number Publication Date
CN109903339A CN109903339A (en) 2019-06-18
CN109903339B true CN109903339B (en) 2021-03-05

Family

ID=66953909

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910235608.5A Active CN109903339B (en) 2019-03-26 2019-03-26 A video group person location detection method based on multi-dimensional fusion features

Country Status (1)

Country Link
CN (1) CN109903339B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110675391B (en) * 2019-09-27 2022-11-18 联想(北京)有限公司 Image processing method, apparatus, computing device, and medium
CN111488834B (en) * 2020-04-13 2023-07-04 河南理工大学 A crowd counting method based on multi-level feature fusion
CN111491180B (en) * 2020-06-24 2021-07-09 腾讯科技(深圳)有限公司 Method and device for determining key frame
CN113610056B (en) * 2021-08-31 2024-06-07 的卢技术有限公司 Obstacle detection method, obstacle detection device, electronic equipment and storage medium
CN114255384A (en) * 2021-12-14 2022-03-29 广东博智林机器人有限公司 A method, device, electronic device and storage medium for detecting number of people
CN114494999B (en) * 2022-01-18 2022-11-15 西南交通大学 Double-branch combined target intensive prediction method and system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107341471A (en) * 2017-07-04 2017-11-10 南京邮电大学 A kind of Human bodys' response method based on Bilayer condition random field
CN108399435A (en) * 2018-03-21 2018-08-14 南京邮电大学 A kind of video classification methods based on sound feature
CN108846446A (en) * 2018-07-04 2018-11-20 国家新闻出版广电总局广播科学研究院 The object detection method of full convolutional network is merged based on multipath dense feature
CN108898078A (en) * 2018-06-15 2018-11-27 上海理工大学 A kind of traffic sign real-time detection recognition methods of multiple dimensioned deconvolution neural network
CN109472298A (en) * 2018-10-19 2019-03-15 天津大学 Deep Bidirectional Feature Pyramid Augmentation Network for Small-Scale Object Detection
CN109508686A (en) * 2018-11-26 2019-03-22 南京邮电大学 A kind of Human bodys' response method based on the study of stratification proper subspace

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8989442B2 (en) * 2013-04-12 2015-03-24 Toyota Motor Engineering & Manufacturing North America, Inc. Robust feature fusion for multi-view object tracking
CN108229319A (en) * 2017-11-29 2018-06-29 南京大学 The ship video detecting method merged based on frame difference with convolutional neural networks
CN108038867A (en) * 2017-12-22 2018-05-15 湖南源信光电科技股份有限公司 Fire defector and localization method based on multiple features fusion and stereoscopic vision

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107341471A (en) * 2017-07-04 2017-11-10 南京邮电大学 A kind of Human bodys' response method based on Bilayer condition random field
CN108399435A (en) * 2018-03-21 2018-08-14 南京邮电大学 A kind of video classification methods based on sound feature
CN108898078A (en) * 2018-06-15 2018-11-27 上海理工大学 A kind of traffic sign real-time detection recognition methods of multiple dimensioned deconvolution neural network
CN108846446A (en) * 2018-07-04 2018-11-20 国家新闻出版广电总局广播科学研究院 The object detection method of full convolutional network is merged based on multipath dense feature
CN109472298A (en) * 2018-10-19 2019-03-15 天津大学 Deep Bidirectional Feature Pyramid Augmentation Network for Small-Scale Object Detection
CN109508686A (en) * 2018-11-26 2019-03-22 南京邮电大学 A kind of Human bodys' response method based on the study of stratification proper subspace

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Person Re-Identification Based on Multi-Level and Multi-Feature Fusion;Tan Feigang等;《2017 International Conference on Smart City and Systems Engineering (ICSCSE)》;20171201;第184-187页 *
基于卷积神经网络特征共享与目标检测的跟踪算法研究;李贺;《中国优秀硕士学位论文全文数据库》;20180715;第I138-1347页 *

Also Published As

Publication number Publication date
CN109903339A (en) 2019-06-18

Similar Documents

Publication Publication Date Title
CN109903339B (en) A video group person location detection method based on multi-dimensional fusion features
CN110619369B (en) A Fine-Grained Image Classification Method Based on Feature Pyramid and Global Average Pooling
CN109614985B (en) Target detection method based on densely connected feature pyramid network
CN110321813B (en) Cross-domain person re-identification method based on pedestrian segmentation
CN110119728B (en) Remote sensing image cloud detection method based on multi-scale fusion semantic segmentation network
CN109801256B (en) Image aesthetic quality assessment method based on region of interest and global features
CN111461038B (en) Pedestrian re-identification method based on layered multi-mode attention mechanism
CN112150821B (en) Method, system and device for constructing lightweight vehicle detection model
CN111126202A (en) Object detection method of optical remote sensing image based on hole feature pyramid network
CN112348036A (en) Adaptive Object Detection Method Based on Lightweight Residual Learning and Deconvolution Cascade
CN110543906B (en) Automatic skin recognition method based on Mask R-CNN model
CN111291604A (en) Facial attribute recognition method, device, storage medium and processor
CN107463892A (en) Pedestrian detection method in a kind of image of combination contextual information and multi-stage characteristics
CN112861917B (en) A Weakly Supervised Object Detection Method Based on Image Attribute Learning
CN117611932B (en) Image classification method and system based on double pseudo tag refinement and sample re-weighting
CN106897738A (en) A kind of pedestrian detection method based on semi-supervised learning
CN105678278A (en) Scene recognition method based on single-hidden-layer neural network
CN111368660A (en) A single-stage semi-supervised image human object detection method
CN111898432A (en) A pedestrian detection system and method based on improved YOLOv3 algorithm
CN113239865B (en) Deep learning-based lane line detection method
CN111274964B (en) Detection method for analyzing water surface pollutants based on visual saliency of unmanned aerial vehicle
CN110633727A (en) Deep neural network ship target fine-grained identification method based on selective search
Hu et al. RGB-D image multi-target detection method based on 3D DSF R-CNN
CN110008899B (en) Method for extracting and classifying candidate targets of visible light remote sensing image
CN117237867A (en) Adaptive scene surveillance video target detection method and system based on feature fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant