CN110263712B

CN110263712B - A Coarse and Fine Pedestrian Detection Method Based on Region Candidates

Info

Publication number: CN110263712B
Application number: CN201910535870.1A
Authority: CN
Inventors: 宋晓宁; 周少康; 孙俊
Original assignee: Jiangnan University
Current assignee: Uniform Entropy Technology Wuxi Co ltd
Priority date: 2019-06-20
Filing date: 2019-06-20
Publication date: 2021-02-23
Anticipated expiration: 2039-06-20
Also published as: CN110263712A

Abstract

The invention discloses a coarse and fine pedestrian detection method based on region candidates, which includes a coarse detection stage, and the coarse detection stage further includes the following steps: using a local irrelevant channel feature method to perform coarse detection on a to-be-detected picture of a coarse training sample; screening The label target frame that is missed on the rough training sample; perform cluster analysis on the missed label target frame, and set the label scale and aspect ratio; use the scale and aspect ratio to train the region candidate network ; The detection results output by the region candidate network trained by the image input are fused to obtain a rough detection result. Beneficial effects of the present invention: First, in the rough detection stage, the real results of the target are analyzed by the clustering method, and the regional candidate network is used for targeted training, and the detection results are fused with the original candidate frame to obtain higher The recall rate significantly reduces the missed detection rate of the detection results.

Description

Coarse and fine pedestrian detection method based on region candidates

Technical Field

The invention relates to the technical field of pedestrian detection, in particular to a pedestrian detection method combining a rough and fine expression strategy and a regional candidate network.

Background

Pedestrian detection has gained particular attention in computer vision research in recent years as a key technology for autonomous driving and intelligent monitoring. The pedestrian detection technology aims to find out pedestrians existing in an image or a video, and accurately mark the size of the pedestrians even if the pedestrians exist, which is a classic problem of the target detection direction and is generally represented by a rectangular box. Because the human body has considerable flexibility, various postures and shapes exist, the appearance characteristics are greatly influenced by clothes, postures, angles and the like, and factors such as shielding, illumination and the like also face the influence, so that the stable and efficient detection is very difficult to guarantee in the actual work, and the pedestrian detection is still the classical and challenging problem in the current computer vision research.

Although the existing technology can accurately extract the contour and some textural features of the pedestrian target, the calculation complexity is high, or the false detection situation is not considered, the rough and fine expression only filters redundant false detection windows, and the pedestrian target which is not detected originally can not be detected for classification and judgment, so that the target which is not detected by the original method is still missed.

Disclosure of Invention

This section is for the purpose of summarizing some aspects of embodiments of the invention and to briefly introduce some preferred embodiments. In this section, as well as in the abstract and the title of the invention of this application, simplifications or omissions may be made to avoid obscuring the purpose of the section, the abstract and the title, and such simplifications or omissions are not intended to limit the scope of the invention.

The present invention has been made in view of the above-mentioned conventional problems.

Therefore, the technical problem solved by the invention is as follows: the pedestrian detection missing detection problem and the target scale difference problem are solved, and the missing detection rate of the pedestrian detection method is effectively reduced.

In order to solve the technical problems, the invention provides the following technical scheme: a coarse and fine pedestrian detection method based on regional candidates comprises a coarse detection stage, wherein the coarse detection stage further comprises the following steps of performing coarse detection on a to-be-detected picture of a coarse training sample by using a local irrelevant channel characteristic method; screening out tag target frames which are missed to be detected on the coarse training sample; performing cluster analysis on the tag target frames which are missed to be detected, and setting the tag scale and the length-width ratio; training a region candidate network using the scale and aspect ratio; and fusing the detection result output by the area candidate network after the picture input training and the coarse detection result output by the classifier trained by the local irrelevant channel characteristic method to obtain a coarse detection result.

As a preferable aspect of the method for detecting rough and fine pedestrians based on the region candidates according to the present invention, wherein: the generation of the coarse detection result comprises the following steps of extracting different characteristic channels from the pedestrian images of the coarse training samples; extracting features and training the classifier by applying the local irrelevant channel feature method; and carrying out coarse detection on the image through the trained classifier to generate a coarse detection result.

As a preferable aspect of the method for detecting rough and fine pedestrians based on the region candidates according to the present invention, wherein: and the area candidate network selects a candidate area on the feature map through the feature map generated after the convolution operation, and outputs a series of rectangular target candidate frames with corresponding score values by selecting and receiving the picture to be detected as input.

As a preferable aspect of the method for detecting rough and fine pedestrians based on the region candidates according to the present invention, wherein: the method comprises the following steps that 13 convolutional layers of a VGG-16 network are selected as a network for convolutional operation; extracting features of an input picture through a convolutional layer, obtaining sliding windows with different scales and proportions on a feature map obtained in the front by using a small network, and mapping the windows to features with lower dimensionality, including features mapped to 512 dimensions; the previously generated windows are effectively classified and regressed by two fully connected layers.

As a preferable aspect of the method for detecting rough and fine pedestrians based on the region candidates according to the present invention, wherein: the loss function of the regional candidate network training is defined as follows:

where i denotes the sequence number of the target candidate box, p_iIt is the probability that the ith target frame candidate is a pedestrian target, and when the ith target frame candidate is identified as a target, p _i ^*1 and conversely 0, t_iRepresenting the predicted coordinates, t_i ^*Representing the coordinates of the real object.

As a preferable aspect of the method for detecting rough and fine pedestrians based on the region candidates according to the present invention, wherein: in the area candidate network training process, a sample with the maximum intersection ratio with the label target frame or the overlapping degree with the real target frame more than 0.7 is taken as a positive sample, and a sample with the overlapping degree with the real target frame less than 0.3 is taken as a negative sample.

As a preferable aspect of the method for detecting rough and fine pedestrians based on the region candidates according to the present invention, wherein: the method also comprises a fine detection stage, wherein the fine detection stage also comprises the following steps of extracting different characteristic channels from the pedestrian images of the coarse training samples; further extracting color self-similarity characteristics and convolution channel characteristics from the characteristic channel; and performing fusion training by using the color self-similarity characteristics and the convolution channel characteristics to obtain three classifiers and outputting corresponding detection results.

As a preferable aspect of the method for detecting rough and fine pedestrians based on the region candidates according to the present invention, wherein: the fine detection stage also comprises the following steps of taking the pedestrian and background pictures generated by the characteristic channel extracted in the coarse detection stage as fine training samples in the fine detection stage; training a VGG-16 network as a two-classifier according to the fine training sample; and combining the two classifiers with the three classifiers to jointly serve as a classification detector of the fine detection stage.

As a preferable aspect of the method for detecting rough and fine pedestrians based on the region candidates according to the present invention, wherein: and the detection stage is used for obtaining a candidate target frame from the test sample image through a local irrelevant channel feature method and a regional candidate network, and inputting the candidate target frame into the classifier in the fine detection stage for accurate classification to obtain a pedestrian label target frame detection result.

As a preferable aspect of the method for detecting rough and fine pedestrians based on the region candidates according to the present invention, wherein: the VGG-16 network replaces the last two pooling layers with the hole convolution layer with the step length of 2 to perform down-sampling operation, so that the characteristic diagram size is reduced and the receptive field is increased.

The invention has the beneficial effects that: firstly, the real target result is analyzed through a clustering method in a coarse detection stage, targeted training is carried out by utilizing a regional candidate network, the detection result is fused with an original candidate frame, higher recall rate is obtained, and the omission ratio of the detection result is remarkably reduced; in the fine classification stage, the VGG-16 network is improved, a part of the pooling layer is replaced by the cavity convolution, the feature extraction capability of the network is improved, an Adaboost classifier is trained, and the detection result is accurately judged.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise. Wherein:

fig. 1 is a schematic diagram of a detection framework flow of a rough and fine pedestrian detection method based on region candidates according to a first embodiment of the invention;

fig. 2 is a schematic structural diagram of a regional candidate network according to a regional candidate-based rough/fine pedestrian detection method according to a first embodiment of the present invention;

FIG. 3 is a graph showing a comparison of the results of the detection on TUD-Brussels according to the third embodiment of the present invention;

fig. 4 is a schematic diagram showing the comparison of the detection results on Caltech according to the third embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, specific embodiments accompanied with figures are described in detail below, and it is apparent that the described embodiments are a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making creative efforts based on the embodiments of the present invention, shall fall within the protection scope of the present invention.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those specifically described and will be readily apparent to those of ordinary skill in the art without departing from the spirit of the present invention, and therefore the present invention is not limited to the specific embodiments disclosed below.

Furthermore, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one implementation of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.

The present invention will be described in detail with reference to the drawings, wherein the cross-sectional views illustrating the structure of the device are not enlarged partially in general scale for convenience of illustration, and the drawings are only exemplary and should not be construed as limiting the scope of the present invention. In addition, the three-dimensional dimensions of length, width and depth should be included in the actual fabrication.

Meanwhile, in the description of the present invention, it should be noted that the terms "upper, lower, inner and outer" and the like indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of describing the present invention and simplifying the description, but do not indicate or imply that the referred device or element must have a specific orientation, be constructed in a specific orientation and operate, and thus, cannot be construed as limiting the present invention. Furthermore, the terms first, second, or third are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

The terms "mounted, connected and connected" in the present invention are to be understood broadly, unless otherwise explicitly specified or limited, for example: can be fixedly connected, detachably connected or integrally connected; they may be mechanically, electrically, or directly connected, or indirectly connected through intervening media, or may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

Example 1

Referring to the schematic diagram of fig. 1, the method for detecting a pedestrian based on region candidates in the present embodiment includes a coarse detection stage and a fine detection stage, and is used for detecting a pedestrian. With the intensive research on computer vision, many classical and effective pedestrian detection methods have been proposed in succession. For example, in the prior art, a more classical pedestrian detection method is obtained by combining a gradient direction histogram with a support vector machine, the directional gradient histogram features can accurately extract the contour and some texture features of a pedestrian target, but the computation complexity of the method is high. The method has the advantages that the local binary pattern is combined with the directional gradient histogram feature, and the pedestrian feature information is accurately described by using the characteristic of the local binary pattern operator and the directional gradient histogram feature together. A method for integrating multiple channels by aggregating channel characteristics is also provided, the color characteristics, the gradient amplitude and the gradient direction histogram characteristics are comprehensively utilized, and the detection result is effectively improved. The aggregation channel characteristic method is expanded to provide local irrelevant channel characteristics, and local decorrelation is carried out on each channel to obtain more representative characteristics, so that the detection result is effectively improved.

The LDCF method, i.e., the local irrelevant channel feature method, generates many detection windows, but generates many false detections, many backgrounds that do not include a pedestrian target are false detected as pedestrians, and these backgrounds are similar to features extracted from pedestrians. Pedestrian detection methods using a rough and fine expression strategy have been proposed to solve this problem. The pedestrian detection method based on the rough and fine expression strategy is a further improvement of the LDCF method, different feature channels are extracted from a pedestrian image, the LDCF method is applied to extract features and train a classifier, and the image is roughly detected through the trained classifier to generate a rough detection result.

And then further extracting an improved color self-similarity feature (NCSSF) and a Simplified Convolution Channel Feature (SCCF) from the previous feature channel, extracting the color similarity feature of the pedestrian from the improved color self-similarity feature, effectively distinguishing the pedestrian target from the background, mainly describing the robust essential feature inside the target by the simplified convolution channel feature, enhancing the irrelevance and the distinguishability of the features, finally fusing the two features, and training by using a weak classifier as an Adaboost classifier of a decision tree.

During detection, firstly, an image to be detected is detected by using an LDCF method, a detection window similar to a candidate region is generated, then, the generated detection window is detected by using a trained Adaboost classifier, and the false detection result of the LDCF method is eliminated, so that the final result is more accurate. Compared with the original LDCF method, the pedestrian detection method of the rough and fine expression strategy has the advantages that the performance is remarkably improved, the omission factor on the Caltech pedestrian detection data set is reduced by 12.9% compared with the original omission factor, the omission factor on the TUD-Brussels data set is reduced by 2.6%, and the effectiveness of the rough and fine expression strategy is proved.

However, through analysis, the rough and fine expression strategy effectively suppresses excessive false detection windows generated by pedestrian detection by the LDCF method, but the false detection situation of the LDCF is not considered, the rough and fine expression only filters the redundant false detection windows, and pedestrian targets which are not detected by the original LDCF method cannot be detected again for classification and judgment, so that the missed targets are still missed.

Therefore, the present embodiment provides a coarse and fine pedestrian detection method based on region candidates to solve the problem of missed detection, and theoretically, the coarse and fine expression structure should have a recall rate as high as possible in the coarse detection stage, but it is found through analysis that more pedestrian targets are still not detected in the coarse detection stage, and since the subsequent fine classification stage only excludes the previous false detection part, the problem of missed detection of the system cannot be effectively solved.

The rough and fine pedestrian detection method based on the regional candidates, which is provided by the embodiment, obtains a proper length-width ratio by clustering the missed detection part of the LDCF method in the structure, then improves the RPN, trains a training set aiming at the missed detection part, adds the detection results into fine classification judgment in the detection process, and more effectively filters the false detection window in the detection results by using the improved VGG16 network, so that the detection results are more accurate. Specifically, the method comprises a coarse detection stage and a fine detection stage, wherein the coarse detection stage further comprises the following steps,

performing coarse detection on the picture to be detected of the coarse training sample by using a local irrelevant channel characteristic method;

screening out a label target frame which is missed to be detected on the rough training sample, and comparing the label of the sample with a detection result to obtain the missed to be detected;

and performing cluster analysis on the label target frames which are missed to be detected, setting a proper scale and an aspect ratio, wherein the proper scale means that 9 scales capable of approximately summarizing all labels are found, the cluster analysis adopts k-means, namely the height and the width of the label are input, k is set to be 9, then the label target frames are divided into 9 families according to each scale, the scales of the labels in each family are similar as much as possible, and the scale difference of different families is as large as possible. The aspect ratio I is set to be 0.41 directly according to experience and pedestrian characteristics;

training the candidate network of the region by using the scale and the length-width ratio;

and (4) fusing a detection result output by the image input trained regional candidate network and a coarse detection result output by a classifier trained by the local irrelevant channel characteristic method to obtain a coarse detection result.

Further, the coarse detection stage further includes a step of generating a coarse detection result:

extracting different characteristic channels from the pedestrian images of the coarse training samples;

extracting features by applying a local irrelevant channel feature method and training a classifier;

and carrying out coarse detection on the image through a trained classifier to generate a coarse detection result.

The area candidate network selects a candidate area on the feature map generated after the convolution operation, so that the calculation time is reduced, and the detection accuracy is not lost, wherein the convolution is the feature map extraction performed by using an unmodified convolution neural network VGG-16, because the VGG-16 is a convolution neural network and the convolution layer is mainly performed in the convolution neural network, the convolution operation is also called, and because the candidate area is not directly selected on the original image and is performed on the feature map with the reduced size after the convolution of the VGG-16 network, the calculation amount is reduced, and the VGG-16 extracts representative image features, the size is reduced, but the information amount is not reduced too much, so that the accuracy is not lost.

In this embodiment, the region candidates receive the to-be-detected picture as an input, and output a series of rectangular target candidate frames, and each candidate frame has a corresponding score value, where the score value indicates a score that an image in each candidate frame is a pedestrian, and the larger the score value is, the higher the probability that a pedestrian is in the frame is.

Further, the rolling operation in this embodiment includes the following steps,

the network of convolution operation selects 13 convolution layers of VGG-16 network; the convolution of the regional candidate network here uses the extraction feature operation of the original VGG-16 network without modification, and the later modified VGG-16 is used for feature extraction in the fine detection stage. The difference is that the VGG-16 of the area candidate network is the original unmodified,

extracting features of an input picture through a convolutional layer, obtaining sliding windows with different scales and proportions on a feature map obtained in the front by using a small network, and mapping the windows to features with lower dimensionality, including features mapped to 512 dimensions; the small networks are here embodied as the back part of the RPN network, the 3 x 3 convolutional layer and the two fully connected layers.

And finally, effectively classifying and regressing the window generated in the front through two full connection layers. The two fully-connected layers are the most basic fully-connected layers in the neural network, and can also be called convolution layers with convolution kernel size being the same as the picture size, namely, each pixel is added with a weight for network learning.

The loss function for the regional candidate network training is defined as follows:

In the area candidate network training process, a sample having the maximum intersection ratio with the real target frame or the intersection ratio (overlapping degree) with the real target frame more than 0.7 is taken as a positive sample, and a sample having the intersection ratio (overlapping degree) with the real target frame less than 0.3 is taken as a negative sample. The real target frame is the index label target frame and is a rectangular frame artificially marked and including pedestrians. The 0.7 and 0.3 setting values are thresholds which are set to be more helpful to the result, so that the detection effect of the trained framework is better.

Referring to the schematic diagram of fig. 2, which is a structural diagram of the regional candidate network, it should be further explained in this embodiment that the regional candidate network, that is, the RPN network is a network structure that is used by GirshickR to replace the selectsearch method in the RCNN model in order to improve the detection accuracy and the detection speed of the RCNN model. The RPN is used for finding a region where a target may exist in a picture, and the work is performed by a sliding window method, the original method generates too many regions which do not contain any useful categories, so that the task parameters of classification and position regression of the detected target are numerous, the time is too long, and accurate convergence is often difficult.

The RPN gives the task to a deep network, gives labels to a network training set and a real target frame, and leads the network to acquire the approximate position of the target from the characteristic diagram through training and then to be mapped back to the original image, so that the efficiency of target classification and frame regression is higher, and the accuracy is improved. For a pedestrian detection scene, the proportion of general pedestrians is small, the area occupied by the background is large, if each window in an image is detected, the training calculation time is increased undoubtedly, the convergence difficulty of the RPN network is high, and candidate areas are selected on the feature map generated after the convolution operation, so that the calculation time is reduced, and meanwhile, the detection accuracy cannot be lost. The RPN network receives the picture as input and then outputs a series of rectangular target candidate boxes, referred to in the text as anchors, each with a corresponding score value. The main network selects 13 convolutional layers of a VGG-16 network, features of an input picture are extracted through the convolutional layers, sliding windows with different scales and proportions are obtained on a feature diagram obtained in the front through a small network, and each window is mapped to a feature with a lower dimension, and the feature extraction network is mapped to a feature with 512 dimensions because the VGG-16 network is selected. Finally, the previously generated windows are effectively classified and regressed by two fully connected layers.

Since the RPN network, as part of the fasterncn, is initially used for multi-target detection directions, a total of 9 a priori candidate object frames of 3 scales (64, 128, 256, respectively) and 3 length-to-width ratios (1: 1, 1: 2, 2: 1, respectively) are selected, but too many candidate regions of useless scale and ratio are generated for pedestrian detection. Because the original RPN network aims at various targets, the scale and the length-width ratio are preset aiming at various targets, and the span is large, the difference between the pedestrian detection scale and the length-width ratio is too large, more target candidate frames which are too large or too small are generated, the calculation amount of the subsequent fine classification stage is increased, the detection efficiency of the whole frame is reduced, and the problems of inaccurate pedestrian position, large error of a pedestrian target frame and the like can be caused when the RPN network is directly used for pedestrian detection.

For the above problems, in this embodiment, through statistical analysis of the target labeling frame of the training set, a priori candidate frame parameters with more appropriate scale and aspect ratio are obtained by using a clustering (k-means algorithm) method to replace parameters in the original RPN network, an appropriate candidate region is generated for the pedestrian target, a prediction result more suitable for the pedestrian target is obtained, and the accuracy of the pedestrian target position is improved.

Example 2

Referring again to fig. 1, the present example is a fine detection stage in a region candidate based rough and fine pedestrian detection method, which further includes the steps of,

extracting different characteristic channels from the pedestrian images of the rough training samples;

further extracting color self-similarity characteristics and convolution channel characteristics from the characteristic channel;

and performing fusion training by using the color self-similarity characteristic and the convolution channel characteristic to obtain three classifiers and outputting corresponding detection results.

Further, the fine detection phase comprises the following steps,

the pedestrian and background pictures generated by extracting the feature channel in the rough detection stage are used as a fine training sample in the fine detection stage, wherein the pedestrian picture is directly intercepted according to the feature channel of the label and the training sample, the background picture is a picture which is directly and randomly intercepted in the feature channel of the sample and does not comprise the pedestrian, and the picture is not a result of training in the rough stage, but the training stage is directly used because the intercepting process is also carried out in the training stage. Each characteristic channel can be understood as pictures of different styles, and the positions of 10 characteristic channels of pedestrians in each picture are unchanged, so that the pedestrians are directly intercepted from the 10 characteristic channels according to the labels of the original pictures; the present embodiment uses 10 feature channels in total, which include 1 local normalized gradient magnitude channel, 3 color channels (LUVs), and 6 gradient direction histogram channels.

Training a VGG-16 network as a two-classifier according to the fine training sample;

and the two classifiers and the three classifiers are combined together to be used as a classification detector in a fine detection stage.

The embodiment finally comprises a detection stage, wherein the detection stage obtains a candidate target frame by passing the test sample image through two region candidate structures, and inputs the candidate target frame into a classifier in the fine detection stage for accurate classification, so as to obtain a pedestrian label target frame detection result. Here, the two region candidates refer to the original LDCF detection method and RPN method, respectively, and since they both aim to obtain candidate target frames for subsequent fine detection, i write here two region candidate structures.

In the embodiment, the VGG-16 network replaces the last two pooling layers with the cavity convolution layer with the step length of 2 to perform down-sampling operation, so that the receptive field is increased while the size of the feature map is reduced, and a better feature extraction effect is obtained. The last two pooling layers were replaced with convolutional layers with convolutional kernel size 3 x 3, step size 2, expansion 2, and padding 2. The step length of 2 achieves the effect of downsampling of the previous pooling layer, because the pooling layer is not learnable, the spatial hierarchical information is lost, more information is lost, and the convolution layer is used for replacing and making up the defect. Meanwhile, the expansion rate is 2, so that the receptive field of each point is properly enlarged, and the global information of the features is properly enhanced. The receptive field refers to the size of the original image corresponding to each pixel point in the convolved image.

Similarly, in this embodiment, it should be further explained that, in the fine detection stage of the framework structure, a mode of fine tuning the VGG-16 network is adopted to extract candidate window features generated by the coarse detection, and then a fine Adaboost classifier, that is, the two classifiers in this embodiment, is trained. The classifier consists of 4096 decision trees of depth 5. In the detection stage, the false detection result in the coarse detection stage is further filtered by the trained classifier, and finally, the non-maximum suppression algorithm is used for suppressing the overlapping part of the detection result to obtain the final result.

With the advent of the big data era and the emergence of high-performance computing systems, the convolutional network has recently achieved great success in the identification and classification directions of large-scale images, videos and the like, and the obtained effect is superior to that of the traditional feature extraction method. The VGG-16 network consists of 13 convolutional layers and 3 fully-connected layers, with 13 convolutional layers again separated by 5 max-pooling layers, where only network extraction features other than the fully-connected layers are utilized. 13 convolutional layers are separated by the pooling layer, the pooling layer is a down-sampling process, parameters are not learnable in the network training process, and therefore the loss of an internal data structure and the loss of space hierarchical information are inevitably caused, in order to overcome the defect, the hole convolutional layer with the step length of 2 is used for replacing the last two pooling layers to perform down-sampling operation, the size of a feature diagram is reduced, the receptive field is increased, and a better feature extraction effect is obtained.

The improved whole pedestrian detection framework respectively trains a classifier of the LDCF and an improved RPN network through input images in a training stage; the pedestrian and background pictures generated in the coarse detection stage are used as training samples to train, a VGG-16 network with a cavitation convolution instead of a pooling layer is used as a two-classification detector, and the two-classification detector is combined with an original classifier (namely a three-classifier) trained by improved self-similarity characteristics and convolution channel characteristics to be used as a classification detector in the fine detection stage. And in the detection stage, the candidate target frame is obtained through two region candidate structures of the picture, the candidate target frame is input into a classifier in the fine detection stage to be classified more accurately, and finally, a pedestrian target frame result is obtained.

Example 3

The embodiment provides a rough and fine pedestrian detection method based on region candidates for experimental result verification. The Caltech pedestrian dataset and TUD-Brussels are commonly used pedestrian detection datasets that most pedestrian detection algorithms use to evaluate the performance of the algorithms. The Caltech pedestrian detection data set is a video published by the california institute of technology, at a resolution of 640x480, 30 frames per second, for about 10 hours, captured with a vehicle-mounted camera. The data set is labeled with about 250000 frames of pictures, 350000 pedestrians are labeled with rectangular frames, and the occlusion environment in the data set is also labeled. The data set is divided into sets 00-10, wherein sets 00-05 are used as training sets, and sets 06-10 are used as testing sets.

In the experiment of this embodiment, 32077 training pictures are generated in sets 00 to 05 in the form of one picture at every 4 frames, and 4024 test pictures are obtained in sets 06 to 10 in the form of one picture at every 30 frames. The TUD-Brussels dataset is a dataset captured by a pair of vehicle-mounted cameras, and is motion information given by the dataset to evaluate the effect of the motion information on pedestrian detection, and is not used here. The training set has 1092 pairs of images for positive samples and 192 pairs of no-pedestrian images (part of images captured by the handheld camera) for negative samples, and the training set has 1776 pedestrian targets in total. The test set has 508 pairs of images, with a resolution of 640x480 as in the Caltech dataset. Evaluation of performance for different methods the performance effects of the different methods were compared with the average Log-average miss rate (Log-average miss rate) using the evaluation algorithm proposed by pittrdollar in 2012, the lower the average Log miss rate the better.

And (3) comparing experimental results:

to demonstrate the effectiveness of the method of this example, the method was compared with other methods experimentally on the TUD-Brussels data set, the experimental results are shown schematically in FIG. 3, which shows the log omission factor curves for these several methods. The most classical traditional feature extraction method has the average logarithmic omission ratio of 78% in TUD-Brussels, which is 32% higher than 46% in the method of the present embodiment, the experimental result of ConvNet frame is 69%, the result of LDCF is 52%, and in addition, MF + Motion +2Ped is obtained by adding Motion information to improve the accuracy of the model, which is 5% higher than the method of the present embodiment without adding Motion information, and is 1% higher than the original method (here, expressed by original), which fully proves that the method effectively reduces the omission ratio of the pedestrian detection method. In fig. 3, the rightmost end of the line segment is sequentially an HOG curve, an MF + Motion +2ped curve, a Convnet curve, an LDCF curve, an ours curve and an original curve from top to bottom. In FIG. 4, the leftmost end of the line segment is shown as HOG curve, SA-FastRCN curve, FasterRCNN + ATT curve, MS-CNN curve, original curve, ours curve, RPN + BF curve from top to bottom (wherein original curve and ours curve are close to each other at the left end and cannot be distinguished, but original curve at the end is located above the ours curve).

It should be noted that the above-mentioned embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, which should be covered by the claims of the present invention.

Claims

1. a coarse and fine pedestrian detection method based on regional candidate, is characterized in that: comprise coarse detection stage and fine detection stage, and described coarse detection stage also comprises the following steps,

Use the local independent channel feature method to perform rough detection on the images to be detected of the rough training samples;

Filter out the label target frame that is missed on the rough training sample;

Perform cluster analysis on the missed label target frame, and set label scale and aspect ratio;

train a region candidate network using the scale and aspect ratio;

The detection result output by the region candidate network trained by the image input is fused with the coarse detection result output by a classifier trained by the local independent channel feature method to obtain a coarse detection result;

The fine detection stage includes the following steps:

extracting different feature channels from the rough training sample pedestrian image;

Further extracting color self-similar features and convolution channel features to the feature channel;

Use the color self-similar feature and the convolution channel feature to perform fusion training to obtain a three-class classifier, and output the corresponding detection result.

2. The rough and fine pedestrian detection method based on a region candidate as claimed in claim 1, wherein the generation of the rough detection result comprises the following steps:

extracting different feature channels from the pedestrian images of the rough training samples;

applying the locally independent channel feature method to extract features and train the one classifier;

The image is roughly detected by the trained classifier to generate a rough detection result.

3. The rough and fine pedestrian detection method based on region candidates according to claim 1 or 2, wherein the region candidate network selects candidate regions on the feature map through the feature map generated after the convolution operation , and the region candidate network outputs a series of rectangular target candidate frames with corresponding score values by selecting and receiving images to be detected as input.

4. The rough and fine pedestrian detection method based on region candidate as claimed in claim 3, is characterized in that: comprises the following steps:

The network of the convolution operation selects 13 convolution layers of the VGG-16 network;

The input image is extracted by the convolution layer, and the small network is used to obtain sliding windows of different scales and proportions on the feature map obtained above, and the windows are mapped to features with lower dimensions, including features mapped to 512 dimensions;

The previously generated windows are efficiently classified and regressed through two fully connected layers.

5. The rough and fine pedestrian detection method based on regional candidate as claimed in claim 4, is characterized in that: the loss function of described regional candidate network training is defined as follows:

Where i represents the serial number of the target candidate frame, pi is the probability that the _ith target candidate frame is a pedestrian target, when the _ith target candidate frame is identified as a target, pi ^* is 1, otherwise it is 0, and t _i represents the prediction The coordinates of t _i ^* represent the coordinates of the real target.

6. The rough and fine pedestrian detection method based on regional candidates as claimed in claim 4 or 5, wherein the regional candidate network training process will have a maximum intersection ratio with the label target frame or overlap with the real target frame Samples with a degree greater than 0.7 are regarded as positive samples, while samples with an overlap with the real target frame less than 0.3 are regarded as negative samples.

7. The rough and fine pedestrian detection method based on region candidates according to claim 6, wherein the fine detection stage further comprises the following steps:

Using the pedestrian and background images generated by the feature channels extracted in the coarse detection stage as fine training samples in the fine detection stage;

Train a VGG-16 network according to the fine training samples, and then train a fine Adaboost classifier as a binary classifier;

The second classifier and the third classifier are combined together as a classification detector in the fine detection stage.

8. The rough and fine pedestrian detection method based on regional candidates as claimed in claim 7, further comprising a detection stage, wherein the detection stage obtains a candidate target frame from the test sample image through a local irrelevant channel feature method and a region candidate network , and input the candidate target frame into the classification detector in the fine detection stage for accurate classification, and obtain the pedestrian label target frame detection result.

9. The rough and fine pedestrian detection method based on regional candidates according to claim 7 or 8, wherein the VGG-16 network trained according to the fine training samples uses a hole convolution layer with a step size of 2 to replace its final The two pooling layers perform downsampling operations to reduce the size of the feature map while increasing the receptive field.