Disclosure of Invention
The present invention is directed to solving at least one of the problems of the prior art. Therefore, the invention provides a face multitask detection method, a system, a device and a storage medium based on deep learning.
The technical scheme adopted by the invention is as follows:
in one aspect, an embodiment of the present invention includes a face multitask detection method based on deep learning, including:
acquiring an original face image;
carrying out normalization processing on the original face image to obtain a first image;
inputting the first image into a super-resolution neural network model for processing to obtain a second image;
and inputting the second image into a face multi-task detection model based on deep learning for processing to obtain a face frame coordinate and a face key point coordinate.
Further, the normalizing the original face image to obtain a first image specifically comprises:
and performing normalization processing on the original face image by adopting a method of uniformly mapping the gray values of 0-255 to 0-1 to obtain a first image.
Further, the super-resolution neural network model comprises:
a feature extraction module for extracting image features from the first image;
and the nonlinear mapping module is used for carrying out nonlinear mapping on the extracted image features to obtain the second image, and the second image comprises high-resolution image information.
Further, the step of inputting the second image into a face multitask detection model based on deep learning for processing specifically includes:
inputting the second image into a backbone network for processing, and acquiring a plurality of levels of feature maps;
after the feature maps of all levels are processed by a 1 x1 convolution kernel, combining the features of high-level semantics and low-level semantics layer by layer to obtain a plurality of feature maps containing high-level semantic information;
inputting all the characteristic diagrams containing high-level semantic information into an area suggestion network for processing, and obtaining a face target coarse candidate frame;
the face target coarse candidate frame is corresponding to a first position, the first position is a feature map position output by the first image through backbone network processing, and all the face target coarse candidate frames are unified into two dimensions through a RoI Align method and are divided into a first branch line and a second branch line for processing;
the first branch line screens and corrects the face target coarse candidate frame to obtain a face frame coordinate;
and the second branch line processes the face target coarse candidate frame to obtain face key point coordinates.
Further, unifying all the face target coarse candidate frames into two dimensions by a RoI Align method, and dividing the two dimensions into a first branch and a second branch for processing, specifically:
unifying all the face target coarse candidate frames into an image with the dimension of 7x7x256 by a RoI Align method, and dividing the image into first branches for processing;
unifying all the face target coarse candidate frames into an image with the dimension of 14x14x256 by a RoI Align method, and dividing the image into second branches for processing.
Further, the step of screening and correcting the coarse candidate frame of the face target by the first branch line to obtain the face frame coordinates specifically includes:
abstracting the characteristics of the image with the dimensionality of 7x7x256 into one dimension through two full connection layers;
fitting the classification and position offset of the face target coarse candidate frame by using two full-connected layers respectively;
and correcting the face target coarse candidate frame according to the classification and the position offset of the face target coarse candidate frame to obtain a face frame coordinate.
Further, the step of processing the face target coarse candidate frame by the second branch line to obtain face key point coordinates specifically includes:
processing an image with the dimension of 14x14x256 by 4 layers of convolution layers;
and fitting the whole connecting layer to obtain the coordinates of the key points of the human face.
On the other hand, the embodiment of the invention also comprises a face multitask detection system based on deep learning, which comprises the following steps:
the acquisition module is used for acquiring an original face image;
the normalization processing module is used for performing normalization processing on the original face image to obtain a first image;
the super-resolution module is used for inputting the first image into a super-resolution neural network model for processing to obtain a second image;
and the face multitask detection module is used for inputting the second image into a face multitask detection model based on deep learning to be processed, and obtaining face frame coordinates and face key point coordinates.
On the other hand, the embodiment of the invention also comprises a face multitask detection device based on deep learning, which comprises:
at least one processor;
at least one memory for storing at least one program;
when executed by the at least one processor, cause the at least one processor to implement the detection method.
In another aspect, the embodiment of the present invention further includes a computer readable storage medium, on which a program executable by a processor is stored, and the program executable by the processor is used for implementing the detection method when being executed by the processor.
The invention has the beneficial effects that:
(1) according to the method, the characteristic information can be enhanced on the premise of keeping the size of the characteristic diagram through the super-resolution neural network model, and meanwhile, the detection performance of the small target face is improved, so that the small face is more easily detected, and compared with an amplified input image, the increment of computing resources is very small;
(2) according to the invention, through the face multitask detection model based on deep learning, the detection effects of faces with different sizes can be improved, the detection result is more accurate, and the size range of the face image can not be changed in the feature enhancement process in the detection process.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.
In the description of the present invention, it should be understood that the orientation or positional relationship referred to in the description of the orientation, such as the upper, lower, front, rear, left, right, etc., is based on the orientation or positional relationship shown in the drawings, and is only for convenience of description and simplification of description, and does not indicate or imply that the device or element referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus, should not be construed as limiting the present invention.
In the description of the present invention, the meaning of a plurality of means is one or more, the meaning of a plurality of means is two or more, and larger, smaller, larger, etc. are understood as excluding the number, and larger, smaller, inner, etc. are understood as including the number. If the first and second are described for the purpose of distinguishing technical features, they are not to be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.
In the description of the present invention, unless otherwise explicitly limited, terms such as arrangement, installation, connection and the like should be understood in a broad sense, and those skilled in the art can reasonably determine the specific meanings of the above terms in the present invention in combination with the specific contents of the technical solutions.
The embodiments of the present application will be further explained with reference to the drawings.
Referring to fig. 1, an embodiment of the present invention provides a face multitask detection method based on deep learning, including but not limited to the following steps:
s1, acquiring an original face image;
s2, carrying out normalization processing on the original face image to obtain a first image;
s3, inputting the first image into a super-resolution neural network model for processing to obtain a second image;
and S4, inputting the second image into a face multi-task detection model based on deep learning for processing to obtain face frame coordinates and face key point coordinates.
Regarding step S2, that is, performing normalization processing on the original face image to obtain a first image, specifically:
s201, performing normalization processing on the original face image by adopting a method of uniformly mapping gray values of 0-255 to 0-1 to obtain a first image.
In this embodiment, the operation of step S2 adopts a method of uniformly mapping gray-scale values of 0 to 255 to between 0 and 1, which is different from the mean standard deviation normalization used in the general detection task, and the purpose of the method is to adapt to the learning of the super-resolution neural network model.
The super-resolution neural network model in step S3 includes:
a feature extraction module for extracting image features from the first image;
and the nonlinear mapping module is used for carrying out nonlinear mapping on the extracted image features to obtain the second image, and the second image comprises high-resolution image information.
In this embodiment, the super-resolution neural network model adopts a lightweight network, which includes four convolutional layers and an upsampling layer, where the upsampling layer is only used to reconstruct a high-resolution image, and does not participate in the inference process of the backbone network, and a second image containing high-resolution information is obtained through processing of the four convolutional layers, where the second image is a feature image having the same resolution as an input original image; then, the second image is input into a human face multi-task detection model based on deep learning for further processing. Therefore, the super-resolution neural network model aims to enhance the feature information on the premise of keeping the size of the feature map, and the super-resolution neural network model is good in help for detecting the target with a small pixel point occupation range.
In this embodiment, the training process of the super-resolution neural network model includes the following processing processes:
(1) processing the first image by using bicubic interpolation downsampling to obtain a low-resolution image, and extracting image features from the low-resolution image through a feature extraction module;
(2) mapping the extracted image features to a same resolution feature image for representing higher resolution image information by a non-linear mapping module, wherein the same resolution feature image is the second image;
(3) performing learnable upsampling processing on the second image through an image reconstruction module to obtain a high-resolution image restored by the model;
in the embodiment, in the training process of the super-resolution neural network model, the low-resolution image is processed by the super-resolution module, a high-resolution image which is predicted and restored by the model can be output, and the difference between the high resolution and the first image is compared and calculated so as to supervise the network training process; and in the application process, the image reconstruction module does not participate in the work.
Optionally, step S4, that is, the step of inputting the second image into the depth learning-based face multitask detection model for processing specifically includes:
s401, inputting the second image into a backbone network for processing, and obtaining a plurality of levels of feature maps;
s402, after the feature maps of all levels are processed by a 1 x1 convolution kernel, combining the features of high-level semantics and low-level semantics layer by layer to obtain a plurality of feature maps containing high-level semantic information;
s403, inputting all the characteristic graphs containing the high-level semantic information into an area for proposing network processing to obtain a face target coarse candidate frame;
s404, enabling the face target coarse candidate frame to correspond to a first position, wherein the first position is a feature map position output by the first image through backbone network processing, and unifying all the face target coarse candidate frames into two dimensions through a RoI Align method and dividing the two dimensions into a first branch line and a second branch line for processing;
s405, screening and correcting the face target coarse candidate frame by the first branch line to obtain a face frame coordinate;
and S406, the second branch line processes the face target coarse candidate frame to obtain face key point coordinates.
In step S404, the unifying all the face target coarse candidate frames into two dimensions by the RoI Align method, and dividing the two dimensions into a first branch line and a second branch line for processing specifically:
s404-1, unifying all the face target coarse candidate frames into an image with the dimension of 7x7x256 by a RoI Align method, and dividing the image into first branches for processing;
s404-2, unifying all the face target coarse candidate frames into an image with the dimension of 14x14x256 by a RoI Align method, and dividing the image into second branches for processing.
In step S405, that is, the step of screening and correcting the coarse candidate frame of the face target by the first branch line to obtain the face frame coordinates specifically includes:
s405-1, abstracting the features of the image with the dimensionality of 7x7x256 into one dimension through two full connection layers;
s405-2, fitting the classification and the position offset of the face target coarse candidate frame by using two full-connection layers respectively;
s405-3, correcting the face target coarse candidate frame according to the classification and the position offset of the face target coarse candidate frame to obtain a face frame coordinate.
In step S406, that is, the step of processing the coarse candidate frame of the face target by the second branch line to obtain the coordinates of the face key point specifically includes:
s406-1, processing the image with the dimension of 14x14x256 by 4 layers of convolution layers;
and S406-2, fitting through the full connection layer to obtain the coordinates of the key points of the human face.
In the embodiment, a face multitask detection model based on deep learning obtains a feature map with highly abstracted information through a convolution layer, a batch normalization layer, an activation function and a pooling layer which are alternated according to a certain rule, and a multi-scale feature pyramid is constructed to assist faces with various scales to be accurately detected; meanwhile, the adopted two-stage network architecture comprises two parts of selecting a candidate frame and generating an accurate detection result, wherein the latter part can be divided into two parallel branches, and the coordinates of the face frame and the coordinates of the face key points are respectively obtained through the two parallel branches.
Specifically, referring to fig. 2, firstly, inputting an information enhanced feature map obtained through super-resolution neural network model processing to a backbone network to obtain feature maps of various levels, wherein due to the fact that the number of layers is gradually deeper and the feature maps are subjected to down-sampling processing step by step, the semantic abstraction degree is gradually increased, and the feature map size is gradually decreased; through the processing of M5-M2 layers, the characteristics of high-level semantics and low-level semantics can be combined layer by layer, characteristic graphs (such as P2-P6 in FIG. 2) with various sizes and containing high-level semantic information are obtained, a characteristic pyramid is formed, and the detection effect of the human faces with different sizes is improved;
secondly, inputting RPN (Region suggestion Network) into all the obtained feature maps (P2-P6), laying anchor frames with the side length of 4, 5.04 and 6.35 pixels and the side length ratio of 1:1 respectively by taking each pixel point as a center, and fitting through two branch lines: namely, the position of the classification anchor frame belongs to a human face or a background (two softmax scores), and the classification anchor frame is fitted with the position offset (the horizontal and vertical coordinates of the upper left corner point and the offset of the length and the width, which are four values) of the regression anchor frame and the true value target frame, so that a coarse candidate frame of the human face target can be obtained, and in the process, a first group of loss functions are generated:
in the formula, p
iRepresenting the classification probability, t, of a human face object
iThe position offset of the object representing the face,
represents p
iA corresponding true value is set for the value of,
represents t
iA corresponding true value; n is a radical of
cls、N
regThe number of classification and regression targets in one batch respectively; l is
clsRepresenting the loss function of the classification, and adopting the cross entropy loss of the two classifications; l is
regLoss function, representing position regression, was lost with smooth L1.
And then, corresponding the output coarse candidate frames of the face target to the position of the information enhancement feature map obtained by the super-resolution neural network model processing, unifying all the coarse candidate frames of the face target into two forms with the dimensionality of 14x14x256 and the dimensionality of 14x14x256 by a RoI Align method, and dividing the coarse candidate frames into two branches for processing. As shown in fig. 2, one of the branch lines abstracts the features to one dimension through two full connection layers, and then fits the classification and position offset of the target frame with the two full connection layers, respectively, so as to achieve the effect of finely correcting the target frame; the other branch was processed through 4 convolutional layers and 5 landworks normalized positions in the corresponding target frame were fitted with full-joins (10 values total). In this process, a second set of loss functions are generated:
the second set of loss functions is substantially the same as the first set of loss functions and, with the addition of landmark supervision, L
lmThe loss of smooth L1, L is adopted
iAnd
respectively as predicted value and true value of landmark; lambda [ alpha ]
1、λ
2、λ
3Are all weights.
Therefore, the total loss function of the face multitask detection model based on deep learning is as follows:
wherein, alpha is a weight,
for the first set of loss functions,
a second set of loss functions.
Referring to fig. 3, in the embodiment of the present invention, because a super-resolution neural network model is introduced, a training strategy matched with the super-resolution neural network model is further proposed to improve the Detection performance of the human face multitask Detection model based on deep learning for the target with a small number of occupied pixels, as shown in fig. 3, a Detector in the figure corresponds to the structure shown in fig. 2, and a Detection Loss corresponds to a Loss
D. The super-resolution neural network model is used as a module for enhancing original image information, and needs to be supervised by a high-resolution image corresponding to an input image, so that the original image needs to be downsampled to generate a training pair, and the loss function of the super-resolution neural network model is as follows:
in the formula, y
iAn SR image (super resolution restored image) is shown,
HR image (high resolution image) is represented, W, H, C being the width, height and channel number of the image, respectively.
For the purpose of keeping a face multi-task detection model based on deep learning in an original imageThe performance on the scale size is not reduced, and meanwhile, the detection performance for small targets is increased. Divide each batch in the training process into two groups: one group is original images (ori _ img) which are only partially processed by main lines; and the other group of images (de _ img) which are subjected to 4-time down-sampling are samples which can provide high-resolution images and correspondingly perform super-resolution training, and are processed by main line and branch line together. And finally, summing the loss values of the two parts of data by a certain weight to supervise the whole network, wherein the loss value processed by the main line part of the original image (ori _ img) is as follows: l isori_img=LossD(ii) a The loss value of the 4-fold down-sampled image (de _ img) processed by both main line and branch line is: l isde_img=LossD+βLossSR(ii) a And the expression of the summation of the two is: losstotal=Lori_img+γLde_img(ii) a Wherein beta and gamma are related weights, Liri_imgAnd Lde_imgThe respective Loss values of the two branches, LosstotalAnd finally adjusting the total loss value of the parameters of the face multitask detection model based on deep learning.
Referring to fig. 4, fig. 4 is a schematic diagram illustrating specific settings of various parameters in the network architecture shown in fig. 2.
The face multi-task detection method based on deep learning provided by the embodiment of the invention has the following technical effects:
(1) according to the embodiment of the invention, the super-resolution neural network model can enhance the feature information on the premise of keeping the size of the feature map, and meanwhile, the detection performance of the small target face is increased, so that the small face is more easily detected, and compared with an amplified input image, the increase of computing resources is very small;
(2) according to the embodiment of the invention, the detection effect of the faces with different sizes can be improved through the face multitask detection model based on deep learning, so that the detection result is more accurate, and the size range of the face image can be ensured not to change in the feature enhancement process in the detection process.
On the other hand, the embodiment of the invention also provides a face multitask detection system based on deep learning, which comprises the following steps:
the acquisition module is used for acquiring an original face image;
the normalization processing module is used for performing normalization processing on the original face image to obtain a first image;
the super-resolution module is used for inputting the first image into a super-resolution neural network model for processing to obtain a second image;
and the face multitask detection module is used for inputting the second image into a face multitask detection model based on deep learning to be processed, and obtaining face frame coordinates and face key point coordinates.
Referring to fig. 5, an embodiment of the present invention further provides a face multitask detection device 200 based on deep learning, which specifically includes:
at least one processor 210;
at least one memory 220 for storing at least one program;
when executed by the at least one processor 210, causes the at least one processor 210 to implement the method as shown in fig. 1.
The memory 220, which is a non-transitory computer-readable storage medium, may be used to store non-transitory software programs and non-transitory computer-executable programs. The memory 220 may include high speed random access memory and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 220 may optionally include remote memory located remotely from processor 210, and such remote memory may be connected to processor 210 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
It will be understood that the device structure shown in fig. 5 is not intended to be limiting of device 200, and may include more or fewer components than shown, or some components may be combined, or a different arrangement of components.
In the apparatus 200 shown in fig. 5, the processor 210 may retrieve the program stored in the memory 220 and execute, but is not limited to, the steps of the embodiment shown in fig. 1.
The above-described embodiments of the apparatus 200 are merely illustrative, and the units illustrated as separate components may or may not be physically separate, may be located in one place, or may be distributed over a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purposes of the embodiments.
Embodiments of the present invention also provide a computer-readable storage medium, which stores a program executable by a processor, and the program executable by the processor is used for implementing the method shown in fig. 1 when being executed by the processor.
The embodiment of the application also discloses a computer program product or a computer program, which comprises computer instructions, and the computer instructions are stored in a computer readable storage medium. The computer instructions may be read by a processor of a computer device from a computer-readable storage medium, and executed by the processor to cause the computer device to perform the method illustrated in fig. 1.
It will be understood that all or some of the steps, systems of methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.
The embodiments of the present invention have been described in detail with reference to the accompanying drawings, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the gist of the present invention.