Three-channel feature fusion face recognition method
Technical Field
The invention belongs to the technical field of face recognition, and discloses a three-channel feature fusion face recognition method.
Background
With the rapid development of society, people have great demands on automatic identity recognition devices in every aspect of daily life. The current identification technology mainly comprises: password authentication, fingerprint identification, face identification, iris identification, gait identification and the like. Because the face recognition has the advantages of non-contact property, high safety, convenience, rapidness and the like, the face recognition is gradually accepted by the public.
Especially in recent years, the computing speed is greatly improved as the computer hardware is continuously updated. Further, the convolutional neural network after several wave breaks is paid attention to people again, and more people are put into research and improvement on the face recognition algorithm. The traditional face recognition algorithm tests the recognition accuracy under a rigid condition and can obtain a good recognition effect. Under the non-rigid condition, the recognition accuracy is greatly reduced due to the influence of illumination, human face posture change, shielding and algorithm defects. The multi-task convolutional neural network appearing in recent years can accurately detect the position of a human face, mark key points and make full preparation for extracting the characteristics of the human face; the LBP operator can effectively reduce the influence of illumination change on the input face image. But how to effectively extract the face features is crucial to improving the face recognition accuracy. The method only preprocesses the input image, does not change the structure of the network, and can not effectively improve the identification accuracy; the two-channel model changes the structure of the neural network model. Although the improved model can effectively classify the input facial expressions, the classification accuracy is greatly influenced by non-rigid factors such as illumination change, shielding and the like of the input images; a traditional LBP operator and a Deep Belief Network (DBN) are combined with a feature extraction module, an image processed by the LBP is input into the DBN, and high identification accuracy is achieved on different data sets. However, most of the global feature information in the original image is lost in the image processed by the LBP operator, and the overall features of the input image cannot be effectively embodied.
Disclosure of Invention
The invention aims to provide a three-channel feature fusion face recognition method, which solves the problem that the traditional face recognition method in the prior art cannot extract all-round features of a face, and improves the face recognition accuracy.
The technical scheme adopted by the invention is that,
a three-channel feature fusion face recognition method comprises the following specific steps:
step 1, collecting different face images to form a data set; preprocessing each face image in the data set to obtain a preprocessed image, wherein the preprocessed image is the face image which is subjected to face correction after irrelevant background information is removed; forming all the preprocessed images into a preprocessed image set;
step 2, establishing a BP neural network model based on three-channel feature fusion, wherein the BP neural network model based on the three-channel feature fusion comprises three parallel feature extraction channels which are respectively a coarse sampling channel, an LBP channel and a fine sampling channel;
step 3, training a BP neural network model based on three-channel feature fusion by utilizing a preprocessed image set;
step 4, inputting an image to be recognized, performing feature similarity comparison on the image to be recognized and images in a training set by using a trained BP neural network model based on three-channel feature fusion, and outputting the image with the highest similarity and the similarity of the image;
and 5, setting a threshold, comparing the similarity output in the step 4 with the threshold, further judging whether the image output in the step 4 and the image to be identified are the same person, and outputting a result.
The present invention is also characterized in that,
the pretreatment operation in the step 1 comprises the following specific steps:
step 1.1, inputting a face image;
step 1.2, carrying out face cutting on the face image, removing redundant information such as background and the like, and obtaining a face image without a background;
and step 1.3, carrying out eye key point marking on the face image without the background, connecting the two eye key points, setting an included angle between a connecting line of the two eye key points and the horizontal direction as a, and rotating the face image anticlockwise by the angle a to obtain a preprocessed image.
In step 1.3, the coordinates of the key points are located by Euclidean distance, as shown in formula (1), wherein L
iRepresenting the euclidean distance at which the key points are located,
representing predicted face keypoint locations, y
iRepresenting the positions of key points of a real face;
the determination of the face key points is as in formula (2):
smaller Y indicates predicted keypoint location
With the true keypoint location y
iThe smaller the error of (2), the smallest Y
The value is the location of the marked keypoint; where Y represents the location information of the final keypoint, N represents the number of training samples, b
iIndicating a sample label.
The angle a can be expressed as:
wherein (x)1,y1) Key point No. 1, the coordinates of the center of the left eye, (x)2,y2) Key point No. 2 is the coordinate of the center of the right eye.
The BP neural network model based on three-channel feature fusion comprises three parallel feature extraction channels, the three feature extraction channels are respectively a coarse sampling channel, an LBP channel and a fine sampling channel, the output ends of the coarse sampling channel, the LBP channel and the fine sampling channel are all connected with a hidden layer, and the output end of the hidden layer is sequentially connected with a dimensionality reduction layer, a first full-connection layer, a second full-connection layer and a loss function layer.
The coarse sampling channel consists of three convolution layers and three pooling layers, and sequentially comprises the following steps: a first convolution layer, a first pooling layer, a second convolution layer, a second pooling layer, a third convolution layer and a third pooling layer;
the first convolution layer, the second convolution layer and the third convolution layer adopt convolution kernels with the size of 5 multiplied by 5;
the first, second and third pooling layers all employ Max _ pooling, the pooling size is 2 × 2, the pooling step size is 2, and the padding mode is set to SAME.
The LBP channel feature extraction method comprises the following steps:
step 2.1, dividing the preprocessed image into a plurality of sub-images with the same size;
step 2.2, for each sub-image, converting each pixel point information into a pixel brightness value, and setting the pixel brightness value at the middle position as gcThe surrounding eight neighboring pixels have brightness values of gi(i ═ 0, 1.., 7), the luminance information is subjected to binarization processing using formula (8) and formula (9):
wherein x represents gi-gcDifference of (A), BiRepresenting a formula for converting the resulting binary number into a binary value, s (x) the resulting pixel value;
step 2.3, the binarized value on the right side of the pixel at the middle position is taken as an initial position, the obtained binarized value is written into an eight-bit binary number in a counterclockwise rotation mode, the binary number is converted into a decimal number, and the decimal number is the LBP value corresponding to the brightness of the pixel at the central point; and performing the operation on each pixel point in the input face image to finally obtain the local characteristic value of each pixel point of the input image.
The fine sampling channel consists of three convolution layers and three pooling layers, and sequentially comprises the following steps: a fourth convolution layer, a fourth pooling layer, a fifth convolution layer, a fifth pooling layer, a sixth convolution layer and a sixth pooling layer;
the fourth convolution layer, the fifth convolution layer and the sixth convolution layer are formed by stacking 1 x 3 convolution kernels and 3 x 1 convolution kernels;
the fourth, fifth and sixth pooling layers all employ Max _ pooling, the pooling size is 2 × 2, the pooling step size is 2, and the padding mode is set to SAME.
The dimensionality reduction layer adopts PCA dimensionality reduction operation to convert the feature fusion into one-dimensional feature vector information, and the method comprises the following specific steps:
step 3.1, performing feature fusion on the feature information of the human faces of the three channels to convert the feature information into a matrix form, and setting the matrix form of the fused feature image as X(n,m);
Step 3.2, carrying out zero-mean processing on the matrix X and solving a covariance matrix H:
step 3.3, solving the eigenvalue of the matrix H and calculating the eigenvector corresponding to the eigenvalue;
step 3.4, taking the eigenvectors corresponding to the first k larger eigenvalues of the eigenvectors, and arranging the eigenvectors into a Q matrix according to rows;
and 3.5, obtaining the one-dimensional feature vector after dimension reduction by using the U-QX.
The loss function layer adopts a triple loss function as a loss function, and the formula is as follows:
a in the formula represents a sample arbitrarily selected from a training set; p represents a randomly selected sample of the same kind as a, and is called positive sample (positive); n represents a randomly selected sample of a different class, called negative sample (negative); training a parameter-sharing network aiming at each sample in the triple loss function to obtain the characteristic expressions of the three samples of a, p and n, which are respectively recorded as:
the triple loss function is to minimize the distance between (a, p) features in two classes and maximize the distance between (a, n) features between two classes through the learning of a training set; the effect of the '+' sign in the formula is: if 2]If the value of the inner is greater than 0, then the value is taken as the loss; when the loss is less than 0, the loss is 0.
The step 5 specifically comprises the following steps: and setting a threshold, if the similarity of the image output in the step 4 is greater than the threshold, determining that the image to be recognized and the human image in the image output in the step 4 are the same person, otherwise, determining that the training set does not have the same human face information as the image to be recognized, wherein the setting range of the threshold is 0.73-0.78.
The invention has the advantages that
(1) Firstly, MTCNN is used for carrying out face detection and face key point marking on an input image, background irrelevant information of the input image is removed, and on the basis, a rotary correction method is used for carrying out posture correction on a face, so that the recognition accuracy rate can be improved.
(2) The three channels are used for face feature extraction, so that local features, edge features and contour features of a face and fine features of face organs can be obtained, and the global feature information of the face image can be accurately reflected.
(3) Since the feature acquisition of the input image by using the three feature extraction channels leads to feature dimension explosion and is not beneficial to subsequent operation, the extracted features are subjected to dimension reduction operation by using a Principal Component Analysis (PCA).
(4) The invention selects triple loss as a loss function, and finally trains out the characteristic with cohesion. The triple loss function can distinguish the same and different features, reduce the inter-class distance as much as possible and enlarge the inter-class distance. Cohesion plays an important role in face recognition, and a very good model can be trained by using a small amount of data.
Drawings
FIG. 1 is a structural diagram of a BP neural network model volume based on three-channel feature fusion in a three-channel feature fusion face recognition method of the present invention;
FIG. 2 is a flow chart of a preprocessing of a three-channel feature fusion face recognition method of the present invention;
FIG. 3 is an LBP mapping chart in the three-channel feature fusion face recognition method of the present invention;
FIG. 4 is a graph of features extracted using different methods; wherein (a) is an original graph, (b) is a characteristic graph extracted by a Gabor method, (c) is a characteristic graph extracted by a Haar method, (d) is a characteristic graph extracted by an LBP method, and (e) is a characteristic graph extracted by the method;
FIG. 5 is another feature map extracted using a different method; wherein (a) is an original graph, (b) is a characteristic graph extracted by a Gabor method, (c) is a characteristic graph extracted by a Haar method, (d) is a characteristic graph extracted by an LBP method, and (e) is a characteristic graph extracted by the method;
fig. 6 is feature numbers extracted from Lena pictures by different methods, wherein (a) is a feature map extracted by a Gabor method, (b) is a feature map extracted by a Haar method, (c) is a feature map extracted by an LBP method, and (d) is a feature map extracted by the method of the present invention.
In the figure, 1, a first convolutional layer, 2, a first pooling layer, 3, a second convolutional layer, 4, a second pooling layer, 5, a third convolutional layer, 6, a third pooling layer, 7, a fourth convolutional layer, 8, a fourth pooling layer, 9, a fifth convolutional layer, 10, a fifth pooling layer, 11, a sixth convolutional layer, 12, a sixth pooling layer, 13, feature fusion, 14, a dimensionality reduction layer, 15, a first full-connection layer, 16, a second full-connection layer, 17, a loss function layer, 18, a fine sampling channel, 19, an LBP channel, and 20, a coarse sampling channel.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
The invention discloses a three-channel feature fusion face recognition method, which comprises the following specific steps:
step 1, collecting different face images to form a data set; preprocessing each face image in the data set to obtain a preprocessed image, wherein the preprocessed image is the face image which is subjected to face correction after irrelevant background information is removed; forming all the preprocessed images into a preprocessed image set;
step 2, establishing a BP neural network model based on three-channel feature fusion, as shown in FIG. 1, wherein the BP neural network model based on three-channel feature fusion comprises three parallel feature extraction channels, and the three feature extraction channels are respectively a rough sampling channel 20, an LBP channel 19 and a fine sampling channel 18;
step 3, training a BP neural network model based on three-channel feature fusion by utilizing a preprocessed image set;
step 4, inputting an image to be recognized, performing feature similarity comparison on the image to be recognized and images in a training set by using a trained BP neural network model based on three-channel feature fusion, and outputting the image with the highest similarity and the similarity of the image;
and 5, setting a threshold, comparing the similarity output in the step 4 with the threshold, further judging whether the image output in the step 4 and the image to be identified are the same person, and outputting a result.
In step 1, as shown in fig. 2, the pretreatment operation specifically includes the following steps:
step 1.1, inputting a face image;
step 1.2, carrying out face cutting on the face image, removing redundant information such as background and the like, and obtaining a face image without a background;
and step 1.3, carrying out eye key point marking on the face image without the background, connecting the two eye key points, setting an included angle between a connecting line of the two eye key points and the horizontal direction as a, and rotating the face image anticlockwise by the angle a to obtain a preprocessed image.
In step 1.3, the coordinates of the key points are located by Euclidean distance, as shown in formula (1), wherein L
iRepresenting the euclidean distance at which the key points are located,
representing predicted face keypoint locations, y
iRepresenting the positions of key points of a real face;
the determination of the face key points is as in formula (2):
smaller Y indicates predicted keypoint location
With the true keypoint location y
iThe smaller the error of (2), the smallest Y
The value is the location of the marked keypoint; where Y represents the location information of the final keypoint, N represents the number of training samples, b
iIndicating a sample label.
The angle a can be expressed as:
wherein (x)1,y1) Key point No. 1, the coordinates of the center of the left eye, (x)2,y2) Is No. 2 key point, right eyeThe coordinates of the center.
The BP neural network model based on three-channel feature fusion comprises three parallel feature extraction channels, the three feature extraction channels are respectively a coarse sampling channel 20, an LBP channel 19 and a fine sampling channel 18, the output ends of the coarse sampling channel 20, the LBP channel 19 and the fine sampling channel 18 are all connected with a hidden layer 13, and the output end of the hidden layer 13 is sequentially connected with a dimensionality reduction layer 14, a first full connection layer 15, a second full connection layer 16 and a loss function layer 17.
Wherein the rough sampling channel 20 is used for collecting rough edge and contour features in the preprocessed image;
the LBP channel 19 is used for acquiring a local characteristic map of a local input image of the preprocessed image;
wherein the fine sampling channel 18 is used for acquiring rough edge and contour features in the preprocessed image;
the hidden layer 13 is used for fusing and converting the features extracted by the three sampling channels into a matrix form.
The rough sampling channel 20 is composed of three convolutional layers and three pooling layers, and sequentially comprises: a first convolutional layer 1, a first pooling layer 2, a second convolutional layer 3, a second pooling layer 4, a third convolutional layer 5, and a third pooling layer 6;
the first convolution layer 1, the second convolution layer 3 and the third convolution layer 5 all adopt convolution kernels with the size of 5 multiplied by 5;
the first, second and third pooling layers 2, 4 and 6 all employ Max _ pooling, the pooling size is 2 × 2, the pooling step size is 2, and the padding mode is set to SAME.
The formula for extracting features of an image by using the coarse sampling channel 20 is as follows:
let the input image size be H1×H2Convolution kernel size of F1×F2Output image size W1×W2. The convolutional layer formula is shown in formula (4):
W1×W2=(H1-F1+1)*(H2-F2+1) (4)
the picture size output by the convolution layer is W1×W2Then inputting the mixture into a pooling layer with a pooling size of F3×F4The step length of the pooling is S; when padding is in Valid mode, the calculation formula of the image size after passing through the pooling layer is shown as formula (5):
and when padding is in the Same mode, calculating the size of the image after passing through the pooling layer. FiFor the size of the pooling layer, P is the size of the fill around the image, as shown in equation (6):
as shown in fig. 3, the LBP channel performs the feature extraction method as follows:
step 2.1, dividing the preprocessed image into a plurality of sub-images with the same size;
step 2.2, for each sub-image, converting each pixel point information into a pixel brightness value, and setting the pixel brightness value at the middle position as gcThe surrounding eight neighboring pixels have brightness values of gi(i ═ 0, 1.., 7), the luminance information is subjected to binarization processing using formula (8) and formula (9):
wherein x represents gi-gcDifference of (A), BiRepresenting a formula for converting the resulting binary number into a binary value, s (x) the resulting pixel value;
step 2.3, the binarized value on the right side of the pixel at the middle position is taken as an initial position, the obtained binarized value is written into an eight-bit binary number in a counterclockwise rotation mode, the binary number is converted into a decimal number, and the decimal number is the LBP value corresponding to the brightness of the pixel at the central point; and performing the operation on each pixel point in the input face image to finally obtain the local characteristic value of each pixel point of the input image.
The fine sampling channel 18 is composed of three convolutional layers and three pooling layers, and sequentially comprises: a fourth convolutional layer (7), a fourth pooling layer (8), a fifth convolutional layer (9), a fifth pooling layer (10), a sixth convolutional layer (11), and a sixth pooling layer (12);
the fourth convolution layer (7), the fifth convolution layer (9) and the sixth convolution layer (11) are convolution layers formed by stacking 1 x 3 convolution kernels and 3 x 1 convolution kernels;
the fourth pooling layer (8), the fifth pooling layer (10) and the sixth pooling layer (12) all adopt maximum pooling (Max _ pooling), the pooling size is 2 x 2, the pooling step size is 2, and the padding mode is set to SAME.
The feature extraction using the fine sampling channel 18 is the same as the feature extraction using the coarse sampling channel 20.
The dimensionality reduction layer 14 transforms the feature fusion into one-dimensional feature vector information by adopting PCA dimensionality reduction operation, and the method specifically comprises the following steps:
step 3.1, performing feature fusion on the feature information of the human faces of the three channels to convert the feature information into a matrix form, and setting the matrix form of the fused feature image as X(n,m);
Step 3.2, carrying out zero-mean processing on the matrix X and solving a covariance matrix H:
step 3.3, solving the eigenvalue of the matrix H and calculating the eigenvector corresponding to the eigenvalue;
step 3.4, taking the eigenvectors corresponding to the first k larger eigenvalues of the eigenvectors, and arranging the eigenvectors into a Q matrix according to rows;
and 3.5, obtaining the one-dimensional feature vector after dimension reduction by using the U-QX.
The loss function layer (17) adopts a triple loss function as a loss function, and the formula is as follows:
a in the formula represents a sample arbitrarily selected from a training set; p represents a randomly selected sample of the same kind as a, and is called positive sample (positive); n represents a randomly selected sample of a different class, called negative sample (negative); training a parameter-sharing network aiming at each sample in the triple loss function to obtain the characteristic expressions of the three samples of a, p and n, which are respectively recorded as:
the triple loss function is to minimize the distance between (a, p) features in two classes and maximize the distance between (a, n) features between two classes through the learning of a training set; the effect of the '+' sign in the formula is: if 2]If the value of the inner is greater than 0, then the value is taken as the loss; when the loss is less than 0, the loss is 0.
Specifically, step 5 is to set a threshold, if the similarity of the image output in step 4 is greater than the threshold, the image to be recognized and the portrait in the image output in step 4 are determined to be the same person, otherwise, the training set is determined to have no facial information the same as that of the image to be recognized, and the set range of the threshold is 0.73-0.78.
In order to verify the invention, five algorithms (CNN, LBP + DBN, rough learning + fine learning, HOG + SVM and the invention algorithm) are adopted in the ORL face database to carry out face recognition simulation experiments. The Gabor, Haar, LBP and three-channel fused profiles were used herein for comparison, as shown in fig. 4. The Gabor and LBP algorithm can only extract the local features of the face image; the Haar algorithm value can extract contour characteristic information of the human face; the extracted local and global features of the human face are fused by the method to form output including the local and global features, so that the identification accuracy can be effectively improved. Fig. 5 shows feature numbers extracted from Lena pictures by four methods, in which the abscissa and ordinate represent the number of features extracted from the length and width of an input image, respectively, and the height represents the feature intensity. By comparing the four feature extraction graphs in fig. 5, the image features after three-channel feature fusion can be obtained, more feature information can be extracted, and the feature intensity is higher than that obtained by other methods, so that the identification accuracy is effectively improved.
TABLE 1 comparison of time consumption and accuracy for identification in ORL database。The ORL face database contains 40 subjects of different ages, gender, and ethnicity. 10 pictures per person for a total of 400 pictures. The original picture size is 92 × 112. These pictures were collected under differential expression, pose and lighting. The original pictures were processed to 64 × 64 in the experiment for the experiment. In the experiment, 200 training samples are selected, and the remaining 200 testing samples are selected.
TABLE 1 comparison of recognition time consumption and accuracy in ORL database
As can be clearly seen from table 1, the performance of the algorithm proposed by the present invention is compared with that of the comparison algorithm on the ORL database. Compared with the traditional CNN algorithm, the algorithm provided by the invention almost has no difference with the comparison algorithm in terms of face recognition time consumption, and the average time consumption is only 0.0135 second more compared with four comparison algorithms of CNN, LBP + DBN, rough learning + fine learning and HOG + SVM. But there is a significant improvement in recognition accuracy. The average accuracy is higher by 5.38%.