CN1156248C

CN1156248C - Method for detecting moving human face

Info

Publication number: CN1156248C
Application number: CNB011204281A
Authority: CN
Inventors: 徐光v; 徐光祐; 彭振云
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2001-07-13
Filing date: 2001-07-13
Publication date: 2004-07-07
Anticipated expiration: 2021-07-13
Also published as: CN1325662A

Abstract

The present invention relates to a motion image human face feature detection method. The method comprises: human face images are shot, and form a training set; principal component analysis, Hough conversion, etc. are carried out, and then, the positions and the sizes are same with eyes in the images of the training set; the images are projected to a character eye space; candidate eyes with minimum errors between primitive eyes and the projection are used as test results, and the exact positions of mouth edges, nostrils and nasal tips are obtained by integral projection. Compared with the existing methods, the method has the advantages that the detection speed is enhanced to 225 times, and the correct rate is enhanced by 1.27%.

Description

method for detecting face feature of moving image

The technical field is as follows:

the invention relates to a method for detecting face characteristics of a moving image, and belongs to the technical field of computer vision.

Background art:

the existing face feature detection method is performed for a still image. In the document of "robust face feature detection based on generalized symmetry" ("11 th international conference proceedings of pattern recognition", 1992, pp.117-120), authors d.reisfeld and y.yeshura proposed a typical still image face feature detection method. The principle of the method is as follows: according to the local and global symmetry of the human face, a complex measure (called symmetry) about the symmetry is defined, then the symmetry is obtained for each edge point in the image through energy function iteration, and the point with the maximum symmetry is regarded as a feature point. The method can detect pupils and mouth angles in the human face, the accuracy is about 95%, and the detection time of each image is about 3 minutes. The main disadvantages of this method are: (1) because the prior knowledge of the human face is not fully utilized, the method has large computation amount and low detection speed, and is not suitable for real-time application environments such as visual communication, non-contact computer operation and the like; (2) since only the information provided by a single still image is used, the search result cannot be verified and corrected; (3) only the feature points on a single image can be retrieved, and the method cannot be used for moving images.

The invention content is as follows:

the invention aims to provide a face detection method in a moving image, which can be used for quickly and accurately detecting the positions of two pupils, two corners of the mouth, two nostrils and the tip of the nose on a face in the moving image, thereby overcoming the defects of low speed, low accuracy and the like in the face detection method of a static image. The detected result can be used in application environments such as face recognition, visual communication, image coding, non-contact computer operation and the like.

The invention provides a moving image face feature detection method, which comprises the following steps:

1. shooting 300 face images with different sexes, ages, postures and illumination to form a training set, and geometrically calibrating the eyes of the images in the training set through homogeneous transformation to ensure that the sizes and the positions of the eyes in the images are completely consistent;

2. performing principal component analysis on the calibrated eyes in the images of the training set to obtain a group of characteristic vectors called characteristic eyes to form a characteristic eye subspace;

3. for a human face image of a tested person, firstly obtaining a plurality of candidate eyes through Hough transformation, carrying out geometric calibration on each pair of candidate eyes by using homogeneous transformation to ensure that the positions and the sizes of the candidate eyes are the same as those of the eyes in an image of a training set, then projecting the candidate eyes to the characteristic eye subspace, and finally taking the candidate eye with the minimum error between the original eye and the projection of the original eye as a detection result;

4. after the eye position of the tested person is determined through the steps, estimating the mouth position according to the human face structure characteristics, obtaining the accurate position of the mouth angle by utilizing integral projection, then estimating the nose position according to the mouth position and the eye position, and accurately positioning the positions of the nostrils and the nose tip by utilizing the integral projection;

5. and if false detection or missing detection occurs, estimating the positions of eyes, nose and mouth in the current frame from the characteristic points in the previous frame of image according to the motion smoothness constraint and the plane motion constraint.

The human face detection method of the invention is used for testing 50 image sequences with different postures, illumination, breadth size, gender, age and background, and the method has the correct detection rate of 96.27 percent and the average detection time of 40 seconds per sequence (each sequence comprises 50 frames of images). Compared with the prior method, the detection speed is improved by 225 times, and the accuracy is improved by 1.27%.

The invention can detect the human face characteristics in the moving images in real time, the accuracy rate reaches 96.27 percent, and the invention can be used in the following application fields: (1) and (5) face recognition. Face recognition methods fall into two broad categories, image-based and feature-based. For the former, the feature points obtained by the method can be used for calibrating the posture and guiding image matching; for the latter, the face features can be used directly as recognition criteria. (2) Visual communication. The biggest challenge in visual communication is to solve the contradiction between channel bandwidth and large amount of transmitted data. By using the method of the invention, the sending end only needs to transmit a few key frame images, can detect the characteristic points of the non-key frame images and only transmits the characteristic points. The receiving end can restore the non-key frame image according to the key frame and the characteristic points. In this way, the existing image transmission bandwidth can be reduced by several orders of magnitude. (3) And (4) encoding the moving image. Content-based retrieval coding methods are becoming new moving picture compression standards (e.g., MPEG-4 and MPEG-7). The human face features are important image contents, and the method provided by the invention can be used as an effective implementation and supplement of the coding method. (4) Contactless computer operation. In many situations, such as a disabled person operating a computer, nuclear reaction control, etc., a user cannot operate the computer with a keyboard or mouse. In which case the computer can be controlled by tracking the gaze point of the human eye. The method of the present invention detects the human face characteristic points in real time, and the positions of pupils on the computer screen are obtained according to the three-dimensional geometric model and the calibrated camera model, so that the computer makes corresponding response.

Description of the drawings:

fig. 1 is a mouth region definition.

Fig. 2 is a nose region definition.

FIG. 3 is a diagram of the feature point spacing used in motion smoothness constraints.

The specific implementation mode is as follows:

1. geometric calibration

100 and 300 face images with different sexes, ages, postures and illumination are shot to form a training set. And through homogeneous transformation, the eye geometry of the images in the training set is calibrated, so that the sizes and the positions of the eyes in the images are completely consistent. In the next step, the same geometric calibration is performed on the eyes in the face image to be tested, so that the relative positions of the two pupils of the eyes in the training set image and the test image are kept unchanged.

Assuming that the original image is I (x, y), the positions of the two pupils are known to be E_L(x_L，y_L) And E_R(x_R，y_R) The included angle between the pupil connecting line and the horizontal axis is theta. The image I (x, y) is now transformed into I' (x, y) by a homogeneous transformation (equation 1) such that the positions of the two pupils are E, respectively_L0(x_L0，y_L0) And E_R0(x_R0，y_R0)。E_L0(x_L0，y_L0) And E_R0(x_R0，y_R0) Is a fixed pupil position, and y_L0＝y_R0I.e. the pupillary line is parallel to the horizontal axis.

[x′，y′]＝STR[x，y，l]^T， (1)

Wherein: r, T and S are a rotation transformation, a translation transformation and a scale transformation, respectively.

T = [\begin{matrix} 1 & 0 & x_{l . 0} - x_{l .} \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{matrix}] - - - (3)

S = [\begin{matrix} \frac{d (E_{l .}, E_{R})}{d (E_{l . 0}, E_{R 0})} & 0 & (1 - \frac{d (E_{l .}, E_{R})}{d (E_{l . 0}, E_{R 0})}) x_{l . 0} \\ 0 & 1 & 0 \end{matrix}] - - - (4)

2. Acquisition of a characteristic eye subspace

And performing principal component analysis on the calibrated eyes in the images of the training set to obtain a group of characteristic vectors called characteristic eyes to form a characteristic eye subspace.

Assume that after calibration, the eye region size is w × h — n. Using n-dimension vector i to form RⁿAnd (4) showing. Let the training set be { i₁，i₂...，i_m}，i_k∈Rⁿ，k＝1，2，...，m。

First, the average image (i.e. average eye) of the training set is found:

<math> <mrow> <mi>μ</mi> <mo>=</mo> <mfrac> <mn>1</mn> <mi>m</mi> </mfrac> <munderover> <mi>Σ</mi> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>m</mi> </munderover> <msub> <mi>i</mi> <mi>k</mi> </msub> <mo>,</mo> <mi>μ</mi> <mo>&Element;</mo> <msup> <mi>R</mi> <mn>11</mn> </msup> <mo>.</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>5</mn> <mo>)</mo> </mrow> </mrow> </math>

then, a covariance matrix of the training set samples is calculated:

<math> <mrow> <mi>R</mi> <mo>=</mo> <mfrac> <mn>1</mn> <mi>m</mi> </mfrac> <munderover> <mi>Σ</mi> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>m</mi> </munderover> <mrow> <mo>(</mo> <msub> <mi>i</mi> <mi>k</mi> </msub> <mo>-</mo> <mi>μ</mi> <mo>)</mo> </mrow> <msup> <mrow> <mo>(</mo> <msub> <mi>i</mi> <mi>k</mi> </msub> <mo>-</mo> <mi>μ</mi> <mo>)</mo> </mrow> <mi>T</mi> </msup> <mo>=</mo> <msup> <mi>AA</mi> <mi>T</mi> </msup> <mo>,</mo> <mi>R</mi> <mo>&Element;</mo> <msup> <mi>R</mi> <mrow> <mi>n</mi> <mo>×</mo> <mi>n</mi> </mrow> </msup> <mo>,</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>6</mn> <mo>)</mo> </mrow> </mrow> </math>

wherein,

A＝[i₁-μ，i₂-μ，...，i_m-μ]A∈R^nxm. (7)

according to the Singular Value Decomposition (SVD) theorem, it is possible to pass through the matrix A^TA∈R^m×mObtaining AA from the set of orthogonal feature vectors^T∈R^n×nOrthogonal feature vector set (u) of₁，u₂，...，u_r). Will (u)₁，u₂，...，u_r) Quadrature obtained after normalizationThe set of feature vectors is still represented as (u)₁，u₂，...，u_r) This is exactly the eigenvector of the covariance matrix R of the training set.

In actual use, only the set of feature vectors (u) for which the following expression holds is taken₁，u₂，...，u₁)，

<math> <mrow> <munderover> <mi>Σ</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>I</mi> </munderover> <mo>|</mo> <msub> <mi>λ</mi> <mi>i</mi> </msub> <mo>|</mo> <mo>&GreaterEqual;</mo> <mn>0.95</mn> <munderover> <mi>Σ</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>r</mi> </munderover> <mo>|</mo> <msub> <mi>λ</mi> <mi>i</mi> </msub> <mo>|</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>8</mn> <mo>)</mo> </mrow> </mrow> </math>

In algebraic sense, the covariance matrix R of the training set completely expresses all the information of the training set, and R can be used (u)₁，u₂，...，u₁) Complete representation, therefore, if the selected training set includes human eye images in all cases, it can be considered as being composed of (u)₁，u₂，...，u₁) The formed subspace can fully describe the human eye. That is, any human eye can use (u)₁，u₂，...，u₁) Is expressed in linear combinations. We call (u)₁，u₂，...，u₁) For characterizing the eye, it is called as (u)₁，u₂，...，u₁) The constructed subspace is the characteristic eye subspace.

Suppose that an input image with a breadth of w x h is p ∈ RⁿIt is projected into the characteristic eye subspace, i.e.,

since U is an orthogonal matrix, it is, therefore,

(c₁，c₂，...，c₁)^T＝U^Tp (10)

thus, we get a mapping of p in the feature eye subspace

The difference between p and p 'is described by its correlation value δ (p, p'):

3. eye detection

Firstly, obtaining a plurality of candidate eyes from a face image of a tested person; performing geometric calibration on each pair of candidate eyes by using homogeneous transformation to ensure that the positions and the sizes of the candidate eyes are the same as those of the eyes in the training set image, and then projecting the candidate eyes to the characteristic eye subspace; and finally, taking the candidate eye with the minimum error between the original eye and the projection thereof as a detection result.

Firstly, k candidate pupils C are obtained by Hough transformation₁，C₂，...，C_kAnd with C₁，C₂，...，C_kA complete graph G is constructed for the nodes. For C in the figure_iAnd C_iThe edges in between define a profit function B (i, j) as follows:

wherein k is₁k₂∈[0，1]，k₁+k₂＝1.0；p_ijAre each represented by C_iAnd C_jA human eye region is divided from the image for the left pupil and the right pupil; p'_ijIs p_ijProjection into a characteristic eye space; gamma (p)_ij，p_ij') is a similarity and symmetry measure; delta (p)_ij，p_ij') is a description of the authenticity of the eye (equation 11); d (i, j) and A (i, j) are constraints on interocular distance and angle.

Pupil pair (C) satisfying the following conditions_l，C_r) Pupil position considered correct:

<math> <mrow> <mi>B</mi> <mrow> <mo>(</mo> <mi>l</mi> <mo>,</mo> <mi>r</mi> <mo>)</mo> </mrow> <mo>=</mo> <munder> <mi>max</mi> <mrow> <mi>i</mi> <mo>,</mo> <mi>j</mi> <mo>=</mo> <mn>1,2</mn> <mo>.</mo> <mo>.</mo> <mo>.</mo> <mi>k</mi> </mrow> </munder> <mi>B</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>j</mi> <mo>)</mo> </mrow> <mo>&GreaterEqual;</mo> <msub> <mi>k</mi> <mn>1</mn> </msub> <msub> <mi>δ</mi> <mn>0</mn> </msub> <mo>+</mo> <msub> <mi>k</mi> <mn>2</mn> </msub> <msub> <mi>γ</mi> <mn>0</mn> </msub> <mo>,</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>13</mn> <mo>)</mo> </mrow> </mrow> </math>

wherein, γ₀Is the human eye similarity and symmetry threshold, δ₀Is the human eye truth threshold. If there is no B (l, r) satisfying the equation (13), the binarization threshold value can be increased and adaptive adjustment can be performed.

4. Mouth and nose detection

(1) Mouth corner detection

First, the mouth region is estimated from the pupil position based on anthropometric data. If the two pupils are each C, as shown in FIG. 1_lAnd C_rThen the mouth area can be roughly estimated as the parallelogram ABCD. The horizontal and vertical integral projections are made in ABCD as follows:

where y ═ AB (x) and y ═ DC (x) are the linear equations of lines AB and DC, respectively; y BC (x) and x AD (y) are linear equations for lines BC and AD, respectively. H (y) is calculated from the original image, and v (x) is the combination of the vertical gradient map and the original image.

The valley points on histogram h (y) correspond to the vertical positions of the mouth corners, and the two valley points on histogram v (x) on both sides of the median value correspond to the horizontal positions of the mouth corners, whereby the positions of the two mouth corners can be determined.

(2) Nostril and tip detection

The nostril detection steps are as follows:

1) rough estimation of the nose region from the mouth region (fig. 2);

2) obtaining a base line y ═ yn of the nose by using integral projection;

3) two nostrils N₁(x_n1l，y_n) And N₃(x_n3，y_n) Is the point that lies on the baseline y-yn and satisfies the following condition:

<math> <mrow> <mi>S</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <msub> <mi>n</mi> <mn>1</mn> </msub> </msub> <mo>)</mo> </mrow> <mo>=</mo> <munder> <mi>min</mi> <mrow> <mi>x</mi> <mo>&Element;</mo> <mo>[</mo> <msub> <mi>x</mi> <mn>3</mn> </msub> <mo>,</mo> <msub> <mi>x</mi> <mi>m</mi> </msub> <mo>]</mo> </mrow> </munder> <mi>S</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>16</mn> <mo>)</mo> </mrow> </mrow> </math>

<math> <mrow> <mi>S</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <msub> <mi>n</mi> <mn>1</mn> </msub> </msub> <mo>)</mo> </mrow> <mo>=</mo> <munder> <mi>min</mi> <mrow> <mi>x</mi> <mo>&Element;</mo> <mo>[</mo> <msub> <mi>x</mi> <mi>m</mi> </msub> <mo>,</mo> <msub> <mi>x</mi> <mn>4</mn> </msub> <mo>]</mo> </mrow> </munder> <mi>S</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>17</mn> <mo>)</mo> </mrow> </mrow> </math>

wherein,

<math> <mrow> <mi>S</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>=</mo> <munder> <mi>Σ</mi> <mrow> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <mi>y</mi> <mo>)</mo> </mrow> <mo>&Element;</mo> <mi>Circle</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <msub> <mi>y</mi> <mi>n</mi> </msub> <mo>,</mo> <msub> <mi>r</mi> <mi>n</mi> </msub> <mo>)</mo> </mrow> </mrow> </munder> <msup> <mi>I</mi> <mo>′</mo> </msup> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <mi>y</mi> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>18</mn> <mo>)</mo> </mrow> </mrow> </math>

5. verification and correction of facial features in moving images

And (3) detecting the characteristics of each frame of image in the moving image by using the method, and if the condition of false detection or missing detection occurs, estimating the position of the characteristic point in the current frame from the characteristic point in the previous frame of image according to the motion smoothness constraint and the plane motion constraint. The specific method comprises the following steps:

1) starting from frame 1, features are detected frame by frame in the above-described manner until the variation between features of successive 3-frame images is less than a given threshold. These 3 frame images are referred to as reference frames, and their features are considered to be correct.

2) Given a reference frame, the feature detection steps of its neighboring frames (target frames) are:

(1) and estimating the characteristic region of the target frame according to the reference frame characteristic.

(2) The features of the target frame are detected within the estimated region using the method described above.

(3) And verifying the detection result by using motion smoothness constraint.

The principle of smoothness constraint is: between two adjacent frames (the reference frame and the target frame), since the head movement amplitude and the distance of the camera from the face are small, the change of the face characteristic point should be small. As shown in fig. 3, the variation of five distances between feature points in two adjacent images should be less than a threshold, otherwise, the detection is considered to be false.

(4) And if the detected features do not accord with the motion smoothness constraint, estimating the face features of the target frame by using a plane motion model.

Two pupils, two corners of the mouth and two nostrils on a human face may be considered approximately on a plane. This plane should conform to the plane rigid motion constraint between the two frames. Let x be (x)₁，x₂) Is the feature point in the reference frame, the corresponding feature point in the target frame can be estimated by the equations (19) and (20)

Wherein a 1.., a8 is a planar motion parameter. If 4 corresponding feature points in the reference frame and the target frame are known, a 1.

A＝[a₁ a₂ a₃ a₄ a₅ a₆ a₇ a₈]^r (22)

According to the steps 1-5, 6 corresponding feature points (two pupils, two corners of mouth and two nostrils) between two adjacent frames can be obtained. 4 of the 6 points are selected

C_{6}^{4} = 15

In the middle combination, 15 sets of plane parameters A can be obtained by using the formula (21)₁....A₁₅. For each combination, 6 feature points in the target frame can be estimated using equations (19) and (20). Optimum plane parameter A_optThe following equation is obtained:

A_opt＝{A₁|Min(Err(A₁)}，i＝1-15 (23)

where err (ai) is the estimation error:

Err (A_{i}) = Max (| x^{ij} (A_{i}) - x_{0}^{j} |), j = 1 - 6 - - - (24)

Claims

1. A method for detecting the face characteristics of a moving image is characterized by comprising the following steps:

(1) shooting 300 face images with different sexes, ages, postures and illumination to form a training set, and geometrically calibrating the eyes of the images in the training set through homogeneous transformation to ensure that the sizes and the positions of the eyes in the images are completely consistent;

(2) performing principal component analysis on the calibrated eyes in the images of the training set to obtain a group of characteristic vectors called characteristic eyes to form a characteristic eye subspace;

(3) for a human face image of a tested person, firstly obtaining a plurality of candidate eyes through Hough transformation, carrying out geometric calibration on each pair of candidate eyes by using homogeneous transformation to ensure that the positions and the sizes of the candidate eyes are the same as those of the eyes in an image of a training set, then projecting the candidate eyes to the characteristic eye subspace, and finally taking the candidate eye with the minimum error between the original eye and the projection of the original eye as a detection result;

(4) after the eye position of the tested person is determined through the steps, estimating the mouth position according to the human face structure characteristics, obtaining the accurate position of the mouth angle by utilizing integral projection, then estimating the nose position according to the mouth position and the eye position, and accurately positioning the positions of the nostrils and the nose tip by utilizing the integral projection;

(5) and if false detection or missing detection occurs, estimating the positions of eyes, nose and mouth in the current frame from the characteristic points in the previous frame of image according to the motion smoothness constraint and the plane motion constraint.