CN104506852B

CN104506852B - A kind of objective quality assessment method towards video conference coding

Info

Publication number: CN104506852B
Application number: CN201410826849.4A
Authority: CN
Inventors: 徐迈; 马源; 张京泽
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2014-12-25
Filing date: 2014-12-25
Publication date: 2016-08-24
Anticipated expiration: 2034-12-25
Also published as: CN104506852A

Abstract

The invention discloses an objective quality evaluation method for video conference coding, which includes two parts: training and evaluation; the training part includes step 1: extracting the face and face area; step 2: obtaining the attention degree of a single pixel point; step Step 3: Calibrate and normalize the face area; Step 4: Obtain a Gaussian mixture model; the evaluation part includes Step 1: For a set of videos, automatically extract the background, face, left eye, right eye, mouth, and nose area The number of inner pixels; Step 2: Calibrate and normalize the face area; Step 3: Obtain the weight map; Step 4: Calculate the peak signal-to-noise ratio based on the Gaussian mixture model, and evaluate the encoded image quality of the video conference system. The invention avoids the deficiency that the traditional method does not take into account the video content, and can improve the accuracy of image quality evaluation by giving more weight to the face of the video image, so that it can better reflect the result of subjective quality evaluation.

Description

An Objective Quality Assessment Method for Video Conference Coding

技术领域technical field

本发明涉及一种面向视频会议编码的客观质量评估方法，属于视频会议编码的感知视觉质量评估技术领域。The invention relates to an objective quality evaluation method for video conference coding, and belongs to the technical field of perceptual visual quality evaluation of video conference coding.

背景技术Background technique

在评估不同的视频编码方式的效率时，视觉质量的指标是必不可少的。感知视频编码的视觉质量评估可以分为两类:主观评估和客观评估。由于人类在观看视频时是最直接的接受者，主观视觉质量评估是评估视频编码的方法中最准确，最可靠的。但其低效率和高代价促进了客观视觉质量的评估指标的发展。客观评估的目的是改善其与主观视觉质量的相关性，以准确地测量视觉质量。最广泛使用的客观指标包括峰值信号噪声比(peaksignal-to-noise ratio，PSNR)，结构相似度(structural similarity,SSIM)，视觉信号噪声比(visual signal-to-noise ratio,VSNR)，视觉质量度量(video quality metrics,VQM)和基于运动的视频完整性评价(MOtion-based Video Integrity Evaluation,MOVIE)。Visual quality metrics are essential when evaluating the efficiency of different video encoding schemes. Visual quality assessment for perceptual video coding can be divided into two categories: subjective assessment and objective assessment. Since humans are the most immediate recipients when viewing video, subjective visual quality assessment is the most accurate and reliable method of evaluating video encoding. But its low efficiency and high cost have promoted the development of objective visual quality evaluation metrics. The purpose of objective assessment is to improve its correlation with subjective visual quality to accurately measure visual quality. The most widely used objective metrics include peak signal-to-noise ratio (PSNR), structural similarity (SSIM), visual signal-to-noise ratio (VSNR), visual quality Metrics (video quality metrics, VQM) and motion-based video integrity evaluation (MOtion-based Video Integrity Evaluation, MOVIE).

感知视频编码的视频会议已经被广泛研究，因为脸部对于视频会议来说是一个ROI(Region-of-Interest，感兴趣区域)。然而，现在没有专门为视频会议开发的客观视觉质量评估方法。Perceptual video coding for video conferencing has been extensively studied because the face is a ROI (Region-of-Interest) for video conferencing. However, there is currently no objective visual quality assessment method developed specifically for videoconferencing.

发明内容Contents of the invention

本发明的目的是为了解决现有的视频质量的客观评估方法的不足，提供了一种针对视频会议编码的客观指标，旨在提高与观看者的主观感知质量之间的相关性。The purpose of the present invention is to solve the deficiency of the existing objective evaluation method of video quality, and provide an objective index for video conference coding, aiming at improving the correlation with the viewer's subjective perception quality.

一种面向视频会议编码的客观质量评估方法，包括训练和评估两部分；An objective quality assessment method for video conference coding, including training and evaluation;

训练部分包括以下几个步骤：The training part consists of the following steps:

步骤一：脸部及脸部区域提取；Step 1: Face and face area extraction;

步骤二：进行眼动仪实验，获取测试者观看视频时对于每一帧图像的关注点坐标位置，得到单个像素点的受关注程度；Step 2: Carry out an eye tracker experiment to obtain the coordinate position of the attention point of each frame of the image when the tester watches the video, and obtain the degree of attention of a single pixel;

步骤三：对脸部区域进行校准和归一化；Step 3: Calibrate and normalize the face area;

步骤四：获取高斯混合模型；Step 4: Obtain the Gaussian mixture model;

评估部分包括以下几个步骤：The evaluation section consists of the following steps:

步骤一：针对一组视频，重复训练部分步骤一，自动提取出背景、脸部、左眼、右眼、嘴、鼻子区域内像素个数；Step 1: For a group of videos, repeat step 1 of the training part to automatically extract the number of pixels in the background, face, left eye, right eye, mouth, and nose areas;

步骤二：重复训练过程的步骤三，对脸部区域进行校准和归一化；Step 2: Repeat step 3 of the training process to calibrate and normalize the face area;

步骤三：在训练阶段获得高斯混合模型基础上，计算出右眼、左眼、口、鼻、脸部其他区域、背景区域的权重及以上各区域周围的高斯分布权重，得到权重图谱；Step 3: Based on the Gaussian mixture model obtained during the training phase, calculate the weights of the right eye, left eye, mouth, nose, other areas of the face, background area, and the Gaussian distribution weights around the above areas to obtain a weight map;

步骤四：在权重图谱基础上，计算基于高斯混合模型的峰值信噪比，评估视频会议系统编码后的图像质量。Step 4: On the basis of the weight map, calculate the peak signal-to-noise ratio based on the Gaussian mixture model, and evaluate the encoded image quality of the video conference system.

本发明的优点在于：The advantages of the present invention are:

(1)本发明针对视频会议系统编码后的图像质量评估方法，避免了传统方法未考虑到视频内容的不足，可通过赋予视频图像脸部更多的权重，提升图像质量评估的精度，使其更加反映主观质量评估的结果；(1) The present invention is aimed at the image quality evaluation method after video conferencing system encoding, which avoids the lack of video content that is not considered in the traditional method, and can improve the accuracy of image quality evaluation by giving more weight to the face of the video image, making it More reflect the results of subjective quality assessment;

(2)本发明在脸部各区域(如鼻子、嘴巴)提取基础上，对于脸部的一些关键区域赋予更大的权重，从而满足当今及未来视频会议系统分辨率不断提高、显示尺寸越来越大的发展趋势；(2) On the basis of extracting various areas of the face (such as nose and mouth), the present invention assigns greater weights to some key areas of the face, so as to meet the needs of current and future video conferencing systems with continuous improvement in resolution and increasing display size. greater development trend;

(3)本发明通过引入眼动仪的实验数据，结合统计学习的计算工具，可挖掘在视频会议时人视觉注意力的规律，进一步将其应用于视频会议系统编码后的图像质量评估，大幅提高其与主观质量评估的相关度。(3) The present invention can excavate the rule of people's visual attention when video conferencing by introducing the experimental data of the eye tracker, in conjunction with the calculation tool of statistical learning, and further apply it to the image quality evaluation after video conferencing system coding, greatly Improve its correlation with subjective quality assessment.

附图说明Description of drawings

图1是本发明的方法流程图；Fig. 1 is method flowchart of the present invention;

图2脸部特征自动标定算法；Figure 2 facial features automatic calibration algorithm;

图3脸部关键区域的自动提取；Figure 3 Automatic extraction of key areas of the face;

图4校准和归一化的方法；Figure 4 Calibration and normalization methods;

图5权重图谱的绘制方法；The drawing method of Fig. 5 weight map;

图6 GMM-PSNR的计算示意。Fig. 6 Schematic diagram of calculation of GMM-PSNR.

具体实施方式detailed description

下面将结合附图和实施例对本发明作进一步的详细说明。The present invention will be further described in detail with reference to the accompanying drawings and embodiments.

本发明采用了实时脸部特征自动标定的方法，以跟踪脸部的关键特征点。在人脸检测之后，通过结合本地检测(纹理信息)和全局优化(面部结构)，关键特征的点的分布模型(PDM)在视频帧上生成。在本发明中，利用66点PDM来提取的脸和脸部的轮廓。66点的PDM能很好地进行面部和面部特征的关键点的采样，并且因此这些点可以连接以精确地提取的脸部和脸部特征的轮廓和区域。因此，66点的PDM被用在的方法中以抽取的脸和脸部关键特征。最后，脸部和脸部关键区域根据它们的轮廓提取。The present invention adopts a real-time facial feature automatic calibration method to track key feature points of the face. After face detection, a Point Distribution Model (PDM) of key features is generated on video frames by combining local detection (texture information) and global optimization (facial structure). In the present invention, a 66-point PDM is used to extract the face and the contour of the face. The 66-point PDM can well sample key points of faces and facial features, and thus these points can be connected to accurately extract contours and regions of faces and facial features. Therefore, a 66-point PDM is used in the method to extract face and facial key features. Finally, faces and face key regions are extracted according to their contours.

在对话类场景的视频中，通过实验发现，视频中人脸内容可以吸引观测者绝大部分关注力。因此，根据观测者的关注力的不同，进一步量化背景，脸和五官不平等的重要性，从而提升视频会议的客观质量评估精确度。为了取得这样的不平等的重要性的值，在会议相关的视频上进行了一些眼动仪实验。In the video of dialogue scenes, it is found through experiments that the face content in the video can attract most of the attention of the observer. Therefore, according to the observer's focus, the importance of background, face and facial features is further quantified, thereby improving the accuracy of objective quality assessment of video conferencing. To obtain such unequal importance values, some eye-tracking experiments were performed on conference-related videos.

在实验中，使用眼动仪记录了观测者观看视频时，落在视频帧上的眼睛注视点。眼睛注视点代表者观测者的关注点，因此，眼睛跟踪的结果可以用来产生主观的关注模型。眼动仪实验后，属于右眼，左眼，口，鼻，脸部其他区域，以及背景的眼睛注视点的数目被记录。根据落在不同区域的眼球注视点的数目，引入一个新的概念，眼睛凝视点/像素(EFP/P)，以反映在这些区域关注度的像素水平。在这里，有以下的EFP/P值。In the experiment, an eye tracker was used to record the gaze point of the observer's eyes on the video frame when watching the video. Eye gaze points represent the focus points of the observer, therefore, the results of eye tracking can be used to generate subjective attention models. After the eye tracker experiment, the number of eye fixations belonging to the right eye, left eye, mouth, nose, other areas of the face, and the background were recorded. According to the number of eye fixation points falling on different areas, a new concept, eye fixation point/pixel (EFP/P), is introduced to reflect the pixel level of attention in these areas. Here, there are the following EFP/P values.

取得上述眼动仪实验的结果后，利用其来训练GMM，以产生每个视频帧的重要性权重图谱。因此，GMM-PSNR可以通过结合相应的权重图谱来计算。训练GMM之前，要进行预处理以校准和归一化在上节获得的眼睛注视点。随后，在校准和归一化的眼球注视点上，用期望最大化(EM)算法来训练GMM。GMM可以通过运行几次EM迭代直至收敛而得到。鉴于得到的GMM的参数，可以计算出权重图谱，来建立客观的度量GMM-PSNR。Once the results of the above eye tracker experiments are obtained, they are used to train a GMM to produce an importance weight map for each video frame. Therefore, GMM-PSNR can be computed by combining the corresponding weight maps. Before training the GMM, preprocessing is performed to calibrate and normalize the eye fixations obtained in the previous section. Subsequently, the GMM is trained with the Expectation-Maximization (EM) algorithm on the calibrated and normalized eye fixations. GMM can be obtained by running several EM iterations until convergence. Given the parameters of the obtained GMM, a weight map can be calculated to establish an objective measure of GMM-PSNR.

本发明是一种面向视频会议编码的客观质量评估方法，流程如图1所示，包括训练和评估两部分；The present invention is an objective quality evaluation method oriented to video conference coding, the process flow is shown in Figure 1, including two parts of training and evaluation;

利用脸部特征自动标定算法在给定的视频会议序列中自动提取出背景、脸部、左眼、右眼、嘴、鼻子区域内像素个数。Using the facial feature automatic calibration algorithm to automatically extract the number of pixels in the background, face, left eye, right eye, mouth, and nose area in a given video conference sequence.

具体为：第一、通过脸部特征自动标定算法获取视频会议序列中每一帧图像中的脸部区域关键点，第二、利用平均值漂移技术在提取出的脸部区域上，局部搜索脸部区域图像中的左眼、右眼、嘴、鼻子区域关键点，并将这些关键点与数据库中的关键点分布模型(PDM)进行匹配，实现左眼、右眼、嘴、鼻子区域关键点优化，第三、得到优化后的每一帧图像中的脸部、左眼、右眼、嘴、鼻子区域关键点，如图2所示，共得到了66关键点，第四、分别将脸部、左眼、右眼、嘴、鼻子区域的关键点相连，得到脸部、左眼、右眼、嘴、鼻子轮廓，如图3所示，第五、分别获取脸部、左眼、右眼、嘴、鼻子区域内像素个数，将图像像素个数减去脸部像素个数，得到背景像素个数，最终实现脸部关键区域的自动提取。Specifically: first, obtain the key points of the face area in each frame image in the video conference sequence through the facial feature automatic calibration algorithm; second, use the average value drift technology to locally search the face area The key points of the left eye, right eye, mouth, and nose area in the image of the inner region, and match these key points with the key point distribution model (PDM) in the database to realize the key points of the left eye, right eye, mouth, and nose area Optimization, thirdly, get the key points of the face, left eye, right eye, mouth, and nose in each frame of image after optimization, as shown in Figure 2, a total of 66 key points have been obtained, and fourthly, the face The key points of the face, left eye, right eye, mouth, and nose are connected to obtain the contours of the face, left eye, right eye, mouth, and nose, as shown in Figure 3. Fifth, obtain the face, left eye, and right The number of pixels in the eye, mouth, and nose areas is subtracted from the number of image pixels by the number of face pixels to obtain the number of background pixels, and finally the automatic extraction of key areas of the face is realized.

其中，点分布模型是采用平均值漂移技术，通过对一组标准测试图像的训练。Among them, the point distribution model adopts the mean shift technique and is trained on a set of standard test images.

其中，可以提取不同人脸图像中的脸部、左眼、右眼、嘴、鼻子区域关键点。Among them, the key points of the face, left eye, right eye, mouth, and nose area in different face images can be extracted.

设单个区域(左眼、右眼、嘴、鼻子、脸部其他区域、背景)的受关注程度为眼睛关注点数目/该区域像素个数(efp/p)：Let the degree of attention of a single area (left eye, right eye, mouth, nose, other areas of the face, background) be the number of eye attention points / the number of pixels in this area (efp/p):

$\{\begin{matrix} {c c}_{r r} = = {f f}_{r r} / / {p p}_{r r} \\ {c c}_{l l} = = {f f}_{l l} / / {p p}_{l l} \\ {c c}_{m m} = = {f f}_{m m} / / {p p}_{m m} \\ {c c}_{n no} = = {f f}_{n no} / / {p p}_{n no} \\ {c c}_{o o} = = {f f}_{o o} / / {p p}_{o o} \\ {c c}_{b b} = = {f f}_{b b} / / {p p}_{b b} \end{matrix}$

其中：c_r、c_l、c_m、c_n、c_o、c_b分别表示右眼、左眼、口、鼻子、脸部其他区域、背景区域的单个像素点的关注程度，f_r、f_l、f_m、f_n、f_o、f_b分别表示在眼动仪实验中，测试者落在右眼、左眼、口、鼻、脸部其他区域、背景区域的眼睛关注点数目，p_r、p_l、p_m、p_n、p_o、p_b分别表示右眼、左眼、口、鼻、脸部其他区域、背景区域中的像素点数目；Among them: c _r , c _l , _{cm , c n} _, c _o , c _b respectively represent the degree of attention of a single pixel in the right eye, left eye, mouth, nose, other areas of the face, and background area, f _r , f _l , f _m , f _n , f _o , and f _b respectively represent the number of focus points of the tester on the right eye, left eye, mouth, nose, other areas of the face, and background areas in the eye tracker experiment, p _r , p _l , p _m , p _n , p _o , p _b respectively represent the number of pixels in the right eye, left eye, mouth, nose, other areas of the face, and the background area;

校准可以避免人脸在图像中不同位置所导致的不确定性，而归一化的方法可以使得本发明适应于视频会议中人脸区域像素数目不等的情况。The calibration can avoid the uncertainty caused by different positions of the face in the image, and the normalization method can make the present invention adapt to the situation that the number of pixels in the face area is not equal in the video conference.

具体方法为：The specific method is:

如图4(a)所示，随机选取一帧图像，采用图像脸部区域关键点中最左侧点，作为校准原始点B，获取其他图像中脸部区域关键点中最左侧点A，获取A、B之间坐标转换关系，将其他图像中关注点根据坐标转换关系进行转换，完成校准。As shown in Figure 4(a), a frame of image is randomly selected, and the leftmost point in the key points of the face area of the image is used as the original point B for calibration, and the leftmost point A in the key points of the face area in other images is obtained. Obtain the coordinate transformation relationship between A and B, convert the points of interest in other images according to the coordinate transformation relationship, and complete the calibration.

如图4(b)所示，随机选取一帧图像，采用图像中人物右眼的横坐标长度(在66点中的右眼右侧的点和右眼左侧的点之间的距离)作为归一化单元，将其他图像中的关注点根据归一化单元进行归一化处理。As shown in Figure 4(b), a frame of image is randomly selected, and the abscissa length of the right eye of the person in the image (the distance between the point on the right side of the right eye and the point on the left side of the right eye among the 66 points) is used as The normalization unit is used to normalize the points of interest in other images according to the normalization unit.

步骤四：获取高斯混合模型；Step 4: Obtain a Gaussian mixture model;

假设眼睛注视点服从高斯混合模型，在归一化与校准眼动仪数据的基础上，通过高斯混合模型写成高斯分量的线性叠加如下：Assuming that the gaze point of the eyes obeys the Gaussian mixture model, on the basis of the normalized and calibrated eye tracker data, the linear superposition of Gaussian components is written through the Gaussian mixture model as follows:

$p p (({x x}^{* *})) = = {Σ Σ}_{k k = = 11}^{K K} {π π}_{k k} {&aleph; &aleph;}_{k k} (({x x}^{* *}))$

${&aleph; &aleph;}_{k k} (({x x}^{* *})) = = \frac{11}{22 π π} \cdot &Center Dot; \frac{11}{{| | {Σ Σ}_{k k} | |}^{\frac{11}{22}}} \cdot &Center Dot; exp exp {{- - \frac{11}{22} {(({x x}^{* *} - - {μ μ}_{k k}))}^{T T} \cdot &Center Dot; {Σ Σ}_{k k}^{- - 11} \cdot &Center Dot; (({x x}^{* *} - - {μ μ}_{k k}))}}$

其中：表示一个高斯分量，π_k,μ_k和Σ_k是第k个高斯分量的混合系数，均值和方差，且x^*表示二维校准和归一化后的眼睛注视点。K代表GMM的高斯分量的数量。由于鼻子的眼睛注视点的数量比眼睛和嘴的少得多，在这里将高斯分量的数目K设置为3，它们各自对应于右眼，左眼和嘴。同时，将μ_k设置为每个脸部特征的归一化质心。in: represents a Gaussian component, π _k , μ _k and Σ _k are the mixing coefficients, mean and variance of the kth Gaussian component, and x ^* represents the eye gaze point after two-dimensional calibration and normalization. K represents the number of Gaussian components of the GMM. Since the number of eye fixations of the nose is much less than that of the eyes and mouth, the number K of Gaussian components is set to 3 here, which correspond to the right eye, left eye and mouth respectively. Meanwhile, _μk is set as the normalized centroid of each facial feature.

上述步骤在离线情况下，针对一组训练视频，通过设计眼动仪实验及其数据分析，获得用于评估视频会议系统客观质量的高斯混合模型。In the above steps, in an offline situation, aiming at a set of training videos, the Gaussian mixture model used to evaluate the objective quality of the video conference system is obtained by designing an eye tracker experiment and its data analysis.

(2)评估部分包括以下几个步骤(2) The evaluation part includes the following steps

步骤一：同训练过程的步骤一相同，自动提取出背景、人脸、左眼、右眼、嘴、鼻子区域。Step 1: Same as Step 1 of the training process, the background, face, left eye, right eye, mouth, and nose regions are automatically extracted.

步骤二：同训练过程的步骤三相同，校准与归一化视频的脸部区域。具体内容见图4。Step 2: Same as step 3 of the training process, calibrate and normalize the face area of the video. See Figure 4 for details.

步骤三：在训练阶段获得高斯混合模型基础上，计算出右眼、左眼、口、鼻、脸部其他区域、背景区域的权重及以上各区域周围的高斯分布权重，具体内容见图5。Step 3: Based on the Gaussian mixture model obtained during the training phase, calculate the weights of the right eye, left eye, mouth, nose, other areas of the face, background area, and the weights of the Gaussian distribution around the above areas. See Figure 5 for details.

图5为权重图谱的绘制方法。在本实例实施中，可通过权重图谱量化视频会议系统中人脸及背景各像素的重要性。本实例的输入为视频会议的一帧图像。首先，按照图3的方法自动提取出脸部以脸部的关键区域。其次，对于视频中的关键点按照图4的方法进行校准和归一化。最后，使用图1训练部分步骤二、四的GMM训练得到的参数，根据各像素所属区域(主要有背景、脸部、左眼、右眼、鼻子、嘴巴)，通过下面公式计算出各像素的权重，并输出该视频会议图像的权重图谱，通过权重大小设定图像各像素在质量评估时的重要性。Figure 5 shows the drawing method of the weight map. In the implementation of this example, the weight map can be used to quantify the importance of each pixel of the face and the background in the video conferencing system. The input of this example is a frame image of a video conference. First, according to the method in Fig. 3, the face and key areas of the face are automatically extracted. Secondly, the key points in the video are calibrated and normalized according to the method in Figure 4. Finally, using the parameters obtained from the GMM training in steps 2 and 4 of the training part in Figure 1, according to the area to which each pixel belongs (mainly background, face, left eye, right eye, nose, and mouth), calculate the value of each pixel by the following formula: weight, and output the weight map of the video conference image, and set the importance of each pixel of the image in quality evaluation through the weight.

其中，in,

$g g ((x x)) = = \frac{\underset{k k}{max max} {π π}_{k k} {&aleph; &aleph;}_{k k} ((x x))}{{Σ Σ}_{x x &Element; &Element; others others} \underset{k k}{max max} {π π}_{k k} {&aleph; &aleph;}_{k k} ((x x))} \cdot &Center Dot; {p p}_{o o}$

本发明不局限于采用这种方法来设定图像像素的权重大小。The present invention is not limited to adopting this method to set the weight of image pixels.

步骤四：在权重图谱基础上，计算基于高斯混合模型的峰值信噪比(GMM-PSNR)，评估视频会议系统编码后的图像质量。具体内容见图4。Step 4: On the basis of the weight map, calculate the peak signal-to-noise ratio (GMM-PSNR) based on the Gaussian mixture model, and evaluate the encoded image quality of the video conference system. See Figure 4 for details.

图6为GMM-PSNR的计算示意。在本实例实施中，GMM-PSNR的计算可输出衡量视频会议系统编码后图像质量的GMM-PSNR。首先，同传统衡量方法(如PSNR)一样，通过计算原始视频图像与待评估视频图像的均方根误差，获取编码前后图像的残差。接着，将均方根误差与权重图谱的加权相乘，即可得到GMM-MSE的值。最终，通过取对数的方法，计算出GMM-PSNR。具体计算方法及其计算公式在图1的说明内。本发明不局限于对传统PSNR的改进。还可对其他衡量方法(如SSIM＝Structural SIMilarity，SSIM)通过与权重图谱中的加权相乘进行改进。Figure 6 is a schematic diagram of the calculation of GMM-PSNR. In this implementation example, the calculation of the GMM-PSNR can output the GMM-PSNR which measures the quality of the encoded image of the video conferencing system. First, as with traditional measurement methods (such as PSNR), by calculating the root mean square error between the original video image and the video image to be evaluated, the residual error of the image before and after encoding is obtained. Then, the root mean square error is multiplied by the weight of the weight map to obtain the value of GMM-MSE. Finally, the GMM-PSNR is calculated by taking the logarithm. The specific calculation method and its calculation formula are in the description of Figure 1. The invention is not limited to improvements over conventional PSNR. Other measurement methods (such as SSIM=Structural SIMilarity, SSIM) can also be improved by multiplying them with the weights in the weight map.

具体计算公式如下：The specific calculation formula is as follows:

${MSE MSE}_{GMM GMM} = = \frac{{Σ Σ}_{i i = = 11}^{M m} {Σ Σ}_{j j = = 11}^{N N} {(({ω ω}_{x x} \cdot \cdot (({I I}_{x x}^{' '} - - {I I}_{x x}))))}^{22}}{{Σ Σ}_{i i = = 11}^{M m} {Σ Σ}_{j j = = 11}^{N N} {ω ω}_{x x}^{22}}$

${PSNR PSNR}_{GMM GMM} = = 1010 \cdot \cdot log log \frac{{((22^{n no} - - 11))}^{22}}{{MSE MSE}_{GMM GMM}}$

其中I′_x和I_x分别是在处理视频和原始视频帧上像素x的值。M和N分别是沿垂直方向和水平方向的像素数。n(＝8)是位深度。where I'x and _Ix are the values of pixel _x on the processed video and the original video frame, respectively. M and N are the number of pixels along the vertical direction and the horizontal direction, respectively. n (=8) is the bit depth.

最终，本发明可输出视频会议系统编码后基于高斯混合模型的峰值信噪比(GMM-PSNR)，用来衡量视频编码前后图像质量的降低情况。与传统峰值信噪比(PSNR)相同，GMM-PSNR的衡量单位为dB。但是，由于人在观看视频时对于图像中各区域的关注度不同，GMM-PSNR对于视频会议系统中重要性不等的脸部区域赋予大小不同的权重，从而大幅提升其与主观质量评估的相关度。Finally, the present invention can output the peak signal-to-noise ratio (GMM-PSNR) based on the Gaussian mixture model (GMM-PSNR) of the video conference system after encoding, which is used to measure the degradation of image quality before and after video encoding. Like traditional peak signal-to-noise ratio (PSNR), GMM-PSNR is measured in dB. However, because people pay different attention to each area in the image when watching the video, GMM-PSNR assigns different weights to the face areas of different importance in the video conferencing system, thereby greatly improving its correlation with subjective quality assessment. Spend.

本发明对于视频会议中视频传输的质量可以提供一种更为有效的评测方法。经过测试，相对于传统的客观视频评估方法，如VQM、MOVIE、PSNR而言，GMM-PSNR显著提高了与主观测试标准，如MOS、DMOS，之间的相关性，说明了GMM-PSNR可以作为一种更有效的面向视频会议编码的客观度量。这对于视频会议的视频处理、压缩以及视频通信都是十分有利的。它可以监测视频系统的性能，并给出调节编解码器或者信道参数的反馈，保证视频质量在可接受的范围内。视频质量评估标准也可以用于对编解码器性能的设计、评估和优化。其也可以用于设计和优化符合人类视觉模型的数字视频系统。The present invention can provide a more effective evaluation method for the quality of video transmission in the video conference. After testing, compared with traditional objective video evaluation methods, such as VQM, MOVIE, PSNR, GMM-PSNR significantly improves the correlation with subjective test standards, such as MOS, DMOS, which shows that GMM-PSNR can be used as A More Efficient Objective Metric for Videoconferencing Coding. This is very beneficial for video processing, compression and video communication of video conferencing. It can monitor the performance of the video system and give feedback to adjust the codec or channel parameters to ensure that the video quality is within an acceptable range. Video quality assessment criteria can also be used for the design, evaluation and optimization of codec performance. It can also be used to design and optimize digital video systems that conform to the human visual model.

本发明涉及视频序列的客观质量评估方法，用于视频会议编码的感知视觉质量评估。本发明采用了眼动仪实验以及脸部和脸部特征提取的实时技术。在实验中，背景、脸部和脸部特征区域的重要性基于观察者对各个部分的关注度被确定。利用眼动仪采集到的眼睛凝视点，并假设其分布为高斯混合模型，可生成一个重要性权重图谱，由此可观察者对于会议视频中各区域的关注度。根据这个产生的权重图谱，可以给视频帧上的每个像素分配不同的权重，从而改进已有的客观视频质量评估方法。更具体地，本发明涉及一种基于已有视频质量评估方法的视频会议编码的感知视频质量评估。The invention relates to an objective quality assessment method for video sequences, which is used for perceptual visual quality assessment of video conference coding. The present invention employs eye tracker experiments and real-time techniques for face and face feature extraction. In the experiments, the importance of the background, face, and facial feature regions is determined based on the observer's attention to each part. Using the gaze points collected by the eye tracker, and assuming that the distribution is a Gaussian mixture model, an importance weight map can be generated, so that the observer's attention to each area in the conference video can be measured. According to this generated weight map, different weights can be assigned to each pixel on the video frame, thereby improving the existing objective video quality assessment methods. More specifically, the present invention relates to a perceptual video quality assessment for video conference coding based on existing video quality assessment methods.

Claims

1. An objective quality assessment method for video conference coding, including training and evaluation;

The training part consists of the following steps:

Step 1: Face and face area extraction;

Use the facial feature automatic calibration algorithm to automatically extract the number of pixels in the background, face, left eye, right eye, mouth, and nose area in a given video conference sequence;

Step 2: Carry out an eye tracker experiment to obtain the coordinates of the key points of each frame of the image when the tester watches the video, and obtain the degree of attention of a single pixel;

Let the degree of attention of a single area be the number of eye key points/the number of pixels in this area efp/p, where a single area is the left eye, right eye, mouth, nose, other areas of the face or the background, then:

\{\begin{matrix} {c c}_{r r} = = {f f}_{r r} / / {p p}_{r r} \\ {c c}_{l l} = = {f f}_{l l} / / {p p}_{l l} \\ {c c}_{m m} = = {f f}_{m m} / / {p p}_{m m} \\ {c c}_{n no} = = {f f}_{n no} / / {p p}_{n no} \\ {c c}_{o o} = = {f f}_{o o} / / {p p}_{o o} \\ {c c}_{b b} = = {f f}_{b b} / / {p p}_{b b} \end{matrix}

Among them: c _r , c _l , _{cm , c n} _, c _o , c _b respectively represent the degree of attention of a single pixel in the right eye, left eye, mouth, nose, other areas of the face, and background area, f _r , f _l , f _m , f _n , f _o , and f _b represent the number of key points of the tester's eyes falling on the right eye, left eye, mouth, nose, other areas of the face, and background areas in the eye tracker experiment, p _r , p _l , p _m , p _n , p _o , p _b respectively represent the number of pixels in the right eye, left eye, mouth, nose, other areas of the face, and the background area;

Step 3: Calibrate and normalize the face area;

The specific method is:

Randomly select a frame of image, use the leftmost point in the key points of the face area of the image as the calibration original point B, obtain the leftmost point A in the key points of the face area in other images, and obtain the coordinate transformation relationship between A and B , transform the key points in other images according to the coordinate transformation relationship, and complete the calibration;

Randomly select a frame of image, use the abscissa length of the right eye of the person in the image as the normalization unit, and normalize the key points in other images according to the normalization unit;

Step 4: Obtain the Gaussian mixture model;

Assuming that the gaze point of the eyes obeys the Gaussian mixture model, on the basis of the normalized and calibrated eye tracker data, the linear superposition of Gaussian components is written through the Gaussian mixture model as follows:

in: Represents a Gaussian component, π _k , μ _k and Σ _k are the mixing coefficients, mean and variance of the kth Gaussian component, and x ^* represents the eye gaze point after two-dimensional calibration and normalization; K represents the Gaussian component of GMM quantity;

The above steps 1-4 are offline to obtain a Gaussian mixture model for evaluating the objective quality of the video conferencing system for a set of training videos;

The evaluation section consists of the following steps:

Step 1: For a group of videos, repeat step 1 of the training part to automatically extract the number of pixels in the background, face, left eye, right eye, mouth, and nose areas;

Step 2: Repeat step 3 of the training process to calibrate and normalize the face area;

Step 3: Based on the Gaussian mixture model obtained during the training phase, calculate the weights of the right eye, left eye, mouth, nose, other areas of the face, background area, and the Gaussian distribution weights around the above areas to obtain a weight map;

Step 4: On the basis of the weight map, calculate the peak signal-to-noise ratio based on the Gaussian mixture model, and evaluate the encoded image quality of the video conference system.

2. A kind of objective quality assessment method for video conference coding according to claim 1, step one of the described training part is specifically:

First, obtain the key points of the face area in each frame of the image in the video conference sequence through the facial feature automatic calibration algorithm;

Second, use the average value drift technology to locally search the key points of the left eye, right eye, mouth, and nose area in the face area image on the extracted face area, and distribute these key points with the key points in the database The model is matched to realize the optimization of key points in the left eye, right eye, mouth, and nose areas;

Third, get the key points of the face, left eye, right eye, mouth, and nose area in each frame of image after optimization;

Fourth, connect the key points of the face, left eye, right eye, mouth, and nose respectively to obtain the contours of the face, left eye, right eye, mouth, and nose;

Fifth, get the number of pixels in the face, left eye, right eye, mouth, and nose area respectively, subtract the number of face pixels from the number of image pixels, get the number of background pixels, and finally realize the automatic detection of key areas of the face extract.

3. a kind of objective quality assessment method facing video conference coding according to claim 1, in the step 4 of described training part, set K=3, corresponding to right eye, left eye and mouth respectively; Set μ _k Normalized centroid for each facial feature.