CN115147889A

CN115147889A - Method, device, device and storage medium for processing face pose in video

Info

Publication number: CN115147889A
Application number: CN202110351849.3A
Authority: CN
Inventors: 陈法圣
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-03-31
Filing date: 2021-03-31
Publication date: 2022-10-04
Anticipated expiration: 2041-03-31
Also published as: CN115147889B

Abstract

The application provides a method, a device and equipment for processing a face gesture in a video and a computer readable storage medium; the method comprises the following steps: detecting human face characteristic points of a plurality of video frames containing a target object in a target video to obtain a first characteristic point set comprising two-dimensional coordinates of the human face characteristic points; performing three-dimensional reconstruction on the target object based on the first feature point set to obtain the human face posture of the target object and the three-dimensional coordinates of the human face feature points of the target object; projecting the three-dimensional coordinates of the human face characteristic points to a camera imaging surface to obtain a second characteristic point set comprising the two-dimensional coordinates of the human face characteristic points; and comparing the second characteristic point set with the first characteristic point set, and when the error between the first characteristic point set and the second characteristic point set meets an error condition, taking the human face posture obtained by three-dimensional reconstruction as the target human face posture of the target object in the target video. By the method and the device, the stability and the accuracy of the determination of the target face posture can be improved.

Description

Method, device, device and storage medium for processing face pose in video

技术领域technical field

本申请涉及人工智能技术领域，尤其涉及一种视频中人脸姿态的处理方法、装置、设备及计算机可读存储介质。The present application relates to the technical field of artificial intelligence, and in particular, to a method, apparatus, device, and computer-readable storage medium for processing face gestures in videos.

背景技术Background technique

人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能，感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。Artificial intelligence (AI) is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.

计算机视觉技术(Computer Vision,CV)属于人工智能技术之一，其通过摄影机和电脑代替人眼对目标进行识别、跟踪和测量等机器视觉，并进一步做图形处理，使电脑处理成为更适合人眼观察或传送给仪器检测的图像，其中，计算机视觉技术包括视频处理。Computer Vision (CV) is one of the artificial intelligence technologies. It uses cameras and computers to replace human eyes to identify, track and measure targets, and further perform graphics processing to make computer processing more suitable for human eyes. An image that is observed or transmitted to an instrument for inspection, where computer vision techniques include video processing.

在视频处理中，对视频包含的视频帧中的人脸姿态进行识别是一种重要的技术。相关技术中，通常是通过训练深度学习模型来对频帧中的人脸姿态进行估计，虽然该方法能够识别出人脸在视频帧中的位置、方向，当准确性和稳定性较差。In video processing, it is an important technology to recognize the face pose in the video frame contained in the video. In the related art, the face pose in the frequency frame is usually estimated by training a deep learning model, although this method can identify the position and direction of the face in the video frame, but the accuracy and stability are poor.

发明内容SUMMARY OF THE INVENTION

本申请实施例提供一种视频中人脸姿态的处理方法、装置、设备及计算机可读存储介质，能够提高目标人脸姿态确定的稳定性和准确性。The embodiments of the present application provide a method, apparatus, device, and computer-readable storage medium for processing a face pose in a video, which can improve the stability and accuracy of determining the target face pose.

本申请实施例的技术方案是这样实现的：The technical solutions of the embodiments of the present application are implemented as follows:

本申请实施例提供一种视频中人脸姿态的处理方法，包括：An embodiment of the present application provides a method for processing a face gesture in a video, including:

对目标视频中包含目标对象的多个视频帧进行人脸特征点检测，得到包括人脸特征点的二维坐标的第一特征点集；Perform facial feature point detection on multiple video frames that contain the target object in the target video, and obtain a first feature point set including the two-dimensional coordinates of the facial feature points;

基于所述第一特征点集，对所述目标对象进行三维重建，得到所述目标对象的人脸姿态、及所述目标对象的人脸特征点的三维坐标；Based on the first feature point set, three-dimensional reconstruction is performed on the target object to obtain the face pose of the target object and the three-dimensional coordinates of the face feature points of the target object;

对所述人脸特征点的三维坐标进行投影，得到包括人脸特征点的二维坐标的第二特征点集；Projecting the three-dimensional coordinates of the facial feature points to obtain a second feature point set comprising the two-dimensional coordinates of the facial feature points;

将所述第二特征点集与所述第一特征点集进行比较，当所述第一特征点集与所述第二特征点集之间的误差满足误差条件时，将所述三维重建得到的所述人脸姿态作为所述目标视频中所述目标对象的目标人脸姿态。Comparing the second feature point set with the first feature point set, when the error between the first feature point set and the second feature point set satisfies the error condition, obtain the three-dimensional reconstruction The face pose of the target video is taken as the target face pose of the target object in the target video.

本申请实施例提供一种视频中人脸姿态的处理装置，包括：An embodiment of the present application provides a device for processing a face gesture in a video, including:

检测模块，用于对目标视频中包含目标对象的多个视频帧进行人脸特征点检测，得到包括人脸特征点的二维坐标的第一特征点集；Detection module, for carrying out facial feature point detection to a plurality of video frames comprising target object in the target video, obtains the first feature point set comprising the two-dimensional coordinates of facial feature points;

重建模块，用于基于所述第一特征点集，对所述目标对象进行三维重建，得到所述目标对象的人脸姿态、及所述目标对象的人脸特征点的三维坐标；a reconstruction module, configured to perform three-dimensional reconstruction on the target object based on the first feature point set, to obtain the face pose of the target object and the three-dimensional coordinates of the face feature points of the target object;

投影模块，用于对所述人脸特征点的三维坐标进行投影，得到包括人脸特征点的二维坐标的第二特征点集；Projection module, for projecting the three-dimensional coordinates of the facial feature points, obtains the second feature point set comprising the two-dimensional coordinates of the facial feature points;

比较模块，用于将所述第二特征点集与所述第一特征点集进行比较，当所述第一特征点集与所述第二特征点集之间的误差满足误差条件时，将所述三维重建得到的所述人脸姿态作为所述目标视频中所述目标对象的目标人脸姿态。The comparison module is used to compare the second feature point set with the first feature point set, and when the error between the first feature point set and the second feature point set satisfies the error condition, the The face pose obtained by the three-dimensional reconstruction is used as the target face pose of the target object in the target video.

上述方案中，所述检测模块，还用于从目标视频中包含目标对象的多个视频帧中，截取目标对象的人脸区域得到多张人脸区域图像；In the above-mentioned scheme, the detection module is also used for, from the target video, in a plurality of video frames comprising the target object, intercepts the face region of the target object and obtains a plurality of face region images;

分别对各人脸区域图像进行人脸特征点检测，得到各人脸区域图像中人脸特征点的二维坐标；Perform face feature point detection on each face region image respectively, and obtain the two-dimensional coordinates of the face feature points in each face region image;

对人脸区域图像中人脸特征点的二维坐标进行坐标变换，得到各视频帧中人脸特征点的二维坐标；Perform coordinate transformation on the two-dimensional coordinates of the face feature points in the face area image to obtain the two-dimensional coordinates of the face feature points in each video frame;

将各视频帧中人脸特征点的二维坐标组成第一特征点集。The two-dimensional coordinates of the face feature points in each video frame form a first feature point set.

上述方案中，所述检测模块，还用于对所述目标视频进行人脸重识别，得到目标对象的人脸轨迹；In the above scheme, the detection module is also used to carry out face re-identification to the target video to obtain the face track of the target object;

根据所述目标对象的人脸轨迹，确定所述目标视频中包含所述目标对象的多个视频帧、以及所述视频帧中目标对象的人脸位置；According to the face track of the target object, it is determined that the target video includes a plurality of video frames of the target object and the face position of the target object in the video frame;

对所述视频帧中位于所述人脸位置的人脸进行人脸特征点检测。Perform face feature point detection on the face located at the face position in the video frame.

上述方案中，所述检测模块，还用于对目标视频进行人脸重识别，将识别到目标对象的视频帧确定为包含目标对象的视频帧，并确定所述视频帧中目标对象的人脸位置；In the above scheme, the detection module is also used to re-identify the face of the target video, determine the video frame that recognizes the target object as a video frame containing the target object, and determine the face of the target object in the video frame. Location;

对于包含目标对象的第一视频帧和第二视频帧，当在第一视频帧与第二视频帧之间的至少一个第三视频帧中均未识别到目标对象、且所述第三视频帧的数量小于第一数量阈值时，获取所述第一视频帧中目标对象的第一人脸位置、以及所述第二视频帧中目标对象的第二人脸位置；For the first video frame and the second video frame containing the target object, when the target object is not identified in at least one third video frame between the first video frame and the second video frame, and the third video frame When the quantity is less than the first quantity threshold, obtain the first face position of the target object in the first video frame and the second face position of the target object in the second video frame;

当所述第一人脸位置与所述第二人脸位置之间的距离小于距离阈值时，确定所述第三视频帧包含目标对象，并When the distance between the first face position and the second face position is less than a distance threshold, determine that the third video frame contains a target object, and

根据所述第一人脸位置及所述第二人脸位置，进行插值处理，得到至少一个所述第三视频帧中目标对象的人脸位置，以生成目标对象的人脸轨迹。According to the position of the first face and the position of the second face, interpolation processing is performed to obtain the face position of the target object in at least one of the third video frames, so as to generate the face track of the target object.

上述方案中，所述检测模块，还用于对目标视频中多个视频帧进行人脸重识别，将识别到目标对象的视频帧确定未包含目标对象的视频帧，并确定所述视频帧中目标对象的人脸位置；In the above scheme, the detection module is also used to perform face re-identification on a plurality of video frames in the target video, determine the video frame that does not contain the target object by identifying the video frame of the target object, and determine the video frame in the video frame. The face position of the target object;

对于包含目标对象的第一视频帧和第二视频帧，当在第一视频帧与第二视频帧之间的至少一个第三视频帧中均未识别到目标对象、且所述第三视频帧的数量达到第二数量阈值时，基于所述第一视频帧中目标对象的第一人脸位置、以及所述第二视频帧中目标对象的第二人脸位置，生成至少两段目标对象的两个人脸轨迹。For the first video frame and the second video frame containing the target object, when the target object is not identified in at least one third video frame between the first video frame and the second video frame, and the third video frame When the number reaches the second number threshold, based on the first face position of the target object in the first video frame and the second face position of the target object in the second video frame, generate at least two paragraphs of the target object. Two face trajectories.

上述方案中，所述重建模块，还用于基于所述第一特征点集进行人脸姿态估计，得到所述目标对象的人脸姿态；In the above-mentioned scheme, the reconstruction module is also used to carry out face pose estimation based on the first feature point set to obtain the face pose of the target object;

基于所述目标对象的人脸姿态，通过三角测量处理对所述目标对象进行三维重建，确定所述目标对象的人脸特征点的三维坐标。Based on the facial posture of the target object, three-dimensional reconstruction is performed on the target object through triangulation processing, and the three-dimensional coordinates of the facial feature points of the target object are determined.

上述方案中，所述重建模块，还用于基于所述第一特征点集中各人脸特征点在多个所述视频帧中的位置关系，获取基础矩阵；In the above scheme, the reconstruction module is also used to obtain a basic matrix based on the positional relationship of each face feature point in a plurality of described video frames based on the first feature point set;

对所述基础矩阵进行归一化处理，得到本质矩阵；Normalize the fundamental matrix to obtain an essential matrix;

对所述本质矩阵进行奇异值分解，得到所述目标对象的人脸姿态。Perform singular value decomposition on the essential matrix to obtain the face pose of the target object.

上述方案中，所述重建模块，还用于获取目标视频的目标相机内参数；In the above solution, the reconstruction module is also used to obtain the target camera internal parameters of the target video;

基于所述目标相机内参数和所述第一特征点集，对所述目标对象进行三维重建；Based on the internal parameters of the target camera and the first feature point set, three-dimensional reconstruction is performed on the target object;

在对所述目标对象进行三维重建的过程中，获取所述目标视频的相机内参数；In the process of carrying out three-dimensional reconstruction to the target object, obtain the camera internal parameters of the target video;

当所述第一特征点集与所述第二特征点集之间的误差未满足误差条件时，基于获取的所述相机内参数，更新所述目标相机内参数，以When the error between the first feature point set and the second feature point set does not satisfy the error condition, based on the acquired in-camera parameters, update the target in-camera parameters to

基于更新得到的所述目标相机内参数和所述第一特征点集，对所述目标对象进行三维重建。Based on the updated internal parameters of the target camera and the first feature point set, three-dimensional reconstruction of the target object is performed.

上述方案中，所述重建模块，还用于在对所述目标对象进行三维重建的过程中，获取基础矩阵及本质矩阵；In the above scheme, the reconstruction module is also used to obtain a fundamental matrix and an essential matrix in the process of carrying out three-dimensional reconstruction to the target object;

根据基础矩阵与所述本质矩阵之间的转换关系，获取所述目标视频的相机内参数。According to the conversion relationship between the fundamental matrix and the essential matrix, the in-camera parameters of the target video are acquired.

上述方案中，所述重建模块，还用于当所述第一特征点集与所述第二特征点集之间的误差未满足误差条件、且每个所述视频帧中人脸特征点的数量为多个时，对于多个人脸特征点中的每个目标人脸特征点，从所述第一特征点集中剔除至少一个人脸特征点的二维坐标，得到第三特征点集；In the above scheme, the reconstruction module is also used for when the error between the first feature point set and the second feature point set does not satisfy the error condition, and the difference between the facial feature points in each of the video frames is When the number is multiple, for each target face feature point in the multiple face feature points, the two-dimensional coordinates of at least one face feature point are removed from the first feature point set to obtain a third feature point set;

基于所述第三特征点集，对所述目标对象进行三维重建，得到所述目标对象的人脸特征点的三维坐标；Based on the third feature point set, three-dimensional reconstruction is carried out to the target object, and the three-dimensional coordinates of the facial feature points of the target object are obtained;

将基于所述第三特征点集进行三维重建得到的三维坐标投影至相机成像面，得到包括人脸特征点的二维坐标的第四特征点集；Projecting the three-dimensional coordinates obtained by performing three-dimensional reconstruction based on the third feature point set on the imaging plane of the camera to obtain a fourth feature point set including the two-dimensional coordinates of the face feature points;

基于所述第三特征点集与所述第四特征点集，更新所述第一特征点集。The first feature point set is updated based on the third feature point set and the fourth feature point set.

上述方案中，所述投影模块，还用于根据所述人脸姿态，对所述人脸特征点的三维坐标针对相机成像面进行旋转和平移，得到所述人脸特征点在相机坐标系下的坐标；In the above solution, the projection module is further configured to rotate and translate the three-dimensional coordinates of the facial feature points on the imaging plane of the camera according to the facial posture, so as to obtain the facial feature points in the camera coordinate system. coordinate of;

获取对应所述目标视频的目标相机内参数；Obtain the target camera internal parameters corresponding to the target video;

根据所述目标相机内参数，将所述相机坐标系下的坐标转换至图像坐标系下的坐标，得到包括人脸特征点的二维坐标的第二特征点集。According to the internal parameters of the target camera, the coordinates under the camera coordinate system are converted to the coordinates under the image coordinate system to obtain a second feature point set including the two-dimensional coordinates of the facial feature points.

上述方案中，所述装置还包括：处理模块，用于基于待展示信息，生成对应所述待展示信息的图片；In the above scheme, the device further comprises: a processing module for generating a picture corresponding to the information to be displayed based on the information to be displayed;

获取所述图片的多个顶点的原始二维坐标、及所述图片的多个顶点的三维坐标；Obtain the original two-dimensional coordinates of the multiple vertices of the picture and the three-dimensional coordinates of the multiple vertices of the picture;

基于所述目标人脸姿态，将所述图片的多个顶点的三维坐标投影到相机成像面上，得到所述图片的多个顶点的二维坐标；Based on the gesture of the target face, the three-dimensional coordinates of the multiple vertices of the picture are projected onto the camera imaging surface to obtain the two-dimensional coordinates of the multiple vertices of the picture;

根据所述原始二维坐标和投影得到的所述二维坐标之间的映射关系，确定透视变换矩阵；Determine the perspective transformation matrix according to the mapping relationship between the original two-dimensional coordinates and the two-dimensional coordinates obtained by projection;

通过透视变换矩阵，对所述图片进行仿射变换，得到变换后的目标图片；Through the perspective transformation matrix, affine transformation is performed on the picture to obtain the transformed target picture;

将所述目标图片叠加到视频帧中，得到添加有所述待展示信息的视频帧。The target picture is superimposed on the video frame to obtain the video frame to which the information to be displayed is added.

本申请实施例提供一种计算机设备，包括：Embodiments of the present application provide a computer device, including:

存储器，用于存储可执行指令；memory for storing executable instructions;

处理器，用于执行所述存储器中存储的可执行指令时，实现本申请实施例提供的视频中人脸姿态的处理方法。The processor is configured to, when executing the executable instructions stored in the memory, implement the method for processing the face gesture in the video provided by the embodiment of the present application.

本申请实施例提供一种计算机可读存储介质，存储有可执行指令，用于引起处理器执行时，实现本申请实施例提供的视频中人脸姿态的处理方法。The embodiments of the present application provide a computer-readable storage medium, which stores executable instructions for causing a processor to execute the method for processing a face gesture in a video provided by the embodiments of the present application.

本申请实施例具有以下有益效果：The embodiments of the present application have the following beneficial effects:

应用上述实施例，通过对目标视频中包含目标对象的多个视频帧进行人脸特征点检测，得到包括人脸特征点的二维坐标的第一特征点集；基于所述第一特征点集，对所述目标对象进行三维重建，得到所述目标对象的人脸姿态、及所述目标对象的人脸特征点的三维坐标；对所述人脸特征点的三维坐标进行投影，得到包括人脸特征点的二维坐标的第二特征点集；将所述第二特征点集与所述第一特征点集进行比较，当所述第一特征点集与所述第二特征点集之间的误差满足误差条件时，将所述三维重建得到的所述人脸姿态作为所述目标视频中所述目标对象的目标人脸姿态；如此，通过投影得到第二特征点集，并将第二特征点集与第一特征点集进行比较，实现了对三维重建结果的校验，提高了目标人脸姿态确定的稳定性和准确性。Applying the above embodiment, by performing facial feature point detection on multiple video frames containing the target object in the target video, a first feature point set including the two-dimensional coordinates of the facial feature points is obtained; based on the first feature point set , perform three-dimensional reconstruction on the target object to obtain the face pose of the target object and the three-dimensional coordinates of the face feature points of the target object; project the three-dimensional coordinates of the face feature points to obtain the The second feature point set of the two-dimensional coordinates of the face feature points; comparing the second feature point set with the first feature point set, when the first feature point set and the second feature point set are When the error between the two satisfies the error condition, the face pose obtained by the three-dimensional reconstruction is used as the target face pose of the target object in the target video; in this way, the second feature point set is obtained through projection, and the first feature point set is obtained by projection. The second feature point set is compared with the first feature point set, which realizes the verification of the three-dimensional reconstruction result and improves the stability and accuracy of the determination of the target face pose.

附图说明Description of drawings

图1是本申请实施例提供的视频中人脸姿态的处理系统的一个可选的架构示意图；Fig. 1 is an optional framework schematic diagram of the processing system of the facial gesture in the video provided by the embodiment of the present application;

图2是本申请实施例提供的视频中人脸姿态的处理方法的流程示意图；2 is a schematic flowchart of a method for processing a face gesture in a video provided by an embodiment of the present application;

图3是本申请实施例提供的人脸特征点的筛选过程示意图；3 is a schematic diagram of a screening process of facial feature points provided by an embodiment of the present application;

图4是本申请实施例提供的文本图片在人脸坐标系下的示意图；4 is a schematic diagram of a text picture provided in an embodiment of the present application in a face coordinate system;

图5A-5B是本申请实施例提供的播放界面示意图；5A-5B are schematic diagrams of playback interfaces provided by embodiments of the present application;

图6是本申请实施例提供的视频中人脸姿态的处理方法的流程示意图；6 is a schematic flowchart of a method for processing a face gesture in a video provided by an embodiment of the present application;

图7是本申请实施例提供的三维重建流程示意图；FIG. 7 is a schematic diagram of a three-dimensional reconstruction process provided by an embodiment of the present application;

图8是本申请实施例提供的在视频帧中添加待展示信息过程示意图；8 is a schematic diagram of a process of adding information to be displayed in a video frame provided by an embodiment of the present application;

图9是本申请实施例提供的视频中人脸姿态的处理装置的结构示意图；9 is a schematic structural diagram of an apparatus for processing a face gesture in a video provided by an embodiment of the present application;

图10是本申请实施例提供的计算机设备的结构示意图。FIG. 10 is a schematic structural diagram of a computer device provided by an embodiment of the present application.

具体实施方式Detailed ways

为了使本申请的目的、技术方案和优点更加清楚，下面将结合附图对本申请作进一步地详细描述，所描述的实施例不应视为对本申请的限制，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例，都属于本申请保护的范围。In order to make the purpose, technical solutions and advantages of the present application clearer, the present application will be described in further detail below with reference to the accompanying drawings. All other embodiments obtained under the premise of creative work fall within the scope of protection of the present application.

在以下的描述中，涉及到“一些实施例”，其描述了所有可能实施例的子集，但是可以理解,“一些实施例”可以是所有可能实施例的相同子集或不同子集，并且可以在不冲突的情况下相互结合。In the following description, reference is made to "some embodiments", which describe a subset of all possible embodiments, but it is understood that "some embodiments" can be the same or a different subset of all possible embodiments, and Can be combined with each other without conflict.

在以下的描述中，所涉及的术语“第一\第二\第三”仅仅是区别类似的对象，不代表针对对象的特定排序，可以理解地，“第一\第二\第三”在允许的情况下可以互换特定的顺序或先后次序，以使这里描述的本申请实施例能够以除了在这里图示或描述的以外的顺序实施。In the following description, the term "first\second\third" is only used to distinguish similar objects, and does not represent a specific ordering of objects. It is understood that "first\second\third" in the Where permitted, the specific order or sequence may be interchanged to enable the embodiments of the application described herein to be practiced in sequences other than those illustrated or described herein.

除非另有定义，本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中所使用的术语只是为了描述本申请实施例的目的，不是旨在限制本申请。Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the technical field to which this application belongs. The terms used herein are only for the purpose of describing the embodiments of the present application, and are not intended to limit the present application.

对本申请实施例进行进一步详细说明之前，对本申请实施例中涉及的名词和术语进行说明，本申请实施例中涉及的名词和术语适用于如下的解释。Before further describing the embodiments of the present application in detail, the nouns and terms involved in the embodiments of the present application are described, and the nouns and terms involved in the embodiments of the present application are suitable for the following explanations.

1)人脸姿态：指在相机拍摄的画面中，人脸相对于相机坐标系的位置(三维坐标)与方向。1) Face pose: refers to the position (three-dimensional coordinates) and direction of the face relative to the camera coordinate system in the picture captured by the camera.

2)相机内参数：以矩阵形式记录的相机内参数，包含了水平、竖直方向的相机焦距、光心在传感器平面位置等参数，可以表示为：2) In-camera parameters: In-camera parameters recorded in matrix form, including the camera focal length in the horizontal and vertical directions, the position of the optical center on the sensor plane and other parameters, which can be expressed as:

其中，f_x为水平方向的焦距参数，f_y为竖直方向的焦距参数，(c_x，c_y)为光心在相机传感器平面上的位置坐标。Wherein, f _x is the focal length parameter in the horizontal direction, f _y is the focal length parameter in the vertical direction, and (c _x , _cy ) is the position coordinate of the optical center on the camera sensor plane.

3)相机外参数：用与记录相机相对于世界坐标系的旋转平移的参数，通常来说，会用一个矩阵记录相机的旋转，再用一个向量记录相机的平移。在我们以人脸坐标系作为世界坐标系的时候，相机外参数即为人脸姿态。3) Camera external parameters: parameters used to record the rotation and translation of the camera relative to the world coordinate system. Generally speaking, a matrix is used to record the rotation of the camera, and a vector is used to record the translation of the camera. When we use the face coordinate system as the world coordinate system, the external parameter of the camera is the face pose.

4)基础矩阵(Fundamental matrix)：F是一个3×3的矩阵，表达了立体像对的像点之间的对应关系。4) Fundamental matrix: F is a 3×3 matrix, which expresses the correspondence between the image points of the stereo pair.

5)本质矩阵(Essential Matrix)：是在三维坐标系下，用来连接两个对应点的矩阵关系。5) Essential Matrix: It is a matrix relationship used to connect two corresponding points in a three-dimensional coordinate system.

参见图1，图1是本申请实施例提供的视频中人脸姿态的处理系统的一个可选的架构示意图，为实现支撑一个示例性应用，终端(示例性示出了终端400-1 和终端400-2)通过网络300连接服务器200，网络300可以是广域网或者局域网，又或者是二者的组合。这里不对服务器及终端的数量做限制。在实际实施时，终端上设置有客户端，如视频客户端，浏览器客户端，信息流客户端，教育客户端等，以用于视频的上传及播放。Referring to FIG. 1, FIG. 1 is a schematic diagram of an optional architecture of a system for processing a face gesture in a video provided by an embodiment of the present application. In order to support an exemplary application, a terminal (exemplarily shows terminal 400-1 and a terminal) 400-2) Connect the server 200 through the network 300, and the network 300 may be a wide area network or a local area network, or a combination of the two. The number of servers and terminals is not limited here. In actual implementation, the terminal is provided with a client, such as a video client, a browser client, an information flow client, an education client, etc., for uploading and playing videos.

终端，用于接收到针对目标视频的上传指令，向服务器发送目标视频；The terminal is used to receive the upload instruction for the target video and send the target video to the server;

服务器200，用于对目标视频中包含目标对象的多个视频帧进行人脸特征点检测，得到包括人脸特征点的二维坐标的第一特征点集；基于第一特征点集，对目标对象进行三维重建，得到目标对象的人脸姿态、及目标对象的人脸特征点的三维坐标；将人脸特征点的三维坐标投影至相机成像面，得到包括人脸特征点的二维坐标的第二特征点集；将第二特征点集与第一特征点集进行比较，当第一特征点集与第二特征点集之间的误差满足误差条件时，将三维重建得到的人脸姿态作为目标视频中目标对象的目标人脸姿态；The server 200 is configured to perform face feature point detection on multiple video frames including the target object in the target video, and obtain a first feature point set including two-dimensional coordinates of the face feature points; based on the first feature point set, the target The object is reconstructed in three dimensions to obtain the face pose of the target object and the three-dimensional coordinates of the face feature points of the target object; the three-dimensional coordinates of the face feature points are projected to the imaging surface of the camera, and the two-dimensional coordinates including the face feature points are obtained. the second feature point set; compare the second feature point set with the first feature point set, when the error between the first feature point set and the second feature point set satisfies the error condition, the face pose obtained by the three-dimensional reconstruction as the target face pose of the target object in the target video;

终端，还用于接收到针对目标视频的播放指令，向服务器发送目标视频、待展示信息及目标对象的目标人脸姿态的获取请求；The terminal is further configured to receive a playback instruction for the target video, and send an acquisition request for the target video, the information to be displayed, and the target face posture of the target object to the server;

服务器200，用于下发目标视频、待展示信息及目标对象的目标人脸姿态至终端；The server 200 is used for delivering the target video, the information to be displayed and the target face posture of the target object to the terminal;

终端，用于在播放界面中包含目标视频；在播放目标视频的过程中，当目标视频的视频画面中包含目标对象、且存在与目标对象相关联的待展示信息时，在与目标对象的人脸相关联的展示区域，展示待展示信息；在展示待展示信息的过程中，确定目标对象的目标人脸姿态发生变化时，伴随目标人脸姿态的变化，调整待展示信息的展示姿态。The terminal is used to include the target video in the playback interface; in the process of playing the target video, when the target object is included in the video picture of the target video and there is information to be displayed associated with the target object, the person who is associated with the target object The display area associated with the face displays the information to be displayed; in the process of displaying the information to be displayed, when it is determined that the target face posture of the target object changes, the display posture of the information to be displayed is adjusted along with the change of the target face posture.

在一些实施例中，服务器200可以是独立的物理服务器，也可以是多个物理服务器构成的服务器集群或者分布式系统，还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(CDN，ContentDelivery Network)、以及大数据和人工智能平台等基础云计算服务的云服务器。终端可以是智能手机、平板电脑、笔记本电脑、台式计算机、智能音箱、智能手表、电视终端、车载终端等，但并不局限于此。终端以及服务器可以通过有线或无线通信方式进行直接或间接地连接，本申请实施例中不做限制。In some embodiments, the server 200 may be an independent physical server, or may be a server cluster or distributed system composed of multiple physical servers, or may provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, Cloud servers for basic cloud computing services such as network services, cloud communications, middleware services, domain name services, security services, Content Delivery Network (CDN, Content Delivery Network), and big data and artificial intelligence platforms. The terminal can be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, a TV terminal, a vehicle terminal, etc., but is not limited to this. The terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited in this embodiment of the present application.

基于上述对本申请实施例的视频中人脸姿态的处理系统说明，下面说明本申请实施例提供的视频中人脸姿态的处理方法。参见图2，图2是本申请实施例提供的视频中人脸姿态的处理方法的流程示意图；在一些实施例中，该视频中人脸姿态的处理方法可由终端或服务器单独实施，或由终端和服务器协同实施，以服务器单独实施为例，本申请实施例提供的视频中人脸姿态的处理方法包括：Based on the above description of the processing system for the facial posture in the video according to the embodiment of the present application, the following describes the processing method for the facial posture in the video provided by the embodiment of the present application. Referring to FIG. 2, FIG. 2 is a schematic flowchart of a method for processing a face gesture in a video provided by an embodiment of the present application; in some embodiments, the method for processing a face gesture in the video may be implemented by a terminal or a server alone, or by a terminal. Implemented in collaboration with the server, taking the server as an example, the method for processing a face gesture in a video provided by the embodiment of the present application includes:

步骤201：服务器对目标视频中包含目标对象的多个视频帧进行人脸特征点检测，得到包括人脸特征点的二维坐标的第一特征点集。Step 201: The server performs facial feature point detection on multiple video frames containing the target object in the target video, and obtains a first feature point set including the two-dimensional coordinates of the facial feature points.

在实际实施时，服务器获取目标视频，并对目标视频进行解码，以得到目标视频包含的多个视频帧，从目标视频包含的多个视频帧中筛选出包含目标对象的多个视频帧来进行人脸特征点检测，以得到包括人脸特征点的二维坐标。其中，第一特征点集由多个人脸特征点组成，人脸特征点通过该人脸特征点的二维坐标进行表示。In actual implementation, the server obtains the target video and decodes the target video to obtain multiple video frames included in the target video, and selects multiple video frames including the target object from the multiple video frames included in the target video. Face feature point detection to obtain two-dimensional coordinates including face feature points. Among them, the first feature point set is composed of a plurality of face feature points, and the face feature points are represented by the two-dimensional coordinates of the face feature points.

在实际应用中，对于包含目标对象的每个视频帧，对该视频帧进行人脸特征点检测，得到人脸特征点在该视频帧中的二维坐标，如此，便可以得到人脸特征点在多个视频帧中的二维坐标，这些二维坐标组成第一特征点集。In practical applications, for each video frame containing the target object, the face feature point detection is performed on the video frame, and the two-dimensional coordinates of the face feature point in the video frame are obtained. In this way, the face feature point can be obtained. Two-dimensional coordinates in a plurality of video frames, these two-dimensional coordinates constitute the first feature point set.

例如，可以通过68点的特征点检测算法，来对人脸进行特征点检测，那么，对于每个视频帧，会得到68个人脸特征点的二维坐标，如采用人脸注意力机制网络(FAN，FaceAttention Network)对人脸进行特征点检测。For example, the 68-point feature point detection algorithm can be used to detect the feature points of the face. Then, for each video frame, the two-dimensional coordinates of the 68 face feature points will be obtained. For example, the face attention mechanism network ( FAN, FaceAttention Network) detects feature points on faces.

在一些实施例中，终端可以通过以下方式对目标视频中包含目标对象的多个视频帧进行人脸特征点检测：从目标视频中包含目标对象的多个视频帧中，截取目标对象的人脸区域得到多张人脸区域图像；分别对各人脸区域图像进行人脸特征点检测，得到各人脸区域图像中人脸特征点的二维坐标；对人脸区域图像中人脸特征点的二维坐标进行坐标变换，得到各视频帧中人脸特征点的二维坐标；将各视频帧中人脸特征点的二维坐标组成第一特征点集。In some embodiments, the terminal may perform facial feature point detection on multiple video frames including the target object in the target video by: intercepting the face of the target object from the multiple video frames including the target object in the target video obtain multiple face region images; perform face feature point detection on each face region image respectively, and obtain the two-dimensional coordinates of the face feature points in each face region image; Coordinate transformation is performed on the two-dimensional coordinates to obtain the two-dimensional coordinates of the face feature points in each video frame; the two-dimensional coordinates of the face feature points in each video frame form a first feature point set.

在实际应用中，为提升人脸特征点的检测准确性，可以先从视频帧中截取目标对象的人脸区域对应的人脸区域图像，然后对人脸区域图像进行人脸特征点检测；这里，在对人脸区域图像进行人脸特征点检测时得到的人脸特征点的二维坐标是对应人脸区域图像的坐标系下的二维坐标，基于此，需要将其变换至对应视频帧的坐标系下。In practical applications, in order to improve the detection accuracy of face feature points, the face region image corresponding to the face region of the target object can be intercepted from the video frame, and then the face region image is detected by face feature points; here , the two-dimensional coordinates of the face feature points obtained when the face region image is detected by the face feature points are the two-dimensional coordinates under the coordinate system corresponding to the face region image. Based on this, it needs to be transformed into the corresponding video frame. in the coordinate system.

在实际实施时，获取人脸区域图像在视频帧中的位置，以确定对应人脸区域图像的坐标系与对应视频帧的坐标系之间的转换关系，例如，当人脸区域图像的左上顶点在视频帧中的位置为(20，20)，那么在对人脸区域图像中人脸特征点的二维坐标进行变换时，将人脸区域图像中人脸特征点的二维坐标的横坐标和纵坐标都加上20，即可得到视频帧中人脸特征点的二维坐标。In actual implementation, the position of the face area image in the video frame is obtained to determine the conversion relationship between the coordinate system of the corresponding face area image and the coordinate system of the corresponding video frame. For example, when the upper left vertex of the face area image is The position in the video frame is (20, 20), then when transforming the two-dimensional coordinates of the face feature points in the face area image, the abscissa of the two-dimensional coordinates of the face feature points in the face area image Add 20 to both the ordinate and the ordinate to get the two-dimensional coordinates of the face feature points in the video frame.

在一些实施例中，在对目标视频包含目标对象的多个视频帧进行人脸特征点检测之前，还可以对目标视频进行人脸重识别，得到目标对象的人脸轨迹；相应的，终端可以通过以下方式对目标视频中包含目标对象的多个视频帧进行人脸特征点检测：根据目标对象的人脸轨迹，确定目标视频中包含目标对象的多个视频帧、以及视频帧中目标对象的人脸位置；对视频帧中位于人脸位置的人脸进行人脸特征点检测。In some embodiments, before performing facial feature point detection on multiple video frames in which the target video contains the target object, the target video may also be re-identified to obtain the facial trajectory of the target object; correspondingly, the terminal may The facial feature point detection is performed on multiple video frames containing the target object in the target video in the following manner: according to the face trajectory of the target object, the multiple video frames containing the target object in the target video and the target object in the video frame are determined. Face position; perform face feature point detection on the face located at the face position in the video frame.

这里，由于视频帧中可能会包括多个对象，即包含目标对象的视频帧中可能还包含其它对象，基于此，需要从中确定出哪一个对象是目标对象，以对目标对象的人脸特征点进行检测。Here, since the video frame may include multiple objects, that is, the video frame including the target object may also include other objects. Based on this, it is necessary to determine which object is the target object, so as to determine the facial feature points of the target object. test.

在实际实施时，可以通过人脸重识别的方式，来获取目标对象的人脸轨迹，这里的人脸轨迹可以标识每个视频帧中目标对象的人脸位置；如此，在得到人脸轨迹后，可以确定哪些视频帧包含目标对象、以及相应视频帧中的目标对象的人脸位置；根据目标对象的人脸位置，对相应人脸位置的人脸进行人脸特征点检测，避免了错误的将其它对象判定为目标对象，对其它对象的人脸特征点进行检测。In actual implementation, the face trajectory of the target object can be obtained by means of face re-recognition, where the face trajectory can identify the face position of the target object in each video frame; in this way, after obtaining the face trajectory , it can determine which video frames contain the target object, and the face position of the target object in the corresponding video frame; Other objects are determined as target objects, and the facial feature points of other objects are detected.

在实际应用中，服务器可以对目标视频的包含的每个视频帧进行人脸检测，以得到每个视频帧中的所有人脸；然后，基于检测出的人脸，截取人脸区域得到人脸区域图像，对每个人脸区域图像进行重识别，得到各人脸区域图像所对应的对象；进而根据目标对象对应的人脸区域图像在相应视频帧中的位置，确定目标对象的人脸轨迹。其中，人脸区域图像在相应视频帧中的位置，为视频帧中目标对象的人脸位置。In practical applications, the server may perform face detection on each video frame included in the target video to obtain all faces in each video frame; then, based on the detected faces, the face regions are intercepted to obtain faces Region image, re-identify each face region image to obtain the object corresponding to each face region image; and then determine the face trajectory of the target object according to the position of the face region image corresponding to the target object in the corresponding video frame. Among them, the position of the face area image in the corresponding video frame is the face position of the target object in the video frame.

在一些实施例中，可以通过以下方式目标视频进行人脸重识别，得到目标对象的人脸轨迹：对目标视频进行人脸重识别，将识别到目标对象的视频帧确定为包含目标对象的视频帧，并确定视频帧中目标对象的人脸位置；对于包含目标对象的第一视频帧和第二视频帧，当在第一视频帧与第二视频帧之间的至少一个第三视频帧中均未识别到目标对象、且第三视频帧的数量小于第一数量阈值时，获取第一视频帧中目标对象的第一人脸位置、以及第二视频帧中目标对象的第二人脸位置；当第一人脸位置与第二人脸位置之间的距离小于距离阈值时，确定第三视频帧包含目标对象，并根据第一人脸位置及第二人脸位置，进行插值处理，得到至少一个第三视频帧中目标对象的人脸位置，以生成目标对象的人脸轨迹。In some embodiments, face re-identification can be performed on the target video to obtain the face trajectory of the target object: face re-identification is performed on the target video, and the video frame in which the target object is recognized is determined as a video containing the target object. frame, and determine the face position of the target object in the video frame; for the first video frame and the second video frame containing the target object, when in at least one third video frame between the first video frame and the second video frame When the target object is not recognized and the number of the third video frames is less than the first number threshold, obtain the first face position of the target object in the first video frame and the second face position of the target object in the second video frame When the distance between the first human face position and the second human face position is less than the distance threshold, determine that the third video frame comprises the target object, and according to the first human face position and the second human face position, carry out interpolation processing, obtain face position of the target object in at least one third video frame to generate a face trajectory of the target object.

在实际实施时，服务器可以对目标视频的包含的每个视频帧进行人脸检测，以得到每个视频帧中的所有人脸；然后，基于检测出的人脸，截取人脸区域得到人脸区域图像，对每个人脸区域图像进行重识别，得到各人脸区域图像所对应的对象；进而根据目标对象对应的人脸区域图像在相应视频帧中的位置，确定目标对象的人脸轨迹。其中，人脸区域图像在相应视频帧中的位置，为视频帧中目标对象的人脸位置。In actual implementation, the server may perform face detection on each video frame included in the target video to obtain all faces in each video frame; then, based on the detected faces, intercept the face region to obtain the face Region image, re-identify each face region image to obtain the object corresponding to each face region image; and then determine the face trajectory of the target object according to the position of the face region image corresponding to the target object in the corresponding video frame. Among them, the position of the face area image in the corresponding video frame is the face position of the target object in the video frame.

在实际应用中，在根据目标对象对应的人脸区域图像在相应视频帧中的位置，确定目标对象的人脸位置时，若某M个视频帧(第三视频帧)中没有识别到目标对象，但M个视频帧前后的两个视频帧中(即第一视频帧和第二视频帧) 识别到了目标对象，且前后的两个视频帧中，目标对象的人脸位置之间的距离小于距离阈值，则说明前后的两个视频帧中，目标对象的人脸位置足够接近，这里可以采用插值的方式，补充中间帧(第三视频帧)的人脸位置。其中，M 为自然数，且M小于第一数量阈值，如第一数量阈值可以为3。In practical applications, when determining the face position of the target object according to the position of the face region image corresponding to the target object in the corresponding video frame, if the target object is not recognized in a certain M video frames (third video frame) , but the target object is recognized in the two video frames before and after M video frames (ie, the first video frame and the second video frame), and in the two video frames before and after, the distance between the face positions of the target object is less than The distance threshold indicates that in the two video frames before and after, the face position of the target object is close enough, and interpolation can be used here to supplement the face position of the intermediate frame (third video frame). Wherein, M is a natural number, and M is smaller than the first quantity threshold, for example, the first quantity threshold may be 3.

示例性的，这里可以采用线性插值方式，来获取第三视频帧中目标对象的人脸位置，也即，根据第一人脸位置和第二人脸位置构建一个线性函数，然后基于该线性函数确定第三视频帧中目标对象的人脸位置。例如，第一人脸位置为(200，200)，第二人脸位置为(260，260)，当第一视频帧和第二视频帧之间包含两个第三视频帧时，第一个第三视频帧(第一视频帧后一个视频帧)中目标对象的人脸位置为(220，220)，第二个第三视频帧(第一视频帧后两个视频帧)中目标对象的人脸位置为(240，240)。Exemplarily, a linear interpolation method can be used here to obtain the face position of the target object in the third video frame, that is, a linear function is constructed according to the first face position and the second face position, and then based on the linear function Determine the face position of the target object in the third video frame. For example, the first face position is (200, 200) and the second face position is (260, 260). When there are two third video frames between the first video frame and the second video frame, the first The face position of the target object in the third video frame (one video frame after the first video frame) is (220, 220), and the face position of the target object in the second third video frame (two video frames after the first video frame) The face position is (240, 240).

这里，由于第三视频帧中未检测到目标对象，在确定第三人脸帧中目标对象的人脸姿态时，也可以采用插值的方式确定，也即获取第一视频帧中目标对象的人脸姿态和第二视频帧中目标对象的人脸姿态，然后根据第一视频帧中目标对象的人脸姿态和第二视频帧中目标对象的人脸姿态，采用插值的方式确定第三视频帧中目标对象的人脸姿态。Here, since the target object is not detected in the third video frame, when determining the face pose of the target object in the third face frame, interpolation can also be used to determine, that is, the person who obtains the target object in the first video frame face pose and the face pose of the target object in the second video frame, and then determine the third video frame by interpolation according to the face pose of the target object in the first video frame and the face pose of the target object in the second video frame face pose of the target object.

在实际应用中，若某M个视频帧(第三视频帧)中没有检测到目标对象，但M个视频帧前后的两个视频帧中(即第一视频帧和第二视频帧)识别到了目标对象，且前后的两个视频帧中，目标对象的人脸位置之间的距离达到距离阈值，则说明前后的两个视频帧中，目标对象的人脸位置不够接近，那么，认为视频画面发生的切换，按照视频画面切换的时刻，将人脸轨迹划分为两段，也即将第一视频帧中的人脸位置作为前一段人脸轨迹的结束位置，将第二视频帧中的人脸位置作为下一段人脸轨迹的开始位置。In practical applications, if the target object is not detected in a certain M video frame (the third video frame), but the two video frames before and after the M video frames (ie the first video frame and the second video frame) are identified The target object, and the distance between the target object's face position in the two video frames before and after reaches the distance threshold, it means that in the two video frames before and after the target object's face position is not close enough, then it is considered that the video picture When switching occurs, the face trajectory is divided into two segments according to the moment when the video screen is switched, that is, the face position in the first video frame is taken as the end position of the previous face trajectory, and the face in the second video frame is used. The position is used as the starting position of the next face trajectory.

在一些实施例中，可以通过以下方式目标视频进行人脸重识别，得到目标对象的人脸轨迹：对目标视频中多个视频帧进行人脸重识别，将识别到目标对象的视频帧确定为包含目标对象的视频帧，并确定视频帧中目标对象的人脸位置；对于包含目标对象的第一视频帧和第二视频帧，当在第一视频帧与第二视频帧之间的至少一个第三视频帧中均未识别到目标对象、且第三视频帧的数量达到第二数量阈值时，基于第一视频帧中目标对象的第一人脸位置、以及第二视频帧中目标对象的第二人脸位置，生成至少两段目标对象的人脸轨迹。In some embodiments, the face re-identification of the target video can be performed to obtain the face trajectory of the target object: face re-identification is performed on multiple video frames in the target video, and the video frame of the identified target object is determined as The video frame containing the target object, and determining the face position of the target object in the video frame; for the first video frame and the second video frame containing the target object, when at least one video frame between the first video frame and the second video frame When the target object is not recognized in the third video frame and the number of the third video frames reaches the second number threshold, based on the position of the first face of the target object in the first video frame and the position of the target object in the second video frame The second face position generates at least two face trajectories of the target object.

在实际应用中，若某N个视频帧(第三视频帧)中没有识别到目标对象，但N个视频帧前后的两个视频帧中(即第一视频帧和第二视频帧)识别到了目标对象，将N与第二数量阈值进行比较，若N达到第二数量阈值，则认为视频画面发生的切换，按照视频画面切换的时刻，将人脸轨迹划分为两段，也即将第一视频帧中的人脸位置作为前一段人脸轨迹的结束位置，将第二视频帧中的人脸位置作为下一段人脸轨迹的开始位置。其中，N为正整数，第二数量阈值大于或等于第一数量阈值。In practical applications, if the target object is not identified in a certain N video frames (the third video frame), but the target object is identified in the two video frames before and after the N video frames (ie, the first video frame and the second video frame) For the target object, compare N with the second number threshold. If N reaches the second number threshold, it is considered that the video screen is switched. According to the moment when the video screen is switched, the face trajectory is divided into two segments, that is, the first video. The face position in the frame is taken as the end position of the previous face trajectory, and the face position in the second video frame is taken as the start position of the next face trajectory. Wherein, N is a positive integer, and the second quantity threshold is greater than or equal to the first quantity threshold.

在一些实施例中，第一视频帧、第二视频帧和第三视频帧中都识别到目标对象，其中，第一视频帧为与第三视频帧相邻的前一视频帧，第二视频帧为与第三视频帧相邻的后一视频帧，当第三视频帧相对于第一视频帧和第二视频帧，目标对象的人脸位置发生跳变，即第一视频帧中目标对象的人脸位置与第三视频帧中目标对象的人脸位置之间的距离超过距离阈值、且第二视频帧中目标对象的人脸位置与第三视频帧中目标对象的人脸位置之间的距离超过距离阈值，如距离阈值可以设置为图像高度的5％，则认为该第三视频帧中的目标对象检测失败，通过第一视频帧中的人脸位置和第二视频帧中的人脸位置，采用插值的方式，得到该视频帧中正确的人脸数据。In some embodiments, the target object is identified in the first video frame, the second video frame, and the third video frame, wherein the first video frame is a previous video frame adjacent to the third video frame, and the second video frame The frame is the next video frame adjacent to the third video frame. When the third video frame is relative to the first video frame and the second video frame, the face position of the target object jumps, that is, the target object in the first video frame jumps. The distance between the face position of the target object in the third video frame and the face position of the target object in the third video frame exceeds the distance threshold, and the distance between the face position of the target object in the second video frame and the face position of the target object in the third video frame The distance exceeds the distance threshold, for example, the distance threshold can be set to 5% of the image height, then it is considered that the target object detection in the third video frame fails, through the face position in the first video frame and the person in the second video frame. The face position is obtained by interpolation to obtain the correct face data in the video frame.

步骤202：基于第一特征点集，对目标对象进行三维重建，得到目标对象的人脸姿态、及目标对象的人脸特征点的三维坐标。Step 202: Based on the first feature point set, perform three-dimensional reconstruction on the target object to obtain the face pose of the target object and the three-dimensional coordinates of the face feature points of the target object.

在实际实施时，第一特征点集中包含多个视频帧中人脸特征点的二维坐标，根据各人脸特征点在不同视频帧中的位置关系，对目标对象进行三维重建，例如，当人脸特征点包括左眼左边眼角、右眼右边眼角、左嘴角、右嘴角、鼻尖、下巴时，获取各视频帧中左眼左边眼角的二维坐标，确定各视频帧中左眼左边眼角的位置关系，基于该位置关系对目标对象进行三维重建，这里对每个人脸特征点都执行如上述针对左眼左边眼角的操作，以得到目标对象的人脸姿态、及目标对象的人脸特征点的三维坐标。In actual implementation, the first feature point set includes the two-dimensional coordinates of face feature points in multiple video frames, and three-dimensional reconstruction is performed on the target object according to the positional relationship of each face feature point in different video frames. For example, when When the face feature points include the left corner of the left eye, the right corner of the right eye, the left corner of the mouth, the right corner of the mouth, the tip of the nose, and the chin, the two-dimensional coordinates of the left corner of the left eye in each video frame are obtained, and the position of the left corner of the left eye in each video frame is determined. Positional relationship, based on the positional relationship, three-dimensional reconstruction of the target object is performed, and the above-mentioned operation for the left corner of the left eye is performed for each face feature point to obtain the face pose of the target object and the face feature point of the target object. three-dimensional coordinates.

在一些实施例中，可以通过以下方式基于第一特征点集，对目标对象进行三维重建，得到目标对象的人脸姿态、及目标对象的人脸特征点的三维坐标：基于第一特征点集进行人脸姿态估计，得到目标对象的人脸姿态；基于目标对象的人脸姿态，通过三角测量处理对目标对象进行三维重建，确定目标对象的人脸特征点的三维坐标。In some embodiments, three-dimensional reconstruction of the target object may be performed based on the first feature point set in the following manner to obtain the face pose of the target object and the three-dimensional coordinates of the face feature points of the target object: Based on the first feature point set The face pose estimation is performed to obtain the face pose of the target object; based on the face pose of the target object, the three-dimensional reconstruction of the target object is performed through triangulation processing, and the three-dimensional coordinates of the face feature points of the target object are determined.

在实际实施时，首先根据第一特征点集中各人脸特征点在多个视频帧中的位置关系，对目标对象进行人脸姿态估计，以得到目标对象的人脸姿态；在基于目标对象的人脸姿态，通过三角测量处理对目标对象进行三维重建，得到目标对象的人脸特征点的三维坐标。In actual implementation, first, according to the positional relationship of each face feature point in the first feature point set in multiple video frames, the face pose estimation of the target object is performed to obtain the face pose of the target object; Face pose, three-dimensional reconstruction of the target object is performed through triangulation processing, and the three-dimensional coordinates of the face feature points of the target object are obtained.

这里，对三角测量处理进行说明，当以目标对象的人脸坐标系作为世界坐标系时，在目标对象的人脸姿态的变化，相当于在不同位置观测目标对象的人脸，在已知不同位置观察到的人脸的二维坐标的情况下，利用投影透视模型中的三角关系，来对目标对象进行三维重建。Here, the triangulation process will be described. When the face coordinate system of the target object is used as the world coordinate system, the change in the face pose of the target object is equivalent to observing the face of the target object at different positions, and it is known that different In the case of the two-dimensional coordinates of the face observed at the position, the three-dimensional reconstruction of the target object is performed by using the triangular relationship in the projected perspective model.

在一些实施例中，服务器可以通过以下方式第一特征点集进行人脸姿态估计，得到目标对象的人脸姿态：基于第一特征点集中各人脸特征点在多个视频帧中的位置关系，获取基础矩阵；对基础矩阵进行归一化处理，得到本质矩阵；对本质矩阵进行奇异值分解，得到目标对象的人脸姿态。In some embodiments, the server may perform face pose estimation with the first feature point set to obtain the face pose of the target object: based on the positional relationship of each face feature point in the first feature point set in multiple video frames , obtain the basic matrix; normalize the basic matrix to obtain the essential matrix; perform singular value decomposition on the essential matrix to obtain the face pose of the target object.

在实际实施时，基础矩阵是一个3x3的矩阵，秩为2，具有7个自由度，可以根据

计算基础矩阵，其中，F为基础矩阵，

和

是视频帧中的一对匹配点在相应视频帧中的二维坐标，例如，相邻两个视频帧中同一人脸特征点的二维坐标，如第一视频帧中下巴的二维坐标为

第二视频帧中下巴的二维坐标为

In actual implementation, the fundamental matrix is a 3x3 matrix with a rank of 2 and 7 degrees of freedom, which can be determined according to

Calculate the fundamental matrix, where F is the fundamental matrix,

and

is the two-dimensional coordinates of a pair of matching points in the video frame in the corresponding video frame, for example, the two-dimensional coordinates of the same face feature point in two adjacent video frames, for example, the two-dimensional coordinates of the chin in the first video frame are

The two-dimensional coordinates of the chin in the second video frame are

这里，在计算基础矩阵时，利用多个人脸特征点的二维坐标，可以使确定的基础矩阵更加的准确。Here, when calculating the fundamental matrix, the two-dimensional coordinates of multiple facial feature points can be used to make the determined fundamental matrix more accurate.

在得到基础矩阵后，由于基础矩阵与本质矩阵之间的关系为：F＝K^-TEK′，其中，F为基础矩阵，K为相机内参数，E为本质矩阵。在进行归一化后，相机内参数变为单位矩阵，这时，基础矩阵就是本质矩阵。基于此，通过对基础矩阵进行归一化处理，以得到本质矩阵。After the fundamental matrix is obtained, since the relationship between the fundamental matrix and the essential matrix is: F=K- ^T EK', where F is the fundamental matrix, K is the camera internal parameter, and E is the essential matrix. After normalization, the in-camera parameters become the identity matrix, and at this time, the fundamental matrix is the essential matrix. Based on this, the essential matrix is obtained by normalizing the fundamental matrix.

在得到本质矩阵后，对本质矩阵进行奇异值分解，这里，对本质矩阵进行奇异值分解可以得到4组平移向量和旋转角，其中只有一组是正确的，基于此，根据三角测量处理，确定人脸特征点对应的三维坐标，将该三维坐标代入每组平移向量和旋转角，得到相应相机坐标系下的三维坐标，只有一组结果的深度值是正的，由此筛选出正确的平移向量和旋转角，以得到目标对象的人脸姿态。After obtaining the essential matrix, perform singular value decomposition on the essential matrix. Here, by performing singular value decomposition on the essential matrix, four sets of translation vectors and rotation angles can be obtained, of which only one set is correct. Based on this, according to the triangulation process, determine The three-dimensional coordinates corresponding to the face feature points are substituted into each set of translation vectors and rotation angles to obtain the three-dimensional coordinates in the corresponding camera coordinate system. Only the depth value of one set of results is positive, and the correct translation vector is filtered out. and rotation angle to get the face pose of the target object.

步骤203：对人脸特征点的三维坐标进行投影，得到包括人脸特征点的二维坐标的第二特征点集。Step 203: Project the three-dimensional coordinates of the face feature points to obtain a second feature point set including the two-dimensional coordinates of the face feature points.

在实际实施时，根据目标对象的人脸姿态，将人脸特征点投影到相机呈现面上。这里，不同视频帧中目标对象的人脸姿态是不同的，在对人脸特征点的三维坐标进行投影时，得到的是各视频帧对应的人脸特征点的二维坐标。例如，当每个人脸由68个人脸特征点表示时，每个视频帧对应的人脸特征点的二维坐标为投影得到的68个人脸特征点的二维坐标。In actual implementation, according to the face pose of the target object, the face feature points are projected onto the camera presentation surface. Here, the face poses of the target objects in different video frames are different. When the three-dimensional coordinates of the face feature points are projected, the two-dimensional coordinates of the face feature points corresponding to each video frame are obtained. For example, when each face is represented by 68 face feature points, the two-dimensional coordinates of the face feature points corresponding to each video frame are the two-dimensional coordinates of the 68 face feature points obtained by projection.

在一些实施例中，可以通过以下方式得到对人脸特征点的三维坐标进行投影，得到包括人脸特征点的二维坐标的第二特征点集：根据人脸姿态，对人脸特征点的三维坐标针对相机成像面进行旋转和平移，得到人脸特征点在相机坐标系下的坐标；获取对应目标视频的目标相机内参数；根据目标相机内参数，将相机坐标系下的坐标转换至图像坐标系下的坐标，得到包括人脸特征点的二维坐标的第二特征点集。In some embodiments, the projection of the three-dimensional coordinates of the facial feature points can be obtained in the following manner to obtain a second feature point set including the two-dimensional coordinates of the facial feature points: The three-dimensional coordinates are rotated and translated for the camera imaging plane to obtain the coordinates of the face feature points in the camera coordinate system; the internal parameters of the target camera corresponding to the target video are obtained; according to the internal parameters of the target camera, the coordinates in the camera coordinate system are converted to the image The coordinates in the coordinate system are obtained, and the second feature point set including the two-dimensional coordinates of the face feature points is obtained.

这里，人脸姿态由人脸相对于相机的平移向量和3个旋转角(俯仰角、偏航角、翻滚角)表示，在实际实施时，根据平移向量对人脸特征点的三维坐标进行平移，并根据3个旋转角对人脸特征点的三维坐标进行旋转，得到人脸特征点在相机坐标系下的坐标；然后，获取目标相机内参数和人脸特征点在相机坐标系下的坐标的乘积，将该乘积作为人脸特征点在图像坐标系下的坐标，以实现对人脸特征点的三维坐标进行投影。Here, the face pose is represented by the translation vector of the face relative to the camera and three rotation angles (pitch angle, yaw angle, and roll angle). In actual implementation, the three-dimensional coordinates of the face feature points are translated according to the translation vector. , and rotate the three-dimensional coordinates of the face feature points according to the three rotation angles to obtain the coordinates of the face feature points in the camera coordinate system; then, obtain the internal parameters of the target camera and the coordinates of the face feature points in the camera coordinate system The product is taken as the coordinates of the face feature points in the image coordinate system, so as to realize the projection of the three-dimensional coordinates of the face feature points.

步骤204：将第二特征点集与第一特征点集进行比较，当第一特征点集与第二特征点集之间的误差满足误差条件时，将三维重建得到的人脸姿态作为目标视频中目标对象的目标人脸姿态。Step 204: Compare the second feature point set with the first feature point set, and when the error between the first feature point set and the second feature point set satisfies the error condition, use the face pose obtained by the three-dimensional reconstruction as the target video The target face pose of the target object in .

在实际实施时，可以获取第二特征点集与第一特征点集之间的均方误差，当第二特征点集与第一特征点集之间的均方误差小于误差阈值时，确定第一特征点集与第二特征点集之间的误差满足误差条件；否则，确定第一特征点集与第二特征点集之间的误差未满足误差条件。In actual implementation, the mean square error between the second feature point set and the first feature point set can be obtained, and when the mean square error between the second feature point set and the first feature point set is less than the error threshold, determine the first The error between the first feature point set and the second feature point set satisfies the error condition; otherwise, it is determined that the error between the first feature point set and the second feature point set does not satisfy the error condition.

在一些实施例中，可以通过以下方式基于第一特征点集，对目标对象进行三维重建：获取目标视频的目标相机内参数；基于目标相机内参数和第一特征点集，对目标对象进行三维重建；方法还包括：在对目标对象进行三维重建的过程中，获取目标视频的相机内参数；当第一特征点集与第二特征点集之间的误差未满足误差条件时，基于获取的相机内参数，更新目标相机内参数，以基于更新得到的目标相机内参数和第一特征点集，对目标对象进行三维重建。In some embodiments, the three-dimensional reconstruction of the target object may be performed based on the first feature point set in the following ways: acquiring target camera internal parameters of the target video; based on the target camera internal parameters and the first feature point set, performing three-dimensional reconstruction on the target object reconstruction; the method further includes: in the process of three-dimensional reconstruction of the target object, acquiring the in-camera parameters of the target video; when the error between the first feature point set and the second feature point set does not meet the error condition, based on the acquired In-camera parameters, the target camera internal parameters are updated, so as to perform three-dimensional reconstruction of the target object based on the updated target camera internal parameters and the first feature point set.

这里，首次获取的目标视频的目标相机内参数为初始相机内参数，初始相机内参数可以是由可以人工给定，或基于相机内参估计算法得到。这里，初始相机内参数为<焦距设置为50mm><光心位置设置到成像面中心位置>，当使用人工给出初始相机内参数时，可以根据图像的高度进行设定。当第一特征点集与第二特征点集之间的误差未满足误差条件时，如第二特征点集与第一特征点集之间的均方误差达到误差阈值时，说明目标相机内参数可能不够准确，这里，根据三维重建过程中获取的相机内参数，更新目标相机内参数。Here, the target camera internal parameters of the target video acquired for the first time are the initial camera internal parameters, and the initial camera internal parameters can be manually given or obtained based on the camera internal parameter estimation algorithm. Here, the initial in-camera parameters are <focal length is set to 50mm> <optical center position is set to the center of the imaging plane>, when the initial in-camera parameters are given manually, they can be set according to the height of the image. When the error between the first feature point set and the second feature point set does not meet the error condition, for example, when the mean square error between the second feature point set and the first feature point set reaches the error threshold, it indicates that the parameters in the target camera are It may not be accurate enough. Here, the internal parameters of the target camera are updated according to the internal parameters of the camera obtained during the three-dimensional reconstruction.

在实际实施时，当获取目标相机内参数后，可以根据目标相机内参数，确定各视频帧中人脸特征点在相应相机坐标系中的三维坐标；然后，基于 p^TEp′＝0，获取该计算本质矩阵，其中，E为基础矩阵，p和p′是视频帧中的一对匹配点在相机坐标系中的三维坐标，例如，相邻两个视频帧中同一人脸特征点在相应相机坐标系中的三维坐标，如第一视频帧中下巴在相机坐标系中的三维坐标为p，第二视频帧中下巴的二维坐标为p′；在获取本质矩阵后，对该本质矩阵进行奇异值分解，得到目标对象的人脸姿态。In actual implementation, after obtaining the internal parameters of the target camera, the three-dimensional coordinates of the facial feature points in each video frame in the corresponding camera coordinate system can be determined according to the internal parameters of the target camera; then, based on p ^T Ep′=0, obtain The calculation essential matrix, where E is the basic matrix, p and p' are the three-dimensional coordinates of a pair of matching points in the video frame in the camera coordinate system, for example, the same face feature point in two adjacent video frames is in the corresponding The three-dimensional coordinates in the camera coordinate system, for example, the three-dimensional coordinate of the chin in the camera coordinate system in the first video frame is p, and the two-dimensional coordinate of the chin in the second video frame is p'; after obtaining the essential matrix, the essential matrix Perform singular value decomposition to obtain the face pose of the target object.

这里，可以通过迭代的方式对目标相机内参数进行更新，直至第一特征点集与第二特征点集之间的误差满足误差条件。Here, the internal parameters of the target camera can be updated in an iterative manner until the error between the first feature point set and the second feature point set satisfies the error condition.

在一些实施例中，可以通过以下方式在对目标对象进行三维重建的过程中，获取目标视频的相机内参数：在对目标对象进行三维重建的过程中，获取基础矩阵及本质矩阵；根据基础矩阵与本质矩阵之间的转换关系，获取目标视频的相机内参数。In some embodiments, the in-camera parameters of the target video may be acquired during the 3D reconstruction of the target object in the following manner: during the 3D reconstruction of the target object, a fundamental matrix and an essential matrix are acquired; according to the fundamental matrix The conversion relationship with the essential matrix to obtain the in-camera parameters of the target video.

在实际实施时，在三维重建的过程中，会获取到基础矩阵和本质矩阵，由 F＝K^- ^TEK′，其中，F为基础矩阵，K为相机内参数，E为本质矩阵，如此，可以计算得到相机内参数K。In the actual implementation, in the process of 3D reconstruction, the fundamental matrix and the essential matrix will be obtained, as F=K ^- ^T EK', where F is the fundamental matrix, K is the internal parameter of the camera, and E is the essential matrix, so, The in-camera parameter K can be calculated.

在一些实施例中，终端还可以当第一特征点集与第二特征点集之间的误差未满足误差条件、且每个视频帧中人脸特征点的数量为多个时，从第一特征点集中剔除至少一个人脸特征点的二维坐标，得到第三特征点集；基于第三特征点集，对目标对象进行三维重建，得到目标对象的人脸特征点的三维坐标；将基于第三特征点集进行三维重建所得到的三维坐标投影至相机成像面，得到包括人脸特征点的二维坐标的第四特征点集；基于第三特征点集与第四特征点集，更新第一特征点集。In some embodiments, when the error between the first feature point set and the second feature point set does not satisfy the error condition, and the number of face feature points in each video frame is multiple Eliminate the two-dimensional coordinates of at least one face feature point from the feature point set to obtain a third feature point set; based on the third feature point set, perform three-dimensional reconstruction on the target object to obtain the three-dimensional coordinates of the face feature points of the target object; The three-dimensional coordinates obtained by the three-dimensional reconstruction of the third feature point set are projected onto the imaging surface of the camera, and the fourth feature point set including the two-dimensional coordinates of the face feature points is obtained; based on the third feature point set and the fourth feature point set, update The first feature point set.

在实际实施时，在首次进行三维重建时，采用所有检测到的人脸特征点的二维坐标作为第一特征点集，若第一特征点集与第二特征点集之间的误差未满足误差条件，可以对人脸特征点进行筛选，以剔除降低三维重建准确性的人脸特征点。例如，由于人存在说话的行为，导致上嘴唇对应的人脸特征点相对于其他人脸特征点的位置不固定，基于此，可以将人脸特征点中上嘴唇对应的人脸特征点剔除，并在剔除后，对剔除结果进行验证，也即，基于该第三特征点集，对目标对象进行三维重建，得到目标对象的人脸特征点的三维坐标；将基于该第三特征点集进行三维重建得到的三维坐标投影至相机成像面，得到第四特征点集；确定该第三特征点集与第四特征点集间的误差，当该误差满足某一误差条件时，用该第三特征点集更新第一特征点集。In actual implementation, when 3D reconstruction is performed for the first time, the two-dimensional coordinates of all detected facial feature points are used as the first feature point set. If the error between the first feature point set and the second feature point set does not satisfy Error conditions, the facial feature points can be screened to eliminate the facial feature points that reduce the accuracy of the 3D reconstruction. For example, due to the behavior of people speaking, the position of the facial feature point corresponding to the upper lip relative to other facial feature points is not fixed. Based on this, the facial feature point corresponding to the upper lip can be eliminated from the facial feature points. And after culling, the culling results are verified, that is, based on the third feature point set, three-dimensional reconstruction of the target object is performed to obtain the three-dimensional coordinates of the face feature points of the target object; The three-dimensional coordinates obtained by the three-dimensional reconstruction are projected to the imaging surface of the camera to obtain the fourth feature point set; the error between the third feature point set and the fourth feature point set is determined, and when the error meets a certain error condition, the third feature point set is used. The feature point set updates the first feature point set.

例如，对目标对象进行特征点检测得到68个人脸特征点的二维坐标，那么，第一特征点集包含68个人脸特征点的二维坐标；从第一特征点集中剔除嘴唇对应的人脸特征点的二维坐标，由于嘴唇对应的人脸特征点数量为16个，那么，剩余52个人脸特征点的二维坐标，将这52个人脸特征点的二维坐标组成第三特征点集；然后基于第三特征点集，对目标对象进行三维重建，得到这52个人脸特征点的三维坐标；将这52个人脸特征点的三维坐标投影至相机成像面，得到第四特征点集，这里的第四特征点集包含投影得到的52个人脸特征点的二维坐标；第三特征点集中的二维坐标与第四特征点集中的二维坐标进行比较，得到第三特征点集与第四特征点集间的误差，当该误差满足误差条件时，用该第三特征点集更新第一特征点集。For example, the two-dimensional coordinates of 68 face feature points are obtained by performing feature point detection on the target object, then, the first feature point set contains the two-dimensional coordinates of 68 face feature points; the face corresponding to the lips is eliminated from the first feature point set. The two-dimensional coordinates of the feature points, since the number of face feature points corresponding to the lips is 16, then, the two-dimensional coordinates of the remaining 52 face feature points, the two-dimensional coordinates of these 52 face feature points form the third feature point set Then, based on the third feature point set, three-dimensional reconstruction is performed on the target object to obtain the three-dimensional coordinates of the 52 facial feature points; the three-dimensional coordinates of the 52 facial feature points are projected to the camera imaging surface to obtain the fourth feature point set, The fourth feature point set here includes the two-dimensional coordinates of the 52 face feature points obtained by projection; the two-dimensional coordinates in the third feature point set are compared with the two-dimensional coordinates in the fourth feature point set, and the third feature point set and The error between the fourth feature point set, when the error satisfies the error condition, the first feature point set is updated with the third feature point set.

在一些实施例中，可以获取多个第三特征点集及相应的第四特征点集，通过比较多个第三特征点集与相应第四特征点集之间的误差，来对第一特征点集进行更新。作为示例，图3是本申请实施例提供的人脸特征点的筛选过程示意图，参见图3，对于每个人脸特征点，从第一特征点集中剔除该人脸特征点的二维坐标，得到相应的第三特征点集，如从第一特征点集中剔除人脸特征点1 的二维坐标，得到相应的第三特征点集，得到第三特征点集1；当人脸特征点的数量为N个时，可以得到N个第三特征点集。在得到N个第三特征点集后，对于每个第三特征点集，基于该第三特征点集，对目标对象进行三维重建，得到目标对象的人脸特征点的三维坐标；将基于该第三特征点集进行三维重建得到的三维坐标投影至相机成像面，得到包括人脸特征点的二维坐标的第四特征点集；在得到N个第三特征点集对应的第四特征点集后，基于各第三特征点集与相应的第四特征点集间的误差，选取相应误差最小的第三特征点集更新第一特征点集。In some embodiments, multiple third feature point sets and corresponding fourth feature point sets may be obtained, and by comparing the errors between the multiple third feature point sets and the corresponding fourth feature point sets, the first feature point set to update. As an example, FIG. 3 is a schematic diagram of a screening process of facial feature points provided by an embodiment of the present application. Referring to FIG. 3 , for each facial feature point, the two-dimensional coordinates of the facial feature point are removed from the first feature point set to obtain The corresponding third feature point set, such as removing the two-dimensional coordinates of the face feature point 1 from the first feature point set, obtains the corresponding third feature point set, and obtains the third feature point set 1; when the number of face feature points is When it is N, N third feature point sets can be obtained. After N third feature point sets are obtained, for each third feature point set, based on the third feature point set, three-dimensional reconstruction of the target object is performed to obtain the three-dimensional coordinates of the face feature points of the target object; The three-dimensional coordinates obtained by the three-dimensional reconstruction of the third feature point set are projected onto the imaging plane of the camera, and the fourth feature point set including the two-dimensional coordinates of the face feature points is obtained; after obtaining the fourth feature points corresponding to the N third feature point sets After the set, based on the error between each third feature point set and the corresponding fourth feature point set, the third feature point set with the smallest corresponding error is selected to update the first feature point set.

这里，可以通过迭代的方式对人脸特征点进行筛选，直至对第一特征点集进行更新后，得到的误差满足误差条件，或者人脸特征点的数量低于数量阈值。Here, the face feature points can be screened in an iterative manner, until after the first feature point set is updated, the obtained error satisfies the error condition, or the number of face feature points is lower than the number threshold.

在一些实施例中，在得到目标人脸姿态之后，服务器还可以基于人脸轨迹，对目标人脸姿态进行滤波处理，得到无抖动的目标人脸姿态。In some embodiments, after obtaining the target face pose, the server may further perform filtering processing on the target face pose based on the face trajectory to obtain a jitter-free target face pose.

在实际实施时，对于每段人脸轨迹，获取该人脸轨迹对应的目标人脸姿态序列；这里，目标人脸姿态序列中的人脸姿态包括6个维度的数据，分别对该目标人脸姿态序列中的各维度的数据进行滤波处理，以滤除抖动。其中，可以采用扩展卡尔曼滤波对各维度的数据进行滤波，使用人脸轨迹开始时间点的值作为扩展卡尔曼滤波的初始值，保证每段轨迹开始时刻，滤波的结果正常可用。In actual implementation, for each face track, the target face pose sequence corresponding to the face track is obtained; here, the face pose in the target face pose sequence includes 6-dimensional data, respectively for the target face The data of each dimension in the pose sequence is filtered to filter out jitter. Among them, the extended Kalman filter can be used to filter the data of each dimension, and the value of the start time point of the face track is used as the initial value of the extended Kalman filter to ensure that the filter results are normally available at the start time of each track.

在一些实施例中，终端还可以基于待展示信息，生成对应待展示信息的图片；获取图片的多个顶点的原始二维坐标、及图片的多个顶点的三维坐标；基于目标人脸姿态，将图片的多个顶点的三维坐标投影到相机成像面上，得到图片的多个顶点的二维坐标；根据原始二维坐标和投影得到的二维坐标之间的映射关系，确定透视变换矩阵；通过透视变换矩阵，对图片进行仿射变换，得到变换后的目标图片；将目标图片叠加到视频帧中，得到添加有待展示信息的视频帧。In some embodiments, the terminal may also generate a picture corresponding to the information to be displayed based on the information to be displayed; obtain the original two-dimensional coordinates of multiple vertices of the image and the three-dimensional coordinates of multiple vertices of the image; based on the target face posture, Projecting the three-dimensional coordinates of the multiple vertices of the picture onto the imaging surface of the camera to obtain the two-dimensional coordinates of the multiple vertices of the picture; determining the perspective transformation matrix according to the mapping relationship between the original two-dimensional coordinates and the projected two-dimensional coordinates; Through the perspective transformation matrix, affine transformation is performed on the picture to obtain the transformed target picture; the target picture is superimposed on the video frame to obtain the video frame to which the information to be displayed is added.

在实际实施时，可以基于确定的目标人脸姿态，来展示待展示信息，以在展示待展示信息的过程中，当目标对象的目标人脸姿态发生变化时，伴随目标人脸姿态的变化，调整待展示信息的展示姿态。这里，待展示信息可以是文本信息、图像信息等，这里不对待展示信息的具体形式进行限定。In actual implementation, the information to be displayed can be displayed based on the determined target face pose, so that in the process of displaying the information to be displayed, when the target face pose of the target object changes, along with the change of the target face pose, Adjust the display posture of the information to be displayed. Here, the information to be displayed may be text information, image information, etc., and the specific form of the information to be displayed is not limited here.

在实际应用中，当待展示信息为图像信息时，直接将该待展示信息作为图片；当待展示信息为文本信息时，绘制文本图片，以得到对应待展示信息的图片。这里，在绘制文本图片时，可选自动换行，即字符长度达到给定阈值后，自动添加换行符号，文字带有透明通道，非文字区域为透明色。In practical applications, when the information to be displayed is image information, the information to be displayed is directly used as a picture; when the information to be displayed is text information, a text picture is drawn to obtain a picture corresponding to the information to be displayed. Here, when drawing a text picture, automatic line wrapping is optional, that is, after the character length reaches a given threshold, a newline symbol is automatically added, the text has a transparent channel, and the non-text area is transparent.

在实际应用中，多个顶点指的是至少两个顶点，这里，可以根据图片的形状、尺寸及展示位置，确定图片的原始二维坐标，例如，当图片为矩形，其长度为w，宽度为H，文本图片距离目标对象的鼻子的高度为D时，文本图片的 4个顶点左上角(0，0)，左下角(0，h-1)，右下角(w-1，h-1)，右上角(w-1，0)。In practical applications, multiple vertices refer to at least two vertices. Here, the original two-dimensional coordinates of the picture can be determined according to the shape, size and display position of the picture. For example, when the picture is a rectangle, its length is w, and its width is is H, when the height of the text image from the nose of the target object is D, the four vertices of the text image are the upper left corner (0, 0), the lower left corner (0, h-1), the lower right corner (w-1, h-1) ), upper right corner (w-1, 0).

以及，这里的图片的多个顶点的三维坐标，指的是图片的多个顶点在人脸坐标系下的三维坐标，可以根据图片的尺寸、以及该原始坐标与目标对象的人脸之间的位置关系，确定多个顶点的原始坐标，如获取图片的长度和宽度，基于图片的长度和宽度、以及图片的展示区域与目标对象的人脸之间的位置关系，确定图片的多个顶点在人脸坐标系下的三维坐标。And, the three-dimensional coordinates of the multiple vertices of the picture here refer to the three-dimensional coordinates of the multiple vertices of the picture in the face coordinate system, which can be determined according to the size of the picture and the relationship between the original coordinates and the face of the target object. Positional relationship, determine the original coordinates of multiple vertices, such as obtaining the length and width of the picture, based on the length and width of the picture, and the positional relationship between the display area of the picture and the face of the target object, determine that the multiple vertices of the picture are in 3D coordinates in the face coordinate system.

作为示例，以在目标对象的头顶展示图片为例，图4是本申请实施例提供的文本图片在人脸坐标系下的示意图，参见图4，当图片为矩形，其长度为w，宽度为H，图片距离目标对象的鼻子的高度为D时，确定图片的4个顶点在人脸坐标系下的坐标为左上角(-W/2，-D-H，0)、左下角(-W/2，-D，0)、右下角(W /2，-D，0)、右上角(W/2，-D-H，0)。As an example, taking the image displayed on the head of the target object as an example, FIG. 4 is a schematic diagram of a text image provided by an embodiment of the present application in the face coordinate system. Referring to FIG. 4 , when the image is a rectangle, its length is w, and its width is H, when the height of the image from the nose of the target object is D, determine the coordinates of the four vertices of the image in the face coordinate system as the upper left corner (-W/2, -D-H, 0), the lower left corner (-W/2 , -D, 0), lower right corner (W/2, -D, 0), upper right corner (W/2, -D-H, 0).

在获取图片的多个顶点的原始二维坐标、及图片的多个顶点的三维坐标后，基于目标人脸姿态，将图片的多个顶点的三维坐标投影到相机成像面上，得到图片的多个顶点的二维坐标；将多个顶点的二维坐标记为点组合A，将多个顶点的原始二维坐标记为点组合B，计算点组合B变换到点组合A的透视变换矩阵M；然后通过透视变换矩阵，对图片进行仿射变换，也即将图片中每个点的坐标与透视变换矩阵相乘，得到改点在目标图片中的坐标，进而得到目标图片。在得到目标图片后，将该目标图片叠加到视频帧中。当终端接收到针对目标视频的播放指令后，基于添加有待展示信息的视频帧进行渲染，使得渲染得到的待展示信息的展示姿态与目标对象的人脸姿态相匹配。After acquiring the original two-dimensional coordinates of the multiple vertices of the picture and the three-dimensional coordinates of the multiple vertices of the image, based on the target face posture, the three-dimensional coordinates of the multiple vertices of the image are projected on the imaging surface of the camera to obtain the multi-dimensional coordinates of the image. Two-dimensional coordinates of each vertex; mark the two-dimensional coordinates of multiple vertices as point combination A, mark the original two-dimensional coordinates of multiple vertices as point combination B, and calculate the perspective transformation matrix M that transforms point combination B to point combination A ; Then, perform affine transformation on the picture through the perspective transformation matrix, that is, multiply the coordinates of each point in the picture with the perspective transformation matrix to obtain the coordinates of the modified point in the target picture, and then obtain the target picture. After the target picture is obtained, the target picture is superimposed on the video frame. After receiving the playback instruction for the target video, the terminal performs rendering based on the video frame to which the information to be displayed is added, so that the display posture of the rendered information to be displayed matches the facial posture of the target object.

参见图5A-5B，图5A-5B是本申请实施例提供的播放界面示意图，待展示信息501的展示姿态与目标对象502的人脸姿态一致，都是向左旋转；以及待展示信息503的展示姿态与目标对象504的人脸姿态一致，都是向左旋转；并且，由于目标对象504的人脸相对于目标对象502的人脸向左旋转的角度更大，相应的，待展示信息503相对于待展示信息501向左旋转的角度更大。Referring to FIGS. 5A-5B , FIGS. 5A-5B are schematic diagrams of a playback interface provided by an embodiment of the present application. The display posture of the information to be displayed 501 is consistent with the facial posture of the target object 502 , both of which are rotated to the left; and the display posture of the information to be displayed 503 The display posture is consistent with the face posture of the target object 504, and both are rotated to the left; and, since the face of the target object 504 is rotated to the left by a larger angle relative to the face of the target object 502, correspondingly, the information to be displayed 503 The angle of rotation to the left relative to the information 501 to be displayed is larger.

应用上述实施例，通过对目标视频中包含目标对象的多个视频帧进行人脸特征点检测，得到包括人脸特征点的二维坐标的第一特征点集；基于第一特征点集，对目标对象进行三维重建，得到目标对象的人脸姿态、及目标对象的人脸特征点的三维坐标；对人脸特征点的三维坐标进行投影，得到包括人脸特征点的二维坐标的第二特征点集；将第二特征点集与第一特征点集进行比较，当第一特征点集与第二特征点集之间的误差满足误差条件时，将三维重建得到的人脸姿态作为目标视频中目标对象的目标人脸姿态；如此，通过投影得到第二特征点集，并将第二特征点集与第一特征点集进行比较，实现了对三维重建结果的校验，提高了目标人脸姿态确定的稳定性和准确性。Applying the above-mentioned embodiment, by performing facial feature point detection on multiple video frames including the target object in the target video, a first feature point set including the two-dimensional coordinates of the facial feature points is obtained; Perform three-dimensional reconstruction of the target object to obtain the face pose of the target object and the three-dimensional coordinates of the face feature points of the target object; project the three-dimensional coordinates of the face feature points to obtain a second coordinate including the two-dimensional coordinates of the face feature points. Feature point set; compare the second feature point set with the first feature point set, when the error between the first feature point set and the second feature point set satisfies the error condition, take the face pose obtained by 3D reconstruction as the target The target face pose of the target object in the video; in this way, the second feature point set is obtained through projection, and the second feature point set is compared with the first feature point set, which realizes the verification of the three-dimensional reconstruction results and improves the target Stability and accuracy of facial pose determination.

下面，将说明本申请实施例在一个实际的应用场景中的示例性应用。Below, an exemplary application of the embodiments of the present application in a practical application scenario will be described.

图6是本申请实施例提供的视频中人脸姿态的处理方法的流程示意图，参见图6，在一些实施例中，该视频中人脸姿态的处理方法可由终端或服务器单独实施，或由终端和服务器协同实施，以服务器单独实施为例，本申请实施例提供的视频中人脸姿态的处理方法包括：FIG. 6 is a schematic flowchart of a method for processing a face gesture in a video provided by an embodiment of the present application. Referring to FIG. 6 , in some embodiments, the method for processing a face gesture in the video may be implemented by a terminal or a server alone, or by a terminal Implemented in collaboration with the server, taking the server as an example, the method for processing a face gesture in a video provided by the embodiment of the present application includes:

步骤601：服务器接收到用户上传的目标视频。Step 601: The server receives the target video uploaded by the user.

步骤602：服务器通过人脸检测算法，对目标视频中各视频帧进行人脸检测。Step 602: The server performs face detection on each video frame in the target video through a face detection algorithm.

对于每个视频帧，对视频帧进行人脸检测，检测出视频帧中的所有人脸，这里可以采用多任务卷积神经网络(MTCNN,Multi-task convolutional neural network)对视频帧进行人脸检测。For each video frame, face detection is performed on the video frame, and all faces in the video frame are detected. Here, a multi-task convolutional neural network (MTCNN, Multi-task convolutional neural network) can be used to perform face detection on the video frame. .

步骤603：服务器对各视频帧中的人脸进行重识别。Step 603: The server re-identifies the human face in each video frame.

这里，可以采用明星识别算法，对人脸进行重识别，得到人脸轨迹。可以理解的是，这里的人脸轨迹是与对象相对应的，也即当目标视频包含多个对象时，可以获取各个对象的人脸轨迹。Here, the star recognition algorithm can be used to re-identify the face to obtain the face trajectory. It can be understood that the face trajectory here corresponds to the object, that is, when the target video contains multiple objects, the face trajectory of each object can be obtained.

在实际实施时，通过对各视频帧中的人脸进行人脸重识别，确定各人脸所属对象，根据同一对象在连续视频帧中的位置，来得到人脸轨迹。In actual implementation, face re-identification is performed on the faces in each video frame to determine the object to which each face belongs, and the face trajectory is obtained according to the position of the same object in consecutive video frames.

在实际应用中，若某M个视频帧中没有检测到某一对象的人脸，但M个视频帧前后的两个视频帧中检测到了该对象的人脸，且前后的两个视频帧中，该对象的人脸位置(人脸框的位置)足够接近，则采用插值的方式，补充中间帧的人脸数据(人脸位置、人脸姿态等)。其中，M为自然数，如M可以取2。In practical applications, if the face of an object is not detected in a certain M video frames, but the face of the object is detected in the two video frames before and after the M video frames, and the two video frames before and after , the face position of the object (the position of the face frame) is close enough, and the interpolation method is used to supplement the face data (face position, face pose, etc.) of the intermediate frame. Among them, M is a natural number, such as M can take 2.

若某一视频帧检测出某一对象的人脸，但相对于该视频帧的前后两个视频帧的人脸位置发生跳变(相邻两个视频帧中人脸位置间的距离超过距离阈值，如图像高度的5％)，则认为该视频帧人脸检测失败，通过前后帧人脸的数据，采用插值的方式，得到该视频帧中正确的人脸数据。If the face of a certain object is detected in a video frame, but the face position of the two video frames before and after the video frame jumps (the distance between the face positions in the adjacent two video frames exceeds the distance threshold , such as 5% of the image height), it is considered that the face detection of the video frame fails, and the correct face data in the video frame is obtained by interpolation through the face data of the preceding and following frames.

若连续N个视频帧中均无某一对象的人脸，但N个视频帧前后的两个视频帧中检测到了该对象的人脸，则认为视频画面发生的切换，将人脸轨迹按照切换的时刻，分为2段，这里的N为自然数，且N大于M，如M为2，N为3。If there is no face of an object in N consecutive video frames, but the face of the object is detected in the two video frames before and after the N video frames, it is considered that the video screen is switched, and the face trajectory is switched according to the switch. The moment is divided into 2 segments, where N is a natural number, and N is greater than M, such as M is 2, N is 3.

步骤604：服务器对人脸特征点进行检测。Step 604: The server detects the facial feature points.

这里，基于人脸轨迹，对人脸进行特征点检测，这里可以使用用68点的特征点检测算法，来对人脸进行特征点检测，如采用人脸注意力机制网络(FAN， Face AttentionNetwork)对人脸特征点进行检测。Here, based on the face trajectory, the feature points of the face are detected. Here, the feature point detection algorithm with 68 points can be used to detect the feature points of the face, such as using the face attention mechanism network (FAN, Face Attention Network) Detect facial feature points.

在实际实施时，基于检测出的人脸，截取各视频帧对应的人脸区域图像，并根据人脸轨迹，获取与各人脸轨迹相匹配的人脸区域图像序列，这里，对于人脸区域图像序列中的每个人脸区域图像，对该人脸区域图像进行人脸特征点检测，确定各人脸特征点在人脸区域图像中的坐标；然后基于人脸区域图像在视频帧中的位置，对人脸特征点的坐标进行转换，以确定各人脸特征点在视频帧中的坐标。In actual implementation, based on the detected face, the face region image corresponding to each video frame is intercepted, and the face region image sequence matching each face track is obtained according to the face track. Here, for the face region For each face area image in the image sequence, perform face feature point detection on the face area image to determine the coordinates of each face feature point in the face area image; then based on the position of the face area image in the video frame , transform the coordinates of the facial feature points to determine the coordinates of each facial feature point in the video frame.

步骤605：获取初始相机内参数。Step 605: Obtain initial in-camera parameters.

这里，获取一个用于后续算法初始化的初始相机内参数，该初始相机内参数可以人工给定，或基于相机内参估计算法得到。这里，初始相机内参数为< 焦距设置为50mm><光心位置设置到成像面中心位置>，当使用人工给出初始相机内参数时，可以根据图像的高度进行设定。Here, an initial in-camera parameter for the initialization of the subsequent algorithm is obtained, and the initial in-camera parameter can be manually given or obtained based on the estimation algorithm of the camera's internal parameter. Here, the initial in-camera parameters are <focal length is set to 50mm> <optical center position is set to the center of the imaging plane>, when the initial in-camera parameters are given manually, they can be set according to the height of the image.

步骤606：基于初始相机内参数和人脸特征点进行三维重建。Step 606: Perform three-dimensional reconstruction based on the initial in-camera parameters and facial feature points.

这里，采用单目相机三维重建算法进行三维重建，如通过运动恢复结构(S FM，Structure From Motion)算法，获取人脸姿态、精确的相机内参数及人脸特征点的三维坐标。Here, a monocular camera 3D reconstruction algorithm is used for 3D reconstruction, for example, a structure from motion (SFM, Structure From Motion) algorithm is used to obtain the face pose, accurate camera internal parameters and the 3D coordinates of the face feature points.

下面分别对三维重建过程进行具体说明。图7是本申请实施例提供的三维重建流程示意图，参见图7，本申请实施例提供的三维重建流程包括：The three-dimensional reconstruction process will be specifically described below. Fig. 7 is a schematic diagram of a three-dimensional reconstruction process provided by an embodiment of the present application. Referring to Fig. 7, the three-dimensional reconstruction process provided by an embodiment of the present application includes:

步骤701：获取每个视频帧中的人脸坐标。Step 701: Acquire face coordinates in each video frame.

步骤702：对人脸特征点进行筛选，得到第一特征点集。Step 702: Screen the face feature points to obtain a first feature point set.

这里，可以通过迭代的方式对人脸特征点进行筛选，直至对第一特征点集进行更新后，得到的误差满足误差条件，或者人脸特征点的数量低于数量阈值。首次保留所有的人脸特征点，在迭代过程中，采用如图3所示方法对人脸特征点进行筛选，参见图3，对于每个人脸特征点，从第一特征点集中剔除该人脸特征点的二维坐标，得到相应的第三特征点集，如从第一特征点集中剔除人脸特征点1的二维坐标，得到相应的第三特征点集，即第三特征点集1；当人脸特征点的数量为N个时，可以得到N个第三特征点集。在得到N个第三特征点集后，对于每个第三特征点集，将该第三特征点集输入SMF模块，对目标对象进行三维重建，得到目标对象的人脸特征点的三维坐标；将基于该第三特征点集进行三维重建得到的三维坐标投影至相机成像面，得到包括人脸特征点的二维坐标的第四特征点集；在得到N个第三特征点集对应的第四特征点集后，基于各第三特征点集与相应的第四特征点集间的误差，选取相应误差最小的第三特征点集更新第一特征点集。Here, the face feature points can be screened in an iterative manner, until after the first feature point set is updated, the obtained error satisfies the error condition, or the number of face feature points is lower than the number threshold. For the first time, all face feature points are retained. In the iterative process, the method shown in Figure 3 is used to screen the face feature points. See Figure 3. For each face feature point, the face is removed from the first feature point set. The two-dimensional coordinates of the feature points are obtained to obtain the corresponding third feature point set. For example, the two-dimensional coordinates of the face feature point 1 are removed from the first feature point set to obtain the corresponding third feature point set, that is, the third feature point set 1. ; When the number of face feature points is N, N third feature point sets can be obtained. After obtaining N third feature point sets, for each third feature point set, input the third feature point set into the SMF module, carry out three-dimensional reconstruction of the target object, and obtain the three-dimensional coordinates of the face feature points of the target object; The three-dimensional coordinates obtained by performing three-dimensional reconstruction based on the third feature point set are projected to the imaging plane of the camera to obtain a fourth feature point set including the two-dimensional coordinates of the face feature points; After the four feature point sets are set, based on the error between each third feature point set and the corresponding fourth feature point set, the third feature point set with the smallest corresponding error is selected to update the first feature point set.

步骤703：获取目标相机内参数。Step 703: Acquire internal parameters of the target camera.

这里，首次获取的目标相机内参数为初始相机内参数，该初始相机内参数可以是人为设定的，在执行相机内参数的更新操作后，该目标相机内参数为更新得到的相机内参数。Here, the target camera internal parameters acquired for the first time are the initial camera internal parameters, and the initial camera internal parameters can be set manually. After the update operation of the camera internal parameters is performed, the target camera internal parameters are the updated camera internal parameters.

步骤704：将目标相机内参数和第一特征点集输入SMF模块，输出人脸姿态、相机内参数及人脸特征点的三维坐标。Step 704: Input the internal parameters of the target camera and the first feature point set into the SMF module, and output the facial posture, the internal parameters of the camera, and the three-dimensional coordinates of the facial feature points.

这里，通过SMF模块，基于第一特征点集中各人脸特征点在多个视频帧中的位置关系，对目标对象进行三维重建，以得到人脸姿态、相机内参数及人脸特征点的三维坐标。Here, through the SMF module, based on the positional relationship of each face feature point in the first feature point set in multiple video frames, the target object is reconstructed in 3D, so as to obtain the 3D representation of the face pose, the parameters in the camera and the face feature points. coordinate.

步骤705：根据人脸姿态和相机内参数，将人脸特征点的三维坐标投影到相机成像面上，得到第二特征点集。Step 705: Project the three-dimensional coordinates of the facial feature points onto the imaging surface of the camera according to the facial posture and the internal parameters of the camera to obtain a second feature point set.

步骤706：获取第一特征点集与第二特征点集间的均方误差。Step 706: Obtain the mean square error between the first feature point set and the second feature point set.

步骤707：判断均方误差是否小于误差阈值，若是，执行步骤708；否则执行步骤709及步骤702。Step 707: Determine whether the mean square error is less than the error threshold, if so, go to Step 708; otherwise, go to Step 709 and Step 702.

步骤708：输出得到的人脸姿态、相机内参数及人脸特征点的三维坐标。Step 708 : Output the obtained face pose, internal parameters of the camera, and three-dimensional coordinates of the face feature points.

步骤709：采用得到的相机内参数更新目标相机内参数。Step 709: Update the target camera in-camera parameters using the obtained in-camera parameters.

这里，输出得到的人脸姿态后，还可以基于人脸轨迹，对人脸姿态进行滤波处理，得到无抖动的人脸姿态。Here, after the obtained face pose is output, the face pose can also be filtered based on the face trajectory to obtain a face pose without jitter.

对于每段人脸轨迹，获取该人脸轨迹对应的人脸姿态序列；这里，人脸姿态序列中的人脸姿态包括6个维度的数据，分别对该人脸姿态序列中的各维度的数据进行滤波处理，以滤除抖动。其中，可以采用扩展卡尔曼滤波对各维度的数据进行滤波，使用人脸轨迹开始时间点的值作为扩展卡尔曼滤波的初始值，保证每段轨迹开始时刻，滤波的结果正常可用。For each face track, obtain the face pose sequence corresponding to the face track; here, the face pose in the face pose sequence includes data of 6 dimensions, and the data of each dimension in the face pose sequence are respectively A filtering process is performed to filter out jitter. Among them, the extended Kalman filter can be used to filter the data of each dimension, and the value of the start time point of the face track is used as the initial value of the extended Kalman filter to ensure that the filter results are normally available at the start time of each track.

在得到无抖动的人脸姿态后，按照视频帧的顺序，把每个视频帧检测出的人脸位置、人脸姿态、人脸ID(标识对象)、视频帧序号、视频ID、镜头焦距等的信息，存储到人脸姿态数据库中，供后续下发终端使用。After obtaining the face pose without shaking, according to the sequence of video frames, the detected face position, face pose, face ID (identification object), video frame serial number, video ID, lens focal length, etc. The information is stored in the face pose database for subsequent use by the terminal.

这里，在得到人脸姿态后，可以实现将待展示信息与人脸进行绑定，以在展示待展示信息时，使待展示信息的展示姿态随对象的人脸姿态的变化，而同步变化。Here, after obtaining the facial posture, the information to be displayed can be bound with the human face, so that when the information to be displayed is displayed, the presentation posture of the information to be displayed changes synchronously with the change of the facial posture of the object.

这里，以待展示信息为弹幕信息为例，参见图5A在目标对象502的头顶区域展示与该目标对象相关联的弹幕信息(待展示信息)501，其中，目标对象的人脸姿态与弹幕的展示姿态相匹配，在视频的播放过程中，随着目标对象的人脸姿态的变化，弹幕的展示姿态随之变化。Here, taking the information to be displayed as the bullet screen information as an example, referring to FIG. 5A , the bullet screen information (information to be displayed) 501 associated with the target object 502 is displayed in the head area of the target object 502 . The display posture of the bullet screen matches the display posture of the bullet screen. During the playback of the video, as the face posture of the target object changes, the display posture of the bullet screen changes accordingly.

其中，目标对象与弹幕之间的关联关系，可以是通过将弹幕与视频画面中的对象进行匹配确定的，也可以是用户在发表弹幕时，由用户关联的。The association between the target object and the bullet screen may be determined by matching the bullet screen with the object in the video screen, or it may be associated by the user when the user publishes the bullet screen.

以待展示信息为字幕为例，参见图5B，在目标对象504的头顶区域展示与该目标对象当前对话内容对应的字幕503，其中，目标对象的人脸姿态与字幕的展示姿态相匹配，在视频的播放过程中，随着目标对象的人脸姿态的变化，弹幕的展示姿态随之变化。Taking the information to be displayed as subtitles as an example, referring to FIG. 5B , subtitles 503 corresponding to the current dialogue content of the target object are displayed in the head area of the target object 504 , wherein the facial posture of the target object matches the display posture of the subtitles, and in the During the playback of the video, as the face posture of the target object changes, the display posture of the bullet screen changes accordingly.

下面对在视频帧中添加待展示信息的具体过程进行说明，在视频帧中添加待展示信息的具体过程可以由终端执行，也可以由服务器执行，还可以由服务器和终端共同执行，以终端执行为例，图8是本申请实施例提供的在视频帧中添加待展示信息过程示意图，参见图8，当待展示信息为文本时，终端在视频帧中添加待展示信息的过程包括：The specific process of adding the information to be displayed in the video frame will be described below. The specific process of adding the information to be displayed in the video frame can be performed by the terminal, the server, or the server and the terminal jointly. Taking execution as an example, FIG. 8 is a schematic diagram of a process of adding information to be displayed in a video frame provided by an embodiment of the present application. Referring to FIG. 8 , when the information to be displayed is text, the process of adding the information to be displayed in the video frame by the terminal includes:

步骤801：绘制文本图片。Step 801: Draw a text picture.

按照待展示信息的文本内容，绘制文本图片；其中，绘制时可选自动换行，即字符长度达到给定阈值后，自动添加换行符号，文字带有透明通道，非文字区域为透明色。Draw text pictures according to the text content of the information to be displayed; among them, automatic line wrapping is optional during drawing, that is, after the character length reaches a given threshold, a line break symbol is automatically added, the text has a transparent channel, and the non-text area is transparent.

步骤802：确定文本图片在人脸坐标系下的坐标。Step 802: Determine the coordinates of the text picture in the face coordinate system.

这里，以绘制在目标对象的头顶为例，在人脸坐标系中，文本图片到鼻子的高度D(y轴方向)固定，文本图片的高度H的固定，按照文本图片的宽高比r，计算文本图片在人脸坐标系下的宽度W，其中W＝Hr。例如，D可以设置为125，H可以设置为40。Here, taking the drawing on the top of the target object as an example, in the face coordinate system, the height D (y-axis direction) of the text image to the nose is fixed, and the height H of the text image is fixed, according to the aspect ratio r of the text image, Calculate the width W of the text picture in the face coordinate system, where W=Hr. For example, D can be set to 125 and H can be set to 40.

相应的，可以确定文本图片的四个顶点在人脸坐标系下的坐标，即左上角 (-W/2，-D-H，0)、左下角(-W/2，-D，0)、右下角(W/2，-D，0)、右上角(W/2， -D-H，0)。Correspondingly, the coordinates of the four vertices of the text picture in the face coordinate system can be determined, that is, the upper left corner (-W/2, -D-H, 0), the lower left corner (-W/2, -D, 0), the right Bottom (W/2, -D, 0), top right (W/2, -D-H, 0).

步骤803：基于人脸姿态，采用点云投射的方法，将文本图片在人脸坐标系下的坐标，投影到相机成像面上。Step 803: Based on the posture of the face, using the method of point cloud projection, project the coordinates of the text picture in the face coordinate system onto the imaging surface of the camera.

在实际实施时，采用点云投射的方法，将文本图片的4个顶点在人脸坐标系下的坐标，投影到相机成像面上，得到在视频帧中文本图片的4个顶点在视频帧中的坐标，这4个顶点在视频帧中的坐标组成点组合A。In actual implementation, the method of point cloud projection is used to project the coordinates of the four vertices of the text image in the face coordinate system onto the imaging surface of the camera, and the four vertices of the text image in the video frame are obtained in the video frame. The coordinates of these 4 vertices in the video frame form point combination A.

在实际应用中，基于人脸姿态，计算文本图片在相机坐标系下的三维坐标；然后将文本图片在相机坐标下的三维坐标，投影到相机成像面上。In practical applications, the three-dimensional coordinates of the text image in the camera coordinate system are calculated based on the face pose; then the three-dimensional coordinates of the text image in the camera coordinate system are projected onto the camera imaging surface.

步骤804：计算将文本图片变换到相机成像面上的变换矩阵。Step 804: Calculate the transformation matrix for transforming the text image to the camera imaging surface.

在实际实施时，文本图片的4个顶点的原始二维坐标为左上角(0，0)，左下角(0，h-1)，右下角(w-1，h-1)，右上角(w-1，0)，将这4个顶点的原始二维坐标组成点组合B。这里，计算点组合B变换到点组合A的变换矩阵M。In actual implementation, the original two-dimensional coordinates of the four vertices of the text image are the upper left corner (0, 0), the lower left corner (0, h-1), the lower right corner (w-1, h-1), the upper right corner ( w-1, 0), the original two-dimensional coordinates of these 4 vertices are composed of point combination B. Here, the transformation matrix M for transforming the point combination B to the point combination A is calculated.

步骤805：使用变换矩阵，对文本图片进行仿射变换，得到变换后的文本图片。Step 805: Using the transformation matrix, perform affine transformation on the text picture to obtain the transformed text picture.

步骤806：将变换后的文本图片叠加到视频帧中，得到添加有待展示信息的视频帧。Step 806: The transformed text picture is superimposed on the video frame to obtain the video frame to which the information to be displayed is added.

应用上述实施例，能够得到更为精准的相机内参数，保证后续待展示信息添加时的效果，提高了人脸姿态确定的准确性；并且，在确定人脸姿态的同时，还能够获取人脸特征点的三维坐标，以得到人脸的三维形状。By applying the above embodiment, more accurate parameters in the camera can be obtained, the effect of adding information to be displayed in the future can be ensured, and the accuracy of the determination of the face pose can be improved; moreover, while the face pose is determined, the face can also be obtained. The three-dimensional coordinates of the feature points to obtain the three-dimensional shape of the face.

下面继续说明本申请实施例提供的视频中人脸姿态的处理装置。参见图9，图9是本申请实施例提供的视频中人脸姿态的处理装置的结构示意图，本申请实施例提供的视频中人脸姿态的处理装置包括：The following continues to describe the apparatus for processing a face gesture in a video provided by the embodiment of the present application. Referring to Fig. 9, Fig. 9 is a schematic structural diagram of a processing device of a human face gesture in a video provided by an embodiment of the present application, and the processing device of a human face gesture in a video provided by an embodiment of the present application includes:

检测模块901，用于对目标视频中包含目标对象的多个视频帧进行人脸特征点检测，得到包括人脸特征点的二维坐标的第一特征点集；Detection module 901, for carrying out facial feature point detection to multiple video frames comprising target object in the target video, obtains the first feature point set comprising the two-dimensional coordinates of facial feature points;

重建模块902，用于基于所述第一特征点集，对所述目标对象进行三维重建，得到所述目标对象的人脸姿态、及所述目标对象的人脸特征点的三维坐标；Reconstruction module 902 is used to carry out three-dimensional reconstruction to the target object based on the first feature point set, to obtain the facial posture of the target object and the three-dimensional coordinates of the facial feature points of the target object;

投影模块903，用于对所述人脸特征点的三维坐标进行投影，得到包括人脸特征点的二维坐标的第二特征点集；Projection module 903 is used to project the three-dimensional coordinates of the facial feature points, and obtains the second feature point set comprising the two-dimensional coordinates of the facial feature points;

比较模块904，用于将所述第二特征点集与所述第一特征点集进行比较，当所述第一特征点集与所述第二特征点集之间的误差未满足误差条件时，将所述三维重建得到的所述人脸姿态作为所述目标视频中所述目标对象的目标人脸姿态。A comparison module 904, configured to compare the second feature point set with the first feature point set, when the error between the first feature point set and the second feature point set does not satisfy an error condition , using the face pose obtained by the three-dimensional reconstruction as the target face pose of the target object in the target video.

在一些实施例中，所述检测模块901，还用于从目标视频中包含目标对象的多个视频帧中，截取目标对象的人脸区域得到多张人脸区域图像；In some embodiments, the detection module 901 is also used to intercept the face region of the target object and obtain multiple face region images from the target video including multiple video frames of the target object;

在一些实施例中，所述检测模块901，还用于对所述目标视频进行人脸重识别，得到目标对象的人脸轨迹；In some embodiments, the detection module 901 is also used to carry out face re-identification to the target video to obtain the face track of the target object;

所述对目标视频中包含目标对象的多个视频帧进行人脸特征点检测，包括：The described detection of facial feature points on multiple video frames containing the target object in the target video, including:

在一些实施例中，所述检测模块901，还用于对目标视频进行人脸重识别，将识别到目标对象的视频帧确定为包含目标对象的视频帧，并确定所述视频帧中目标对象的人脸位置；In some embodiments, the detection module 901 is further configured to perform face re-identification on the target video, determine the video frame in which the target object is recognized as a video frame containing the target object, and determine the target object in the video frame face position;

在一些实施例中，所述检测模块901，还用于对目标视频中多个视频帧进行人脸重识别，将识别到目标对象的视频帧确定未包含目标对象的视频帧，并确定所述视频帧中目标对象的人脸位置；In some embodiments, the detection module 901 is further configured to perform face re-recognition on multiple video frames in the target video, determine the video frames that recognize the target object as video frames that do not contain the target object, and determine the The face position of the target object in the video frame;

在一些实施例中，所述重建模块902，还用于基于所述第一特征点集进行人脸姿态估计，得到所述目标对象的人脸姿态；In some embodiments, the reconstruction module 902 is further configured to perform face pose estimation based on the first feature point set to obtain the face pose of the target object;

在一些实施例中，所述重建模块902，还用于基于所述第一特征点集中各人脸特征点在多个所述视频帧中的位置关系，获取基础矩阵；In some embodiments, the reconstruction module 902 is further configured to obtain a fundamental matrix based on the positional relationship of each face feature point in the first feature point set in a plurality of the video frames;

在一些实施例中，所述重建模块902，还用于获取目标视频的目标相机内参数；In some embodiments, the reconstruction module 902 is further configured to obtain the target in-camera parameters of the target video;

当所述第一特征点集与所述第二特征点集之间的误差满足误差条件时，基于获取的所述相机内参数，更新所述目标相机内参数，以When the error between the first feature point set and the second feature point set satisfies the error condition, based on the acquired in-camera parameters, the target in-camera parameters are updated to

在一些实施例中，所述重建模块902，还用于在对所述目标对象进行三维重建的过程中，获取基础矩阵及本质矩阵；In some embodiments, the reconstruction module 902 is further configured to obtain a fundamental matrix and an essential matrix in the process of three-dimensional reconstruction of the target object;

在一些实施例中，所述重建模块902，还用于当所述第一特征点集与所述第二特征点集之间的误差未满足误差条件、且每个所述视频帧中人脸特征点的数量为多个时，对于多个人脸特征点中的每个目标人脸特征点，从所述第一特征点集中剔除至少一个人脸特征点的二维坐标，得到第三特征点集；In some embodiments, the reconstruction module 902 is further configured to, when the error between the first feature point set and the second feature point set does not satisfy an error condition, and the face in each of the video frames When the number of feature points is multiple, for each target face feature point in the multiple face feature points, the two-dimensional coordinates of at least one face feature point are removed from the first feature point set to obtain a third feature point. set;

将基于第三特征点集进行三维重建所得到的三维坐标投影至相机成像面，得到与所述第三特征点集对应的包括人脸特征点的二维坐标的第四特征点集；Projecting the three-dimensional coordinates obtained by performing the three-dimensional reconstruction based on the third feature point set to the imaging plane of the camera to obtain a fourth feature point set corresponding to the third feature point set and including the two-dimensional coordinates of the face feature points;

在一些实施例中，所述投影模块903，还用于根据所述人脸姿态，对所述人脸特征点的三维坐标针对相机成像面进行旋转和平移，得到所述人脸特征点在相机坐标系下的坐标；In some embodiments, the projection module 903 is further configured to rotate and translate the three-dimensional coordinates of the facial feature points on the imaging plane of the camera according to the facial posture, to obtain the facial feature points on the camera. the coordinates in the coordinate system;

在一些实施例中，所述装置还包括：处理模块，用于基于待展示信息，生成对应所述待展示信息的图片；In some embodiments, the apparatus further comprises: a processing module for generating a picture corresponding to the information to be displayed based on the information to be displayed;

应用上述实施例，通过投影得到第二特征点集，并将第二特征点集与第一特征点集进行比较，实现了对三维重建结果的校验，提高了目标人脸姿态确定的稳定性和准确性。By applying the above embodiment, the second feature point set is obtained by projection, and the second feature point set is compared with the first feature point set, so as to realize the verification of the three-dimensional reconstruction result and improve the stability of the determination of the target face pose and accuracy.

本申请实施例还提供一种计算机设备，该计算机设备可以为终端或服务器，参见图10，图10是本申请实施例提供的计算机设备的结构示意图，本申请实施例提供的计算机设备包括：An embodiment of the present application also provides a computer device, which can be a terminal or a server. Referring to FIG. 10, FIG. 10 is a schematic structural diagram of the computer device provided by the embodiment of the present application. The computer device provided by the embodiment of the present application includes:

存储器550，用于存储可执行指令；memory 550 for storing executable instructions;

处理器510，用于执行所述存储器中存储的可执行指令时，实现本申请实施例提供的信息展示方法。The processor 510 is configured to implement the information display method provided by the embodiments of the present application when executing the executable instructions stored in the memory.

这里，处理器510可以是一种集成电路芯片，具有信号的处理能力，例如通用处理器、数字信号处理器(DSP，Digital Signal Processor)，或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等，其中，通用处理器可以是微处理器或者任何常规的处理器等。Here, the processor 510 may be an integrated circuit chip with signal processing capabilities, such as a general-purpose processor, a digital signal processor (DSP, Digital Signal Processor), or other programmable logic devices, discrete gate or transistor logic devices, Discrete hardware components, etc., where a general purpose processor may be a microprocessor or any conventional processor or the like.

存储器550可以是可移除的，不可移除的或其组合。示例性的硬件设备包括固态存储器，硬盘驱动器，光盘驱动器等。存储器450可选地包括在物理位置上远离处理器510的一个或多个存储设备。Memory 550 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard drives, optical drives, and the like. Memory 450 optionally includes one or more storage devices that are physically remote from processor 510.

存储器550包括易失性存储器或非易失性存储器，也可包括易失性和非易失性存储器两者。非易失性存储器可以是只读存储器(ROM，Read Only Me mory)，易失性存储器可以是随机存取存储器(RAM，Random Access Memor y)。本申请实施例描述的存储器550旨在包括任意适合类型的存储器。Memory 550 includes volatile memory or non-volatile memory, and may also include both volatile and non-volatile memory. The non-volatile memory may be Read Only Memory (ROM, Read Only Memory), and the volatile memory may be Random Access Memory (RAM, Random Access Memory). The memory 550 described in the embodiments of the present application is intended to include any suitable type of memory.

在一些实施例中还可包括至少一个网络接口520和用户接口530。计算机设备500中的各个组件通过总线系统540耦合在一起。可理解，总线系统540 用于实现这些组件之间的连接通信。总线系统540除包括数据总线之外，还包括电源总线、控制总线和状态信号总线。但是为了清楚说明起见，在图10中将各种总线都标为总线系统540。At least one network interface 520 and user interface 530 may also be included in some embodiments. The various components in computer device 500 are coupled together by bus system 540. It can be understood that the bus system 540 is used to implement the connection communication between these components. In addition to the data bus, the bus system 540 also includes a power bus, a control bus, and a status signal bus. For clarity, however, the various buses are labeled as bus system 540 in Figure 10.

本申请实施例提供了一种计算机程序产品或计算机程序，该计算机程序产品或计算机程序包括计算机指令，该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令，处理器执行该计算机指令，使得该计算机设备执行本申请实施例上述的视频中人脸姿态的处理方法。Embodiments of the present application provide a computer program product or computer program, where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium. The processor of the computer device reads the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the computer device executes the above-mentioned method for processing the facial gesture in the video in the embodiment of the present application.

本申请实施例提供一种存储有可执行指令的计算机可读存储介质，其中存储有可执行指令，当可执行指令被处理器执行时，将引起处理器执行本申请实施例提供的视频中人脸姿态的处理方法。The embodiments of the present application provide a computer-readable storage medium storing executable instructions, wherein the executable instructions are stored, and when the executable instructions are executed by a processor, the processor will cause the processor to execute the human in the video provided by the embodiments of the present application. How to deal with face poses.

在一些实施例中，计算机可读存储介质可以是FRAM、ROM、PROM、EP ROM、EEPROM、闪存、磁表面存储器、光盘、或CD-ROM等存储器；也可以是包括上述存储器之一或任意组合的各种设备。In some embodiments, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EP ROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; it may also include one or any combination of the foregoing memories of various equipment.

在一些实施例中，可执行指令可以采用程序、软件、软件模块、脚本或代码的形式，按任意形式的编程语言(包括编译或解释语言，或者声明性或过程性语言)来编写，并且其可按任意形式部署，包括被部署为独立的程序或者被部署为模块、组件、子例程或者适合在计算环境中使用的其它单元。In some embodiments, executable instructions may take the form of programs, software, software modules, scripts, or code, written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and which Deployment may be in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

作为示例，可执行指令可以但不一定对应于文件系统中的文件，可以可被存储在保存其它程序或数据的文件的一部分，例如，存储在超文本标记语言(H TML，Hyper TextMarkup Language)文档中的一个或多个脚本中，存储在专用于所讨论的程序的单个文件中，或者，存储在多个协同文件(例如，存储一个或多个模块、子程序或代码部分的文件)中。As an example, executable instructions may, but do not necessarily correspond to files in a file system, may be stored as part of a file that holds other programs or data, eg, a Hyper Text Markup Language (H TML) document One or more scripts in , stored in a single file dedicated to the program in question, or in multiple cooperating files (eg, files that store one or more modules, subroutines, or code sections).

作为示例，可执行指令可被部署为在一个计算机设备上执行，或者在位于一个地点的多个计算机设备上执行，又或者，在分布在多个地点且通过通信网络互连的多个计算机设备上执行。As an example, executable instructions may be deployed to be executed on one computer device, or on multiple computer devices located at one site, or alternatively, distributed across multiple sites and interconnected by a communication network execute on.

以上所述，仅为本申请的实施例而已，并非用于限定本申请的保护范围。凡在本申请的精神和范围之内所作的任何修改、等同替换和改进等，均包含在本申请的保护范围之内。The above descriptions are merely examples of the present application, and are not intended to limit the protection scope of the present application. Any modifications, equivalent replacements and improvements made within the spirit and scope of this application are all included within the protection scope of this application.

Claims

1. a processing method of human face posture in a video, is characterized in that, comprises:

Performing facial feature point detection on multiple video frames containing the target object in the target video to obtain a first feature point set including the two-dimensional coordinates of the facial feature points;

Based on the first feature point set, three-dimensional reconstruction is performed on the target object to obtain the face pose of the target object and the three-dimensional coordinates of the face feature points of the target object;

Projecting the three-dimensional coordinates of the face feature points to obtain a second feature point set including the two-dimensional coordinates of the face feature points;

Comparing the second feature point set with the first feature point set, when the error between the first feature point set and the second feature point set satisfies the error condition, obtain the three-dimensional reconstruction The face pose of the target video is taken as the target face pose of the target object in the target video.

2. method as claimed in claim 1 is characterized in that, described in the target video, comprises the multiple video frame of target object and carries out facial feature point detection, obtains the first feature that comprises the two-dimensional coordinates of facial feature point point set, including:

From multiple video frames containing the target object in the target video, intercept the face region of the target object to obtain multiple face region images;

Perform face feature point detection on each face region image respectively, and obtain the two-dimensional coordinates of the face feature points in each face region image;

Perform coordinate transformation on the two-dimensional coordinates of the face feature points in the face area image to obtain the two-dimensional coordinates of the face feature points in each video frame;

The two-dimensional coordinates of the face feature points in each video frame form a first feature point set.

3. The method according to claim 1, wherein before the multiple video frames of the target video including the target object are subjected to facial feature point detection, the method further comprises:

Perform face re-identification on the target video to obtain the face trajectory of the target object;

The described detection of facial feature points on multiple video frames containing the target object in the target video, including:

According to the face trajectory of the target object, determine a plurality of video frames including the target object in the target video and the face position of the target object in the video frame;

Perform face feature point detection on the face located at the face position in the video frame.

4. method as claimed in claim 3, is characterized in that, described carrying out face re-identification to described target video, obtains the face track of target object, comprising:

Carry out face re-identification to the target video, identify the video frame of the target object as a video frame containing the target object, and determine the face position of the target object in the video frame;

For the first video frame and the second video frame containing the target object, when the target object is not identified in at least one third video frame between the first video frame and the second video frame, and the third video frame When the quantity is less than the first quantity threshold, obtain the first face position of the target object in the first video frame and the second face position of the target object in the second video frame;

When the distance between the first face position and the second face position is less than a distance threshold, determine that the third video frame contains a target object, and

According to the first face position and the second face position, interpolation processing is performed to obtain at least one face position of the target object in the third video frame, so as to generate a face trajectory of the target object.

5. The method of claim 3, wherein the described target video is subjected to face re-identification to obtain the face track of the target object, comprising:

Carrying out face re-identification to multiple video frames in the target video, identifying the video frame of the target object as a video frame containing the target object, and determining the face position of the target object in the video frame;

For the first video frame and the second video frame containing the target object, when the target object is not identified in at least one third video frame between the first video frame and the second video frame, and the third video frame When the number reaches the second number threshold, based on the first face position of the target object in the first video frame and the second face position of the target object in the second video frame, generate at least two segments of the target object. face trajectory.

6 . The method according to claim 1 , wherein, based on the first feature point set, three-dimensional reconstruction is performed on the target object to obtain the face pose of the target object and the target object. 7 . The three-dimensional coordinates of the facial feature points, including:

Perform face pose estimation based on the first feature point set to obtain the face pose of the target object;

Based on the facial posture of the target object, three-dimensional reconstruction of the target object is performed through triangulation processing, and the three-dimensional coordinates of the facial feature points of the target object are determined.

7. The method according to claim 6, wherein, performing face pose estimation based on the first feature point set to obtain the face pose of the target object, comprising:

Obtain a fundamental matrix based on the positional relationship of each facial feature point in the first feature point set in the plurality of video frames;

Normalize the fundamental matrix to obtain an essential matrix;

Perform singular value decomposition on the essential matrix to obtain the face pose of the target object.

8. The method of claim 1, wherein the performing three-dimensional reconstruction on the target object based on the first feature point set comprises:

Get the in-camera parameters of the target video;

performing three-dimensional reconstruction on the target object based on the target camera internal parameters and the first feature point set;

The method also includes:

In the process of performing 3D reconstruction on the target object, acquiring in-camera parameters of the target video;

When the error between the first feature point set and the second feature point set does not satisfy the error condition, based on the acquired in-camera parameters, the target in-camera parameters are updated to

Based on the updated internal parameters of the target camera and the first feature point set, three-dimensional reconstruction of the target object is performed.

9. The method according to claim 8, wherein, in the process of performing 3D reconstruction on the target object, acquiring in-camera parameters of the target video comprises:

In the process of performing 3D reconstruction on the target object, acquiring a fundamental matrix and an essential matrix;

According to the conversion relationship between the fundamental matrix and the essential matrix, the in-camera parameters of the target video are acquired.

10. The method of claim 1, further comprising:

When the error between the first feature point set and the second feature point set does not satisfy the error condition, and the number of face feature points in each of the video frames is multiple, the first feature Eliminate the two-dimensional coordinates of at least one face feature point from the point set to obtain a third feature point set;

Based on the third feature point set, three-dimensional reconstruction is performed on the target object to obtain the three-dimensional coordinates of the facial feature points of the target object;

Projecting the three-dimensional coordinates obtained by performing three-dimensional reconstruction based on the third feature point set on the imaging plane of the camera to obtain a fourth feature point set including the two-dimensional coordinates of the face feature points;

The first feature point set is updated based on the third feature point set and the fourth feature point set.

11. The method according to claim 1, wherein the three-dimensional coordinates of the face feature points are projected to obtain a second feature point set comprising the two-dimensional coordinates of the face feature points, comprising:

According to the facial posture, the three-dimensional coordinates of the facial feature points are rotated and translated with respect to the imaging plane of the camera to obtain the coordinates of the facial feature points in the camera coordinate system;

Obtain the target camera internal parameters corresponding to the target video;

According to the internal parameters of the target camera, the coordinates in the camera coordinate system are converted to the coordinates in the image coordinate system to obtain a second feature point set including the two-dimensional coordinates of the facial feature points.

12. The method of claim 1, further comprising:

generating a picture corresponding to the information to be displayed based on the information to be displayed;

obtaining the original two-dimensional coordinates of the multiple vertices of the picture and the three-dimensional coordinates of the multiple vertices of the image;

Based on the target face posture, project the three-dimensional coordinates of the multiple vertices of the picture onto the imaging surface of the camera to obtain the two-dimensional coordinates of the multiple vertices of the picture;

Determine a perspective transformation matrix according to the mapping relationship between the original two-dimensional coordinates and the two-dimensional coordinates obtained by projection;

Through the perspective transformation matrix, affine transformation is performed on the picture to obtain a transformed target picture;

The target picture is superimposed on the video frame to obtain the video frame to which the information to be displayed is added.

13. A processing device for facial gestures in a video, comprising:

a detection module, configured to perform facial feature point detection on multiple video frames including the target object in the target video, and obtain a first feature point set including the two-dimensional coordinates of the facial feature points;

a reconstruction module, configured to perform three-dimensional reconstruction on the target object based on the first feature point set, to obtain the face pose of the target object and the three-dimensional coordinates of the face feature points of the target object;

a projection module for projecting the three-dimensional coordinates of the facial feature points to obtain a second feature point set including the two-dimensional coordinates of the facial feature points;

The comparison module is used to compare the second feature point set with the first feature point set, and when the error between the first feature point set and the second feature point set satisfies the error condition, the The face pose obtained by the three-dimensional reconstruction is used as the target face pose of the target object in the target video.

14. A computer equipment, characterized in that, comprising:

memory for storing executable instructions;

The processor is configured to implement the method for processing a face gesture in a video according to any one of claims 1 to 12 when executing the executable instructions stored in the memory.

15 . A computer-readable storage medium, characterized in that it stores executable instructions for, when executed by a processor, implementing the method for processing a face gesture in a video according to any one of claims 1 to 12 .