CN110515463B

CN110515463B - A 3D model embedding method based on monocular vision in gesture interactive video scene

Info

Publication number: CN110515463B
Application number: CN201910805546.7A
Authority: CN
Inventors: 胡斌; 程啸; 钱程; 张静怡
Original assignee: Nanjing Fanzai Geographic Information Industry Research Institute Co ltd; Nanjing Normal University
Current assignee: Nanjing Fanzai Geographic Information Industry Research Institute Co ltd; Nanjing Normal University
Priority date: 2019-08-29
Filing date: 2019-08-29
Publication date: 2023-02-28
Anticipated expiration: 2039-08-29
Also published as: CN110515463A

Abstract

The invention discloses a 3D model embedding method based on monocular vision in a gesture interactive video scene, which belongs to the field of computer graphics and comprises the following steps: the method comprises the steps of monocular scene depth reconstruction, fine gesture extraction, 3D model rendering, shielding judgment and gesture redrawing. The invention provides a 3D model embedding method based on monocular vision in a gesture interactive video scene, which combines the technologies of monocular depth recovery, gesture extraction, 3D model rendering, occlusion judgment, gesture redrawing and the like, can effectively process the occlusion relation between a gesture and a model, and obtains a seamless embedding effect. The invention can effectively solve the problem of shielding generated during gesture interaction, does not depend on expensive depth detection equipment, and meets the requirement of practical application.

Description

A 3D model embedding method based on monocular vision in gesture interactive video scene

技术领域technical field

本发明属于计算机图形学领域，涉及一种手势交互式视频场景中基于单目视觉的3D模型嵌入方法。The invention belongs to the field of computer graphics and relates to a method for embedding a 3D model based on monocular vision in a gesture interactive video scene.

背景技术Background technique

增强现实(AugmentedReality，AR)具有虚实交融、表现力强和交互性好等优点。通过将虚拟信息嵌入在真实场景的图像上，增强现实技术为用户呈现出感官效果真实的环境。作为一种自然的人机交互方式，手势交互在增强现实和虚拟现实中的应用日渐广泛。Augmented reality (AugmentedReality, AR) has the advantages of blending virtual and real, strong expressiveness and good interactivity. By embedding virtual information on images of real scenes, augmented reality technology presents users with an environment with realistic sensory effects. As a natural way of human-computer interaction, gesture interaction is widely used in augmented reality and virtual reality.

遮挡一致性是沉浸感增强现实环境需要满足的基本原则，它要求增强现实系统中各要素之间保证正确的遮挡关系。视频图像是三维场景在二维平面的投影，恢复视频场景与嵌入模型的空间遮挡关系是实现遮挡一致性的前提。但目前大多数增强现实应用均忽略了遮挡关系的处理，视频仅作为虚拟模型的背景进行绘制。而在手势交互式增强现实应用中，手势交互的动态性强，手势与嵌入模型的遮挡关系不固定，先绘制视频后绘制嵌入模型的方式容易导致视觉错乱。虽然使用深度传感器，或利用双摄像头通过立体视觉的方法可以获取视频场景的深度信息，重建场景遮挡关系，但存在设备成本或便携性方面的问题。Occlusion consistency is the basic principle that the immersive augmented reality environment needs to satisfy, and it requires the correct occlusion relationship between the elements in the augmented reality system. A video image is the projection of a 3D scene on a 2D plane, and restoring the spatial occlusion relationship between the video scene and the embedded model is a prerequisite for achieving occlusion consistency. However, most of the current augmented reality applications ignore the occlusion relationship, and the video is only drawn as the background of the virtual model. However, in gesture-interactive augmented reality applications, gesture interaction is highly dynamic, and the occlusion relationship between gestures and embedded models is not fixed. Drawing the video first and then drawing the embedded model can easily lead to visual confusion. Although the depth information of the video scene can be obtained by using a depth sensor, or the method of stereo vision by using dual cameras, and the occlusion relationship of the scene can be reconstructed, but there are problems in terms of equipment cost or portability.

发明内容Contents of the invention

本发明提出了一种手势交互式视频场景中基于单目视觉的3D模型嵌入方法，结合了单目深度恢复、精确手势提取和3D模型嵌入等技术，能有效处理手势与模型的遮挡关系，得到无缝嵌入效果。The present invention proposes a 3D model embedding method based on monocular vision in a gesture interactive video scene, which combines technologies such as monocular depth recovery, precise gesture extraction and 3D model embedding, and can effectively handle the occlusion relationship between gestures and models, and obtain Seamless embedding effect.

本发明的技术方案是：一种手势交互式视频场景中基于单目视觉的3D模型嵌入方法，包括以下步骤：The technical solution of the present invention is: a 3D model embedding method based on monocular vision in a gesture interactive video scene, comprising the following steps:

步骤1：单目场景深度重建：Step 1: Monocular scene depth reconstruction:

利用深度传感器或双目视觉的方法获取场景深度图，然后将获得的场景深度图进行配准，转化到摄像机坐标系，作为单目深度恢复模型的训练集；利用上述训练集对单目深度恢复网络进行迁移学习，得到适合于该场景的单目深度恢复模型；利用单目深度恢复模型直接获得当前帧的场景深度图。Use the depth sensor or binocular vision method to obtain the scene depth map, and then register the obtained scene depth map and convert it to the camera coordinate system as the training set of the monocular depth recovery model; use the above training set to recover the monocular depth The network performs migration learning to obtain a monocular depth recovery model suitable for the scene; use the monocular depth recovery model to directly obtain the scene depth map of the current frame.

步骤2：精细手势提取：Step 2: Fine Gesture Extraction:

2.1)、手势检测与定位：对当前帧进行手势检测与定位得到手势检测框。2.1) Gesture detection and positioning: perform gesture detection and positioning on the current frame to obtain a gesture detection frame.

2.2)、粗略手势掩模生成：用混合高斯模型提取前景，然后初始化一张与当前帧相同分辨率的RGBA图像作为手势掩模，将手势掩模中每个像素的RGB值设为当前帧对应元素的RGB值，将位于前景区域和手势检测边框范围内的像素Alpha值设置为1，其他像素的Alpha值设为0，获得粗略手势掩模。2.2) Rough gesture mask generation: use the mixed Gaussian model to extract the foreground, then initialize an RGBA image with the same resolution as the current frame as the gesture mask, and set the RGB value of each pixel in the gesture mask to the current frame The RGB value of the element, set the Alpha value of the pixel located in the foreground area and the gesture detection border to 1, and set the Alpha value of other pixels to 0 to obtain a rough gesture mask.

2.3)、精细手势掩模生成：用种子算法检测粗略手势掩模中手势检测框范围内Alpha值为0的像素连通域，计算每一个连通域的像素个数，如果像素个数小于阈值，则该连通域为噪声，把该连通域的所有像素Alpha值设为1。通过该方法去除小面积噪声，得到精细手势掩模。2.3), fine gesture mask generation: use the seed algorithm to detect the pixel connected domains with an Alpha value of 0 within the range of the gesture detection frame in the rough gesture mask, calculate the number of pixels in each connected domain, if the number of pixels is less than the threshold, then The connected domain is noise, and the alpha values of all pixels in the connected domain are set to 1. This method removes small area noise and obtains a fine gesture mask.

步骤3：3D模型渲染：Step 3: 3D model rendering:

3.1)、采用基于标识物模板的摄像机跟踪方法，对当前视频帧与标识物图像进行特征提取与匹配，求出当前帧对应的摄像机姿态信息，即当前摄像机坐标系到三维世界坐标系的旋转矩阵和平移矩阵；3.1), adopt the camera tracking method based on the marker template, perform feature extraction and matching on the current video frame and the marker image, and obtain the camera attitude information corresponding to the current frame, that is, the rotation matrix from the current camera coordinate system to the three-dimensional world coordinate system and translation matrix;

3.2)、首先绘制当前帧作为窗口背景，然后根据摄像机内参，和3.1)中获得的旋转矩阵和平移矩阵，将3D模型变换到屏幕空间，并绘制到窗口背景上，同时获取并保存深度图作为模型深度图。3.2), first draw the current frame as the window background, then transform the 3D model into the screen space according to the camera internal reference, and the rotation matrix and translation matrix obtained in 3.1), and draw it on the window background, and obtain and save the depth map as Model depth map.

步骤4：遮挡判断及手势再绘制：Step 4: Occlusion judgment and gesture redrawing:

4.1)、遮挡判断：以手势掩模中Alpha值为1的像素坐标采样模型深度图和场景深度图，分别获得模型深度和手势深度，若手势深度<场景深度，则认为该位置手势对虚拟3D模型产生了遮挡，保持手势掩模中对应像素的Alpha值为1，否则设置Alpha值为0；4.1) Occlusion judgment: Sampling the model depth map and scene depth map with pixel coordinates with an Alpha value of 1 in the gesture mask to obtain the model depth and gesture depth respectively. If the gesture depth < the scene depth, it is considered that the gesture at this position is harmful to the virtual 3D If the model is occluded, keep the Alpha value of the corresponding pixel in the gesture mask as 1, otherwise set the Alpha value to 0;

4.2)、手势再绘制：对于Alpha值为1的像素，用其RGB值覆盖窗口背景值，对于Alpha值为0的像素，依然保留窗口背景值。4.2) Gesture redrawing: For pixels with an Alpha value of 1, use their RGB values to cover the window background value, and for pixels with an Alpha value of 0, the window background value is still retained.

本发明的有益效果是：本发明提出的一种手势交互式视频场景中基于单目视觉的3D模型嵌入方法，既能有效应对手势交互时产生的遮挡问题，又不依靠昂贵的深度检测设备，符合实际应用需要。The beneficial effects of the present invention are: a 3D model embedding method based on monocular vision in a gesture interactive video scene proposed by the present invention can not only effectively deal with the occlusion problem generated during gesture interaction, but also does not rely on expensive depth detection equipment, meet the needs of practical applications.

附图说明Description of drawings

图1为本发明所述的具体技术路线图；Fig. 1 is the specific technical roadmap described in the present invention;

图2为本发明单目场景深度重建示意图；FIG. 2 is a schematic diagram of depth reconstruction of a monocular scene in the present invention;

图3为本发明精细手势提取示意图；Fig. 3 is a schematic diagram of fine gesture extraction in the present invention;

图4为本发明遮挡判断及手势再绘制流程图；Fig. 4 is a flow chart of occlusion judgment and gesture redrawing in the present invention;

图5为本发明3D模型嵌入效果示意图。Fig. 5 is a schematic diagram of the embedding effect of the 3D model of the present invention.

具体实施方式Detailed ways

下面结合附图和具体实施例，对本发明做进一步说明。The present invention will be further described below in conjunction with the accompanying drawings and specific embodiments.

本发明：一种手势交互式视频场景中基于单目视觉的3D模型嵌入方法，包括以下步骤：The present invention: a 3D model embedding method based on monocular vision in a gesture interactive video scene, comprising the following steps:

利用深度传感器或双目视觉的方法获取场景深度图，然后将获得的场景深度图进行配准，转化到摄像机坐标系，作为单目深度恢复模型的训练集；利用上述训练集对单目深度恢复网络进行迁移学习，得到适合于该场景的单目深度恢复模型；利用单目深度恢复模型直接获得当前帧的场景深度图。Use the depth sensor or binocular vision method to obtain the scene depth map, and then register the obtained scene depth map and convert it to the camera coordinate system as the training set of the monocular depth recovery model; use the above training set to recover the monocular depth The network performs migration learning to obtain a monocular depth restoration model suitable for the scene; the scene depth map of the current frame is directly obtained by using the monocular depth restoration model.

步骤2：精细手势提取：Step 2: Fine Gesture Extraction:

步骤3：3D模型渲染：Step 3: 3D model rendering:

4.2)、手势再绘制：对于Alpha值为1的像素，用RGB值覆盖窗口背景值，对于Alpha值为0的像素，依然保留窗口背景值。4.2) Gesture redrawing: For pixels with an Alpha value of 1, the RGB value is used to cover the window background value, and for pixels with an Alpha value of 0, the window background value is still retained.

具体的，包括以下步骤：Specifically, the following steps are included:

1.1)、利用Kinect等深度传感器或双面视觉获取足够样本，构成训练集；1.1), using depth sensors such as Kinect or double-sided vision to obtain enough samples to form a training set;

1.2)、用FCRN等深度学习模型使用上述训练集进行样本迁移，得到满足手势交互场景下的单目深度恢复模型；1.2) Using deep learning models such as FCRN to perform sample migration using the above training set, and obtain a monocular depth recovery model that satisfies gesture interaction scenarios;

1.3)、通过训练好的模型恢复当前帧下的场景深度。1.3), restore the scene depth under the current frame through the trained model.

步骤2：精细手势提取：Step 2: Fine Gesture Extraction:

2.1)、手势检测与定位：用YOLO等深度学习模型对当前帧进行手势检测与定位得到手势类别c和手势检测框R。2.1) Gesture detection and positioning: Use YOLO and other deep learning models to perform gesture detection and positioning on the current frame to obtain gesture category c and gesture detection frame R.

2.2)、粗略手势掩模生成：用混合高斯模型提取前景，然后初始化一张与当前帧相同分辨率的RGBA图像作为手势掩模，将手势掩模中每个像素的RGB值设为当前帧对应元素的RGB值，将位于前景区域和手势检测边框范围内的像素Alpha值设置为1，其他像素的Alpha值设为0，获得粗略手势掩模TD。2.2) Rough gesture mask generation: use the mixed Gaussian model to extract the foreground, then initialize an RGBA image with the same resolution as the current frame as the gesture mask, and set the RGB value of each pixel in the gesture mask to the current frame The RGB value of the element, set the Alpha value of the pixel located in the foreground area and the gesture detection border to 1, and set the Alpha value of other pixels to 0 to obtain a rough gesture mask TD.

2.3)、精细手势掩模生成：用种子算法检测粗略手势掩模TD中手势检测框范围内Alpha值为0的像素连通域，计算每一个连通域的像素个数，如果像素个数小于阈值，则该连通域为噪声，把该连通域的所有像素Alpha值设为1。通过该方法去除小面积噪声，得到去噪之后精细手势掩模。2.3), fine gesture mask generation: use the seed algorithm to detect the connected domains of pixels with an Alpha value of 0 within the range of the gesture detection frame in the rough gesture mask TD, calculate the number of pixels in each connected domain, if the number of pixels is less than the threshold, Then the connected domain is noise, and the Alpha value of all pixels in the connected domain is set to 1. This method removes small area noise and obtains a fine gesture mask after denoising.

步骤3：3D模型渲染：Step 3: 3D model rendering:

3.1)、当前摄像头拍摄到的视频帧与系统中预先设置好的标识物模板进行匹配，标识物模板可以是自然标识物，也可以是人工标识物；匹配过程采用SIFT或其他特征点和描述符，通过正反向蛮力匹配过滤掉误匹配点集之后得到准确的匹配点集；3.1), the video frame captured by the current camera is matched with the pre-set marker template in the system. The marker template can be a natural marker or an artificial marker; the matching process uses SIFT or other feature points and descriptors , get the accurate matching point set after filtering out the wrong matching point set through forward and reverse brute force matching;

3.2)、根据第一步中匹配上的两对点集进行单应性矩阵H的计算，得到模板图像与当前帧的单应性矩阵之后，将模板图的四个角点(x,y)通过单应性变换求出在当前帧中对应的位置(x’,y’)；3.2), calculate the homography matrix H according to the two pairs of point sets matched in the first step, after obtaining the homography matrix of the template image and the current frame, the four corner points (x, y) of the template image Find the corresponding position (x', y') in the current frame through homography transformation;

3.3)、设模板图的四个角点的三维世界坐标为模板图像坐标加上一个为0的z坐标即(x,y,0)，通过该三维坐标(x,y,0)与当前帧中对应的四个角点的坐标(x’,y’)形成一个典型的PnP问题，求出当前状态下的相机姿态信息，即当前摄像机坐标系与三维世界坐标系的旋转矩阵R和平移矩阵T；3.3), set the three-dimensional world coordinates of the four corner points of the template image as the template image coordinates plus a z coordinate of 0 (x, y, 0), through the three-dimensional coordinates (x, y, 0) and the current frame The coordinates (x', y') of the corresponding four corner points form a typical PnP problem, and the camera attitude information in the current state is obtained, that is, the rotation matrix R and the translation matrix of the current camera coordinate system and the three-dimensional world coordinate system T;

3.4)、预先通过棋盘格标定法得到摄像机的内参矩阵K，通过K、R、T进行坐标系的变换，将虚拟三维物体的三维坐标变换到屏幕坐标系上，将当前帧当做背景进行叠加显示，同时使用三维图形引擎如OpenGL中的glReadPixels函数获取并保存深度图作为模型深度图。3.4) Obtain the internal parameter matrix K of the camera through the checkerboard calibration method in advance, transform the coordinate system through K, R, and T, transform the three-dimensional coordinates of the virtual three-dimensional object into the screen coordinate system, and use the current frame as the background for overlay display , and at the same time use the glReadPixels function in the 3D graphics engine such as OpenGL to obtain and save the depth map as the model depth map.

4.1)、以手势掩模中Alpha值为1的像素坐标采样场景深度图得到手势深度D_A；4.1), the gesture depth D _A is obtained by sampling the scene depth map with pixel coordinates whose Alpha value is 1 in the gesture mask;

4.2)、以手势掩模中Alpha值为1的像素坐标采样模型深度图得到模型深度D_3D；4.2), sampling the model depth map with the pixel coordinates of the gesture mask with an Alpha value of 1 to obtain the model depth D _3D ;

4.3)、对于每个像素进行判断，若D_A<D_3D则认为手势对虚拟3D模型产生了遮挡，保持手势掩模中对应像素的Alpha值为1，否则设置Alpha值为0；4.3), judge for each pixel, if D _A < D _3D , it is considered that the gesture has blocked the virtual 3D model, and keep the Alpha value of the corresponding pixel in the gesture mask as 1, otherwise set the Alpha value as 0;

4.4)、对于手势掩模中Alpha值为1的像素，用RGB值覆盖窗口背景值，对于Alpha值为0的像素，依然保留窗口背景值。也可以采用三维渲染引擎如OpenGL中的glDrawPixels函数，将手势掩模直接透明叠加绘制在屏幕上，该函数绘制时，对于手势掩模中Alpha值为0的依然保持背景色。4.4) For pixels with an Alpha value of 1 in the gesture mask, cover the window background value with RGB values, and for pixels with an Alpha value of 0, still retain the window background value. It is also possible to use a 3D rendering engine such as the glDrawPixels function in OpenGL to directly and transparently overlay and draw the gesture mask on the screen. When this function is drawn, the background color of the gesture mask with an Alpha value of 0 is still maintained.

本发明公开了一种手势交互式视频场景中基于单目视觉的3D模型嵌入方法，通过单目场景深度重建进行手势与模型的遮挡判断，然后对3D模型进行无缝嵌入。具体技术路线如图1所示。The invention discloses a 3D model embedding method based on monocular vision in a gesture interactive video scene. The occlusion judgment of the gesture and the model is performed through the depth reconstruction of the monocular scene, and then the 3D model is seamlessly embedded. The specific technical route is shown in Figure 1.

步骤1，单目场景深度重建：Step 1, monocular scene depth reconstruction:

对当前帧进行场景深度重建，生成场景深度，如图2所示，为遮挡关系的判断提供基础；Perform scene depth reconstruction on the current frame to generate scene depth, as shown in Figure 2, to provide a basis for judging the occlusion relationship;

步骤2，精细手势提取：Step 2, fine gesture extraction:

首先对当前帧进行手势检测与定位，得到手势检测框的位置与范围，然后提取前景，把前景和手势检测框的共同部分作为手势部分，然后再进行去噪处理得到精细手势掩模，如图3所示；First, perform gesture detection and positioning on the current frame to obtain the position and range of the gesture detection frame, then extract the foreground, use the common part of the foreground and the gesture detection frame as the gesture part, and then perform denoising processing to obtain a fine gesture mask, as shown in the figure 3 shown;

步骤3，3D模型渲染：Step 3, 3D model rendering:

首先绘制视频作为背景，然后渲染3D模型到背景，最后在手势遮挡3D模型的部分重新绘制该部分视频；First draw the video as the background, then render the 3D model to the background, and finally redraw the part of the video where the gesture blocks the 3D model;

步骤4，遮挡判断及手势再绘制：Step 4, occlusion judgment and gesture redrawing:

对步骤2提取的精细手势掩模，通过步骤1产生的场景深度图和步骤3得到的模型深度图进行遮挡判断，具体流程如图4所示，然后进行遮挡手势的再绘制，从而形成3D模型无缝嵌入效果，如图5所示。For the fine gesture mask extracted in step 2, the occlusion judgment is performed through the scene depth map generated in step 1 and the model depth map obtained in step 3. The specific process is shown in Figure 4, and then the occlusion gesture is redrawn to form a 3D model Seamless embedding effect, as shown in Figure 5.

本发明提出的一种手势交互式视频场景中基于单目视觉的3D模型嵌入方法，既能有效应对手势交互时产生的遮挡问题，又不依靠昂贵的深度检测设备，符合实际应用需要。A 3D model embedding method based on monocular vision in a gesture interactive video scene proposed by the present invention can not only effectively deal with the occlusion problem generated during gesture interaction, but also does not rely on expensive depth detection equipment, which meets the needs of practical applications.

Claims

1. a 3D model embedding method based on monocular vision in a gesture interactive video scene, is characterized in that, comprises the following steps:

Step 1: Monocular scene depth reconstruction:

Use the depth sensor or binocular vision method to obtain the scene depth map, register the obtained scene depth map, and transform it into the camera coordinate system as the training set of the monocular depth restoration model; use the above training set to restore the monocular depth to the network Perform migration learning to obtain a monocular depth restoration model suitable for the scene; use the monocular depth restoration model to directly obtain the scene depth map of the current frame;

Among them, the scene depth map acquisition is to directly recover the scene depth from the image captured by a single camera;

Step 2: Fine Gesture Extraction:

2.1), gesture detection and positioning: perform gesture detection and positioning on the current frame to obtain a gesture detection frame;

2.2) Rough gesture mask generation: use the mixed Gaussian model to extract the foreground, then initialize an RGBA image with the same resolution as the current frame as the gesture mask, and set the RGB value of each pixel in the gesture mask to the current frame The RGB value of the element, set the Alpha value of the pixel located in the foreground area and the gesture detection frame range to 1, and set the Alpha value of other pixels to 0 to obtain a rough gesture mask;

2.3), fine gesture mask generation: use the seed algorithm to detect the pixel connected domains with an Alpha value of 0 within the range of the gesture detection frame in the rough gesture mask, calculate the number of pixels in each connected domain, if the number of pixels is less than the threshold, then The connected domain is noise, and the alpha value of all pixels in the connected domain is set to 1; the small area noise is removed by this method, and a fine gesture mask is obtained;

Step 3: 3D model rendering:

3.1), adopt the camera tracking method based on the marker template, perform feature extraction and matching on the current video frame and the marker image, and obtain the camera attitude information corresponding to the current frame, that is, the rotation matrix from the current camera coordinate system to the three-dimensional world coordinate system and translation matrix;

3.2), first draw the current frame as the window background, then transform the 3D model into the screen space according to the camera internal reference, and the rotation matrix and translation matrix obtained in 3.1), and draw it on the window background, and obtain and save the depth map as model depth map;

Wherein, the model depth map is obtained by reading the depth buffer of the 3D rendering engine;

The 3D model embedding scene includes two video drawing processes and one 3D model rendering process. The drawing sequence is: first draw the video as the background, then render the 3D model to the background, and finally redraw the part of the video where the gesture blocks the 3D model ;

Step 4: Occlusion judgment and gesture redrawing:

4.1) Occlusion judgment: Sampling the model depth map and scene depth map with pixel coordinates with an Alpha value of 1 in the gesture mask to obtain the model depth and gesture depth respectively. If the gesture depth < the scene depth, it is considered that the gesture at this position is harmful to the virtual 3D If the model is occluded, keep the Alpha value of the corresponding pixel in the gesture mask as 1, otherwise set the Alpha value to 0;

4.2) Gesture redrawing: For pixels with a gesture mask Alpha value of 1, cover the window background value with its RGB value, and for pixels with an Alpha value of 0, the window background value is still retained;

Among them, in the gesture mask used, all pixels with an Alpha value of 1 form the part that needs to be redrawn due to the gesture occluding the 3D model.