Open AccessArticle

Integrating Depth-Based and Deep Learning Techniques for Real-Time Video Matting without Green Screens

Pin-Chen Su

and

Mau-Tsuen Yang

Department of Computer Science & Information Engineering, National Dong Hwa University, Hualien 974301, Taiwan

Author to whom correspondence should be addressed.

Electronics 2024, 13(16), 3182; https://doi.org/10.3390/electronics13163182

Submission received: 11 July 2024 / Revised: 5 August 2024 / Accepted: 9 August 2024 / Published: 12 August 2024

(This article belongs to the Special Issue Applications of Artificial Intelligence in Computer Vision)

Download

Browse Figures

Figure 1
Concept map of the proposed virtual production system with moving camera tracking in varying perspectives. "> Figure 2
Architecture of the proposed virtual production system. "> Figure 3
Camera video and tracking data flow diagram. "> Figure 4
Experimental environment of the proposed virtual production system including VIVE base station placement, camera tracking range, and foreground character active area. "> Figure 5
Depth-based video matting module. "> Figure 6
Flowchart of the proposed Boundary-Selective Fusion (BSF) and comparison of four alpha mattes. (a) Flowchart; (b) Alpha matte generated by deep learning-based module; (c) Alpha matte generated by depth-based module; (d) Combined alpha matte using the proposed BSF; (e) Ground truth. The source images are taken from the HDM-2K dataset [<a href="#B17-electronics-13-03182" class="html-bibr">17</a>]. "> Figure 7
Occlusion handling via UI plane in Unity. (a) Image rendered in Unity scene; (b) Image properly rendered with occlusion considered. "> Figure 8
MAD error over time of video matting in static background. "> Figure 9
MAD error over time of video matting in dynamic foreground with humans holding an object. "> Figure 10
Initial camera position and composite image. (a) Initial virtual camera position (Unity coordinates); (b) Initial composite image with real-world human and virtual background. "> Figure 11
Some video compositing results of the virtual production system. (a) Sci-fi office background; (b) Outdoor background; (c–f) depicts a sequence of shots manipulating different times of a day. "> Figure 12
Occlusion effect of the virtual production system. (a) The virtual object closer to the camera than the real-world human; (b) The virtual object occludes the real-world human; (c) The real-world human closer to the camera than the virtual object; (d) The real-world human occludes the virtual object. "> Figure 13
User interface of the virtual production system. ">

Versions Notes

Abstract

Virtual production, a filmmaking technique that seamlessly merges virtual and real cinematography, has revolutionized the film and television industry. However, traditional virtual production requires the setup of green screens, which can be both costly and cumbersome. We have developed a green screen-free virtual production system that incorporates a 3D tracker for camera tracking, enabling the compositing of virtual and real-world images from a moving camera with varying perspectives. To address the core issue of video matting in virtual production, we introduce a novel Boundary-Selective Fusion (BSF) technique that combines the alpha mattes generated by deep learning-based and depth-based approaches, leveraging their complementary strengths. Experimental results demonstrate that this combined alpha matte is more accurate and robust than those produced by either method alone. Overall, the proposed BSF technique is competitive with state-of-the-art video matting methods, particularly in scenarios involving humans holding objects or other complex settings. The proposed system enables real-time previewing of composite footage during filmmaking, reducing the costs associated with green screen setups and simplifying the compositing process of virtual and real images.

Keywords:

virtual production; video matting; boundary-selective fusion; depth camera; deep learning; virtual-real compositing

1. Introduction

Virtual Production (VP) is a way of making film and television that utilizes computer-generated content, allowing real-time visualization and control of the digital environment [1]. It integrates technologies such as green screens, motion capture, real-time visualization, and LED walls, offering numerous advantages including real-time compositing, enhanced actor immersion, reduced post-production costs, and greater flexibility in set design.

Video matting plays a crucial role in virtual production by accurately identifying foreground objects in an image. In film and television production, green screens (or blue screens) are commonly used for video matting. However, setting up a green screen can be costly and challenging due to lighting issues, wrinkles, and shadows.

This paper introduces a multi-modal video matting method for virtual production in non-green-screen environments. We explore two green screen-free background matting methods: the deep learning-based approach and the depth-based approach. The deep learning approach can automatically learn high-level features and reduce reliance on low-level features such as color and brightness. On the other hand, the depth-based approach provides valuable depth information (z-axis) that can be utilized in complex backgrounds for video matting. We analyze the advantages and disadvantages of both video matting methods and propose a novel Boundary-Selective Fusion (BSF) technique to combine the alpha mattes generated by the deep learning-based method and the depth-based method. This fusion technique enhances both the stability and accuracy of video matting without green screens.

The system concept map of the proposed virtual production system with video matting is presented in Figure 1. The contributions of this paper are twofold. First, we develop a virtual production system that incorporates a 3D tracker for camera tracking, enabling the compositing of virtual and real-world images from a moving camera with varying perspectives. The technical details of the green screen-free system are provided to eliminate the setup costs and challenges associated with traditional green screens. Second, we address the core issue of video matting within the virtual production system by proposing a novel Boundary-Selective Fusion (BSF) technique. This method combines alpha mattes generated by deep learning-based and depth-based methods, leveraging the strengths of both approaches to improve accuracy and robustness.

2. Related Work

2.1. Virtual Production

Virtual production encompasses a range of computer-aided production and visualization techniques that enable the integration of virtual and real-world elements in the film and television industry. In virtual production, creators can manipulate time, location, and lighting within a studio environment, capture scenes that do not exist in the real world, and preview post-production effects early in the production process. Unlike the irreversible nature of traditional filming, virtual production offers greater flexibility and control over the shooting process.

Gaspari et al. [2] propose a virtual studio system that utilizes chroma keying for background matting, optimizes the results using a blurring filter, and employs the Kinect depth camera for collision detection to determine the final output. This system can also be integrated with an external RGB camera to enhance image quality. Goussencourt et al. [3] employ Unity game engine for real-time previewing, calibrating an RGB camera with an RGB-D camera. Depth information is then projected onto the real-world scene captured by the RGB camera to address occlusion issues. Nakatani et al. [4] utilize the deep learning-based video matting method, called DeepKeying, to generate alpha mattes and employ a depth camera for occlusion determination. Aguilar et al. [5] combine RGB and chroma key cameras, introducing a layer rendering feature for occlusion handling. The final image is generated based on the manually defined layer display order, and remote monitoring is supported. Chiu [6] proposes a virtual production system with camera tracking capabilities, enabling the mapping of real camera movements to the virtual world.

In addition, virtual production relies on video deraining to remove unwanted artifacts and super-resolution to enhance video quality. Jiang et al. [7] propose a progressive coupled network (PCNet) to effectively separate rain streaks while preserving rain-free details. In subsequent work, Jiang et al. [8] decompose rain streaks into multiple rain layers, estimating each layer individually throughout the network stages. Xiao et al. [9] devise a Residual Token Selective Group (RTSG) to dynamically select top-k keys based on score ranking for each query. They develop a Multi-scale Feed-forward Layer (MFL) and a Global Context Attention (GCA), forming their final Top-k Token Selective Transformer (TTST) for progressive representation. Additionally, Jiang et al. [10] present a Dual-Path Deep Fusion Network (DPDFN) for face image super-resolution without requiring additional face priors. These techniques collectively ensure that the final visual output meets the high standards required for immersive and seamless digital environments in contemporary virtual production.

2.2. Deep Learning-Based Video Matting

Video matting is the process of estimating the foreground objects in an image or video by determining the opacity (ranging from 0 to 1) of each pixel. This process produces an accurate alpha matte, allowing for the separation of the foreground from the background.

Forte et al. propose FBA [11], which uses a trimap as an auxiliary input to simultaneously predict the foreground, background, and alpha matte. Lin et al. propose BMV2 [12], which uses a background image as an auxiliary input and utilizes a real-time matting method for high-resolution images. Ke et al. propose MODNet [13], which is a lightweight matting objective decomposition network that does not require any auxiliary input. It introduces an efficient Atrous Spatial Pyramid Pooling (e-ASPP) and a Self-supervised Objective Consistency (SOC) strategy for matting. Lin et al. propose RVM [14], which does not require auxiliary input and can achieve real-time high-resolution image matting. They rely on a recurrent architecture to utilize the temporal information in the video, resulting in significant improvements in matting for dynamic images. Li et al. propose VMFormer [15], which uses a transformer-based end-to-end image matting method that can learn from a given image input sequence and predict the alpha matte for each frame. Subsequently, VideoMatt [16] employs spatial attention to enhance the model’s balance between accuracy and speed of alpha matte prediction. Peng et al. [17] propose an RGB-D human image matting neural network model that uses a depth map as an auxiliary input and experimentally demonstrates that depth maps are helpful for image matting.

Huynh et al. introduce Masked Guided Gradual Human Instance Matting (MaGGIe) [18], which progressively predicts alpha mattes for each human instance, balancing computational cost, precision, and consistency. By utilizing transformer attention and sparse convolution, MaGGIe simultaneously outputs all instance mattes while maintaining constant inference costs. Chen et al. propose ControlMatting [19], a robust, real-time, high-resolution matting method that achieves state-of-the-art results. Unlike existing frame-by-frame methods, ControlMatting employs a unified architecture with a controllable generation model to preserve overall semantic information. Its independent recurrent architecture leverages temporal information to enhance temporal coherence and matting quality.

Segmenting Anything Model (SAM) [20], introduced by Kirillov et al., utilizes a Vision Transformer (ViT) [21] for image encoding and employs a flexible prompting mechanism, enabling interactive and versatile segmentation by generating masks based on user-provided inputs such as boxes or points. Its primary strength lies in its adaptability and immediate response to various user prompts, making it a powerful tool for real-time image segmentation. Extending this approach, Matting Anything Model (MAM) [22], proposed by Li et al., builds upon the SAM framework by incorporating a Mask-to-Matte module, refining the binary masks produced by SAM into detailed alpha mattes. This integration allows MAM to address a broader range of image matting tasks with a unified model that requires fewer parameters than specialized matting models. MAM’s iterative refinement process and multi-scale prediction capability enhance its performance, achieving results comparable to specialized models and maintaining efficiency. While SAM excels in providing a flexible and interactive solution for segmentation, MAM extends this flexibility to matting, offering a comprehensive solution that simplifies user intervention and improves the precision of matting outputs. Huang et al. [23] review several deep learning-based matting algorithms in detail and introduce commonly used image matting evaluation metrics. Table 1 presents a comparative analysis of ten deep learning-based video matting methods across various dimensions.

2.3. Depth-Based Video Matting

While a standard RGB camera converts real-world 3D space images into 2D plane images with only XY coordinates, a depth camera can measure the distance between the camera and the target subject, which is the Z-axis depth information. Depth cameras have a wide range of applications, including 3D modeling, gesture recognition, human-computer interaction, virtual reality, and video matting.

Lu et al. [24] propose a method that automatically generates trimaps and uses Bayesian matting to compute the alpha channel, with depth images, color images, and trimaps as inputs. He et al. [25] use a hard segmentation algorithm to iteratively refine the scene depth map and calculate the foreground and background masks, automatically generating trimaps. Then, soft segmentation is applied to calculate the alpha matte. Zeng et al. [26] use a Directional Joint Bilateral Filter to fill in the depth map and then construct a hierarchical level framework to integrate color and depth information to segment the images and generate trimaps. Liu et al. [27] adopt the confidence map to determine the weights of color and depth images and use these weights to assist in image segmentation, generating better trimaps and improving matting results. Zhao et al. [28] detect shadows in RGB-D images and use temporal matting to improve the consistency of foreground extraction, while achieving real-time performance.

Li et al. propose DART [29], an adaptation of conventional RGB-based matting algorithms that incorporates depth information. The model’s output is refined using Bayesian inference with a background depth prior, producing a trimap that is then fed into another matting algorithm, ViTMatte [30], for more precise alpha mattes. Table 2 presents a comparative analysis of six depth-based video matting methods across various dimensions.

Both deep learning-based and depth-based matting methods have distinct strengths and weaknesses. Deep learning-based approaches are prone to foreground misclassification, while depth-based approaches often produce inaccurate edges. To address these limitations, we integrate deep learning-based and depth-based techniques using a novel Boundary-Selective Fusion (BSF) method to enhance accuracy and robustness. This innovative approach extends virtual production beyond virtual studios, making it feasible in real-world environments without the need for green screens.

3. Methods

The system architecture is shown in Figure 2. Before launching the application, preprocessing steps are required to set up the virtual scene, HTC VIVE base station, tracker, and camera. Upon launching the application, images are acquired using both the RGB and depth cameras. Video matting is then performed using both the deep learning-based and the depth-based matting methods, followed by the proposed Boundary-Selective Fusion (BSF) technique. The matted foreground images are then synchronized with the virtual scene. The coordinate information of the real-world camera is obtained using a camera coordinate callback triggered by the HTC VIVE tracker attached to the camera. This positional data is then applied to the virtual camera to achieve real-time compositing of the real foreground with the virtual background, accurately accounting for camera movements.

3.1. Preprocessing and Setup

A variety of virtual backgrounds for virtual production can be designed using Unity game engine or created using 3D modeling software such as Maya or SketchUp. For instance, we utilize a sci-fi office environment [31] as the virtual scene, enabling adjustments to the viewing angle and lighting. The camera is positioned above the origin of the virtual scene, allowing for adjustments of the background environment captured by the camera by manually manipulating the position and rotation of the scene relative to the origin.

Camera tracking plays a crucial role in virtual production by enabling the seamless integration of real-world camera movements with virtual scenes. It involves tracking the camera’s six Degrees of Freedom (DoF) in real space: position (X, Y, Z) and orientation (pitch, yaw, roll). This synchronization between the real and virtual cameras allows for real-time compositing of real-world footage and virtual elements. Camera tracking systems can be broadly categorized into two types: non-optical and optical systems. Non-optical systems employ mechanical or electromagnetic sensors to track and acquire position and orientation information. Optical systems, on the other hand, rely on real-time vision-based detection of marker points to locate and track the camera. In this paper, we utilize the HTC VIVE tracker for optical camera tracking. This tracking system requires the setup of HTC base stations, known as lighthouses, which emit infrared signals. These signals are received by a set of sensors on the VIVE tracker to determine its coordinates and orientations.

For image acquisition, the camera is connected via an HDMI cable, and its output is captured by the computer using an HDMI video capture card. The flow of video and tracking data is illustrated in Figure 3. To track the camera’s position in the real world, the VIVE tracker is attached to the camera and the VIVE dongle is connected to the computer. Figure 4 illustrates the experimental environment of our virtual production system.

3.2. Principle of Video Matting

Video matting is the process to estimate the foreground objects in an image or video. It involves calculating the opacity (0~1) of each pixel in the input image to obtain the accurate alpha matte of the corresponding image, thereby separating the foreground from the background. Mathematically, the image matting problem can be expressed as follows:

I_{i} = a_{i} F_{i} + (1 - a_{i}) B_{i}, a \in [0, 1],

(1)

$I_{i}$ represents the composite image color of pixel i;
$a_{i}$ represents the alpha value (opacity) of pixel i, ranging from 0 (fully transparent) to 1 (fully opaque);
$F_{i}$ represents the foreground color of pixel i;
$B_{i}$ represents the background color of pixel i.

Traditional image matting methods primarily rely on color and other low-level features to generate the alpha channel. However, these techniques struggle with complex backgrounds, especially when the foreground object has a similar color or is semi-transparent, making accurate matting difficult. To overcome these limitations, the deep learning-based matting and the depth-based matting methods have been introduced. These methods do not solely depend on color but utilize deep learning mechanisms or depth information to predict alpha mattes.

The proposed Boundary-Selective Fusion (BSF) method combines alpha masks generated by deep learning-based and depth-based matting techniques on a pixel-by-pixel basis. For the deep learning-based matting module, we utilize the RVM model [14] to predict alpha mattes from RGB images due to its speed and robustness. For the depth-based module, we employ the Intel RealSense D435i camera to capture depth information and apply Spatial and Temporal Hole-Filling to refine the depth map. Trimaps are generated through morphological erosion and dilation, and the share matting technique [32] is used to compute alpha mattes. The deep learning module, depth-based module, and the proposed BSF are explained in Section 3.3, Section 3.4, and Section 3.5, respectively.

3.3. Deep Learning-Based Video Matting Module

This module aims to use deep learning networks to extract the foreground from RGB images captured by a regular camera for virtual production applications. We employ the Robust Video Matting (RVM) model [14] to predict alpha values for the RGB images captured by the camera. The MobileNetv3 version of RVM model is used to process the RGB images captured by the camera to generate the final alpha matte. After calculating the alpha matte, both the original image and the alpha matte are imported into Unity for compositing the foreground subject with the virtual scene.

3.4. Depth-Based Video Matting Module

This module aims to use color images with depth information from a depth camera to extract the foreground using image processing techniques for virtual production applications. The depth-based video matting process is shown in Figure 5.

After acquiring depth and color frames using the Pyrealsense2 SDK for the Intel RealSense D435i depth camera, it is necessary to align the depth and color frames because they originate from different sensors with distinct viewpoints. To address the missing regions often present in depth images, we first apply a Decimation Filter to simplify the depth scene. Next, the depth map is converted to a disparity signal, and a Spatial Filter is applied to enhance the smoothness of the depth map. Subsequently, a Temporal Filter is used to increase the persistence of the depth information. The disparity signal is then converted back to the depth map. Finally, a Hole Filling Filter is applied to refine the depth map.

Using the refined depth map, we employ the minimum method [33] to determine the threshold between the foreground and background. Once the foreground range is established, morphological filters such as erosion and dilation are applied to create the image’s trimap. The original RGB image and the trimap are then analyzed to calculate the alpha matte. To reduce computational costs, we use Shared Matting [32], which selectively shares background and foreground samples to achieve real-time alpha matte computation.

3.5. Boundary-Selective Fusion (BSF)

Both the deep learning-based and the depth-based matting methods have their own strengths and weaknesses. The deep learning-based approaches are susceptible to foreground misclassification, while the depth-based approaches can produce inaccurate edges. To address these limitations, we propose a novel Boundary-Selective Fusion (BSF) that combines the output mattes of both approaches to improve accuracy and robustness.

The proposed Boundary-Selective Fusion method combines alpha masks generated by the depth-based and the learning-based matting techniques on a pixel-by-pixel basis. The flowchart of the proposed Boundary-Selective Fusion is shown in Figure 6a and can be roughly divided into four parts: boundary interaction, maximum union, edge correction, and largest connected component selection.

Boundary Intersection: Morphological erosion and dilation are applied and then subtracted to obtain a distinct boundary region of the alpha matte predicted by each individual matting module. The intersection of these two boundaries is considered the final boundary region;
Maximum Union: Since the depth-based matting typically has higher spatial integrity but less accurate edges, the maximum values of the alpha mattes predicted by both methods are combined through a pixel-wise Max operation to create a union matte;
Edge Correction: The boundary regions of the depth-based alpha matte are prone to noise and may be less reliable. Therefore, these boundary regions are replaced with the corresponding regions from the deep learning-based alpha matte, resulting in a more accurate combined alpha matte;
Largest Connected Component Selection: Perform connected component analysis and retain only the largest connected component in the foreground. All other connected components are discarded to minimize background noise in the final alpha matte.

The essence of the proposed Boundary-Selective Fusion method is to retain the solid foreground from the depth-based matte while preserving the boundary details from the deep learning-based matte. For qualitative comparison, the alpha mattes generated by the deep learning-based method, the depth-based method, the proposed Boundary-Selective Fusion method, as well as the ground truth are demonstrated in Figure 6b–e.

3.6. Video Transmission, Compositing, and Camera Tracking

To seamlessly integrate Python-scripted matted images with alpha channels into the Unity game engine for real-time visualization, we utilize NDI [34] for image transmission. This process involves an NDI sender in the Python script and an NDI receiver within the Unity project. Once each frame and its corresponding alpha matte are generated, the NDI signals are transmitted. The NDI receiver in Unity continuously updates the image during runtime. Upon receiving the image, Unity’s UI Canvas is used to display the real-world human foreground. The Screen-Camera rendering mode employs a camera as a reference, positioning the UI plane (in this case, the matted image) at a specified distance in front of the camera. Objects in the scene that are closer to the camera than the UI plane will occlude the matted image, effectively handling occlusion as demonstrated in Figure 7.

We employ VIVE trackers in combination with VIVE base stations for camera tracking. The SteamVR SDK provides basic procedures for tracking object position and rotation, but we have added two additional corrections for rotation and alignment. Firstly, an Initial Rotation Correction compensates for the 3D angles due to the varied installation of the trackers. Secondly, a Horizontal Alignment Correction adjusts for any slight misalignments between the tracker’s mounting position and the camera’s horizontal axis.

In addition to the rotation and alignment corrections, a buffer size parameter has been introduced to allow users to configure the delay of tracker pose data. This adjustment is necessary because the image must be processed by the matting algorithm and transmitted over the network before it can be displayed in Unity.

4. Results and Discussion

To evaluate the performance of the matting methods, we conduct three thorough experiments: a speed test, an accuracy test, and a robustness test. In addition to the three previously mentioned matting methods (deep learning-based [14], depth-based, and the proposed BSF), we also include the state-of-the-art Matting Anything Model (MAM) [22] for both quantitative and qualitative comparisons. We use the original code from the official GitHub repository of MAM to ensure a fair comparison.

4.1. Speed Evaluation

To assess the computational efficiency of four matting methods (deep learning-based [14], depth-based, proposed BSF, and MAM [22]) at different image resolutions, we utilize a testbed Windows PC equipped with Intel Core i9-13900K CPU and NVIDIA GeForce RTX 4070 GPU. The frame rates in FPS (Frames Per Second) with three different resolutions of each matting method are shown in Table 3. The depth camera matting lacks a down-sampling setting and it cannot be practically executed on the GPU. To ensure fair comparison, the experiment is separated into five basic categories: the deep learning-based (GPU + Down-sampling), the learning-based (CPU + Down-sampling), the learning-based (CPU), the depth-based (CPU), and the proposed BSF method.

MAM [22] integrates the Segment Anything Model (SAM) [20] as a guidance module and includes an additional Mask-to-Matte (M2M) module to refine the mask output into the alpha matte. To represent the three available variations of the SAM model, ViT-Base with 91 million parameters, ViT-Large with 308 million parameters, and ViT-Huge with 636 million parameters, three additional categories have been added to the right side of Table 3. The M2M module is relatively lightweight, containing only 2.7 million parameters.

The deep learning-based method (GPU + Down-sampling) appears to be the most efficient matting approach across all resolutions, while the deep learning-based (CPU + Down-sampling) method offers a balance between speed and resolution. The deep learning-based (CPU) and the depth-based (CPU) methods exhibit slower processing as resolution increases. As for the proposed BSF method, it needs to undergo both matting methods; thus, the frame rate is slightly slower than the depth-based (CPU) method.

The three variations of MAM [22] differ by their model sizes. Generally, ViT-Huge (636 million parameters) achieves higher accuracy but operates at a slower speed, whereas ViT-Base (91 million parameters) sacrifices a small amount of accuracy for faster performance. Before processing by MAM, each image is resized so that its longer side is 1024 pixels and its shorter side is padded to 1024 pixels. The proposed BSF algorithm avoids floating-point computations entirely. Instead, all alpha values are scaled to a range of 0 to 255 to enable efficient integer computations. In summary, the proposed BSF, operating at a resolution of 1280 × 720, runs faster than all three variations of MAM.

4.2. Accuracy Evaluation

To assess the accuracy of four matting methods (deep learning-based [14], depth-based, proposed BSF, and MAM [22]), we employ metrics that independently process each pixel, thus capturing only the mean per-frame difference between the evaluated and ground-truth sequences. Simple metrics such as Mean Absolute Difference (MAD) and Mean Squared Error (MSE) have been widely used to compare test images with ground truth. Let N denote the total number of pixels,

α_{p}

denote the alpha value of pixel p in the video matting under consideration, and

α_{p}^{G T}

denote the ground truth alpha value; the MAD and MSE are defined as follows:

M A D = \frac{1}{N} \sum_{p} |α_{p} - α_{p}^{G T}|

(2)

M S E = \frac{1}{N} \sum_{p} {(α_{p} - α_{p}^{G T})}^{2}

(3)

Despite their simplicity and popularity, the MAD and MSE do not always correlate with the visual quality perceived by human observers. To address this issue, Rhemann et al. [35] conducted a user study of visual inspection and suggested two perceptual error measures: Gradient Error and Connectivity Error. The Gradient Error, calculated as the difference between the normalized gradients of the predicted alpha mask (∇α) and the ground truth alpha mask (∇α^GT), evaluates edge and detail preservation, which is crucial for high-quality matting. The Connectivity Error assesses the structural integrity and coherence of the foreground regions by measuring the connectivity of pixel p to the source region Ω that is defined as the largest connected region where both the alpha matte and its ground truth are completely opaque. The Gradient Error and Connectivity Error [35] are defined as follows:

G r a d i e n t E r r o r = \sum_{p} {({\nabla α}_{p} - {\nabla α}_{p}^{G T})}^{2}

(4)

C o n n e c t i v i t y E r r o r = \sum_{p} |{(α}_{p}, Ω) - {(α}_{p}^{G T}, Ω)|

(5)

Collectively, these metrics ensure a comprehensive evaluation of the matting method’s performance, balancing accuracy, edge preservation, and structural integrity. For a quantitative comparison of accuracy, MAM [22] with the largest model size (ViT-Huge with 636 million parameters) is chosen due to its superior accuracy. Two scenarios are considered in this experiment: human alone and human holding an object.

4.2.1. Accuracy Experiment of Human Alone

To evaluate the accuracy of four matting methods (the deep learning-based [14], depth-based, BSF, and MAM [22]) for a barehanded human subject, this experiment uses the HDM-2K dataset [17] as input, which contains images primarily focused on a human subject. The results are shown in Table 4.

For images with simple backgrounds, the deep learning-based matting outperforms the depth-based matting in terms of accuracy. As for complex backgrounds, the accuracy of the deep learning-based matting degrades due to the influence of background clutter on foreground extraction. On the other hand, the depth-based matting, which relies on depth information for foreground determination, is less affected by complex backgrounds. MAM [22] performs well in simple backgrounds, but its error increases significantly in complex backgrounds. Overall, the proposed Boundary-Selective Fusion method, which leverages the strengths of two matting techniques, exhibits superior performance in complex backgrounds.

In low illumination scenarios, only simple backgrounds are evaluated due to the limited number of test images with complex backgrounds. The error comparison is shown in Table 5. Generally, the proposed Boundary-Selective Fusion method outperforms either the deep learning-based matting or the depth-based matting under low illumination. However, MAM [22] demonstrates the best overall results in low illumination scenarios, likely due to its attention mechanisms, which enable the model to focus on relevant parts of the image.

4.2.2. Accuracy Experiment of Human Holding an Object

We used the HDM-2K dataset [17] to evaluate the accuracy of four matting methods (deep learning-based [14], depth-based, proposed BSF, and MAM [22]) in scenarios with humans holding an object. The matting error comparison for humans holding a variety of objects is summarized in Table 6. The deep learning-based method tends to misclassify parts of objects when the foreground is a human holding an object, leading to higher matting errors. This is because deep learning-based matting models are typically trained on images of humans alone, making them prone to errors when other objects are present. On the other hand, the depth-based matting relies on depth information to distinguish between foreground and background, making it less prone to errors whether the foreground person is holding an object or not. However, it exhibits higher gradient errors compared to the deep learning-based matting. Best of all, the proposed Boundary-Selective Fusion combining the alpha mattes from both methods aims to address these limitations by exploiting their complementary strengths, resulting in the most accurate result. The qualitative comparisons are demonstrated in Table 7 using human-holding-object examples in HDM-2K datasets [17]. It is noted that MAM generates numerous background outliers, leading to increased matting errors in scenarios where a human is holding an object. This issue arises from the attention mechanisms, which may misclassify parts of the background as foreground, particularly when there are intricate overlaps or proximity between the object and background elements.

4.3. Robustness Evaluation

RGB-D video datasets with alpha channel for matting evaluation are unavailable. Therefore, to evaluate the stability of four matting methods (deep learning-based [14], depth-based, proposed BSF, and MAM [22]), we generate a set of RGB-D videos with ground truth of alpha matte by ourselves. The process of self-generated images begins with capturing RGB image and depth map of the object using an Intel RealSense D435i camera in front of a green screen. The ground truth is created using chroma keying in a video compositing software Adobe After Effects 2023. Next, we composite the foreground image with a background image to produce the final RGB image. Note that the background depth is assumed to be infinite during the process since the final background will be a different image.

To assess the stability of four matting methods (deep learning-based [14], depth-based, proposed BSF, and MAM [22]), the following metrics are employed: Mean Absolute Difference (MAD) and dtSSD [36]. The MAD values are plotted over time to visualize the consistency of the matting approaches. The dtSSD value, on the other hand, measures temporal coherence by capturing unexpected alpha temporal changes and ignoring temporally coherent errors. Suppose N denotes the product of the number of frames and the number of pixels; the dtSSD is defined as follows:

d t S S D = \frac{1}{N} \sum_{t} \sqrt{\sum_{p} {(\frac{d α_{p, t}}{d t} - \frac{d α_{p, t}^{G T}}{d t})}^{2}}

(6)

4.3.1. Static and Dynamic Background Video

Table 8 illustrates the temporal evolution of MAD and dtSSD error for static and dynamic video matting. The MAD of alpha channel over time on four matting methods (the deep learning-based [14], depth-based, BSF, and MAM [22]) are illustrated in Figure 8. The deep learning-based matting exhibits lower error fluctuations, indicating superior consistency and stability. This can be attributed to its recurrent neural network (RNN) architecture, which effectively incorporates temporal information into the alpha matte prediction. In contrast, the depth-based matting still produces relatively coarse edges, leading to lower stability in static background scenarios.

Concerning the temporal progression of MAD error for video matting in dynamic background, a notable observation is the sudden increase in MAD error for the deep learning-based matting over a series of frames. This surge can be attributed to the challenges of handling dynamic background changes, which can introduce complexity and inconsistencies in the alpha matte predictions. In contrast, the depth-based matting maintains its MAD error below a low threshold throughout the sequence, demonstrating greater resilience to dynamic background variations. As shown in Figure 8, MAM [22] produces numerous background outliers, leading to increased MAD and dtSSD errors for both static and dynamic video matting. On the other hand, the proposed BSF provides competitive robustness for both static and dynamic video matting.

4.3.2. Human Holding Objects Video

Considering common scenarios involving individuals holding objects in foreground matting, we conduct stability evaluations on videos featuring moving people holding an object. Table 9 summarizes the average MAD and dtSSD Error for four matting methods (deep learning-based [14], depth-based, proposed BSF, and MAM [22]) in human-holding-object scenarios. Figure 9 further illustrates the MAD error over time for each method.

The experimental results demonstrate that the proposed Boundary-Selective Fusion (BSF), which integrates deep learning-based matting and depth-based matting, outperforms each method individually when processing videos with dynamic foreground objects, such as a person holding an object. This superiority arises from the inherent limitations of each individual method in handling complex foreground dynamics. As illustrated in Figure 9, the proposed BSF is more accurate and robust compared to MAM [22], exhibiting lower MAD and dtSSD errors in scenarios where a human is holding and moving an object.

The deep learning-based matting, while adept at learning intricate patterns from color images, may occasionally misclassify certain foreground elements, such as objects held by a moving person. This can lead to inconsistencies in the alpha mattes, affecting overall stability. The depth-based matting, on the other hand, consistently identifies the entire foreground, including both the person and the object, based on depth cues. However, its inherent limitations in edge refinement can result in larger error variations along the human’s and object’s boundaries.

By combining the strengths of both methods, the proposed Boundary-Selective Fusion approach addresses their individual shortcomings and achieves enhanced stability. It incorporates the complete foreground information from the depth-based matting and the smooth object boundaries from the deep learning-based matting, resulting in more accurate and stable alpha mattes.

4.4. Tracking and Compositing in the Virtual Production System

The proposed virtual production system enables the creation of immersive virtual environments that surpass the limitations of physical sets. To facilitate shooting from a moving camera with varying perspectives, the system is designed with camera tracking capabilities, allowing the accurate mapping of real camera movements to the virtual world.

We conduct experiments to evaluate the quality of virtual camera tracking by attaching a VIVE tracker to a real-world camera and observing whether the virtual camera correspondingly follows the real camera’s movements. The experiment starts from the initial position shown in Figure 10. Table 10 and Table 11 summarize the real camera’s movements, virtual camera coordinates, and the final composite images.

In these camera tracking experiments, it can be observed that the virtual camera’s environment moves in sync with the real camera’s movements. This synchronization applies to a wide range of camera movements, including: forward and backward motion, lateral motion, rotational motion, tilting motion, and rolling motion.

Figure 11 shows some video compositing results of the virtual production system; it seamlessly integrates real-world foreground elements with a sci-fi office background and an outdoor bridge background, showcasing its ability to create diverse and captivating virtual settings.

Figure 12 illustrates the occlusion effect achieved by incorporating foreground depth. The scene features a real-world person and a virtual object (a red dragon). As the virtual object moves closer to the camera than the real-world person, it occludes the person. Conversely, when the real-world person is closer to the camera than the virtual object, the person occludes the virtual object. It is evident that our virtual production effectively handles occlusion in dynamic scenes.

To further enhance the user experience, the virtual production system incorporates a user interface (UI) feature that facilitates convenient file naming and recording controls. As shown in Figure 13, this feature streamlines the recording process and provides users with better control over their captured footage.

5. Conclusions

In the context of video matting for virtual production, both deep learning-based matting and depth-based matting methods provide a viable alternative to traditional green screen setups, eliminating the need for physical green screens and simplifying the production process. We propose a Boundary-Selective Fusion (BSF) algorithm that combines the alpha mattes from both methods, leveraging their complementary strengths. Experimental results demonstrate that the proposed BSF significantly enhances both accuracy and robustness compared to each individual method. By integrating these techniques with a camera tracking system, filmmakers can achieve real-time compositing of real-world elements and virtual environments from a moving camera with varying perspectives. This approach streamlines the virtual production process and eliminates the need for green screens.

The proposed BSF exhibits competitive overall performance relative to state-of-the-art methods, especially in scenarios involving humans holding objects or other complex settings. However, the matting accuracy under low illumination conditions is inferior to that of MAM [22]. The effectiveness of the proposed BSF is contingent upon the performance of both deep learning-based matting and depth-based matting modules. Concerning the speed of video matting, the current bottleneck is the depth-based matting module, which cannot be accelerated on the GPU. Notably, each module operates independently and can be replaced with more advanced matting methods as they become available.

Future work on the video matting method will focus on several key areas. Firstly, enhancing the algorithm’s efficiency to support real-time applications by optimizing computational complexity and leveraging hardware acceleration. Secondly, improving robustness in diverse lighting conditions and dynamic scenes through advanced training techniques and data augmentation. Thirdly, integrating adaptive learning mechanisms to refine matting results in real-time based on feedback. Finally, expanding the dataset to cover a broader range of scenarios and objects to further enhance versatility and accuracy.

Author Contributions

Conceptualization, P.-C.S. and M.-T.Y.; methodology, P.-C.S. and M.-T.Y.; software, P.-C.S.; validation, P.-C.S. and M.-T.Y.; formal analysis, P.-C.S. and M.-T.Y.; investigation, P.-C.S.; resources, M.-T.Y.; data curation, P.-C.S.; writing—original draft preparation, P.-C.S.; writing—review and editing, M.-T.Y.; visualization, P.-C.S.; supervision, M.-T.Y.; project administration, M.-T.Y.; funding acquisition, M.-T.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research is funded by the Ministry of Education (MOE) and the National Science and Technology Council (NSTC) in Taiwan.

Data Availability Statement

Dataset available on request from the authors.

Conflicts of Interest

The authors declare no conflict of interest.

References

Swords, J.; Willment, N. The emergence of virtual production—A research agenda. Converg. Int. J. Res. New Media Technol. 2024; ahead of print. [Google Scholar] [CrossRef]
De Gaspari, T.; Sementille, A.C.; Vielmas, D.Z.; Aguilar, I.A.; Marar, J.F. ARSTUDIO: A Virtual Studio System with Augmented Reality Features. In Proceedings of the 13th ACM SIGGRAPH International Conference on Virtual-Reality Continuum and Its Applications in Industry, Shenzhen, China, 30 November 2014; pp. 17–25. [Google Scholar]
de Goussencourt, T.; Bertolino, P. Using the Unity® Game Engine as a Platform for Advanced Real Time Cinema Image Processing. In Proceedings of the 2015 IEEE International Conference on Image Processing (ICIP), Quebec City, QC, Canada, 27–30 September 2015; pp. 4146–4149. [Google Scholar]
Nakatani, A.; Shinohara, T.; Miyaki, K. Live 6DoF Video Production with Stereo Camera. In Proceedings of the SIGGRAPH Asia 2019 XR, Brisbane, QLD, Australia, 17 November 2019; pp. 23–24. [Google Scholar]
Aguilar, I.A.; Sementille, A.C.; Sanches, S.R.R. ARStudio: A Low-Cost Virtual Studio Based on Augmented Reality for Video Production. Multimed. Tools Appl. 2019, 78, 33899–33920. [Google Scholar] [CrossRef]
Chiu, P.-C. Augmented Reality Virtual Production System. Master’s Thesis, National Taipei University of Technology, Taipei, Taiwan, 2022. Available online: https://hdl.handle.net/11296/gjzp5r (accessed on 10 July 2024).
Jiang, K.; Wang, Z.; Yi, P.; Chen, C.; Wang, Z.; Wang, X.; Jiang, J.; Lin, C. Rain-free and residue hand-in-hand: A progressive coupled network for real-time image deraining. IEEE Trans. Image Process. 2021, 30, 7404–7418. [Google Scholar] [CrossRef] [PubMed]
Jiang, K.; Wang, Z.; Yi, P.; Chen, C.; Han, Z.; Lu, T.; Huang, B.; Jiang, J. Decomposition makes better rain removal: An improved attention-guided deraining network. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 3981–3995. [Google Scholar] [CrossRef]
Xiao, Y.; Yuan, Q.; Jiang, K.; He, J.; Lin, C.; Zhang, L. TTST: A top-k token selective transformer for remote sensing image super-resolution. IEEE Trans. Image Process. 2024, 33, 738–752. [Google Scholar] [CrossRef] [PubMed]
Jiang, K.; Wang, Z.; Yi, P.; Lu, T.; Jiang, J.; Xiong, Z. Dual-path deep fusion network for face image hallucination. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 378–391. [Google Scholar] [CrossRef] [PubMed]
Forte, M.; Pitié, F. F, B, Alpha Matting. [Online]. 2020. Available online: http://arxiv.org/abs/2003.07711 (accessed on 19 February 2024).
Lin, S.; Ryabtsev, A.; Sengupta, S.; Curless, B.; Seitz, S.; Kemelmacher-Shlizerman, I. Real-Time High-Resolution Background Matting. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Ke, Z.; Sun, J.; Li, K.; Yan, Q.; Lau, R.W.H. MODNet: Real-Time Trimap-Free Portrait Matting via Objective Decomposition. In Proceedings of the 36th AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 22 February–1 March 2022. [Google Scholar]
Lin, S.; Yang, L.; Saleemi, I.; Sengupta, S. Robust High-Resolution Video Matting with Temporal Guidance. In Proceedings of the 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2022. [Google Scholar]
Li, J.; Goel, V.; Ohanyan, M.; Navasardyan, S.; Wei, Y.; Shi, H. VMFormer: End-to-End Video Matting with Transformer. In Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024. [Google Scholar]
Li, J.; Ohanyan, M.; Goel, V.; Navasardyan, S.; Wei, Y.; Shi, H. VideoMatt: A Simple Baseline for Accessible Real-Time Video Matting. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 20–22 June 2023; pp. 2177–2186. [Google Scholar]
Peng, B.; Zhang, M.; Lei, J.; Fu, H.; Shen, H.; Huang, Q. RGB-D Human Matting: A Real-World Benchmark Dataset and a Baseline Method. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 4041–4053. [Google Scholar] [CrossRef]
Huynh, C.; Oh, S.; Shrivastava, A.; Lee, J. MaGGIe: Mask guided gradual human instance matting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024. [Google Scholar]
Chen, A.; Huang, H.; Zhu, Y.; Xue, J. Real-time multi-person video synthesis with controllable prior-guided matting. Sensors 2024, 24, 2795. [Google Scholar] [CrossRef] [PubMed]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.; Lo, W.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual Event, Austria, 3–7 May 2021. [Google Scholar]
Li, J.; Jain, J.; Shi, H. Matting anything. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Seattle, WA, USA, 17–21 June 2024. [Google Scholar]
Huang, L.; Liu, X.; Wang, X.; Li, J.; Tan, B. Deep learning methods in image matting: A survey. Appl. Sci. 2023, 13, 6512. [Google Scholar] [CrossRef]
Lu, T.; Li, S. Image Matting with Color and Depth Information. In Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), Tsukuba, Japan, 11–15 November 2012. [Google Scholar]
He, B.; Wang, G.; Zhang, C. Iterative Transductive Learning for Automatic Image Segmentation and Matting with RGB-D Data. J. Vis. Commun. Image Represent. 2014, 25, 1031–1043. [Google Scholar] [CrossRef]
Zeng, W.; Liu, J. A Hierarchical Level Set Approach to for RGBD Image Matting. In MultiMedia Modeling; Kompatsiaris, I., Huet, B., Mezaris, V., Gurrin, C., Cheng, W.-H., Vrochidis, S., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2019; Volume 11295, pp. 628–639. ISBN 978-3-030-05709-1. [Google Scholar]
Liu, J.; Zeng, W.; Yang, B. RGBD Image Matting Using Depth Assisted Active Contours. In Proceedings of the 2018 International Conference on Audio, Language and Image Processing (ICALIP), Shanghai, China, 16–17 July 2018; pp. 385–392. [Google Scholar]
Zhao, M.; Fu, C.; Cai, J.; Cham, T. Real-Time and Temporal-Coherent Foreground Extraction with Commodity RGBD Camera. IEEE J. Sel. Top. Signal Process. 2015, 9, 449–461. [Google Scholar] [CrossRef]
Li, H.; Li, G.; Li, B.; Lin, W.; Cheng, Y. DART: Depth-enhanced accurate and real-time background matting. arXiv 2024, arXiv:2402.15820. [Google Scholar]
Yao, J.; Wang, X.; Yang, S.; Wang, B. ViTMatte: Boosting image matting with pretrained plain vision transformers. Inf. Fusion 2024, 103, 102091. [Google Scholar] [CrossRef]
Free Sci-Fi Office Pack|3D Sci-Fi|Unity Asset Store. Available online: https://assetstore.unity.com/packages/3d/environments/sci-fi/free-sci-fi-office-pack-195067 (accessed on 17 March 2024).
Gastal, E.S.L.; Oliveira, M.M. Shared Sampling for Real-Time Alpha Matting. Comput. Graph. Forum 2010, 29, 575–584. [Google Scholar] [CrossRef]
Glasbey, C.A. An Analysis of Histogram-Based Thresholding Algorithms. CVGIP Graph. Models Image Process. 1993, 55, 532–537. [Google Scholar] [CrossRef]
NDI—Removing the Limits of Video Connectivity. Available online: https://ndi.video/ (accessed on 4 July 2024).
Rhemann, C.; Rother, C.; Wang, J.; Gelautz, M.; Kohli, P.; Rott, P. A Perceptually Motivated Online Benchmark for Image Matting. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009. [Google Scholar]
Erofeev, M.; Gitman, Y.; Vatolin, D.; Fedorov, A.; Wang, J. Perceptually Motivated Benchmark for Video Matting. In Proceedings of the British Machine Vision Conference 2015, Swansea, UK, 7–10 September 2015; British Machine Vision Association: Swansea, UK, 2015; pp. 99.1–99.12. [Google Scholar]

Figure 1. Concept map of the proposed virtual production system with moving camera tracking in varying perspectives.

Figure 2. Architecture of the proposed virtual production system.

Figure 3. Camera video and tracking data flow diagram.

Figure 4. Experimental environment of the proposed virtual production system including VIVE base station placement, camera tracking range, and foreground character active area.

Figure 5. Depth-based video matting module.

Figure 6. Flowchart of the proposed Boundary-Selective Fusion (BSF) and comparison of four alpha mattes. (a) Flowchart; (b) Alpha matte generated by deep learning-based module; (c) Alpha matte generated by depth-based module; (d) Combined alpha matte using the proposed BSF; (e) Ground truth. The source images are taken from the HDM-2K dataset [17].

Figure 7. Occlusion handling via UI plane in Unity. (a) Image rendered in Unity scene; (b) Image properly rendered with occlusion considered.

Figure 8. MAD error over time of video matting in static background.

Figure 9. MAD error over time of video matting in dynamic foreground with humans holding an object.

Figure 10. Initial camera position and composite image. (a) Initial virtual camera position (Unity coordinates); (b) Initial composite image with real-world human and virtual background.

Figure 11. Some video compositing results of the virtual production system. (a) Sci-fi office background; (b) Outdoor background; (c–f) depicts a sequence of shots manipulating different times of a day.

Figure 12. Occlusion effect of the virtual production system. (a) The virtual object closer to the camera than the real-world human; (b) The virtual object occludes the real-world human; (c) The real-world human closer to the camera than the virtual object; (d) The real-world human occludes the virtual object.

Figure 13. User interface of the virtual production system.

Table 1. Comparison of deep learning-based video matting methods. The “automatic” field indicates if the system operates without human intervention. The “real-time” field specifies whether the system can achieve real-time performance, as detailed in the original paper.

Method	Input	Network	Dataset	Target	Automatic	Real-Time
FBA [11]	RGB + Trimap	One-stage CNN	Composition-1K	Anything	No	No
BMv2 [12]	RGB + BG	One-stage CNN + Refine	VideoMatte240K and PhotoMatte13K/85	Human	No	Yes
MODNet [13]	RGB	Parallel two-stream CNN	SPD and PPM-100	Human	Yes	Yes
RVM [14]	RGB	RNN	VideoMatte240K, ImageMatte, DVM, YouTubeVIS 2021, COCO and SPD	Human	Yes	Yes
VMFormer [15]	RGB	Transformer	ImageMatte, VideoMatte240K, BG20K and DVM	Human	Yes	Yes
VideoMatt [16]	RGB	CNN + Attention Mechanism	VideoMatte240, BG20K and DVM	Human	Yes	Yes
RGB-D Human Matting [17]	RGB-D	CNN	HDM-2K	Human	Yes	Yes
MaGGIe [18]	RGB	Transformer attention	Synthesized training data from several existing sources	Human	Yes	Yes
ControlMatting [19]	RGB	FasterNet	Adobe Image Matting, VideoMatte240K	Human	Yes	Yes
MAM [22]	RGB	SAM [20] + Vision Transformer [21]	Adobe Image Matting, Distinctions-646, AM2K, Human-2K, RefMatte	Anything	Yes	Yes

Table 2. Comparison of depth-based video matting methods. The “automatic” field indicates if the system operates without human intervention. The “real-time” field specifies whether the system can achieve real-time performance, as detailed in the original paper.

Method	Depth Source	Depth Processing	Trimap Processing	Matting Method	Automatic	Real-Time
Lu et al. [24]	Kinect	Region growing and bilateral filter	Variational level set based method	Bayesian matting	No	No
He et al. [25]	Kinect	Iteratively perform depth refinement and bi-layer classification	Iteratively perform depth refinement and bi-layer classification	Iterative transductive learning	Yes	Yes
Zheng et al. [26]	NJU2000 database	Directional joint bilateral filter	Hierarchical level set framework	Bayesian matting	Yes	No
Zhao et al. [28]	Kinect and PrimeSense 3D sensor	Shadow detection and adaptive temporal hole-filling	Adaptive background mixture with shadow detection	Closed-form temporal matting	Yes	Yes
DART [29]	JXNU-RGBD dataset	Bayesian manner	Bayesian inference	ViTMatte [30]	Yes	Yes
Ours	Intel RealSense D435i	Spatial and temporal hole-filling	Minimum method	Share matting Proposed BSF	Yes	Yes

Table 3. Comparison of Video Matting Speed in Frames Per Second (FPS).

Resolutions	Learning- Based (GPU+ Down-Sample ¹)	Learning- Based (CPU+ Down-Sample ¹)	Learning-Based (CPU) ²	Depth-Based (CPU)	Proposed BSF ^1,3	ViT-Base 91M Parameters	MAM [22] ViT-Large 308M Parameters	ViT-Huge 636M Parameters
640 × 480	113.392	13.978	12.543	15.812	13.574	5.138	2.623	1.628
1280 × 720	102.944	13.502	4.567	8.185	5.158
1920 × 1080	82.503	14.704	1.839	4.221	2.924

¹ The down-sample ratio for deep learning-based method is set to 1 for the model testing 480-resolution video, and 0.5 and 0.25 for testing 720 and 1080 video. ² Note that setting the down-sample ratio to 1 regardless of resolution deviates from the official documentation and may affect matting results in general usage. ³ Used GPU for deep learning-based method and CPU for depth-based method.

Table 4. Comparison of matting error of a barehanded person in normal illumination. MAD, MSE, and Connectivity are scaled by 10³, 10³, and 10⁻³, respectively. (The best performance in each metric is marked in bold font.)

Error	Simple Background				Complex Background
Error	Learning-Based [14]	Depth- Based	Proposed BSF	MAM [22]	Learning-Based [14]	Depth- Based	Proposed BSF	MAM [22]
MAD	3.535	5.767	4.972	4.299	8.465	6.428	3.927	20.384
MSE	1.051	1.26	1.151	1.049	4.826	4.019	2.214	12.355
Gradient	3.466	3.839	3.697	3.586	17.15	16.887	16.467	15.487
Conn	3.807	4.419	3.872	3.778	35.977	27.99	16.726	84.188

Table 5. Comparison of matting error of a person in low illumination. MAD, MSE, and Connectivity are scaled by 10³, 10³, and 10⁻³, respectively. (The best performance in each metric is marked in bold font.)

Error	Learning-Based [14]	Depth-Based	Proposed BSF	MAM [22]
MAD	4.242	5.164	4.085	2.364
MSE	1.937	2.273	1.768	0.673
Gradient	10.447	13.441	12.220	9.156
Conn	13.327	14.704	11.235	5.183

Table 6. Comparison of matting error of human holding an object. MAD, MSE, and Connectivity are scaled by 10³, 10³, and 10⁻³, respectively. (The best performance in each metric is marked in bold font.)

Error	Learning-Based [14]	Depth-Based	Proposed BSF	MAM [22]
MAD	14.770	10.924	6.229	26.944
MSE	11.433	8.087	4.222	15.299
Gradient	11.018	16.095	11.314	10.463
Conn	46.647	34.625	19.163	75.684

Table 7. Qualitative demonstration of each matting method in scenarios with humans holding a variety of objects. MAD is scaled by 10³ for better readability.

Original Image	Learning-Based Alpha Matte [9]	Depth-Based Alpha Matte	BSF Alpha Matte	MAM [31] Alpha Matte

MAD Error	25.5330	10.8968	6.4486	5.2741

MAD Error	22.1378	29.4624	3.2251	7.3851

MAD Error	12.4419	8.1649	7.1510	42.5277

MAD Error	4.2003	4.1350	3.3013	33.9496

Table 8. Comparison of video matting stability in static and dynamic background. Both MAD and dtSSD are scaled by 10³. (The best performance in each metric is marked in bold font.)

Error	Static Background				Dynamic Background
Error	Learning-Based [14]	Depth- Based	Proposed BSF	MAM [22]	Learning-Based [14]	Depth- Based	Proposed BSF	MAM [22]
MAD	1.543	1.749	1.705	4.135	5.043	3.288	5.044	7.020
dtSSD	4.623	5.666	4.608	4.648	12.411	10.261	12.400	14.274

Table 9. Comparison of video matting stability in dynamic foreground with humans holding objects. Both MAD and dtSSD are scaled by 10³. (The best performance in each metric is marked in bold font).

Error	Learning-Based [14]	Depth-Based	Proposed BSF	MAM [22]
MAD	21.911	5.416	4.502	10.732
dtSSD	63.020	49.370	40.345	47.980

Table 10. Camera tracking experiments: Move and Pan.

Realistic Camera Movement	Move Forward	Move Backward	Move Right	Move Left	Pan Right	Pan Left
Virtual Camera (Unity coordinates)
Composite Image

Table 11. Camera tracking experiments: Tilt and Roll.

Realistic Camera Movement	Tilt Up	Tilt Down	Roll Right	Roll Left
Virtual Camera (Unity coordinates)
Composite Image

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Su, P.-C.; Yang, M.-T. Integrating Depth-Based and Deep Learning Techniques for Real-Time Video Matting without Green Screens. Electronics 2024, 13, 3182. https://doi.org/10.3390/electronics13163182

AMA Style

Su P-C, Yang M-T. Integrating Depth-Based and Deep Learning Techniques for Real-Time Video Matting without Green Screens. Electronics. 2024; 13(16):3182. https://doi.org/10.3390/electronics13163182

Chicago/Turabian Style

Su, Pin-Chen, and Mau-Tsuen Yang. 2024. "Integrating Depth-Based and Deep Learning Techniques for Real-Time Video Matting without Green Screens" Electronics 13, no. 16: 3182. https://doi.org/10.3390/electronics13163182

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu