Attention-based network for passive non-light-of-sight reconstruction in complex scenes

464 Accesses
Explore all metrics

Abstract

Passive non-line-of-sight (NLOS) reconstruction has received considerable success in diverse fields. However, the existing reconstruction methods ignore that complex scenes attenuate object-related information and view object-related information and noise in measured images as equivalent, yielding low-quality recovery. We propose an attention-based encoder–decoder (AED) network to tackle this problem. Specifically, we introduce an attention in the attention (A2B) module that can prune the attention layers to help the network focus on the object-related information in the measured images. In addition, we establish several datasets in complex scenes, including varying ambient light conditions and parameter settings of reconstruction systems, as well as complex hidden objects, to verify the generalization of our method. Experiments on our constructed datasets demonstrate that our methods achieve better recovery performance than existing methods, with more robustness to complex scenes.

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Non-line-of-sight (NLOS) reconstruction is a technique for making objects visible behind corners or obstacles by analysing the scattered light, which is helpful in many applications, such as medical imaging, autonomous driving, and disaster relief.

Numerous NLOS reconstructions are based on active detection strategies [1,2,3,4]. However, these techniques often rely on expensive external light sources, and collecting data takes time. In contrast, passive NLOS reconstruction utilizes weak light emitted by hidden objects without needing a controllable light source. It has a simplified hardware system and good stealthiness. However, passive NLOS reconstruction faces significant challenges, particularly in complex scenarios characterized by varying parameter settings of reconstruction systems and varying ambient light conditions, as well as complex hidden objects. This is because the complex environment causes significant attenuation, scattering, and shadowing of useful information.

Numerous traditional passive reconstruction methods [5,6,7,8,9,10,11] seek to build an explicit forward model and then develop an effective reconstruction algorithm to solve the inverse problem. For instance, Saunders et al. [5] established an affine model and presented a spatial differencing reconstruction method to produce a reconstruction. These methods face difficulties when the ambient light intensity is higher than the useful signal of the hidden object. To improve the robustness of ambient light, traditional methods focus on removing ambient light, such as by recovering motion [12], linear model modelling [13], and constructing an optimized preconditioning matrix [6]. However, performing NLOS reconstruction using traditional reconstruction methods requires the prior knowledge of the reconstruction system (e.g. the position and shape of the occluder and the position of the camera) and is only applicable to a fixed setup, which hinders its practical application in a situation where the shape and position of devices such as occluder and camera are changing. Instead of cancelling ambient light, we aim to introduce attention mechanisms to enhance the object-related information for better recovery results under different settings.

Recently, NLOS reconstruction methods based on deep learning have shown remarkable visual gains over traditional methods, as they have a powerful ability to extract features. Existing deep learning-based passive reconstruction methods can be roughly divided into two categories: the hybrid method that is combined with physical models [14, 15] and end-to-end learning methods [16,17,18,19,20,21,22,23,24,25,26,27,28,29]. The former can perform effectively even in situations with limited or no available data. However, it has certain limitations, such as domain-specific solid knowledge and relying on the prior knowledge of the reconstruction system. For instance, Mu et al. [14] integrated highly domain-specific knowledge, including physics priors related to wave propagation and volume rendering, into a neural network to enhance the representation capabilities of a conditional neural scene. Wu et al. [15] introduced a novel untrained deep decoder network (UNN) informed by physics, which effectively improved reconstruction quality without the need for data acquisition, particularly in ambient light conditions. Nevertheless, it still relies on prior knowledge of the reconstruction system for computations like the light transport matrix. The latter end-to-end methods, which do not require prior knowledge of the reconstruction system, can be broadly categorized into two types: convolutional neural networks (CNNs) [16,17,18,19,20,21,22,23,24,25, 29] and generative models [26,27,28]. Among these, CNNs, especially U-Net and its variants, are the most widely used due to their exceptional performance in tasks such as image segmentation and image restoration. For example, Chen et al. [22] directly utilized the U-Net to restore occluded objects at distances beyond 50 m at a reconstruction rate of 14 FPS. Wang et al. [29] introduced a SPIR-Net network to simultaneously retrieve the image and position of NLOS objects from a single-shot speckle pattern. However, these methods are not quite effective since they ignore that complex scenes attenuate object-related signal intensity and treat the useful information of hidden objects^{Footnote 1} on an equal footing with noisy information, such as ambient light, which is unreasonable because noisy information is an obstacle to high-quality recovery. Furthermore, most of the existing deep learning-based reconstruction datasets focus on simple scenes. Only recent work [19] created an NLOS-Passive dataset focusing on complex scenes. However, its complexity is not sufficient. For example, they do not take into account the situation where ambient lighting cannot be controlled in practice, which can change the ambient light pattern on the secondary surface.

In this paper, we focus on overcoming the above shortcomings. We first introduce the attention mechanism into passive NLOS reconstruction tasks to automatically capture the useful information of hidden objects in the measured images. Particularly, we propose an attention-based encoder–decoder (AED) network that does not use a skip connection scheme to improve the quality of passive NLOS reconstruction in complex scenes. Specifically, we introduce an attention in attention (A2B) module [30] that can prune the attention layers into the proposed network to strengthen the useful information of hidden objects. The A2B module comprises an attention branch, a non-attention branch, and an additional attention dropout module. The attention branch is proposed to enable the network to pay more attention to the useful information of hidden objects. The non-attention branch serves to learn information that is neglected by the attention branch. The additional attention dropout module generates the dynamic weight for the attention branch and non-attention branch, making full use of the information from both branches and thus enhancing the useful information of hidden objects. Furthermore, we build an automated acquisition system, as shown in Fig. 1, constructing several datasets in complex scenes, including fixed and mixed setups, complex hidden objects in a dark environment, and varying ambient light conditions. Our proposed AED network can perform good reconstruction on our constructed complex datasets. The proposed AED network exceeds PSNR over 23 dB and SSIM over 0.9 in uncalibrated setups in a dark environment. Compared to the U-Net network, the proposed AED network achieves improvements in both PSNR and SSIM metrics on more sophisticated datasets. The proposed AED method remains above 17.9 dB in PSNR and above 0.8 in SSIM on the MNIST dataset under varying ambient light conditions. This system also exhibits good generalizability, achieving a PSNR of 10.92 dB and an SSIM of 0.4997, even in scenarios where different numbers of people walk around the system and cast shadows on the secondary surface.

2 Proposed method

In this section, we first describe the passive NLOS reconstruction problem that we consider. Then, we give a detailed description of attention-based encoder–decoder (AED) network for passive NLOS reconstruction in complex scenes.

2.1 Problem formulation

Figure 1 shows a classic passive NLOS reconstruction scene. The hidden object, also called the original image, is placed on the monitor screen, and the occluder partially blocks the light emitted, producing a penumbra on the secondary surface. The penumbra captured by the camera is called the measurement image. The goal of NLOS reconstruction is to recover the hidden object from the measured image. For a given position of the point light source s on the monitor screen, the measured irradiance intensity at d on the secondary surface can be expressed as:

$$\begin{aligned} y(d) = \int _{s \in S} {x(s)A(s,d,{p_0})\text {d}s} + b(d), \end{aligned}$$

(1)

where x(s) represents monitor screen radiosity of point s; b(d) represents the noise contribution, such as the system modelling errors and background noise; and integration over all pixels of monitor resolution S denotes the combined contribution of the monitor screen at d; $A(s,d,{p_0})$ refers to the optical transport process from point light s to the point d with the occluder positioned at ${p_0}$:

$$\begin{aligned} A(s,d,{p_0}) = \mu (s,d)\frac{1}{{\left\| s - d \right\| {^2}}}G(s,d)V(s,d,{p_0}) \end{aligned}$$

Here, $\mu (s,d)=\cos ^{18}(\angle (s,d))$ is the radiometric model of the monitor screen with respect to the viewing angle; $\left\| \cdot \right\| _2^2$ is the Euclidean norm; G is the Lambertian bidirectional reflectance distribution function (BRDF):

$$\begin{aligned} G(s,d) =\cos [\angle (d-s,n_{s})]\cos [\angle (s-d,n_{d})] \end{aligned}$$

where $n_{s}$ and $n_{d}$ are the surface normal of the monitor screen and the secondary surface, respectively, and $V(s,d,{p_0})$ is a Boolean-valued visibility function that takes value 1 when the light path from s to d is not obstructed by an occluder; otherwise, it takes 0. The equation (1) can be discretized as:

$$\begin{aligned} \textbf{y}\mathrm{{ = }}A\mathrm{{(}}{p_0}\mathrm{{)}}{} \mathbf{{x}} + \mathbf{{b}} \end{aligned}$$

(2)

where $\textbf{y}$ and $\textbf{x}$ are the vectorized measured image and vectorized original image, respectively. $A(p_{0})$ is the light transport matrix; $\textbf{b}$ represents the noise term. NLOS reconstruction aims to learn an inverse mapping that reconstructs the hidden object $\textbf{x}$ from the measured image $\textbf{y}$, which can be formulated as:

$$\begin{aligned} \textbf{x}=A^{-1}\textbf{y}. \end{aligned}$$

(3)

where the formulation of $A^{-1}$ depend on the reconstruction methods. For example, if least-squares method is used, ${A^{\mathrm{{ - }}1}} = {({A^T}\mathrm{{(}}{p_0}\mathrm{{)}}A\mathrm{{(}}{p_0}\mathrm{{))}}^{ - 1}}A\mathrm{{(}}{p_0}\mathrm{{)}}. $

2.2 Network architecture of AED model

The commonly used reconstruction methods, such as U-Net and its variants, are less effective because they ignore that complex scenes can attenuate the intensity of object-related information and treat the useful information from hidden objects on an equal footing with noisy signals, such as ambient light. To address this issue, we first introduce attention mechanisms, represented as the attention branch, into NLOS reconstruction tasks to enhance the object-related information. Furthermore, considering that attention mechanisms might neglect some effective information in certain layers, we incorporated a learnable attention dropout module to dynamically adjust the weights between two branches, ensuring that both attention and non-attention mechanisms are utilized effectively and enhance the object-related useful information.

Our proposed AED reconstruction model is a typical CNN-based encoder–decoder framework, which takes the measured images y as inputs and outputs the reconstructed original images x. The network structure is illustrated in Fig. 2. The input measured images and the output original images are resized to $128\times 128$ dimensions in pixels. The blue part acts as an encoder and comprises a convolutional layer and four stacked encoding blocks. A convolutional layer with a filter size of $1\times 1$ is used to change the number of channels. Four stacked encodings are used to generate the high-level feature representations. Each encoding block consists of an attention in attention (A2B) module followed by a pooling layer for downsampling. Once the encoder process is completed, the deep features $c_{5}$ can be extracted from the measured image. This process can be described as follows:

$$\begin{aligned} c_{5}=f_{E_{4}}(f_{E_{3}}(f_{E_{2}}(f_{E_{1}}(f_\textrm{chg}(y))))), \end{aligned}$$

(4)

where $f_\textrm{chg}(\cdot )$ is a convolutional layer with a filter size of $1\times 1$, and the $f_{E_{i}}$ ( $i= 1, 2, 3,4$ ) denotes the i-th encoding block.

The green part plays a role in the decoder and comprises four stacked decoding blocks and a convolutional layer. Each decoding block contained an A2B module followed by a deconvolutional layer to expand the size of the feature maps. The decoder process can restore the reconstructed image from the deep features $c_{5}$. Similarly, the decoder process can be represented as:

$$\begin{aligned} x=f_\textrm{chg}(f_{D_{4}}(f_{D_{3}}(f_{D_{2}}(f_{D_{1}}(c_{5}))))), \end{aligned}$$

(5)

where the $f_{D_{i}}$ ($i= 1, 2, 3, 4$) denotes the decoding block and $f_\textrm{chg}(\cdot )$ represents a convolutional layer with $1\times 1$ kernel size, and x denotes the final reconstructed image. In general, our proposed AED model $f_{a}(\cdot )$, which is designed to learn the inverse mapping $A^{-1}$, can be written in the following form:

$$\begin{aligned} x=f_{a}(y), \end{aligned}$$

(6)

Given a training dataset ${\{y_{i},x_{i}\}}_{i=1}^{N}$, the objective of training proposed AED network is to minimize the mean squared error (MSE) loss [17] function as follows

$$\begin{aligned} L(\theta )=\frac{1}{N}\sum \limits _{\mathrm{{i = 1}}}^N {{{\left\| {{x_i} - {f_a}({y_i;\theta })} \right\| }_2}} \end{aligned}$$

(7)

where $||\cdot ||$ denotes $L_{2}$ norm; $\theta $ denotes all trainable network parameters.

2.3 Attention in attention module

As shown in Fig. 3, the A2B module comprises an attention branch, a non-attention branch, and an additional attention dropout module. The non-attention branch refers to an improved bottleneck layer [17], as shown in Fig. 4, which is used to learn the information ignored by the attention branch. The bottleneck layer contains two 3$\times $3 convolution layers, a $7\times 7$ convolution layer, and three nonlinear activation functions followed by a batch normalization layer. Next, we will detail the attention branch and attention dropout module.

1.
The attention branch The attention branch is proposed to enable network to pay more attention to the useful information of hidden objects. The attention branch applied here is divided into two sub-branch: mask branch and trunk branch. Here, the sub-trunk branch is an improved bottleneck layer, and we use $T(x_{n-1})$ representing trunk branch output with input $x_{n-1}$. The mask branch mainly uses an pixel channel-spatial attention [31], which employs a convolutional layer with a 1$\times $1 kernel followed by a sigmoid function to yield the same size mask $M(x_{n-1})$ that softly weight features $T(x_{n-1})$. The output of attention branch H is:
$$\begin{aligned} \begin{aligned} H_{i,j}(x_{n-1})&= M_{i,j}(x_{n-1}) \times T_{i,j}(x_{n-1})\\ \end{aligned} \end{aligned}$$
(8)
where i, j range over all spatial positions and number of channels, respectively; $\times $ means the dot product operation. The mask branch in the attention branch can serve as a gradient update filter in the backpropagation process. Given input feature $x_{n-1}$, the gradient of mask branch is:
$$\begin{aligned} \frac{\partial M(x_{n-1},\theta _{m})T(x_{n-1}, \theta _{t})}{\partial \theta _{t}}= M(x_{n-1},\theta _{m})\frac{\partial T(x_{n-1}, \theta _{t})}{\partial \theta _{t}}\nonumber \\ \end{aligned}$$
(9)
where $\theta _{m}$ and $\theta _{t}$ are the mask branch and the trunk branch parameters, respectively. This property enable AED model to learn more useful information.
2.
Attention dropout module: For a given input $x_{n-1}$, we adopt two fully connected layers ($W_{1}$ and $W_{2}$) with a ReLU activation after one global average pooling layer $W_{avg}$ and a softmax activation function to adaptively assigns weights $P_{1}$ and $P_{2}$ to the attention and non-attention branches, respectively. Specifically, the corresponding formula of computing dynamic weights $P(x_{n-1})=(P_{1},P_{2})$ can be defined as follows:
$$\begin{aligned} P(x_{n-1})=\frac{e^\mathrm{{RelU}(W_{1}(\textrm{RelU}(W_{1}(W_\textrm{avg}(x_{n-1})))))}}{\sum e^\mathrm{{RelU}(W_{1}(\textrm{RelU}(W_{1}(W_\textrm{avg}(x_{n-1})))))}} \end{aligned}$$
(10)
The formula for obtaining the enhanced feature map $x_{n}$ from the input feature map $x_{n-1}$ can be written as follows:
$$\begin{aligned} x_{n}=f_{1*1}(x_{n-1}^\textrm{att}*P_{1}+x_{n-1}^{\text {non-att}}*P_{2}), \end{aligned}$$
(11)
where $x_{n-1}^\mathrm{{non-att}}$ and $x_{n-1}^\mathrm{{att}}$ represent the outputs of the non-attention and attention branches, respectively. The $f_{1*1}$ is a convolutional layer with a filter size of $1\times 1$. Here, we use sum-to-one constraint to bound the dynamic weights, that is, $P_{1}+P_{2}=1$.

3 Experiments

3.1 Datasets and training details

1.
Datasets: The success of deep learning methods for the NLOS reconstruction problem strongly depends on the availability of the given datasets, that is, pairs of original images and corresponding measured images. Specifically, the original images x from six benchmark datasets, MNIST [32], Human posture [33], Handgesture-1 [34], Handgesture-2 [35], Fashion-MNIST [36], and Cifar10 [37], are displayed sequentially on the monitor screen of the NLOS reconstruction system, as shown in Fig. 1, and the corresponding measured images are collected. Of these datasets, MNIST dataset is popular because of its simple distribution. Each image of this dataset has a digital object with value ranging from 0 to 9 with a pure black background and a grayscale colour space. In this study, we used this dataset to investigate the robustness of the proposed AED model to parameter settings of the reconstruction system and the varying ambient light conditions. Specifically, in the former case, we first placed the NLOS reconstruction system in a dark environment and then collected data in both fixed and mixed NLOS setups. In the latter case, by keeping the parameter settings of the reconstruction system unchanged, we collected the data under varying ambient lighting conditions. The remaining five datasets were used to evaluate the reconstruction performance of the AED model for complex hidden objects. The background and colour spaces in the Human posture and Fashion-MNIST datasets are similar to those in the MNIST dataset, but their objects are more complex. Compared to the MINIST dataset, the gesture object of Handgesture-1 is more complex, the background is not monochromatic and does not change with the image, and the colour space is RGB. In contrast to Handgesture-1, Handgesture-2 has more object types, the colour of the background is close to that of the object, and the background varies with the image. The Cifar10 dataset contains real objects in the physical world with different scales and features, whose background and colour space are more complex than all the mentioned datasets. A more detailed description of the six datasets can be found in the paragraph “Results on more sophisticated datasets” in Supplementary Information of [38]. We used the peak signal-to-noise ratio (PSNR) and structure similarity index measures (SSIM) the evaluation metric.
2.
Training details: All the experiments were implemented in PyTorch 1.7. We used a stochastic gradient descent (SGD) optimizer [39] with a momentum of 0.9 to train our model for 200 epochs, with a total of eight images per minibatch. We set the initial learning rate to 0.001 with a linear warmup schedule [40] and a step decay schedule. In addition, we adopted augmentation, such as Gaussian noise and cropping, to avoid overfitting.

Table 1 Quantitative results (PSNR(dB)/SSIM) of several reconstruction algorithms on the MNIST dataset in the fixed or mixed setup

Full size table

3.2 Experimental results

3.2.1 Dark environment

1.
To evaluate our AED method using the MNIST dataset and compare it with the conventional method [5] and U-Net method [17], we first collect NLOS data with immobilizing parameter settings; this implies that the position of the occluder, monitor, and camera, as well as the shape of the occluder, remain unchanged during the acquisition process (a more detailed description of the fixed setup in [38]). This case is referred to as the fixed setup. We then feed the collected data into the traditional, U-Net, and AED methods to produce the corresponding reconstructions. Table 1 lists the quantitative results on the 10,000 test images. As can be seen, our method achieves a PSNR of 25.242 dB and a SSIM of 0.9424, which are higher than that of the traditional method (10.813 dB and 0.0772) and U-Net (24.913 dB and 0.8886). This indicates the effectiveness of the AED method for NLOS reconstructions. We further compared our AED with U-Net and a traditional method in terms of computational cost, including the number of parameters and inference time for a single image. The results are presented in Table 2. Benefiting from the A2B module, the parameter number of our AED increases by 6.22M compared to U-Net. The traditional method requires an additional 2 min for image reconstruction, whereas the inference time of the AED is equivalent to U-Net and is under 0.2 s. This indicates the effectiveness of our model while maintaining a temporal advantage.

2.
In real applications, the parameter settings of the reconstruction system can vary. In other words, the position of the occluder, monitor, and camera, as well as the shape of the occluder, may change. This case is referred to as the mixed setup (refer to the detailed description of the mixed setup in [38]). To analyse the robustness of the AED method to parameter settings, we collected NLOS data that varied over a range to train the model. The traditional method constructs a forward transport model based on the prior knowledge of the reconstruction system, such as a rectangular occluder, while in the mixed setup, the prior knowledge is changed, such as the shape of the occluder changes from rectangular to circular or cup-shaped. Therefore, the traditional method does not operate in a mixed setup. As presented in Table 1, our AED method achieves the best performance (23.624 dB/0.9239) in the mixed setup and outperforms the U-Net method in both PSNR and SSIM (23.218 dB/0.8568). This result demonstrates that the proposed AED method is robust to NLOS system parameter settings.

Table 2 The number of parameters and inference time for a single image across different methods

Full size table

Table 3 The PSNR and SSIM results of several reconstruction algorithms on five complex datasets in a fixed setup. AED-W and AED-O denote the use and not use of skip connections from the encoder to decoder for the network, respectively

Full size table

Table 4 The weight of attention branch and non-attention branch for the bending pose case, respectively

Full size table

3.
To verify the generalizability of our method, we selected Human posture, Hand gesture-1, Hand gestures-2, Fashion-MNIST, and Cifar10 as example datasets to conduct experiments in a fixed setup. As listed in Table 3, our AED method shows improvements both in terms of PSNR and SSIM on more complex datasets than the mainstream U-Net method, demonstrating that the attention mechanism helps the network learn more useful information of hidden objects to achieve better reconstruction results. Although our method did not show a significant improvement in the numerical, it performed better than U-Net visually, as shown in Fig. 5. Specifically, for human posture, we produced a sharper appearance than that of U-Net. For Hand gesture-1, we successfully recovered the details of the thumb and wrist, whereas U-Net could not. Our method showed better finger shape recovery even on the Hand gesture-2 dataset, where the colours of the background and object were close, and could restore the shape of the hidden object on the fashion-MNIST dataset, in contrast to U-Net, which failed to accomplish both. Compared with the results of U-Net on cifar10, our method was able to restore the colour information to a great extent and identify hidden objects. These results demonstrate the superior ability of our method in terms of generalizability to complex hidden objects.

As shown in Table 4, we observe that the weights of the attention branch and non-attention branch vary significantly at different blocks, reflecting differences in their ability to extract object-related useful information. For instance, in Block1, Block2, and Block7, the attention branch’s weights are higher than those of the non-attention branch, indicating its superior information extraction ability. Conversely, in Block5, Block6, and Block8, the increased weights of the non-attention branch suggest that it plays a valuable role in capturing information that is ignored by the attention branch. By fully utilizing the information from both the attention and non-attention branches, the network can preserve more important object-related useful information.

4.
Considering that the data distribution of the input and output of the network is different, we also verify the effect of the skip connection scheme from the encoder to the decoder. The experimental results in Table 3 show that the AED method without the skip connection scheme achieves results comparable to those of the one with this scheme; this demonstrates that using this scheme does not lead to superior reconstruction performance.

3.2.2 Varying ambient light environment

In real-life situations, ambient light cannot be controlled. Therefore, we evaluated the robustness of the AED method under varying ambient light conditions. Specifically, we studied the effect on reconstruction performance when people walk near a passive NLOS reconstruction system and cast a penumbra on the secondary surface. We placed the NLOS reconstruction system shown in Fig. 7a in an exhibition hall with an intense ambient light of 56.7 Lux/Fc. Under this lighting condition, the PSNR and the SSIM of the AED method were 17.689 dB and 0.6894, respectively. Subsequently, we added different numbers of opaque plates, each with a height of 1.7 m, to the experimental setup (Fig. 7b–c) to simulate the casting of penumbras on the secondary surface when different people are in the vicinity of the system. As shown in Table 5, the PSNR and SSIM of the AED model trained in the experimental setting without a plate (denoted as AED-0) both decreased when people walked near the system, demonstrating that the AED model is not robust when training under the fixed lighting condition. For improved robustness, we trained a new AED model under mixed lighting conditions, referred to as AED-M. As presented in Table 5, the proposed AED-M model remains above 17.9 dB in terms of PSNR and 0.8 in terms of SSIM when people walk near the system, showing that it is robust to varying ambient light conditions. It is noteworthy that the robustness of our AED model against changes in ambient light conditions can be attributed to two key factors. The first factor pertains to the diversity of our training dataset, which comprises a substantial collection of images captured under varying ambient light conditions. This diversity allows the model to acquire knowledge regarding the varying ambient lighting conditions autonomously, so the model exhibits robustness when applied in real-world scenarios. The second factor pertains to the model architecture. Our A2B module assists our network in focusing more on object-related information, significantly reducing its attention to ambient lighting. This, to some extent, improves our method’s ability to adapt to changes in lighting conditions.

Table 5 Summary of the PSNR and the SSIM of AED models trained under varying ambient lighting conditions

Full size table

We have conducted an experiment in an exhibition hall (Fig. 8) to evaluate the robustness of the AED-M model under real varying ambient lighting due to movements of nearby objects. We developed a script to automatically display 1000 test images from the MNIST dataset on the monitor screen and to reconstruct the NLOS object as different numbers of people walk around the system. The proposed AED-M model has a PSNR of 10.92 dB and an SSIM of 0.4997. These results suggest that our model exhibits good generalizability.

4 Conclusions

In this study, we propose an attention-based encoder–decoder network to boost the reconstruction quality under complex scenes. To address the equal treatment of object-related information and noise, we use the attention in attention (A2B) module to help the network focus on object-related useful information. We also find that the skip connection scheme does not contribute meaningfully to improving the restoration quality. In addition, we create several datasets in complex scenes to evaluate the performance of the AED method. The experimental results demonstrate that the proposed AED method achieves good recovery quality on our constructed datasets. In the future, we plan to combine visible-light information with other types of electromagnetic waves to improve the quality of passive NLOS reconstruction under complex scenes.

Data and Code Availability Statement

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

Notes

Object-related information refers to properties like colour, texture, and shape of the hidden object in a measured image.

References

Liu, X., Bauer, S., Velten, A.: Phasor field diffraction based reconstruction for fast non-line-of-sight imaging systems. Nat. Commun. 11(1), 1645 (2020)
Article Google Scholar
Isogawa, M., Yuan, Y., O’Toole, M., Kitani, K.M.: Optical non-line-of-sight physics-based 3d human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7013–7022 (2020)
Metzler, C.A., Heide, F., Rangarajan, P., Balaji, M.M., Viswanath, A., Veeraraghavan, A., Baraniuk, R.G.: Deep-inverse correlography: towards real-time high-resolution non-line-of-sight imaging. Optica 7(1), 63–71 (2020)
Article Google Scholar
Willomitzer, F., Li, F., Balaji, M.M., Rangarajan, P., Cossairt, O.: High resolution non-line-of-sight imaging with superheterodyne remote digital holography. In: Computational Optical Sensing and Imaging, pp. 2–2. Optica Publishing Group, Columbia (2019)
Saunders, C., Murray-Bruce, J., Goyal, V.K.: Computational periscopy with an ordinary digital camera. Nature 565(7740), 472–475 (2019)
Article Google Scholar
Saunders, C., Goyal, V.K.: Fast computational periscopy in challenging ambient light conditions through optimized preconditioning. In: 2021 IEEE International Conference on Computational Photography (ICCP), pp. 1–9, IEEE (2021)
Yedidia, A.B., Baradad, M., Thrampoulidis, C., Freeman, W.T., Wornell, G.W.: Using unknown occluders to recover hidden scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12231–12239 (2019)
Seidel, S.W., Ma, Y., Murray-Bruce, J., Saunders, C., Freeman, W.T., Christopher, C.Y., Goyal, V.K.: Corner occluder computational periscopy: estimating a hidden scene from a single photograph. In: 2019 IEEE International Conference on Computational Photography (ICCP), pp. 1–9. IEEE (2019)
Tanaka, K., Mukaigawa, Y., Kadambi, A.: Enhancing passive non-line-of-sight imaging using polarization cues (2019). arXiv preprint arXiv:1911.12906
Seidel, S.W., Murray-Bruce, J., Ma, Y., Yu, C., Freeman, W.T., Goyal, V.K.: Two-dimensional non-line-of-sight scene estimation from a single edge occluder. IEEE Trans. Comput. Imaging 7, 58–72 (2020)
Article MathSciNet Google Scholar
Chen, W., Daneau, S., Mannan, F., Heide, F.: Steady-state non-line-of-sight imaging. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6790–6799 (2019)
Bouman, K.L., Ye, V., Yedidia, A.B., Durand, F., Wornell, G.W., Torralba, A., Freeman, W.T.: Turning corners into cameras: Principles and methods. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2270–2278 (2017)
Klein, J., Peters, C., Martín, J., Laurenzis, M., Hullin, M.B.: Tracking objects outside the line of sight using 2d intensity images. Sci. Rep. 6(1), 1–9 (2016)
Article Google Scholar
Mu, F., Mo, S., Peng, J., Liu, X., Nam, J.H., Raghavan, S., Velten, A., Li, Y.: Physics to the rescue: deep non-line-of-sight reconstruction for high-speed imaging. IEEE Trans. Pattern Anal. Mach. Intell. (2022). https://doi.org/10.1109/TPAMI.2022.3203383
Article Google Scholar
Wu, H., Liu, S., Meng, X., Yang, X., Yin, Y.: Non-line-of-sight imaging based on an untrained deep decoder network. Opt. Lett. 47(19), 5056–5059 (2022). https://doi.org/10.1364/OL.471319
Article Google Scholar
Zhou, C., Wang, C.-Y., Liu, Z.: Non-line-of-sight imaging off a phong surface through deep learning (2020). arXiv preprint arXiv:2005.00007
Tingyi, Y., Mu, Q., Liu, H., Han, S.: Non-line-of-sight imaging through deep learning. Acta Optica Sinica 39(7), 0711002 (2019)
Article Google Scholar
Sun, L., Shi, J., Wu, X., Sun, Y., Zeng, G.: Photon-limited imaging through scattering medium based on deep learning. Opt. Express 27(23), 33120–33134 (2019)
Article Google Scholar
Geng, R., Hu, Y., Lu, Z., Yu, C., Li, H., Zhang, H., Chen, Y.: Passive non-line-of-sight imaging using optimal transport. IEEE Trans. Image Process. 31, 110–124 (2021)
Article Google Scholar
He, J., Wu, S., Wei, R., Zhang, Y.: Non-line-of-sight imaging and tracking of moving objects based on deep learning. Opt. Express 30(10), 16758–16772 (2022). https://doi.org/10.1364/OE.455803
Article Google Scholar
Sun, Y., Wu, X., Shi, J., Zeng, G.: Scattering-assisted computational imaging. In: Photonics, vol. 9, p. 512. MDPI (2022)
Chen, X., Li, M., Chen, T., Zhan, S.: Long-range non-line-of-sight imaging based on projected images from multiple light fields. Photonics (2023). https://doi.org/10.3390/photonics10010025
Article Google Scholar
Zhu, S., Sua, Y.M., Bu, T., Huang, Y.-P.: Compressive non-line-of-sight imaging with deep learning. Phys. Rev. Appl. 19(3), 034090 (2023)
Article Google Scholar
Su, X., Hong, Y., Ye, J., Xu, F., Yuan, X.: Multi-scale iterative model-guided unfolding network for nlos reconstruction. In: Computer Graphics Forum, vol. 42 (2023)
Peng, J., Xiong, Z., Tan, H., Huang, X., Li, Z.-P., Xu, F.: Boosting photon-efficient image reconstruction with a unified deep neural network. IEEE Trans. Pattern Anal. Mach. Intell. 45(4), 4180–4197 (2022)
Google Scholar
Zhu, D., Cai, W.: Fast non-line-of-sight imaging with two-step deep remapping. ACS Photon. 9(6), 2046–2055 (2022)
Article Google Scholar
Huang, C., He, J., Wei, R., Weng, Y., Wang, W., Wang, C., Zhang, Y.: 45.2: high-performance non-line-of-sight imaging based on deep learning. In: SID Symposium Digest of Technical Papers, vol. 54, pp. 321–322. Wiley Online Library (2023)
Sun, Y., Shi, J., Sun, L., Fan, J., Zeng, G.: Image reconstruction through dynamic scattering media based on deep learning. Opt. Express 27(11), 16032–16046 (2019). https://doi.org/10.1364/OE.27.016032
Article Google Scholar
Wang, Z., Huang, H., Li, H., Chen, Z., Han, J., Pu, J.: Non-line-of-sight imaging and location determination using deep learning. Opt. Lasers Eng. 169, 107701 (2023). https://doi.org/10.1016/j.optlaseng.2023.107701
Article Google Scholar
Chen, H., Gu, J., Zhang, Z.: Attention in attention network for image super-resolution (2021). arXiv preprint arXiv:2104.09497
Zhao, H., Kong, X., He, J., Qiao, Y., Dong, C.: Efficient image super-resolution using pixel attention. In: European Conference on Computer Vision, pp. 56–72. Springer, Berlin (2020)
LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural Comput. 1(4), 541–551 (1989). https://doi.org/10.1162/neco.1989.1.4.541
Article Google Scholar
Kumar, A., Deni Raj, E.: Silhouettes for human posture recognition. IEEE Dataport. [Online; accessed 06-08-2020] (2020). https://doi.org/10.21227/9c9b-3j44
Aistudio: Hand gesture recognition dataset (2020). https://aistudio.baidu.com/aistudio/datasetdetail/51629. Online. Accessed 27 Aug. 2020
tecperson, K.: Sign Language MNIST: Drop-In Replacement for MNIST for Hand Gesture Recognition Tasks (2017). https://www.kaggle.com/datamunge/sign-language-mnist. Online; Accessed 20 Oct. 2017
Xiao, H., Rasul, K., Vollgraf, R.: Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms (2017). arXiv preprint arXiv:1708.07747
Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images. University of Toronto (2009)
Wang, Y., Zhang, Y., Huang, M., Chen, Z., Jia, Y., Weng, Y., Xiao, L., Xiang, X.: Accurate but fragile passive non-line-of-sight recognition. Commun. Phys. 4, 1–9 (2021)
Article Google Scholar
Kingma, D.P., Ba, J.: ADAM: a method for stochastic optimization (2014). arXiv preprint arXiv:1412.6980
Ma, J., Yarats, D.: On the adequacy of untuned warmup for adaptive optimization. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 8828–8836 (2021)

Download references

Acknowledgements

This research was funded by Hunan Provincial Innovation Foundation for Postgraduate of FUNDER grant number CX20220646. Huang’s research was partially supported by NSFC Project (11971410) and China’s National Key R &D Programs (2020YFA0713500).

Author information

Yaqin Zhang and Meiyu Huang have contributed equally to this work.

Authors and Affiliations

Qian Xuesen Laboratory of Space Technology, China Academy of Space Technology, Beijing, 100094, People’s Republic of China
Yaqin Zhang, Meiyu Huang, Yangyang Wang & Xueshuang Xiang
School of Mathematics and Computational Science, Xiangtan University, Xiangtan, 411105, People’s Republic of China
Yaqin Zhang & Yunqing Huang
College of Mathematics and Physics, Beijing University of Chemical Technology, Beijing, 100029, People’s Republic of China
Zhao Chen

Authors

Yaqin Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Meiyu Huang
View author publications
You can also search for this author in PubMed Google Scholar
Yangyang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zhao Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yunqing Huang
View author publications
You can also search for this author in PubMed Google Scholar
Xueshuang Xiang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Yunqing Huang or Xueshuang Xiang.

Ethics declarations

Conflicts of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 964 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, Y., Huang, M., Wang, Y. et al. Attention-based network for passive non-light-of-sight reconstruction in complex scenes. Vis Comput 40, 8073–8083 (2024). https://doi.org/10.1007/s00371-023-03223-z

Download citation

Accepted: 09 December 2023
Published: 10 January 2024
Issue Date: November 2024
DOI: https://doi.org/10.1007/s00371-023-03223-z

Attention-based network for passive non-light-of-sight reconstruction in complex scenes

Abstract

Explore related subjects

1 Introduction

2 Proposed method

2.1 Problem formulation

2.2 Network architecture of AED model

2.3 Attention in attention module

3 Experiments

3.1 Datasets and training details

3.2 Experimental results

3.2.1 Dark environment

3.2.2 Varying ambient light environment

4 Conclusions

Data and Code Availability Statement

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (pdf 964 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords