1 Introduction

Non-line-of-sight (NLOS) reconstruction is a technique for making objects visible behind corners or obstacles by analysing the scattered light, which is helpful in many applications, such as medical imaging, autonomous driving, and disaster relief.

Numerous NLOS reconstructions are based on active detection strategies [1,2,3,4]. However, these techniques often rely on expensive external light sources, and collecting data takes time. In contrast, passive NLOS reconstruction utilizes weak light emitted by hidden objects without needing a controllable light source. It has a simplified hardware system and good stealthiness. However, passive NLOS reconstruction faces significant challenges, particularly in complex scenarios characterized by varying parameter settings of reconstruction systems and varying ambient light conditions, as well as complex hidden objects. This is because the complex environment causes significant attenuation, scattering, and shadowing of useful information.

Numerous traditional passive reconstruction methods [5,6,7,8,9,10,11] seek to build an explicit forward model and then develop an effective reconstruction algorithm to solve the inverse problem. For instance, Saunders et al. [5] established an affine model and presented a spatial differencing reconstruction method to produce a reconstruction. These methods face difficulties when the ambient light intensity is higher than the useful signal of the hidden object. To improve the robustness of ambient light, traditional methods focus on removing ambient light, such as by recovering motion [12], linear model modelling [13], and constructing an optimized preconditioning matrix [6]. However, performing NLOS reconstruction using traditional reconstruction methods requires the prior knowledge of the reconstruction system (e.g. the position and shape of the occluder and the position of the camera) and is only applicable to a fixed setup, which hinders its practical application in a situation where the shape and position of devices such as occluder and camera are changing. Instead of cancelling ambient light, we aim to introduce attention mechanisms to enhance the object-related information for better recovery results under different settings.

Recently, NLOS reconstruction methods based on deep learning have shown remarkable visual gains over traditional methods, as they have a powerful ability to extract features. Existing deep learning-based passive reconstruction methods can be roughly divided into two categories: the hybrid method that is combined with physical models [14, 15] and end-to-end learning methods [16,17,18,19,20,21,22,23,24,25,26,27,28,29]. The former can perform effectively even in situations with limited or no available data. However, it has certain limitations, such as domain-specific solid knowledge and relying on the prior knowledge of the reconstruction system. For instance, Mu et al. [14] integrated highly domain-specific knowledge, including physics priors related to wave propagation and volume rendering, into a neural network to enhance the representation capabilities of a conditional neural scene. Wu et al. [15] introduced a novel untrained deep decoder network (UNN) informed by physics, which effectively improved reconstruction quality without the need for data acquisition, particularly in ambient light conditions. Nevertheless, it still relies on prior knowledge of the reconstruction system for computations like the light transport matrix. The latter end-to-end methods, which do not require prior knowledge of the reconstruction system, can be broadly categorized into two types: convolutional neural networks (CNNs) [16,17,18,19,20,21,22,23,24,25, 29] and generative models [26,27,28]. Among these, CNNs, especially U-Net and its variants, are the most widely used due to their exceptional performance in tasks such as image segmentation and image restoration. For example, Chen et al. [22] directly utilized the U-Net to restore occluded objects at distances beyond 50 m at a reconstruction rate of 14 FPS. Wang et al. [29] introduced a SPIR-Net network to simultaneously retrieve the image and position of NLOS objects from a single-shot speckle pattern. However, these methods are not quite effective since they ignore that complex scenes attenuate object-related signal intensity and treat the useful information of hidden objectsFootnote 1 on an equal footing with noisy information, such as ambient light, which is unreasonable because noisy information is an obstacle to high-quality recovery. Furthermore, most of the existing deep learning-based reconstruction datasets focus on simple scenes. Only recent work [19] created an NLOS-Passive dataset focusing on complex scenes. However, its complexity is not sufficient. For example, they do not take into account the situation where ambient lighting cannot be controlled in practice, which can change the ambient light pattern on the secondary surface.

In this paper, we focus on overcoming the above shortcomings. We first introduce the attention mechanism into passive NLOS reconstruction tasks to automatically capture the useful information of hidden objects in the measured images. Particularly, we propose an attention-based encoder–decoder (AED) network that does not use a skip connection scheme to improve the quality of passive NLOS reconstruction in complex scenes. Specifically, we introduce an attention in attention (A2B) module [30] that can prune the attention layers into the proposed network to strengthen the useful information of hidden objects. The A2B module comprises an attention branch, a non-attention branch, and an additional attention dropout module. The attention branch is proposed to enable the network to pay more attention to the useful information of hidden objects. The non-attention branch serves to learn information that is neglected by the attention branch. The additional attention dropout module generates the dynamic weight for the attention branch and non-attention branch, making full use of the information from both branches and thus enhancing the useful information of hidden objects. Furthermore, we build an automated acquisition system, as shown in Fig. 1, constructing several datasets in complex scenes, including fixed and mixed setups, complex hidden objects in a dark environment, and varying ambient light conditions. Our proposed AED network can perform good reconstruction on our constructed complex datasets. The proposed AED network exceeds PSNR over 23 dB and SSIM over 0.9 in uncalibrated setups in a dark environment. Compared to the U-Net network, the proposed AED network achieves improvements in both PSNR and SSIM metrics on more sophisticated datasets. The proposed AED method remains above 17.9 dB in PSNR and above 0.8 in SSIM on the MNIST dataset under varying ambient light conditions. This system also exhibits good generalizability, achieving a PSNR of 10.92 dB and an SSIM of 0.4997, even in scenarios where different numbers of people walk around the system and cast shadows on the secondary surface.

Fig. 1
figure 1

Passive NLOS reconstruction in complex scenes. The black and red arrows refer to the light path and data flow, respectively

2 Proposed method

In this section, we first describe the passive NLOS reconstruction problem that we consider. Then, we give a detailed description of attention-based encoder–decoder (AED) network for passive NLOS reconstruction in complex scenes.

Fig. 2
figure 2

Overview of our proposed attention-based encoder–decoder (AED) network for NLOS reconstruction in complex scenes. The attention branch in A2B module is used to enhance object-related information

2.1 Problem formulation

Figure 1 shows a classic passive NLOS reconstruction scene. The hidden object, also called the original image, is placed on the monitor screen, and the occluder partially blocks the light emitted, producing a penumbra on the secondary surface. The penumbra captured by the camera is called the measurement image. The goal of NLOS reconstruction is to recover the hidden object from the measured image. For a given position of the point light source s on the monitor screen, the measured irradiance intensity at d on the secondary surface can be expressed as:

$$\begin{aligned} y(d) = \int _{s \in S} {x(s)A(s,d,{p_0})\text {d}s} + b(d), \end{aligned}$$
(1)

where x(s) represents monitor screen radiosity of point s; b(d) represents the noise contribution, such as the system modelling errors and background noise; and integration over all pixels of monitor resolution S denotes the combined contribution of the monitor screen at d; \(A(s,d,{p_0})\) refers to the optical transport process from point light s to the point d with the occluder positioned at \({p_0}\):

$$\begin{aligned} A(s,d,{p_0}) = \mu (s,d)\frac{1}{{\left\| s - d \right\| {^2}}}G(s,d)V(s,d,{p_0}) \end{aligned}$$

Here, \(\mu (s,d)=\cos ^{18}(\angle (s,d))\) is the radiometric model of the monitor screen with respect to the viewing angle; \(\left\| \cdot \right\| _2^2\) is the Euclidean norm; G is the Lambertian bidirectional reflectance distribution function (BRDF):

$$\begin{aligned} G(s,d) =\cos [\angle (d-s,n_{s})]\cos [\angle (s-d,n_{d})] \end{aligned}$$

where \(n_{s}\) and \(n_{d}\) are the surface normal of the monitor screen and the secondary surface, respectively, and \(V(s,d,{p_0})\) is a Boolean-valued visibility function that takes value 1 when the light path from s to d is not obstructed by an occluder; otherwise, it takes 0. The equation (1) can be discretized as:

$$\begin{aligned} \textbf{y}\mathrm{{ = }}A\mathrm{{(}}{p_0}\mathrm{{)}}{} \mathbf{{x}} + \mathbf{{b}} \end{aligned}$$
(2)

where \(\textbf{y}\) and \(\textbf{x}\) are the vectorized measured image and vectorized original image, respectively. \(A(p_{0})\) is the light transport matrix; \(\textbf{b}\) represents the noise term. NLOS reconstruction aims to learn an inverse mapping that reconstructs the hidden object \(\textbf{x}\) from the measured image \(\textbf{y}\), which can be formulated as:

$$\begin{aligned} \textbf{x}=A^{-1}\textbf{y}. \end{aligned}$$
(3)

where the formulation of \(A^{-1}\) depend on the reconstruction methods. For example, if least-squares method is used, \({A^{\mathrm{{ - }}1}} = {({A^T}\mathrm{{(}}{p_0}\mathrm{{)}}A\mathrm{{(}}{p_0}\mathrm{{))}}^{ - 1}}A\mathrm{{(}}{p_0}\mathrm{{)}}. \)

2.2 Network architecture of AED model

The commonly used reconstruction methods, such as U-Net and its variants, are less effective because they ignore that complex scenes can attenuate the intensity of object-related information and treat the useful information from hidden objects on an equal footing with noisy signals, such as ambient light. To address this issue, we first introduce attention mechanisms, represented as the attention branch, into NLOS reconstruction tasks to enhance the object-related information. Furthermore, considering that attention mechanisms might neglect some effective information in certain layers, we incorporated a learnable attention dropout module to dynamically adjust the weights between two branches, ensuring that both attention and non-attention mechanisms are utilized effectively and enhance the object-related useful information.

Our proposed AED reconstruction model is a typical CNN-based encoder–decoder framework, which takes the measured images y as inputs and outputs the reconstructed original images x. The network structure is illustrated in Fig. 2. The input measured images and the output original images are resized to \(128\times 128\) dimensions in pixels. The blue part acts as an encoder and comprises a convolutional layer and four stacked encoding blocks. A convolutional layer with a filter size of \(1\times 1\) is used to change the number of channels. Four stacked encodings are used to generate the high-level feature representations. Each encoding block consists of an attention in attention (A2B) module followed by a pooling layer for downsampling. Once the encoder process is completed, the deep features \(c_{5}\) can be extracted from the measured image. This process can be described as follows:

$$\begin{aligned} c_{5}=f_{E_{4}}(f_{E_{3}}(f_{E_{2}}(f_{E_{1}}(f_\textrm{chg}(y))))), \end{aligned}$$
(4)

where \(f_\textrm{chg}(\cdot )\) is a convolutional layer with a filter size of \(1\times 1\), and the \(f_{E_{i}}\) ( \(i= 1, 2, 3,4\) ) denotes the i-th encoding block.

The green part plays a role in the decoder and comprises four stacked decoding blocks and a convolutional layer. Each decoding block contained an A2B module followed by a deconvolutional layer to expand the size of the feature maps. The decoder process can restore the reconstructed image from the deep features \(c_{5}\). Similarly, the decoder process can be represented as:

$$\begin{aligned} x=f_\textrm{chg}(f_{D_{4}}(f_{D_{3}}(f_{D_{2}}(f_{D_{1}}(c_{5}))))), \end{aligned}$$
(5)

where the \(f_{D_{i}}\) (\(i= 1, 2, 3, 4\)) denotes the decoding block and \(f_\textrm{chg}(\cdot )\) represents a convolutional layer with \(1\times 1\) kernel size, and x denotes the final reconstructed image. In general, our proposed AED model \(f_{a}(\cdot )\), which is designed to learn the inverse mapping \(A^{-1}\), can be written in the following form:

$$\begin{aligned} x=f_{a}(y), \end{aligned}$$
(6)

Given a training dataset \({\{y_{i},x_{i}\}}_{i=1}^{N}\), the objective of training proposed AED network is to minimize the mean squared error (MSE) loss [17] function as follows

$$\begin{aligned} L(\theta )=\frac{1}{N}\sum \limits _{\mathrm{{i = 1}}}^N {{{\left\| {{x_i} - {f_a}({y_i;\theta })} \right\| }_2}} \end{aligned}$$
(7)

where \(||\cdot ||\) denotes \(L_{2}\) norm; \(\theta \) denotes all trainable network parameters.

2.3 Attention in attention module

As shown in Fig. 3, the A2B module comprises an attention branch, a non-attention branch, and an additional attention dropout module. The non-attention branch refers to an improved bottleneck layer [17], as shown in Fig. 4, which is used to learn the information ignored by the attention branch. The bottleneck layer contains two 3\(\times \)3 convolution layers, a \(7\times 7\) convolution layer, and three nonlinear activation functions followed by a batch normalization layer. Next, we will detail the attention branch and attention dropout module.

Fig. 3
figure 3

The architecture of the attention in attention (A2B) module. \(\bigoplus \) denotes calculating of the weighted sum of the two branches. \(\bigotimes \) represents multiplication operation

Fig. 4
figure 4

The architecture of bottleneck layer [17]. \(\bigoplus \) represents summation operation

  1. 1.

    The attention branch The attention branch is proposed to enable network to pay more attention to the useful information of hidden objects. The attention branch applied here is divided into two sub-branch: mask branch and trunk branch. Here, the sub-trunk branch is an improved bottleneck layer, and we use \(T(x_{n-1})\) representing trunk branch output with input \(x_{n-1}\). The mask branch mainly uses an pixel channel-spatial attention [31], which employs a convolutional layer with a 1\(\times \)1 kernel followed by a sigmoid function to yield the same size mask \(M(x_{n-1})\) that softly weight features \(T(x_{n-1})\). The output of attention branch H is:

    $$\begin{aligned} \begin{aligned} H_{i,j}(x_{n-1})&= M_{i,j}(x_{n-1}) \times T_{i,j}(x_{n-1})\\ \end{aligned} \end{aligned}$$
    (8)

    where ij range over all spatial positions and number of channels, respectively; \(\times \) means the dot product operation. The mask branch in the attention branch can serve as a gradient update filter in the backpropagation process. Given input feature \(x_{n-1}\), the gradient of mask branch is:

    $$\begin{aligned} \frac{\partial M(x_{n-1},\theta _{m})T(x_{n-1}, \theta _{t})}{\partial \theta _{t}}= M(x_{n-1},\theta _{m})\frac{\partial T(x_{n-1}, \theta _{t})}{\partial \theta _{t}}\nonumber \\ \end{aligned}$$
    (9)

    where \(\theta _{m}\) and \(\theta _{t}\) are the mask branch and the trunk branch parameters, respectively. This property enable AED model to learn more useful information.

  2. 2.

    Attention dropout module: For a given input \(x_{n-1}\), we adopt two fully connected layers (\(W_{1}\) and \(W_{2}\)) with a ReLU activation after one global average pooling layer \(W_{avg}\) and a softmax activation function to adaptively assigns weights \(P_{1}\) and \(P_{2}\) to the attention and non-attention branches, respectively. Specifically, the corresponding formula of computing dynamic weights \(P(x_{n-1})=(P_{1},P_{2})\) can be defined as follows:

    $$\begin{aligned} P(x_{n-1})=\frac{e^\mathrm{{RelU}(W_{1}(\textrm{RelU}(W_{1}(W_\textrm{avg}(x_{n-1})))))}}{\sum e^\mathrm{{RelU}(W_{1}(\textrm{RelU}(W_{1}(W_\textrm{avg}(x_{n-1})))))}} \end{aligned}$$
    (10)

    The formula for obtaining the enhanced feature map \(x_{n}\) from the input feature map \(x_{n-1}\) can be written as follows:

    $$\begin{aligned} x_{n}=f_{1*1}(x_{n-1}^\textrm{att}*P_{1}+x_{n-1}^{\text {non-att}}*P_{2}), \end{aligned}$$
    (11)

    where \(x_{n-1}^\mathrm{{non-att}}\) and \(x_{n-1}^\mathrm{{att}}\) represent the outputs of the non-attention and attention branches, respectively. The \(f_{1*1}\) is a convolutional layer with a filter size of \(1\times 1\). Here, we use sum-to-one constraint to bound the dynamic weights, that is, \(P_{1}+P_{2}=1\).

3 Experiments

3.1 Datasets and training details

  1. 1.

    Datasets: The success of deep learning methods for the NLOS reconstruction problem strongly depends on the availability of the given datasets, that is, pairs of original images and corresponding measured images. Specifically, the original images x from six benchmark datasets, MNIST [32], Human posture [33], Handgesture-1 [34], Handgesture-2 [35], Fashion-MNIST [36], and Cifar10 [37], are displayed sequentially on the monitor screen of the NLOS reconstruction system, as shown in Fig. 1, and the corresponding measured images are collected. Of these datasets, MNIST dataset is popular because of its simple distribution. Each image of this dataset has a digital object with value ranging from 0 to 9 with a pure black background and a grayscale colour space. In this study, we used this dataset to investigate the robustness of the proposed AED model to parameter settings of the reconstruction system and the varying ambient light conditions. Specifically, in the former case, we first placed the NLOS reconstruction system in a dark environment and then collected data in both fixed and mixed NLOS setups. In the latter case, by keeping the parameter settings of the reconstruction system unchanged, we collected the data under varying ambient lighting conditions. The remaining five datasets were used to evaluate the reconstruction performance of the AED model for complex hidden objects. The background and colour spaces in the Human posture and Fashion-MNIST datasets are similar to those in the MNIST dataset, but their objects are more complex. Compared to the MINIST dataset, the gesture object of Handgesture-1 is more complex, the background is not monochromatic and does not change with the image, and the colour space is RGB. In contrast to Handgesture-1, Handgesture-2 has more object types, the colour of the background is close to that of the object, and the background varies with the image. The Cifar10 dataset contains real objects in the physical world with different scales and features, whose background and colour space are more complex than all the mentioned datasets. A more detailed description of the six datasets can be found in the paragraph “Results on more sophisticated datasets” in Supplementary Information of [38]. We used the peak signal-to-noise ratio (PSNR) and structure similarity index measures (SSIM) the evaluation metric.

  2. 2.

    Training details: All the experiments were implemented in PyTorch 1.7. We used a stochastic gradient descent (SGD) optimizer [39] with a momentum of 0.9 to train our model for 200 epochs, with a total of eight images per minibatch. We set the initial learning rate to 0.001 with a linear warmup schedule [40] and a step decay schedule. In addition, we adopted augmentation, such as Gaussian noise and cropping, to avoid overfitting.

Table 1 Quantitative results (PSNR(dB)/SSIM) of several reconstruction algorithms on the MNIST dataset in the fixed or mixed setup

3.2 Experimental results

3.2.1 Dark environment

  1. 1.

    To evaluate our AED method using the MNIST dataset and compare it with the conventional method [5] and U-Net method [17], we first collect NLOS data with immobilizing parameter settings; this implies that the position of the occluder, monitor, and camera, as well as the shape of the occluder, remain unchanged during the acquisition process (a more detailed description of the fixed setup in [38]). This case is referred to as the fixed setup. We then feed the collected data into the traditional, U-Net, and AED methods to produce the corresponding reconstructions. Table 1 lists the quantitative results on the 10,000 test images. As can be seen, our method achieves a PSNR of 25.242 dB and a SSIM of 0.9424, which are higher than that of the traditional method (10.813 dB and 0.0772) and U-Net (24.913 dB and 0.8886). This indicates the effectiveness of the AED method for NLOS reconstructions. We further compared our AED with U-Net and a traditional method in terms of computational cost, including the number of parameters and inference time for a single image. The results are presented in Table 2. Benefiting from the A2B module, the parameter number of our AED increases by 6.22M compared to U-Net. The traditional method requires an additional 2 min for image reconstruction, whereas the inference time of the AED is equivalent to U-Net and is under 0.2 s. This indicates the effectiveness of our model while maintaining a temporal advantage.

  1. 2.

    In real applications, the parameter settings of the reconstruction system can vary. In other words, the position of the occluder, monitor, and camera, as well as the shape of the occluder, may change. This case is referred to as the mixed setup (refer to the detailed description of the mixed setup in [38]). To analyse the robustness of the AED method to parameter settings, we collected NLOS data that varied over a range to train the model. The traditional method constructs a forward transport model based on the prior knowledge of the reconstruction system, such as a rectangular occluder, while in the mixed setup, the prior knowledge is changed, such as the shape of the occluder changes from rectangular to circular or cup-shaped. Therefore, the traditional method does not operate in a mixed setup. As presented in Table 1, our AED method achieves the best performance (23.624 dB/0.9239) in the mixed setup and outperforms the U-Net method in both PSNR and SSIM (23.218 dB/0.8568). This result demonstrates that the proposed AED method is robust to NLOS system parameter settings.

Table 2 The number of parameters and inference time for a single image across different methods
Table 3 The PSNR and SSIM results of several reconstruction algorithms on five complex datasets in a fixed setup. AED-W and AED-O denote the use and not use of skip connections from the encoder to decoder for the network, respectively
Fig. 5
figure 5

Examples of different methods on five complex datasets. a Measured images. b Original images. c and d are the reconstructions for the U-Net method and AED method, respectively. All data are collected in a fixed setup

Table 4 The weight of attention branch and non-attention branch for the bending pose case, respectively
  1. 3.

    To verify the generalizability of our method, we selected Human posture, Hand gesture-1, Hand gestures-2, Fashion-MNIST, and Cifar10 as example datasets to conduct experiments in a fixed setup. As listed in Table 3, our AED method shows improvements both in terms of PSNR and SSIM on more complex datasets than the mainstream U-Net method, demonstrating that the attention mechanism helps the network learn more useful information of hidden objects to achieve better reconstruction results. Although our method did not show a significant improvement in the numerical, it performed better than U-Net visually, as shown in Fig. 5. Specifically, for human posture, we produced a sharper appearance than that of U-Net. For Hand gesture-1, we successfully recovered the details of the thumb and wrist, whereas U-Net could not. Our method showed better finger shape recovery even on the Hand gesture-2 dataset, where the colours of the background and object were close, and could restore the shape of the hidden object on the fashion-MNIST dataset, in contrast to U-Net, which failed to accomplish both. Compared with the results of U-Net on cifar10, our method was able to restore the colour information to a great extent and identify hidden objects. These results demonstrate the superior ability of our method in terms of generalizability to complex hidden objects.

As shown in Table 4, we observe that the weights of the attention branch and non-attention branch vary significantly at different blocks, reflecting differences in their ability to extract object-related useful information. For instance, in Block1, Block2, and Block7, the attention branch’s weights are higher than those of the non-attention branch, indicating its superior information extraction ability. Conversely, in Block5, Block6, and Block8, the increased weights of the non-attention branch suggest that it plays a valuable role in capturing information that is ignored by the attention branch. By fully utilizing the information from both the attention and non-attention branches, the network can preserve more important object-related useful information.

Fig. 6
figure 6

The attention heat maps of bending pose in human posture cases. Block i denotes the i-th attention in attention block in AED model. AFM and AAM denote averaged feature map and averaged attention map respectively. The first, third, fourth, and fifth rows represent the averaged feature maps of the input, attention branch, non-attention branch, and output of attention in attention module, respectively. The second row refers to the averaged attention map of the attention branch, in which the brighter the colour, the greater the attention weight. The red area, white area, and blue area indicate positive, negative, and zero values

  1. 4.

    Considering that the data distribution of the input and output of the network is different, we also verify the effect of the skip connection scheme from the encoder to the decoder. The experimental results in Table 3 show that the AED method without the skip connection scheme achieves results comparable to those of the one with this scheme; this demonstrates that using this scheme does not lead to superior reconstruction performance.

3.2.2 Varying ambient light environment

In real-life situations, ambient light cannot be controlled. Therefore, we evaluated the robustness of the AED method under varying ambient light conditions. Specifically, we studied the effect on reconstruction performance when people walk near a passive NLOS reconstruction system and cast a penumbra on the secondary surface. We placed the NLOS reconstruction system shown in Fig. 7a in an exhibition hall with an intense ambient light of 56.7 Lux/Fc. Under this lighting condition, the PSNR and the SSIM of the AED method were 17.689 dB and 0.6894, respectively. Subsequently, we added different numbers of opaque plates, each with a height of 1.7 m, to the experimental setup (Fig. 7b–c) to simulate the casting of penumbras on the secondary surface when different people are in the vicinity of the system. As shown in Table 5, the PSNR and SSIM of the AED model trained in the experimental setting without a plate (denoted as AED-0) both decreased when people walked near the system, demonstrating that the AED model is not robust when training under the fixed lighting condition. For improved robustness, we trained a new AED model under mixed lighting conditions, referred to as AED-M. As presented in Table 5, the proposed AED-M model remains above 17.9 dB in terms of PSNR and 0.8 in terms of SSIM when people walk near the system, showing that it is robust to varying ambient light conditions. It is noteworthy that the robustness of our AED model against changes in ambient light conditions can be attributed to two key factors. The first factor pertains to the diversity of our training dataset, which comprises a substantial collection of images captured under varying ambient light conditions. This diversity allows the model to acquire knowledge regarding the varying ambient lighting conditions autonomously, so the model exhibits robustness when applied in real-world scenarios. The second factor pertains to the model architecture. Our A2B module assists our network in focusing more on object-related information, significantly reducing its attention to ambient lighting. This, to some extent, improves our method’s ability to adapt to changes in lighting conditions.

Table 5 Summary of the PSNR and the SSIM of AED models trained under varying ambient lighting conditions
Fig. 7
figure 7

Passive NLOS reconstruction system in different ambient lighting conditions. (Reproduced with permission of Ref. [38], copyright \(\textcircled {a}\) communications physics, 2021.) a Zero, b one, and c two opaque plates placed around the reconstruction system for casting shadows on the secondary surface

We have conducted an experiment in an exhibition hall (Fig. 8) to evaluate the robustness of the AED-M model under real varying ambient lighting due to movements of nearby objects. We developed a script to automatically display 1000 test images from the MNIST dataset on the monitor screen and to reconstruct the NLOS object as different numbers of people walk around the system. The proposed AED-M model has a PSNR of 10.92 dB and an SSIM of 0.4997. These results suggest that our model exhibits good generalizability.

Fig. 8
figure 8

Different number of people walks around the system to test the AED-M model

4 Conclusions

In this study, we propose an attention-based encoder–decoder network to boost the reconstruction quality under complex scenes. To address the equal treatment of object-related information and noise, we use the attention in attention (A2B) module to help the network focus on object-related useful information. We also find that the skip connection scheme does not contribute meaningfully to improving the restoration quality. In addition, we create several datasets in complex scenes to evaluate the performance of the AED method. The experimental results demonstrate that the proposed AED method achieves good recovery quality on our constructed datasets. In the future, we plan to combine visible-light information with other types of electromagnetic waves to improve the quality of passive NLOS reconstruction under complex scenes.