Rafe: Generative Radiance Fields Restoration
Rafe: Generative Radiance Fields Restoration
Rafe: Generative Radiance Fields Restoration
1 Introduction
Recently, Neural Radiance Fields (NeRFs) [3, 4, 7, 12, 30, 31, 37, 40] have achieved
great success in novel view synthesis and 3D reconstruction. However, most
NeRF methods are designed based on well-captured images from multiple views
with calibrated camera parameters. In real-world applications of NeRF, the data
capture or transmission process often introduces various forms of image degrada-
tions, such as noise generated during photography in low-light conditions [28,32]
2 Wu et al.
Low Resolution
Image Set NeRF Ours (x4) Ours (x8)
Mixed Degradation
NeRF Ours Real Blur Image Set NeRF Ours
Image Set
and blur caused by camera motion [26, 43], or JPEG compression and down-
sampling during transmission [2,42]. Simply restoring degraded images frame-by-
frame can result in inconsistencies of geometry and appearance across different
viewpoints. Directly reconstructing 3D models over these per-frame restoration
results can easily induce inferior quality since current NeRF methods heavily
rely on pixel-wise independent ray optimization with local computations, which
are highly vulnerable NeRF
Noise Image Set
to noise and other
Ours
degradation.
NeRF Ours
from multiple distinct high-quality NeRF models with varied geometry and ap-
pearance. In this case, we abandon the commonly-used pixel-wise reconstruction
objective and propose to leverage the generative adversarial networks (GANs) to
model the distribution of these different high-quality NeRF models, which could
effectively capture the inherent variability in ill-posed inverse problem, allowing
for a better accommodation of the inconsistencies present in different views.
Specifically, our pipeline consists of two main stages. In the first stage, based
on the type of degradation, we can employ the corresponding off-the-shelf image
restoration methods [17, 25, 33, 34, 45, 48] to obtain a set of high-quality multi-
view images. In practice, we prefer choosing restoration methods which have
strong capabilities to recover high-quality and realistic texture details. In the
second stage, we train a 3D generative model based on these restored multi-view
images. Drawing inspirations from recent 3D generation works [6, 9, 36, 39], we
construct a convolutional neural network (CNN) to generate tri-plane features,
which are subsequently sampled and decoded into density and colors using MLP
networks for NeRF rendering. Here, instead of generating single-level tri-plane
features as previous works did, we decompose the tri-planes into two levels. The
coarse-level tri-planes are constructed directly from low-quality images and re-
main fixed during training, representing the coarse structure of the modeled 3D
distribution. Simultaneously, we train a generator to output the diverse fine-level
tri-plane features, which act as residuals to be added to the coarse-level features
for NeRF rendering. By focusing on learning the residual representations in-
stead of the entire tri-planes for NeRF, we simplify the modeling and learning
of restoration variations since we only need to learn the details while the coarse
structure is provided by coarse-level tri-planes, which makes great improvement
in rendering quality for more complex regions. To train the generator, we adopt
an adversarial loss defined on NeRF rendered 2D images to encourage them to be
indistinguishable from the high-quality restored images. We also incorporate a
perceptual loss between the rendered images and the restored images to calculate
structure constraints. Additionally, we propose patch sampling strategies to sta-
bilize the generator training procedure. Once the generator has been trained, we
can generate restored radiance fields with high quality renderings and a certain
level of diversity by sampling different code in the latent space.
We conducted extensive experiments to validate the effectiveness of our
method, both qualitatively and quantitatively. The experimental results show-
case the superiority of our approach in various restoration tasks, such as super-
resolution (upper row of Figure 1), camera motion blur (a real-world case at the
right part of the lower row of Figure 1) and the restoration of mixed degradation
consisting of noise, blur, and compression (left part of lower row of Figure 1).
Our method not only generates images with richer and enhanced texture details
but also achieves significant improvements in geometric refinement, as demon-
strated by the mesh visualization in Figure 1. To summarize, our contributions
are:
2 Related Works
2.1 2D image restoration
Image restoration is a long standing problem in low-level vision domain and
significant progress has been achieved in different specific tasks including image
super-resolution, deblur, denoise and blind restoration. Previously reconstruction-
based methods [11,21,23,51,53] show their success in these tasks. However, those
reconstruction-based method are struggling to generate abundant high-quality
details. Subsequently, generative restoration methods [10, 17, 25, 33, 34, 38, 41, 45,
48], particularly those based on diffusion model, have shown the great capability
to render high-quality details. Deepfloyd [34] proposes a super-resolution model,
which concatenates the low-resolution input with random noise at pixel level as
a condition to guide the generation of high-resolution images. For blind restora-
tion, DiffBIR [25] designs a degraded pipeline to simulate real-world degradation
and utilizes the pre-trained diffusion model to generate photorealistic images. For
camera motion blur, HiDiff [10] recovers exquisite images by using diffusion to
generate feature with abundant detailed information.
only happens on rendered views. None of the existing approaches can restore
NeRF in 3D space directly with flexible forms of degradation. NVSR [2] archives
3D geometry refine by upsampling tri-plane representation, but their training
processing requires tremendous amounts of 3D data, which are extremely hard
to obtain in practice. By contrast, our method has the ability to handle more
flexible forms of degradation and restore 3D geometry and appearance with the
only needs of an image set for an object or scene, making the 3D restoration
more practical in real-world applications.
3 Method
In this section, we will elaborate the details of RaFE. We introduce how to refine
the degraded views using pretrained 2D restoration model, to capture the high-
quality appearance distribution in Sec. 3.1. Then, we describe our generative
restoration framework including the neural representation, generator architec-
ture and optimization in Sec. 3.2. The training strategy is introduced in Sec. 3.3.
The overall pipeline could be found in Fig. 2.
SH
Coarse NeRF Low quality
image sets
d
MLP
rgb
Addition
density
Feature
Mapping Coarse Rendered
Network Tri-planes Volume Rendering coarse image
Shared MLP Decoder
LPIPS Restoration
model
crop
d 𝛿𝑥
MLP
𝛿𝑦
Rendered
𝑝𝑟 𝜙 𝐼ℎ Residual Fine Shared MLP Decoder Patch Rendering High quality
Patch image
Tri-planes Real or Fake image patches
\vspace {-0.5em} \small \begin {split} C(\boldsymbol {r}) = \sum ^{N}_{i=1}T_i(1-exp(-\sigma _i\delta _i))\boldsymbol {c}_{i}, \quad T_i = exp(-\sum ^ {i-1}_{j=1}\sigma _i\delta _i), \label {eq:render} \end {split} (1)
where ci , σi represents RGB and density value of the ith point alone the ray,
respectively. N is the sampled number and δi is the distance between samples.
NeRF Generator. Given restored multi-view images, we regard them as ren-
derings from several diverse high-quality NeRF models with varied geometry and
appearance. The distribution formed by these distinct NeRF models, formulated
as pr (ϕ|Ih ), can be modeled by a generator. With the hybrid explicit-implicit
tri-plane representation, we introduce a StyleGAN2 [15]-like CNN-based genera-
tor, which receives a latent code w mapped from random code z to generate fine
RaFE: Generative Radiance Fields Restoration 7
where z, θ represent random code and view point, respectively, and Ih is the
restored high quality image.
However, we observed relying solely on a GAN loss for training can lead
to significant geometric mismatches between the restored images and rendered
views. We argue that although the GAN loss helps align the distribution of
2D renderings, it still lacks geometry-level constraints. Therefore, we also incor-
porate a perceptual loss that encourages the rendered images to resemble the
geometry of pre-frame restoration.
\vspace {-0.5em} \mathcal {L}_{geometry} = LPIPS(I_{h}, G( \boldsymbol {z}, \boldsymbol {\theta })), \label {eq:mimic_loss} (3)
where LP IP S(·, ·) refer to the learned perceptual image patch similarity pro-
posed in [49]. Ih is the restored high quality image paired with view point θ, and
G(z, θ) are the rendered image using the same training view θ and a random
sampled latent code z. We also supervise the coarse NeRF branch by the input
RGB degraded images:
\vspace {-0.75em} \begin {split} \mathcal {L}_{rec} = \mathbb {E}_{\boldsymbol {\theta } \sim p_{\theta }} [\| G_{c}(\boldsymbol {\theta })-I_{l}^{\theta }\|^2], \label {eq:l_coarse} \end {split} (4)
where pθ indicate the view point distribution, and Ilθ is the low quality image
corresponding to view point θ. Overall, the complete training objective is:
\vspace {-0.5em} \mathcal {L} = \lambda _{geometry}\mathcal {L}_{geometry} + \lambda _{adv}\mathcal {L}_{adv} + \lambda _{rec}\mathcal {L}_{rec}, \label {eq:l_tot} (5)
where λgeometry , λadv , λrec are trade-off parameters.
4 Experiments
4.1 Setup
Datasets. We evaluate our model on the NeRF-Synthetic benchmark dataset [30],
which contains 8 synthetic objects with images taken from different viewpoints
uniformly distributed in the hemisphere. Following original setting, We hold 200
viewpoints for generating high-quality training data and 200 viewpoints for test-
ing. Further, to demonstrate the generalization ability of our method, we also
evaluate our method on complex real-world LLFF scenes [29] which consists of
8 scenes captured with roughly forward-facing images. We also demonstrate the
superior performance of RaFE on real-world blur [26] and noise [28] data.
Evaluation Metrics. Following the common practice of 3D reconstruction,
we firstly try to evaluate each method with two standard image quality met-
rics: peak signal-to-noise ratio (PSNR), structural similarity index (SSIM) [46],
which however could not reflect the real 3D restoration performance according
to our observations. Due to the generative characteristic of RaFE, the recovered
radiance fields from RaFE is very high-quality, but may not faithfully follow the
"ground-truth" 3D, since inversing a degraded signal is a highly ill-posed prob-
lem. Moreover, we also found the baselines with better scores over PSNR and
SSIM still pose the degraded appearance with smooth texture details, as shown in
RaFE: Generative Radiance Fields Restoration 9
Figure 3b. Hence, a better way is to leveraging perceptual metric: learned percep-
tual image patch similarity (LPIPS) [49] which computes the mean squared error
(MSE) between normalized features from all layers of a pre-trained VGG [35]
encoder and is deemed to better correlate with human perception. Besides we
also leverage the latest non-reference based image quality assessment metrics
including LIQE [50] and MANIQA [47] to demonstrate the superior rendering
quality of our method.
Implementation Details. We implement all the experiments by PyTorch. For
the 2D generator and discriminator, we use a convolutional-based generator and
discriminator used in StyleGAN2 [16]. In all experiments, we choose Adam op-
timizer for all the modules in our pipeline, with hyperparameters β1 = 0, β2 =
0.99. We use learning rate 2 × 10−3 for both generator and discriminator. For
loss weights, we use λmimic = 0.5, λadv = 1.0, and λrec = 1.0 for almost all
experiments. We evaluate RaFE framework on 4 different 3D restoration tasks:
4.2 Results
Baseline Methods. For general tasks, we try a baseline that firstly restores the
degraded images and uses the restored high-quality images to reconstruct a NeRF
directly, denoted as NeRF-Perframe. We also use the 2D-based restoration model
SwinIR [24] to do the per-view refinement for the renderings of NeRF trained
by degraded image, denoted as NeRF-SwinIR. Note that we do not evaluate
NeRF-SwinIR for the deblur task since there are no corresponding checkpoints.
To more thoroughly test the effectiveness of our method, we also select some
task-specific competitors. For super-resolution task, we choose NeRF-SR [42],
Neural Volume Super-Resolution (NVSR) [1] as baselines. For mixed degrada-
tion, since there is no existing method tailored for mixed degradation, we choose
NeRFLiX, which tries to solve the NeRF-like degradation by training a 2D re-
finement model using degradation images constructed by a degradation simu-
lator for typical NeRF-style artifacts. We consider this to be the most relevant
method. For the deblur task, we compare with two state-of-the-art methods
Deblur-NeRF [26] and BAD-NeRF [44]. Deblur-NeRF designs a learnable blur
kernel and applys it to rays to simulate the degrading process and BAD-NeRF
directly models the camera trajectories to solve motion blur. For the denoising
task, we compare with NAN [32], which uses a noise-aware encoder to aggregate
the feature of multi-view images for restoration.
Quantitative Results. We conduct extensive quantitative comparisons with
various baselines across different restoration task in Tab. 1a for super-resolution,
Tab. 2b and Tab. 2a for deblurring, Tab. 2c for denoising and Tab. 1b for mixed
degradation. As analyzed before, although most of the time our method falls
slightly behind on the reconstruction metrics like PSNR and SSIM compared
with other baselines, which only measure local pixel-aligned similarity between
the rendered novel views and the ground truth images, are less indicative since
uncertainties naturally exist in generative procedure. Taking the super-resolution
results on Blender data as an example (Tab. 1a), the simplest baseline NeRF-
Perframe have already achieved the best reconstruction metrics, but as shown
in Figure. 3a, its visual quality is vastly inferior compared with our results.
Through the error map, we found the mis-alignment between the generated 3D
and input 3D causes the drop of PSNR and SSIM. By contrast, on the percep-
tual metrics like LPIPS and non-reference based metrics including LIPE and
MANIQA, which could more effectively reflect the restoration performance, our
method consistently achieves better results when compared with other baselines,
demonstrating its clear advantages.
For mixed degradation tasks, the best results for LPIPS metrics are achieved
by NeRFLix w. ref [52]. This is because NeRFLiX can see two high-quality
ground-truth images from the nearest two viewpoints when inference, which
leads to information leakage. However, high-quality ground-truth information is
not accessed in our setting or any real-world cases. After eliminating the impact
of ground truth (NeRFLiX w/o. ref) by replacing the reference ground truth
images with degraded images, our method performs better than NeRFLiX.
RaFE: Generative Radiance Fields Restoration 11
(a) Deblurring (b) Deblurring (consistent blur) (c) Denoising with Gain 8
Table 2: Quantitative comparisons for deblurring and denoising. The best result with-
out using reference is highlighted. Our method performs the best on perceptual metrics.
(a) Input (b) NeRF-SR (c) NVSR (d) NeRF-SwinIR (e) NeRF-Perframe (f) Ours
Super-resolution
(a) Input (b) NAN (b) NeRF-SwinIR (c) NeRF-Perframe (d) ours
Denoise Gain8
(a) Input (b) NeRF-SwinIR (c) NeRFLiX w.ref (d) NeRFLiX w/o.ref (e) NeRF-Perframe (f) Ours
Mixed Degradation
SRFormer DiffBIR
Diversity Score ↑: 0.161 Diversity Score ↑: 0.252
To validate that our method also works well on the real-world setting, we also
test RaFE using real-noise and real-blur datasets proposed in [28] and [26]. As
shown in Fig. 6a and Fig. 6b, benefiting from the powerful 2D restoration models,
NeRF-Perframe could effectively remove the existing degradations like noise or
blur. Nonetheless, simply averaging the view-inconsistent 2D restored frames
results in very smooth 3D reconstruction. By contrast, through modeling the 3D
space using a generative model, the sampled 3D model from our method could
achieve significantly better rendering quality with realistic and degradation-free
texture details, demonstrating its superiorities on the real-world 3D restoration.
4.4 Discussion
Effects of different restoration models. To investigate the influence of differ-
ent 2D restoration models on our method, we tested two additional off-the-shelf
restoration models for the super-resolution task, including diffusion-based Diff-
BIR [25] and non-diffusion-based SRFormer [53]. As shown in Fig. 5b, diffusion-
based model DiffBIR shows larger restoration diversity over SRFormer as ex-
pected by measuring the diversity score. When the repaired images exhibit di-
versity, direct reconstruction inevitably leads to the blurriness to varying degrees
due to the exisiting multi-view inconsistency. By contrast, through modeling the
distribution of the potential high-quality NeRFs, our method successfully accom-
modates these inconsistencies and consistently achieves better performance over
NeRF-Perframe, demonstrating the great generalization capability of RaFE to
different 2D restoration model.
14 Wu et al.
Generator % "
LPIPS↓ 0.091 0.080
LIPE↑ 3.266 4.224
𝑧𝑧 = 𝑧𝑧0 𝑧𝑧 = 𝑧𝑧1 𝑧𝑧 = 𝑧𝑧2 MANIQA↑ 0.278 0.447
GT w/o Generator Full Full Full
Fig. 7: Effectiveness of the tri-plane generator.Left: image rendered by NeRF
without generator and images rendered by generative NeRF under different random
codes z(z0 , z1 , z2 ). Right: numerical metrics to evaluate the efficacy of the generator.
The results show the effectiveness of using the generator to model the distribution.
Effects of the generator. In this ablation study, we examine the influence of
the generator by comparing with the baseline that directly optimizes the NeRF
parameters using GAN loss and LPIPS mentioned above on the Lego dataset.
As we can observe in Fig. 7, the image rendered by generative NeRF exhibits
varied fine-textured details under different random code z. Once the generator
is removed, the rendered images will contain blurry and smooth appearance,
showing the importance of using generator to model the distribution, which can
also be demonstrated by the numerical metrics in the right part of Fig. 7.
Effects of geometry loss & GAN loss. In this experiment, we examine the
influence of geometry and GAN losses on the performance by training the model
on the drums dataset, shown in Fig. 8. We conduct this experiment by setting
the weights of unused losses to 0. As can be seen, removing the geometry loss
results in severe geometry mismatches (e.g. the edges of the drums exhibit distor-
tion), which can be also demonstrated by the drop of PSNR. Meanwhile, using
perceptual loss only makes the rendered image to be very smooth with fewer de-
tails (e.g. the light and reflection), resulting in a significant decline on perceptual
metrics. RaFE trained with both objectives achieves the best performance.
PSNR:17.86 PSNR:21.02 Geometry loss " % "
GAN loss % " "
PSNR↑ 22.18 18.93 20.72
SSIM↑ 0.861 0.826 0.858
LPIPS↓ 0.190 0.098 0.077
LIPE↑ 1.029 4.715 4.939
MANIQA↑ 0.157 0.528 0.553
5 Conclusions
This paper proposes a novel generic NeRF restoration method that applies to
various types of degradations, such as low resolution, blurriness, noise, and mixed
degradation. The proposed method leverages the off-the-shelf image restoration
methods to restore the multi-view input images individually. To tackle the ge-
ometric and appearance inconsistencies presented in multi-view images due to
individual restoration, we propose to train a GAN for NeRF generation, where
a two-level tri-plane structure is adopted. The coarse-level tri-plane pre-trained
by low-quality images remains fixed, while the fine-level residual tri-plane to be
RaFE: Generative Radiance Fields Restoration 15
References
1. Bahat, Y., Michaeli, T.: Explorable super resolution. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2716–
2725 (2020) 4, 10
2. Bahat, Y., Zhang, Y., Sommerhoff, H., Kolb, A., Heide, F.: Neural volume super-
resolution. arXiv preprint arXiv:2212.04666 (2022) 2, 4, 5
3. Barron, J.T., Mildenhall, B., Tancik, M., Hedman, P., Martin-Brualla, R., Srini-
vasan, P.P.: Mip-nerf: A multiscale representation for anti-aliasing neural radiance
fields (2021) 1
4. Barron, J.T., Mildenhall, B., Verbin, D., Srinivasan, P.P., Hedman, P.: Mip-nerf
360: Unbounded anti-aliased neural radiance fields. CVPR (2022) 1
5. Beyer, L., Zhai, X., Kolesnikov, A.: Big vision. https://github.com/google-
research/big_vision (2022) 5
6. Chan, E.R., Lin, C.Z., Chan, M.A., Nagano, K., Pan, B., De Mello, S., Gallo,
O., Guibas, L.J., Tremblay, J., Khamis, S., et al.: Efficient geometry-aware 3d
generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition. pp. 16123–16133 (2022) 3, 6
7. Chen, A., Xu, Z., Geiger, A., Yu, J., Su, H.: Tensorf: Tensorial radiance fields. In:
European Conference on Computer Vision (ECCV) (2022) 1
8. Chen, W.T., Yifan, W., Kuo, S.Y., Wetzstein, G.: Dehazenerf: Multiple image haze
removal and 3d shape reconstruction using neural radiance fields. arXiv preprint
arXiv:2303.11364 (2023) 4
9. Chen, X., Deng, Y., Wang, B.: Mimic3d: Thriving 3d-aware gans via 3d-to-2d im-
itation. In: Proceedings of the IEEE/CVF International Conference on Computer
Vision (ICCV) (2023) 3, 6
10. Chen, Z., Zhang, Y., Ding, L., Bin, X., Gu, J., Kong, L., Yuan, X.: Hierarchical
integration diffusion model for realistic image deblurring. In: NeurIPS (2023) 4, 9
11. Chen, Z., Zhang, Y., Gu, J., Kong, L., Yang, X., Yu, F.: Dual aggregation trans-
former for image super-resolution. In: ICCV (2023) 4
12. Fridovich-Keil, S., Meanti, G., Warburg, F.R., Recht, B., Kanazawa, A.: K-planes:
Explicit radiance fields in space, time, and appearance. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12479–
12488 (2023) 1
13. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S.,
Courville, A., Bengio, Y.: Generative adversarial nets. Advances in neural infor-
mation processing systems 27 (2014) 7
16 Wu et al.
14. Han, Y., Yu, T., Yu, X., Wang, Y., Dai, Q.: Super-nerf: View-consistent detail
generation for nerf super-resolution. arXiv preprint arXiv:2304.13518 (2023) 4
15. Karras, T., Aittala, M., Laine, S., Härkönen, E., Hellsten, J., Lehtinen, J., Aila,
T.: Alias-free generative adversarial networks. In: Proc. NeurIPS (2021) 6
16. Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing
and improving the image quality of StyleGAN. In: Proc. CVPR (2020) 9
17. Kawar, B., Elad, M., Ermon, S., Song, J.: Denoising diffusion restoration models.
In: Advances in Neural Information Processing Systems (2022) 3, 4
18. Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for
real-time radiance field rendering. ACM Transactions on Graphics 42(4) (July
2023), https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/ 15
19. Lee, D., Lee, M., Shin, C., Lee, S.: Dp-nerf: Deblurred neural radiance field with
physical scene priors. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition. pp. 12386–12396 (2023) 2, 4
20. Lee, D., Oh, J., Rim, J., Cho, S., Lee, K.M.: Exblurf: Efficient radiance fields for
extreme motion blurred images. In: Proceedings of the IEEE/CVF International
Conference on Computer Vision. pp. 17639–17648 (2023) 2, 4
21. Li, H., Zhang, Z., Jiang, T., Luo, P., Feng, H., Xu, Z.: Real-world deep local mo-
tion deblurring. In: Proceedings of the AAAI Conference on Artificial Intelligence.
vol. 37, pp. 1314–1322 (2023) 4
22. Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre-training
for unified vision-language understanding and generation. In: ICML (2022) 5
23. Li, J., Zhang, Z., Liu, X., Feng, C., Wang, X., Lei, L., Zuo, W.: Spatially adap-
tive self-supervised learning for real-world image denoising. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
pp. 9914–9924 (June 2023) 4
24. Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., Timofte, R.: Swinir: Image
restoration using swin transformer. arXiv preprint arXiv:2108.10257 (2021) 10
25. Lin, X., He, J., Chen, Z., Lyu, Z., Fei, B., Dai, B., Ouyang, W., Qiao, Y., Dong,
C.: Diffbir: Towards blind image restoration with generative diffusion prior. arXiv
preprint arXiv:2308.15070 (2023) 3, 4, 5, 9, 13
26. Ma, L., Li, X., Liao, J., Zhang, Q., Wang, X., Wang, J., Sander, P.V.: Deblur-
nerf: Neural radiance fields from blurry images. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition. pp. 12861–12870 (2022)
2, 4, 8, 9, 10, 13
27. Mildenhall, B., Barron, J.T., Chen, J., Sharlet, D., Ng, R., Carroll, R.: Burst
denoising with kernel prediction networks. In: Proceedings of the IEEE conference
on computer vision and pattern recognition. pp. 2502–2510 (2018) 9
28. Mildenhall, B., Hedman, P., Martin-Brualla, R., Srinivasan, P.P., Barron, J.T.:
NeRF in the dark: High dynamic range view synthesis from noisy raw images.
CVPR (2022) 1, 8, 13
29. Mildenhall, B., Srinivasan, P.P., Ortiz-Cayon, R., Kalantari, N.K., Ramamoorthi,
R., Ng, R., Kar, A.: Local light field fusion: Practical view synthesis with prescrip-
tive sampling guidelines. ACM Transactions on Graphics (TOG) (2019) 8
30. Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng,
R.: Nerf: Representing scenes as neural radiance fields for view synthesis. Commu-
nications of the ACM 65(1), 99–106 (2021) 1, 2, 4, 8
31. Müller, T., Evans, A., Schied, C., Keller, A.: Instant neural graphics primitives
with a multiresolution hash encoding. ACM Trans. Graph. 41(4), 102:1–102:15
(Jul 2022). https://doi.org/10.1145/3528223.3530127, https://doi.org/10.
1145/3528223.3530127 1
RaFE: Generative Radiance Fields Restoration 17
32. Pearl, N., Treibitz, T., Korman, S.: Nan: Noise-aware nerfs for burst-denoising. In:
CVPR (2022) 1, 4, 9, 10
33. Saharia, C., Chan, W., Chang, H., Lee, C., Ho, J., Salimans, T., Fleet, D., Norouzi,
M.: Palette: Image-to-image diffusion models. In: ACM SIGGRAPH 2022 Confer-
ence Proceedings. pp. 1–10 (2022) 3, 4
34. Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour,
K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealistic text-
to-image diffusion models with deep language understanding. Advances in Neural
Information Processing Systems 35, 36479–36494 (2022) 3, 4, 5, 9
35. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556 (2014) 9
36. Skorokhodov, I., Tulyakov, S., Wang, Y., Wonka, P.: Epigraf: Rethinking training
of 3d gans. Advances in Neural Information Processing Systems 35, 24487–24501
(2022) 3, 6
37. Sun, C., Sun, M., Chen, H.: Direct voxel grid optimization: Super-fast convergence
for radiance fields reconstruction. In: CVPR (2022) 1
38. Tian, K., Jiang, Y., Yuan, Z., Bingyue, P., Wang, L.: Visual autoregressive
modeling: Scalable image generation via next-scale prediction. arXiv preprint
arXiv:2404.02905 (2024) 4
39. Wan, Z., Paschalidou, D., Huang, I., Liu, H., Shen, B., Xiang, X., Liao, J., Guibas,
L.: Cad: Photorealistic 3d generation via adversarial distillation. arXiv preprint
arXiv:2312.06663 (2023) 3
40. Wan, Z., Richardt, C., Božič, A., Li, C., Rengarajan, V., Nam, S., Xiang, X., Li,
T., Zhu, B., Ranjan, R., Liao, J.: Learning neural duplex radiance fields for real-
time view synthesis. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR). pp. 8307–8316 (June 2023) 1
41. Wan, Z., Zhang, B., Chen, D., Zhang, P., Chen, D., Liao, J., Wen, F.: Bringing
old photos back to life. In: proceedings of the IEEE/CVF conference on computer
vision and pattern recognition. pp. 2747–2757 (2020) 4
42. Wang, C., Wu, X., Guo, Y.C., Zhang, S.H., Tai, Y.W., Hu, S.M.: Nerf-sr: High
quality neural radiance fields using supersampling. In: Proceedings of the 30th
ACM International Conference on Multimedia. pp. 6445–6454 (2022) 2, 4, 10
43. Wang, P., Zhao, L., Ma, R., Liu, P.: BAD-NeRF: Bundle Adjusted Deblur Neu-
ral Radiance Fields. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR). pp. 4170–4179 (June 2023) 2, 4
44. Wang, P., Zhao, L., Ma, R., Liu, P.: Bad-nerf: Bundle adjusted deblur neural
radiance fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition. pp. 4170–4179 (2023) 10
45. Wang, Y., Yu, J., Zhang, J.: Zero-shot image restoration using denoising diffusion
null-space model. The Eleventh International Conference on Learning Representa-
tions (2023) 3, 4
46. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment:
from error visibility to structural similarity. IEEE transactions on image processing
13(4), 600–612 (2004) 8
47. Yang, S., Wu, T., Shi, S., Lao, S., Gong, Y., Cao, M., Wang, J., Yang, Y.: Maniqa:
Multi-dimension attention network for no-reference image quality assessment. In:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition. pp. 1191–1200 (2022) 9
48. Zhang, K., Liang, J., Van Gool, L., Timofte, R.: Designing a practical degradation
model for deep blind image super-resolution. In: IEEE International Conference
on Computer Vision. pp. 4791–4800 (2021) 3, 4
18 Wu et al.
49. Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable
effectiveness of deep features as a perceptual metric. In: CVPR (2018) 7, 9
50. Zhang, W., Zhai, G., Wei, Y., Yang, X., Ma, K.: Blind image quality assessment
via vision-language correspondence: A multitask learning perspective. In: IEEE
Conference on Computer Vision and Pattern Recognition. pp. 14071–14081 (2023)
9
51. Zhang, W., Li, X., Chen, X., Qiao, Y., Wu, X.M., Dong, C.: Seal: A frame-
work for systematic evaluation of real-world super-resolution. arXiv preprint
arXiv:2309.03020 (2023) 4
52. Zhou, K., Li, W., Wang, Y., Hu, T., Jiang, N., Han, X., Lu, J.: Nerflix: High-quality
neural view synthesis by learning a degradation-driven inter-viewpoint mixer. In:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition. pp. 12363–12374 (2023) 2, 4, 10
53. Zhou, Y., Li, Z., Guo, C.L., Bai, S., Cheng, M.M., Hou, Q.: Srformer: Permuted
self-attention for single image super-resolution. arXiv preprint arXiv:2303.09735
(2023) 4, 13
RaFE: Generative Radiance Fields Restoration
arXiv:2404.03654v1 [cs.CV] 4 Apr 2024 Supplementary Material
Zhongkai Wu1 , Ziyu Wan2 , Jing Zhang1 , Jing Liao2 , and Dong Xu3
1
College of Software, Beihang University, China
2
City University of Hong Kong, China
3
The University of Hong Kong, China
ZhongkaiWu@buaa.edu.cn
RaFE.github.io
1 Overview
The input images and corresponding viewpoints are randomly selected during
the training process. We also randomly sample random code z from the normal
distribution. And we use a discriminator learning rate of 0.002 and a genera-
tor learning rate of 0.0025. In the early stage of training, we blur the image
to stabilize the training process and reduce the blur kernel size to zero gradu-
ally. Furthermore, we use a density regularization which minimizes the density
differences between adjacent sampled points.
For the NeRF-Synthetic benchmark dataset (Blender), we sample 192 points
for each ray, with 128 stratified sampling points and 48 importance sampling
points. We assume that the blender object is in a [−1.5, 1.5]3 cube and set the
near and far plane of the ray to 2 and 6 respectively. For training, we restore 10K
high-quality images randomly selected in 200 training views and we set the batch
size to 32 and the minibatch std group size to 4 for optimizing the generator.
For the forward-facing datasets, we sample 192 points for each ray, with 128
stratified sampling points and 48 importance sampling points. We use normalized
device coordinates (NDC) and set the near plane to 0 and the far plane to 1. The
derivation of NDC can be found in NeRF [?]. The size of the dataset, the batch
size, and the minibatch standard deviation group size are the same as those in
the Blender dataset.
2 Wu et al.
\mathcal {L}_{coarse} = \lambda _{rec}L_{rec} + \lambda _{tv}L_{tv} + \lambda _{dis}L_{dis}, \label {eq:mimic_loss} (1)
where λrec , λtv and λdis denote the trade-off parameters. Specifically, we use
λrec = 1, λtv = 0.01 and λdis = 0.001 in our experiments.
4 More Discussions
Effects of residual coarse NeRF. In this ablation study, we show that the
residual coarse NeRF could facilitate the model to be aware of the geometry and
help better render highly detailed images, as shown in Fig. 1. We directly drop
the coarse NeRF and train the generator from scratch with GAN loss and LPIPS
loss, forcing the generator to learn the coarse structure and details jointly. We
can observe that with the addition of coarse NeRF, RaFE could better model
the structural information (like the Lego baseplate) and transparent material
(like the drumhead).
Fig. 1: Ablation study of residual coarse NeRF. Our full pipeline renders the
images with better quality, demonstrating the great effectiveness of the residual coarse
NeRF.
let the RGB value directly be decoded by the first MLP jointly with density. As
shown in Fig. 2, objects with non-Lambertian surfaces exhibit significant light
reflections. In contrast, without view direction information, the object in the
figure appears to have no reflections.
Fig. 2: Ablation of view direction. Reflections can be observed when using view
direction conditioning.
Fig. 3: Ablation of patch sampling strategy. The training process becomes un-
stable without patch sampling strategy and causes severe artifacts
for the NeRF-like degradation, our method still demonstrates clearly improved
performance.
Restore
In this section, we describe the details of the diversity score used to evaluate the
diversity of the generated image set. In particular, we follow the computational
methods described in [?]. We first calculate the minimal LPIPS score for each
image with other images in the image set. Then we average the per-image scores
to get the overall score for the whole image set. The higher diversity score means
the greater diversity of the image set. Here we provide pseudo-code in Algorithm
1 to explain our computational method better.
ssum ← 0
foreach is in I do
smin ← inf
foreach id in I do
if is ̸= id then
slpips ← LP IP S(is , id )
smin ← min(smin , slpips )
end
end
ssum ← ssum + smin
end
s ← ssum /|A|
return s
RaFE: Generative Radiance Fields Restoration Supplementary Material 5