[go: up one dir, main page]

Next Article in Journal
Exploring Adversarial Robustness of LiDAR Semantic Segmentation in Autonomous Driving
Previous Article in Journal
Ethanol-Gas-Sensing Performances of Built-in ZrO2/Co3O4 Hybrid Nanostructures
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

RepECN: Making ConvNets Better Again for Efficient Image Super-Resolution

1
School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510275, China
2
School of Information Engineering, Guangdong University of Technology, Guangzhou 510006, China
*
Author to whom correspondence should be addressed.
Sensors 2023, 23(23), 9575; https://doi.org/10.3390/s23239575
Submission received: 9 October 2023 / Revised: 22 November 2023 / Accepted: 30 November 2023 / Published: 2 December 2023
Figure 1
<p>Qualitative trade-off comparison between the performance and the latency of SR models (e.g., SwinIR [<a href="#B4-sensors-23-09575" class="html-bibr">4</a>], ESRT [<a href="#B9-sensors-23-09575" class="html-bibr">9</a>], ShuffleMixer [<a href="#B13-sensors-23-09575" class="html-bibr">13</a>], IDN [<a href="#B19-sensors-23-09575" class="html-bibr">19</a>], IMDN [<a href="#B20-sensors-23-09575" class="html-bibr">20</a>], LatticeNet [<a href="#B21-sensors-23-09575" class="html-bibr">21</a>], LapSRN [<a href="#B22-sensors-23-09575" class="html-bibr">22</a>]) on the Manga109 (<math display="inline"><semantics> <mrow> <mn>3</mn> <mo>×</mo> </mrow> </semantics></math>) benchmark dataset. The color normalized mapping represents the model’s parameter number, and the circle’s area represents the Multiply-Accumulates (MACs) of a model. Our proposed models are marked in the red label and line. The comparison results show the superiority of our method.</p> ">
Figure 2
<p>The architecture of the Efficient Residual ConvNet with structural Re-parameterization (RepECN).</p> ">
Figure 3
<p>The comparison between Asymmetric Convolutional Block (ACB) and standard Convolution.</p> ">
Figure 4
<p>Illustration of the proposed upsampling module.</p> ">
Figure 5
<p>Visual qualitative comparison of the efficient state-of-the-art models (e.g., SwinIR-S [<a href="#B4-sensors-23-09575" class="html-bibr">4</a>], ESRT [<a href="#B9-sensors-23-09575" class="html-bibr">9</a>], LBNet [<a href="#B10-sensors-23-09575" class="html-bibr">10</a>], IMDN [<a href="#B20-sensors-23-09575" class="html-bibr">20</a>], LatticeNet [<a href="#B21-sensors-23-09575" class="html-bibr">21</a>], EDSR-baseline [<a href="#B24-sensors-23-09575" class="html-bibr">24</a>]) on Set14 [<a href="#B49-sensors-23-09575" class="html-bibr">49</a>] and Urban100 [<a href="#B51-sensors-23-09575" class="html-bibr">51</a>] benchmark datasets for <math display="inline"><semantics> <mrow> <mn>4</mn> <mo>×</mo> </mrow> </semantics></math> single image super-resolution (SISR). Zoom in for the best view.</p> ">
Figure 6
<p>Visual qualitative comparisons on a real-world historical image dataset for <math display="inline"><semantics> <mrow> <mn>3</mn> <mo>×</mo> </mrow> </semantics></math> SR. The proposed ALAN generates a cleaner view than other methods (e.g., LBNet [<a href="#B10-sensors-23-09575" class="html-bibr">10</a>], LatticeNet [<a href="#B21-sensors-23-09575" class="html-bibr">21</a>], EDSR [<a href="#B24-sensors-23-09575" class="html-bibr">24</a>], CARN [<a href="#B56-sensors-23-09575" class="html-bibr">56</a>]) with fewer artifacts.</p> ">
Figure 7
<p>Ablation study on different number settings of the RepECN structure. The illustrations are tested on Set5 [<a href="#B46-sensors-23-09575" class="html-bibr">46</a>] for <math display="inline"><semantics> <mrow> <mn>2</mn> <mo>×</mo> </mrow> </semantics></math> SISR.</p> ">
Versions Notes

Abstract

:
Traditional Convolutional Neural Network (ConvNet, CNN)-based image super-resolution (SR) methods have lower computation costs, making them more friendly for real-world scenarios. However, they suffer from lower performance. On the contrary, Vision Transformer (ViT)-based SR methods have achieved impressive performance recently, but these methods often suffer from high computation costs and model storage overhead, making them hard to meet the requirements in practical application scenarios. In practical scenarios, an SR model should reconstruct an image with high quality and fast inference. To handle this issue, we propose a novel CNN-based Efficient Residual ConvNet enhanced with structural Re-parameterization (RepECN) for a better trade-off between performance and efficiency. A stage-to-block hierarchical architecture design paradigm inspired by ViT is utilized to keep the state-of-the-art performance, while the efficiency is ensured by abandoning the time-consuming Multi-Head Self-Attention (MHSA) and by re-designing the block-level modules based on CNN. Specifically, RepECN consists of three structural modules: a shallow feature extraction module, a deep feature extraction, and an image reconstruction module. The deep feature extraction module comprises multiple ConvNet Stages (CNS), each containing 6 Re-Parameterization ConvNet Blocks (RepCNB), a head layer, and a residual connection. The RepCNB utilizes larger kernel convolutions rather than MHSA to enhance the capability of learning long-range dependence. In the image reconstruction module, an upsampling module consisting of nearest-neighbor interpolation and pixel attention is deployed to reduce parameters and maintain reconstruction performance, while bicubic interpolation on another branch allows the backbone network to focus on learning high-frequency information. The extensive experimental results on multiple public benchmarks show that our RepECN can achieve 2.5∼5× faster inference than the state-of-the-art ViT-based SR model with better or competitive super-resolving performance, indicating that our RepECN can reconstruct high-quality images with fast inference.

1. Introduction

Single Image Super-Resolution (SISR), which aims to reconstruct a high-resolution (HR) image from a low-resolution (LR) image, is an ill-posed problem without one unique solution. As an efficient data-driven technology, deep learning-based SISR methods have shown promising results and achieved better quantitative and qualitative performance than traditional methods. These super-resolution (SR) models can be divided into three categories, including convolutional neural network-based SR methods [1,2], Transformer-based SR methods [3,4], generative adversarial network-based SR methods [5,6].
However, deep learning-based methods require significant computation costs and storage resources to provide high reconstruction accuracy, hindering them from being deployed in resource-limited platforms or scenarios, such as live streaming [7], phone imaging [8], etc. Therefore, an SR model with high super-resolving performance and fast inference is urgently required to meet the requirements of resource-limited scenarios.
Lightweight SR models have recently been proposed, but they still face challenges in how to make a better trade-off between inference speed and reconstruction performance. Transformer-based methods, such as SwinIR [4], ESRT [9], and LBNet [10], have shown better performance than CNN-based lightweight models, like ESRN [11], LBFN [12], and ShuffleMixer [13]. However, the multi-head self-attention and encoder–decoder designs overlook the actual inference latency caused by a large amount of memory access cost (MAC) and the parallelism degree of network structure. Our statistical experiments demonstrate that Transformer-based methods suffer from high latencies even with small parameter sizes, as illustrated in Figure 1. In contrast, CNN-based methods infer much faster than other designs with simple structures but suffer from lower reconstruction performance. Thus, ConvNet is often adopted to build efficient and lightweight models for improving inference speed. SR-LUT [14] and SPLUT [15] can reconstruct images faster at the expense of severe performance degradation. Wu et al. [16] explored a compiler-aware SR neural architecture search (NAS) framework to achieve real-time inference on GPU/DSP platforms for mobile devices. However, this work faces difficulties deploying or directly transferring pre-trained models to different hardware platforms with varying instruction architectures. With these considerations, RepSR [17] aims to improve the performance of VGG-like [18] CNN-based models but still has a low-performance cap.
To make a better trade-off between reconstruction performance and inference latency for practical scenarios, we propose a pure CNN-based Efficient Residual ConvNet with structural Re-parameterization (RepECN). The architecture is investigated by the stage-to-block hierarchical design of the ViT-based model to offer both fast speed and high-quality image reconstruction capabilities. The RepECN has three key structural components: a shallow feature extraction module, a deep feature extraction module, and an image reconstruction module. The deep feature extraction module comprises several ConvNet Stages (CNS), each containing six Re-Parameterization ConvNet Blocks (RepCNB), a head layer, and a residual connection. By employing the Transformer-like stage-to-block design, this module allows for learning channel and spatial information by different convolution structures, enabling faster processing speeds, while maintaining similar parameter numbers and performance compared to the Transformer-based models. In addition, we propose a novel image reconstruction module based on nearest-neighbor interpolation and pixel attention to save parameters and maintain reconstruction performance. The extensive experimental results show that our RepECN can achieve 2.5∼5× faster inference than the state-of-the-art ViT-based SR model with better or competitive super-resolving performance, indicating that our RepECN can achieve a better trade-off between super-resolution quality and inference latency for resource-limited scenarios.
In summary, the main contributions of this paper are as follows:
  • We propose an efficient and high-accuracy SR model RepECN to offer fast speed and high-quality image reconstruction capabilities using the Transformer-like stage-to-block design paradigm.
  • To further improve performance, we employ a large kernel Conv module inspired by ConvNeXt and an Asymmetric Re-Parameterization technique, which is proven to perform better than other symmetric square Re-Parameterization techniques.
  • To save parameters and maintain reconstruction performance, we propose a novel image reconstruction module based on nearest-neighbor interpolation and pixel attention.
  • Extensive experimental results show that our RepECN can achieve 2.5∼5× faster inference than the state-of-the-art ViT-based SR model with better or competitive super-resolving performance.

2. Related Work

2.1. CNN-Based Efficient SR

FSRCNN [2] uses upsampling at the end of the model and optimizes the width and depth of convolutional layers from the pioneering model SRCNN [1]. However, the performance is not competitive nowadays. Inspired by residual learning, VDSR [23] and EDSR [24] were proposed to allow deeper networks and avoid gradient disappearance and degradation problems. Later, a series of SR methods proposed by increasing the depth and width of the network (e.g., RCAN [25], RDN [26]) achieved state-of-the-art (SOTA) performance. However, huge Multiply-Accumulates (MACs) and parameters limit their deployment on hardware-limited platforms. To solve this problem, some SR methods [19,20,21,27] focus on improving efficiency. IDN [19] and IMDN [20] use a channel-splitting strategy to reduce computational complexity with redundant parameters. Luo et al. [21] utilize the proposed lattice block to combine residual blocks and introduce a network LatticeNet for fast and accurate SR. MIPN [27] polymerizes multi-scale image features extracted by convolutions with different kernel sizes. The MAI 2021 Challenge [28] brings some extremely lightweight model works [29,30] with real-time inference latency. However, most are optimized for specific NPU mobile platforms, while the SR performance is insufficient. Wu et al. [16] use a neural architecture search (NAS) framework with adaptive SR blocks to find an appropriate model to achieve real-time SR inference. However, it needs to retrain the model when the environment changes, which cannot be used on new devices directly. Unlike these methods that mainly focus on efficiency, we aim at the trade-off of latency and accuracy.

2.2. Transformer-Based Efficient SR

Dosovitskiy et al. [31] firstly applied a vision transformer to image recognition. Since then, high-accuracy image SR methods based on transformers became popular. IPT [3] uses a pre-trained vanilla Vision transformer (ViT) on the ImageNet dataset. SwinIR [4] brings Swin Transformer [32] to image restoration tasks and achieves state-of-the-art performance. However, having fewer parameters and MACs does not necessarily result in faster inference latency because other factors, such as memory access cost and degree of parallelism, can also affect latency. The Transformer-based methods suffer from time-consuming and memory-intensive operations, including quadratic-complexity Multi-Head Self-Attention (MHSA) and inefficient non-parallelized window partition. Therefore, some works focus on designing lightweight Transformer-based methods [10,33,34]. A2N [33] obtains lightweight by studying the effectiveness of the attention mechanism. LBNet [10] uses a hybrid network of CNN and Transformer to build an efficient model. SMN [34] simplifies MHSA by separating spatial modulation from channel aggregation, hence making the long-range interaction lightweight. However, there is still potential for improvement in terms of accuracy.

2.3. Large Kernel ConvNet

After the introduction of VGG [18], large kernel ConvNets lost popularity due to the higher number of parameters and MACs they require, which is not appropriate for lightweight model designs. However, large kernel convolutions have regained their importance with the development of novel efficient techniques and structures such as transformers and MLPs. Then, ConvMixer [35], ConvNeXt [36], and RepLKNet [37] utilize the large kernel depth-wise convolutions to redesign ConvNet, which achieve competitive performance compared to Transformers. In addition, LKASR [38] also explores the possibility of using a large kernel for lightweight models in the image SR task. However, there is still potential for improvement in terms of SR performance. In this paper, we explore the combination of large kernel convolution and the Structural Re-parameterization technique to further improve performance without a computational cost at the inference phase.

2.4. Structural Re-Parameterization

Structural Re-parameterization [39,40,41] equivalently converts model structures via transforming the parameters between training and inference time. These structures enhance the off-the-shelf models without modification of the CNN architecture. Specifically, Ding et al. [39] improve the performance without any inference-time costs by using Asymmetric Convolutional Block (ACB). ACB uses 1D asymmetric convolutions to strengthen the square convolution kernels within a single convolution block. It also uses batch normalizations (BN) [42] in training time to reduce overfitting and accelerate the training process on high-level vision tasks. Besides, Ding et al. [40] designs a more complex version (DBB) that utilizes the symmetric square kernel in the branch during training. DBB performs better in high-level tasks but worse in SR tasks than ACB. RepSR [17] and RMBN [43] use the variants of DBB on VGG-like CNN for SR. However, the SR quality of RepSR is much lower than Transformer-based models. RepSR also introduces the artifacts problem when using BN in a VGG-like SR model. This paper explores the usage of asymmetric structural re-parameterization with BN on large kernel convolutions for image SR.

3. Methods

In this section, we first outline the architecture of the proposed Efficient Residual ConvNet with structural Re-parameterization (RepECN) and then introduce the ConvNet Stages (CNS), Re-Parameterization ConvNet Blocks (RepCNB), and the lightweight upsampling module.

3.1. Network Architecture

We leverage the high-performance, Transformer-like stage-to-block design paradigm and lower computation cost of a pure convolution structure to explore the efficient and high-accuracy network for image super-resolution. As shown in Figure 2, RepECN mainly consists of three modules: shallow feature extraction, deep feature extraction, and high-quality image reconstruction. Different demands of the network sizes employ the same structure, while only different in the number of CNS and backbone channels. The network should also be doing well on other tasks of image resolution.

3.1.1. Shallow and Deep Feature Extraction

Given a low-resolution (LR) image input I L R R H × W × C i n (H, W, and C i n are the numbers of the LR image height, width, and input channels, respectively), we use A S F ( · ) to denote an ACB with a 3 × 3 kernel size. The corresponding shallow feature O 0 R H × W × C is extracted as
O 0 = A S F ( I L R ) ,
where C is the number of output feature channels. Such ACB enhances the standard square-kernel convolution layer. So, it provides a better and simple way to map the input low-dimensional image space to a high-dimensional feature space than conventional shallow feature extraction. In the next module, we extract the deep feature O D F R H × W × C from O 0 as
O D F = F D F ( O 0 ) ,
where F D F ( · ) denotes the entire deep feature extraction module, which consists of K ConvNet Stages (CNS), a LayerNorm (LN), and an ACB. Specific for Equation (2), the intermediate outputs { O 1 , O 2 , , O K } of CNS and the final output O F of the entire feature extraction module are calculated stage-by-stage as
O i = F C N S i ( O i 1 ) , i = 1 , 2 , , K , O D F = A D F ( L N ( O K ) ) O F = O D F + O 0 ,
where F C N S i is the i-th CNS and A D F is an ACB with a 3 × 3 kernel at the end of the module. Such an ACB could bring the inductive bias into the depth-wise ConvNet-based network, which helps aggregate shallow and deep features. Meanwhile, the long skip connection aggregates the shallow and deep features, bringing the low-frequency information directly to the next module.

3.1.2. Image Reconstruction

The input LR image has the most primitive information, which should guide the reconstruction output. Additionally, bicubic interpolation can upsample the LR image directly and maintain the original information. Considering that, we reconstruct the super-resolution (SR) image I S R as
I S R = U F ( O F ) + U L R ( I L R ) ,
where U F ( · ) and U L R ( · ) denote the upsampling of the extracted feature and the bicubic interpolation of the LR image, respectively. The benefit of the aggregation is that the backbone network could focus on learning the high-frequency information of tuning the conventional upsampling of the LR image to a high-qualitative SR image. The upsampling of the extracted feature is implemented by nearest-neighbor interpolation, ACBs, and pixel attention (PA) described in Section 3.3.

3.1.3. Loss Function

The parameters of our network are optimized by s m o o t h L 1 loss
L = 0.5 × I S R I H R 2 , i f I S R I H R 1 < 1 I S R I H R 1 0.5 , o t h e r w i s e
where I H R denotes the corresponding ground-truth HR image, and I S R is the output of RepECN that takes I L R as the input. The s m o o t h L 1 loss converges faster than the naive L 1 pixel loss.

3.2. ConvNet Stages

The ConvNet Stages (CNS) is a residual block consisting of six Re-Parameterization ConvNet Blocks (RepCNBs), a LayerNorm, and an ACB, as shown in Figure 2a. Each CNS of Equation (3) takes a feature as the input. For the specific i-th CNS, we use O i , 0 , taking the place of input O i 1 for convenience. Inside such CNS, we obtain intermediate outputs { O i , 1 , O i , 2 , , O i , L } by L RepCNBs as
O i , L = F R e p C N B i , j ( O i , j 1 ) , j = 1 , 2 , , L ,
where F R e p C N B i , j ( · ) denotes the j-th RepCNB. Then, a RepCNB is added before the residual connect. The total output of i-th CNS is formulated as
O i = F A C B i ( L N ( O i , L ) ) + O i , 0 ,
where F A C B i ( · ) is the ACB at the end of the i-th CNS. The ACB could be treated as a standard convolution, while the RepCNB consists of depth-wise and point-wise convolutions. The standard convolution with a small and spatially invariant filter brings a different vision, which benefits the translational equivariance. In addition, the residual connection aggregates different hierarchies of features to let the block fit more complex feature mappings.

3.2.1. Re-Parameterization ConvNet Blocks

The Re-Parameterization ConvNet Blocks (RepCNB) are based on a residual block inspired by the ConvNeXt [36]. The main difference is that we use ACB to enhance the square convolution kernel inside RepCNB. As shown in Figure 2b, given an input with x channels, a RepCNB first uses a depth-wise ACB with a 7 × 7 kernel to extract a feature with the x channels. A layer normalization (LN) layer is added behind it. Then, two point-wise convolutional layers are added to learn features across the channel before the residual connection, with GELU non-linearity between them. The first point-wise layer accepts the output of LN with an x channel as the input and obtains a feature with 4 x channels. The corresponding second point-wise layer takes the feature above as input and obtains the final output with x channels.

3.2.2. Asymmetric Convolutional Block

An asymmetric Convolutional Block (ACB) is a block using the structural re-para-meterization technique [39], the same as a standard convolution at inference time while different at training time. Figure 3 compares standard convolution (Conv) and ACB with a kernel size of 3 × 3 . The ACB or Conv takes a feature I A C B as the input. At training time, ACB uses three no-bias convolutional layers { F c o n v 1 , F c o n v 2 , F c o n v 3 } with kernel sizes of 3 × 3 , 1 × 3 , and 3 × 1 , respectively. After batch normalization (BN) for each convolutional layer above, ACB obtains the output O A C B by merging three outputs by element-wise summation as
O A C B = c = 1 3 ( ( F c o n v c ( I A C B ) μ c ) γ c σ c + β c )
where μ c , σ c , γ c , and β c denote the channel-wise mean, standard deviation, learned scaling factor, and bias term, respectively, while c = 1 3 means element-wise summation for several features. At inference time, ACB first merges channel-wise BN with Conv kernel by BN fusion and then merges three Conv by branch fusion as
O A C B = c = 1 3 ( I A C B × ( γ c σ c K c ) μ c γ c σ c + β c ) = I A C B × c = 1 3 γ c σ c K c c = 1 3 ( μ c γ c σ c + β c ) , K i n f = c = 1 3 γ c σ c K c , b i n f = c = 1 3 ( μ c γ c σ c + β c )
where K c denotes the kernel of no-bias convolutional layer F c o n v c . The ACB is finally converted to a standard convolutional layer with kernel K i n f and bias b i n f .

3.3. Lightweight Upsampling Module

As shown in Figure 4, we choose the nearest-neighbor interpolation to upsample the input feature, followed by an ACB. Rather than sub-pixel convolution like pixel shuffle, such upsampling choice saves the parameter number without performance degradation. We first use an upsampling operation to transfer the feature O F from the entire feature extraction module in Equation (3). The upsampling operation consists of several pairs of nearest-neighbor interpolation and ACB. Each pair only upsamples on scale factor 2 or 3, limiting the whole module to accept scale factor 2 N or 3. The module should support varying scale factors by adopting the interpolation scale factor. Then, inspired by PAN [44], we employ a pixel attention (PA) layer and an ACB to reconstruct the SR feature. The PA can enhance the reconstruction and improve the SR quality. Finally, a second ACB layer generates the output U F ( O F ) of the upsampling module in Equation (4).

4. Experiments

This section uses several commonly used benchmark datasets to compare the proposed network with effective and state-of-the-art SISR models. In addition, some ablation studies are used to analyze the rationality of our proposed modules.

4.1. Experimental Settings

4.1.1. Datasets and Indicators

We train the proposed network using the DIV2K dataset [45] while validating it on the Set5 [46] dataset. The 800 training and 100 validation image pairs in DIV2K are used as the training dataset. The indicators of evaluation for SISR performance are peak signal-to-noise ratio (PSNR) [47] and structural similarity index (SSIM) [48] on benchmark datasets Set5, Set14 [49], B100 [50], Urban100 [51], and Manga109 [52]. We use MATLAB to calculate them on the Y channel of the YCbCr space converted from the RGB space of the image.

4.1.2. Training Details

We group the efficient models into three level sizes according to the parameter number. The parameter number of extremely tiny, small, and base size is smaller than 100 K, 500 K, and 1500 K, respectively. The settings of the training hyperparameters for our RepECN-T (tiny), RepECN-S (small), and RepECN (base) models are described in Table 1. The RepCNB and channel in the table denote the RepCNB number in each CNS and the channel number of each intermediate feature, while the patch denotes the size of RGB patches cropped from LR images as the input. The total training epochs of RepECN-T, RepECN-S, and RepECN are set to 3000, 2000, and 1500, respectively. Each minibatch comprises 32 patches for training all three models. The learning rate is set to 2 × 10 4 and reduced by half at [ 1 2 , 4 5 , 9 10 , 19 20 ] of the total epoch.
The latency of inference on the CPU and GPU platform are measured for generating a 720P SR image (the width and height are 1280 × 720 ) on an Intel Xeon Gold 5118 CPU (12 cores, 2.30 GHz, and 6 load-data threads) and Nvidia Titan V (12 GB of HBM2 memory and 5120 CUDA cores) GPU acceleration, respectively. Each latency takes an average of 50 running results. The multiply-accumulates (MACs) are also measured for generating a 720P SR image ( 1280 × 720 ).

4.2. Experimental Results

Performance and Latency Comparison

To show the effectiveness of our RepECN fairly, we chose the state-of-the-art Transformer-based models with similar parameter numbers, which are trained on the same DIV2K dataset. Table 2 shows the quantitative performance comparisons between the proposed RepECN and state-of-the-art Transformer-based models: SwinIR [4], ESRT [9], and LBNet [10]. As for the models with parameter numbers less than 1500 K, RepECN achieves the best or second-best performance on five benchmark datasets for three standard scale factors with much less latency. Specifically, compared to the state-of-the-art SwinIR-S with similar PSNR/SSIM, RepECN only needs one-fifth of the latency for a scale factor 2 on the platform with GPU. Especially, LBNet and ESRT cannot do inference for a scale factor of 2 on our platform with GPU because of memory resource limitations.
To show the high SR quality of our RepECN structure, we chose the current CNN-based models in different sizes of parameter numbers. Specially, the training dataset of ShuffleMixer and LAPAR is DF2K (a merged dataset with DIV2K [45] and Flickr2K [53]), which contains much more image pairs. Table 3 shows the quantitative performance comparisons between the proposed RepECN and CNN-based models: SRCNN [1], FSRCNN [2], ShuffleMixer [13], IDN [19], IMDN [20], LatticeNet [21], LapSRN [22], EDSR [24], DRRN [54], and LAPAR [55]. Our RepECN family achieves state-of-the-art performance in all tiny, small, and base sizes. Specifically, RepECN-T (less than 100 K) outperforms ShuffleMixer-Tiny with a 0.45 dB gain on Urban100 ( 2 × ). RepECN-S (less than 500 K) outperforms ShuffleMixer with a 0.41 dB gain on Urban100 ( 2 × ) using similar parameter numbers. In addition, RepECN-S also outperforms LatticeNet with a 0.06 dB gain on Urban100 ( 2 × ) using about half the parameter numbers. It proves that our design of ConvNet outperforms all of the previous designs. In conclusion, our model achieves state-of-the-art performance with a better trade-off between inference speed and performance.
To evaluate our RepECN qualitatively, we also show visual comparisons in Figure 5, including three different sizes of RepECN and the corresponding size state-of-the-art models for scale factor 4 SISR on benchmark images. All three sizes of RepECN can restore higher frequency detailed textures and alleviate the blurring artifacts with more visually pleasing images. In contrast, most other models produce incorrect textures with blurry artifacts. Furthermore, we evaluate our model on real LR images from a historical dataset [22], as shown in Figure 6. RepECN can generate smoother details with a clearer structure than other models. This indicates the high effectiveness of our proposed RepECN.

4.3. Ablation Study and Analysis

For the ablation study, we train RepECN family models on DIV2K [45] with 1000 epochs for 2 × SISR in Section 4.3.1, Section 4.3.3 and Section 4.3.4, progressively adding useful elements to construct RepECN-T. Then, we train RepECN with CNSs, RepCNBs, channels, and epochs setting to 4, 6, 60, and 1500 epochs for 2 × SISR as the baseline model and modify the first three hyperparameters individually in Section 4.3.5. In addition, we train FSRCNN variants on DIV2K [45] with 3000 epochs for 2 × SISR in Section 4.3.2. In all sections, the performance comparison uses the PSNR on benchmark dataset Set5 [46].

4.3.1. Impact of Normalization in CNS and ACB

To explore the effect of layer normalization (LN) in each CNS, we remove the head layer in CNS and batch normalization (BN) inside ACB to reduce their effects. Table 4 firstly shows that LN is necessary for better performance, as the SR quality of RepECN-T-A is lower than RepECN-T-B and RepECN-T-C. Then, the table illustrates that LayerNorm before the residual connection in CNS further improves the PSNR than LN after the residual connection.
In addition, we compare using batch normalization (BN) inside ACB. The no-BN variant RepECN-T-C skips normalization and generates a bias for each convolutional layer in ACB during training. When switching to inference, add the weight and bias of three convolutional layers in ACB to the single convolutional layer used as the inference ACB layer. Table 4 shows that the normalization inside ACB is important as RepECN-T-D improves the PSNR performance by 0.01 dB. Apart from that, the training with normalization in ACB will not converge when removing the residual connection of LR input to the output while using the pixel shuffle upsampling.

4.3.2. Impact of Structural Re-Parameterization

To demonstrate the effectiveness of structural re-parameterization for image super-resolution (SR), we trained multiple variants of FSRCNN, a model with ample room for improvement. We first replace the upsampling module of FSRCNN with our proposed lightweight upsampling module as a variant FSRCNN-N, which improves the PSNR performance by 0.31. Then, we use a symmetric square kernel structural re-parameterization technique DBB [40] for each ConvNet layer in FSRCNN-N, a similar but more complex technique as used in RepSR [17]. FSRCNN-N-DBB can improve the SR performance by a 0.16 dB gain on PSNR. Finally, we replace the DBB with the asymmetric kernel structural re-parameterization technique ACB. FSRCNN-N-ACB further improves the SR quality by a 0.09 dB gain on PSNR. In conclusion, structural re-parameterization can improve the performance of CNN-based SR models, while the asymmetric kernel technique is better than a symmetric square one.

4.3.3. Impact of the Head Layer in CNS

The effect of using a head layer (the last ACB before the residual connection) in CNS is shown in Table 4. The base version RepECN-T is designed as one 3 × 3 ACB. With this version, the performance gains on PSNR by 0.4 dB. Furthermore, the table shows that one 3 × 3 ACB is better than three 3 × 3 ACB (whose channel number of the second layer is one-fourth of the input and output channel number). RepECN-T-E saves a few parameters (5K) with 0.02 dB performance degradation on PSNR compared to RepECN-T. To achieve higher performance, we finally choose to use one 3 × 3 ACB as the head layer in CNS.

4.3.4. Impact of Nearest-Neighbor Interpolation with Pixel Attention in Upsampling Module

Table 4 and Table 5 show the performance improvement of the proposed upsampling module in Section 3.3 with pixel attention (PA). In Table 4, The pixel shuffle of the variant RepECN-T-G is the same as the image reconstruction module in SwinIR [4]. The nearest-neighbor without PA of variant RepECN-T-F removes the PA block from the proposed upsampling module. The table shows that the nearest-neighbor interpolation saves parameters with performance improvement, and the PA is necessary, as it improves the PSNR by 0.02 dB. Table 5 shows that the proposed upsampling module can significantly improve the performance of FSRCNN by a 0.31 dB gain on PSNR.

4.3.5. Impact of CNS, RepCNB, and Channel Numbers

The effects of CNS numbers, RepCNB numbers in each CNS, and channel numbers of each layer are shown in Figure 7, respectively. We observed that the performance is positively correlated with such three hyper-parameters. In addition, as the number of settings increases, the performance growth tends to flatten out. As a result, it is a trade-off between the performance and the model’s size. To achieve high performance and fast inference, we choose the point with the maximum change in slope as the setting. Especially, the RepCNB number of each CNS is fixed to 6, as the performance is more sensitive when reducing it than the others.

5. Conclusions and Future Works

In this paper, we propose a pure CNN-based SR model, Efficient Residual ConvNet with structural Re-parameterization (RepECN), with fast speed and high quality. The model contains three modules: shallow feature extraction, deep feature extraction, and image reconstruction. We borrow the stage-to-block hierarchical design of the ViT-based model to keep the SOTA performance using ConvNet. Specifically, we proposed the ConvNet Stages (CNS) for deep feature extraction. Each CNS comprises six Parameterization ConvNet Blocks (RepCNB), a basic ACB, a LayerNorm, and a residual connection. We also introduce a lightweight upsampling module containing nearest-neighbor interpolation and pixel attention, which saves parameters without performance degradation. We evaluate the proposed RepECN for different sizes on commonly used benchmark datasets. Experiments show that RepECN achieves state-of-the-art performance while providing much faster inference than Transformer-based models and much better performance than CNN-based models.

Author Contributions

Conceptualization, Q.C.; methodology, Q.C. and J.Q.; software, Q.C.; validation, Q.C. and J.Q.; formal analysis, Q.C. and J.Q.; investigation, Q.C. and J.Q.; resources, Q.C., J.Q. and W.W.; data curation, Q.C. and J.Q.; writing—original draft preparation, Q.C.; writing—review and editing, Q.C., J.Q. and W.W.; visualization, Q.C. and J.Q.; supervision, J.Q. and W.W.; project administration, Q.C.; funding acquisition, J.Q. and W.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partly funded by the National Natural Science Foundation of China (NSFC) under Grant No. 62206314 and Grant No. U1711264, GuangDong Basic and Applied Basic Research Foundation under Grant No. 2022A1515011835, and China Postdoctoral Science Foundation funded project under Grant No. 2021M703687.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data is contained within the article. The data presented in this study are available in https://github.com/qpchen/RepECN (accessed on 28 November 2023).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Dong, C.; Loy, C.C.; He, K.; Tang, X. Learning a Deep Convolutional Network for Image Super-Resolution. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 184–199. [Google Scholar] [CrossRef]
  2. Dong, C.; Loy, C.C.; Tang, X. Accelerating the Super-Resolution Convolutional Neural Network. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 391–407. [Google Scholar] [CrossRef]
  3. Chen, H.; Wang, Y.; Guo, T.; Xu, C.; Deng, Y.; Liu, Z.; Ma, S.; Xu, C.; Xu, C.; Gao, W. Pre-Trained Image Processing Transformer. In Proceedings of the Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12294–12305. [Google Scholar] [CrossRef]
  4. Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. SwinIR: Image Restoration Using Swin Transformer. In Proceedings of the International Conference on Computer Vision Workshops, Montreal, QC, Canada, 11–17 October 2021; pp. 1833–1844. [Google Scholar] [CrossRef]
  5. Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network. In Proceedings of the Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 105–114. [Google Scholar] [CrossRef]
  6. Wang, X.; Yu, K.; Wu, S.; Gu, J.; Liu, Y.; Dong, C.; Qiao, Y.; Loy, C.C. ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks. In Proceedings of the European Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 63–79. [Google Scholar] [CrossRef]
  7. Dong, C.; Wen, W.; Xu, T.; Yang, X. Joint Optimization of Data-Center Selection and Video-Streaming Distribution for Crowdsourced Live Streaming in a Geo-Distributed Cloud Platform. IEEE Trans. Netw. Serv. Manag. 2019, 16, 729–742. [Google Scholar] [CrossRef]
  8. Morikawa, C.; Kobayashi, M.; Satoh, M.; Kuroda, Y.; Inomata, T.; Matsuo, H.; Miura, T.; Hilaga, M. Image and Video Processing on Mobile Devices: A Survey. Vis. Comput. 2021, 37, 2931–2949. [Google Scholar] [CrossRef] [PubMed]
  9. Lu, Z.; Li, J.; Liu, H.; Huang, C.; Zhang, L.; Zeng, T. Transformer for Single Image Super-Resolution. In Proceedings of the Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 457–466. [Google Scholar]
  10. Gao, G.; Wang, Z.; Li, J.; Li, W.; Yu, Y.; Zeng, T. Lightweight Bimodal Network for Single-Image Super-Resolution via Symmetric CNN and Recursive Transformer. In Proceedings of the International Joint Conference on Artificial Intelligence, Vienna, Austria, 23–29 July 2022; Volume 2, pp. 913–919. [Google Scholar] [CrossRef]
  11. Song, D.; Xu, C.; Jia, X.; Chen, Y.; Xu, C.; Wang, Y. Efficient Residual Dense Block Search for Image Super-Resolution. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12007–12014. [Google Scholar] [CrossRef]
  12. Wang, B.; Yan, B.; Liu, C.; Hwangbo, R.; Jeon, G.; Yang, X. Lightweight Bidirectional Feedback Network for Image Super-Resolution. Comput. Electr. Eng. 2022, 102, 108254. [Google Scholar] [CrossRef]
  13. Sun, L.; Pan, J.; Tang, J. ShuffleMixer: An Efficient ConvNet for Image Super-Resolution. In Proceedings of the NeurIPS, Virtual, 12–16 December 2022; Volume 35, pp. 17314–17326. [Google Scholar]
  14. Jo, Y.; Joo Kim, S. Practical Single-Image Super-Resolution Using Look-Up Table. In Proceedings of the Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; 2021; pp. 691–700. [Google Scholar] [CrossRef]
  15. Ma, C.; Zhang, J.; Zhou, J.; Lu, J. Learning Series-Parallel Lookup Tables for Efficient Image Super-Resolution. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 305–321. [Google Scholar] [CrossRef]
  16. Wu, Y.; Gong, Y.; Zhao, P.; Li, Y.; Zhan, Z.; Niu, W.; Tang, H.; Qin, M.; Ren, B.; Wang, Y. Compiler-Aware Neural Architecture Search for On-Mobile Real-time Super-Resolution. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 92–111. [Google Scholar] [CrossRef]
  17. Wang, X.; Dong, C.; Shan, Y. RepSR: Training Efficient VGG-style Super-Resolution Networks with Structural Re-Parameterization and Batch Normalization. In Proceedings of the ACM International Conference on Multimedia, Lisboa, Portugal, 10 October 2022; pp. 2556–2564. [Google Scholar] [CrossRef]
  18. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar]
  19. Hui, Z.; Wang, X.; Gao, X. Fast and Accurate Single Image Super-Resolution via Information Distillation Network. In Proceedings of the Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 723–731. [Google Scholar] [CrossRef]
  20. Hui, Z.; Gao, X.; Yang, Y.; Wang, X. Lightweight Image Super-Resolution with Information Multi-distillation Network. In Proceedings of the ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 2024–2032. [Google Scholar] [CrossRef]
  21. Luo, X.; Xie, Y.; Zhang, Y.; Qu, Y.; Li, C.; Fu, Y. LatticeNet: Towards Lightweight Image Super-Resolution with Lattice Block. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 272–289. [Google Scholar] [CrossRef]
  22. Lai, W.S.; Huang, J.B.; Ahuja, N.; Yang, M.H. Deep Laplacian Pyramid Networks for Fast and Accurate Super-Resolution. In Proceedings of the Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5835–5843. [Google Scholar] [CrossRef]
  23. Kim, J.; Lee, J.K.; Lee, K.M. Accurate Image Super-Resolution Using Very Deep Convolutional Networks. In Proceedings of the Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar] [CrossRef]
  24. Lim, B.; Son, S.; Kim, H.; Nah, S.; Lee, K.M. Enhanced Deep Residual Networks for Single Image Super-Resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 1132–1140. [Google Scholar] [CrossRef]
  25. Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image Super-Resolution Using Very Deep Residual Channel Attention Networks. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 294–310. [Google Scholar] [CrossRef]
  26. Zhang, Y.; Tian, Y.; Kong, Y.; Zhong, B.; Fu, Y. Residual Dense Network for Image Restoration. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 2480–2495. [Google Scholar] [CrossRef] [PubMed]
  27. Lu, T.; Wang, Y.; Wang, J.; Liu, W.; Zhang, Y. Single Image Super-Resolution via Multi-Scale Information Polymerization Network. IEEE Signal Process. Lett. 2021, 28, 1305–1309. [Google Scholar] [CrossRef]
  28. Ignatov, A.; Timofte, R.; Denna, M.; Younes, A.; Lek, A.; Ayazoglu, M.; Liu, J.; Du, Z.; Guo, J.; Zhou, X.; et al. Real-Time Quantized Image Super-Resolution on Mobile NPUs, Mobile AI 2021 Challenge: Report. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Nashville, TN, USA, 19–25 June 2021; pp. 2525–2534. [Google Scholar] [CrossRef]
  29. Ayazoglu, M. Extremely Lightweight Quantization Robust Real-Time Single-Image Super Resolution for Mobile Devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Nashville, TN, USA, 19–25 June 2021; pp. 2472–2479. [Google Scholar] [CrossRef]
  30. Du, Z.; Liu, J.; Tang, J.; Wu, G. Anchor-Based Plain Net for Mobile Image Super-Resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Virtual, 19–25 June 2021; pp. 2494–2502. [Google Scholar] [CrossRef]
  31. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations, Virtual, 3–7 May 2021. [Google Scholar]
  32. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
  33. Chen, H.; Gu, J.; Zhang, Z. Attention in Attention Network for Image Super-Resolution. arXiv 2021, arXiv:2104.09497. [Google Scholar]
  34. Wu, Z.; Li, J.; Huang, D. Separable Modulation Network for Efficient Image Super-Resolution. In Proceedings of the ACM International Conference on Multimedia, Vancouver, BC, Canada, 7–10 June 2023; pp. 8086–8094. [Google Scholar] [CrossRef]
  35. Trockman, A.; Kolter, J.Z. Patches Are All You Need? In Proceedings of the International Conference on Learning Representations, Virtual, 25 April 2022. [Google Scholar] [CrossRef]
  36. Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11966–11976. [Google Scholar] [CrossRef]
  37. Ding, X.; Zhang, X.; Han, J.; Ding, G. Scaling Up Your Kernels to 31×31: Revisiting Large Kernel Design in CNNs. In Proceedings of the Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11953–11965. [Google Scholar] [CrossRef]
  38. Feng, H.; Wang, L.; Li, Y.; Du, A. LKASR: Large Kernel Attention for Lightweight Image Super-Resolution. Knowl.-Based Syst. 2022, 252, 109376. [Google Scholar] [CrossRef]
  39. Ding, X.; Guo, Y.; Ding, G.; Han, J. ACNet: Strengthening the Kernel Skeletons for Powerful CNN via Asymmetric Convolution Blocks. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1911–1920. [Google Scholar] [CrossRef]
  40. Ding, X.; Zhang, X.; Han, J.; Ding, G. Diverse Branch Block: Building a Convolution as an Inception-like Unit. In Proceedings of the Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10881–10890. [Google Scholar] [CrossRef]
  41. Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. RepVGG: Making VGG-style ConvNets Great Again. In Proceedings of the Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13728–13737. [Google Scholar] [CrossRef]
  42. Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
  43. Shen, Y.; Zheng, W.; Huang, F.; Wu, J.; Chen, L. Reparameterizable Multibranch Bottleneck Network for Lightweight Image Super-Resolution. Sensors 2023, 23, 3963. [Google Scholar] [CrossRef] [PubMed]
  44. Zhao, H.; Kong, X.; He, J.; Qiao, Y.; Dong, C. Efficient Image Super-Resolution Using Pixel Attention. In Proceedings of the European Conference on Computer Vision Workshops, Glasgow, UK, 23–28 August 2020; pp. 56–72. [Google Scholar] [CrossRef]
  45. Agustsson, E.; Timofte, R. NTIRE 2017 Challenge on Single Image Super-Resolution: Dataset and Study. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 1122–1131. [Google Scholar] [CrossRef]
  46. Bevilacqua, M.; Roumy, A.; Guillemot, C.; Morel, M.l.A. Low-Complexity Single-Image Super-Resolution Based on Nonnegative Neighbor Embedding. In Proceedings of the British Machine Vision Conference, Surrey, UK, 3–7 September 2012; pp. 135.1–135.10. [Google Scholar] [CrossRef]
  47. Wang, Z.; Bovik, A.C. Mean Squared Error: Love It or Leave It? A New Look at Signal Fidelity Measures. IEEE Signal Process. Mag. 2009, 26, 98–117. [Google Scholar] [CrossRef]
  48. Wang, Z.; Bovik, A.; Sheikh, H.; Simoncelli, E. Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
  49. Yang, J.; Wright, J.; Huang, T.S.; Ma, Y. Image Super-Resolution Via Sparse Representation. IEEE Trans. Image Process. 2010, 19, 2861–2873. [Google Scholar] [CrossRef]
  50. Martin, D.; Fowlkes, C.; Tal, D.; Malik, J. A Database of Human Segmented Natural Images and Its Application to Evaluating Segmentation Algorithms and Measuring Ecological Statistics. In Proceedings of the IEEE International Conference on Computer Vision, Kauai, HI, USA, 8–14 December 2001; Volume 2, pp. 416–423. [Google Scholar] [CrossRef]
  51. Huang, J.B.; Singh, A.; Ahuja, N. Single Image Super-Resolution from Transformed Self-Exemplars. In Proceedings of the Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 5197–5206. [Google Scholar] [CrossRef]
  52. Matsui, Y.; Ito, K.; Aramaki, Y.; Fujimoto, A.; Ogawa, T.; Yamasaki, T.; Aizawa, K. Sketch-Based Manga Retrieval Using Manga109 Dataset. Multimed. Tools Appl. 2017, 76, 21811–21838. [Google Scholar] [CrossRef]
  53. Timofte, R.; Agustsson, E.; Gool, L.V.; Yang, M.H.; Zhang, L. NTIRE 2017 Challenge on Single Image Super-Resolution: Methods and Results. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 1110–1121. [Google Scholar] [CrossRef]
  54. Tai, Y.; Yang, J.; Liu, X. Image Super-Resolution via Deep Recursive Residual Network. In Proceedings of the Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2790–2798. [Google Scholar] [CrossRef]
  55. Li, W.; Zhou, K.; Qi, L.; Jiang, N.; Lu, J.; Jia, J. LAPAR: Linearly-Assembled Pixel-Adaptive Regression Network for Single Image Super-resolution and Beyond. In Proceedings of the NeurIPS, Virtual, 6–12 December 2020; Volume 33, pp. 20343–20355. [Google Scholar]
  56. Ahn, N.; Kang, B.; Sohn, K.A. Fast, Accurate, and Lightweight Super-Resolution with Cascading Residual Network. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 256–272. [Google Scholar] [CrossRef]
Figure 1. Qualitative trade-off comparison between the performance and the latency of SR models (e.g., SwinIR [4], ESRT [9], ShuffleMixer [13], IDN [19], IMDN [20], LatticeNet [21], LapSRN [22]) on the Manga109 ( 3 × ) benchmark dataset. The color normalized mapping represents the model’s parameter number, and the circle’s area represents the Multiply-Accumulates (MACs) of a model. Our proposed models are marked in the red label and line. The comparison results show the superiority of our method.
Figure 1. Qualitative trade-off comparison between the performance and the latency of SR models (e.g., SwinIR [4], ESRT [9], ShuffleMixer [13], IDN [19], IMDN [20], LatticeNet [21], LapSRN [22]) on the Manga109 ( 3 × ) benchmark dataset. The color normalized mapping represents the model’s parameter number, and the circle’s area represents the Multiply-Accumulates (MACs) of a model. Our proposed models are marked in the red label and line. The comparison results show the superiority of our method.
Sensors 23 09575 g001
Figure 2. The architecture of the Efficient Residual ConvNet with structural Re-parameterization (RepECN).
Figure 2. The architecture of the Efficient Residual ConvNet with structural Re-parameterization (RepECN).
Sensors 23 09575 g002
Figure 3. The comparison between Asymmetric Convolutional Block (ACB) and standard Convolution.
Figure 3. The comparison between Asymmetric Convolutional Block (ACB) and standard Convolution.
Sensors 23 09575 g003
Figure 4. Illustration of the proposed upsampling module.
Figure 4. Illustration of the proposed upsampling module.
Sensors 23 09575 g004
Figure 5. Visual qualitative comparison of the efficient state-of-the-art models (e.g., SwinIR-S [4], ESRT [9], LBNet [10], IMDN [20], LatticeNet [21], EDSR-baseline [24]) on Set14 [49] and Urban100 [51] benchmark datasets for 4 × single image super-resolution (SISR). Zoom in for the best view.
Figure 5. Visual qualitative comparison of the efficient state-of-the-art models (e.g., SwinIR-S [4], ESRT [9], LBNet [10], IMDN [20], LatticeNet [21], EDSR-baseline [24]) on Set14 [49] and Urban100 [51] benchmark datasets for 4 × single image super-resolution (SISR). Zoom in for the best view.
Sensors 23 09575 g005
Figure 6. Visual qualitative comparisons on a real-world historical image dataset for 3 × SR. The proposed ALAN generates a cleaner view than other methods (e.g., LBNet [10], LatticeNet [21], EDSR [24], CARN [56]) with fewer artifacts.
Figure 6. Visual qualitative comparisons on a real-world historical image dataset for 3 × SR. The proposed ALAN generates a cleaner view than other methods (e.g., LBNet [10], LatticeNet [21], EDSR [24], CARN [56]) with fewer artifacts.
Sensors 23 09575 g006
Figure 7. Ablation study on different number settings of the RepECN structure. The illustrations are tested on Set5 [46] for 2 × SISR.
Figure 7. Ablation study on different number settings of the RepECN structure. The illustrations are tested on Set5 [46] for 2 × SISR.
Sensors 23 09575 g007
Table 1. Hyperparameter settings of different-sized RepECN.
Table 1. Hyperparameter settings of different-sized RepECN.
ModelCNSRepCNBChannelPatchEpoch
RepECN-T2624 64 × 64 3000
RepECN-S3642 64 × 64 2000
RepECN5660 48 × 48 1500
Table 2. The performances of PSNR (dB) and SSIMs on standard benchmark datasets for our RepECN models trained on DIV2K compared with Vision Transformer-based models. The best and second-best SR performances are marked in red and blue, respectively. Blanked entries denote unavailable.
Table 2. The performances of PSNR (dB) and SSIMs on standard benchmark datasets for our RepECN models trained on DIV2K compared with Vision Transformer-based models. The best and second-best SR performances are marked in red and blue, respectively. Blanked entries denote unavailable.
#LatencySet5Set14BSD100Urban100Manga109
MethodsScale#Params#MACsGPU(ms)CPU(s)PSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIM
LBNet-T×2407K22.0 G-241.2937.950.960233.530.916832.070.898331.910.925338.590.9768
ESRT×2677K161.8 G-55.0038.030.960033.750.918432.250.900132.580.931839.120.9774
LBNet×2731K153.2 G-314.2738.050.960733.650.917732.160.899432.300.929138.880.9775
RepECN-S (Ours)×2411K117.5 G145.22.9638.100.960733.680.918732.240.900432.300.930138.760.9773
SwinIR-S×2878K195.6 G1074.313.6138.140.961133.860.920632.310.901232.760.934039.120.9783
RepECN (Ours)×21262K336.5 G242.66.6638.200.961233.850.919932.320.901332.680.933739.110.9777
LBNet-T×3407K22.0 G1551.549.8034.330.926430.250.840229.050.804228.060.848533.480.9433
ESRT×3770K82.1 G372.012.6234.420.926830.430.843329.150.806328.460.857433.950.9455
LBNet×3736K68.4 G2099.665.2534.470.927730.380.841729.130.806128.420.855933.820.9460
RepECN-S (Ours)×3411K69.9 G70.31.3834.470.927730.410.843929.150.806428.300.855133.720.9456
SwinIR-S×3886K87.2 G323.85.1034.620.928930.540.846329.200.808228.660.862433.980.9478
RepECN (Ours)×31262K185.1 G111.42.8234.670.929130.480.845929.250.808928.650.862834.090.9482
LBNet-T×4410 K12.6 G567.518.2932.080.893328.540.780227.540.735826.000.781930.370.9059
ESRT×4751K58.6 G135.74.9232.190.894728.690.783327.690.737926.390.796230.750.9100
LBNet×4742K38.9 G714.621.8332.290.896028.680.783227.620.738226.270.790630.760.9111
RepECN-S (Ours)×4427K57 G45.71.0332.320.896428.690.783327.620.737526.190.788930.540.9099
SwinIR-S×4897K49.6 G176.12.9732.440.897628.770.785827.690.740626.470.798030.920.9151
RepECN (Ours)×41295K140 G72.01.9832.480.898528.760.785627.670.739526.450.797130.920.9139
Table 3. The performances of PSNR (dB) and SSIMs on standard benchmark datasets for CNN-based models. The best and second-best SR performances are marked in red and blue, respectively. Blanked entries denote unavailable.
Table 3. The performances of PSNR (dB) and SSIMs on standard benchmark datasets for CNN-based models. The best and second-best SR performances are marked in red and blue, respectively. Blanked entries denote unavailable.
Set5Set14BSD100Urban100Manga109
MethodsScaleDataset#Params#MACsPSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIM
Bicubic×2---33.660.929930.240.868829.560.843126.880.840330.800.9339
SRCNN×2T9169K63.7G36.660.954232.450.906731.360.887929.500.894635.600.9663
FSRCNN×2T9125K15.1G37.000.955832.630.908831.530.892029.880.902036.670.9710
ShuffleMixer-Tiny×2DIV2K+Flickr2K108K25G37.850.960033.330.915331.990.897231.220.918338.250.9761
RepECN-T (Ours)×2DIV2K104K31.6G37.900.960133.410.916432.090.898431.670.923938.300.9763
LapSRN×2DIV2K435K146.0G37.520.959132.990.912431.800.895230.410.910337.270.9740
DRRN×2DIV2K298K6.8T37.740.959133.230.913632.050.897331.230.918837.880.9749
IDN×2DIV2K553K174.1G37.830.960033.300.914832.080.898531.270.919638.010.9749
EDSR-baseline×2DIV2K1370K316.2G37.990.960433.570.917532.160.899431.980.927238.540.9769
IMDN×2DIV2K694K158.8G38.000.960533.630.917732.190.899632.170.928338.880.9774
LAPAR-A×2DIV2K+Flickr2K548K171.0G38.010.960533.620.918332.190.899932.100.928338.670.9772
ShuffleMixer×2DIV2K+Flickr2K394K91G38.010.960633.630.918032.170.899531.890.925738.830.9774
LatticeNet×2DIV2K756K169.5G38.060.960733.700.918732.190.899932.240.928838.930.9774
RepECN-S (Ours)×2DIV2K411K117.5G38.100.960733.680.918732.240.900432.300.930138.760.9773
RepECN (Ours)×2DIV2K1262K336.5G38.200.961233.850.919932.320.901332.680.933739.110.9777
Bicubic×3---30.390.868227.550.774227.210.738524.460.734926.950.8556
SRCNN×3T9169K63.7G32.750.909029.300.821528.410.786326.240.798930.480.9117
FSRCNN×3T9125K13.6G33.180.914029.370.824028.530.791026.430.808031.100.9210
ShuffleMixer-Tiny×3DIV2K+Flickr2K114K12G34.070.925030.140.838228.940.800927.540.837333.030.9400
RepECN-T (Ours)×3DIV2K104K19.9G34.200.925930.250.840529.030.803127.860.845333.130.9419
LapSRN×3DIV2K435K98.6G33.810.922029.790.832528.820.798027.070.827532.210.9350
DRRN×3DIV2K298K6.8T34.030.924429.960.834928.950.800427.530.837832.710.9379
IDN×3DIV2K553K105.6G34.110.925329.990.835428.950.801327.420.835932.710.9381
EDSR-baseline×3DIV2K1555K160.1G34.370.927030.280.841729.090.805228.150.852733.450.9439
IMDN×3DIV2K703K71.5G34.360.927030.320.841729.090.804628.170.851933.610.9445
LAPAR-A×3DIV2K+Flickr2K544K114.0G34.360.926730.340.842129.110.805428.150.852333.510.9441
ShuffleMixer×3DIV2K+Flickr2K415K43G34.400.927230.370.842329.120.805128.080.849833.690.9448
LatticeNet×3DIV2K765K76.3G34.400.927230.320.841629.090.804728.190.851133.630.9442
RepECN-S (Ours)×3DIV2K411K69.9G34.470.927730.410.843929.150.806428.300.855133.720.9456
RepECN (Ours)×3DIV2K1262K185.1G34.670.929130.480.845929.250.808928.650.862834.090.9482
Bicubic×4---28.420.810426.000.702725.960.667523.140.657724.890.7866
SRCNN×4T9169K63.7G30.480.862827.500.751326.900.710124.520.722127.580.8555
FSRCNN×4T9125K13.6G30.720.866027.610.755026.980.715024.620.728027.900.8610
ShuffleMixer-Tiny×4DIV2K+Flickr2K113K8G31.880.891228.460.777927.450.731325.660.769029.960.9006
RepECN-T (Ours)×4DIV2K110K17.1G32.050.893028.520.779127.520.733525.840.777230.090.9038
LapSRN×4DIV2K870K182.4G31.540.885228.090.770027.320.727525.210.756229.090.8900
DRRN×4DIV2K298K6.8T31.680.888828.210.772027.380.728425.440.763829.450.8946
IDN×4DIV2K553K81.9G31.820.890328.250.773027.410.729725.410.763229.410.8942
EDSR-baseline×4DIV2K1518K114.2G32.090.893828.580.781327.570.735726.040.784930.350.9067
IMDN×4DIV2K715K40.9G32.210.894828.580.781127.560.735326.040.783830.450.9075
LAPAR-A×4DIV2K+Flickr2K659K94.0G32.150.894428.610.781827.610.736626.140.787130.420.9074
ShuffleMixer×4DIV2K+Flickr2K411K28G32.210.895328.660.782727.610.736626.080.783530.650.9093
LatticeNet×4DIV2K777K43.6G32.180.894328.610.781227.560.735326.130.784330.540.9075
RepECN-S (Ours)×4DIV2K427K57G32.320.896428.690.783327.620.737526.190.788930.540.9099
RepECN (Ours)×4DIV2K1295K140G32.480.898528.760.785627.670.739526.450.797130.920.9139
Table 4. Ablation study on the several designs of RepECN, including layer normalization in CNS, batch normalization in ACB, head layer in CNS, and upsampling design. The best SR performances are marked in red.
Table 4. Ablation study on the several designs of RepECN, including layer normalization in CNS, batch normalization in ACB, head layer in CNS, and upsampling design. The best SR performances are marked in red.
Design NameLayerNormBN in ACBHead in CNSUpsamplingParamsPSNR
RepECN-T-ANearest (PA)75K37.78
RepECN-T-BAfter Connect75K37.80
RepECN-T-CBefore Connect75K37.81
RepECN-T-DBefore Connect75K37.82
RepECN-T-EThree 3 × 3 ACB99K37.84
RepECN-TOne 3 × 3 ACBNearest (PA)104K37.86
RepECN-T-FOne 3 × 3 ACBNearest (no PA)103K37.84
RepECN-T-GPixel Shuffle114K37.83
Table 5. Ablation study on the structural re-parameterization and upsampling design for the simple 3 × 3 ConvNet model FSRCNN to prove the effectiveness. The best SR performances are marked in red.
Table 5. Ablation study on the structural re-parameterization and upsampling design for the simple 3 × 3 ConvNet model FSRCNN to prove the effectiveness. The best SR performances are marked in red.
Design NameUpsamplingRe-ParameterizationPSNR
FSRCNNDeconvolution37.00
FSRCNN-NNearest (PA)37.31
FSRCNN-N-DBBNearest (PA)DBB37.47
FSRCNN-N-ACBNearest (PA)ACB37.56
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, Q.; Qin, J.; Wen, W. RepECN: Making ConvNets Better Again for Efficient Image Super-Resolution. Sensors 2023, 23, 9575. https://doi.org/10.3390/s23239575

AMA Style

Chen Q, Qin J, Wen W. RepECN: Making ConvNets Better Again for Efficient Image Super-Resolution. Sensors. 2023; 23(23):9575. https://doi.org/10.3390/s23239575

Chicago/Turabian Style

Chen, Qiangpu, Jinghui Qin, and Wushao Wen. 2023. "RepECN: Making ConvNets Better Again for Efficient Image Super-Resolution" Sensors 23, no. 23: 9575. https://doi.org/10.3390/s23239575

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop