文章

Super-Resolution Study

Super-Resolution Study

Hitchhiker’s Guide to Super-Resolution: Introduction and Recent Advances

▶ Original Report Link

2 Setting and Terminology

2.1 Problem Definition: Super-Resolution

Super-Resolution (SR) refers to methods that can develop High-Resolution (HR) images from at least one Low-Resolution (LR) image

2.1.1 Single Image Super-Resolution (SISR)

Low-Resolution (LR) image

\[x \in ℝ^{\bar{w} \times \bar{h} \times c}\]

High-Resolution (HR) image

\[x \in ℝ^{w \times h \times c}\]

with \(\bar{w} \leq w\) and \(\bar{h} \leq h\)

The amount of pixels of an image:

\[N_x = w \cdot h \cdot c\]

The set of all valid positions in 𝐱:

\[\omega_x = \{(i,j,k) \in ℕ^3_1 \| i \leq h, j \leq w, k \leq c\}\]

A scaling factor:

\[s \in ℕ\]

It holds that \(h = s \cdot \bar{h}\) and \(w = s \cdot \bar{w}\)

the inherent relationship between the two entities LR (𝐱) and HR (𝐲):

\[𝒟: ℝ^{w \times h \times c} → ℝ^{\bar{w} \times \bar{h} \times c}\] \[x = 𝒟(𝐲;δ)\]

in which δ are parameters of 𝒟 that contain, for example, the scaling factor s and other elements like blur type.

2.2 Evaluation: Image Quality Assessment (IQA)

2.2.1 Mean Opinion Score (MOS)

Human viewers rate images with quality scores, typically 1 (bad) to 5 (good).

MOS is the arithmetic mean of all ratings. Despite reliability, mobilizing human resources is time-consuming and cumbersome, especially for large datasets.

2.2.2 Peak Signal-to-Noise Ratio (PSNR)

It is the ratio between the maximum possible pixel-value L (255 for 8-bit representations) and the Mean Squared Error (MSE) of reference images. Given the approximation $\hat{𝐲}$ and the ground-truth 𝐲, PSNR is a logarithmic quantity using the decibel scale [dB]:

\[\mathrm{PSNR}\left(\mathbf{y},\widehat{\mathbf{y}}\right)=10\cdot\log_{10}\frac{L^2}{\frac{1}{N_{\mathbf{y}}}\sum_{p\in\Omega_{\mathbf{y}}}\left[\mathbf{y}_{p}-\widehat{\mathbf{y}}_{p}\right]^2}\]

It focuses on pixel-level differences instead of mammalian visual perception, which is more attracted to structures

It correlates poorly with subjectively perceived quality

2.2.3 Structural Similarity Index (SSIM)

The Structural Similarity Index (SSIM) depends on three relatively independent entities: luminance, contrast, and structures

SSIM estimates for an image 𝐲 the luminance \(μ_𝐲\) as the mean of the intensity, while it is estimating contrast \(σ_𝐲\) as its standard deviation:

\[\mu_{\mathbf{y}}=\frac{1}{N_{\mathbf{y}}}\sum_{p\in\Omega_{\mathbf{y}}}\mathbf{y}_{p}\] \[\sigma_{\mathbf{y}}=\frac{1}{N_{\mathbf{y}}-1}\sum_{p\in\Omega_{\mathbf{y}}}\left[\mathbf{y}_{p}-\mu_{\mathbf{y}}\right]^{2}\]

A similarity comparison function S:

\[S \begin{pmatrix} x,y,c \end{pmatrix}=\frac{2\cdot x\cdot y+c}{x^2+y^2+c},\]

where x and y are the compared scalar variables, and \(c = (k \cdot L)^2, 0 < k \ll 1\) is a constant to avoid instability.

Given a ground-truth image 𝐲 and its approximation \(\hat{y}\), the comparisons on luminance (\(𝒞_l\)) and contrast (\(𝒞_c\)) are

\[\mathcal{C}_l\left(\mathbf{y},\mathbf{\hat{y}}\right)=S\left(\mu_\mathbf{y},\mu_\mathbf{\hat{y}},c_1\right)\mathrm{~and~}\mathcal{C}_c\left(\mathbf{y},\mathbf{\hat{y}}\right)=S\left(\sigma_\mathbf{y},\sigma_\mathbf{\hat{y}},c_2\right)\]

where \(c_1, c_2 > 0\). The empirical co-variance

\[\sigma_{\mathbf{y},\mathbf{\hat{y}}}=\frac{1}{N_{\mathbf{y}}-1}\sum_{p\in\Omega_{\mathbf{y}}}\left(\mathbf{y}_{p}-\mu_{\mathbf{y}}\right)\cdot\left(\mathbf{\hat{y}}_{p}-\mu_{\mathbf{\hat{y}}}\right),\]

determines the structure comparison (\(𝒞_s\)), expressed as the correlation coefficient between 𝐲 and \(\hat{y}\):

\[\mathcal{C}_s\left(\mathbf{y},\widehat{\mathbf{y}}\right)=\frac{\sigma_{\mathbf{y},\hat{\mathbf{y}}}+c_3}{\sigma_{\mathbf{y}}\cdot\sigma_{\hat{\mathbf{y}}}+c_3},\]

where \(c_3 > 0\). Finally, the SSIM is defined as:

\[\mathrm{SSIM}\left(\mathbf{y},\mathbf{\hat{y}}\right)=\left[\mathcal{C}_{l}\left(\mathbf{y},\mathbf{\hat{y}}\right)\right]^{a}\cdot\left[\mathcal{C}_{c}\left(\mathbf{y},\mathbf{\hat{y}}\right)\right]^{\beta}\cdot\left[\mathcal{C}_{s}\left(\mathbf{y},\mathbf{\hat{y}}\right)\right]^{\gamma}\]

where α>0,β>0 and γ>0 are adjustable control parameters for weighting relative importance of all components.

2.2.4 Learning-based Perceptual Quality (LPQ)

In essence, LPQ tries to approximate a variety of subjective ratings by applying DL methods.

A significant drawback of LPQ is the limited availability of annotated samples.

2.2.5 Task-based Evaluation (TBE)

One can focus on task-oriented features.

2.2.6 Evaluation with defined Features

One example is the Gradient Magnitude Similarity Deviation (GMSD), which uses the pixel-wise Gradient Magnitude Similarity (GMS)

An alternative is the Feature Similarity (FSIM) Index. It also uses gradient magnitudes, but combines them with Phase Congruency (PC), a local structure measurement, as feature points.

2.2.7 Multi-Scale Evaluation

In practice, SR models usually super-resolve to different scaling factors, known as Multi-Scaling (MS). Thus, evaluating metrics should address this scenario.

2.3 Datasets and Challenges

Two of the most famous challenges are the New Trends in Image Restoration and Enhancement (NTIRE) challenge, and the Perceptual Image Restoration and Manipulation (PIRM) challenge.

2.4 Color Spaces

Exploring other color spaces for DL-based SR methods is nearly nonexistent, which presents an exciting research gap.

3 Learning Objectives

3.1 Regression-based Objectives

3.1.1 Pixel Loss

The first one is the Mean Absolute Error (MAE), or L​1-loss:

\[\mathcal{L}_{\mathrm{L1}}\left(\mathbf{y},\widehat{\mathbf{y}}\right)=\frac{1}{N_{\mathbf{y}}}\sum_{p\in\Omega_{\mathbf{y}}}\left|\mathbf{y}_{p}-\widehat{\mathbf{y}}_{p}\right|\]

It takes the absolute differences between every pixel of both images and returns the mean value.

The second well-known pixel loss function is the Mean Squared Error (MSE), or L2-loss. It weights high-value differences higher than low-value differences due to an additional square operation:

\[\mathcal{L}_{\mathrm{L2}}\left(\mathbf{y},\mathbf{\hat{y}}\right)=\frac{1}{N_{\mathbf{y}}}\sum_{p\in\Omega_{\mathbf{y}}}\left|\mathbf{y}_{p}-\mathbf{\hat{y}}_{p}\right|^{2}\]
3.1.2 Uncertainty-Driven Loss

An adaptive weighted loss for SISR, which aims at prioritizing texture and edge pixels that are visually more significant than pixels in smooth regions. Thus, the adaptive weighted loss treats every pixel unequally.

3.1.3 Content Loss

Instead of using the difference between the approximated and the ground-truth image, one can transform both entities further into a more discriminant domain.

In more detail, the feature extractor is pre-trained on another task, i.e., image classification or segmentation. During the training of the actual SR model on the difference of feature maps, the parameters of the feature extractor remain fixed. Thus, the goal of the SR model is not to generate pixel-perfect estimations. Instead, it produces images whose features are close to the features of the target.

3.2 Generative Adversarial Networks

The core idea is to use two distinct networks: a generator G and a discriminator D. The generator network learns to produce samples close to a given dataset and to fool the discriminator.

3.2.1 Total Variation Loss

One way to regularize GANs is to use a Total Variation (TV) denoising technique known from image processing.

\[\mathrm{TV}(\mathbf{y})=\frac{1}{N_{\mathbf{y}}}\sum_{i,j,k}\sqrt{\underbrace{\left(\mathbf{y}_{i+1,j,k}-\mathbf{y}_{i,j,k}\right)^2+\underbrace{\left(\mathbf{y}_{i,j+1,k}-\mathbf{y}_{i,j,k}\right)^2}_{\text{diff. first axis}}}_{\text{diff. second axis}}}\]
3.2.2 Texture Loss

Texture synthesis with parametric texture models has a long history with the goal of transferring global texture onto other images

3.3 Denoising Diffusion Probabilistic Models

Denoising Diffusion Probabilistic Models (DDPMs) exploit this insight by formulating a Markov chain to alter one image into a noise distribution gradually, and the other way around.

4 Upsampling

4.1 Interpolation-based Upsampling

Many DL-based SR models use image interpolation methods because of their simplicity. The most known methods are nearest-neighbor, bilinear, and bicubic interpolation.

4.2 Learning-based Upsampling

4.2.1 Transposed Convolution

Transposed convolution expands the spatial size of a given feature map and subsequently applies a convolution operation.

4.2.2 Sub-Pixel Layer

Introduced with ESPCN, it uses a convolution layer to extract a deep feature map and rearranges it to return an upsampled output.

4.2.3 Decomposed Upsampling

An extension to the above approaches is decomposed transposed convolution. Using 1D convolutions instead of 2D convolutions reduces the number of operations and parameters for the component \(k^2\) to 2⋅k.

4.2.4 Attention-based Upsampling

Another alternative to transposed convolution is attention-based upsampling [69]. It follows the definition of attention-based convolution (or scaled dot product attention) and replaces the 1x1 convolutions with upsampling methods.

4.2.5 Upsampling with Look-Up Tables

Before generating the LUT, a small-scale SR model is trained to upscale small patches of a LR image to target HR patches. Subsequently, the LUT is created by saving the results of the trained SR model applied on a uniformly distributed input space. It reduces the upsampling runtime to the time necessary for memory access while achieving better quality than bicubic interpolation. On the other hand, it requires additional training to create the LUT.

4.2.6 Flexible Upsampling

In order to overcome this limitation, a meta-upscale module was proposed [41]. It predicts a set of filters for each position in a feature map that is later applied to a location in a lower-resolution feature map.

5 Attention Mechanisms for SR

5.1 Channel-Attention

Feature maps generated by CNNs are not equally important. Therefore, essential channels should be weighted higher than counterpart channels, which is the goal of channel attention. It focuses on “which” (channels) carry crucial details.

5.2 Spatial-Attention

In contrast to channel attention, spatial attention focuses on “where” the input feature maps carry important details, which requires extracting global information from the input.

5.3 Mixed Attention

Since both attention types can be applied easily, merging them into one framework is natural. Thus, the model focuses on “which” (channel) is essential and “where” (spatially) to extract the most valuable features. This combines the benefits of both approaches and introduces an exciting field of research, especially in SR. One potential future direction would be to introduce attention mechanisms incorporating both concerns in one module.

6 Additional Learning Strategies

6.1 Curriculum Learning

Curriculum learning follows the idea of training a model under easy conditions and gradually involving more complexity [84], i.e., additional scaling sizes.

6.2 Enhanced Predictions

Instead of enhancing simple input-output pairs, one can use data augmentation techniques like rotation and flipping for final prediction.

6.3 Learned Degradation

The Content Adaptive Resampler (CAR) introduced a resampler for downscaling. It predicts kernels to produce downscaled images according to its HR input. Next, a SR model takes the LR image and predicts the SR image. Thus, it simultaneously learns the degradation mapping and upsampling task.

6.4 Network Fusion

Network fusion uses the output of all additional SR models and applies a fusion layer to the outputs. Finally, it predicts the SR image used for the learning objective.

6.5 Multi-Task Learning

E.g., one can assign a label to each image and use multiple datasets for training. Next, a SR model can learn to reconstruct the SR image and predict its category (e.g., natural or manga image)

6.6 Normalization Techniques

A slight change in the input distribution is a cause of many issues because layers need to continuously adapt to new distributions, which is known as covariate shift and can be alleviated with BatchNorm.

7 SR Models

8 Unsupervised Super-Resolution

8.1 Weakly-Supervised

Weakly-supervised methods use unpaired LR and HR images

The first generator takes a LR image and super-resolves it. The output of the first generator constitutes a SR image

The second generator takes the prediction of the first generator and performs the inverse mapping. The result of the second generator is optimized via content loss with the original input, the LR image.

8.2 Zero-Shot

Zero-shot or one-shot learning is associated with training on objects and testing on entirely different objects from a different class that was never observed.

8.3 Deep Image Prior

It uses a CNN to predict the LR image when downsampled, given some random noise instead of an actual image. Therefore, it follows the strategy of ZSSR by using only the LR image. However, it fixes the input to random noise and applies a fixed downsampling method to the prediction.

▶ Original Report Link

Residual Local Feature Network for Efficient Super-Resolution

2.1 Efficient Image Super-Resolution

SCRNN applied the deep learning algorithm to the SISR field for the first time. It has three layers and uses bicubic interpolation to upscale the image before the net, causing unnecessary computational cost.

To address this issue, FSRCNN employed the deconvolution layer as the upsampling layer and upscaled the image at the end of net.

DRCN introduced a deep recursive convolutional network to reduce the number of parameters.

LapSRN proposed the laplacian pyramid super-resolution block to reconstruct the sub-band residuals of HR images.

CARN proposed an efficient cascading residual network with group convolution, which obtains comparable results against computationally expensive models.

IMDN proposed a lightweight information multi-distillation network by constructing the cascaded information multi-distillation blocks, which extracts hierarchical features step-by-step with the information distillation mechanism (IDM).

RFDN refined the architecture of IMDN and proposed the residual feature distillation network, which replaced IDM with feature distillation connections.

ECBSR proposed an edge-oriented convolutional block based on the reparameterization technique[10], which can improve the learning ability of the model without increasing the inference time.

2.2 Train Strategy for PSNR-oriented SISR

These SR networks are usually trained by the ADAM optimizer with standard l1 loss for hundreds of epoches. To improve the robustness of training, they usually adopt a smaller learning rate and patch size.

Recent works on image recognition[3] and optical flow estimation[41] have demonstrated that advanced training strategies can enable older network architectures to match or surpass the performance of novel architectures.

RFDN[31] demonstrated that both fine-tuning the network with l2 loss and initializing a 4x SR model with pretrained 2x model can effectively improve PSNR.

RRCAN[30] revisited the popular RCAN model and demonstrated that increasing training iterations clearly improves the model performance.

3 Method

3.1 Network Architecture

graph LR
  A[Input: Low-Resolution Image] --> B[Conv 3×3<br>Shallow Feature Extraction]
  B --> C1[RLFB Block 1]
  C1 --> C2[RLFB Block 2]
  C2 --> C3[RLFB Block 3]
  C3 --> D[Conv 3×3<br>Post RLFB]

  B -.-> E[Skip Connection]
  D --> F[Add: Long Residual]
  E --> F

  F --> G[Conv 3×3<br>Upsample Preparation]
  G --> H[PixelShuffle<br>Upsampling]
  H --> I[Output: High-Resolution Image]

3.2 Revisiting the Contrastive Loss

The contrastive loss is defined as:

3.3 Warm-Start Strategy

For large scale factors like 3 or 4 in the SR task, some previous works use the 2x model as a pre-trained network instead of training them from scratch.

Efficient Long-Range Attention Network for Image Super-resolution

▶ Original Report Link

2.1 CNN-based SR methods

CNN-based methods have demonstrated impressive performance in the SR task.

To build more effective models for SR, the recently developed methods tend to employ deeper and more complicated architectures as well as the attention techniques.

2.2 Transformer-based SR methods

The breakthrough of transformer networks in natural language processing (NLP) inspired of use of self-attention (SA) in computer vision tasks. The SA mechanism in transformers can effectively model the dependency across data, and it has achieved impressive results on several high-level vision tasks, such as image classification, image detection, and segmentation.

3 Methodology

3.1 Overall Pipeline of ELAN

graph LR
    A[Input: Low-Resolution Image] --> B[Conv 3×3<br>Shallow Feature Extraction]
    B --> C1[ELAB 1]
    C1 --> C2[ELAB 2]
    C2 --> C3[ELAB 3]
    C3 --> C4[ELAB 4]
    C4 --> C5[ELAB 5]
    C5 --> D[Conv 3×3<br>Reconstruction]
    D --> E[PixelShuffle]
    E --> F[Output: High-Resolution Image]
graph LR
    X[Input Feature] --> A1[ShiftConv]
    A1 --> A2[ReLU]
    A2 --> A3[ShiftConv]
    A3 --> A4[GMSA<br>Group-wise Self-Attention]
    A4 --> Y[Output Feature]
graph LR
    G1[Input] --> Split[Split into Groups]
    Split --> Shift[Shift-Convolution per Group]
    Shift --> ASA1[ASA Block]
    Shift --> ASA2[...]
    ASA1 --> Merge[Concatenate & Merge]
    Merge --> Out[Output]

3.2 Efficient Long-range Attention Block (ELAB)

graph TD
    A[Input] --> Theta_i[Theta_i]
    A --> G_i[G_i]

    Theta_i --> R1[Reshape]
    G_i --> R2[Reshape]

    R1 --> M1[Multiplication]
    R2 --> M1

    M1 --> R3[Reshape]

    Theta_next[Theta_i_plus_n] --> R4[Reshape]
    G_next[G_i_plus_n] --> R5[Reshape]

    R4 --> M2[Multiplication]
    R5 --> M2

    M2 --> R6[Reshape]

    R3 --> FinalOut1[Output Part 1]
    R6 --> FinalOut2[Output Part 2]
graph LR
    A[Input] --> Theta_i[Shared Theta_i]
    A --> G_i[G_i]

    Theta_i --> R1[Reshape]
    G_i --> R2[Reshape]

    R1 --> M1[Multiplication]
    R2 --> M1

    M1 --> R3[Reshape]

    G_next[G_i_plus_n] --> R4[Reshape]
    R1 --> M2[Multiplication Shared]
    R4 --> M2

    M2 --> R5[Reshape]

    R3 --> FinalOut1[Output Part 1]
    R5 --> FinalOut2[Output Part 2]

ShuffleMixer: An Efficient ConvNet for Image Super-Resolution

▶ Original Report Link

Proposed Method

We aim to develop an efficient large-kernel CNN model for the SISR task. To meet the efficiency goal, we introduce key designs to the feature mixing block employed to encode information efficiently.

3.1 ShuffleMixer Architecture

graph LR
    A[Input Image] --> B[Feature Extraction]
    B --> C[FMB 1 - Feature Mixing Block]
    C --> D[FMB 2 - Feature Mixing Block]
    D --> E[... more FMBs]
    E --> F[Upsampler - Bilinear Upsampling]
    F --> G[Conv 3x3 - Reconstruction]
    G --> H[Output Image]

    %% Optional skip connection
    A -. Skip Connection .-> H
graph TD
    Z0[Input Z0] --> L1[Shuffle Mixer Layer 1]
    L1 --> L2[Shuffle Mixer Layer 2]
    L2 --> FMBConv[Conv Layer - FMBConv]
    FMBConv --> Add[Residual Add with Z0]
    Add --> Z_out[Output]

    SM_Input[SM Layer Input] --> P1[Channel Projection 1x1]
    P1 --> DWC[Depthwise Convolution]
    DWC --> P2[Channel Projection 1x1]
    P2 --> SM_Output[SM Layer Output]

    Z0_main[Input Feature Z0] --> LN[LayerNorm]
    LN --> Split[Channel Splitting]
    Split --> MLP[Point-wise MLP Branch]
    Split --> ID[Identity Branch]
    MLP --> Merge1
    ID --> Merge1
    Merge1[Channel Shuffle] --> SM2[Next Shuffle Mixer Layer]

    X[Input Tensor] --> C1[Conv 1x1]
    C1 --> Act[SiLU Activation]
    Act --> C2[Conv 1x1]
    C2 --> Out_MLP[Output Tensor]

    X2[Input Tensor] --> F1[Conv 3x3]
    F1 --> Act2[SiLU Activation]
    Act2 --> F2[Conv 1x1]
    F2 --> Out_MB[Output Tensor]

Omni Aggregation Networks for Lightweight SR

▶ Original Report Link

3 Methodology

3.1 Attention Mechanisms in Super-Resolution

Two attention paradigms are widely adopted in SR to assist in analyzing and aggregating comprehensive patterns.

Spatial Attention. Spatial attention can be regarded as an anisotropic selection process. Spatial self-attention and spatial gate are predominantly applied.

Channel Attention. There are two categories of channel attention, i.e., scalar-based and covariance-based, proposed to perform channel recalibration or transmit patterns among channels.

3.2 Omni Self-Attention Block

To mine all the correlations hidden in the latent variables, we propose a novel self-attention paradigm called Omni Self-Attention (OSA) block. Unlike existing self-attention paradigms (e.g., spatial self-attention [51, 5, 37]) that only indulge in unidimensional processing, OSA establishes the spatial and channel context simultaneously.

The proposed OSA calculates the score matrices corresponding to the space and channel direction through sequential matrix operations and rotation

\[\begin{aligned} & Q^{s}=X\cdot W_{q},\quad K^{s}=X\cdot W_{k},\quad V^{s}=X\cdot W_{v}, \\ & Y_{s}=\mathcal{A}^{s}(Q^{s},K^{s},V^{s})=\mathrm{SoftMax}(Q^{s}K^{sT})\cdot V^{s}, \\ & Q^{c}=\mathcal{R}(Q^{^{\prime}}),K^{c}=\mathcal{R}(K^{^{\prime}}),V^{c}=\mathcal{R}(V^{^{\prime}}), \\ & Y_{c}=\mathcal{A}^{c}(Q^{c},K^{c},V^{c})=\mathrm{SoftMax}(K^{c}Q^{cT})\cdot V^{c}, \\ & Y_{OSA}=\mathcal{R}^{-1}\left(Y_{c}\right), \end{aligned}\]

3.3 Omni-Scale Aggregation Group

OSAG mainly consists of three stages: local, meso and global aggregations.

3.4 Network Architecture

Omni-SR consists of three parts, i.e., shallow feature extraction, deep feature extraction, and image reconstruction.

Omni-Scale Aggregation Group (OSAG).

Each OSAG contains a local convolution block (LCB), a meso-OSA block, a global-OSA block, and an ESA block

Optimization Objective

Following prior works, we train the model by minimizing a standard \(L_1\) loss between model prediction \(\hat{I}_{H​R}\) and HR label \(I_{H​R}\) as follows:

\[\mathcal{L}=\|I_{HR}-\widehat{I}_{HR}\|_{1}.\]

Equivalent Transformation and Dual Stream Network Construction for Mobile Image Super-Resolution

▶ Original Report Link

Pre-Trained Image Processing Transformer

3 Image Processing Transformer

3.1 IPT architecture

flowchart LR
  %% Multi-head 部分
  subgraph MultiHead
    DenoisingHead
    DerainingHead
    X2UpHead
    X4UpHead
  end

  %% Transformer Body 部分
  subgraph TransformerBody
    TransformerEncoder
    TransformerDecoder
  end

  %% Multi-tail 部分
  subgraph MultiTail
    DenoisingTail
    DerainingTail
    X2UpTail
    X4UpTail
  end

  %% 连线
  DenoisingHead --> TransformerEncoder
  DerainingHead --> TransformerEncoder
  X2UpHead      --> TransformerEncoder
  X4UpHead      --> TransformerEncoder

  TransformerEncoder --> TransformerDecoder

  TransformerDecoder --> DenoisingTail
  TransformerDecoder --> DerainingTail
  TransformerDecoder --> X2UpTail
  TransformerDecoder --> X4UpTail

3.2 Pre-training on ImageNet

在 ImageNet 上对每张高分辨率图像,通过下采样(超分辨)、加噪(去噪)、合成雨滴(去雨)等多种操作,生成超过 1 000 万对“降质-原图”样本。

自监督对比学习:在标准的 L₁/L₂ 重建损失之外,加入 patch 级对比损失(contrastive loss),鼓励不同输入(如噪声/低分)下的相同位置 patch 在特征空间中靠近,从而提升任务间的可迁移性

4 Experiments

4.1 Super-resolution

4.2 Denoising

4.3 Deraining

4.4 Generalization Ability

训练结束后,只需针对某一任务替换对应的 head 和 tail,保持共享的 Transformer Body 固定或微调,即可在小规模且任务特定的数据集上快速收敛。

例如,×2/×3/×4 超分辨、不同噪声级别的去噪、不同雨强度的去雨等,都能通过同一预训练模型高效微调。

SwinIR: Image Restoration Using Swin Transformer

3 Method

3.1 Network Architecture

SwinIR consists of three modules: shallow feature extraction, deep feature extraction and high-quality (HQ) image reconstruction modules.

  1. Shallow Feature Extraction
    • 单层卷积,负责从输入图像中提取低频信息(直传给重建模块),保留原始细节
  2. Deep Feature Extraction
    • 由 \(M\) 个残差 Swin Transformer 块(Residual Swin Transformer Block, RSTB)堆叠而成。
    • 每个 RSTB 内部包含 \(L\) 层 Swin Transformer Layer(STL),利用窗口自注意力(window-based self‑attention)和跨窗 shifted window 交互机制,有效建模局部与跨块依赖。
    • 块尾附加一层卷积以增强表征,并通过残差连接将输入特征直加至输出,实现特征聚合与梯度传递
  3. High‑Quality Image Reconstruction
    • 对于超分任务,使用 Pixel Shuffle 或转置卷积进行上采样;对于去噪、去压缩伪影任务,则保持原分辨率输出
    • 最终再经一层卷积生成恢复图像,并与浅层特征相加补偿低频信息

Transformer for Single Image Super-Resolution

3 Efficient Super-Resolution Transformer

Efficient Super-Resolution Transformer (ESRT) mainly consists of four parts: shallow feature extraction, Lightweight CNN Backbone (LCB), Lightweight Transformer Backbone (LTB), and image reconstruction.

3.1 Lightweight CNN Backbone(LCB)

High Preservin Block(HPB)

in HPB, we creatively preserve the High-frequency Filtering Module (HFM) and Adaptive Residual Feature Block (ARFB).

3.2 High-frequency Filtering Module(HFM)

The schematic diagram of the proposed HFM module.

3.2.1 Adaptive Residual Feature Block(ARFB)

when the depth of the model grows, the residual architecture can mitigate the gradient vanishing problem and augment the representation capacity of the model.

ARFB contains two Residual Units (RUs) and two convolutional layers. To save memory and the number of parameters, RU is made up of two modules: Reduction and Expansion.

The complete architecture of the proposed ARFB

3.3 Light weight Transformer Backbone(LTB)

LTB is composed of specially designed Efficient Transformers (ETs), which can capture the long-term dependence of similar local regions in the image at a low computational cost.

3.3.1 Pre- and Post-processing for ET
3.3.2 Efficient Transformer (ET)

A hybrid of transformer and CNN for efficient single image super-resolution via multi-level distillation

3 Methodology

3.1 Network architecture

our proposed network consists of three parts: 1) shallow feature extraction module, 2) deep feature distillation module, and 3) high-resolution reconstruction module.

3.1.1 Shallow feature extraction
3.1.2 Deep feature distillation

The deep feature distillation module is our model’s main part, consisting of a stack of Transformer-CNN feature distillation blocks (TCFDB). Besides, we made a few changes to the feed-forward network (i.e., MLP) of the Swin Transformer and built the enhanced swin transformer layer (ESTL). The TCFDB consists of ESTLs and convolutional layers. Each TCFDB has two paths in which features are propagated. One path is for feature distillation operation, and the other is for gradual feature refinement operation. The two outputs are then concatenated using a 1 × 1 convolutional layer.

3.1.3 Super-resolution reconstruction module

3.2 Transformer-CNN feature distillation block

The proposed feature distillation block (TCFDB) consists of hybrid networks of CNN and enhanced Swin Transformer layer (ESTL), convolutional layers, and enhanced spatial attention

3.2.1 Feature distillation and refinement pipeline

The TCFDB contains two pipelines, i.e., the feature distillation and refinement pipeline. The intermediate features are fed into the two pipelines parallelly. The 1 × 1 convolutional layers on the left are responsible for distilling features and reducing channels with few parameters. On the right are cascaded ESTLs, which can attend to spatial context and gradually refine features to attain more discriminate information.

3.2.2 Enhanced Swin Transformer layer (ESTL)
3.2.2.1 The standard transformer layer
3.2.2.2 Swin transformer layer
3.2.2.3 Tensor reshape operation
3.2.2.4 Convolutional feed-forward network (CFF)
3.2.3 Enhanced spatial attention block (ESA)

3.3 Image reconstruction module

3.4 Loss function

3.4.1 L1 pixel loss
3.4.2 Contrastive loss

Swin2SR: SwinV2 Transformer for Compressed Image Super-Resolution and Restoration

Table of Reports

Related Search tag in Google Scholar:

1
"image super-resolution" AND ("lightweight" OR "efficient") AND ("dynamic" OR "adaptive") after:2021
Title (with Year)TagsSummary
Lightweight Image Super-Resolution with Adaptive Weighted Learning Network(IMDN) (2019)Lightweight, Information Distillation, Channel AttentionIMDN introduces an Information Multi-distillation Network that employs a series of distillation blocks to extract and fuse hierarchical features efficiently. The network utilizes contrast-aware channel attention mechanisms to enhance feature representation, achieving a balance between performance and computational efficiency.
Exploring Sparsity in Image Super-Resolution for Efficient Inference (SMSR) (2021)Sparse Mask, Dynamic Inference, EfficientSMSR introduces a Sparse Mask Super-Resolution network that learns spatial and channel-wise masks to identify and skip redundant computations during inference. While it enhances efficiency by pruning unnecessary operations, the binary nature of the masks limits fine-grained enhancements in complex regions.
Dynamic Residual Self-Attention Network for Lightweight Single Image Super-Resolution (DRSAN) (2021)Dynamic, Self-Attention, LightweightDRSAN proposes a dynamic residual self-attention mechanism that adaptively adjusts residual connections based on input content. This approach allows the network to modulate feature strength dynamically; however, the overall network structure remains static, lacking dynamic routing capabilities for different content complexities.
Hitchhiker’s Guide to Super-Resolution: Introduction and Recent Advances (2023)Survey, OverviewComprehensive survey of deep learning-based SR, covering recent advancements like transformer-based models and diffusion approaches, and discussing remaining challenges (e.g., flexible upsampling, new loss functions, better metrics)..
Residual Local Feature Network for Efficient SR (RLFN) (2022)Lightweight, EfficientProposes a compact network using only three conv layers for local feature learning, simplifying feature aggregation for a better speed–quality trade-off. With an improved contrastive loss and multi-stage warm-start training, RLFN outperforms prior efficient models in runtime while maintaining high PSNR/SSIM.
Efficient Long-Range Attention Network (ELAN) (2022)Efficient, Attention, TransformerIntroduces an efficient transformer-inspired SR model that reduces self-attention cost. Uses shift convolution to extract local structure and a group-wise multi-scale self-attention (GMSA) module to capture long-range dependencies. An efficient long-range attention block (ELAB), built by combining shift-conv and GMSA with shared attention, achieves transformer-level accuracy with significantly fewer parameters.
ShuffleMixer: An Efficient ConvNet for SR (2022)Lightweight, Efficient, ConvNetProposes a lightweight SR network with large depth-wise convolutions and channel split-shuffle operations to efficiently enlarge the receptive field. Augments large kernels with fused MobileNet-like convolution blocks to preserve fine details. Experimental results show ShuffleMixer is ~3× smaller (in FLOPs and params) than previous efficient models (e.g., CARN) while achieving competitive performance.
Omni-SR: Omni Aggregation Networks for Lightweight SR (2023)Lightweight, Transformer, Multi-scaleTransformer-based framework targeting <1M parameters. Introduces an Omni Self-Attention (OSA) block that densely connects spatial and channel attention, simultaneously modeling correlations across spatial and channel dimensions. Also proposes a multi-scale interaction scheme to expand the effective receptive field in shallow models. Omni-SR achieved record-high performance for lightweight SR
ETDS: Equivalent Transformation & Dual-Stream Network for Mobile SR (2023)Mobile, Dual-Branch, EfficientTargets real-time SR on mobile devices. Proposes an Equivalent Transformation (ET) technique to convert heavy layers into mobile-friendly ops (reparameterizing into standard conv+ReLU). Builds a dual-stream network to mitigate the extra parameters from ET and enhance feature extraction.
Pre-TrainedImageProcessingTransformer (2021)Transformer、Pre-training、Multi-taskThe first model to apply Transformer to super-resolution. Use ViT architecture and multi-task pre-training on large-scale ImageNet data, and then fine-tune for SR IPT has up to 115M parameters, but with the support of pre-training, its performance is better than previous CNN methods, achieving the most advanced PSNR/SSIM at the time
SwinIR: Image Restoration Using Swin Transformer (2021)Transformer、Local Attention、HierarchicalAn efficient image restoration model based on Swin Transformer. Using window self-attention and hierarchical structure, it surpasses the performance of IPT without large-scale pre-training.
ESRT: Efficient Super-Resolution Transformer (CVPRW 2022)Hybrid、Lightweight、Transformer+CNNA lightweight model with a CNN+Transformer hybrid architecture. It uses a lightweight CNN backbone to dynamically adjust the feature map size to reduce computation, and combines a series of efficient Transformer modules (EMHA attention) to extract long-range dependencies. ESRT significantly reduces video memory usage while maintaining competitive performance (4 times less video memory than the original ViT model)
A hybrid of transformer and CNN for efficient single image super-resolution via multi-level distillationHybrid、CNN+Transformer、DistillationA hybrid network constructed by cascading Transformer-CNN feature distillation blocks. Each block combines the global modeling capabilities of Transformer and the local fine representation of CNN, distilling and strengthening features step by step. This design not only exploits global dependencies but also retains local details, achieving reconstruction accuracy comparable to large models with lower computational effort.
SAFMN: Spatially-Adaptive Feature Modulation for Efficient SR (2023)  
DLGSANet: Dynamic Local/Global Self-Attention Network (2023)  
AsConvSR: Fast SR Network with Assembled Convolutions (2023)  
WBSR: Weight-Balancing for Efficient SR (2024)  
PCSR: Pixel-Level Classification for SR (2024)  
LAMNet: Linear Adaptive Mixer Network for SR (2023)  
FIWHN: Feature Interaction Weighted Hybrid Network (2024)  
TCSR: Transformer-CNN for Lightweight SR (2023)  
HFFN: High-Frequency Focused Network (2023)  
RAMiT: Reciprocal Attention Mixing Transformer (2023)  
SCNet: Fully 1×1 Convolutional SR Network (2023)  
QSRNet: Quick SR Network for Mobile (2023)  
DITN: Deployment-Friendly Transformer for SR (2023)  
Dynamic Graph SR via Message Passing & Conv Mixer (2025)  
BSRN: Blueprint Separable Residual Network (2022)  
ECBSR: Edge-Oriented Convolution Block for Real-Time SR (2022)  
ARM: Any-Time Super-Resolution Method (2022)  
ADSRNet: Adaptive Heterogeneous CNN for SR (2024)  
LDEN: Learnable Detail Enhancement Network (2025)  
Swin2SR: SwinV2 Transformer for Compressed SR (2023)  
ESRT: Efficient SR Transformer (2022)  
   

Possible Architecture

1
py -3.11 train2.py --train_dir Data/DIV2K_train_HR --val_dir Data/DIV2K_valid_HR --scale 2 --epochs 100 --batch_size 16 --patch_size 128

Overall Architecture

flowchart TD
    A[Input LR Image] --> B[Conv 3×3, 3→C channels<br/>+ ReLU]
    B --> C[Shallow Features F₀]
    C --> D1[RDB Block 1]
    D1 --> D2[RDB Block 2]
    D2 --> D3[…]
    D3 --> Dn[RDB Block N]
    Dn --> E[Concat all block outputs]
    E --> F[Conv 1×1 Fusion → C channels]
    F --> G[ESA: Spatial Attention]
    G --> H[Add skip connection<br/>F + F₀]
    H --> I[Conv 3×3 → 3×scale² channels]
    I --> J[PixelShuffle ×scale]
    J --> K[Output HR Image]

Residual Distillation Block

flowchart TD
    X[Input Features X] --> P1[Conv 3×3 → C channels<br/>+ ReLU] 
    P1 --> P2[Intermediate Feature 1]
    
    P2 --> P3[Conv 3×3 → C channels<br/>+ ReLU]
    P3 --> P4[Intermediate Feature 2]
    
    P4 --> P5[Conv 3×3 → C channels<br/>+ ReLU]
    P5 --> P6[Intermediate Feature 3]
    
    subgraph Fuse & Attention
        P2 --> Q1
        P4 --> Q1
        P6 --> Q1
        Q1[Concat F1,F2,F3] --> R[Conv 1×1 → C channels]
        R --> S[CALayer: Channel Attention]
    end
    
    S --> T[Add Residual S + X]
    T --> Y[Output Features]
本文由作者按照 CC BY 4.0 进行授权