If you see this, something is wrong

Collapse and expand sections

To get acquainted with the document, the best thing to do is to select the "Collapse all sections" item from the "View" menu. This will leave visible only the titles of the top-level sections.

Clicking on a section title toggles the visibility of the section content. If you have collapsed all of the sections, this will let you discover the document progressively, from the top-level sections to the lower-level ones.

Cross-references and related material

Generally speaking, anything that is blue is clickable.

Clicking on a reference link (like an equation number, for instance) will display the reference as close as possible, without breaking the layout. Clicking on the displayed content or on the reference link hides the content. This is recursive: if the content includes a reference, clicking on it will have the same effect. These "links" are not necessarily numbers, as it is possible in LaTeX2Web to use full text for a reference.

Clicking on a bibliographical reference (i.e., a number within brackets) will display the reference.

Speech bubbles indicate a footnote. Click on the bubble to reveal the footnote (there is no page in a web document, so footnotes are placed inside the text flow). Acronyms work the same way as footnotes, except that you have the acronym instead of the speech bubble.

Discussions

By default, discussions are open in a document. Click on the discussion button below to reveal the discussion thread. However, you must be registered to participate in the discussion.

If a thread has been initialized, you can reply to it. Any modification to any comment, or a reply to it, in the discussion is signified by email to the owner of the document and to the author of the comment.

First published on Saturday, Apr 19, 2025 and last modified on Saturday, Apr 19, 2025 by François Chaplais.

@misc{NTIRE-2025-Challenge-on-Short-form-UGC-Video-Quality-Assessment-and-Enhancement-Methods-and-Results,
title={NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement: Methods and Results},
author={Xin Li and  Kun Yuan and  Bingchen Li and  Fengbin Guan and  Yizhen Shao and  Zihao Yu and  Xijun Wang and  Yiting Lu and  Wei Luo and  Suhang Yao and  Ming Sun and  Chao Zhou and  Zhibo Chen and  Radu Timofte and  Yabin Zhang and  Ao-Xiang Zhang and  Tianwu Zhi and  Jianzhao Liu and  Yang Li and  Jingwen Xu and  Yiting Liao and  Yushen Zuo and  Mingyang Wu and  Renjie Li and  Shengyun Zhong and  Zhengzhong Tu and  Yufan Liu and  Xiangguang Chen and  Zuowei Cao and  Minhao Tang and  Shan Liu and  Kexin Zhang and  Jingfen Xie and  Yan Wang and  Kai Chen and  Shijie Zhao and  Yunchen Zhang and  Xiangkai Xu and  Hong Gao and  Ji Shi and  Yiming Bao and  Xiugang Dong and  Xiangsheng Zhou and  Yaofeng Tu and  Ying Liang and  Yiwen Wang and  Xinning Chai and  Yuxuan Zhang and  Zhengxue Cheng and  Yingsheng Qin and  Yucai Yang and  Rong Xie and  Li Song and  Wei Sun and  Kang Fu and  Linhan Cao and  Dandan Zhu and  Kaiwei Zhang and  Yucheng Zhu and  Zicheng Zhang and  Menghan Hu and  Xiongkuo Min and  Guangtao Zhai and  Zhi Jin and  Jiawei Wu and  Wei Wang and  Wenjian Zhang and  Yuhai Lan and  Gaoxiong Yi and  Hengyuan Na and  Wang Luo and  Di Wu and  MingYin Bai and  Jiawang Du and  Zilong Lu and  Zhenyu Jiang and  Hui Zeng and  Ziguan Cui and  Zongliang Gan and  Guijin Tang and  Xinglin Xie and  Kehuan Song and  Xiaoqiang Lu and  Licheng Jiao and  Fang Liu and  Xu Liu and  Puhua Chen and  Ha Thu Nguyen and  Katrien De Moor and  Seyed Ali Amirshahi and  Mohamed-Chaker Larabi and  Qi Tang and  Linfeng He and  Zhiyong Gao and  Zixuan Gao and  Guohua Zhang and  Zhiye Huang and  Yi Deng and  Qingmiao Jiang and  Lu Chen and  Yi Yang and  Xi Liao and  Nourine Mohammed Nadir and  Yuxuan Jiang and  Qiang Zhu and  Siyue Teng and  Fan Zhang and  Shuyuan Zhu and  Bing Zeng and  David Bull and  Meiqin Liu and  Chao Yao and Yao Zhao},
year={2025},
month={Apr},
url={https://latex2web.app/documents/display/NTIRE-2025-Challenge-on-Short-form-UGC-Video-Quality-Assessment-and-Enhancement-Methods-and-Results},
}

Like what you see? Register!

NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement: Methods and Results

Xin Li Kun Yuan Bingchen Li Fengbin Guan Yizhen Shao Zihao Yu Xijun Wang Yiting Lu Wei Luo Suhang Yao Ming Sun Chao Zhou Zhibo Chen Radu Timofte Yabin Zhang Ao-Xiang Zhang Tianwu Zhi Jianzhao Liu Yang Li Jingwen Xu Yiting Liao Yushen Zuo Mingyang Wu Renjie Li Shengyun Zhong Zhengzhong Tu Yufan Liu Xiangguang Chen Zuowei Cao Minhao Tang Shan Liu Kexin Zhang Jingfen Xie Yan Wang Kai Chen Shijie Zhao Yunchen Zhang Xiangkai Xu Hong Gao Ji Shi Yiming Bao Xiugang Dong Xiangsheng Zhou Yaofeng Tu Ying Liang Yiwen Wang Xinning Chai Yuxuan Zhang Zhengxue Cheng Yingsheng Qin Yucai Yang Rong Xie Li Song Wei Sun Kang Fu Linhan Cao Dandan Zhu Kaiwei Zhang Yucheng Zhu Zicheng Zhang Menghan Hu Xiongkuo Min Guangtao Zhai Zhi Jin Jiawei Wu Wei Wang Wenjian Zhang Yuhai Lan Gaoxiong Yi Hengyuan Na Wang Luo Di Wu MingYin Bai Jiawang Du Zilong Lu Zhenyu Jiang Hui Zeng Ziguan Cui Zongliang Gan Guijin Tang Xinglin Xie Kehuan Song Xiaoqiang Lu Licheng Jiao Fang Liu Xu Liu Puhua Chen Ha Thu Nguyen Katrien De Moor Seyed Ali Amirshahi Mohamed-Chaker Larabi Qi Tang Linfeng He Zhiyong Gao Zixuan Gao Guohua Zhang Zhiye Huang Yi Deng Qingmiao Jiang Lu Chen Yi Yang Xi Liao Nourine Mohammed Nadir Yuxuan Jiang Qiang Zhu Siyue Teng Fan Zhang Shuyuan Zhu Bing Zeng David Bull Meiqin Liu Chao Yao Yao Zhao

Abstract

This paper presents a review for the NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement. The challenge comprises two tracks: (i) Efficient Video Quality Assessment (KVQ), and (ii) Diffusion-based Image Super-Resolution (KwaiSR). Track 1 aims to advance the development of lightweight and efficient video quality assessment (VQA) models, with an emphasis on eliminating reliance on model ensembles, redundant weights, and other computationally expensive components in the previous IQA/VQA competitions. Track 2 introduces a new short-form UGC dataset tailored for single-image super-resolution, i.e., the KwaiSR dataset. It consists of 1,800 synthetically generated S-UGC image pairs and 1,900 real-world S-UGC images, which are split into training, validation, and test sets using a ratio of 8:1:1. The primary objective of the challenge is to drive research that benefits the user experience of short-form UGC platforms such as Kwai and TikTok. This challenge attracted 266 participants and received 18 valid final submissions with corresponding fact sheets, significantly contributing to the progress of short-form UGC VQA and image super-resolution. The project is publicly available at https://github.com/lixinustc/KVQE-Challenge-CVPR-NTIRE2025.

1 Introduction

Recently, short-form user-generated content (S-UGC) platforms such as Kwai and TikTok have emerged as mainstream streaming platforms for information sharing and dissemination. Unlike traditional UGC or professionally generated content (PGC), short-form UGC videos own advantages, including mobile-friendly browsing mode, high user engagement, and abundant content creation [1, 2, 3]. However, since the content creation of S-UGC does not require professional acquisition devices (e.g., the mobile phone) and experienced users, the source S-UGC videos might inevitability suffer from sub-optimal subjective quality. Meanwhile, the complicated video processing techniques, like preprocessing, transcoding, and enhancement in the S-UGC platform will further lead to unexpected quality changes. It is essential to develop powerful S-UGC video quality assessment and enhancement methods to promote the development of Short-form UGC platforms.

There are lots of studies that have been devoted to video/image quality assessment (VQA)[4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14] and image super-resolution[15, 16, 17, 18, 19, 20, 21, 22, 23] . Based on the availability of reference information, UGC VQA methods can be categorized into full-reference [24, 25], no-reference [26, 10, 27, 28], and reduced-reference [29, 30] approaches. With the rapid advancement of large language models (LLMs) and large multimodal models (LMMs), recent works [31, 32, 33] take the first step to explore their reasoning and understanding capabilities to enhance the interactivity and explainability of VQA framework. In parallel, the evolution of model architectures has significantly improved the performance of image super-resolution. Existing approaches can be roughly classified into four categories: CNN-based [34, 35, 36], Transformer-based [37, 38, 39, 40, 41], Mamba-based [42, 43], and MLP-based methods [44, 45]. Among them, Transformer-based models excel at capturing global contextual dependencies, while Mamba-based methods offer linear-time sequence processing and possess learned state space dynamics, enabling competitive or superior performance with significantly lower computational overhead. However, S-UGC image super-resolution and efficient S-UGC VQA have been underexplored.

To advance the development of short-form UGC (User-Generated Content) platforms, we organized the NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement, in collaboration with the NTIRE 2025 Workshops. This challenge aims to establish a practical and comprehensive benchmark for evaluating and enhancing the quality of short-form UGC content. We welcome the collaborative efforts of all participants to push the boundaries of short-form video quality. The challenge consists of two tracks: (i) Efficient Short-form UGC Video Quality Assessment. This track introduces an innovative evaluation methodology that combines coarse-grained quality scoring with fine-grained rankings for difficult samples. The VQA models must adhere to a computational limit of 120 GFLOPs; (ii) Diffusion-based Image Super-Resolution for Short-form UGC Images in the Wild. This track focuses on enhancing the subjective quality of S-UGC images, which utilizes a combination of user studies and no-reference quality metrics to better reflect subjective quality.

This challenge is one of the NTIRE 2025 https://www.cvlai.net/ntire/2025/ Workshop associated challenges on: ambient lighting normalization [46], reflection removal in the wild [47], shadow removal [48], event-based image deblurring [49], image denoising [50], XGC quality assessment [51], UGC video enhancement [52], night photography rendering [53], image super-resolution (x4) [54], real-world face restoration [55], efficient super-resolution [56], HR depth estimation [57], efficient burst HDR and restoration [58], cross-domain few-shot object detection [59], short-form UGC video quality assessment and enhancement [60, 3], text to image generation model quality assessment [61], day and night raindrop removal for dual-focused images [62], video quality assessment for video conferencing [63], low light image enhancement [64], light field super-resolution [65], restore any image model (RAIM) in the wild [66], raw restoration and super-resolution [67] and raw reconstruction from RGB on smartphones [68].

2 Challenge

Track1: Efficient Short-form UGC Video Quality Assessment

The first track is efficient short-form UGC video quality assessment, and the second track is Diffusion-based Super-resolution for the Short-form UGC Images in the Wild. The first track utilizes the KVQ, i.e., the large-scale Kaleidoscope short Video database for Quality assessment, for training, and evaluation. The KVQ database comprises 600 user-uploaded short videos and 3600 processed videos through diverse practical processing workflows. Moreover, it contains nine primary content scenarios in the practical short-form video platform, including landscape, crowd, person, food, portrait, computer graphic (termed as CG), caption, and stage, covering almost all existing creation modes and scenarios, and the ratio of each category of content satisfies the practical online statistics. The quality score of each short-form video and the partial ranked score are annotated by professional researchers on image processing.

Track2: Diffusion-based Image Super-resolution for Short-form UGC Images in the Wild

The second track collected 1800 synthetic paired images with a simulation strategy from the real-world Kwai Platform and 1900 real-world in-the-wild images with only low-quality images. The contents are from the same source as the KVQ datasets. The purpose is to improve the perceptual quality of images in the wild while maintaining the generalization capability. It is encouraged to utilize the diffusion models for methods. And Fig. 1 shows processing results as a reference. Other methods are also welcome.

Enhancement results by <span style="font-style: italic">Kwai-LPM (Large Processing Model)</span>, which is a diffusion-based SR method. — Figure 1. Enhancement results by Kwai-LPM (Large Processing Model), which is a diffusion-based SR method.

3 Challenge Results

The challenge results are presented in Table 1. We report the performances of teams that submitted their fact sheets. The teams with top performances including SharpMind, ZQE, ZX-AIE-Vector, ECNU-SJTU VQA Team, and TenVQA achieved excellent results in both PLCC and SROCC, with all values exceeding 0.91. Notably, the top three teams also maintained strong Rank1 and Rank2 scores, highlighting their models’ robustness under diverse evaluation metrics. The first-place team, SharpMind, attained the highest final score of 0.922, with relatively low computational complexity of 47.39 GFLOPs and 33.01M parameters. ZQE and ZX-AIE-Vector follow closely with scores of 0.916 and 0.912 respectively, while operating at moderate computational budgets. Interestingly, GoldenChef and ZQE exhibit relatively large model sizes (over 150M parameters), whereas DAIQAM achieved competitive accuracy (0.844) with only 66.36 GFLOPs and 27.98M parameters, demonstrating a good balance between performance and efficiency. We also observe that several teams have achieved superior results while maintaining computational constraints below 120 GFLOPs, demonstrating the community’s continued advancement in building efficient and accurate VQA models.

For the second track, we focus on the subjective comparison among different teams. After the testing stage, the top six teams were shortlisted based on objective performance metrics as candidates for the user study. We then organize five professional image processing experts to assess the perceptual quality and realism of the super-resolved S-UGC images. Each expert spent approximately eight hours meticulously comparing image details and selecting the most visually convincing and realistic result among the six teams. As shown in Table 2, the winning rates of user study for both synthetic and wild datasets are presented in the “User Study” part. Notably, TACO_SR and RealismDiff demonstrated consistent performance in terms of subjective quality on wild and synthetic datasets. SRLab achieved great performance on synthetic datasets, whereas SYSU-FVL-Team excelled in the wild scenario. Additionally, SYSU-FVL-Team attained the best objective performance, while BrainyBots Team ranked third in objective quality. Interestingly, the results highlight a noticeable inconsistency between subjective preferences and objective metrics, suggesting that current perceptual metrics may not reliably reflect perceptual quality in generative model-based S-UGC image super-resolution. The subjective comparison between the top six teams can be found in Figs. 2, 3, 4 and 5.

Table 1 Result of Track 1: Efficient KVQ.
Rank	Team name	Team leader	Final Score	SROCC	PLCC	Rank1	Rank2	GFLOPs	Params (M)
1	SharpMind	Yabin Zhang	0.922	0.934	0.933	0.788	0.846	47.39G	33.01
2	ZQE	Yufan Liu	0.916	0.930	0.933	0.732	0.817	95.75G	150.11
3*	ZX-AIE-Vector	Yunchen Zhang	0.912	0.926	0.927	0.775	0.787	100.3G	98.76
3	ECNU-SJTU VQA Team	Wei Sun	0.910	0.926	0.924	0.736	0.817	114.83G	37.90
5	TenVQA	Yuhai Lan	0.900	0.914	0.915	0.745	0.775	118G	28.00
6	GoldenChef	MingYin Bai	0.871	0.881	0.886	0.693	0.817	119G	154.76
7	DAIQAM	Ha Thu Nguyen	0.844	0.856	0.855	0.667	0.811	66.36G	27.98
8	57VQA	Zhiye Huang	0.264	0.220	0.239	0.541	0.598	–	–
9	Nourayn	Nourine Mohammed Nadir	0.127	0.070	0.085	0.455	0.686	–	–

Table 2 Result of Track 2: KwaiSR.
Team name	Team leader	User Study	Score (objective)	PSNR	SSIM	LPIPS	MUSIQ	ManIQA	CLIPIQA	Ranking (User Study)	Ranking (Objective)
TACO_SR	Yushen Zuo	0.2775/0.3529	51.1168	27.5625	0.7877	0.2232	65.7152	0.4489	0.6849	1	6
RealismDiff	Kexin Zhang	0.2640/0.2834	51.6289	28.3801	0.7979	0.2150	68.7706	0.4600	0.5943	2	5
SRlab	Ying Liang	0.2492/0.0932	53.4643	27.4647	0.7890	0.2121	71.1964	0.5532	0.7579	3	2
SYSU-FVL-Team	Zhi Jin	0.0733/0.1540	54.1403	28.2410	0.8069	0.2366	70.5872	0.5451	0.7687	4	1
NetLab	Hengyuan Na	0.0751/0.0947	52.5092	27.5176	0.7862	0.2105	68.3565	0.4911	0.74885	5	4
BrainyBots Team	Xinglin Xie	0.0609/0.0219	53.4161	27.9241	0.7836	0.2373	64.6089	0.6071	0.7497	6	3
BP-SR	Qi Tang	-	46.4382	27.2520	0.7744	0.2257	58.9467	0.3387	0.4418	-	7
NVDTOFCUC	Qingmiao Jiang	-	44.8025	22.3412	0.6761	0.3589	69.7200	0.5080	0.7238	-	8
BVIVSR	Yuxuan Jiang	-	44.1724	27.6412	0.7808	0.3857	51.8542	0.3146	0.4249	-	9

4 Teams and Methods of Track 1

4.1 SharpMind

Figure 6. The overall framework of Team SharpMind.

This team employs several powerful backbone networks to extract features associated with the human vision system. Subsequently, they train a comprehensive video quality assessment model, which is then utilized as a teacher model to label a series of User-Generated Content (UGC) video datasets. Leveraging these labeled datasets, they train a lightweight student network. As a result, the small-scale model can also effectively discern video quality.

Their method consists of two stages: (i) In the first stage, a comprehensive teacher network is trained by extracting video features through a series of powerful backbone networks. In the second stage, a series of closed-source UGC videos are annotated with pseudo labels by the teacher network and then used to train a lightweight model. They introduce the two stages in detail below.

In the first stage, they extract keyframes of the video following the RQ-VQA [6] strategy. Then, they extract features of these keyframes from three aspects: spatial, temporal, and spatiotemporal. Specifically, they extract the motion features of the video through SlowFast, FAST-VQA features, LIQE features, and DeQA [69] features. In addition, considering that UGC videos are uploaded by ordinary users, these videos often exhibit eye-related characteristics such as edge masking that need to be taken into account. Therefore, they refer to HVS-5M [70] to further extract edge and content features of the keyframes. Based on the above features, they fully retain the information of all dimensions of the video. To better preserve the learning of quality-aware features by some learnable parameters, they incorporate an additional Swin-B network. By integrating these features, they train a powerful teacher network. Specifically, they concatenate all of the above features and obtain the final video score through two MLP layers. Due to the integration of a series of features, the teacher model trained in the first stage possesses strong capabilities in identifying the quality of UGC videos. They then use it to annotate a large set of closed-source UGC videos (nearly 30{,}000), providing pseudo labels for training the student model in the second stage.

In the second stage, they utilize a lightweight student network to fit the labels annotated by the teacher network. Their expectation is that the student model can acquire the proficiency to discriminate the quality of UGC videos. Specifically, they first use the keyframe extraction method of RQ-VQA to convert all videos into frame-level representations. Subsequently, the score of each frame is annotated as the score of the corresponding video. After that, they randomly crop a patch with a resolution of \( 224\times224\) from these frames and feed it into a Swin-T network. Considering the importance of multi-scale features in quality assessment tasks, they obtain the features of each layer of the Swin-T and concatenate them together. Finally, they compute the video quality score through two fully connected layers.

Training details They employ several powerful backbone networks to extract features associated with the human vision system. Subsequently, they train a comprehensive video quality assessment model, which is utilized as a teacher model to label a series of closed-source User-Generated Content (UGC) video datasets. Leveraging these labeled datasets, they train a lightweight network. As a result, the small-scale model can also effectively discern video quality. Differentiable SRCC and PLCC losses are adopted in both training phases.

In the first stage, the keyframes are randomly and centrally cropped to a resolution of \( 384\times384\) and then fed into the Swin-B network. The Adam optimizer with an initial learning rate of \( 1 \times 10^{-5}\) and a batch size of 6 is used to train the proposed teacher model on one A100 GPU.

In the second stage, the Adam optimizer with an initial learning rate of \( 1 \times 10^{-3}\) and a batch size of 64 is used to train the proposed student model on one A100 GPU.

4.2 ZQE

Figure 7. The overall framework of Team ZQE.

This team extracts semantic, local texture, and global features from videos through three branches, aggregates them via an attention mechanism, and finally maps them into a quality score. Their model is a hybrid model built upon and improved from DOVER [71] and KSVQE [2].

First, they pretrain DOVER on the KVQ dataset, freeze the aesthetic branch weights, and discard the original VQA head. They then integrate the attention-based fusion modules proposed in KSVQE between each stage of the technical branch. Next, through extensive ablation experiments, they select VideoMamba-middle [72] as their video semantic extractor, as it is an effective and lightweight video/image classification model. Specifically, they freeze VideoMamba’s pretrained weights obtained from ImageNet-1k [73] and insert eight trainable MLP modules into the backbone (one after every four SSM layers) to facilitate fine-tuning for the VQA task. Additionally, to address the issue of large irrelevant regions commonly found in short-form videos, they not only incorporate the quality-aware region selection module from KSVQE but also introduce a novel slicing preprocessing step. Specifically, they horizontally divide each video into multiple slices of equal height, compute the texture density of each slice using the Laplacian operator, and discard slices whose texture density falls below a threshold determined empirically from their experiments on the KVQ dataset. This preprocessing approach significantly improves the performance of their model.

Their model is pretrained end-to-end on a combined dataset consisting of the KVQ training set and their private UGC dataset. Finally, they fine-tune the best-performing pretrained model (selected based on validation performance on the KVQ validation set) using the KVQ training set to obtain the optimal model.

Training details They first pretrain DOVER on the KVQ dataset and freeze the weights of the aesthetic branch. The technical branch is initialized using the official weights of KSVQE. The model is pretrained end-to-end on a mixed dataset containing the KVQ training set and their private UGC dataset, and is then fine-tuned on the KVQ training set. All training stages are evaluated on the KVQ validation set.

Testing details The test video is directly fed into the model for inference to obtain the predicted quality score. It is important to note that under the current Torch/CUDA version, they observe that multithreading and cuBLAS may introduce non-determinism in the results. Specifically, the output score may vary from the 7th or 8th decimal place onward with some probability. They ensure that no randomness is artificially introduced at any stage of inference. This does not affect the overall score on the Codalab leaderboard (rounded to 4 decimal places: 0.9158).

4.3 ZX-AIE-Vector

Figure 8. The overall framework of Team ZX-AIE-Vector.

This team proposes a lightweight and efficient VQA framework to address the limitations of high computational cost and suboptimal performance in many existing Video Quality Assessment methods. The total computational complexity of their framework is constrained to approximately 100 GFLOPS. Specifically, their approach adopts a widely used dual-stream architecture that separately extracts spatial-temporal and temporal features. For spatial-temporal modeling, they design two lightweight modules: a low-resolution branch to capture coarse global context, and a high-resolution branch to extract fine-grained local details, which are subsequently fused. Meanwhile, a lightweight SlowFast [74] module is employed to extract multi-scale temporal features.

To further enhance the model’s representation capability, they introduce a Multi-Scale Parallel Mamba (MSPM) layer. This module leverages the global modeling ability and linear complexity of Mamba [75] to perform deep feature extraction, improving long-range dependency modeling. Finally, spatial-temporal and temporal features are fused to form a comprehensive video representation. Experiments on the S-UGC competition dataset KVQ [2] demonstrate that their method achieves competitive performance while maintaining high efficiency, validating the effectiveness of the proposed lightweight dual-stream architecture and the MSPM module.

Training details The training pipeline consists of three main stages: pretraining, pseudo-labeling, and two-phase fine-tuning. They first pretrain their lightweight model on the large-scale LSVQ dataset [76] for 10 epochs to learn generalizable video representations, strictly following the original training-validation split.

Next, they fine-tune a large-scale model on the KVQ dataset [2] (merging the official training and validation sets), and use it to generate pseudo-labels for the test set. These pseudo-labels are then merged with the original KVQ training and validation data to construct an augmented dataset.

They conduct the first-phase fine-tuning of their lightweight model using this augmented set via 10-fold cross-validation, training each fold for 30 epochs. The average prediction across all folds is used as a refined pseudo-label for each test sample. In the second-phase fine-tuning, they use the full KVQ training and validation sets along with the refined pseudo-labeled test data to retrain their lightweight model for an additional 30 epochs. This two-stage fine-tuning strategy enables them to fully exploit the unlabeled test data and improve generalization.

They implement their method using PyTorch and train it on 8 NVIDIA V100 GPUs. The Adam optimizer is used with an initial learning rate of \( 3\times10^{-5}\) , combined with a Cosine Annealing learning rate scheduler that gradually decreases the learning rate to a minimum of \( 1\times10^{-6}\) . The entire training process takes approximately 9 hours.

Testing details During testing, they evaluate their final lightweight model on the KVQ test set. For each video, they uniformly sample 8 frames, consistent with the training pipeline. The input resolution during testing is kept the same as in training to ensure consistency in feature extraction. The model directly predicts a quality score for each video. To ensure reproducibility, they fix the random seed to 8 during all testing procedures.

4.4 ECNU-SJTU VQA Team

Figure 9. The overall framework of Team ECNU-SJTU.

This team proposes an efficient video quality assessment (VQA) model named E-VQA [77], designed to achieve high performance while maintaining low computational complexity. Inspired by previous successful efforts in developing efficient VQA models such as SimpleVQA [4], FAST-VQA [5], and MinimalisticVQA [78], they empirically explore a combination of best practices from these models along with techniques from other efficient deep neural networks (DNNs) to develop E-VQA. By combining task-specific and general optimizations, their method balances accuracy and efficiency for practical deployment.

They begin with the lightweight MinimalisticVQA VIII model [78], replacing its backbone from Swin Transformer-Base to Swin Transformer-Tiny (Swin-T) [79], and evaluate the optimal frame count and resolution under the competition’s computational constraint of 120 GFLOPs. Next, they incrementally integrate motion features (extracted via X3D [80]) and fragment frame features (extracted using 2D/3D FAST-VQA variants [5]) into the baseline to assess performance gains.

Their experiments reveal that a dual Swin-T architecture with shared weights achieves optimal results. One branch aligns with MinimalisticVQA VIII, processing frames resized to \( 384\times384\) , while the other employs a 2D FAST-VQA variant operating on fragment frames composed of \( 12 \times 12\) image patches with a resolution of \( 32 \times 32\) . For feature extraction, they uniformly sample 4 frames from 8-second video clips.

To enhance generalization, they adopt an offline knowledge distillation strategy using RQ-VQA [6]—the winning solution of the NTIRE 2024 VQA Challenge—as the teacher model. They curate a pretraining dataset of 52,000 videos, consisting of 40,000 web-sourced videos and 12,000 synthetically compressed samples. Specifically, they generate 2,000 originals from the web-sourced videos and compress them using H.264 at 6 levels [2]. RQ-VQA is used to generate pseudo-labels for all videos. Finally, they pretrain E-VQA on this dataset using fidelity loss [81] and RMSE loss, and fine-tune it on the KVQ training set using PLCC loss. The source code and dataset will be publicly released to ensure reproducibility. The final test score achieved is 0.91083.

Training details They initialize the Swin-T backbone using weights from MinimalisticVQA [78]. For training, they sample one keyframe every two seconds (i.e., 0.5 fps). In the semantic perception branch, these keyframes are further resized to a resolution of \( 384\times384\) . In the distortion measurement branch, fragments are extracted from \( 12\times12\) partitions, with each fragment having a resolution of \( 32\times32\) . They train the proposed model using 4 NVIDIA RTX 3090 GPUs. The training consists of 3 epochs on the collected dataset and 30 epochs on the KVQ dataset [2], with a batch size of 6. The learning rate is set to \( 1 \times 10^{-5}\) , and the optimizer used is Adam. The total training time is approximately 3 hours for the pretraining stage and 4 hours for fine-tuning on the KVQ dataset. No additional training strategies or specific efficiency optimizations are applied.

Testing details During testing, the procedure for obtaining resized frames and fragments is kept exactly the same as in the training stage. The model uses the same input structure to ensure consistency in evaluation.

4.5 TenVQA

This team proposes a robust baseline for video quality assessment (VQA) using a single convolutional neural network (CNN). The core of their approach is based on the ConvNeXtV2-Tiny architecture, which is pre-trained on ImageNet and selected for its favorable balance between computational efficiency and classification performance.

They implement a two-stage training strategy with a differentiable global rank loss, namely RaMBO loss, to improve rank-aware quality prediction. For data processing, videos are resized to 720p while maintaining aspect ratio. From each video, four equidistant frames are sampled and \( 576 \times 576\) patches are cropped. During training, they apply data augmentations including random cropping, horizontal flipping, and brightness/contrast jittering to improve model robustness.

Training details The training consists of two stages. In Stage 1, the model is jointly trained on the KVQ dataset and a private dataset using a combination of L2 loss and PLCC loss. In Stage 2, they fine-tune the model on the KVQ dataset using PLCC loss and a differentiable SRCC loss, specifically the RaMBO loss. During training, four equidistant frames are randomly sampled from each video, and \( 576 \times 576\) random crops are applied. The training is implemented in PyTorch using Distributed Data Parallel (DDP) on 4 NVIDIA A100 GPUs. The AdamW optimizer is used, with a learning rate of \( 3 \times 10^{-5}\) for Stage 1 and \( 1 \times 10^{-5}\) for Stage 2. The total training time is approximately 4 hours for Stage 1 and 0.5 hours for Stage 2. Automatic Mixed Precision (AMP) is used to improve training efficiency, and Exponential Moving Average (EMA) is enabled for more stable optimization.

Testing details During testing, the model samples four equidistant frames from each video and applies a center crop of resolution \( 576 \times 576\) . The inference process follows the same preprocessing strategy as in training to ensure consistency.

4.6 GoldenChef

Figure 10. The overall framework of Team GoldenChef.

This team proposes a lightweight multi-feature fusion model for short-form UGC video quality assessment (LMFVQA), which integrates several advanced components to capture diverse quality-related features of videos. The model incorporates the visual component of the CLIP model with a ViT-32 backbone to extract semantic features, a ResNet50-3D-based network to extract temporal motion features (similar to the Fast pathway of SlowFast), a Swin Transformer-Tiny (Swin-T) to capture spatial global features, and a DenseNet121 to extract local texture features from video frames.

To achieve comprehensive feature integration, they employ a cross-attention fusion module to model the interaction between semantic and motion features, as well as between global and local spatial structures. This enhances the robustness and expressiveness of the representation for effective and efficient VQA. The fused features are subsequently processed through a multilayer perceptron (MLP) with two fully connected layers to derive the final video quality score. Notably, within the DenseNet121 branch dedicated to texture feature extraction, they introduce a channel attention mechanism (SE) to dynamically recalibrate channel weights and refine feature representations before the cross-attention fusion.

Training details They extract two keyframes per second from each video (i.e., 2 fps) as input to the model. These keyframes are resized to a resolution of \( 224 \times 224\) and fed into the ViT-32-based semantic branch, the DenseNet121-based texture branch, and the ResNet50-3D-based motion branch to extract semantic, texture, and motion features, respectively. Furthermore, the frame difference between keyframes is utilized to guide a grid-based sampling process.

In detail, each keyframe is divided into a uniform \( 7 \times 7\) grid. Frame difference information is used to sample one \( 32 \times 32\) patch at the original resolution within each grid cell. These sampled patches are then stitched together according to their original spatial locations to construct a \( 224 \times 224\) frame-difference sampling map. This map is subsequently used as input to the Swin Transformer-T for spatial feature extraction.

The model is trained on a single NVIDIA RTX 3090 GPU with a batch size of 4 over 30 epochs. The learning rate is set to \( 1 \times 10^{-5}\) , and the optimizer used is Adam. The total training time is approximately 2.51 hours. They apply multiscale feature fusion and exponential decay of the learning rate as efficiency optimization strategies.

Testing details They train the model on the KVQ training set, select the best-performing checkpoint based on the KVQ validation set, and evaluate the final performance on the KVQ test set. The same preprocessing and input sampling strategies used during training are applied during inference to ensure consistency.

4.7 DAIQAM

Figure 11. The overall framework of Team DAIQAM.

This team uses ResNet-50 [82] to extract spatial features from patches of two versions of keyframes: original and downsampled. To leverage the effectiveness of image distortion manifold learning introduced in ARNIQA, they adopt the ResNet-50 encoder from that model as the initialization for their feature extractor. After extracting features from each frame, a projector module is applied to reduce the feature dimensionality. The features from all frames are then concatenated and passed through a multilayer perceptron (MLP) for quality regression.

Training details During training, they sample one keyframe for each one-second segment of the video. Since most videos in the training set are 8 seconds long, the total number of keyframes is typically 8. In cases where a video is shorter than 8 seconds, replicate padding is applied, duplicating the last available keyframe to maintain a consistent frame count.

They find that this sampling technique outperforms uniform frame sampling, improving the validation accuracy by 3.2%. A downsampled version of each keyframe is created by resizing the frame while maintaining its aspect ratio, setting the shorter side to 224 pixels. Both the original and downsampled keyframes are then randomly cropped to a resolution of \( 224 \times 224\) during training.

The model is trained using the Adam optimizer with an initial learning rate of \( 1 \times 10^{-5}\) and a batch size of 8 on an NVIDIA RTX 4090 GPU. The learning rate is decayed by a factor of 10 after every 10 epochs, and the total number of training epochs is 20.

Testing details During testing, videos are processed in the same way as in the training procedure. However, instead of random cropping, a center crop of \( 224 \times 224\) is applied to the keyframes to ensure consistent evaluation.

4.8 57VQA

This team submits a project adapted from the StarVQA model, making the necessary modifications for it to work on the specified competition dataset. The core architecture and methodology are inherited from the original StarVQA implementation without major structural changes.

Training details They use the SlowFast framework for model training, with the Adam optimizer and an NVIDIA RTX 4090 GPU. The learning rate is provided in the code and not explicitly stated in the report. The total training time is approximately 3 hours. Their training strategy is direct training on the competition dataset without additional pretraining or fine-tuning stages.

4.9 Nourayn

Figure 12. The overall framework of Team Nourayn.

This team presents a two-stage deep learning solution for the NTIRE2025 Video Quality Assessment Challenge. Their model combines spatial feature extraction with a pre-trained ResNet-50 backbone and Faster-RNN, alongside temporal modeling using a Bidirectional LSTM (BiLSTM). The overall design aims to predict the Mean Opinion Score (MOS) for video quality by capturing both spatial and temporal information from video sequences.

Training details They train the BiLSTM component using features extracted from the spatial encoder. A composite loss function is employed to guide the training process. To improve generalization, the model is trained on a combination of the official training and validation datasets.

Testing details The trained model is evaluated on the test dataset. They compute standard performance metrics including SROCC, PLCC, KROCC, and RMSE to assess the accuracy and robustness of the predicted video quality scores.

5 Teams and Methods of Track 2

5.1 TACO_SR

This team develops a two-stage super-resolution method, including an image super-resolution phase and an image fusion phase, as shown in Figure 13. Their method consists of three key components designed to generate a high-quality image: (1) PiSASR [83], (2) Detail Extractor, and (3) NAFusion (Inspired by NAFSSR [84]). They optimize the NAFusion module based on KwaiSR dataset, while keeping pre-trained PiSASR fixed.

In the first phase, they develop their method based on PiSASR [83]. They generate candidate images uder two different settings: (1) \( \lambda_{\text{pix}} = 1.0, \lambda_{\text{sem}} = 0.0\) : This configuration prioritizes high fidelity, producing an image denoted as \( I_{\text{psnr}}\) ; and (2) \( \lambda_{\text{pix}} = 1.0, \lambda_{\text{sem}} = 1.0\) : This setting enhances perceptual quality, resulting in an image referred to as \( I_{\text{per}}\) . They observe that \( I_{\text{psnr}}\) achieves superior fidelity (PSNR, SSIM) but lacks perceptual quality (NIQE, MANIQA, CLIPIQA). Conversely, \( I_{\text{per}}\) excels in perceptual quality but exhibits lower fidelity. They take advantage of \( I_{\text{psnr}}\) and \( I_{\text{per}}\) by an image fusion model. Specifically, they first combine \( I_{\text{psnr}}\) and \( I_{\text{per}}\) to \( I_{\text{init}}\) with Detail Extractor: use high pass filter to extract high frequency component from \( I_{\text{per}}\) and add to \( I_{\text{psnr}}\) .

In the second phase, they construct NAFusion using \( I_{\text{psnr}}\) , \( I_{\text{per}}\) , and \( I_{\text{init}}\) . Inspired by NAFSSR [84], they design the NAFusion module with NAFBlock and SCAM. Finally, they add \( I_{\text{init}}\) to the output of the main branch of NAFusion to generate \( I_{\text{HQ}}\) .

Figure 13. The overall framework of Team PiNAFusion-SR.

Training Details First, they generate \( I_{\text{psnr}}\) and \( I_{\text{per}}\) from low-quality images in the provided training dataset using PiSASR under different settings. (\( I_{\text{psnr}}\) : \( \lambda_{\text{pix}} = 1.0, \lambda_{\text{sem}} = 0.0\) , \( I_{\text{per}}\) : \( \lambda_{\text{pix}} = 1.0, \lambda_{\text{sem}} = 1.0\) ). Then, they crop image patches from \( I_{\text{psnr}}\) , \( I_{\text{per}}\) and \( I_{\text{GT}}\) with \( 512 \times 512\) resolution (Overlap is set to 128). They train NAFusion \( f\) using data triplet \( (I_{\text{psnr}}, I_{\text{per}}, I_{\text{GT}})\) with a loss function \( L_{\text{total}}\) :

\[ \begin{equation} L_{\text{total}} = \lambda_1 L_{\text{pix}} + \lambda_2 L_{\text{ssim}} + \lambda_3 L_{\text{lpips}}, \end{equation} \]

(1)

where

\[ \begin{align*} L_{\text{pix}}&= L_1(f(I_{\text{psnr}}, I_{\text{per}}), I_{\text{GT}}) \text{ is L1 loss}, \\ L_{\text{ssim}} &= L_{\text{SSIM}}(f(I_{\text{psnr}}, I_{\text{per}}), I_{\text{GT}}) \text{ is SSIM loss}, \\ L_{\text{lpips}} &= L_{\text{LPIPS}}(f(I_{\text{psnr}}, I_{\text{per}}), I_{\text{GT}}) \text{ is LPIPS loss}. \end{align*} \]

They set the batch size and the learning rate to 8 and \( 1e^{- 5}\) . They train NAFusion for 5 epochs, with \( \lambda_1 = 10.0, \lambda_2 = 0.5, \lambda_3 = 1.0\) .

Testing Details They set the scale factor to 4 and 1 when running inference with PiNAFusion-SR on images from the “synthetic” and “wild” subfolders in the validation and test datasets. The scaling strategy is determined based on the resolution of the input image. Specifically, if the larger dimension (height or width) of the input image is below 500 pixels, they apply a scale factor of 4. Otherwise, they use a scale factor of 1. They perform inference on both the validation and test datasets using the same settings.

5.2 RealismDiff

This team proposes a two-stage diffusion-based image super-resolution method consisting of the PreCleaner (a lightweight CNN) and SUPIR [85]. Specifically, as shown in Figure. 14, they construct a dynamic pipeline to assign different processing paths according to the image quality assessment. For images with weak degradation, they use a light restoration module (or resize function) to maintain the high-frequency information and fewer step diffusion, while using more extensive settings (larger CNN and diffusion model with more steps and higher CFG) for heavily distorted images. Besides, they collect a training dataset of high-resolution images from DIV2K [86], Flicker2K [87], FFHQ [88], and Laion5B [89].

Training Details They train the proposed method using 8 A100 GPUs and adopt the AdamW optimizer with default parameters. During training, they crop images into patches of 512\( \times\) 512 pixels and set the batch size as 32. They first pre-train the PreCleaner for 1000 iterations with an initial learning rate of \( 5\times e^{-5}\) . They then fine-tune the entire pipeline in an end-to-end manner for 40,000 iterations. During the fine-tuning stage, the learning rate for the ControlNet is initialized as \( 5\times e^{-6}\) . The learning rates are updated using the Cosine Annealing scheme.

Figure 14. The framework proposed by Team RealismDiff.

5.3 SRlab

As shown in Figure 15, the method of this team [90] is based on the diffusion framework. They employ a VAE to encode the input low-quality images and obtain their corresponding latent representations. These latents are then processed by a Denoising U-Net, which iteratively refines them through multiple denoising steps. To ensure the generated images maintain high fidelity with the low-resolution (LR) inputs, they incorporate the ControlNet architecture, which allows for precise control over the generation process. Furthermore, recognizing that the official synthetic and wild test sets exhibit varying degrees of degradation and require different super-resolution scaling factors, they enhance their approach by utilizing the Segment Anything Model 2 [91] (SAM2). SAM2 is employed to extract rich semantic embeddings from these images, providing additional contextual information that aids in the denoising process. The extracted latents, enriched with semantic embeddings, are subsequently fed into the Denoising U-Net for T steps of iterative refinement. During training, they optimize their model by minimizing the denoising objective:

\[ \begin{equation} \mathcal{L} = \mathbb{E}_{X_0,X_{lr},t,c,c_{\text{sem}},\epsilon} \left\|\epsilon - \epsilon_{\theta}(X_t, X_{lr}, t, c, c_{\text{sem}})\right\|^2, \end{equation} \]

(2)

where \( X_{lr}\) represents the low-resolution (LR) latent, \( c\) denotes the tag prompt, and \( c_{\text{sem}}\) is the semantic embedding. The noise estimation network \( \epsilon_{\theta}\) is responsible for predicting the noise \( \epsilon \sim \mathcal{N}(0, I)\) .

Figure 15. Overall Pipeline of the solution of Team SRlab.

Training Details To enhance the model’s performance in short-form UGC scenarios, they constructed a new training dataset by combining the synthetic training set provided by the competition with the LSDIR dataset [92]. First, they processed the high-resolution images from the LSDIR training set by applying a \( 4\times\) downsampling and degradation with a 50% probability for each of two degradation modes to simulate the image degradation scenarios in the synthetic and wild datasets, respectively. Second, they refined the synthetic training set (1440 pairs of gt and lr images) by cropping and overlapping its high-resolution (1080×1920) images to generate \( 512\times 512\) sub-images, which served as ground truth (GT) images. These GT images were then 4× downsampled and degraded to create the corresponding low-resolution images. Finally, they merged the processed LSDIR training set with the synthetic training set to form the final dataset for training.

They trained the model on a synthesized 512\( \times\) 512 dataset for 90,000 steps using an Nvidia RTX 3090 GPU and the Adam optimizer with a learning rate of \( 5\times e^{-5}\) . The original Stable Diffusion parameters were frozen, and only the ControlNet component and semantic embedding transformer module were trained. The implementation was built on PyTorch, with mixed precision training (FP16) and gradient accumulation to optimize efficiency.

Testing Details During the testing phase, they analyzed the impact of three key parameters: “start point”, “guidance scale(gs)”, and positive/negative prompts on the experimental results.

First, they evaluated the model’s performance on both the synthetic and wild datasets by setting “start point” to either “noise” or “lr” while keeping “gs = 5.5” without adding additional prompts. Next, with “gs = 5.5”, they introduced positive prompts (“ultra-detailed”, “ultra-realistic”) and negative prompts (“distorted”, “deformed”) separately, comparing the model’s overall performance in each case. Finally, they examined the model’s final scores across different “gs” values. Based on the results, they selected the best-performing parameter combination for the final test settings.

5.4 SYSU-FVL-Team

This team develops the framework based on diffusion. They leverage the diffusion model to predict the residual between LQ latent feature and HQ latent feature rather than directly predict HQ latent feature itself. Such a residual learning formulation helps the model focus on learning the desired high-frequency information from the HQ latent features, and it can also accelerate the convergence of the model training process [93]. The predicted HQ latent feature will be decoded into high-quality images by the decoder. In order to adjust the preference between pixel and visual semantics, they add a dual LoRA module controlling the pixel-wise and semantic-wise quality based on the pretrained diffusion model SD21, similar to PiSASR [83].

Following PiSASR, they introduce a pair of pixel and semantic guidance factors, denoted by \( \lambda_{\text{pix}}\) and \( \lambda_{\text{sem}}\) , to control the SR results as follows:

\[ \begin{equation} \epsilon_\theta(z_L) = \lambda_{\text{pix}}\epsilon_{\theta_{\text{pix}}}(z_L) + \lambda_{\text{sem}}(\epsilon_{\theta_{\text{PiS A}}}(z_L) - \epsilon_{\theta_{\text{pix}}}(z_L)), \end{equation} \]

(3)

where \( \epsilon_{\theta_{\text{pix}}}(z_L)\) is the output with only pixel-level LoRA, and \( \epsilon_{\theta_{\text{PiS A}}}(z_L)\) is the output with both pixel and semantic level enhancement. When processing synthetic images, \( \lambda_{\text{pix}}\) and \( \lambda_{\text{sem}}\) are set as 1.0 and 0.5 to process wild images, separately.

Training Details This team utilized an RTX 3090 GPU for model training. The model was optimized using the Adam optimizer with a learning rate of \( 5\times e^{-5}\) , and trained on the released KwaiSR dataset. The training process spanned 96 hours, during which they fine-tuned the model for 100,000 iterations. Both synthetic and wild datasets were processed with a consistent batch size of 4.

5.5 NetLab

This team addresses two key challenges in diffusion-based super-resolution, i.e., inference efficiency and generalization ability. The pipeline of their method is shown in Figure 16. For the first challenge, they design a compact architecture inspired by the Diffusion Transformer (DiT) [94] but optimized for high-resolution inputs. To handle mismatched resolutions between input images and latent, they introduce a lightweight \( 8\times\) downscaling encoder using convolutions. To mitigate DiT’s quadratic complexity, they simplify the transformer structure with ReLU activations and a Q(KV) operation [95] while enhancing local detail handling through a hybrid convolutional-linear feed-forward network. For the second challenge, they refine the original RealESRGAN [19] degradation pipeline by adjusting downscaling probabilities (0.4 for the final stage and 0.8 for the initial stage) and incorporating more recent compression formats (WEBP, HEIF, AVIF) to avoid unrealistic artifacts demonstrated in KwaiSR dataset. They also simulate real-world text overlay degradation by injecting synthetic text components into the degradation pipeline before applying compression. They adopt ResShift’s efficient Markov chain framework [96] as the diffusion scheduler, achieving significant quality improvements in both synthetic and real-world scenarios.

Figure 16. The overall pipeline of the method proposed by Team NetLab. (a) General pipeline for inferencing an image. (b) Proposed degrading process to match the reality production of short-form UGC video. (c) Model structure details.

Training Details The training process of their method is conducted on PyTorch using NVIDIA RTX 4090 GPUs. It consists of two distinct phases, each with specific configurations.

In the first phase, they train their model on a composite dataset including WIDER [97], LSDIR [92], DF2K [86], OST [98] and the first 10k images in FFHQ [99]. As for WIDER, they selected it by limiting CLIPIQA \( >\) 0.6, MANIQA \( >\) 0.4 and MUSIQ \( >\) 60. As a result, 1768 images are selected for training the model. For loss functions in this phase, MSE loss was used in the latent space. They employed the AdamW optimizer and a cosine annealing learning rate schedule. This schedule included a warm-up period of 20,000 iterations, during which the learning rate ramped up to a peak value of \( 1\times e^{-4}\) , then ended with \( 1\times e^{-5}\) . With a global batch size of 64, the model are optimized for 400k iterations. This whole phase costs about 3 days on 8 GPUs.

In the second phase, they extend the training dataset to include the high-resolution (HR) images from the KwaiSR Dataset (synthetic portion, while discarding the LR portion and the wild part for ease of augmentation), in addition to the datasets used in the first phase. These images were repeated six times per epoch to make the model better fit the distribution of UGC contents. Additionally, about 6k high-quality images selected by IQA metrics in WIDER dataset were also used. Similar to the first phase, they used the AdamW optimizer and a learning rate schedule with a 2,000-iteration warm-up. The learning rate in this phase annealed from a peak of \( 3\times e^{-5}\) down to a minimum of \( 1\times e^{-5}\) . In the latent space, an L2 loss was applied. Simultaneously, in the image space, they incorporated L1, L2, SSIM, and LPIPS losses with respective weights of 1, 0.3, 0.3, 0.1, and 0.05. The batchsize was set to 1 on each GPU and 8\( \times\) gradient accumulations in order to save VRAM. This phase lasted for 16,000 iterations. It costs 0.75 days working on 7 GPUs. They adopted mixed-precision training with fp16 for efficiency optimization. They randomly cropped each image to 512 in shape and utilized a modified version of the RealESRGAN degradation process for generating low-resolution inputs.

Testing Details During testing, they first interpolate images to 1080p and encode these images by VAE to the latent space. The noise and original image (as condition) are applied to the model for the 15-step sampling, which has the same process as ResShift [96]. For inferring the Wild content, before being conditioned, images are downscaled then upscaled to the original shape with a factor of 2\( \times\) .

5.6 BrainyBots Team

This team develops the method after a comprehensive analysis of existing super-resolution approaches. They propose a hybrid approach combining RealESRGAN [19] and SinSR [100] by leveraging their complementary strengths. RealESRGAN excels at removing real-world distortions, whereas SinSR specializes in achieving super-resolution in a single step. Consequently, their method performs super-resolution by sequentially processing distorted images with SinSR followed by RealESRGAN.

5.7 NVDTOFCUC

This team proposes a two-stage training free diffusion-based super-resolution method based on pre-trained SeeSR [101], as shown in Figure 17. In the first stage, they adopt a zigzag sampling method [102] to accelerate the denoising process of SeeSR. The denoising trajectory is alternated between deterministic forward steps and stochastic backward jumps. They dynamically skip denoising steps based on gradient magnitude thresholds, preventing restored images from oversmoothing in high-frequency regions. In the second stage, they adopt the standard DDPM sampling strategy (with 30 denoising steps) to refine the super-resolved image with multi-scale feature fusion. The denoising is accomplished with a single A100-SXM4-80GB GPU.

Figure 17. The framework of Team ZigZagSeeSR.

5.8 BVIVSR

This team proposes to build the method based on the state-of-the-art super-resolution model MambaIRv2 [103] and the continuous super-resolution approach HIIF [104], denoted as HIMambaSR. As depicted in Figure 18, they adopt the MambaIRv2-B model as the latent encoder \( E_{\varphi}\) without its upsampling modules. These latent features are subsequently processed by HIIF [104] as the latent decoder \( D_{\rho}\) to generate restored images. Specifically, \( E_{\varphi}\) consists of a sequence of Attentive State Space Groups (ASSG), with each ASSG incorporating multiple Attentive State Space Blocks (ASSBs). Within each ASSB, a progressive local-to-global modeling strategy is employed. Notably, Window Multi-Head Self-Attention (MHSA) is used to capture local interactions, while the Attentive State Space Model (ASSM) models global dependencies. Each block follows a “Norm \( \rightarrow\) Token Mixer \( \rightarrow\) Norm \( \rightarrow\) FFN” structure, and incorporates two residual connections with learnable scaling. This encoder \( E_{\varphi}\) is responsible for extracting deep latent features from the input low-resolution image. \( D_{\rho}\) consists of a multi-scale hierarchical encoding module, multiple multi-head linear attention blocks, and MLPs. Its hierarchical positional encoding captures the local implicit image function across different scales. By progressively injecting these encodings into the network, features at each scale are effectively propagated and shared among neighboring sampling points. This enhances the ability of the network to exploit spatial correlations and reconstruct high-frequency details.

Figure 18. The framework proposed by Team BVIVSR.

Training Details They adopt the original configurations of MambaIRv2-B and HIIF as the model settings. They utilize a combination of DIV2K [86], 1000 high-resolution images from BVI-AOM [105], Flickr2K [87] and 5000 images from LSDIR [92] as the training dataset. For evaluation, they follow common practice in continuous super-resolution task [104, 106] and employ the DIV2K validation set (containing 100 images) [86]. The maximum learning rate is set to \( 4\times e^{-4}\) . The learning rate follows a cosine annealing schedule, gradually decreasing after an initial warm-up phase of 50 epochs. L1 loss and the Adam [107] optimization are adopted to optimize their model during training. Training and testing are implemented based on Pytorch on the 4 NVIDIA A100 GPUs. Their model was trained for 1000 epochs with a batch size of 48.

Acknowledgments

This work was partially supported by NSFC under Grant 623B2098 and the China Postdoctoral Science Foundation-Anhui Joint Support Program under Grant Number 2024T017AH. We thank Kuaishou for sponsoring this challenge. This work was also partially supported by the Humboldt Foundation. We thank the NTIRE 2025 sponsors: Kuaishou, ByteDance, Meituan, and University of Wurzburg (Computer Vision Lab).

Appendix

A Teams and Affiliations of Track 1

NTIRE2024 Organizers

Title: NTIRE 2024 Challenge on Short-form UGC Video Quality Assessment and Enhancement-Track 1

Members: Xin Li¹ (Email), Kun Yuan² (Email), Fengbin Guan¹, Zihao Yu¹, Yiting Lu¹, Wei Luo¹, Ming Sun², Chao Zhou², Zhibo Chen¹, and Radu Timofte³

Affiliations:

¹ University of Science and Technology of China

² KuaiShou Technology

³ Computer Vision Lab, University of Wurzburg, Germany

SharpMind

Title: Distillation-based Video Quality Assessment: Aligning with Human Eye Characteristics for Enhanced Precision

Members: Yabin Zhang (Email), Ao-Xiang Zhang, Tianwu Zhi, Jianzhao Liu, Yang Li, Jingwen Xu, and Yiting Liao

Affiliations:

Bytedance Inc.

ZQE

Title: ZQE (Z-Tech Video Quality Evaluator)

Members: Yufan Liu (Email), Xiangguang Chen, Zuowei Cao, Minhao Tang, and Shan Liu

Affiliations:

Tencent Online Video

ZX-AIE-Vector

Title: Mamba-in-Mamba-Out: A Lightweight Video Quality Assessment Network with Hybrid Mamba-Attention Design

Members: Yunchen Zhang (Email), Xiangkai Xu, Hong Gao, Ji Shi, Yiming Bao, Xiugang Dong, Xiangsheng Zhou, Yaofeng Tu

Affiliations:

ZTE Corporation

ECNU-SJTU VQA Team

Title: Towards Good Practices for Efficient Video Quality Assessment

Members: Wei Sun (Email), Kang Fu, Linhan Cao, Dandan Zhu, Kaiwei Zhang, Yucheng Zhu, Zicheng Zhang, Menghan Hu, Xiongkuo Min and Guangtao Zhai

Affiliations:

East China Normal University, Shanghai Jiao Tong University

TenVQA

Title: Strong Baseline Strategies for Video Quality Assessment Tasks

Members: Yuhai Lan (Email), Gaoxiong Yi

Affiliations:

Tencent

GoldenChef

Title: Lightweight Multi-Feature Cross Attention Fusion Model for Short-form UGC Video Quality Assessment

Members: MingYin Bai (Email), Jiawang Du, Zilong Lu, Zhenyu Jiang, Hui Zeng, Ziguan Cui, Zongliang Gan, Guijin Tang

Affiliations:

College of Telecommunications and Information Engineering, Nanjing University of Posts and Telecommunications

DAIQAM

Title: Short-form Video Quality Assessment: a simple approach

Members: Ha Thu Nguyen (Email), Katrien De Moor, Seyed Ali Amirshahi, Mohamed-Chaker Larabi

Affiliations:

Norwegian University of Science and Technology; Universit´e de Poitiers, CNRS, XLIM, France

57VQA

Title:

Members: Zhiye Huang (Email), Yi Deng

Affiliations:

Beijing University of Posts and Telecommunications

Nourayn

Title: No Title

Members: Nourine Mohammed Nadir (Email)

Affiliations:

None

B Teams and Affiliations of Track 2

NTIRE2024 Organizers

Title: NTIRE 2024 Challenge on Short-form UGC Video Quality Assessment and Enhancement-Track 2

Members: Xin Li¹ (Email), Kun Yuan² (Email), Bingchen Li¹, YiZhen Shao², Xijun Wang¹, Suhang Yao¹, Ming Sun², Chao Zhou², Zhibo Chen¹, and Radu Timofte³

Affiliations:

¹ University of Science and Technology of China

² KuaiShou Technology

³ Computer Vision Lab, University of Wurzburg, Germany

TACO_SR

Title: PiNAFusion-SR

Members: Yushen Zuo¹ (Email), Mingyang Wu², Renjie Li², Shengyun Zhong³, Zhengzhong Tu²

Affiliations:

¹ The Hong Kong Polytechnic University

² Texas A&M University

³ Northeastern University

RealismDif

Title: No Title

Members: Kexin Zhang (Email), Jingfen Xie, Yan Wang, Kai Chen, Shijie Zhao

Affiliations: Bytedance Inc.

SRlab

Title: No Title

Members: Ying Liang¹ (Email), Yiwen Wang¹, Xinning Chai¹, Yuxuan Zhang¹, Zhengxue Cheng¹, Yingsheng Qin², Yucai Yang², Rong Xie¹, Li Song¹

Affiliations:

¹Shanghai Jiao Tong University

²Transsion, China

SYSU-FVL-Team

Title: Pixel-level and Semantic-level Adjustable Super-resolution.

Members: Zhi Jin (Email), Jiawei Wu, Wei Wang, Wenjian Zhang

Affiliations:

Shenzhen Campus of Sun Yat-sen University

NetLab

Title: Make Small Model Diffuse Well to Higher Resolution

Members: Hengyuan Na (Email), Wang Luo, Di Wu

Affiliations:

Sun Yat-sen University

BrainyBots Team

Title: No Title

Members: Xinglin Xie (Email), Kehuan Song Email, Xiaoqiang Lu Email, Licheng Jiao Email, Fang Liu Email, Xu Liu Email, Puhua Chen Email

Affiliations:

XiDian University

BP-SR

Title: No title

Members: Qi Tang (Email), Linfeng He, Zhiyong Gao, Zixuan Gao, Guohua Zhang, Meiqin Liu, Chao Yao, Yao Zhao

Affiliations: Bejing Jiaotong University

NVDTOFCUC

Title: ZigZagSeeSR: Semantic-Driven Super-Resolution Diffusion Sampling

Members: Qingmiao Jiang (Email), Lu Chen, Yi Yang, Xi Liao

Affiliations:

School of Information and Communication Engineering, Communication University of China

BVIVSR

Title: No Title

Members: Yuxuan Jiang¹ (Email), Qiang Zhu^2,1, Siyue Teng¹, Fan Zhang¹, Shuyuan Zhu², Bing Zeng², and David Bull¹

Affiliations:

¹ University of Bristol

² University of Electronic Science and Technology of China {

}

References

[1] Xin Li, Kun Yuan, Yajing Pei, Yiting Lu, Ming Sun, Chao Zhou, Zhibo Chen, Radu Timofte, Wei Sun, Haoning Wu, et al. Ntire 2024 challenge on short-form ugc video quality assessment: Methods and results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6415–6431, 2024.

[2] Yiting Lu, Xin Li, Yajing Pei, Kun Yuan, Qizhi Xie, Yunpeng Qu, Ming Sun, Chao Zhou, and Zhibo Chen. Kvq: Kwai video quality assessment for short-form videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 25963–25973, 2024.

[3] Xin Li and Xijun Wang and Bingchen Li and Kun Yuan and Yizhen Shao and Suhang Yao and Ming Sun and Chao Zhou and Radu Timofte and Zhibo Chen NTIRE Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops 2025

[4] Wei Sun, Xiongkuo Min, Wei Lu, and Guangtao Zhai. A deep learning based no-reference quality assessment model for ugc videos. In Proceedings of the 30th ACM International Conference on Multimedia, pages 856–865, 2022.

[5] Haoning Wu, Chaofeng Chen, Jingwen Hou, Liang Liao, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Fast-vqa: Efficient end-to-end video quality assessment with fragment sampling. In European conference on computer vision, pages 538–554. Springer, 2022.

[6] Wei Sun, Haoning Wu, Zicheng Zhang, Jun Jia, Zhichao Zhang, Linhan Cao, Qiubo Chen, Xiongkuo Min, Weisi Lin, and Guangtao Zhai. Enhancing blind video quality assessment with rich quality-aware features. arXiv preprint arXiv:2405.08745, 2024.

[7] haiqiang Wang, Gary Li, Shan Liu, and C.-C. Jay Kuo. Icme 2021 ugc-vqa challenge. In Available: http://ugcvqa.com/.

[8] Qizhi Xie, Kun Yuan, Yunpeng Qu, Mingda Wu, Ming Sun, Chao Zhou, and Jihong Zhu. QPT-V2: masked image modeling advances visual scoring. In ACM Multimedia, pages 2709–2718. ACM, 2024.

[9] Weixia Zhang, Guangtao Zhai, Ying Wei, Xiaokang Yang, and Kede Ma. Blind image quality assessment via vision-language correspondence: A multitask learning perspective. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14071–14081, 2023.

[10] Zhengzhong Tu and Yilin Wang and Neil Birkbeck and Balu Adsumilli and Alan C Bovik UGC-VQA: Benchmarking blind video quality assessment for user generated content IEEE Transactions on Image Processing 2021 30 4449–4464

[11] Zihao Yu and Fengbin Guan and Yiting Lu and Xin Li and Zhibo Chen Sf-iqa: Quality and similarity integration for ai generated image quality assessment Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024 6692–6701

[12] Kai Zhao and Kun Yuan and Ming Sun and Mading Li and Xing Wen Quality-aware pre-trained models for blind image quality assessment Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 2023 22302–22313

[13] Hongbo Liu, Mingda Wu, Kun Yuan, Ming Sun, Yansong Tang, Chuanchuan Zheng, Xing Wen, and Xiu Li. Ada-dqa: Adaptive diverse quality-aware feature acquisition for video quality assessment. In ACM Multimedia, pages 6695–6704. ACM, 2023.

[14] Yunpeng Qu, Kun Yuan, Qizhi Xie, Ming Sun, Chao Zhou, and Jian Wang. KVQ: boosting video quality assessment via saliency-guided local perception. CoRR, abs/2503.10259, 2025.

[15] Jingyun Liang and Jiezhang Cao and Guolei Sun and Kai Zhang and Luc Van Gool and Radu Timofte Swinir: Image restoration using swin transformer Proceedings of the IEEE/CVF international conference on computer vision 2021 1833–1844

[16] Ze Liu and Han Hu and Yutong Lin and Zhuliang Yao and Zhenda Xie and Yixuan Wei and Jia Ning and Yue Cao and Zheng Zhang and Li Dong and others Swin transformer v2: Scaling up capacity and resolution Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 2022 12009–12019

[17] Xiangyu Chen, Xintao Wang, Jiantao Zhou, Yu Qiao, and Chao Dong. Activating more pixels in image super-resolution transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22367–22377, 2023.

[18] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 136–144, 2017.

[19] Xintao Wang and Liangbin Xie and Chao Dong and Ying Shan Real-esrgan: Training real-world blind super-resolution with pure synthetic data ICCV 2021 1905–1914

[20] Bingchen Li and Xin Li and Hanxin Zhu and Yeying Jin and Ruoyu Feng and Zhizheng Zhang and Zhibo Chen Sed: Semantic-aware discriminator for image super-resolution Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 2024 25784–25795

[21] Ren Yang and Radu Timofte and Xin Li and Qi Zhang and Lin Zhang and Fanglong Liu and Dongliang He and Fu Li and He Zheng and Weihang Yuan and others Aim 2022 challenge on super-resolution of compressed image and video: Dataset, methods and results European Conference on Computer Vision 2022 174–202 Springer

[22] Xin Li and Xin Jin and Jun Fu and Xiaoyuan Yu and Bei Tong and Zhibo Chen A Close Look at Few-shot Real Image Super-resolution from the Distortion Relation Perspective arXiv preprint arXiv:2111.13078 2021

[23] Yunpeng Qu, Kun Yuan, Kai Zhao, Qizhi Xie, Jinhua Hao, Ming Sun, and Chao Zhou. XPSR: cross-modal priors for diffusion-based image super-resolution. In ECCV (11), pages 285–303. Springer, 2024.

[24] Wei Sun and Tao Wang and Xiongkuo Min and Fuwang Yi and Guangtao Zhai Deep learning based full-reference and no-reference quality assessment models for compressed ugc videos 2021 IEEE International Conference on Multimedia & Expo Workshops (ICMEW) 2021 1–6 IEEE

[25] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. workshops, pages 586–595, 2018.

[26] Wei Sun and Xiongkuo Min and Wei Lu and Guangtao Zhai A deep learning based no-reference quality assessment model for ugc videos Proceedings of the 30th ACM International Conference on Multimedia 2022 856–865

[27] Fengbin Guan and Zihao Yu and Yiting Lu and Xin Li and Zhibo Chen InternVQA: Advancing Compressed Video QualityAssessment with Distilling Large Foundation Model arXiv preprint arXiv:2502.19026 2025

[28] Zihao Yu and Fengbin Guan and Yiting Lu and Xin Li and Zhibo Chen Video quality assessment based on swin transformerv2 and coarse to fine strategy arXiv preprint arXiv:2401.08522 2024

[29] Rajiv Soundararajan and Alan C Bovik Video quality assessment by reduced reference spatio-temporal entropic differencing IEEE Transactions on Circuits and Systems for Video Technology 2012 23 4 684–694

[30] Lin Ma and Songnan Li and King Ngi Ngan Reduced-reference video quality assessment of compressed video sequences IEEE Transactions on circuits and systems for video technology 2012 22 10 1441–1456

[31] Haoning Wu and Zicheng Zhang and Weixia Zhang and Chaofeng Chen and Liang Liao and Chunyi Li and Yixuan Gao and Annan Wang and Erli Zhang and Wenxiu Sun and others Q-align: Teaching lmms for visual scoring via discrete text-defined levels arXiv preprint arXiv:2312.17090 2023

[32] Yiting Lu and Xin Li and Haoning Wu and Bingchen Li and Weisi Lin and Zhibo Chen Q-Adapt: Adapting LMM for Visual Quality Assessment with Progressive Instruction Tuning arXiv preprint arXiv:2504.01655 2025

[33] Haoning Wu and Zicheng Zhang and Erli Zhang and Chaofeng Chen and Liang Liao and Annan Wang and Kaixin Xu and Chunyi Li and Jingwen Hou and Guangtao Zhai and others Q-instruct: Improving low-level visual abilities for multi-modality foundation models Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 2024 25490–25500

[34] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. Enhanced deep residual networks for single image super-resolution. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. workshops, pages 136–144, 2017.

[35] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate image super-resolution using very deep convolutional networks. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. workshops, pages 1646–1654, 2016.

[36] Xin Li, Bingchen Li, Xin Jin, Cuiling Lan, and Zhibo Chen. Learning distortion invariant representation for image restoration from a causality perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1714–1724, 2023.

[37] Zhisheng Lu and Juncheng Li and Hong Liu and Chaoyan Huang and Linlin Zhang and Tieyong Zeng Transformer for single image super-resolution Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 2022 457–466

[38] Zhisheng Lu and Hong Liu and Juncheng Li and Linlin Zhang Efficient transformer for single image super-resolution arXiv:2108.11084 2021

[39] X Chen and X Wang and J Zhou and C Dong Activating More Pixels in Image Super-Resolution Transformer arXiv:2205.04437

[40] Dafeng Zhang and Feiyu Huang and Shizhuo Liu and Xiaobing Wang and Zhezhu Jin SwinFIR: Revisiting the SWINIR with fast Fourier convolution and improved training for image super-resolution arXiv:2208.11247 2022

[41] Bingchen Li, Xin Li, Yiting Lu, Sen Liu, Ruoyu Feng, and Zhibo Chen. Hst: Hierarchical swin transformer for compressed image super-resolution. In European conference on computer vision, pages 651–668. Springer, 2022.

[42] Hang Guo and Jinmin Li and Tao Dai and Zhihao Ouyang and Xudong Ren and Shu-Tao Xia Mambair: A simple baseline for image restoration with state-space model European conference on computer vision 2024 222–241 Springer

[43] Yulin Ren and Xin Li and Mengxi Guo and Bingchen Li and Shijie Zhao and Zhibo Chen MambaCSR: Dual-Interleaved Scanning for Compressed Image Super-Resolution With SSMs arXiv preprint arXiv:2408.11758 2024

[44] Zhengzhong Tu and Hossein Talebi and Han Zhang and Feng Yang and Peyman Milanfar and Alan Bovik and Yinxiao Li Maxim: Multi-axis mlp for image processing Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 2022 5769–5780

[45] Xin Li and Bingchen Li and Yeying Jin and Cuiling Lan and Hanxin Zhu and Yulin Ren and Zhibo Chen UCIP: A universal framework for compressed image super-resolution using dynamic prompt European Conference on Computer Vision 2024 107–125 Springer

[46] Florin-Alexandru Vasluianu and Tim Seizinger and Zhuyun Zhou and Zongwei Wu and Radu Timofte and others NTIRE Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops 2025

[47] Kangning Yang and Jie Cai and Ling Ouyang and Florin-Alexandru Vasluianu and Radu Timofte and Jiaming Ding and Huiming Sun and Lan Fu and Jinlong Li and Chiu Man Ho and Zibo Meng and others NTIRE Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops 2025

[48] Florin-Alexandru Vasluianu and Tim Seizinger and Zhuyun Zhou and Cailian Chen and Zongwei Wu and Radu Timofte and others NTIRE Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops 2025

[49] Lei Sun and Andrea Alfarano and Peiqi Duan and Shaolin Su and Kaiwei Wang and Boxin Shi and Radu Timofte and Danda Pani Paudel and Luc and others Van Gool NTIRE Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops 2025

[50] Lei Sun and Hang Guo and Bin Ren and Luc Van Gool and Radu Timofte and Yawei Li and others The Tenth NTIRE 2025 Image Denoising Challenge Report Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops 2025

[51] Xiaohong Liu and Xiongkuo Min and Qiang Hu and Xiaoyun Zhang and Jie Guo and others NTIRE Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops 2025

[52] Nickolay Safonov and Alexey Bryntsev and Andrey Moskalenko and Dmitry Kulikov and Dmitriy Vatolin and Radu Timofte and others NTIRE Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops 2025

[53] Egor Ershov and Sergey Korchagin and Alexei Khalin and Artyom Panshin and Arseniy Terekhin and Ekaterina Zaychenkova and Georgiy Lobarev and Vsevolod Plokhotnyuk and Denis Abramov and Elisey Zhdanov and Sofia Dorogova and Yasin Mamedov and Nikola Banic and Georgii Perevozchikov and Radu Timofte and others NTIRE Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops 2025

[54] Zheng Chen and Kai Liu and Jue Gong and Jingkai Wang and Lei Sun and Zongwei Wu and Radu Timofte and Yulun Zhang and others NTIRE Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops 2025

[55] Zheng Chen and Jingkai Wang and Kai Liu and Jue Gong and Lei Sun and Zongwei Wu and Radu Timofte and Yulun Zhang and others NTIRE Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops 2025

[56] Bin Ren and Hang Guo and Lei Sun and Zongwei Wu and Radu Timofte and Yawei Li and others The Tenth NTIRE 2025 Efficient Super-Resolution Challenge Report Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops 2025

[57] Pierluigi Zama Ramirez and Fabio Tosi and Luigi Di Stefano and Radu Timofte and Alex Costanzino and Matteo Poggi and Samuele Salti and Stefano Mattoccia and others NTIRE Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops 2025

[58] Sangmin Lee and Eunpil Park and Angel Canelo and Hyunhee Park and Youngjo Kim and Hyungju Chun and Xin Jin and Chongyi Li and Chun-Le Guo and Radu Timofte and others NTIRE Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops 2025

[59] Yuqian Fu and Xingyu Qiu and Bin Ren Yanwei Fu and Radu Timofte and Nicu Sebe and Ming-Hsuan Yang and Luc and others Van Gool NTIRE Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops 2025

[60] Xin Li and Kun Yuan and Bingchen Li and Fengbin Guan and Yizhen Shao and Zihao Yu and Xijun Wang and Yiting Lu and Wei Luo and Suhang Yao and Ming Sun and Chao Zhou and Zhibo Chen and Radu Timofte and others NTIRE Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops 2025

[61] Shuhao Han and Haotian Fan and Fangyuan Kong and Wenjie Liao and Chunle Guo and Chongyi Li and Radu Timofte and others NTIRE Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops 2025

[62] Xin Li and Yeying Jin and Xin Jin and Zongwei Wu and Bingchen Li and Yufei Wang and Wenhan Yang and Yu Li and Zhibo Chen and Bihan Wen and Robby Tan and Radu Timofte and others NTIRE Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops 2025

[63] Varun Jain and Zongwei Wu and Quan Zou and Louis Florentin and Henrik Turbell and Sandeep Siddhartha and Radu Timofte and others NTIRE Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops 2025

[64] Xiaoning Liu and Zongwei Wu and Florin-Alexandru Vasluianu and Hailong Yan and Bin Ren and Yulun Zhang and Shuhang Gu and Le Zhang and Ce Zhu and Radu Timofte and others NTIRE Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops 2025

[65] Yingqian Wang and Zhengyu Liang and Fengyuan Zhang and Lvli Tian and Longguang Wang and Juncheng Li and Jungang Yang and Radu Timofte and Yulan Guo and others NTIRE Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops 2025

[66] Jie Liang and Radu Timofte and Qiaosi Yi and Zhengqiang Zhang and Shuaizheng Liu and Lingchen Sun and Rongyuan Wu and Xindong Zhang and Hui Zeng and Lei Zhang and others NTIRE Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops 2025

[67] Marcos Conde and Radu Timofte and others NTIRE Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops 2025

[68] Marcos Conde and Radu Timofte and others RAW Image Reconstruction from RGB on Smartphones. NTIRE 2025 Challenge Report Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops 2025

[69] Zhiyuan You, Xin Cai, Jinjin Gu, Tianfan Xue, and Chao Dong. Teaching large language models to regress accurate image quality scores using score distribution. arXiv preprint arXiv:2501.11561, 2025.

[70] Ao-Xiang Zhang, Yuan-Gen Wang, Weixuan Tang, Leida Li, and Sam Kwong. A spatial–temporal video quality assessment method via comprehensive hvs simulation. IEEE Transactions on Cybernetics, 54 (8): 4749–4762, 2023.

[71] Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jingwen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Exploring video quality assessment on user generated contents from aesthetic and technical perspectives. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20144–20154, 2023.

[72] Kunchang Li, Xinhao Li, Yi Wang, Yinan He, Yali Wang, Limin Wang, and Yu Qiao. Videomamba: State space model for efficient video understanding. In European Conference on Computer Vision, pages 237–255. Springer, 2024.

[73] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.

[74] Christoph Feichtenhofer and Haoqi Fan and Jitendra Malik and Kaiming He Slowfast networks for video recognition Proceedings of the IEEE/CVF international conference on computer vision 2019 6202–6211

[75] Albert Gu and Tri Dao Mamba: Linear-time sequence modeling with selective state spaces arXiv preprint arXiv:2312.00752 2023

[76] Zhenqiang Ying, Maniratnam Mandal, Deepti Ghadiyaram, and Alan Bovik. Patch-vq:'patching up'the video quality problem. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14019–14029, 2021.

[77] Wei Sun, Kang Fu, Linhan Cao, Dandan Zhu, Kaiwei Zhang, Yucheng Zhu, Zicheng Zhang, Menghan Hu, Xiongkuo Min, and Guangtao Zhai. An empirical study for efficient video quality assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2025.

[78] Wei Sun, Wen Wen, Xiongkuo Min, Long Lan, Guangtao Zhai, and Kede Ma. Analysis of video quality datasets via design of minimalistic video quality models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.

[79] Ze Liu and Yutong Lin and Yue Cao and Han Hu and Yixuan Wei and Zheng Zhang and Stephen Lin and Baining Guo Swin transformer: Hierarchical vision transformer using shifted windows Proceedings of the IEEE/CVF international conference on computer vision 2021 10012–10022

[80] Christoph Feichtenhofer. X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 203–213, 2020.

[81] Ming-Feng Tsai and Tie-Yan Liu and Tao Qin and Hsin-Hsi Chen and Wei-Ying Ma Frank: a ranking method with fidelity loss Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval 2007 383–390

[82] Kaiming He and Xiangyu Zhang and Shaoqing Ren and Jian Sun Deep residual learning for image recognition Proceedings of the IEEE conference on computer vision and pattern recognition 2016 770–778

[83] Lingchen Sun, Rongyuan Wu, Zhiyuan Ma, Shuaizheng Liu, Qiaosi Yi, and Lei Zhang. Pixel-level and semantic-level adjustable super-resolution: A dual-lora approach. arXiv preprint arXiv:2412.03017, 2024.

[84] Xiaojie Chu and Liangyu Chen and Wenqing Yu Nafssr: Stereo image super-resolution using nafnet CVPR 2022 1239–1248

[85] Fanghua Yu and Jinjin Gu and Zheyuan Li and Jinfan Hu and Xiangtao Kong and Xintao Wang and Jingwen He and Yu Qiao and Chao Dong Scaling up to excellence: Practicing model scaling for photo-realistic image restoration in the wild CVPR 2024 25669–25680

[86] Eirikur Agustsson and Radu Timofte. Ntire 2017 challenge on single image super-resolution: Dataset and study. In CVPR workshops, pages 126–135, 2017.

[87] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. Enhanced deep residual networks for single image super-resolution. In CVPR workshops, pages 136–144, 2017.

[88] Vahid Kazemi and Josephine Sullivan. One millisecond face alignment with an ensemble of regression trees. In CVPR, pages 1867–1874, 2014.

[89] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS, 35: 25278–25294, 2022.

[90] Yiwen Wang and Ying Liang and Yuxuan Zhang and Xinning Chai and Zhengxue Cheng and Yinsheng Qin and Yucai Yang and rong Xie and Li Song Enhanced Semantic Extraction and Guidance for UGC Image Super Resolution Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2025

[91] Nikhila Ravi and Valentin Gabeur and Yuan-Ting Hu and Ronghang Hu and Chaitanya Ryali and Tengyu Ma and Haitham Khedr and Roman Rädle and Chloe Rolland and Laura Gustafson and others Sam 2: Segment anything in images and videos arXiv preprint arXiv:2408.00714 2024

[92] Yawei Li and Kai Zhang and Jingyun Liang and Jiezhang Cao and Ce Liu and Rui Gong and Yulun Zhang and Hao Tang and Yun Liu and Denis Demandolx and others Lsdir: A large scale dataset for image restoration Proc. IEEE Conf. Comput. Vis. Pattern Recog. 2023 1775–1787

[93] Kaiming He and Xiangyu Zhang and Shaoqing Ren and Jian Sun Deep residual learning for image recognition CVPR 2016 770–778

[94] William Peebles and Saining Xie. Scalable diffusion models with transformers. In ICCV, pages 4195–4205, 2023.

[95] Jiahao Wang and Ning Kang and Lewei Yao and Mengzhao Chen and Chengyue Wu and Songyang Zhang and Shuchen Xue and Yong Liu and Taiqiang Wu and Xihui Liu and others LiT: Delving into a Simplified Linear Diffusion Transformer for Image Generation arXiv preprint arXiv:2501.12976 2025

[96] Zongsheng Yue, Jianyi Wang, and Chen Change Loy. Efficient diffusion model for image restoration by residual shifting. PAMI, 2024.

[97] Shuo Yang and Ping Luo and Chen-Change Loy and Xiaoou Tang Wider face: A face detection benchmark CVPR 2016 5525–5533

[98] Xintao Wang and Ke Yu and Chao Dong and Chen Change Loy Recovering realistic texture in image super-resolution by deep spatial feature transform CVPR 2018 606–615

[99] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In CVPR, pages 4401–4410, 2019.

[100] Yufei Wang and Wenhan Yang and Xinyuan Chen and Yaohui Wang and Lanqing Guo and Lap-Pui Chau and Ziwei Liu and Yu Qiao and Alex C Kot and Bihan Wen Sinsr: diffusion-based image super-resolution in a single step CVPR 2024 25796–25805

[101] Rongyuan Wu and Tao Yang and Lingchen Sun and Zhengqiang Zhang and Shuai Li and Lei Zhang Seesr: Towards semantics-aware real-world image super-resolution CVPR 2024 25456–25467

[102] Lichen Bai and Shitong Shao and Zikai Zhou and Zipeng Qi and Zhiqiang Xu and Haoyi Xiong and Zeke Xie Zigzag Diffusion Sampling: The Path to Success Is Zigzag arXiv preprint arXiv:2412.10891 2024

[103] Hang Guo and Yong Guo and Yaohua Zha and Yulun Zhang and Wenbo Li and Tao Dai and Shu-Tao Xia and Yawei Li MambaIRv2: Attentive State Space Restoration arXiv preprint arXiv:2411.15269 2024

[104] Yuxuan Jiang and Ho Man Kwan and Tianhao Peng and Ge Gao and Fan Zhang and Xiaoqing Zhu and Joel Sole and David Bull HIIF: Hierarchical Encoding based Implicit Image Function for Continuous Super-resolution arXiv preprint arXiv:2412.03748 2024

[105] Jakub Nawała and Yuxuan Jiang and Fan Zhang and Xiaoqing Zhu and Joel Sole and David Bull BVI-AOM: A New Training Dataset for Deep Video Compression Optimization VCIP 2024 1–5 IEEE

[106] Yuxuan Jiang and Chengxi Zeng and Siyue Teng and Fan Zhang and Xiaoqing Zhu and Joel Sole and David Bull C2D-ISR: Optimizing Attention-based Image Super-resolution from Continuous to Discrete Scales arXiv preprint arXiv:2503.13740 2025

[107] Diederik P Kingma and Jimmy Ba Adam: A method for stochastic optimization arXiv preprint arXiv:1412.6980 2014

Dynamic display of documents.

Collapse and expand sections

Cross-references and related material

Discussions

Table of contents

Abstract

1 Introduction

2 Challenge

3 Challenge Results

4 Teams and Methods of Track 1

4.1 SharpMind

4.2 ZQE

4.3 ZX-AIE-Vector

4.4 ECNU-SJTU VQA Team

4.5 TenVQA

4.6 GoldenChef

4.7 DAIQAM

4.8 57VQA

4.9 Nourayn

5 Teams and Methods of Track 2

5.1 TACO_SR

5.2 RealismDiff

5.3 SRlab

5.4 SYSU-FVL-Team

5.5 NetLab

5.6 BrainyBots Team

5.7 NVDTOFCUC

5.8 BVIVSR

Acknowledgments

A Teams and Affiliations of Track 1

NTIRE2024 Organizers

SharpMind

ZQE

ZX-AIE-Vector

ECNU-SJTU VQA Team

TenVQA

GoldenChef

DAIQAM

57VQA

Nourayn

B Teams and Affiliations of Track 2

NTIRE2024 Organizers

TACO_SR

RealismDif

SRlab

SYSU-FVL-Team

NetLab

BrainyBots Team

BP-SR

NVDTOFCUC

BVIVSR

References

Discussion: login to participate.