TC-Light: Temporally Consistent Relighting for Dynamic Long Videos

Yang Liu 1,2Chuanchen Luo 3Zimo Tang 6Yingyan Li1,2Yuran Yang 5
Yuanyong Ning 5Lue Fan 1,2Junran Peng 4 ♣Zhaoxiang Zhang 1,2 ♣
1 NLPR, MAIS, Institute of Automation, Chinese Academy of Sciences2 University of Chinese Academy of Sciences3 Shandong University4 University of Science and Technology Beijing5 Tencent6 Huazhong University of Science and Technology
Corresponding Author

Abstract

Editing illumination in long videos with complex dynamics has significant value in various downstream tasks, including visual content creation and manipulation, as well as data scaling up for embodied AI through sim2real and real2real transfer. Nevertheless, existing video relighting techniques are predominantly limited to portrait videos or fall into the bottleneck of temporal consistency and computation efficiency. In this paper, we propose TC-Light, a novel paradigm characterized by the proposed two-stage post optimization mechanism. Starting from the video preliminarily relighted by an inflated video relighting model, it optimizes appearance embedding in the first stage to align global illumination. Then it optimizes the proposed canonical video representation, i.e., Unique Video Tensor (UVT), to align fine-grained texture and lighting in the second stage. To comprehensively evaluate performance, we also establish a long and highly dynamic video benchmark. Extensive experiments show that our method enables physically plausible relighting results with superior temporal coherence and low computation cost.

Method

TC-Light overview. Given the source video and text prompt p, the model tokenizes input latents in xy plane and yt plane seperately. The predicted noises are combined together for denoising. Its output then undergoes two-stage optimization. The first stage aligns exposure by optimizing appearance embedding. The second stage aligns detailed texture and illumination by optimizing Unique Video Tensor, which is compressed version of video Please refer to the paper for more details.

Quantitative Comparison

Architecture

Quantitative Comparison results. "OOM" here means the method is unable to finish the task due to an out-of-memory error. For a fair comparison, the base models of VidToMe and Slicedit are replaced with IC-Light here. Ours-light applies post-optimization to VidToMe, while Ours-full further introduces decayed multi-axis denoising. Experiments are conducted on 40G A100. The best and the second best of each metric are separately highlighted in red and RoyalBlue. Please refer to the paper for more details.

Qualitative Comparisons

Related Work

[1] Alhaija H A, Alvarez J, Bala M, et al. Cosmos-transfer1: Conditional world generation with adaptive multimodal control. arXiv preprint, 2025.

[2] Lvmin Zhang, Anyi Rao, Maneesh Agrawala. Scaling In-the-Wild Training for Diffusion-based Illumination Harmonization and Editing by Imposing Consistent Light Transport. The Thirteenth International Conference on Learning Representations, 2025.

[3] Li X, Ma C, Yang X, et al. Vidtome: Video token merging for zero-shot video editing. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.

[4] Yu Z, Sattler T, Geiger A. Slicedit: Zero-Shot Video Editing With Text-to-Image Diffusion Models Using Spatio-Temporal Slices. Proceedings of the 41st International Conference on Machine Learning, 2024.