Editing illumination in long videos with complex dynamics has significant value in various downstream tasks, including visual content creation and manipulation, as well as data scaling up for embodied AI through sim2real and real2real transfer. Nevertheless, existing video relighting techniques are predominantly limited to portrait videos or fall into the bottleneck of temporal consistency and computation efficiency. In this paper, we propose TC-Light, a novel paradigm characterized by the proposed two-stage post optimization mechanism. Starting from the video preliminarily relighted by an inflated video relighting model, it optimizes appearance embedding in the first stage to align global illumination. Then it optimizes the proposed canonical video representation, i.e., Unique Video Tensor (UVT), to align fine-grained texture and lighting in the second stage. To comprehensively evaluate performance, we also establish a long and highly dynamic video benchmark. Extensive experiments show that our method enables physically plausible relighting results with superior temporal coherence and low computation cost.
Quantitative Comparison results. "OOM" here means the method is unable to finish the task due to an out-of-memory error. For a fair comparison, the base models of VidToMe and Slicedit are replaced with IC-Light here. Ours-light applies post-optimization to VidToMe, while Ours-full further introduces decayed multi-axis denoising. Experiments are conducted on 40G A100. The best and the second best of each metric are separately highlighted in red and RoyalBlue. Please refer to the paper for more details.
[1] Alhaija H A, Alvarez J, Bala M, et al. Cosmos-transfer1: Conditional world generation with adaptive multimodal control. arXiv preprint, 2025.
[2] Lvmin Zhang, Anyi Rao, Maneesh Agrawala. Scaling In-the-Wild Training for Diffusion-based Illumination Harmonization and Editing by Imposing Consistent Light Transport. The Thirteenth International Conference on Learning Representations, 2025.
[3] Li X, Ma C, Yang X, et al. Vidtome: Video token merging for zero-shot video editing. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
[4] Yu Z, Sattler T, Geiger A. Slicedit: Zero-Shot Video Editing With Text-to-Image Diffusion Models Using Spatio-Temporal Slices. Proceedings of the 41st International Conference on Machine Learning, 2024.