TC-Light: Temporally Coherent Generative Rendering for Realistic World Transfer

¹ NLPR, MAIS, Institute of Automation, Chinese Academy of Sciences ² University of Chinese Academy of Sciences ³ Shandong University ⁴ University of Science and Technology Beijing ⁵ Tencent ⁶ Huazhong University of Science and Technology

^♣ Corresponding Author

Abstract

Illumination and texture editing are critical dimensions for world-to-world transfer, which is valuable for applications including sim2real and real2real visual data scaling up for embodied AI. Existing techniques generatively re-render the input video to realize the transfer, such as video relighting models and conditioned world generation models. Nevertheless, these models are predominantly limited to the domain of training data (e.g., portrait) or fall into the bottleneck of temporal consistency and computation efficiency, especially when the input video involves complex dynamics and long durations. In this paper, we propose TC-Light, a novel generative renderer to overcome these problems. Starting from the video preliminarily relighted by an inflated video relighting model, it optimizes appearance embedding in the first stage to align global illumination. Then it optimizes the proposed canonical video representation, i.e., Unique Video Tensor (UVT), to align fine-grained texture and lighting in the second stage. To comprehensively evaluate performance, we also establish a long and highly dynamic video benchmark. Extensive experiments show that our method enables physically plausible re-rendering results with superior temporal coherence and low computation cost.

Method

TC-Light overview. Given the source video and text prompt p, the model tokenizes input latents in xy plane and yt plane seperately. The predicted noises are combined together for denoising. Its output then undergoes two-stage optimization. The first stage aligns exposure by optimizing appearance embedding. The second stage aligns detailed texture and illumination by optimizing Unique Video Tensor, which is compressed version of video Please refer to the paper for more details.

Quantitative Comparison

Quantitative Comparison results. "OOM" here means the method is unable to finish the task due to an out-of-memory error. For a fair comparison, the base models of VidToMe and Slicedit are replaced with IC-Light here. Ours-light applies post-optimization to VidToMe, while Ours-full further introduces decayed multi-axis denoising. Experiments are conducted on 40G A100. The best and the second best of each metric are separately highlighted in red and RoyalBlue. Please refer to the paper for more details.

Qualitative Comparisons

"The sky is clear, suggesting a calm, sunny day."

"The room is well lit by sunshine from window."

"Magic lit, sci-fi RGB glowing, studio lighting."

Related Work

[1] Alhaija H A, Alvarez J, Bala M, et al. Cosmos-transfer1: Conditional world generation with adaptive multimodal control. arXiv preprint, 2025.

[2] Lvmin Zhang, Anyi Rao, Maneesh Agrawala. Scaling In-the-Wild Training for Diffusion-based Illumination Harmonization and Editing by Imposing Consistent Light Transport. The Thirteenth International Conference on Learning Representations, 2025.

[3] Li X, Ma C, Yang X, et al. Vidtome: Video token merging for zero-shot video editing. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.

[4] Yu Z, Sattler T, Geiger A. Slicedit: Zero-Shot Video Editing With Text-to-Image Diffusion Models Using Spatio-Temporal Slices. Proceedings of the 41st International Conference on Machine Learning, 2024.