VGGT-X

When VGGT Meets Dense Novel View Synthesis

Yang Liu 1,2,6Chuanchen Luo 4,6Zimo Tang 3
Junran Peng 5,6 ♣Zhaoxiang Zhang 1,2 ♣
1 NLPR, MAIS, Institute of Automation, Chinese Academy of Sciences2 University of Chinese Academy of Sciences3 Huazhong University of Science and Technology 4 Shandong University5 University of Science and Technology Beijing6 Linketic
Corresponding Author
Teaser

TL;DR: We present VGGT-X to explore what would happen if 3D Foundation Model is applied for dense Novel View Synthesis (NVS). It incorporates a memory-efficient VGGT implementation, an adaptive global alignment for VGGT output enhancement, and robust 3DGS training practices. The VGGT- here denotes VGGT with the elimination of redundant intermediate features.

Abstract

We study the problem of applying 3D Foundation Models (3DFMs) to dense Novel View Synthesis (NVS). Despite significant progress in Novel View Synthesis powered by NeRF and 3DGS, current approaches remain reliant on accurate 3D attributes (e.g., camera poses and point clouds) acquired from Structure-from-Motion (SfM), which is often slow and fragile in low-texture or low-overlap captures. Recent 3DFMs showcase orders of magnitude speedup over the traditional pipeline and great potential for online NVS. But most of the validation and conclusions are confined to sparse-view settings. Our study reveal that naively scaling 3DFMs to dense views encounters two fundamental barriers: dramatically increasing VRAM burden and imperfect outputs that degrade initialization-sensitive 3D training. To address these barriers, we introduce VGGT-X, incorporating a memory-efficient VGGT implementation that scales to 1,000+ images, an adaptive global alignment for VGGT output enhancement, and robust 3DGS training practices. Extensive experiments show that these measures substantially close the fidelity gap with COLMAP-initialized pipelines, achieving state-of-the-art results in dense COLMAP-free NVS and pose estimation. Additionally, we analyze the causes of remaining gaps with COLMAP-initialized rendering, providing insights for the future development of 3D foundation models and dense NVS.

Method

VGGT-X takes dense multi-view images as input. It first uses memory-efficient VGGT to losslessly predict 3D key attributes, which we name as VGGT--. Then, an fast global alignment module refines the predicted camera poses and point clouds. Finally, a robust joint pose and 3DGS training pipeline is applied to produce high-fidelity novel view synthesis.

Qualitative Results

Rendering

Trajectories

Quantitative Results

Rendering

VGGT-- here denotes our memory-efficient VGGT implementation. GA here denotes Global Alignment. The ablation and bad case analysis in our paper also shows performance discrepancy between training set and test set, indicating overfitting problem and joint pose and 3DGS optimization dilemma.

Pose Estimation

VGGT-- here denotes our memory-efficient VGGT implementation. GA here denotes Global Alignment. Here, we rectified the order-dependence problem of AUC@30, enabling a more stable and objective evaluation. Note that no single 3DFM demonstrates consistent superiority across all datasets, indicating that generalization remains an unresolved challenge.

BibTeX

@misc{liu2025vggtxvggtmeetsdense,
      title={VGGT-X: When VGGT Meets Dense Novel View Synthesis}, 
      author={Yang Liu and Chuanchen Luo and Zimo Tang and Junran Peng and Zhaoxiang Zhang},
      year={2025},
      eprint={2509.25191},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.25191}, 
}

References