We present WinT3R, a feed-forward reconstruction model capable of online prediction of precise camera poses and high-quality point maps. Previous methods suffer from a trade-off between reconstruction quality and real-time performance. To address this, we first introduce a sliding window mechanism that ensures sufficient information exchange among frames within the window, thereby improving the quality of geometric predictions without large computation. In addition, we leverage a compact representation of cameras and maintain a global camera token pool, which enhances the reliability of camera pose estimation without sacrificing efficiency. These designs enable WinT3R to achieve state-of-the-art performance in terms of online reconstruction quality, camera pose estimation, and reconstruction speed, as validated by extensive experiments on diverse datasets.
Given an image stream, WinT3R processes input images in a sliding-window manner, where adjacent windows overlap by half of the window size. Our model generates extremely compact camera tokens during online reconstruction to serve as global information for historical frames. This enables the reconstructions of subsequent windows to leverage these global cues for more accurate results.
We detail the reconstruction process within a single window. All images first pass through a frame-wise ViT encoder, which outputs image tokens. Camera tokens are then appended to these tokens. Then the tokens within this window are collectively fed into a decoder to interact with state tokens. Finally, the image tokens output by the decoder are sent to a lightweight convolutional head to predict local point maps. Meanwhile, the camera tokens, along with those in the camera token pool, are jointly fed into a camera head to predict camera parameters, while these camera tokens are simultaneously added to the camera token pool.
Interactive viewer for colored point clouds - Click thumbnails below to view the 3D reconstruction.
Qualitative comparison of 3D reconstruction. Compared with other online methods, WinT3R achieves higher reconstruction accuracy while also enabling faster reconstruction speed.
Qualitative comparison of in-the-wild multi-view 3D reconstruction. We demonstrate reconstruction results on in-the-wild sequences across indoor, outdoor, and object-level scenes. Our method consistently achieves the most photorealistic reconstruction results.
@misc{li2025wint3rwindowbasedstreamingreconstruction,
title={WinT3R: Window-Based Streaming Reconstruction with Camera Token Pool},
author={Zizun Li and Jianjun Zhou and Yifan Wang and Haoyu Guo and Wenzheng Chang and Yang Zhou and Haoyi Zhu and Junyi Chen and Chunhua Shen and Tong He},
year={2025},
eprint={2509.05296},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.05296},
}