^❄WinT3R: Window-Based Streaming Reconstruction with Camera Token Pool

Zizun Li^1,2, Jianjun Zhou^2,3,4, Yifan Wang², Haoyu Guo², Wenzheng Chang², Yang Zhou², Haoyi Zhu^1,2, Junyi Chen², Chunhua Shen⁴, Tong He^{2,3 †}

¹University of Science and Technology of China, ²Shanghai AI Lab, ³SII, ⁴Zhejiang University

Paper arXiv Code

Abstract

We present WinT3R, a feed-forward reconstruction model capable of online prediction of precise camera poses and high-quality point maps. Previous methods suffer from a trade-off between reconstruction quality and real-time performance. To address this, we first introduce a sliding window mechanism that ensures sufficient information exchange among frames within the window, thereby improving the quality of geometric predictions without large computation. In addition, we leverage a compact representation of cameras and maintain a global camera token pool, which enhances the reliability of camera pose estimation without sacrificing efficiency. These designs enable WinT3R to achieve state-of-the-art performance in terms of online reconstruction quality, camera pose estimation, and reconstruction speed, as validated by extensive experiments on diverse datasets.

Method

Given an image stream, WinT3R processes input images in a sliding-window manner, where adjacent windows overlap by half of the window size. Our model generates extremely compact camera tokens during online reconstruction to serve as global information for historical frames. This enables the reconstructions of subsequent windows to leverage these global cues for more accurate results.

We detail the reconstruction process within a single window. All images first pass through a frame-wise ViT encoder, which outputs image tokens. Camera tokens are then appended to these tokens. Then the tokens within this window are collectively fed into a decoder to interact with state tokens. Finally, the image tokens output by the decoder are sent to a lightweight convolutional head to predict local point maps. Meanwhile, the camera tokens, along with those in the camera token pool, are jointly fed into a camera head to predict camera parameters, while these camera tokens are simultaneously added to the camera token pool.

Qualitative Visualization

Interactive viewer for colored point clouds - Click thumbnails below to view the 3D reconstruction.

Loading point cloud data...

Qualitative Comparison

Qualitative comparison of 3D reconstruction. Compared with other online methods, WinT3R achieves higher reconstruction accuracy while also enabling faster reconstruction speed.

Qualitative comparison of in-the-wild multi-view 3D reconstruction. We demonstrate reconstruction results on in-the-wild sequences across indoor, outdoor, and object-level scenes. Our method consistently achieves the most photorealistic reconstruction results.

BibTeX

@misc{li2025wint3rwindowbasedstreamingreconstruction,
      title={WinT3R: Window-Based Streaming Reconstruction with Camera Token Pool}, 
      author={Zizun Li and Jianjun Zhou and Yifan Wang and Haoyu Guo and Wenzheng Chang and Yang Zhou and Haoyi Zhu and Junyi Chen and Chunhua Shen and Tong He},
      year={2025},
      eprint={2509.05296},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.05296}, 
}