UniviewVLA: A Unified Multiview Vision-Language-Action Model
with World Modeling

Tao Xu1,2, Runhao Zhang2, Zhijian Huang4, Jiayi Guan4, Jiaxin Wang2, Yifan Ding2, Yong-Lu Li2,3, Long Chen4, Guang Chen1,2,†, Jinghui Lu4,†
1Tongji University, 2Shanghai Innovation Institute, 3Shanghai Jiao Tong University, 4Xiaomi EV
Corresponding Authors
UniviewVLA pipeline overview

UniviewVLA pipeline.

UniviewVLA models language instructions, multiview observations, and actions with discrete tokens that can be autoregressively predicted by a unified Transformer model, using two training stages and dynamic inference.

(1) Multiview world model post-training. UniviewVLA takes language instructions, standard agent-view, and wrist-view inputs, and autoregressively generates future multiview images that incorporate multiview and world evolution information.

(2) Action fine-tuning. UniviewVLA first predicts compact motion-informative tokens to avoid the high latency of full auxiliary-view tokens, and then predicts FAST action tokens.

(3) Dynamic inference. During inference, transparent cameras denote generated auxiliary views, and the green camera denotes the selected auxiliary view. UniviewVLA periodically selects the best auxiliary view for action prediction across different inference stages instead of using a fixed viewpoint.

Abstract

Occluded tasks remain a bottleneck in robot manipulation. Existing solutions either deploy additional physical cameras requiring training-inference camera parity, or rely on explicit 3D reconstruction with high computational cost. Moreover, both approaches rely on standard agent-view and wrist-view observations, while failing to capture occlusion information and future scene evolution. To this end, we propose UniviewVLA, a unified multiview Vision-Language-Action model with world modeling, which infers multiview scene evolution for action prediction from only standard two-camera observations. We demonstrate that by leveraging generated multiview future views from the world model, UniviewVLA reveals occluded cues and models future scene evolution, improving action prediction and removing the need for extra hardware or explicit reconstruction. Besides, to accelerate inference while preserving prediction accuracy, UniviewVLA develops Motion-Informative Token Compression, which compresses each generated view from 625 to 16 tokens and reduces per-view latency from 6–7s to 0.2–0.3s. UniviewVLA also proposes training-free Action-Entropy View Selection, which dynamically identifies the most action-informative view at different inference stages. Extensive experiments show that UniviewVLA achieves 95.8% on LIBERO and 4.60 on CALVIN ABCD→D, both standard occlusion-free benchmarks. On customized occlusion-focused tasks, it improves success rate from 40.0% to 73.3%, and average real-robot success rate by 33.4 points, demonstrating stronger occlusion-focused performance without sacrificing standard occlusion-free benchmarks.

Customized Occlusion-Focused Simulation Tasks

To further evaluate the importance of multiview information, we construct six customized occlusion-focused tasks that hide action-critical cues from the standard viewpoints. Specifically, we follow the LIBERO BDDL format to design occluded manipulation scenes and collect multiview demonstrations with a hardware-in-the-loop teleoperation setup based on a 3D SpaceMouse. Each task contains 60 demonstrations. The table below lists the corresponding language instructions, and the videos show UniviewVLA rollouts.

ID Task Language Instruction
1 Scene1 Bowl open the bottom drawer of the cabinet and put the bowl in it
2 Scene2 Bowl put the black bowl on the left plate
3 Scene4 Wine pick up the wine bottle at the back and put it on the wine rack
4 Scene4 Drawer put the black bowl at the left in the bottom drawer of the cabinet and close it
5 Scene7 Mug put the yellow and white mug on the plate
6 Scene8 Moka turn off the stove and put the moka pot on top of the cabinet

Task 1: Scene1 Bowl

Task 2: Scene2 Bowl

Task 3: Scene4 Wine

Task 4: Scene4 Drawer

Task 5: Scene7 Mug

Task 6: Scene8 Moka

Six customized occlusion-focused tasks
Six customized occlusion-focused tasks. Each task hides action-critical state cues from the default agent-view camera while preserving the same two deployed physical observations.

Occlusion-Focused Real-Robot Tasks

For real-robot evaluation, we design right-arm-only occlusion tasks on a Mobile ALOHA dual-arm platform. The unused left-wrist camera is repurposed as an additional side-view camera by placing it at a task-specific viewpoint using an extension cable. We collect 100 demonstrations for each task. The real-robot multiview setup is shown below, and the videos show UniviewVLA rollouts.

Oreo-to-Plate

Occluded-Doll Move

Real-robot multiview setup
Real-robot multiview. The unused left-wrist camera is repurposed as a side-view camera via an extension cable, providing additional occlusion-revealing observations.

Standard Occlusion-Free Benchmark Tasks

For standard simulation benchmarks, we obtain multiview supervision by replaying recorded demonstrations with additional workspace cameras in the simulator. LIBERO uses 60°, 120°, 240°, 300°, and overhead views, as shown below. Since the rear tabletop side in CALVIN is mostly redundant or weakly informative, we use more widely spaced views at 30°, 216°, 288°, and overhead.

LIBERO multiview camera setup
LIBERO multiview.
CALVIN multiview camera setup
CALVIN multiview.