UniviewVLA: A Unified Multiview Vision-Language-Action Model with World Modeling

UniviewVLA: A Unified Multiview Vision-Language-Action Model
with World Modeling

Tao Xu^1,2, Runhao Zhang², Zhijian Huang⁴, Jiayi Guan⁴, Jiaxin Wang², Yifan Ding², Yong-Lu Li^2,3, Long Chen⁴, Guang Chen^1,2,†, Jinghui Lu^4,†

¹Tongji University, ²Shanghai Innovation Institute, ³Shanghai Jiao Tong University, ⁴Xiaomi EV

^†Corresponding Authors

Abstract

Occluded tasks remain a bottleneck in robot manipulation. Existing solutions either deploy additional physical cameras requiring training-inference camera parity, or rely on explicit 3D reconstruction with high computational cost. Moreover, both approaches rely on standard agent-view and wrist-view observations, while failing to capture occlusion information and future scene evolution. To this end, we propose UniviewVLA, a unified multiview Vision-Language-Action model with world modeling, which infers multiview scene evolution for action prediction from only standard two-camera observations. We demonstrate that by leveraging generated multiview future views from the world model, UniviewVLA reveals occluded cues and models future scene evolution, improving action prediction and removing the need for extra hardware or explicit reconstruction. Besides, to accelerate inference while preserving prediction accuracy, UniviewVLA develops Motion-Informative Token Compression, which compresses each generated view from 625 to 16 tokens and reduces per-view latency from 6–7s to 0.2–0.3s. UniviewVLA also proposes training-free Action-Entropy View Selection, which dynamically identifies the most action-informative view at different inference stages. Extensive experiments show that UniviewVLA achieves 95.8% on LIBERO and 4.60 on CALVIN ABCD→D, both standard occlusion-free benchmarks. On customized occlusion-focused tasks, it improves success rate from 40.0% to 73.3%, and average real-robot success rate by 33.4 points, demonstrating stronger occlusion-focused performance without sacrificing standard occlusion-free benchmarks.

Customized Occlusion-Focused Simulation Tasks

To further evaluate the importance of multiview information, we construct six customized occlusion-focused tasks that hide action-critical cues from the standard viewpoints. Specifically, we follow the LIBERO BDDL format to design occluded manipulation scenes and collect multiview demonstrations with a hardware-in-the-loop teleoperation setup based on a 3D SpaceMouse. Each task contains 60 demonstrations. The table below lists the corresponding language instructions, and the videos show UniviewVLA rollouts.

ID	Task	Language Instruction
1	Scene1 Bowl	open the bottom drawer of the cabinet and put the bowl in it
2	Scene2 Bowl	put the black bowl on the left plate
3	Scene4 Wine	pick up the wine bottle at the back and put it on the wine rack
4	Scene4 Drawer	put the black bowl at the left in the bottom drawer of the cabinet and close it
5	Scene7 Mug	put the yellow and white mug on the plate
6	Scene8 Moka	turn off the stove and put the moka pot on top of the cabinet

Task 1: Scene1 Bowl

Task 2: Scene2 Bowl

Task 3: Scene4 Wine

Task 4: Scene4 Drawer

Task 5: Scene7 Mug

Task 6: Scene8 Moka

Six customized occlusion-focused tasks. Each task hides action-critical state cues from the default agent-view camera while preserving the same two deployed physical observations.

Occlusion-Focused Real-Robot Tasks

For real-robot evaluation, we design right-arm-only occlusion tasks on a Mobile ALOHA dual-arm platform. The unused left-wrist camera is repurposed as an additional side-view camera by placing it at a task-specific viewpoint using an extension cable. We collect 100 demonstrations for each task. The real-robot multiview setup is shown below, and the videos show UniviewVLA rollouts.

Oreo-to-Plate

Occluded-Doll Move

Real-robot multiview. The unused left-wrist camera is repurposed as a side-view camera via an extension cable, providing additional occlusion-revealing observations.

Standard Occlusion-Free Benchmark Tasks

For standard simulation benchmarks, we obtain multiview supervision by replaying recorded demonstrations with additional workspace cameras in the simulator. LIBERO uses 60°, 120°, 240°, 300°, and overhead views, as shown below. Since the rear tabletop side in CALVIN is mostly redundant or weakly informative, we use more widely spaced views at 30°, 216°, 288°, and overhead.

LIBERO multiview.

CALVIN multiview.

UniviewVLA: A Unified Multiview Vision-Language-Action Modelwith World Modeling