ATHENA: Accelerated Multi-Task Heterogeneous Influence Functions
for Robot Data Curation

Tao Xu1,2, Jiaxin Wang2,3, Runhao Zhang2, Jiayi Guan1, Xianchao Zeng2, Weixi Song2, Xinyu Zhou2, Zhetao Chen2, Guang Chen1,2, Yong-Lu Li2,4,†
1Tongji University, 2Shanghai Innovation Institute, 3Xi'an Jiaotong University, 4Shanghai Jiao Tong University
Corresponding Author
ATHENA pipeline overview
ATHENA pipeline. Training dataset and closed-loop rollouts from VLA evaluation are fed into an efficient multitask influence computation module to score and rank demonstration importance, guiding high-quality data curation for VLA fine-tuning.

Abstract

In robot imitation learning, influence functions provide a principled approach to quantify each demonstration's effect on robot task outcomes, yet scaling them to billion-parameter Vision-Language-Action (VLA) models is limited by computational and multitask bottlenecks. To this end, we propose ATHENA, an influence function framework tailored for multitask VLA data curation at billion-parameter scale. Concretely, it leverages the Kronecker structure of linear-layer gradients to reduce projection cost, and approximates dense Hessian inversion with a rank-$r$ Random Truncated Approximation, achieving about a 313.4x speedup in influence computation. Furthermore, ATHENA formulates global and local interactive influence to balance data curation across 50 jointly trained tasks. Extensive evaluations on RoboTwin 2.0 and real-robot deployment, covering 9.34 and 6.90 hours of demonstrations, respectively, show that ATHENA matches or exceeds full-data joint fine-tuning using only 50% of demonstrations in simulation and 66.7% of data across six real-robot tasks. Overall, ATHENA demonstrates its effectiveness for data curation in billion-parameter multitask VLA fine-tuning.

Video

Real Robot ALOHA Demonstration

We evaluate ATHENA on six real-robot ALOHA tasks spanning three different difficulty levels. The videos below compare the performance of ATHENA, Oracle, and the Full-Data baseline. The percentages in parentheses indicate the proportion of demonstration data retained for fine-tuning. Notably, both ATHENA and Oracle use exactly 66.7% of the original data.

Task 1: Pick Fruits (simple)

ATHENA (66.7%, Ours)

Oracle (66.7%)

Full-Data (100%)


Task 2: Wipe Board (simple)

ATHENA (Ours)

Oracle (66.7%)

Full-Data (100%)


Task 3: Stack Bowls (medium)

ATHENA (66.7%, Ours)

Oracle (66.7%)

Full-Data (100%)


Task 4: Box Return (medium)

ATHENA (66.7%, Ours)

Oracle (66.7%)

Full-Data (100%)


Task 5: Seal Stamping (challenging)

ATHENA (66.7%, Ours)

Oracle (66.7%)

Full-Data (100%)


Task 6: Shelf Retrieval (challenging)

ATHENA (66.7%, Ours)

Oracle (66.7%)

Full-Data (100%)