<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Proceedings of Machine Learning Research</title>
    <description>Proceedings of The 9th Conference on Robot Learning
  Held in Seoul, Korea on 27-30 September 2025

Published as Volume 305 by the Proceedings of Machine Learning Research on 07 October 2025.

Volume Edited by:
  Joseph Lim
  Shuran Song
  Hae-Won Park

Series Editors:
  Neil D. Lawrence
</description>
    <link>https://proceedings.mlr.press/v305/</link>
    <atom:link href="https://proceedings.mlr.press/v305/feed.xml" rel="self" type="application/rss+xml"/>
    <pubDate>Tue, 14 Oct 2025 10:21:48 +0000</pubDate>
    <lastBuildDate>Tue, 14 Oct 2025 10:21:48 +0000</lastBuildDate>
    <generator>Jekyll v3.10.0</generator>
    
      <item>
        <title>GraspQP: Differentiable Optimization of Force Closure for Diverse and Robust Dexterous Grasping</title>
        <description>Dexterous robotic hands enable versatile interactions through the flexibility and adaptability of a multi-finger setup, allowing for a wise range of task-specific grasp configurations in diverse environments. However, access to diverse and high-quality grasp data is essential to fully exploit the capabilities of dexterous hands, be it to train grasp prediction models from point clouds, train manipulation policies, or to support high-level task planning with a broader range of action options. Existing approaches for dataset generation rely on sampling-based algorithms or simplified force-closure analysis, which tend to converge to power grasps and often exhibit limited diversity. In this work, we propose a method to synthesize large-scale, diverse, and physically feasible grasps that additionally go beyond simple power grasps to more refined manipulation, such as pinches or tri-finger precision grasps. We introduce a rigorous differentiable energy formulation of force closure, implicitly defined through a Quadratic Program (QP). In addition, we present an adjusted optimization method (MALA*) that improves performance by dynamically rejecting gradient steps based on the global sample distribution. We extensively evaluate our approach and demonstrate significant improvements in both grasp diversity and the stability of final grasp predictions. Finally, we provide a new, large-scale grasp dataset for the 5’700 objects from DexGraspNet, consisting of five different grippers and three different grasp types.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/zurbrugg25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/zurbrugg25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Embrace Contacts: humanoid shadowing with full body ground contacts</title>
        <description>Previous humanoid robot research works treat the robot as a bipedal mobile manipulation platform, where only the feet and hands contact the environment. However, we humans use all body parts to interact with the world, e.g., we sit in chairs, get up from the ground, or roll on the floor. Contacting the environment using body parts other than feet and hands brings significant challenges in both model-predictive control and reinforcement learning-based methods: an unpredictable contact sequence makes it almost impossible for model-predictive control to plan ahead in real time; the success of sim-to-real reinforcement learning for humanoids heavily depends on the acceleration of the rigid-body physical simulator and the simplification of collision detection. On the other hand, lacking extreme torso movement of humanoid data makes all other components non-trivial to design, such as dataset distribution, motion commands, and task rewards. To address these challenges, we propose a general humanoid motion framework that takes discrete motion commands and controls the robot’s motor actions in real time. Using a GPU-accelerated simulator, we train a humanoid whole-body control policy that follows the high-level motion command in the real world in real time, even with stochastic contacts and extremely large robot base rotation and not-so-feasible motion commands.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/zhuang25b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/zhuang25b.html</guid>
        
        
      </item>
    
      <item>
        <title>MirrorDuo: Reflection-Consistent Visuomotor Learning from Mirrored Demonstration Pairs</title>
        <description>Image-based behaviour cloning leverages demonstrations captured from ubiquitous RGB cameras, enabling impressive visuomotor performance. However, it remains constrained by the cost of collecting sufficiently diverse demonstrations, especially for generalizing across workspace variations. We propose MirrorDuo, a mirroring-based formulation that operates on image, proprioception, and full 6-DoF end-effector action tuples, generating a mirrored counterpart for each original demonstration, effectively achieving “collect one, get one for free.&quot; It can be applied as a data augmentation strategy for existing learning pipelines, such as standard behaviour cloning or diffusion policy, or as a structural prior for reflection-equivariant policy networks. By leveraging the overlap between the original and mirrored domains, MirrorDuo achieves significantly improved performance under the same data budget when demonstrations are evenly distributed across both sides of the workspace. When demonstrations are confined to one side, MirrorDuo enables efficient skill transfer to the mirrored workspace with as few as zero or just 5 demonstrations in the target arrangement.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/zhuang25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/zhuang25a.html</guid>
        
        
      </item>
    
      <item>
        <title>LaVA-Man: Learning Visual Action Representations for Robot Manipulation</title>
        <description>Visual-textual understanding is essential for language-guided robot manipulation. Recent works leverage pre-trained vision-language models to measure the similarity between encoded visual observations and textual instructions, and then train a model to map this similarity to robot actions. However, this two-step approach limits the model to capture the relationship between visual observations and textual instructions, leading to reduced precision in manipulation tasks. We propose to learn visual-textual associations through a self-supervised pretext task: reconstructing a masked goal image conditioned on an input image and textual instructions. This formulation allows the model to learn visual-action representations without robot action supervision. The learned representations can then be fine-tuned for manipulation tasks with only a few demonstrations. We also introduce the \textit{Omni-Object Pick-and-Place} dataset, which consists of annotated robot tabletop manipulation episodes, including 180 object classes and 3,200 instances with corresponding textual instructions. This dataset enables the model to acquire diverse object priors and allows for a more comprehensive evaluation of its generalisation capability across object instances. Experimental results on the five benchmarks, including both simulated and real-robot validations, demonstrate that our method outperforms prior art.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/zhu25b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/zhu25b.html</guid>
        
        
      </item>
    
      <item>
        <title>Versatile Loco-Manipulation through Flexible Interlimb Coordination</title>
        <description>The ability to flexibly leverage limbs for loco-manipulation is essential for enabling autonomous robots to operate in unstructured environments. Yet, prior work on loco-manipulation is often constrained to specific tasks or predetermined limb configurations. In this work, we present einforcement Learning for Interlimb Coordination (ReLIC), an approach that enables versatile loco-manipulation through flexible interlimb coordination. The key to our approach is an adaptive controller that seamlessly bridges the execution of manipulation motions and the generation of stable gaits based on task demands. Through the interplay between two controller modules, ReLIC dynamically assigns each limb for manipulation or locomotion and robustly coordinates them to achieve the task success. Using efficient reinforcement learning in simulation, ReLIC learns to perform stable gaits in accordance with the manipulation goals in the real world. To solve diverse and complex tasks, we further propose to interface the learned controller with different types of task specifications, including target trajectories, contact points, and natural language instructions. Evaluated on 12 real-world tasks that require diverse and complex coordination patterns, ReLIC demonstrates its versatility and robustness by achieving a success rate of 78.9% on average.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/zhu25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/zhu25a.html</guid>
        
        
      </item>
    
      <item>
        <title>AutoEval: Autonomous Evaluation of Generalist Robot Manipulation Policies in the Real World</title>
        <description>Scalable and reproducible policy evaluation has been a long-standing challenge in robot learning: evaluations are critical to assess progress and build better policies, but evaluation in the real world, especially at a scale that would provide statistically reliable results, is costly in terms of human time and hard to obtain. Evaluation of increasingly generalist robot policies requires an increasingly diverse repertoire of evaluation environments, making the evaluation bottleneck even more pronounced. To make real-world evaluation of robotic policies more practical, we propose AutoEval, a system to autonomously evaluate generalist robot policies around the clock with minimal human intervention. Users interact with AutoEval by submitting evaluation jobs to the AutoEval queue, much like how software jobs are submitted with a cluster scheduling system, and AutoEval will schedule the policies for evaluation within a framework supplying automatic success detection and automatic scene resets. We show that AutoEval can nearly fully eliminate human involvement in the evaluation process, permitting around the clock evaluations, and the evaluation results correspond closely to ground truth evaluations conducted by hand. To facilitate the evaluation of generalist policies in the robotics community, we provide public access to multiple AutoEval scenes in the popular BridgeData robot setup with WidowX robot arms. In the future, we hope that AutoEval scenes can be set up across institutions to form a diverse and distributed evaluation network.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/zhou25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/zhou25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Learning a Unified Policy for Position and Force Control in Legged Loco-Manipulation</title>
        <description>Robotic loco-manipulation tasks often involve contact-rich interactions with the environment, requiring the joint modeling of contact force and robot position. However, recent visuomotor policies often focus solely on position or force control, overlooking their integration. In this work, we propose a unified policy for legged robots that jointly models force and position control learned without reliance on force sensors. By simulating diverse combinations of active position and force commands alongside external disturbances force, we use reinforcement learning to learn a policy that estimates forces from the robot’s historical states and compensates for them through position and velocity adjustments. Such a policy enables a wide range of manipulation behaviors under varying combinations of force and position inputs, including position tracking, force application, force tracking, and compliant robot behaviors. Additionally, we demonstrate that the learned policy enhances trajectory-based imitation learning pipelines by incorporating essential contact information through its force estimation module, achieving approximately  39.5% higher success rates across four challenging contact-rich manipulation tasks compared to position-control policies. Extensive experiments on both a quadrupedal mobile manipulation platform and a humanoid validate the versatility and robustness of the proposed policy across diverse scenarios.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/zhi25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/zhi25a.html</guid>
        
        
      </item>
    
      <item>
        <title>FLARE: Robot Learning with Implicit World Modeling</title>
        <description>We introduce **F**uture **LA**tent **R**presentation Alignm**E**nt (**FLARE**), a novel framework that integrates predictive world modeling into robot policy learning. By aligning features from a diffusion transformer with latent embeddings of future observations, **FLARE** enables a diffusion transformer policy to anticipate latent representations of future observations, allowing it to reason about long-term consequences while generating actions. Remarkably lightweight, **FLARE** requires only minimal architectural modifications—adding a few tokens to standard vision-language-action (VLA) models—yet delivers substantial performance gains. Across two challenging multitask simulation imitation learning benchmarks spanning single-arm and humanoid tabletop manipulation, **FLARE** achieves state-of-the-art performance, outperforming prior policy learning baselines by up to 26%. Moreover, **FLARE** unlocks the ability to co-train with human egocentric video demonstrations lacking action labels, significantly boosting policy generalization to a novel object with unseen geometry with as few as 1 robot demonstration. Our results establish **FLARE** as a general and scalable approach for combining implicit world modeling with high-frequency robotic control.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/zheng25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/zheng25a.html</guid>
        
        
      </item>
    
      <item>
        <title>FastUMI: A Scalable and Hardware-Independent Universal Manipulation Interface with Dataset</title>
        <description>Real-world manipulation datasets for robotic arms remain scarce due to the high costs, rigid hardware dependencies, and complex setup procedures associated with existing data collection methods. We introduce, a redesigned Universal Manipulation Interface (UMI) that addresses these challenges, enabling low-cost, scalable, and rapid deployment across heterogeneous platforms. FastUMI achieves this through: (i) hardware decoupling via extensive mechanical reengineering, which removes dependence on specialized robotic components while preserving a consistent visual perspective; (ii) replacement of complex visual–inertial odometry with a commercial off-the-shelf tracker, simplifying the software stack without compromising pose estimation accuracy; and (iii) the provision of an integrated ecosystem that streamlines data acquisition, automates quality control, and ensures compatibility with both standard and enhanced imitation-learning pipelines. To facilitate further research, we release an open-access dataset comprising over 15,000 real-world demonstrations spanning 24 tasks constituting one of the most extensive UMI-like resources to date. Empirical evaluations show that FastUMI supports rapid deployment, reduces operational overhead, and delivers robust performance across diverse manipulation scenarios, advancing scalable data-driven robotic learning.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/zhaxizhuoma25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/zhaxizhuoma25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Robot Learning from Any Images</title>
        <description>We introduce RoLA, a framework that transforms any in-the-wild image into an interactive, physics-enabled robotic environment.  Unlike previous methods, RoLA operates directly on a single image without requiring additional hardware or digital assets. Our framework democratizes robotic data generation by producing massive visuomotor robotic demonstrations within minutes from a wide range of image sources, including camera captures, robotic datasets, and Internet images. At its core, our approach combines a novel method for single-view physical scene recovery with an efficient visual blending strategy for photorealistic data collection. We demonstrate RoLA’s versatility across applications like scalable robotic data generation and augmentation, robot learning from internet images, and single-image real-to-sim-to-real systems for manipulators and humanoids. Video results are available at our \href{https://rola-2025.github.io/}{project page}.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/zhao25c.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/zhao25c.html</guid>
        
        
      </item>
    
      <item>
        <title>AnyPlace: Learning Generalizable Object Placement for Robot Manipulation</title>
        <description>Object placement in robotic tasks is inherently challenging due to the diversity of object geometries and placement configurations. We address this with AnyPlace, a two-stage method trained entirely on synthetic data, capable of predicting a wide range of feasible placement poses for real-world tasks. Our key insight is that by leveraging a Vision-Language Model (VLM) to identify approximate placement locations, we can focus only on the relevant regions for precise local placement, which enables us to train the low-level placement-pose-prediction model to capture multimodal placements efficiently. For training, we generate a fully synthetic dataset comprising 13 categories of randomly generated objects in 5370 different placement poses across three configurations (insertion, stacking, hanging) and train local placement-prediction models. We extensively evaluate our method in high-fidelity simulation and show that it consistently outperforms baseline approaches across all three tasks in terms of success rate, coverage of placement modes, and precision. In real-world experiments, our method achieves an average success and coverage rate of 76% across three tasks, where most baseline methods fail completely. We further validate the generalization of our approach on 16 real-world placement tasks, demonstrating that models trained purely on synthetic data can be directly transferred to the real world in a zero-shot setting. More at: https://anyplace-pnp.github.io.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/zhao25b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/zhao25b.html</guid>
        
        
      </item>
    
      <item>
        <title>ManipBench: Benchmarking Vision-Language Models for Low-Level Robot Manipulation</title>
        <description>Vision-Language Models (VLMs) have revolutionized artificial intelligence and robotics due to their commonsense reasoning capabilities. In robotic manipulation, VLMs are used primarily as high-level planners, but recent work has also studied their lower-level reasoning ability, which refers to making decisions about precise robot movements. However, the community currently lacks a clear and common benchmark that can evaluate how well VLMs can aid low-level reasoning in robotics. Consequently, we propose a novel benchmark, ManipBench, to evaluate the low-level robot manipulation reasoning capabilities of VLMs across various dimensions, including how well they understand object-object interactions and deformable object manipulation. We extensively test 35 common and state-of-the-art VLM families on our benchmark, including variants to test different model sizes. The performance of VLMs significantly varies across tasks, and there is a strong correlation between this performance and trends in our real-world manipulation tasks. It also shows that there remains a significant gap between these models and human-level understanding.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/zhao25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/zhao25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Mastering Multi-Drone Volleyball through Hierarchical Co-Self-Play Reinforcement Learning</title>
        <description>In this paper, we tackle the problem of learning to play 3v3 multi-drone volleyball, a new embodied competitive task that requires both high-level strategic coordination and low-level agile control. The task is turn-based, multi-agent, and physically grounded, posing significant challenges due to its long-horizon dependencies, tight inter-agent coupling, and the underactuated dynamics of quadrotors. To address this, we propose Hierarchical Co-Self-Play (HCSP), a hierarchical reinforcement learning framework that separates centralized high-level strategic decision-making from decentralized low-level motion control. We design a three-stage population-based training pipeline to enable both strategy and skill to emerge from scratch without expert demonstrations: (I) training diverse low-level skills, (II) learning high-level strategy via self-play with fixed low-level controllers, and (III) joint fine-tuning through co-self-play. Experiments show that HCSP achieves superior performance, outperforming non-hierarchical self-play and rule-based hierarchical baselines with an average 82.9% win rate and a 71.5% win rate against the two-stage variant. Moreover, co-self-play leads to emergent team behaviors such as role switching and coordinated formations, demonstrating the effectiveness of our hierarchical design and training scheme.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/zhang25n.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/zhang25n.html</guid>
        
        
      </item>
    
      <item>
        <title>Motion Blender Gaussian Splatting for Dynamic Reconstruction</title>
        <description>Gaussian splatting has emerged as a powerful tool for high-fidelity reconstruction of dynamic scenes.  However, existing methods primarily rely on implicit motion representations, such as encoding motions into neural networks or per-Gaussian parameters, which makes it difficult to further manipulate the reconstructed motions. This lack of explicit controllability limits existing methods to replaying recorded motions only, which hinders a wider application. To address this, we propose Motion Blender Gaussian Splatting (MB-GS), a novel framework that uses motion graph as an explicit and sparse motion representation. The motion of graph links is propagated to individual Gaussians via dual quaternion skinning, with learnable weight painting functions determining the influence of each link. The motion graphs and 3D Gaussians are jointly optimized from input videos via differentiable rendering. Experiments show that MB-GS achieves state-of-the-art performance on the iPhone dataset while being competitive on HyperNeRF. Additionally, we demonstrate the application potential of our method in animating novel object motions, synthesizing robot demonstrations through motion editing, and predicting robot actions through visual planning.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/zhang25m.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/zhang25m.html</guid>
        
        
      </item>
    
      <item>
        <title>KineDex: Learning Tactile-Informed Visuomotor Policies via Kinesthetic Teaching for Dexterous Manipulation</title>
        <description>Collecting demonstrations enriched with fine-grained tactile information is critical for dexterous manipulation, particularly in contact-rich tasks that require precise force control and physical interaction. While prior works primarily focus on teleoperation or video-based retargeting, they often suffer from kinematic mismatches and the absence of real-time tactile feedback, hindering the acquisition of high-fidelity tactile data. To mitigate this issue, we propose KineDex, a hand-over-hand kinesthetic teaching paradigm in which the operator’s motion is directly transferred to the dexterous hand, enabling the collection of physically grounded demonstrations enriched with accurate tactile feedback. To resolve occlusions from human hand, we apply inpainting technique to preprocess the visual observations. Based on these demonstrations, we then train a visuomotor policy using tactile-augmented inputs and implement force control during deployment for precise contact-rich manipulation. We evaluate KineDex on a suite of challenging contact-rich manipulation tasks, including particularly difficult scenarios such as squeezing toothpaste onto a toothbrush, which require precise multi-finger coordination and stable force regulation. Across these tasks, KineDex achieves an average success rate of 74.4%, representing a 57.7% improvement over the variant without force control. Comparative experiments with teleoperation and user studies further validate the advantages of KineDex in data collection efficiency and operability. Specifically, KineDex collects data over twice as fast as teleoperation across two tasks of varying difficulty, while maintaining a near-100% success rate, compared to under 50% for teleoperation.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/zhang25l.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/zhang25l.html</guid>
        
        
      </item>
    
      <item>
        <title>Elucidating the Design Space of Torque-aware Vision-Language-Action Models</title>
        <description>Many robotic manipulation tasks require sensing and responding to force signals such as torque to assess whether the task has been successfully completed and to enable closed-loop control. However, current Vision-Language-Action (VLA) models lack the ability to integrate such subtle physical feedback. In this work, we explore Torque-aware VLA models, aiming to bridge this gap by systematically studying the design space for incorporating torque signals into existing VLA architectures. We identify and evaluate several strategies, leading to three key findings. First, introducing torque adapters into the decoder consistently outperforms inserting them into the encoder.  This is because torque signals align more closely with the decoder’s input, and the decoder is more sensitive to variations in input. Second, torque history proves to be a critical signal. We find that the most effective way to incorporate it is by summarizing the entire history into a single token, as this preserves the original input pattern of the decoder. Third, inspired by joint prediction and planning paradigms in autonomous driving, we propose predicting torque as an auxiliary output, which further improves performance. This strategy encourages the model to build a physically grounded internal representation of interaction dynamics. Extensive quantitative and qualitative experiments across contact-rich manipulation benchmarks validate our findings. Code, models, and datasets will be released.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/zhang25k.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/zhang25k.html</guid>
        
        
      </item>
    
      <item>
        <title>Motion Priors Reimagined: Adapting Flat-Terrain Skills for Complex Quadruped Mobility</title>
        <description>Reinforcement learning (RL)-based legged locomotion controllers often require meticulous reward tuning to track velocities or goal positions while preserving smooth motion on various terrains.  Motion imitation methods via RL using demonstration data reduce reward engineering but fail to generalize to novel environments.  We address this by proposing a hierarchical RL framework in which a low-level policy is first pre-trained to imitate animal motions on flat ground, thereby establishing motion priors. A subsequent high-level, goal-conditioned policy then builds on these priors, learning residual corrections that enable perceptive locomotion, local obstacle avoidance, and goal-directed navigation across diverse and rugged terrains. Simulation experiments illustrate the effectiveness of learned residuals in adapting to progressively challenging uneven terrains while still preserving the locomotion characteristics provided by the motion priors. Furthermore, our results demonstrate improvements in motion regularization over baseline models trained without motion priors under similar reward setups.  Real-world experiments with an ANYmal-D quadruped robot confirm our policy’s capability to generalize animal-like locomotion skills to complex terrains, demonstrating smooth and efficient locomotion and local navigation performance amidst challenging terrains with obstacles.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/zhang25j.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/zhang25j.html</guid>
        
        
      </item>
    
      <item>
        <title>RoboChemist: Long-Horizon and Safety-Compliant Robotic Chemical Experimentation</title>
        <description>Robotic chemists promise to both liberate human experts from repetitive tasks and accelerate scientific discovery, yet remain in their infancy. Chemical experiments involve long-horizon procedures over hazardous and deformable substances, where success requires not only task completion but also strict compliance with experimental norms. To address these challenges, we propose RoboChemist, a dual-loop framework that integrates Vision-Language Models (VLMs) with Vision-Language-Action (VLA) models. Unlike prior VLM-based systems (e.g., VoxPoser, ReKep) that rely on depth perception and struggle with transparent labware, and existing VLA systems (e.g., RDT, $\pi_0$) that lack semantic-level feedback for complex tasks, our method leverages a VLM to serve as (1) a planner to decompose tasks into primitive actions, (2) a visual prompt generator to guide VLA models, and (3) a monitor to assess task success and regulatory compliance. Notably, we introduce a VLA interface that accepts image-based visual targets from the VLM, enabling precise, goal-conditioned control. Our system successfully executes both primitive actions and complete multi-step chemistry protocols. Results show significant improvements in both success rate and compliance rate over state-of-the-art VLM and VLA baselines, while also demonstrating strong generalization to objects and tasks. Code, data, and models will be released.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/zhang25i.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/zhang25i.html</guid>
        
        
      </item>
    
      <item>
        <title>Robust Dexterous Grasping of General Objects</title>
        <description>The ability to robustly grasp a variety of objects is essential for dexterous robots. In this paper, we present a framework for zero-shot dynamic dexterous grasping using single-view visual inputs, designed to be resilient to various disturbances. Our approach utilizes a hand-centric object shape representation based on dynamic distance vectors between finger joints and object surfaces. This representation captures the local shape around potential contact regions rather than focusing on detailed global object geometry, thereby enhancing generalization to shape variations and uncertainties. To address perception limitations, we integrate a privileged teacher policy with a mixed curriculum learning approach, allowing the student policy to effectively distill grasping capabilities and explore for adaptation to disturbances. Trained in simulation, our method achieves success rates of 97.0% across 247,786 simulated objects and 94.6% across 512 real objects, demonstrating remarkable generalization. Quantitative and qualitative results validate the robustness of our policy against various disturbances.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/zhang25h.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/zhang25h.html</guid>
        
        
      </item>
    
      <item>
        <title>Generative Visual Foresight Meets Task-Agnostic Pose Estimation in Robotic Table-top Manipulation</title>
        <description>Robotic manipulation in unstructured environments requires systems that can generalize across diverse tasks while maintaining robust and reliable performance. We introduce GVF-TAPE, a closed-loop framework that combines generative visual foresight with task-agnostic pose estimation to enable scalable robotic manipulation. GVF-TAPE employs a generative video model to predict future RGB-D frames from a single RGB side-view image and a task description, offering visual plans that guide robot actions. A decoupled pose estimation model then extracts end-effector poses from the predicted frames, translating them into executable commands via low-level controllers. By iteratively integrating video foresight and pose estimation in a closed loop, GVF-TAPE achieves real-time, adaptive manipulation across a broad range of tasks. Extensive experiments in both simulation and real-world settings demonstrate that our approach reduces reliance on task-specific action data and generalizes effectively, providing a practical and scalable solution for intelligent robotic systems</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/zhang25g.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/zhang25g.html</guid>
        
        
      </item>
    
      <item>
        <title>AgentWorld: An Interactive Simulation Platform for Scene Construction and Mobile Robotic Manipulation</title>
        <description>We introduce AgentWorld, an interactive simulation platform for developing household mobile manipulation capabilities. Our platform combines automated scene construction that encompasses layout generation, semantic asset placement, visual material configuration, and physics simulation, with a dual-mode teleoperation system supporting both wheeled bases and humanoid locomotion policies for data collection. The resulting AgentWorld Dataset captures diverse tasks ranging from primitive actions (pick-and-place, push-pull, etc.) to multistage activities (serve drinks, heat up food, etc.) across living rooms, bedrooms, and kitchens. Through extensive benchmarking of imitation learning methods including behavior cloning, action chunking transformers, diffusion policies, and vision-language-action models, we demonstrate the dataset’s effectiveness for sim-to-real transfer. The integrated system provides a comprehensive solution for scalable robotic skill acquisition in complex home environments, bridging the gap between simulation-based training and real-world deployment.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/zhang25f.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/zhang25f.html</guid>
        
        
      </item>
    
      <item>
        <title>ATK: Automatic Task-driven Keypoint Selection for Robust Policy Learning</title>
        <description>Learning visuamotor policy through imitation learning often suffers from perceptual challenges, where visual differences between training and evaluation environments degrade policy performance. Policies relying on state estimations like 6D pose, require task-specific tracking and are difficult to scale, while raw sensor-based policies may lack robustness to small visual disturbances. In this work, we leverage 2D keypoints — spatially consistent features in the image frame — as a state representation for robust policy learning, and apply it to both sim-to-real transfer and real-world imitation learning. However, the choice of which keypoints to use can vary across objects and tasks. We propose a novel method -ATK, to automatically select keypoints in a task-driven manner, such that the chosen keypoints are  that are predictive of optimal behavior for the given task. Our proposal optimizes for a minimal set of task-relevant keypoints that preserve policy performance and robustness. We distill expert data (either from an expert policy in simulation or a human expert) into a policy that operates on RGB images while tracking the selected keypoints. By leveraging pre-trained visual modules, our system effectively tracks keypoints and transfers policies to the real-world evaluation scenario, even given perceptual challenges like transparent objects or fine-grained manipulation, or widely varying scene appearance. We validate our approach on various robotic tasks, demonstrating that these minimal keypoint representations improve robustness to visual disturbances and environmental variations.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/zhang25e.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/zhang25e.html</guid>
        
        
      </item>
    
      <item>
        <title>Sample-Efficient Online Control Policy Learning with Real-Time Recursive Model Updates</title>
        <description>Data-driven control methods need to be sample-efficient and lightweight, especially when data acquisition and computational resources are limited—such as during learning on hardware. Most modern data-driven methods require large datasets and struggle with real-time updates of models, limiting their performance in dynamic environments. Koopman theory formally represents nonlinear systems as linear models over observables, and Koopman representations can be determined from data in an optimization-friendly setting with potentially rapid model updates. In this paper, we present a highly sample-efficient, Koopman-based learning pipeline: Recursive Koopman Learning (RKL). We identify sufficient conditions for model convergence and provide formal algorithmic analysis supporting our claim that RKL is lightweight and fast, with complexity independent of dataset size. We validate our method on a simulated planar two-link arm and a hybrid nonlinear hardware system with soft actuators, showing that real-time recursive Koopman model updates improve the sample efficiency and stability of data-driven controller synthesis—requiring only &lt;10% of the data compared to benchmarks. The high-performance C++ codebase will be open-sourced.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/zhang25d.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/zhang25d.html</guid>
        
        
      </item>
    
      <item>
        <title>Robot Operating Home Appliances by Reading User Manuals</title>
        <description>Operating home appliances, among the most common tools in every household, is a critical capability for assistive home robots. This paper presents ApBot, a robot system that operates novel household appliances by “reading” their user manuals. ApBot faces multiple challenges: (i) infer goal-conditioned partial policies from their unstructured, textual descriptions in a user manual document, (ii) ground the policies to the appliance in the physical world, and (iii) execute the policies reliably over potentially many steps, despite compounding errors. To tackle these challenges, ApBot constructs a structured, symbolic model of an appliance from its manual, with the help of a large vision-language model (VLM). It grounds the symbolic actions visually to control panel elements. Finally, ApBot closes the loop by updating the model based on visual feedback. Our experiments show that across a wide range of simulated and real-world appliances, ApBot achieves consistent and statistically significant improvements in task success rate, compared with state-of-the-art large VLMs used directly as control policies. These results suggest that a structured internal representations plays an important role in robust robot operation of home appliances, especially, complex ones.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/zhang25c.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/zhang25c.html</guid>
        
        
      </item>
    
      <item>
        <title>HuB: Learning Extreme Humanoid Balance</title>
        <description>The human body demonstrates exceptional motor capabilities—such as standing steadily on one foot or performing a high kick with the leg raised over 1.5 meters—both requiring precise balance control. While recent research on humanoid control has leveraged reinforcement learning to track human motions for skill acquisition, applying this paradigm to balance-intensive tasks remains challenging. In this work, we identify three key obstacles: instability from reference motion errors, learning difficulties due to morphological mismatch, and the sim-to-real gap caused by sensor noise and unmodeled dynamics. To address these challenges, we propose $\textbf{HuB}$ ($\textbf{Hu}$manoid $\textbf{B}$alance), a unified framework that integrates $\textit{reference motion refinement}$, $\textit{balance-aware policy learning}$, and $\textit{sim-to-real robustness training}$, with each component targeting a specific challenge. We validate our approach on the Unitree G1 humanoid robot across challenging quasi-static balance tasks, including extreme single-legged poses such as $\texttt{Swallow Balance}$ and $\texttt{Bruce Lee’s Kick}$. Our policy remains stable even under strong physical disturbances—such as a forceful soccer strike—while baseline methods consistently fail to complete these tasks.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/zhang25b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/zhang25b.html</guid>
        
        
      </item>
    
      <item>
        <title>ReWiND: Language-Guided Rewards Teach Robot Policies without New Demonstrations</title>
        <description>We introduce ReWiND, a framework for learning robot manipulation tasks solely from language instructions without per-task demonstrations. Standard reinforcement learning (RL) and imitation learning methods require expert supervision through human-designed reward functions or demonstrations for every new task. In contrast, ReWiND starts from a small demonstration dataset to learn: (1) a data-efficient, language-conditioned reward function that labels the dataset with rewards, and (2) a language-conditioned policy pre-trained with offline RL using these rewards. Given an unseen task variation, ReWiND fine-tunes the pre-trained policy using the learned reward function, requiring minimal online interaction. We show that ReWiND’s reward model generalizes effectively to unseen tasks, outperforming baselines by up to 2.4X in reward generalization and policy alignment metrics. Finally, we demonstrate that ReWiND enables sample-efficient adaptation to new tasks in both simulation and on a real bimanual manipulation platform, taking a step towards scalable, real-world robot learning.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/zhang25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/zhang25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Decentralized Aerial Manipulation of a Cable-Suspended Load Using Multi-Agent Reinforcement Learning</title>
        <description>This paper presents the first decentralized method to enable real-world 6-DoF manipulation of a cable-suspended load using a team of Micro-Aerial Vehicles (MAVs). Our method leverages multi-agent reinforcement learning (MARL) to train an outer-loop control policy for each MAV. Unlike state-of-the-art controllers that utilize a centralized scheme, our policy does not require global states, inter-MAV communications, nor neighboring MAV information. Instead, agents communicate implicitly through load pose observations alone, which enables high scalability and flexibility. It also significantly reduces computing costs during inference time, enabling onboard deployment of the policy. In addition, we introduce a new action space design for the MAVs using linear acceleration and body rates. This choice, combined with a robust low-level controller, enables reliable sim-to-real transfer despite significant uncertainties caused by cable tension during dynamic 3D motion. We validate our method in various real-world experiments, including full-pose control under load model uncertainties, showing setpoint tracking performance comparable to the state-of-the-art centralized method. We also demonstrate cooperation amongst agents with heterogeneous control policies, and robustness to the complete in-flight loss of one MAV. Videos of experiments:  https://github.com/anonymousCoRL/MDCM_CoRL2025</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/zeng25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/zeng25a.html</guid>
        
        
      </item>
    
      <item>
        <title>TWIST: Teleoperated Whole-Body Imitation System</title>
        <description>Teleoperating humanoid robots in a whole-body manner marks a fundamental step toward developing general-purpose robotic intelligence, with human motion providing an ideal interface for controlling all degrees of freedom. Yet, most current humanoid teleoperation systems fall short of enabling coordinated whole-body behavior, typically limiting themselves to isolated locomotion or manipulation tasks. We present the Teleoperated Whole-Body Imitation System (TWIST), a system for humanoid teleoperation through whole-body motion imitation. We first generate reference motion clips by retargeting human motion capture data to the humanoid robot. We then develop a robust, adaptive, and responsive whole-body controller using a combination of reinforcement learning and behavior cloning (RL+BC). Through systematic analysis, we demonstrate how incorporating privileged future motion frames and real-world motion capture (MoCap) data improves tracking accuracy. TWIST enables real-world humanoid robots to achieve unprecedented, versatile, and coordinated whole-body motor skills—spanning whole-body manipulation, legged manipulation, locomotion, and expressive movement—using a single unified neural network controller.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/ze25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/ze25a.html</guid>
        
        
      </item>
    
      <item>
        <title>BEVCalib: LiDAR-Camera Calibration via Geometry-Guided Bird’s-Eye View Representation</title>
        <description>Accurate LiDAR-camera calibration is the foundation of accurate multimodal fusion environmental perception for autonomous driving and robotic systems. Traditional calibration methods require extensive data collection in controlled environments and cannot compensate for the transformation changes during the vehicle/robot movement. In this paper, we propose the first model that uses bird’s-eye view (BEV) features to perform LiDAR camera calibration from raw data, termed BEVCalib. To achieve this, we extract camera BEV features and LiDAR BEV features separately and fuse them into a shared BEV feature space. To fully utilize the geometry information from the BEV feature, we introduce a novel feature selector to choose the most important feature in the transformation decoder, which reduces memory consumption and enables efficient training. Extensive evaluations in various datasets demonstrate that BEVCalib establishes a new state-of-the-art; improving the best open-source baseline by two orders of magnitude on KITTI, Nuscenes, and our dynamic extrinsic dataset, respectively, and outperforming the best baseline in literature by 72% on KITTI dataset, and 69% on Nuscenes dataset. All source code and checkpoints will be released.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/yuan25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/yuan25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Real2Render2Real: Scaling Robot Data Without Dynamics Simulation or Robot Hardware</title>
        <description>Scaling robot learning requires vast and diverse datasets. Yet the prevailing data collection paradigm—human teleoperation—remains costly and constrained by manual effort and physical robot access. We introduce Real2Render2Real (R2R2R), a novel approach for generating robot training data without relying on object dynamics simulation or teleoperation of robot hardware. The input is a smartphone-captured scan of one or more objects and a single video of a human demonstration. R2R2R renders thousands of high visual fidelity robot-agnostic demonstrations by reconstructing detailed 3D object geometry and appearance, and tracking 6-DoF object motion. R2R2R uses 3D Gaussian Splatting (3DGS) to enable flexible asset generation and trajectory synthesis for both rigid and articulated objects, converting these representations to meshes to maintain compatibility with scalable rendering engines like IsaacLab but with collision-modeling turned off. Robot demonstration data generated by R2R2R integrates directly with models that operate on robot proprioceptive states and image observations, such as vision-language-action models (VLA) and imitation learning policies. Physical experiments suggest that models trained on R2R2R data from a single human demonstration can match the performance of models trained on 150 human teleoperation demonstrations.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/yu25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/yu25a.html</guid>
        
        
      </item>
    
      <item>
        <title>KineSoft: Learning Proprioceptive Manipulation Policies with Soft Robot Hands</title>
        <description>Underactuated soft robot hands offer inherent safety and adaptability advantages over rigid systems, but developing dexterous manipulation skills remains challenging. While imitation learning shows promise for complex manipulation tasks, traditional approaches struggle with soft systems due to demonstration collection challenges and ineffective state representations. We present KineSoft, a framework enabling direct kinesthetic teaching of soft robotic hands by leveraging their natural compliance as a skill teaching advantage rather than only as a control challenge. KineSoft makes two key contributions: (1) an internal strain sensing array providing occlusion-free proprioceptive shape estimation, and (2) a shape-based imitation learning framework that uses proprioceptive feedback with a low-level shape-conditioned controller to ground diffusion-based policies. This enables human demonstrators to physically guide the robot while the system learns to associate proprioceptive patterns with successful manipulation strategies. We validate KineSoft through physical experiments, demonstrating superior shape estimation accuracy compared to baseline methods, precise shape-trajectory tracking, and higher task success rates compared to baseline imitation learning approaches. KineSoft’s results demonstrate that embracing the inherent properties of soft robots leads to intuitive and robust dexterous manipulation capabilities.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/yoo25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/yoo25a.html</guid>
        
        
      </item>
    
      <item>
        <title>WoMAP: World Models For Embodied Open-Vocabulary Object Localization</title>
        <description>Active object localization remains a critical challenge for robots, requiring efficient exploration of partially observable environments. However, state-of-the-art robot policies either struggle to generalize beyond demonstration datasets (e.g., imitation learning methods) or fail to generate physically grounded actions (e.g., VLMs). To address these limitations, we introduce WoMAP (World Models for Active Perception): a recipe for training open-vocabulary object localization policies that: (i) uses a Gaussian Splatting-based real-to-sim-to-real pipeline for scalable data generation without the need for expert demonstrations, (ii) distills dense rewards signals from open-vocabulary object detectors, and (iii) leverages a latent world model for dynamics and rewards prediction to ground high-level action proposals at inference time. Rigorous simulation and hardware experiments demonstrate WoMAP’s superior performance in a wide range of zero-shot object localization tasks, with a 63% success rate compared to 10%success rate compared to a VLM baseline, and only a 10 - 20% drop in performance when directly transferring from sim to real.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/yin25b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/yin25b.html</guid>
        
        
      </item>
    
      <item>
        <title>GC-VLN: Instruction as Graph Constraints for Training-free Vision-and-Language Navigation</title>
        <description>In this paper, we propose a training-free framework for vision-and-language navigation (VLN). Existing zero-shot VLN methods are mainly designed for discrete environments or involve unsupervised training in continuous simulator environments, which makes it challenging to generalize and deploy them in real-world scenarios. To achieve a training-free framework in continuous environments, our framework formulates navigation guidance as graph constraint optimization by decomposing instructions into explicit spatial constraints. The constraint-driven paradigm decodes spatial semantics through constraint solving, enabling zero-shot adaptation to unseen environments. Specifically, we construct a spatial constraint library covering all types of spatial relationship mentioned in VLN instructions. The human instruction is decomposed into a directed acyclic graph, with waypoint nodes, object nodes and edges, which are used as queries to retrieve the library to build the graph constraints. The graph constraint optimization is solved by the constraint solver to determine the positions of waypoints, obtaining the robot’s navigation path and final goal. To handle cases of no solution or multiple solutions, we construct the navigation tree and the backtracking mechanism. Extensive experiments on standard benchmarks demonstrate significant improvements in success rate and navigation efficiency compared to state-of-the-art zero-shot VLN methods. We further conduct real-world experiments to show our framework can effectively generalize to new environments and instruction sets, paving the way for more robust and autonomous navigation framework.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/yin25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/yin25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Co-Design of Soft Gripper with Neural Physics</title>
        <description>For robot manipulation, both the controller and end-effector design are crucial. Compared with rigid grippers, soft grippers are more generalizable by deforming to different geometries, but designing such a gripper and finding its grasp pose remains challenging. In this paper, we propose a co-design framework that generates an optimized soft gripper’s block-wise stiffness distribution and its grasping pose, using a neural physics model trained in simulation. We adopt a uniform-pressure tendon model, then generate a diverse dataset by randomizing both gripper pose and design parameters. A neural network is trained to approximate this forward simulation, yielding a fast, differentiable surrogate. We embed that surrogate in an end-to-end optimization loop to recover the ideal stiffness configuration and best grasp pose. Finally, we 3D-print the optimized grippers of various stiffness by changing the printing infills and parameters. We demonstrate that our co-designed grippers significantly outperform baseline designs in terms of force closure and success rate.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/yi25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/yi25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Towards Generalizable Safety in Crowd Navigation via Conformal Uncertainty Handling</title>
        <description>Mobile robots navigating in crowds trained using reinforcement learning are known to suffer performance degradation when faced with out-of-distribution scenarios. We propose that by properly accounting for the uncertainties of pedestrians, a robot can learn safe navigation policies that are robust to distribution shifts. Our method augments agent observations with prediction uncertainty estimates generated by adaptive conformal inference, and it uses these estimates to guide the agent’s behavior through constrained reinforcement learning. The system helps regulate the agent’s actions and enables it to adapt to distribution shifts. In the in-distribution setting, our approach achieves a 96.93% success rate, which is over 8.80% higher than the previous state-of-the-art baselines with over 3.72 times fewer collisions and 2.43 times fewer intrusions into ground-truth human future trajectories. In three out-of-distribution scenarios, our method shows much stronger robustness when facing distribution shifts in velocity variations, policy changes, and transitions from individual to group dynamics. We deploy our method on a real robot, and experiments show that the robot makes safe and robust decisions when interacting with both sparse and dense crowds.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/yao25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/yao25a.html</guid>
        
        
      </item>
    
      <item>
        <title>FOMO-3D: Using Vision Foundation Models for Long-Tailed 3D Object Detection</title>
        <description>In order to navigate complex traffic environments, self-driving vehicles must recognize many semantic classes pertaining to vulnerable road users or traffic control devices.  However, many safety-critical objects (e.g., construction worker) appear infrequently in nominal traffic conditions, leading to a severe shortage of training examples from driving data alone.  Recent vision foundation models, which are trained on a large corpus of data, can serve as a good source of external prior knowledge to improve generalization. We propose FOMO-3D, the first 3D detector to leverage vision foundation models for long-tailed 3D detection. Specifically, FOMO-3D exploits rich semantic and depth priors from OWLv2 and Metric3Dv2 within a two-stage detection paradigm that first generates proposals with a LiDAR-based branch and a novel camera-based branch, and refines them with attention especially to image features from OWL. Evaluations on real-world driving data show that using rich priors from vision foundation models with careful multimodal fusion designs leads to large gains for long-tailed 3D detection.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/yang25e.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/yang25e.html</guid>
        
        
      </item>
    
      <item>
        <title>Deep Reactive Policy: Learning Reactive Manipulator Motion Planning for Dynamic Environments</title>
        <description>Generating collision-free motion in dynamic, partially observable environments is a fundamental challenge for robotic manipulators. Classical motion planners can compute globally optimal trajectories but require full environment knowledge and are typically too slow for dynamic scenes. Neural motion policies offer a promising alternative by operating in closed-loop directly on raw sensory inputs but often struggle to generalize in complex or dynamic settings. We propose Deep Reactive Policy (DRP), a visuo-motor neural motion policy designed for reactive motion generation in diverse dynamic environments, operating directly on point cloud sensory input. At its core is IMPACT, a transformer-based neural motion policy pretrained on 10 million generated expert trajectories across diverse simulation scenarios. We further improve IMPACT’s static obstacle avoidance through iterative student-teacher finetuning. We additionally enhance the policy’s dynamic obstacle avoidance at inference time using DCP-RMP, a locally reactive goal-proposal module. We evaluate DRP on challenging tasks featuring cluttered scenes, dynamic moving obstacles, and goal obstructions. DRP achieves strong generalization, outperforming prior classical and neural methods in success rate across both simulated and real-world settings. We will release the dataset, simulation environments, and trained models upon acceptance. Refer to supplementary material for videos.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/yang25d.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/yang25d.html</guid>
        
        
      </item>
    
      <item>
        <title>Extracting Visual Plans from Unlabeled Videos via Symbolic Guidance</title>
        <description>Visual planning, by offering a sequence of intermediate visual subgoals to a goal-conditioned low-level policy, achieves promising performance on long-horizon manipulation tasks. To obtain the subgoals, existing methods typically resort to video generation models but suffer from model hallucination and computational cost. We present Vis2Plan, an efficient, explainable and white-box visual planning framework powered by symbolic guidance. From raw, unlabeled play data, Vis2Plan harnesses vision foundation models to automatically extract a compact set of task symbols, which allows building a high-level symbolic transition graph for multi-goal, multi-stage planning. At test time, given a desired task goal, our planner conducts planning at the symbolic level and assembles a sequence of physically consistent intermediate sub-goal images grounded by the underlying symbolic representation. Our Vis2Plan outperforms strong diffusion video generation-based visual planners by delivering 53% higher aggregate success rate while generating visual plans 35$\times$ faster. The results indicate that Vis2Plan is able to generate physically consistent image goals while offering fully inspectable reasoning steps.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/yang25c.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/yang25c.html</guid>
        
        
      </item>
    
      <item>
        <title>Mobi-$\pi$: Mobilizing Your Robot Learning Policy</title>
        <description>Learned visuomotor policies are capable of performing increasingly complex manipulation tasks. However, most of these policies are trained on data collected from limited robot positions and camera viewpoints. This leads to poor generalization to novel robot positions, which limits the use of these policies on mobile platforms, especially for precise tasks like pressing buttons or turning faucets. In this work, we formulate the &quot;policy mobilization&quot; problem: find a mobile robot base pose in a novel environment that is in distribution with respect to a manipulation policy trained on a limited set of camera viewpoints. Compared to retraining the policy itself to be more robust to unseen robot base pose initializations, policy mobilization decouples navigation from manipulation and thus does not require additional demonstrations. With that, our formulation is still compatible with any approach that improves manipulation policy robustness. To study policy mobilization, we introduce the Mobi-$\pi$ framework, which includes: (1) metrics that quantify the difficulty of mobilizing a given policy, (2) a suite of simulated mobile manipulation tasks based on RoboCasa to evaluate policy mobilization, (3) visualization tools for analysis, and (4) several baseline methods. We also propose a novel approach that bridges navigation and manipulation by optimizing the robot’s base pose to align with an in-distribution base pose for a learned policy. Our approach utilizes a 3D Gaussian Splatting model for novel viewpoint synthesis, a score function to evaluate pose suitability, as well as sampling-based optimization to identify optimal robot poses. We show that our approach on average outperforms the best baseline by 7.65$\times$ in simulation and 2.38$\times$ in the real world, demonstrating its effectiveness for policy mobilization.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/yang25b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/yang25b.html</guid>
        
        
      </item>
    
      <item>
        <title>Multi-Loco: Unifying Multi-Embodiment Legged Locomotion via Reinforcement Learning Augmented Diffusion</title>
        <description>Generalizing locomotion policies across diverse legged robots with varying morphologies is a key challenge due to differences in observation/action dimensions and system dynamics. In this work, we propose \textit{Multi-Loco}, a novel unified framework combining a morphology-agnostic generative diffusion model with a lightweight residual policy optimized via reinforcement learning (RL). The diffusion model captures morphology-invariant locomotion patterns from diverse cross-embodiment datasets, improving generalization and robustness. The residual policy is shared across all embodiments and refines the actions generated by the diffusion model, enhancing task-aware performance and robustness for real-world deployment. We evaluated our method with a rich library of four legged robots in both simulation and real-world experiments. Compared to a standard RL framework with PPO, our approach - replacing the Gaussian policy with a diffusion model and residual term - achieves a 10.35% average return improvement, with gains up to 13.57% in wheeled-biped locomotion tasks. These results highlight the benefits of cross-embodiment data and composite generative architectures in learning robust, generalized locomotion skills.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/yang25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/yang25a.html</guid>
        
        
      </item>
    
      <item>
        <title>ManiFlow: A General Robot Manipulation Policy via Consistency Flow Training</title>
        <description>Generative models based on flow matching offer significant potential for learning robot policies, particularly in generating high-dimensional, dexterous behaviors that are conditioned on diverse observations. In this work, we introduce ManiFlow, an advanced flow matching model specifically designed to support dexterous manipulation tasks. ManiFlow improves over flow matching both in the learning procedure and in the model architecture, resulting in better robustness and efficacy. It consistently exhibits strong generalization capabilities, outperforming existing state-of-the-art robot learning methods on a wide range of benchmarks. We also demonstrate the powerful capabilities of ManiFlow in solving complex bimanual dexterous manipulation challenges.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/yan25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/yan25a.html</guid>
        
        
      </item>
    
      <item>
        <title>D-Cubed: Latent Diffusion Trajectory Optimisation for Dexterous Deformable Manipulation</title>
        <description>Mastering deformable object manipulation often necessitates the use of anthropomorphic, high-degree-of-freedom robot hands capable of precise, contact-rich control. However, current trajectory optimisation methods often struggle in these settings due to the large search space and the sparse task information available from shape-matching cost functions, particularly when contact is absent. In this work, we propose D-Cubed, a novel trajectory optimisation method using a latent diffusion model (LDM) trained from a task-agnostic play dataset to solve dexterous deformable object manipulation tasks. D-Cubed learns a skill-latent space that encodes short-horizon actions from a play dataset using a VAE and trains a LDM to compose the skill latents into a skill trajectory, representing a long-horizon action trajectory. To optimise a trajectory for a target task, we introduce a novel gradient-free guided sampling method that employs the Cross-Entropy method within the reverse diffusion process. In particular, D-Cubed samples a small number of noisy skill trajectories using the LDM for exploration and evaluates the trajectories in simulation. Then D-Cubed selects the trajectory with the lowest cost for the subsequent reverse process. This effectively explores promising solution areas and optimises the sampled trajectories towards a target task throughout the reverse diffusion process. Through empirical evaluation on a published benchmark of dexterous deformable object manipulation tasks, we demonstrate that D-Cubed outperforms traditional trajectory optimisation and competitive baseline approaches by a significant margin.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/yamada25b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/yamada25b.html</guid>
        
        
      </item>
    
      <item>
        <title>COMBO-Grasp: Learning Constraint-Based Manipulation for Bimanual Occluded Grasping</title>
        <description>This paper addresses the challenge of occluded robot grasping, i.e. grasping in situations where the desired grasp poses are kinematically infeasible due to environmental constraints such as surface collisions. Existing RL methods struggle with task complexity, and collecting expert demonstrations is often impractical. Instead, inspired by human bimanual manipulation strategies, where two hands coordinate to stabilise and reorient objects, we focus on a bimanual robotic setup to tackle this challenge. In particular, we introduce Constraint-based Manipulation for Bimanual Occluded Grasping (COMBO-Grasp), an approach which leverages two coordinated policies: a constraint policy trained using self-supervised datasets to generate stabilising poses and a grasping policy trained using RL that reorients and grasps the target object. A key contribution lies in value function-guided policy coordination, where gradients from a jointly trained value function refine the constraint policy during RL training to improve bimanual coordination and task performance. Lastly, COMBO-Grasp employs teacher-student policy distillation to effectively deploy vision-based policies in real-world environments. Experiments show that COMBO-Grasp significantly outperforms baselines and generalises to unseen objects in both simulation and real environments.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/yamada25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/yamada25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Neural Robot Dynamics</title>
        <description>Accurate and efficient simulation of modern robots remains challenging due to their high degrees of freedom and intricate mechanisms. Neural simulators have emerged as a promising alternative to traditional analytical simulators, capable of efficiently predicting complex dynamics and adapting to real-world data; however, existing neural simulators typically require application-specific training and fail to generalize to novel tasks and/or environments, primarily due to inadequate representations of the global state. In this work, we address the problem of learning generalizable neural simulators for robots that are structured as articulated rigid bodies. We propose *NeRD* (Neural Robot Dynamics), learned robot-specific dynamics models for predicting future states for articulated rigid bodies under contact constraints. *NeRD* uniquely replaces the low-level dynamics and contact solvers in an analytical simulator and employs a robot-centric and spatially-invariant simulation state representation. We integrate the learned *NeRD* models as an interchangeable backend solver within a state-of-the-art robotics simulator. We conduct extensive experiments to show that the *NeRD* simulators are stable and accurate over a thousand simulation steps; generalize across tasks and environment configurations; enable policy learning exclusively in a neural engine; and, unlike most classical simulators, can be fine-tuned from real-world data to bridge the gap between simulation and reality.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/xu25g.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/xu25g.html</guid>
        
        
      </item>
    
      <item>
        <title>VLM-AD: End-to-End Autonomous Driving through Vision-Language Model Supervision</title>
        <description>Human drivers rely on commonsense reasoning to navigate diverse and dynamic real-world scenarios. Existing end-to-end (E2E) autonomous driving (AD) models are typically optimized to mimic driving patterns observed in data, without capturing the underlying reasoning processes.  This limitation constrains their ability to handle challenging driving scenarios. To close this gap, we propose VLM-AD, a method that leverages vision-language models (VLMs) as teachers to enhance training by providing additional supervision that incorporates unstructured reasoning information and structured action labels. Such supervision enhances the model’s ability to learn richer feature representations that capture the rationale behind driving patterns. Importantly, our method does not require a VLM during inference, making it practical for real-time deployment. When integrated with state-of-the-art methods, VLM-AD achieves significant improvements in planning accuracy and reduced collision rates on the nuScenes dataset. It further improves route completion and driving scores under closed-loop evaluation, demonstrating its effectiveness in long-horizon, interactive driving scenarios and its potential for safe and reliable real-world deployment.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/xu25f.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/xu25f.html</guid>
        
        
      </item>
    
      <item>
        <title>exUMI: Extensible Robot Teaching System with Action-aware Task-agnostic Tactile Representation</title>
        <description>Tactile-aware robot learning faces critical challenges in data collection and representation due to data scarcity and sparsity, and the absence of force feedback in existing systems. To address these limitations, we introduce a tactile robot learning system with both hardware and algorithm innovations. We present exUMI, an extensible data collection device that enhances the vanilla UMI with robust proprioception (via AR MoCap and rotary encoder), modular visuo-tactile sensing, and automated calibration, achieving 100% data usability. Building on an efficient collection of over 1 M tactile frames, we propose Tactile Prediction Pretraining (TPP), a representation learning framework through action-aware temporal tactile prediction, capturing contact dynamics and mitigates tactile sparsity. Real-world experiments show that TPP outperforms traditional tactile imitation learning. Our work bridges the gap between human tactile intuition and robot learning through co-designed hardware and algorithms, offering open-source resources to advance contact-rich manipulation research.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/xu25e.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/xu25e.html</guid>
        
        
      </item>
    
      <item>
        <title>Dexplore: Scalable Neural Control for Dexterous Manipulation from Reference Scoped Exploration</title>
        <description>Hand–object motion-capture (MoCap) repositories provide abundant, contact-rich human demonstrations for scaling dexterous manipulation on robots. Yet demonstration inaccuracy and embodiment gaps between human and robot hands challenge direct policy learning. Existing pipelines adapt a three-stage workflow: retargeting, tracking, and residual correction. This multi-step process may not fully utilize demonstrations and can introduce compound errors. We introduce Reference-Scoped Exploration (RSE), a unified, single-loop optimization that integrates retargeting and tracking to train a scalable robot control policy directly from MoCap. Instead of treating demonstrations as strict ground truth, we view them as soft guidance. From raw demonstrations, we construct adaptive spatial scopes—time-varying termination boundaries, and reinforcement learning promotes the policy to stay within these envelopes while minimizing control effort. This holistic approach preserves demonstration intent, lets robot-specific strategies emerge, boosts robustness to noise, and scales effortlessly with large-scale demonstrations. We distill the scaled tracking policy into a vision-based, skill-conditioned generative control policy. This distilled policy captures diverse manipulation skills within a rich latent representation, enabling generalization across various objects and real-world robotic manipulation.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/xu25d.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/xu25d.html</guid>
        
        
      </item>
    
      <item>
        <title>FACET: Force-Adaptive Control via Impedance Reference Tracking for Legged Robots</title>
        <description>Reinforcement learning (RL) has made significant strides in legged robot control, enabling locomotion across diverse terrains and complex loco-manipulation capabilities. However, the commonly used position or velocity tracking-based objectives are agnostic to forces experienced by the robot, leading to stiff and potentially dangerous behaviors and poor control during forceful interactions. To address this limitation, we present Force-Adaptive Control via Impedance Reference Tracking (FACET). Inspired by impedance control, we use RL to train a control policy to imitate a virtual mass-spring-damper system, allowing fine-grained control under external forces by manipulating the virtual spring.  In simulation, we demonstrate that our quadruped robot achieves improved robustness to large impulses (up to 200 Ns) and exhibits controllable compliance, achieving an 80% reduction in collision impulse. The policy is deployed to a physical robot, demonstrating both compliant behavior, such as initiation/cessation of movement with finger tip, and the ability to pull payloads up to 10kg. Further extension to a legged loco-manipulator and a humanoid shows the applicability of our method to more complex settings to enable whole-body compliance control.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/xu25c.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/xu25c.html</guid>
        
        
      </item>
    
      <item>
        <title>DexUMI: Using Human Hand as the Universal Manipulation Interface for Dexterous Manipulation</title>
        <description>We present DexUMI - a data collection and policy learning framework that uses the human hand as the natural interface to transfer dexterous manipulation skills to various robot hands. DexUMI incorporates hardware and software adaptations to minimize the embodiment gap between the human hand and various robot hands. The hardware adaptation bridges the kinematics gap with a wearable hand exoskeleton. It allows direct haptic feedback in manipulation data collection and adapts human motion to feasible robot hand motion. Our software adaptation bridges the visual gap by replacing the human hand in video data with high-fidelity robot hand inpainting. We demonstrate DexUMI’s capabilities through comprehensive real-world experiments on two different dexterous robot hand hardware platforms, achieving an average task success rate of 86%.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/xu25b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/xu25b.html</guid>
        
        
      </item>
    
      <item>
        <title>“Stack It Up!”: 3D Stable Structure Generation from 2D Hand-drawn Sketch</title>
        <description>Imagine a child sketching the Eiffel Tower and asking a robot to bring it to life. Today’s robot manipulation systems can’t act on such sketches directly—they require precise 3D block poses as goals, which in turn demand structural analysis and expert tools like CAD. We present *StackItUp*, a system that enables non-experts to specify complex 3D structures using only 2D front-view hand-drawn sketches. *StackItUp* introduces an abstract relation graph to bridge the gap between rough sketches and accurate 3D block arrangements, capturing the symbolic geometric relations (e.g., *left-of*) and stability patterns (e.g.,*two-pillar-bridge*) while discarding noisy metric details from sketches. It then grounds this graph to 3D poses using compositional diffusion models and iteratively updates it by predicting hidden internal and rear supports—critical for stability but absent from the sketch. Evaluated on sketches of iconic landmarks and modern house designs, *StackItUp* consistently produces stable, multilevel 3D structures and outperforms all baselines in both stability and visual resemblance.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/xu25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/xu25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Vision in Action: Learning Active Perception from Human Demonstrations</title>
        <description>We present Vision in Action (ViA), an active perception system for bimanual robot manipulation. ViA learns task-relevant active perceptual strategies (e.g., searching, tracking, and focusing) directly from human demonstrations. On the hardware side, ViA employs a simple yet effective 6-DoF robotic neck to enable flexible, human-like head movements. To capture human active perception strategies, we design a VR-based teleoperation interface that creates a shared observation space between the robot and the human operator. To mitigate VR motion sickness caused by latency in the robot’s physical movements, the interface uses an intermediate 3D scene representation, enabling real-time view rendering on the operator side while asynchronously updating the scene with the robot’s latest observations. Together, these design elements enable the learning of robust visuomotor policies for three complex, multi-stage bimanual manipulation tasks involving visual occlusions, significantly outperforming baseline systems.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/xiong25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/xiong25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Shortcut Learning in Generalist Robot Policies: The Role of Dataset Diversity and Fragmentation</title>
        <description>Generalist robot policies trained on large-scale datasets such as Open X-Embodiment (OXE) demonstrate strong performance across a wide range of tasks. However, they often struggle to generalize beyond the distribution of their training data. In this paper, we investigate the underlying cause of this limited generalization capability.  We identify shortcut learning—the reliance on task-irrelevant features—as a key impediment to generalization. Through comprehensive theoretical and empirical analysis, we uncover two primary contributors to shortcut learning: (1) limited diversity within individual sub-datasets, and (2) significant distributional disparities across sub-datasets, leading to dataset fragmentation. These issues arise from the inherent structure of large-scale datasets like OXE, which are typically composed of multiple sub-datasets collected independently across varied environments and embodiments. Our findings provide critical insights into dataset collection strategies that can reduce shortcut learning and enhance the generalization ability of generalist robot policies. Moreover, in scenarios where acquiring new large-scale data is impractical, we demonstrate that carefully selected robotic data augmentation strategies can effectively reduce shortcut learning in existing offline datasets, thereby improving generalization capabilities of generalist robot policies, e.g., $\pi_0$ in the SIMPLER Environment.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/xing25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/xing25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Data Retrieval with Importance Weights for Few-Shot Imitation Learning</title>
        <description>While large-scale robot datasets have propelled recent progress in imitation learning, learning from smaller task specific datasets remains critical for deployment in new environments and unseen tasks. One such approach to few-shot imitation learning is retrieval-based imitation learning, which extracts relevant samples from large, widely available prior datasets to augment a limited demonstration dataset. To determine the relevant data from prior datasets, retrieval-based approaches most commonly calculate a prior data point’s minimum distance to a point in the target dataset in latent space. While retrieval-based methods have shown success using this metric for data selection, we demonstrate its equivalence to the limit of a Gaussian kernel density (KDE) estimate of the target data distribution. This reveals two shortcomings of the retrieval rule used in prior work. First, it relies on high-variance nearest neighbor estimates that are susceptible to noise. Second, it does not account for the distribution of prior data when retrieving data. To address these issues, we introduce Importance Weighted Retrieval (IWR), which estimates importance weights, or the ratio between the target and prior data distributions for retrieval, using Gaussian KDEs. By considering the probability ratio, IWR overcomes the bias of previous selection rules, and by using reasonable modeling parameters, IWR effectively smooths estimates using all data points.  Across both simulation environments and real-world evaluations on the Bridge dataset we find that our method, IWR, consistently improves performance of existing retrieval-based methods, despite only requiring minor modifications.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/xie25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/xie25a.html</guid>
        
        
      </item>
    
      <item>
        <title>UniTac2Pose: A Unified Approach Learned in Simulation for Category-level Visuotactile In-hand Pose Estimation</title>
        <description>Accurate estimation of the in-hand pose of an object based on its CAD model is crucial in both industrial applications and everyday tasks—ranging from positioning workpieces and assembling components to seamlessly inserting devices like USB connectors. While existing methods often rely on regression, feature matching, or registration techniques, achieving high precision and generalizability to unseen CAD models remains a significant challenge. In this paper, we propose a novel three-stage framework for in-hand pose estimation. The first stage involves sampling and pre-ranking pose candidates, followed by iterative refinement of these candidates in the second stage. In the final stage, post-ranking is applied to identify the most likely pose candidates. These stages are governed by a unified energy-based diffusion model, which is trained solely on simulated data. This energy model simultaneously generates gradients to refine pose estimates and produces an energy scalar that quantifies the quality of the pose estimates. Additionally, inspired by the computer vision domain, we incorporate a render-compare architecture within the energy-based score network to significantly enhance sim-to-real performance, as demonstrated by our ablation studies. Extensive experimental evaluations show that our method outperforms conventional baselines based on regression, matching, and registration techniques, while also exhibiting strong generalization to previously unseen CAD models. Moreover, our approach integrates tactile object pose estimation, pose tracking, and uncertainty estimation into a unified system, enabling robust performance across a variety of real-world conditions.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/wu25d.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/wu25d.html</guid>
        
        
      </item>
    
      <item>
        <title>MoTo: A Zero-shot Plug-in Interaction-aware Navigation for General Mobile Manipulation</title>
        <description>Mobile manipulation is the fundamental challenge for robotics in assisting humans with diverse tasks and environments in everyday life. Conventional mobile manipulation approaches often struggle to generalize across different tasks and environments due to the lack of large-scale training. However, recent advances in manipulation foundation models demonstrate impressive generalization capability on a wide range of fixed-base manipulation tasks, which are still limited to a fixed setting. Therefore, we devise a plug-in module named MoTo, which can be combined with any off-the-shelf manipulation foundation model to empower them with mobile manipulation ability. Specifically, we propose an interaction-aware navigation policy to generate agent docking points for generalized mobile manipulation.  To enable zero-shot ability, we propose an interaction keypoints framework via vision-language models (VLM) under multi-view consistency for both target object and robotic arm following instructions, where fixed-base manipulation foundation models can be employed. We further propose motion planning objectives for the mobile base and robot arm, which minimize the distance between the two keypoints and maintain the physical feasibility of trajectories. In this way, MoTo guides the agent to move to the docking points where fixed-base manipulation can be successfully performed, and leverages VLM generation and trajectory optimization to achieve mobile manipulation in a zero-shot manner, without any requirement on mobile manipulation expert data. Extensive experimental results on OVMM and real-world demonstrate that MoTo achieves success rates of 2.68% and 16.67% higher than the state-of-the-art mobile manipulation methods, respectively, without requiring additional training data.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/wu25c.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/wu25c.html</guid>
        
        
      </item>
    
      <item>
        <title>QuaDreamer: Controllable Panoramic Video Generation for Quadruped Robots</title>
        <description>Panoramic cameras, capturing comprehensive 360-degree environmental data, are suitable for quadruped robots in surrounding perception and interaction with complex environments. However, the scarcity of high-quality panoramic training data — caused by inherent kinematic constraints and complex sensor calibration challenges — fundamentally limits the development of robust perception systems tailored to these embodied platforms. To address this issue, we propose QuaDreamer—the first panoramic data generation engine specifically designed for quadruped robots. QuaDreamer focuses on mimicking the motion paradigm of quadruped robots to generate highly controllable, realistic panoramic videos, providing a data source for downstream tasks. Specifically, to effectively capture the unique vertical vibration characteristics exhibited during quadruped locomotion, we introduce Vertical Jitter Encoding (VJE). VJE extracts controllable vertical signals through frequency-domain feature filtering and provides high-quality prompts. To facilitate high-quality panoramic video generation under jitter signal control, we propose a Scene-Object Controller (SOC) that effectively manages object motion and boosts background jitter control through the attention mechanism. To address panoramic distortions in wide-FoV video generation, we propose the Panoramic Enhancer (PE) – a dual-stream architecture that synergizes frequency-texture refinement for local detail enhancement with spatial-structure correction for global geometric consistency. We further demonstrate that the generated video sequences can serve as training data for the quadruped robot’s panoramic visual perception model, enhancing the performance of multi-object tracking in 360-degree scenes. The source code and model weights will be publicly available.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/wu25b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/wu25b.html</guid>
        
        
      </item>
    
      <item>
        <title>SAVOR: Skill Affordance Learning from Visuo-Haptic Perception for Robot-Assisted Bite Acquisition</title>
        <description>Robot-assisted feeding requires reliable bite acquisition, a challenging task due to the complex interactions between utensils and food with diverse physical properties. These interactions are further complicated by the temporal variability of food properties—for example, steak becomes firm as it cools even during a meal. To address this, we propose SAVOR, a novel approach for learning skill affordances for bite acquisition—how suitable a manipulation skill (e.g., skewering, scooping) is for a given utensil-food interaction. In our formulation, skill affordances arise from the combination of tool affordances (what a utensil can do) and food affordances (what the food allows). Tool affordances are learned offline through calibration, where different utensils interact with a variety of foods to model their functional capabilities. Food affordances are characterized by physical properties such as softness, moisture, and viscosity, initially inferred through commonsense reasoning using a visually-conditioned language model and then dynamically refined through online multi-modal visuo-haptic perception using SAVOR-Net during interaction. Our method integrates these offline and online estimates to predict skill affordances in real time, enabling the robot to select the most appropriate skill for each food item. Evaluated on 20 single-item foods and 10 in-the-wild meals, our approach improves bite acquisition success by 13% over state-of-the-art (SOTA) category-based methods (e.g. use skewer for fruits). These results highlight the importance of modeling interaction-driven skill affordances for generalizable and effective robot-assisted bite acquisition.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/wu25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/wu25a.html</guid>
        
        
      </item>
    
      <item>
        <title>DexSkin: High-Coverage Conformable Robotic Skin for Learning Contact-Rich Manipulation</title>
        <description>Human skin provides a rich tactile sensing stream, localizing intentional and unintentional contact events over a large and contoured region. Replicating these tactile sensing capabilities for dexterous robotic manipulation systems remains a longstanding challenge. In this work, we take a step towards this goal by introducing DexSkin. DexSkin is a soft, conformable capacitive electronic skin that enables sensitive, localized, and calibratable tactile sensing, and can be tailored to varying geometries. We demonstrate its efficacy for learning downstream robotic manipulation by sensorizing a pair of parallel jaw gripper fingers, providing tactile coverage across almost the entire finger surfaces. We empirically evaluate DexSkin’s capabilities in learning challenging manipulation tasks that require sensing coverage across the entire surface of the fingers, such as reorienting objects in hand and wrapping elastic bands around boxes, in a learning-from-demonstration framework. We then show that, critically for data-driven approaches, DexSkin can be calibrated to enable model transfer across sensor instances, and demonstrate its applicability to online reinforcement learning on real robots. Our results highlight DexSkin’s suitability and practicality for learning real-world, contact-rich manipulation.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/wistreich25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/wistreich25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Adapt3R: Adaptive 3D Scene Representation for Domain Transfer in Imitation Learning</title>
        <description>Imitation Learning can train robots to perform complex and diverse manipulation tasks, but learned policies are brittle with observations outside of the training distribution. 3D scene representations that incorporate observations from calibrated RGBD cameras have been proposed as a way to mitigate this, but in our evaluations with unseen embodiments and camera viewpoints they show only modest improvement. To address those challenges, we propose Adapt3R, a general-purpose 3D observation encoder which synthesizes data from calibrated RGBD cameras into a vector that can be used as conditioning for arbitrary IL algorithms. The key idea is to use a pretrained 2D backbone to extract semantic information, using 3D only as a medium to localize this information with respect to the end-effector. We show across 93 simulated and 6 real tasks that when trained end-to-end with a variety of IL algorithms, Adapt3R maintains these algorithms’ learning capacity while enabling zero-shot transfer to novel embodiments and camera poses. For more results, visit https://adapt3r-robot.github.io.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/wilcox25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/wilcox25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Articulated Object Estimation in the Wild</title>
        <description>Understanding the 3D motion of articulated objects is essential in robotic scene understanding, mobile manipulation, and motion planning. Prior methods for articulation estimation have primarily focused on controlled settings, assuming either fixed camera viewpoints or direct observations of various object states, which tend to fail in more realistic, unconstrained environments. In contrast, humans effortlessly infer articulation modes by watching others manipulating objects. Inspired by this, we introduce ArtiPoint, a novel estimation framework capable of inferring articulated object models under dynamic camera motion and partial observability. By combining deep point tracking with a factor graph optimization framework, ArtiPoint robustly estimates articulated part trajectories and articulation axes directly from raw RGB-D videos. To foster future research in this domain, we introduce Arti4D, the first ego-centric in-the-wild dataset capturing articulated object interactions at a scene level, accompanied with articulation labels and ground truth camera poses. We benchmark ArtiPoint against a range of classical and modern deep learning baselines, demonstrating its superior performance on Arti4D. We make our code and Arti4D publicly available at redacted-for-review.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/werby25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/werby25a.html</guid>
        
        
      </item>
    
      <item>
        <title>DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control</title>
        <description>Enabling robots to perform diverse tasks across varied environments is a central challenge in robot learning. While vision-language-action (VLA) models have shown promise for generalizable robot skills, realizing their full potential requires addressing limitations in action representation and efficient training. Current VLA models often focus on scaling the vision-language model (VLM) component, while the action space representation remains a critical bottleneck. This paper introduces DexVLA, a novel framework designed to enhance the efficiency and generalization capabilities of VLAs for complex, long-horizon tasks across diverse robot embodiments. DexVLA features a novel diffusion-based action expert, scaled to one billion parameters, designed for cross-embodiment learning. A novel embodiment curriculum learning strategy facilitates efficient training: (1) pre-training the diffusion expert on cross-embodiment data, (2) aligning the VLA model to specific embodiments, and (3) post-training for rapid adaptation to new tasks.  We conduct comprehensive experiments across multiple embodiments, including single-arm, bimanual, and dexterous hand, demonstrating DexVLA’s adaptability to challenging tasks without task-specific adaptation, its ability to learn dexterous skills on novel embodiments with limited data, and its capacity to complete complex, long-horizon tasks using only direct language prompting, such as laundry folding. In all settings, our method demonstrates superior performance compared to state-of-the-art models like OpenVLA and $\pi_{0}$.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/wen25b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/wen25b.html</guid>
        
        
      </item>
    
      <item>
        <title>Constrained Style Learning from Imperfect Demonstrations under Task Optimality</title>
        <description>Learning from demonstration has proven effective in robotics for acquiring natural behaviors, such as stylistic motions and lifelike agility, particularly when explicitly defining style-oriented reward functions is challenging. Synthesizing stylistic motions for real-world tasks usually requires balancing task performance and imitation quality. Existing methods generally depend on expert demonstrations closely aligned with task objectives. However, practical demonstrations are often incomplete or unrealistic, causing current methods to boost style at the expense of task performance. To address this issue, we propose formulating the problem as a  constrained Markov Decision Process (CMDP). Our approach integrates a style-imitation objective with constraints to maintain near-optimal task performance. We introduce an adaptively adjustable Lagrangian multiplier to guide the agent to imitate demonstrations selectively, capturing stylistic nuances without compromising task performance. We validate our approach across multiple robotic platforms and tasks, demonstrating both robust task performance and high-fidelity style learning. On ANYmal-D hardware we show a 14.5% drop in mechanical energy and a more agile gait pattern, showcasing real-world benefits.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/wen25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/wen25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Ensuring Force Safety in Vision-Guided Robotic Manipulation via Implicit Tactile Calibration</title>
        <description>In unstructured environments, robotic manipulation tasks involving objects with constrained motion trajectories—such as door opening—often experience discrepancies between the robot’s vision-guided end-effector trajectory and the object’s constrained motion path.  Such discrepancies generate unintended harmful forces, which, if exacerbated, may lead to task failure and potential damage to the manipulated objects or the robot itself. To address this issue, this paper introduces a novel diffusion framework, termed SafeDiff. Unlike conventional methods that sequentially fuse visual and tactile data to predict future robot states, our approach generates a prospective state sequence based on the current robot state and visual context observations, using real-time force feedback as a calibration signal.  This implicitly adjusts the robot’s state within the state space, enhancing operational success rates and significantly reducing harmful forces during manipulation, thus ensuring manipulation force safety. Additionally, we develop a large-scale simulation dataset named SafeDoorManip50k, offering extensive multimodal data to train and evaluate the proposed method. Extensive experiments show that our visual-tactile model substantially mitigates the risk of harmful forces in the door opening task, across both simulated and real-world settings.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/wei25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/wei25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Beyond Constant Parameters: Hyper Prediction Models and HyperMPC</title>
        <description>Model Predictive Control (MPC) is among the most widely adopted and reliable methods for robot control, relying critically on an accurate dynamics model. However, existing dynamics models used in the gradient-based MPC are limited by computational complexity and state representation. To address this limitation, we propose the Hyper Prediction Model (HyperPM) - a novel approach in which we project the unmodeled dynamics onto a time-dependent dynamics model. This time-dependency is captured through time-varying model parameters, whose evolution over the MPC prediction horizon is learned using a neural network. Such formulation preserves the computational efficiency and robustness of the base model while equipping it with the capacity to anticipate previously unmodeled phenomena. We evaluated the proposed approach on several challenging systems, including real-world F1TENTH autonomous racing, and demonstrated that it significantly reduces long-horizon prediction errors. Moreover, when integrated within the MPC framework (HyperMPC), our method consistently outperforms existing state-of-the-art techniques.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/wegrzynowski25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/wegrzynowski25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Do LLM Modules Generalize? A Study on Motion Generation for Autonomous Driving</title>
        <description>Recent breakthroughs in large language models (LLMs) have not only advanced natural language processing but also inspired their application in domains with structurally similar problems—most notably, autonomous driving motion generation. Both domains involve autoregressive sequence modeling, token-based representations, and context-aware decision making, making the transfer of LLM components a natural and increasingly common practice. However, despite promising early attempts, a systematic understanding of which LLM modules are truly transferable remains lacking. In this paper, we present a comprehensive evaluation of five key LLM modules—tokenizer design, positional embedding, pre-training paradigms, post-training strategies, and test-time computation—within the context of motion generation for autonomous driving. Through extensive experiments on the Waymo Sim Agents benchmark, we demonstrate that, when appropriately adapted, these modules can significantly improve performance for autonomous driving motion generation. In addition, we identify which techniques can be effectively transferred, analyze the potential reasons for the failure of others, and discuss the specific adaptations needed for autonomous driving scenarios. We evaluate our method on the Sim Agents task and achieve competitive results.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/wang25g.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/wang25g.html</guid>
        
        
      </item>
    
      <item>
        <title>TrackVLA: Embodied Visual Tracking in the Wild</title>
        <description>Embodied visual tracking is a fundamental skill in Embodied AI, enabling an agent to follow a specific target in dynamic environments using only egocentric vision. This task is inherently challenging as it requires both accurate target recognition and effective trajectory planning under conditions of severe occlusion and high scene dynamics. Existing approaches typically address this challenge through a modular separation of recognition and planning. In this work, we propose TrackVLA, a Vision-Language-Action (VLA) model that learns the synergy between object recognition and trajectory planning. Leveraging a shared LLM backbone, we employ a language modeling head for recognition and an anchor-based diffusion model for trajectory planning. To train TrackVLA, we construct an Embodied Visual Tracking Benchmark (EVT-Bench) and collect diverse difficulty levels of recognition samples, resulting in a dataset of 1.7 million samples. Through extensive experiments in both synthetic and real-world environments, TrackVLA demonstrates SOTA performance and strong generalizability. It significantly outperforms existing methods on public benchmarks in a zero-shot manner while remaining robust to high dynamics and occlusion in real-world scenarios at 10 FPS inference speed.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/wang25f.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/wang25f.html</guid>
        
        
      </item>
    
      <item>
        <title>CoRI: Communication of Robot Intent for Physical Human-Robot Interaction</title>
        <description>Clear communication of robot intent fosters transparency and interpretability in physical human-robot interaction (pHRI), particularly during assistive tasks involving direct human-robot contact. We introduce CoRI, a pipeline that automatically generates natural language communication of a robot’s upcoming actions directly from its motion plan and visual perception. Our pipeline first processes the robot’s image view to identify human poses and key environmental features. It then encodes the planned 3D spatial trajectory (including velocity and force) onto this view, visually grounding the path and its dynamics. CoRI queries a vision-language model with this visual representation to interpret the planned action within the visual context before generating concise, user-directed statements, without relying on task-specific information. Results from a user study involving robot-assisted feeding, bathing, and shaving tasks across two different robots indicate that CoRI leads to statistically significant difference in communication clarity compared to a baseline communication strategy. Specifically, CoRI effectively conveys not only the robot’s high-level intentions but also crucial details about its motion and any collaborative user action needed.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/wang25e.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/wang25e.html</guid>
        
        
      </item>
    
      <item>
        <title>CogniPlan: Uncertainty-Guided Path Planning with Conditional Generative Layout Prediction</title>
        <description>Path planning in unknown environments is a crucial yet inherently challenging capability for mobile robots, which primarily encompasses two coupled tasks: autonomous exploration and point-goal navigation. In both cases, the robot must perceive the environment, update its belief, and accurately estimate potential information gain on-the-fly to guide planning. In this work, we propose CogniPlan, a novel path planning framework that leverages multiple plausible layouts predicted by a conditional generative inpainting model, mirroring how humans rely on cognitive maps during navigation. These predictions, based on the partially observed map and a set of layout conditioning vectors, enable our planner to reason effectively under uncertainty. We demonstrate strong synergy between generative image-based layout prediction and graph-attention-based path planning, allowing CogniPlan to combine the scalability of graph representations with the fidelity and predictiveness of occupancy maps, yielding notable performance gains in both exploration and navigation. We extensively evaluate CogniPlan on two datasets (hundreds of maps and realistic floor plans), consistently outperforming state-of-the-art planners. We further deploy it in a high-fidelity simulator and on hardware, showcasing its high-quality path planning and real-world applicability.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/wang25d.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/wang25d.html</guid>
        
        
      </item>
    
      <item>
        <title>TopoCut: Learning Multi-Step Cutting with Spectral Rewards and Discrete Diffusion Policies</title>
        <description>Robotic manipulation tasks involving cutting deformable objects remain challenging due to complex topological behaviors, difficulties in perceiving dense object states, and the lack of efficient evaluation methods for cutting outcomes. In this paper, we introduce TopoCut, a comprehensive benchmark for multi-step robotic cutting tasks that integrates a cutting environment and generalized policy learning. TopoCut is built upon three core components: (1) a high-fidelity simulation environment based on a particle-based elastoplastic solver with compliant von Mises constitutive models, augmented by a novel damage-driven topology discovery mechanism for accurate tracking of multiple cutting pieces; (2) a comprehensive reward design that combines this topology discovery with a pose-invariant spectral reward model based on Laplace–Beltrami eigenanalysis, enabling consistent and robust assessment of cutting quality; and (3) an integrated policy learning pipeline, where a dynamics-informed perception module predicts topological evolution and produces particle-wise, topology-aware embeddings to support PDDP—Particle-based Score-Entropy Discrete Diffusion Policy—for goal-conditioned policy learning. Extensive experiments demonstrate that TopoCut enables trajectory generation, scalable learning, precise evaluation, and strong generalization across diverse object geometries, scales, poses, and cutting goals.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/wang25c.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/wang25c.html</guid>
        
        
      </item>
    
      <item>
        <title>Omni-Perception: Omnidirectional Collision Avoidance of Legged Robots in Dynamic Environments</title>
        <description>Agile locomotion in complex 3D environments requires robust spatial awareness to safely avoid diverse obstacles such as aerial clutter, uneven terrain, and dynamic agents. Depth-based perception approaches often struggle with sensor noise, lighting variability, computational overhead from intermediate representations (e.g., elevation maps), and difficulties with non-planar obstacles, limiting performance in unstructured environments. In contrast, direct integration of LiDAR sensing into end-to-end learning for legged locomotion remains underexplored.We propose Omni-Perception, an end-to-end locomotion policy that achieves 3D spatial awareness and omnidirectional collision avoidance by directly processing raw LiDAR point clouds. At its core is PD-RiskNet (Proximal-Distal Risk-Aware Hierarchical Network), a novel perception module that interprets spatio-temporal LiDAR data for environmental risk assessment. To facilitate efficient policy learning, we develop a high-fidelity LiDAR simulation toolkit with realistic noise modeling and fast raycasting, compatible with platforms such as Isaac Gym, Genesis, and MuJoCo, enabling scalable training and effective sim-to-real transfer.Learning reactive control policies directly from raw LiDAR data enables the robot to navigate complex environments with static and dynamic obstacles more robustly than approaches relying on intermediate maps or limited sensing. We validate Omni-Perception through real-world experiments and extensive simulation, demonstrating strong omnidirectional avoidance capabilities and superior locomotion performance in highly dynamic environments.We will open-source our code and models.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/wang25b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/wang25b.html</guid>
        
        
      </item>
    
      <item>
        <title>The Sound of Simulation: Learning Multimodal Sim-to-Real Robot Policies with Generative Audio</title>
        <description>Robots must integrate multiple sensory modalities to act effectively in the real world. Yet, learning such multimodal policies at scale remains challenging. Simulation offers a viable solution, but while vision has benefited from high-fidelity simulators, other modalities (e.g. sound) can be notoriously difficult to simulate. As a result, sim-to-real transfer has succeeded primarily in vision-based tasks, with multimodal transfer still largely unrealized. In this work, we tackle these challenges by introducing MultiGen, a framework that integrates large-scale generative models into traditional physics simulators, enabling multisensory simulation. We showcase our framework on the dynamic task of robot pouring, which inherently relies on multimodal feedback. By synthesizing realistic audio conditioned on simulation video, our method enables training on rich audiovisual trajectories—without any real robot data. We demonstrate effective zero-shot transfer to real-world pouring with novel containers and liquids, highlighting the potential of generative modeling to both simulate hard-to-model modalities and close the multimodal sim-to-real gap.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/wang25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/wang25a.html</guid>
        
        
      </item>
    
      <item>
        <title>LodeStar: Long-horizon Dexterity via Synthetic Data Augmentation from Human Demonstrations</title>
        <description>Developing robotic systems capable of robustly executing long-horizon manipulation tasks with human-level dexterity is challenging, as such tasks require both physical dexterity and seamless sequencing of manipulation skills while robustly handling environment variations. While imitation learning offers a promising approach, acquiring comprehensive datasets is resource-intensive. In this work, we propose a learning framework and system LodeStar that automatically decomposes task demonstrations into semantically meaningful skills using off-the-shelf foundation models, and generates diverse synthetic demonstration datasets from a few human demos through reinforcement learning. These sim-augmented datasets enable robust skill training, with a Skill Routing Transformer (SRT) policy effectively chaining the learned skills together to execute complex long-horizon manipulation tasks. Experimental evaluations on three challenging real-world long-horizon dexterous manipulation tasks demonstrate that our approach significantly improves task performance and robustness compared to previous baselines. Videos are available at lodestar-robot.github.io.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/wan25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/wan25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Steering Your Diffusion Policy with Latent Space Reinforcement Learning</title>
        <description>Robotic control policies learned from human demonstrations have achieved impressive results in many real-world applications. However, in scenarios where initial performance is not satisfactory, as is often the case in novel open-world settings, such behavioral cloning (BC)-learned policies typically require collecting additional human demonstrations to further improve their behavior—an expensive and time-consuming process. In contrast, reinforcement learning (RL) holds the promise of enabling autonomous online policy improvement, but often falls short of achieving this due to the large number of samples it typically requires. In this work we take steps towards enabling fast autonomous adaptation of BC-trained policies via efficient real-world RL. Focusing in particular on diffusion policies—a state-of-the-art BC methodology—we propose *diffusion steering via reinforcement learning* (DSRL): adapting the BC policy by running RL over its latent-noise space. We show that DSRL is highly sample efficient, requires only black-box access to the BC policy, and enables effective real-world autonomous policy improvement. Furthermore, DSRL avoids many of the challenges associated with finetuning diffusion policies, obviating the need to modify the weights of the base policy at all. We demonstrate DSRL on simulated benchmarks, real-world robotic tasks, and for adapting pretrained generalist policies, illustrating its sample efficiency and effective performance at real-world policy improvement.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/wagenmaker25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/wagenmaker25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Learning Long-Context Diffusion Policies via Past-Token Prediction</title>
        <description>Reasoning over long sequences of observations and actions is essential for many robotic tasks.  Yet, learning effective long-context policies from demonstrations remains challenging.  As context length increases, training becomes increasingly expensive due to rising memory demands, and policy performance often degrades as a result of spurious correlations. Recent methods typically sidestep these issues by truncating context length, discarding historical information that may be critical for subsequent decisions. In this paper, we propose an alternative approach that explicitly regularizes the retention of past information. We first revisit the copycat problem in imitation learning and identify an opposite challenge in recent diffusion policies: rather than over-relying on prior actions, they often fail to capture essential dependencies between past and future actions. To address this, we introduce Past-Token Prediction (PTP), an auxiliary task in which the policy learns to predict past action tokens alongside future ones. This regularization significantly improves temporal modeling in the policy head, with minimal reliance on visual representations. Building on this observation, we further introduce a multistage training strategy: pre-train the visual encoder with short contexts, and fine-tune the policy head using cached long-context embeddings.  This strategy preserves the benefits of PTP while greatly reducing memory and computational overhead. Finally, we extend PTP into a self-verification mechanism at test time, enabling the policy to score and select candidates consistent with past actions during inference. Experiments across four real-world and six simulated tasks demonstrate that our proposed method improves the performance of long-context diffusion policies by 3$\times$ and accelerates policy training by more than 10$\times$. Videos are available at https://ptp-robot.github.io.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/villasevil25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/villasevil25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Multi-critic Learning for Whole-body End-effector Twist Tracking</title>
        <description>Learning whole-body control for locomotion and arm motions in a single policy has challenges, as the two tasks have conflicting goals. For instance, efficient locomotion typically favors a horizontal base orientation, while end-effector tracking may benefit from base tilting to extend reachability. Additionally, current Reinforcement Learning (RL) approaches using a pose-based task specification lack the ability to directly control the end-effector velocity, making smoothly executing trajectories very challenging. To address these limitations, we propose an RL-based framework that allows for dynamic, velocity-aware whole-body end-effector control. Our method introduces a multi-critic actor architecture that decouples the reward signals for locomotion and manipulation, simplifying reward tuning and allowing the policy to resolve task conflicts more effectively. Furthermore, we design a twist-based end-effector task formulation that can track both discrete poses and motion trajectories. We validate our approach through a set of simulation and hardware experiments using a quadruped robot equipped with a robotic arm. The resulting controller can simultaneously walk and move its end-effector and shows emergent whole-body behaviors, where the base assists the arm in extending the workspace, despite a lack of explicit formulations.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/vijayan25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/vijayan25a.html</guid>
        
        
      </item>
    
      <item>
        <title>From Space to Time: Enabling Adaptive Safety with Learned Value Functions via Disturbance Recasting</title>
        <description>Safe operation is essential for autonomous systems in safety-critical environments such as urban air mobility.  Value function-based safety filters provide formal guarantees on safety, wrapping learned or planning-based controllers with a layer of protection.  Recent approaches leverage offline learned value functions to scale these safety filters to high-dimensional systems.  Yet these methods assume detailed prior knowledge of all possible sources of model mismatch, in the form of disturbances, in the environment – information that is typically unavailable in real world settings.  Even in well-mapped environments like urban canyons or industrial sites, drones encounter complex, spatially-varying disturbances arising from payload-drone interaction, turbulent airflow, and other environmental factors.  We introduce Space2Time, which enables safe and adaptive deployment of offline-learned safety filters under unknown, spatially-varying disturbances.  The key idea is to reparameterize spatial disturbances as a time-varying formulation, allowing the use of temporally varying precomputed value functions during online operation.  We validate Space2Time through extensive simulations on diverse quadcopter models and real-world hardware experiments, demonstrating significantly improved safety performance over worst-case and naive baselines.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/tonkens25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/tonkens25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Diffusion Dynamics Models with Generative State Estimation for Cloth Manipulation</title>
        <description>Cloth manipulation is challenging due to its highly complex dynamics, near-infinite degrees of freedom, and frequent self-occlusions, which complicate both state estimation and dynamics modeling. Inspired by recent advances in generative models, we hypothesize that these expressive models can effectively capture intricate cloth configurations and deformation patterns from data. Therefore, we propose a diffusion-based generative approach for both perception and dynamics modeling. Specifically, we formulate state estimation as reconstructing full cloth states from partial observations and dynamics modeling as predicting future states given the current state and robot actions. Leveraging a transformer-based diffusion model, our method achieves accurate state reconstruction and reduces long-horizon dynamics prediction errors by an order of magnitude compared to prior approaches. We integrate our dynamics models with model-predictive control and show that our framework enables effective cloth folding on real robotic systems, demonstrating the potential of generative models for deformable object manipulation under partial observability and complex dynamics.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/tian25c.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/tian25c.html</guid>
        
        
      </item>
    
      <item>
        <title>O$^3$Afford: One-Shot 3D Object-to-Object Affordance Grounding for Generalizable Robotic Manipulation</title>
        <description>Grounding object affordance is fundamental to robotic manipulation as it establishes the critical link between perception and action among interacting objects. However, prior works predominantly focus on predicting single-object affordance, overlooking the fact that most real-world interactions involve relationships between pairs of objects. In this work, we address the challenge of object-to-object affordance grounding under limited data. Inspired by recent advances in few-shot learning with 2D vision foundation models, we propose a novel one-shot 3D object-to-object affordance learning approach for robotic manipulation. Semantic features from vision foundation models combined with point cloud representation for geometric understanding enable our one-shot learning pipeline to generalize effectively to novel objects and categories. We further integrate our 3D affordance representation with large language models (LLMs) for optimization-based motion planning, significantly enhancing LLMs’ capability to comprehend and reason about object interactions when generating task-specific constraint functions. Our experiments on 3D object-to-object affordance grounding and robotic manipulation demonstrate that our O$^3$Afford significantly outperforms existing baselines in terms of both accuracy and generalization capability.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/tian25b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/tian25b.html</guid>
        
        
      </item>
    
      <item>
        <title>Fabrica: Dual-Arm Assembly of General Multi-Part Objects via Integrated Planning and Learning</title>
        <description>Multi-part assembly poses significant challenges for robotic systems to execute long-horizon, contact-rich manipulation with generalization across complex geometries. We present a dual-arm robotic system capable of end-to-end planning and control for autonomous assembly of general multi-part objects. For planning over long horizons, we develop hierarchies of precedence, sequence, grasp, and motion planning with automated fixture generation, enabling general multi-step assembly on any dual-arm robots. The planner is made efficient through a parallelizable design and is optimized for downstream control stability. For contact-rich assembly steps, we propose a lightweight reinforcement learning framework that trains generalist policies across object geometries, assembly directions, and grasp poses, guided by equivaraiance and residual actions obtained from the plan. These policies transfer zero-shot to the real world and achieve 80% success rates. For systematic evaluation, we propose a benchmark suite of multi-part assemblies resembling industrial and daily objects across diverse categories and geometries. By integrating efficient global planning and robust local control, we demonstrate the first system to achieve complete and generalizable real-world multi-part assembly without domain knowledge or human demonstrations.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/tian25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/tian25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Agreement Volatility: A Second-Order Metric for Uncertainty Quantification in Surgical Robot Learning</title>
        <description>Autonomous surgical robots are a promising solution to the increasing demand for surgery amid a shortage of surgeons. Recent work has proposed learning-based approaches for the autonomous manipulation of soft tissue. However, due to variability in aspects such as tissue geometries and stiffnesses, these methods do not always perform well, especially in out-of-distribution settings. To address this challenge, we propose a novel second-order metric for uncertainty quantification, agreement volatility, that enables successful and efficient collaborative handoffs between a human operator and a robot during soft-tissue manipulation by allowing the robot to know when to cede control to human operators and when to resume autonomous operation. We validate our approach using the daVinci Research Kit (dVRK) surgical robot to perform risk-aware physical soft-tissue manipulation. Our experimental results demonstrate that our proposed agreement volatility metric improves system success rates and leads to a 10% lower reliance on human interventions compared to a variance-only baseline.  We further demonstrate the usefulness of our agreement volatility metric as a spatial uncertainty map over geometric point cloud data, enabling uncertainty attribution which provides insight into regions of the input causing uncertainty.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/thompson25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/thompson25a.html</guid>
        
        
      </item>
    
      <item>
        <title>CLAMP: Crowdsourcing a LArge-scale in-the-wild haptic dataset with an open-source device for Multimodal robot Perception</title>
        <description>Robust robot manipulation in unstructured environments often requires understanding object properties that extend beyond geometry, such as material or compliance—properties that can be challenging to infer using vision alone. Multimodal haptic sensing provides a promising avenue for inferring such properties, yet progress has been constrained by the lack of large, diverse, and realistic haptic datasets. In this work, we introduce the CLAMP device, a low-cost (&lt; $200) sensorized reacher-grabber designed to collect large-scale, in-the-wild multimodal haptic data from non-expert users in everyday settings. We deployed 16 CLAMP devices to 41 participants, resulting in the CLAMP dataset, the largest open-source multimodal haptic dataset to date, comprising 12.3 million datapoints across 5357 household objects. Using this dataset, we train a haptic encoder that can infer material and compliance object properties from multimodal haptic data. We leverage this encoder to create the CLAMP model, a visuo-haptic perception model for material recognition that generalizes to novel objects and three robot embodiments with minimal finetuning. We also demonstrate the effectiveness of our model in three real-world robot manipulation tasks: sorting recyclable and non-recyclable waste, retrieving objects from a cluttered bag, and distinguishing overripe from ripe bananas. Our results show that large-scale, in-the-wild haptic data collection can unlock new capabilities for generalizable robot manipulation.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/thakkar25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/thakkar25a.html</guid>
        
        
      </item>
    
      <item>
        <title>MimicFunc: Imitating Tool Manipulation from a Single Human Video via Functional Correspondence</title>
        <description>Imitating tool manipulation from human videos offers an intuitive approach to teaching robots, while also providing a promising and scalable alternative to labor-intensive teleoperation data collection for visuomotor policy learning. While humans can mimic tool manipulation behavior by observing others perform a task just once and effortlessly transfer the skill to diverse tools for functionally equivalent tasks, current robots struggle to achieve this level of generalization. A key challenge lies in establishing function-level correspondences, considering the significant geometric variations among functionally similar tools, referred to as intra-function variations. To address this challenge, we propose MimicFunc, a framework that establishes functional correspondences with function frame, a function-centric local coordinate frame constructed with 3D functional keypoints, for imitating tool manipulation skills. Experiments demonstrate that MimicFunc effectively enables the robot to generalize the skill from a single RGB-D human video to manipulating novel tools for functionally equivalent tasks. Furthermore, leveraging MimicFunc’s one-shot generalization capability, the generated rollouts can be used to train visuomotor policies without requiring labor-intensive teleoperation data collection for novel objects.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/tang25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/tang25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Search-TTA: A Multi-Modal Test-Time Adaptation Framework for Visual Search in the Wild</title>
        <description>To perform autonomous visual search for environmental monitoring, a robot may leverage satellite imagery as a prior map. This can help inform coarse, high level search and exploration strategies, even when such images lack sufficient resolution to allow fine-grained, explicit visual recognition of targets. However, there are some challenges to overcome with using satellite images to direct visual search. For one, targets that are unseen in satellite images are underrepresented (compared to real life) in most existing datasets, and thus vision models trained on these datasets fail to reason effectively based on indirect visual cues. Furthermore, approaches which leverage large Vision Language Models (VLMs) for generalization may yield inaccurate outputs due to hallucination, leading to inefficient search. To address these challenges, we introduce Search-TTA, a multimodal test-time adaptation framework that can accept text and/or image input. First, we pretrain a remote sensing image encoder to align with CLIP’s visual encoder to output probability distributions of target presence used for visual search. Second, our framework dynamically refines CLIP’s predictions during search using a test-time adaptation mechanism. Through a feedback loop inspired by Spatial Poisson Point Processes, gradient updates (weighted by uncertainty) are used to correct (potentially inaccurate) predictions and improve search performance. To validate Search-TTA’s performance, we curate a visual search dataset based on internet-scale ecological data. We find that Search-TTA improves planner performance by up to 9.7%, particularly in cases with poor initial CLIP predictions. It also achieves comparable performance to state-of-the-art VLMs. Finally, we deploy Search-TTA on a real UAV via hardware-in-the-loop testing, by simulating its operation within a large-scale simulation that provides onboard sensing.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/tan25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/tan25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Reactive In-Air Clothing Manipulation with Confidence-Aware Dense Correspondence and Visuotactile Affordance</title>
        <description>Manipulating clothing is challenging due to their complex, variable configurations and frequent self-occlusion. While prior systems often rely on flattening garments, humans routinely identify keypoints in highly crumpled and suspended states. We present a novel, task-agnostic, visuotactile framework that operates directly on crumpled clothing—including in-air configurations that have not been addressed before. Our approach combines global visual perception with local tactile feedback to enable robust, reactive manipulation. We train dense visual descriptors on a custom simulated dataset using a distributional loss that captures cloth symmetries and generates correspondence confidence estimates. These estimates guide a reactive state machine that dynamically selects between folding strategies based on perceptual uncertainty. In parallel, we train a visuotactile grasp affordance network using high-resolution tactile feedback to supervise grasp success. The same tactile classifier is used during execution for real-time grasp validation. Together, these components enable a reactive, task-agnostic framework for in-air garment manipulation, including folding and hanging tasks. Moreover, our dense descriptors serve as a versatile intermediate representation for other planning modalities, such as extracting grasp targets from human video demonstrations, paving the way for more generalizable and scalable garment manipulation.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/sunil25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/sunil25a.html</guid>
        
        
      </item>
    
      <item>
        <title>HyperTASR: Hypernetwork-Driven Task-Aware Scene Representations for Robust Manipulation</title>
        <description>Effective policy learning for robotic manipulation requires scene representations that selectively capture task-relevant environmental features. Current approaches typically employ task-agnostic representation extraction, failing to emulate the dynamic perceptual adaptation observed in human cognition. We present HyperTASR, a hypernetwork-driven framework that modulates scene representations based on both task objectives and the execution phase. Our architecture dynamically generates representation transformation parameters conditioned on task specifications and progression state, enabling representations to evolve contextually throughout task execution. This approach maintains architectural compatibility with existing policy learning frameworks while fundamentally reconfiguring how visual features are processed. Unlike methods that simply concatenate or fuse task embeddings with task-agnostic representations, HyperTASR establishes computational separation between task-contextual and state-dependent processing paths, enhancing learning efficiency and representational quality. Comprehensive evaluations in both simulation and real-world environments demonstrate substantial performance improvements across different representation paradigms. Most notably, HyperTASR elevates success rates by over 27% when applied to GNFactor and achieves unprecedented single-view performance exceeding 80% success with 3D Diffuser Actor. Through ablation studies and attention visualization, we confirm that our approach selectively prioritizes task-relevant scene information, closely mirroring human adaptive perception during manipulation tasks.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/sun25c.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/sun25c.html</guid>
        
        
      </item>
    
      <item>
        <title>ARCH: Hierarchical Hybrid Learning for Long-Horizon Contact-Rich Robotic Assembly</title>
        <description>Generalizable long-horizon robotic assembly requires reasoning at multiple levels of abstraction. While end-to-end imitation learning (IL) is a promising approach, it typically requires large amounts of expert demonstration data and often struggles to achieve the high precision demanded by assembly tasks. Reinforcement learning (RL) approaches, on the other hand, have shown some success in high-precision assembly, but suffer from sample inefficiency, which limits their effectiveness in long-horizon tasks. To address these challenges, we propose a hierarchical modular approach, named Adaptive Robotic Compositional Hierarchy (ARCH), which enables long-horizon, high-precision robotic assembly in contact-rich settings. ARCH employs a hierarchical planning framework, including a low-level primitive library of parameterized skills and a high-level policy. The low-level primitive library includes essential skills for assembly tasks, such as grasping and inserting. These primitives consist of both RL and model-based controllers. The high-level policy, learned via IL from a handful of demonstrations, without the need for teleoperation, selects the appropriate primitive skills and instantiates them with input parameters. We extensively evaluate our approach in simulation and on a real robotic manipulation platform. We show that ARCH generalizes well to unseen objects and outperforms baseline methods in terms of success rate and data efficiency.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/sun25b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/sun25b.html</guid>
        
        
      </item>
    
      <item>
        <title>FlashBack: Consistency Model-Accelerated Shared Autonomy</title>
        <description>Abstract: Shared autonomy is an enabling technology that provides users with control authority over robots that would otherwise be difficult if not impossible to directly control. Yet, standard methods make assumptions that limit their adoption in practice—for example, prior knowledge of the user’s goals or the objective (i.e., reward) function that they wish to optimize, knowledge of the user’s policy, or query-level access to the user during training. Diffusion-based approaches to shared autonomy do not make such assumptions and instead only require access to demonstrations of desired behaviors, while allowing the user to maintain control authority. However, these advantages have come at the expense of high computational complexity, which has made real-time shared autonomy all but impossible. To overcome this limitation, we propose Consistency Shared Autonomy (CSA), a shared autonomy framework that employs a consistency model-based formulation of diffusion. Key to CSA is that it employs the distilled probability flow of ordinary differential equations (PF ODE) to generate high-fidelity samples in a single step. This results in inference speeds significantly than what is possible with previous diffusion-based approaches to shared autonomy, enabling real-time assistance in complex domains with only a single function evaluation. Further, by intervening on flawed actions at intermediate states of the PF ODE, CSA enables varying levels of assistance. We evaluate CSA on a variety of challenging simulated and real-world robot control problems, demonstrating significant improvements over state-of-the-art methods both in terms of task performance and computational efficiency.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/sun25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/sun25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Toward Real-World Cooperative and Competitive Soccer with Quadrupedal Robot Teams</title>
        <description>Achieving coordinated teamwork among legged robots requires both fine-grained locomotion control and long-horizon strategic decision-making. Robot soccer offers a compelling testbed for this challenge, combining dynamic, competitive, and multi-agent interactions. In this work, we present a hierarchical multi-agent reinforcement learning (MARL) framework that enables fully autonomous and decentralized quadruped robot soccer. First, a set of highly dynamic low-level skills is trained for legged locomotion and ball manipulation, such as walking, dribbling, and kicking. On top of these, a high-level strategic planning policy is trained with Multi-Agent Proximal Policy Optimization (MAPPO) via Fictitious Self-Play (FSP). This learning framework allows agents to adapt to diverse opponent strategies and gives rise to sophisticated team behaviors, including coordinated passing, interception, and dynamic role allocation. With an extensive ablation study, the proposed learning method shows significant advantages in the cooperative and competitive multi-agent soccer game. We deploy the learned policies to real quadruped robots relying solely on onboard proprioception and decentralized localization, with the resulting system supporting autonomous robot-robot and robot-human soccer matches on indoor and outdoor soccer courts.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/su25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/su25a.html</guid>
        
        
      </item>
    
      <item>
        <title>SDS – See it, Do it, Sorted: Quadruped Skill Synthesis from Single Video Demonstration</title>
        <description>Imagine a robot learning locomotion skills from any single video, without labels or reward engineering. We introduce SDS (&quot;See it. Do it. Sorted.&quot;), an automated pipeline for skill acquisition from unstructured video demonstrations. Using GPT-4o, SDS applies novel prompting techniques, in the form of spatio-temporal grid-based visual encoding (Gv) and structured input decomposition (SUS). These produce executable reward functions (RF) from raw input videos. The RFs are used to train PPO policies and are optimized through closed-loop evolution, using training footage and performance metrics as self-supervised signals. SDS allows quadrupeds (e.g., Unitree Go1) to learn four gaits—trot, bound, pace, and hop—achieving 100% gait matching fidelity, Dynamic Time Warping (DTW) distance in the order of 10^-6, and stable locomotion with zero failures, both in simulation and the real world. SDS generalizes to morphologically different quadrupeds (e.g., ANYmal) and outperforms prior work in data efficiency, training time, and engineering effort. Our code is open-source under: https://sdsreview.github.io/SDS_ANONYM/</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/stamatopoulou25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/stamatopoulou25a.html</guid>
        
        
      </item>
    
      <item>
        <title>RICL:  Adding In-Context Adaptability to Pre-Trained Vision-Language-Action Models</title>
        <description>Multi-task “vision-language-action” (VLA) models have recently demonstrated increasing promise as generalist foundation models for robotics, achieving non-trivial performance out of the box on new tasks in new environments. However, for such models to be truly useful, an end user must have easy means to teach them to improve. For language and vision models, the emergent ability to perform in-context learning (ICL) has proven to be a versatile and highly useful interface to easily teach new tasks with no parameter finetuning. Unfortunately, VLAs pre-trained with imitation learning objectives do not naturally acquire ICL abilities. In this paper, we demonstrate that, with the right finetuning recipe and a small robot demonstration dataset, it is possible to inject in-context adaptability post hoc into such a VLA. After retraining for in-context learning (RICL), our system permits an end user to provide a small number (10-20) of demonstrations for a new task. RICL then fetches the most relevant portions of those demonstrations into the VLA context to exploit ICL, performing the new task and boosting task performance. We apply RICL to inject ICL into the $\pi_0$-FAST VLA, and show that it permits large in-context improvements for a variety of new manipulation tasks with only 20 demonstrations per task, without any parameter updates. When parameter updates on the target task demonstrations is possible, RICL finetuning further boosts performance. We release code and model weights for RICL-$\pi_0$-FAST alongside the paper to enable, for the first time, a simple in-context learning interface for new manipulation tasks</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/sridhar25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/sridhar25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Generalist Robot Manipulation beyond Action Labeled Data</title>
        <description>Recent advances in generalist robot manipulation leverage pre-trained Vision–Language Models (VLMs) and large-scale robot demonstrations to tackle diverse tasks in a zero-shot manner. A key challenge remains: scaling high-quality, action-labeled robot demonstration data, which existing methods rely on for robustness and generalization. To address this, we propose a method that benefits from videos without action labels—featuring humans and/or robots in action—enhancing open-vocabulary performance and enabling data-efficient learning of new tasks. Our method extracts dense, dynamic 3D point clouds at the hand or gripper location and uses a proposed 3D dynamics predictor for self-supervision. This predictor is then tuned to an action predictor using a smaller labeled dataset for action alignment. We show that our method not only learns from unlabeled human and robot demonstrations—improving downstream generalist robot policies—but also enables robots to learn new tasks without action labels (i.e., out-of-action generalization) in both real-world and simulated settings.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/spiridonov25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/spiridonov25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Fast Flow-based Visuomotor Policies via Conditional Optimal Transport Couplings</title>
        <description>Diffusion and flow matching policies have recently demonstrated remarkable performance in robotic applications by accurately capturing multimodal robot trajectory distributions. However, their computationally expensive inference, due to the numerical integration of an ODE or SDE, limits their applicability as real-time controllers for robots. We introduce a methodology that utilizes conditional Optimal Transport couplings between noise and samples to enforce straight solutions in the flow ODE for robot action generation tasks. We show that naively coupling noise and samples fails in conditional tasks and propose incorporating condition variables into the coupling process to improve few-step performance. The proposed few-step policy achieves a 4% higher success rate with a 10$\times$ speed-up compared to Diffusion Policy on a diverse set of simulation tasks. Moreover, it produces high-quality and diverse action trajectories within 1-2 steps on a set of real-world robot tasks. Our method also retains the same training complexity as Diffusion Policy and vanilla Flow Matching, in contrast to distillation-based approaches.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/sochopoulos25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/sochopoulos25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Sampling-based System Identification with Active Exploration for Legged Sim2Real Learning</title>
        <description>Sim-to-real discrepancies hinder learning-based policies from achieving high-precision tasks in the real world. While Domain Randomization (DR) is commonly used to bridge this gap, it often relies on heuristics and can lead to overly conservative policies with degrading performance when not properly tuned. System Identification (Sys-ID) offers a targeted approach, but standard techniques rely on differentiable dynamics and/or direct torque measurement, assumptions that rarely hold for contact-rich legged systems. To this end, we present SPI-Active (Sampling-based Parameter Identification with Active Exploration), a two-stage framework that estimates physical parameters of legged robots to minimize the sim-to-real gap. SPI-Active robustly identifies key physical parameters through massive parallel sampling, minimizing state prediction errors between simulated and real-world trajectories. To further improve the informativeness of collected data, we introduce an active exploration strategy that maximizes the Fisher Information of the collected real-world trajectories via optimizing the input commands of an exploration policy. This targeted exploration leads to accurate identification and better generalization across diverse tasks. Experimental results demonstrate that SPI-Active enables precise sim-to-real transfer of learned policies to the real world, outperforming baselines by 42-63% in various locomotion tasks. Videos at the anonymous website https://anonymous-spi-active.github.io/</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/sobanbabu25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/sobanbabu25a.html</guid>
        
        
      </item>
    
      <item>
        <title>SIREN: Semantic, Initialization-Free Registration of Multi-Robot Gaussian Splatting Maps</title>
        <description>We present SIREN for registration of multi-robot Gaussian Splatting (GSplat) maps, with zero access to camera poses, images, and inter-map transforms for initialization or fusion of local submaps. To realize these capabilities, SIREN harnesses the versatility and robustness of semantics in three critical ways to derive a rigorous registration pipeline for multi-robot GSplat maps. First, SIREN utilizes semantics to identify feature-rich regions of the local maps where the registration problem is better posed, eliminating the need for any initialization which is generally required in prior work. Second, SIREN identifies candidate correspondences between Gaussians in the local maps using robust semantic features, constituting the foundation for robust geometric optimization, coarsely aligning 3D Gaussian primitives extracted from the local maps. Third, this key step enables subsequent photometric refinement of the transformation between the submaps, where SIREN leverages novel-view synthesis in GSplat maps along with a semantics-based image filter to compute a high-accuracy non-rigid transformation for the generation of a high-fidelity fused map. We demonstrate the superior performance of SIREN compared to competing baselines across a range of real-world datasets, and in particular, across the most widely used robot hardware platforms, including a manipulator, drone, and quadruped.  In fact, in the most challenging scenes where accurate feature matching is extremely challenging, SIREN achieves about 90x smaller rotation errors, 300x smaller translation errors, and 44x smaller scale errors, compared to competing methods. We will release the code and provide a link to the project page after the review process.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/shorinwa25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/shorinwa25a.html</guid>
        
        
      </item>
    
      <item>
        <title>ToddlerBot: Open-Source ML-Compatible Humanoid Platform for Loco-Manipulation</title>
        <description>Learning-based robotics research driven by data demands a new approach to robot hardware design—one that serves as both a platform for policy execution and a tool for embodied data collection. We introduce ToddlerBot, a low-cost, open-source humanoid robot platform designed for robotics and AI research. ToddlerBot enables seamless acquisition of high-quality simulation and real-world data. The plug-and-play zero-point calibration and transferable motor system identification ensure a high-fidelity digital twin and zero-shot sim-to-real policy transfer. A user-friendly teleoperation interface streamlines real-world data collection from human demonstrations. With its data collection ability and anthropomorphic design, ToddlerBot is ideal for whole-body loco-manipulation research. Additionally, ToddlerBot’s compact size (0.56 m, 3.4 kg) ensures safe operation in real-world environments. Reproducibility is achieved with entirely 3D-printed, open-source design and off-the-shelf components, keeping the total cost under 6,000 USD. This allows assembly and maintenance with basic technical expertise, as validated by successful independent replications of the system. We demonstrate ToddlerBot’s capabilities through arm span, payload, endurance tests, loco-manipulation tasks, and a collaborative long-horizon scenario where two robots tidy a toy session together. By advancing ML-compatibility, capability, and reproducibility, ToddlerBot provides a robust and scalable platform for policy learning and execution in robotics research.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/shi25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/shi25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Meta-Optimization and Program Search using Language Models for Task and Motion Planning</title>
        <description>Intelligent interaction with the real world requires robotic agents to jointly reason over high-level plans and low-level controls. This requirement is formalized in the task and motion planning (TAMP) problem, in which symbolic planning and continuous trajectory generation must be solved in a coordinated manner. Recently, foundation model-based approaches to TAMP have presented impressive results, including fast planning times and the execution of natural language instructions. Yet, the optimal interface between high-level plan and low-level motion generation remains to be found: prior approaches are limited by either too much abstraction (e.g., chaining simplified skill primitives) or a lack thereof (e.g., direct joint angle prediction). Our method introduces a novel technique employing a form of meta-optimization to address these shortcomings by: (i) using program search over trajectory optimization problems as an interface between foundation model and robot controllers, and (ii) leveraging a zero-order method to optimize numerical values in the foundation model output. Results on challenging object manipulation and drawing tasks confirm that our proposed method improves over prior TAMP approaches.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/shcherba25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/shcherba25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Self-supervised perception for tactile skin covered dexterous hands</title>
        <description>We present PercepSkin, a pre-trained encoder for magnetic skin sensors distributed across the fingertips, phalanges, and palm of a dexterous robot hand.  Magnetic tactile skins offer a flexible form factor for hand-wide coverage with fast response times, in contrast to vision-based tactile sensors that are restricted to the fingertips and limited by bandwidth. Full hand tactile perception is crucial for robot dexterity. However, a lack of general-purpose models, challenges with interpreting magnetic flux and calibration have limited the adoption of these sensors. PercepSkin, given a history of kinematic and tactile sensing across a hand, outputs a latent tactile embedding that can be used in any downstream task. The encoder is self-supervised via self-distillation on a variety of unlabeled hand-object  interactions using an Allegro hand sensorized with Xela uSkin. In experiments across several benchmark tasks, from state estimation to policy learning, we find that pretrained PercepSkin representations are both sample efficient in learning downstream tasks and improve task performance by over 41% compared to prior work and over 56% compared to end-to-end learning.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/sharma25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/sharma25a.html</guid>
        
        
      </item>
    
      <item>
        <title>From Real World to Logic and Back: Learning Generalizable Relational Concepts For Long Horizon Robot Planning</title>
        <description>Humans efficiently generalize from limited demonstrations, but robots still struggle to transfer learned knowledge to complex, unseen tasks with longer horizons and increased complexity.  We propose the first known method enabling robots to autonomously invent relational concepts directly from small sets of unannotated, unsegmented demonstrations. The learned symbolic concepts are grounded into logic-based world models, facilitating efficient zero-shot generalization to significantly more complex tasks. Empirical results demonstrate that our approach achieves performance comparable to hand-crafted models, successfully scaling execution horizons and handling up to 18 times more objects than seen in training, providing the first autonomous framework for learning transferable symbolic abstractions from raw robot trajectories.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/shah25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/shah25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Generating Robot Constitutions &amp; Benchmarks for Semantic Safety</title>
        <description>Large vision and language models are being increasingly deployed on real robots, leading to an immediate need for ensuring robot safety under AI-control. In this paper, we develop the ASIMOV Benchmark — a collection of large-scale semantic safety datasets grounded in real-world visual scenes and human injury reports from hospitals (500k situations, 3M instructions). We propose a scalable recipe for data generation leveraging text and image generation techniques to synthesize safety-relevant scenarios. As a second contribution, we develop a framework to automatically generate robot constitutions from real-world data to steer a robot’s behavior using Constitutional AI mechanisms. We report a top alignment rate of 84.3% on the ASIMOV Benchmark using generated constitutions, outperforming no-constitution baselines and human-written constitutions. We argue that human interpretability and modifiability of constitutions inferred from data make them an ideal medium for behavior governance of AI-controlled robots.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/sermanet25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/sermanet25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Uncertainty-aware Latent Safety Filters for Avoiding Out-of-Distribution Failures</title>
        <description>Recent advances in generative world models have enabled classical safe control methods, such as Hamilton-Jacobi (HJ) reachability, to generalize to complex robotic systems operating directly from high-dimensional sensor observations. However, obtaining comprehensive coverage of all safety-critical scenarios during world model training is extremely challenging. As a result, latent safety filters built on top of these models may miss novel hazards and even fail to prevent known ones, overconfidently misclassifying risky out-of-distribution (OOD) situations as safe. To address this, we introduce an uncertainty-aware latent safety filter that proactively steers robots away from both known and unseen failures. Our key idea is to use the world model’s epistemic uncertainty as a proxy for identifying unseen potential hazards. We propose a principled method to detect OOD world model predictions by calibrating an uncertainty threshold via conformal prediction. By performing reachability analysis in an augmented state space—spanning both the latent representation and the epistemic uncertainty—we synthesize a latent safety filter that can reliably safeguard arbitrary policies from both known and unseen safety hazards. In simulation and hardware experiments on vision-based control tasks with a Franka manipulator, we show that our uncertainty-aware safety filter preemptively detects potential unsafe scenarios and reliably proposes safe, in-distribution actions.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/seo25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/seo25a.html</guid>
        
        
      </item>
    
      <item>
        <title>HALO : Human Preference Aligned Offline Reward Learning for Robot Navigation</title>
        <description>In this paper, we introduce HALO, a novel Offline Reward Learning algorithm that quantifies human intuition in navigation into a vision-based reward function for robot navigation. HALO learns a reward model from offline data, leveraging expert trajectories collected from mobile robots. During training, actions are randomly sampled from the action space around the expert action and ranked using a Boltzmann probability distribution that combines their distance to the expert action with human preference scores derived from intuitive navigation queries based on the corresponding egocentric camera feed. These scores establish preference rankings, enabling the training of a novel reward model based on Plackett-Luce loss, which allows for preference-driven navigation. To demonstrate the effectiveness of HALO, we deploy its reward model in two downstream applications: (i) an offline learned policy trained directly on the HALO-derived rewards, and (ii) a model-predictive-control (MPC) based planner that incorporates the HALO reward as an additional cost term. This showcases the versatility of HALO across both learning-based and classical navigation frameworks. Our real-world deployments on a Clearpath Husky across multiple scenarios demonstrate that policies trained with HALO achieve improved performance over state-of-the-art methods in terms of success rate and normalized trajectory length while maintaining lower Fréchet distance with the human expert trajectories.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/seneviratne25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/seneviratne25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Learning Deployable Locomotion Control via Differentiable Simulation</title>
        <description>Differentiable simulators promise to improve sample efficiency in robot learning by providing analytic gradients of the system dynamics. Yet, their application to contact-rich tasks like locomotion is complicated by the inherently non-smooth nature of contact, impeding effective gradient-based optimization. Existing works thus often rely on soft contact models that provide smooth gradients but lack physical accuracy, constraining results to simulation. To address this limitation, we propose a differentiable contact model designed to provide informative gradients while maintaining high physical fidelity. We demonstrate the efficacy of our approach by training a quadrupedal locomotion policy within our differentiable simulator leveraging analytic gradients and successfully transferring the learned policy zero-shot to the real world. To the best of our knowledge, this represents the first successful sim-to-real transfer of a legged locomotion policy learned entirely within a differentiable simulator, establishing the feasibility of using differentiable simulation for real-world locomotion control.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/schwarke25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/schwarke25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Long Range Navigator (LRN): Extending robot planning horizons beyond metric maps</title>
        <description>A robot navigating an outdoor environment with no prior knowledge of the space must rely on its local sensing, which is in the form of a local metric map or local policy with some fixed horizon. A limited planning horizon can often result in myopic decisions leading the robot off course or worse, into very difficult terrain. In this work, we make a key observation that long range navigation only necessitates identifying good frontier directions for planning instead of full map knowledge. To address this, we introduce Long Range Navigator (LRN), which learns to predict affordable’ frontier directions from high-dimensional camera images. LRN is trained entirely on unlabeled egocentric videos, making it scalable and adaptable. In off-road tests on Spot and a large vehicle, LRN reduces human interventions and improves decision speed when integrated into existing navigation stacks.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/schmittle25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/schmittle25a.html</guid>
        
        
      </item>
    
      <item>
        <title>GraphEQA: Using 3D Semantic Scene Graphs for Real-time Embodied Question Answering</title>
        <description>In Embodied Question Answering (EQA), agents must explore and develop a semantic understanding of an unseen environment in order to answer a situated question with confidence. This remains a challenging problem in robotics, due to the difficulties in obtaining useful semantic representations, updating these representations online, and leveraging prior world knowledge for efficient exploration and planning. Aiming to address these limitations, we propose GraphEQA, a novel approach that utilizes real-time 3D metric-semantic scene graphs (3DSGs) and task relevant images as multi-modal memory for grounding Vision-Language Models (VLMs) to perform EQA tasks in unseen environments. We employ a hierarchical planning approach that exploits the hierarchical nature of 3DSGs for structured planning and semantic-guided exploration. We evaluate GraphEQA in simulation on two benchmark datasets, HM-EQA and OpenEQA, and demonstrate that it outperforms key baselines by completing EQA tasks with higher success rates and fewer planning steps, and further demonstrate GraphEQA in two separate real world environments. Videos and code are available at https://grapheqa.github.io.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/saxena25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/saxena25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Planning from Point Clouds over Continuous Actions for Multi-object Rearrangement</title>
        <description>Multi-object rearrangement is a challenging task that requires robots to reason about a physical 3D scene and the effects of a sequence of actions. While traditional task planning methods are shown to be effective for long-horizon manipulation, they require discretizing the continuous state and action space into symbolic descriptions of objects, object relationships, and actions. Our proposed method is instead able to take in a partially-observed point cloud observation of an initial scene and plan to a goal-satisfying configuration, without needing to discretize the set of actions or object relationships. To enable this, we formulate the planning problem as an A* search over the space of possible point cloud rearrangements. We sample point cloud transformations from a learned, domain-specific prior and then search for a sequence of such point cloud transformations that leads from the initial state to a goal. We evaluate our method in terms of task planning success and task execution success on a real-world, multi-step table bussing environment and a simulation block stacking environment. We experimentally demonstrate that our method produces successful plans and outperforms a policy-learning approach; we also perform ablations that show the importance of search in our approach.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/saha25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/saha25a.html</guid>
        
        
      </item>
    
      <item>
        <title>KDPE: A Kernel Density Estimation Strategy for Diffusion Policy Trajectory Selection</title>
        <description>Learning robot policies that capture multimodality in the training data has been a long-standing open challenge for behavior cloning. Recent approaches tackle the problem by modeling the conditional action distribution with generative models. One of these approaches is Diffusion Policy, which relies on a diffusion model to denoise random points into robot action trajectories. While achieving state-of-the-art performance, it has two main drawbacks that may lead the robot out of the data distribution during policy execution. First, the stochasticity of the denoising process can highly impact on the quality of generated trajectory of actions. Second, being a supervised learning approach, it can learn data outliers from the dataset used for training. Recent work focuses on mitigating these limitations by combining Diffusion Policy either with large-scale training or with classical behavior cloning algorithms. Instead, we propose KDPE, a Kernel Density Estimation-based strategy that filters out potentially harmful trajectories output of Diffusion Policy while keeping a low test-time computational overhead. For Kernel Density Estimation, we propose a manifold-aware kernel to model a probability density function for actions composed of end-effector Cartesian position, orientation, and gripper state. KDPE overall achieves better performance than Diffusion Policy on simulated single-arm RoboMimic and MimicGen tasks, and on three real robot experiments:PickPlush, a tabletop grasping task, CubeSort, a multimodal pick and place task, and CoffeeMaking, a task that requires long-horizon capabilities and precise execution. The code will be released upon acceptance and additional material is provided on our anonymized project page:https://kdpe-robotics.github.io.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/rosasco25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/rosasco25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Cross-Sensor Touch Generation</title>
        <description>Today’s visuo-tactile sensors come in many shapes and sizes, making it challenging to develop general-purpose tactile representations. This is because most models are tied to a specific sensor design. To address this challenge, we propose two approaches to cross-sensor image generation. The first is an end-to-end method that leverages paired data (Touch2Touch). The second method builds an intermediate depth representation and does not require paired data (T2D2: Touch-to-Depth-to-Touch). Both methods enable the use of sensor-specific models across multiple sensors via the cross-sensor touch generation process. Together, these models offer flexible solutions for sensor translation, depending on data availability and application needs. We demonstrate their effectiveness on downstream tasks such as cup stacking and tool insertion, where models originally designed for one sensor are successfully transferred to another using in-hand pose estimation.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/rodriguez25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/rodriguez25a.html</guid>
        
        
      </item>
    
      <item>
        <title>ZipMPC: Compressed Context-Dependent MPC Cost via Imitation Learning</title>
        <description>The computational burden of model predictive control (MPC) limits its application on real-time systems, such as robots, and often requires the use of short prediction horizons. This not only affects the control performance, but also increases the difficulty of designing MPC cost functions that reflect the desired long-term objective. This paper proposes ZipMPC, a method that imitates a long-horizon MPC behaviour by learning a compressed and context-dependent cost function for a short-horizon MPC. It improves performance over alternative methods, such as approximate explicit MPC and automatic cost parameter tuning, in particular in terms of i) optimizing the long-term objective; ii) maintaining computational costs comparable to a short-horizon MPC; iii) ensuring constraint satisfaction; and iv) generalizing control behaviour to environments not observed during training. For this purpose, ZipMPC leverages the concept of differentiable MPC with neural networks to propagate gradients of the imitation loss through the MPC optimization. We validate our proposed method in simulation and real-world experiments on autonomous racing. ZipMPC consistently completes laps faster than selected baselines, achieving lap times close to the long-horizon MPC baseline. In challenging scenarios where the short-horizon MPC baseline fails to complete a lap, ZipMPC is able to do so. In particular, these performance gains are also observed on tracks unseen during training.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/rickenbach25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/rickenbach25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Unsupervised Skill Discovery as Exploration for Learning Agile Locomotion</title>
        <description>Exploration is crucial for legged robots to learn agile locomotion behaviors capable of overcoming diverse obstacles.  For example, a robot may need to try different contact patterns and momentum profiles to successfully jump over an obstacle—but encouraging such diverse exploration is inherently challenging. As a result, training these behaviors often relies on additional techniques such as extensive reward engineering, expert demonstrations, or curriculum learning. However, these approaches limit generalizability, especially when prior knowledge or demonstration data is unavailable. In this work, we propose using unsupervised skill discovery as a skill-level exploration strategy to significantly reduce human engineering effort. Our learning framework enables the agent to autonomously discover diverse skills to overcome complex obstacles. To dynamically regulate the degree of exploration throughout training, we introduce a bi-level optimization process that learns a parameter to balance two distinct reward signals. We demonstrate that our method enables quadrupedal robots to acquire highly agile behaviors—including crawling, climbing, leaping, and complex maneuvers such as jumping off vertical walls. Finally, we successfully deploy the learned policy on real hardware, validating its transferability to the real world.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/rho25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/rho25a.html</guid>
        
        
      </item>
    
      <item>
        <title>FLOWER: Democratizing Generalist Robot Policies with Efficient Vision-Language-Flow Models</title>
        <description>Developing efficient Vision-Language-Action (VLA) policies is crucial for practical robotics deployment, yet current approaches face prohibitive computational costs and resource requirements. Existing diffusion-based VLA policies require multi-billion-parameter models and massive datasets to achieve strong performance. We tackle this efficiency challenge with two contributions: intermediate-modality fusion, which reallocates capacity to the diffusion head by pruning up to 50% of LLM layers, and action-specific Global-AdaLN conditioning, which cuts parameters by 20% through modular adaptation. We integrate these advances into a novel 950 M-parameter VLA called FLOWER. Pretrained in just 200 H100 GPU hours, FLOWER delivers a 25.9% improvement over state-of-the-art baselines across 190 tasks spanning ten simulation and real-world benchmarks and demonstrates robustness across diverse robotic embodiments. All code, pretrained weights, and training recipes are publicly released to democratize efficient VLA development.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/reuss25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/reuss25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Distilling On-device Language Models for Robot Planning with Minimal Human Intervention</title>
        <description>Large language models (LLMs) provide robots with powerful contextual reasoning abilities and a natural human interface. Yet, current LLM-enabled robots typically depend on cloud-hosted models, limiting their usability in environments with unreliable communication infrastructure, such as outdoor or industrial settings. We present PRISM, a framework for distilling small language model (SLM)-enabled robot planners that run on-device with minimal human supervision. Starting from an existing LLM-enabled planner, PRISM automatically synthesizes diverse tasks and environments, elicits plans from the LLM, and uses this synthetic dataset to distill a compact SLM as a drop-in replacement of the source model. We apply PRISM to three LLM-enabled planners for mapping and exploration, manipulation, and household assistance, and we demonstrate that PRISM improves the performance of Llama-3.2-3B from 10-20% of GPT-4o’s performance to over 93% - using only synthetic data. We further demonstrate that the distilled planners generalize across heterogeneous robotic platforms (ground and aerial) and diverse environments (indoor and outdoor). We release all software, trained models, and datasets to promote reproducibility and follow-up work.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/ravichandran25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/ravichandran25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Lucid-XR: An Extended-Reality Data Engine for Robotic Manipulation</title>
        <description>We introduce Lucid-XR, a generative data engine for creating diverse and realistic-looking data to train real-world robot systems. At the core of Lucid-XR is vuer, a web-based physics simulation environment that runs directly on the XR headset, enabling internet-scale access to immersive, latency-free virtual interactions without requiring specialized equipment. The complete system integrates on-device physics simulation with on-device human-to-robot pose retargeting, that are further amplified by a physics-guided video generation pipeline commandable with natural language specifications. We demonstrate zero-shot sim-to-real transfer of robot visual policies, trained entirely on Lucid-XR’s synthetic data, across bimanual and dexterous manipulation tasks that involve flexible materials, adhesive interaction between particles, and rigid body contact.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/ravan25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/ravan25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Learning from 10 Demos: Generalisable and Sample-Efficient Policy Learning with Oriented Affordance Frames</title>
        <description>Imitation learning has unlocked the potential for robots to exhibit highly dexterous behaviours. However, it still struggles with long-horizon, multi-object tasks due to poor sample efficiency and limited generalisation. Existing methods require a substantial number of demonstrations to cover possible task variations, making them costly and often impractical for real-world deployment.  We address this challenge by introducing \emph{oriented affordance frames}, a structured representation for state and action spaces that improves spatial and intra-category generalisation and enables policies to be learned efficiently from only 10 demonstrations. More importantly, we show how this abstraction allows for compositional generalisation of independently trained sub-policies to solve long-horizon, multi-object tasks. To seamlessly transition between sub-policies, we introduce the notion of self-progress prediction, which we directly derive from the duration of the training demonstrations. We validate our method across three real-world tasks, each requiring multi-step, multi-object interactions. Despite the small dataset, our policies generalise robustly to unseen object appearances, geometries, and spatial arrangements, achieving high success rates without reliance on exhaustive training data. Video demonstration can be found on our anonymised project page: https://affordance-policy.github.io/.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/rana25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/rana25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Articulate AnyMesh: Open-vocabulary 3D Articulated Objects Modeling</title>
        <description>3D articulated objects modeling has long been a challenging problem, since it requires to capture both accurate surface geometries and semantically meaningful and spatially precise structures, parts, and joints. Existing methods heavily depend on training data from a limited set of handcrafted articulated object categories (e.g., cabinets and drawers), which restricts their ability to model a wide range of articulated objects in an open-vocabulary context. To address these limitations, we propose Articulate AnyMesh, an automated framework that is able to convert any rigid 3D mesh into its articulated counterpart in an open-vocabulary manner. Given a 3D mesh, our framework utilizes advanced Vision-Language Models and visual prompting techniques to extract semantic information, allowing for both the segmentation of object parts and the construction of functional joints. Our experiments show that Articulate AnyMesh can generate large-scale, high-quality 3D articulated objects, including tools, toys, mechanical devices, and vehicles, significantly expanding the coverage of existing 3D articulated object datasets. Additionally, we show that these generated assets can facilitate the acquisition of new articulated object manipulation skills in simulation, which can then be transferred to a real robotic system.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/qiu25b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/qiu25b.html</guid>
        
        
      </item>
    
      <item>
        <title>Humanoid Policy   Human Policy</title>
        <description>Training manipulation policies for humanoid robots with diverse data enhances their robustness and generalization across tasks and platforms. However, learning solely from robot demonstrations is labor-intensive, requiring expensive tele-operated data collection,n which is difficult to scale. This paper investigates a more scalable data source, egocentric human demonstrations, to serve as cross-embodiment training data for robot learning. We mitigate the embodiment gap between humanoids and humans from both the data and modeling perspectives. We collect an egocentric task-oriented dataset that is directly aligned with humanoid manipulation demonstrations. We then train a human-humanoid behavior policy, which we term Human Action Transformer (HAT). The state-action space of HAT is unified for both humans and humanoid robots and can be differentiably retargeted to robot actions. Co-trained with smaller-scale robot data, HAT directly models humanoid robots and humans as different embodiments without additional supervision. We show that human data improves both the generalization and robustness of HAT with significantly better data collection efficiency.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/qiu25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/qiu25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Belief-Conditioned One-Step Diffusion: Real-Time Trajectory Planning with Just-Enough Sensing</title>
        <description>Robots equipped with rich sensor suites can localize reliably in partially-observable environments—but powering every sensor continuously is wasteful and often infeasible. Belief-space planners address this by propagating pose-belief covariance through analytic models and switching sensors heuristically–a brittle, runtime expensive approach. Data-driven approaches–including diffusion models–learn multi-modal trajectories from demonstrations, but presuppose an accurate, always-on state estimate. We address the largely open problem: for a given task in a  mapped environment, which *minimal sensor subset* must be active at each location to maintain state uncertainty *just low enough* to complete the task? Our key insight is that when a diffusion planner is explicitly conditioned on a pose-belief raster and a sensor mask, the spread of its denoising trajectories yields a calibrated, differentiable proxy for the expected localization error. Building on this insight, we present Belief-Conditioned One-Step Diffusion (B-COD), the first planner that, in a 10 ms forward pass, returns a short-horizon trajectory, per-waypoint aleatoric variances, and a proxy for localization error–eliminating external covariance rollouts. We show that this single proxy suffices for a soft-actor–critic to choose sensors online, optimising energy while bounding pose-covariance growth. We deploy B-COD in real-time marine trials on an unmanned surface vehicle and show that it reduces sensing energy consumption while matching the goal-reach performance of an always-on baseline.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/puthumanaillam25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/puthumanaillam25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Steerable Scene Generation with Post Training and Inference-Time Search</title>
        <description>Training robots in simulation requires diverse 3D scenes that reflect the specific challenges of downstream tasks. However, scenes that satisfy strict task requirements, such as high-clutter environments with plausible spatial arrangement, are rare and costly to curate manually. Instead, we generate large-scale scene data using procedural models that approximate realistic environments for robotic manipulation, and adapt it to task-specific goals. We do this by training a unified diffusion-based generative model that predicts which objects to place from a fixed asset library, along with their SE(3) poses. This model serves as a flexible scene prior that can be adapted using reinforcement learning-based post training, conditional generation, or inference-time search, steering generation toward downstream objectives even when they differ from the original data distribution. Our method enables goal-directed scene synthesis that respects physical feasibility and scales across scene types. We introduce a novel MCTS-based inference-time search strategy for diffusion models, enforce feasibility via projection and simulation, and release a dataset of over 44 million SE(3) scenes spanning five diverse environments.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/pfaff25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/pfaff25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Non-conflicting Energy Minimization in Reinforcement Learning based Robot Control</title>
        <description>Efficient robot locomotion often requires balancing task performance with energy expenditure. A common approach in reinforcement learning (RL) is to penalize energy use directly in the reward function. This requires carefully weighting the reward terms to avoid undesirable trade-offs where energy minimization harms task success or vice versa. In this work, we propose a hyperparameter-free gradient optimization method to minimize energy without conflicting with task performance. Inspired by recent works in multitask learning, our method applies policy gradient projection between task and energy objectives to promote non-conflicting updates. We evaluate this technique on standard locomotion benchmarks of DM-Control and HumanoidBench and demonstrate a reduction of $64$% energy usage while maintaining comparable task performance. Further, we conduct experiments on a Unitree GO2 quadruped showcasing Sim2Real transfer of energy efficient policies. Our method is easy to implement in standard RL pipelines with minimal code changes, and offers a principled alternative to reward shaping for energy efficient control policies.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/peri25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/peri25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Dynamics-Compliant Trajectory Diffusion for Super-Nominal Payload Manipulation</title>
        <description>Nominal payload ratings for articulated robots are typically derived from worst-case configurations, resulting in uniform payload constraints across the entire workspace. This conservative approach severely underutilizes the robot’s inherent capabilities—our analysis demonstrates that manipulators can safely handle payloads well above nominal capacity across broad regions of their workspace while staying within joint angle, velocity, acceleration, and torque limits. To address this gap between assumed and actual capability, we propose a novel trajectory generation approach using denoising diffusion models that explicitly incorporates payload constraints into the planning process. Unlike traditional sampling-based methods that rely on inefficient trial-and-error, optimization-based methods that are prohibitively slow, or kinodynamic planners that struggle with problem dimensionality, our approach generates dynamically feasible joint-space trajectories in constant time that can be directly executed on physical hardware without post-processing. Experimental validation on a 7 DoF Franka Emika Panda robot demonstrates that up to 67.6% of the workspace remains accessible even with payloads exceeding 3 times the nominal capacity. This expanded operational envelope highlights the importance of a more nuanced consideration of payload dynamics in motion planning algorithms.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/pasricha25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/pasricha25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Diffusion-Guided Multi-Arm Motion Planning</title>
        <description>Multi-arm motion planning is fundamental for enabling arms to complete collaborative tasks in shared spaces but current methods struggle with scalability due to exponential state-space growth and reliance on large training datasets for learned models. Inspired by Multi-Agent Path Finding (MAPF), which decomposes planning into single-agent problems coupled with collision resolution, we propose a novel diffusion-guided multi-arm planner (DG-MAP) that enhances scalability of learning-based models while reducing their reliance on massive multi-arm datasets. Recognizing that collisions are primarily pairwise, we train two conditional diffusion models, one to generate feasible single-arm trajectories, and a second, to model the dual-arm dynamics required for effective pairwise collision resolution. By integrating these specialized generative models within a MAPF-inspired structured decomposition, our planner efficiently scales to larger number of arms. Evaluations against alternative learning-based methods across various team sizes demonstrate our method’s effectiveness and practical applicability. Code and data will be made publicly available. View video demonstrations in our supplementary material.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/parimi25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/parimi25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Cost-aware Discovery of Contextual Failures using Bayesian Active Learning</title>
        <description>Ensuring the robustness of robotic systems is crucial for their deployment in safety-critical domains. Failure discovery, or falsification, is a widely used approach for evaluating robustness, with recent advancements focusing on improving sample efficiency and generalization through probabilistic sampling techniques and learning-theoretic approaches. However, existing methods typically rely on explicitly defined analytical cost functions to characterize failures, often overlooking the underlying causes and diversity of discovered failure scenarios. In this work, we propose a novel failure discovery framework that integrates contextual reasoning in the falsification process, specifically tailored for high evaluation-cost applications. Our method incorporates expert-in-the-loop feedback to construct a probabilistic surrogate model of failures using Bayesian inference. This model is iteratively refined and leveraged to guide an active learning strategy that prioritizes the discovery of diverse failure cases. We empirically validate our approach across a range of tasks for high-cost contextual falsification in robotic manipulation and autonomous driving.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/parashar25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/parashar25a.html</guid>
        
        
      </item>
    
      <item>
        <title>One Demo is Worth a Thousand Trajectories: Action-View Augmentation for Visuomotor Policies</title>
        <description>Visuomotor policies for manipulation have demonstrated remarkable potential in modeling complex robotic behaviors, yet minor alterations in the robot’s initial configuration and unseen obstacles easily lead to out-of-distribution observations. Without extensive data collection effort, these result in catastrophic execution failures. In this work, we introduce an effective data augmentation framework that generates visually realistic fisheye image sequences and corresponding physically feasible action trajectories from real-world eye-in-hand demonstrations, captured with a portable parallel gripper with a single fisheye camera. We introduce a novel Gaussian Splatting formulation, adapted to wide FoV fisheye cameras, to reconstruct and edit the 3D scene with unseen objects. We utilize trajectory optimization to generate smooth, collision-free, view-rendering-friendly action trajectories and render visual observations from corresponding novel views. Comprehensive experiments in simulation and the real world show that our augmentation framework improves the success rate for various manipulation tasks in both the same scene and the augmented scene with obstacles requiring collision avoidance.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/pan25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/pan25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Imitation Learning Based on Disentangled Representation Learning of Behavioral Characteristics</title>
        <description>In the field of robot learning, it is becoming possible to coordinate robot action through language instructions. On the other hand, it is still a difficult task to adjust the action based on human instructions because human instructions are often qualitative, and there are cases where there is no one-to-one correspondence between the behavior and the instructions. In this paper, we propose a motion generation model that can adjust actions in response to qualitative human instructions during task execution. The core of the proposed method is a learning architecture that maps qualitative human instructions to actions. Specifically, the demonstration is divided into short action sequences, and labels reflecting human qualitative senses are assigned to these sequences to realize learning that links human qualitative instructions and robot actions. In evaluation experiments, we verified the effectiveness of the method in two tasks: a pick-and-place task and a wiping task. Experimental results showed that the proposed method is able to generate motions in response to human qualitative instructions during task execution, whereas the conventional method generates trajectories all at once, making it impossible to adjust motions during task execution.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/oishi25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/oishi25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Latent Adaptive Planner for Dynamic Manipulation</title>
        <description>This paper presents Latent Adaptive Planner (LAP), a novel approach for dynamic nonprehensile manipulation tasks that formulates planning as latent space inference, effectively learned from human demonstration videos.  Our method addresses key challenges in visuomotor policy learning through a principled variational replanning framework that maintains temporal consistency while efficiently adapting to environmental changes. LAP employs Bayesian updating in latent space to incrementally refine plans as new observations become available, striking an optimal balance between computational efficiency and real-time adaptability.  We bridge the embodiment gap between humans and robots through model-based proportional mapping that regenerates accurate kinematic-dynamic joint states and object positions from human demonstrations.  Experimental evaluations across multiple complex manipulation benchmarks demonstrate that LAP achieves state-of-the-art performance, outperforming existing approaches in success rate, trajectory smoothness, and energy efficiency, particularly in dynamic adaptation scenarios. Our approach enables robots to perform complex interactions with human-like adaptability while providing an expandable framework applicable to diverse robotic platforms using the same human demonstration videos.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/noh25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/noh25a.html</guid>
        
        
      </item>
    
      <item>
        <title>SocialNav-SUB: Benchmarking VLMs for Scene Understanding in Social Robot Navigation</title>
        <description>Robot navigation in dynamic, human-centered environments requires socially-compliant decisions grounded in robust scene understanding, including spatiotemporal awareness and the ability to interpret human intentions. Recent Vision-Language Models (VLMs) show exhibit promising capabilities such as object recognition, common-sense reasoning, and contextual understanding—that align with the nuanced requirements of social robot navigation. However, it remains unclear whether VLMs can reliably perform the complex spatiotemporal reasoning and intent inference needed for safe and socially compliant robot navigation. In this paper, we introduce the Social Navigation Scene Understanding Benchmark (SocialNav-SUB), a Visual Question Answering (VQA) dataset and benchmark designed to evaluate VLMs for scene understanding in real-world social robot navigation scenarios. SocialNav-SUB provides a unified framework for evaluating VLMs against human and rule-based baselines across VQA tasks requiring spatial, spatiotemporal, and social reasoning in social robot navigation. Through experiments with state-of-the-art VLMs, we find that while the best-performing VLM achieves an encouraging probability of agreeing with human answers, it still underperforms a simpler rule-based approach and human consensus, indicating critical gaps in social scene understanding of current VLMs. Our benchmark sets the stage for further research on foundation models for social robot navigation, offering a framework to explore how VLMs can be tailored to meet real-world social robot navigation needs. We will open source the code and release the benchmark.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/munje25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/munje25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Uncertainty-Aware Scene Understanding via Efficient Sampling-Free Confidence Estimation</title>
        <description>Reliable scene understanding requires not only accurate predictions but also well-calibrated confidence estimates to ensure calibrated uncertainty estimation, especially in safety-critical domains like autonomous driving. In this context, semantic segmentation of LiDAR points supports real-time 3D scene understanding, where reliable uncertainty estimates help identify potentially erroneous predictions. While most existing calibration approaches focus on modeling epistemic uncertainty, they often overlook aleatoric uncertainty arising from measurement inaccuracies, which is especially prevalent in LiDAR data and essential for real-world deployment. In this work, we introduce a sampling-free approach for estimating well-calibrated confidence values by explicitly modeling aleatoric uncertainty in semantic segmentation, achieving alignment with true classification accuracy and reducing inference time compared to sampling-based methods. Evaluated on the real-world SemanticKITTI benchmark, our approach achieves 1.70% and 1.33% Adaptive Calibration Error (ACE) in semantic segmentation of LiDAR data using RangeViT and SalsaNext models, and is more than one order of magnitude faster than the comparable baseline. Furthermore, reliability diagrams reveal that our method produces underconfident rather than overconfident predictions — an advantageous property in safety-critical systems.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/miandashti25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/miandashti25a.html</guid>
        
        
      </item>
    
      <item>
        <title>In-Context Iterative Policy Improvement for Dynamic Manipulation</title>
        <description>Attention-based architectures trained on internet-scale language data have demonstrated state of the art reasoning ability for various language-based tasks, such as logic problems and textual reasoning. Additionally, these Large Language Models (LLMs) have exhibited the ability to perform few-shot prediction via in-context learning, in which input-output examples provided in the prompt are generalized to new inputs. This ability furthermore extends beyond standard language tasks, enabling few-shot learning for general patterns. In this work, we consider the application of in-context learning with pre-trained language models for dynamic manipulation. Dynamic manipulation introduces several crucial challenges, including increased dimensionality, complex dynamics, and partial observability. To address this, we take an iterative approach, and formulate our in-context learning problem to predict adjustments to a parametric policy based on previous interactions. We show across several tasks in simulation and on a physical robot that utilizing in-context learning outperforms alternative methods in the low data regime.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/merwe25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/merwe25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Sequence Modeling for Time-Optimal Quadrotor Trajectory Optimization with Sampling-based Robustness Analysis</title>
        <description>Time-optimal trajectories drive quadrotors to their dynamic limits, but computing such trajectories involves solving non-convex problems via iterative nonlinear optimization, making them prohibitively costly for real-time applications. In this work, we investigate learning-based models that imitate a model-based time-optimal trajectory planner to accelerate trajectory generation. Given a dataset of collision-free geometric paths, we show that modeling architectures can effectively learn the patterns underlying time-optimal trajectories. We introduce a quantitative framework to analyze local analytic properties of the learned models and link them to the Backward Reachable Tube of the geometric tracking controller. To enhance robustness, we propose a data augmentation scheme that applies random perturbations to the input paths. Compared to classical planners, our method achieves substantial speedups, and we validate its real-time feasibility on a hardware quadrotor platform. Experiments demonstrate that the learned models generalize to previously unseen path lengths.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/mao25b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/mao25b.html</guid>
        
        
      </item>
    
      <item>
        <title>Learning Long-Horizon Robot Manipulation Skills via Privileged Action</title>
        <description>Long-horizon contact-rich tasks are challenging to learn with reinforcement learning, due to ineffective exploration of high-dimensional state spaces with sparse rewards. The learning process often gets stuck in local optimum and demands task-specific reward fine-tuning for complex scenarios.  In this work, we propose a structured framework that leverages privileged actions with curriculum learning, enabling the policy to efficiently acquire long-horizon skills without relying on extensive reward engineering or reference trajectories. Specifically, we use privileged actions in simulation with a general training procedure that would be infeasible to implement in real-world scenarios. These privileges include relaxed constraints and virtual forces that enhance interaction and exploration with objects. Our results successfully achieve complex multi-stage long-horizon tasks that naturally combine non-prehensile manipulation with grasping to lift objects from non-graspable poses. We demonstrate generality by maintaining a parsimonious reward structure and showing convergence to diverse and robust behaviors across various environments. Our approach outperforms state-of-the-art methods in these tasks, converging to solutions where others fail.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/mao25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/mao25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Predictive Red Teaming: Breaking Policies Without Breaking Robots</title>
        <description>Visuomotor policies trained via imitation learning are capable of performing challenging manipulation tasks, but are often extremely brittle to lighting, visual distractors, and object locations. These vulnerabilities can depend unpredictably on the specifics of training, and are challenging to expose without time-consuming and expensive hardware evaluations. We propose the problem of predictive red teaming: discovering vulnerabilities of a policy with respect to environmental factors, and predicting the corresponding performance degradation without hardware evaluations in off-nominal scenarios. In order to achieve this, we develop RoboART: an automated red teaming (ART) pipeline that (1) modifies nominal observations using generative image editing to vary different environmental factors, and (2) predicts performance under each variation using a policy-specific anomaly detector executed on edited observations. Experiments across 500+ hardware trials in twelve off-nominal conditions for visuomotor diffusion policies demonstrate that RoboART predicts performance degradation with high accuracy (less than 0.19 average difference between predicted and real success rates). We also demonstrate how predictive red teaming enables targeted data collection: fine-tuning with data collected under conditions predicted to be adverse boosts baseline performance by 2–7x.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/majumdar25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/majumdar25a.html</guid>
        
        
      </item>
    
      <item>
        <title>PrioriTouch: Adapting to User Contact Preferences for Whole-Arm Physical Human-Robot Interaction</title>
        <description>Many robot caregiving tasks, such as bathing, dressing, and transferring, require a robot arm to make contact with a human body at multiple points rather than solely at the end effector. However, varied human touch preferences can lead to unsafe or uncomfortable multi-contact interactions. To address this, we introduce PrioriTouch, a framework integrating a novel contextual bandit algorithm with hierarchical operational space control to learn user contact preferences and translate them into low-level pose and force control policies. PrioriTouch minimizes user discomfort by initially gathering real-world feedback and subsequently refining the policy using simulation-in-the-loop, thus avoiding unsafe user experimentation. Guided by insights from a user study on physical assistance preferences, we rigorously evaluate PrioriTouch in extensive simulation and real-world experiments, demonstrating effective adaptation to user contact preferences, maintained task performance, and enhanced safety and comfort.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/madan25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/madan25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Bipedal Balance Control with Whole-body Musculoskeletal Standing and Falling Simulations</title>
        <description>Balance control is important for human and bipedal robotic systems. While dynamic balance during locomotion has received considerable attention, quantitative understanding of static balance and falling remains limited. This work presents a hierarchical control pipeline for simulating human balance via a comprehensive whole-body musculoskeletal system. We identified spatiotemporal dynamics of balancing during stable standing, revealed the impact of muscle injury on balancing behavior, and generated fall contact patterns that aligned with clinical data. Furthermore, our simulated hip exoskeleton assistance demonstrated improvement in balance maintenance and reduced muscle effort under perturbation. This work offers unique muscle-level insights into human balance dynamics that are challenging to capture experimentally. It could provide a foundation for developing targeted interventions for individuals with balance impairments and support the advancement of humanoid robotic systems.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/ma25d.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/ma25d.html</guid>
        
        
      </item>
    
      <item>
        <title>CDP: Towards Robust Autoregressive Visuomotor Policy Learning via Causal Diffusion</title>
        <description>Diffusion Policy (DP) enables robots to learn complex behaviors by imitating expert demonstrations through action diffusion. However, in practical applications, hardware limitations often degrade data quality, while real-time constraints restrict model inference to instantaneous state and scene observations. These limitations seriously reduce the efficacy of learning from expert demonstrations, resulting in failures in object localization, grasp planning, and long-horizon task execution. To address these challenges, we propose Causal Diffusion Policy (CDP), a novel transformer-based diffusion model that enhances action prediction by conditioning on historical action sequences, thereby enabling more coherent and context-aware visuomotor policy learning. In practice, to further mitigate the computational cost associated with autoregressive inference, a caching mechanism is also introduced to store attention key-value pairs from previous timesteps, substantially reducing redundant computations during execution. Extensive experiments in both simulated and real-world environments, spanning diverse 2D and 3D manipulation tasks, demonstrate that CDP uniquely leverages historical action sequences to achieve significantly higher accuracy than existing methods. Moreover, even when faced with degraded observation quality, CDP maintains remarkable precision by reasoning through temporal continuity, which highlights its practical robustness for robotic control under realistic, imperfect conditions.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/ma25c.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/ma25c.html</guid>
        
        
      </item>
    
      <item>
        <title>GLOVER++: Unleashing the Potential of Affordance Learning from Human Behaviors for Robotic Manipulation</title>
        <description>Learning manipulation skills from human demonstration videos offers a promising path toward generalizable and interpretable robotic intelligence—particularly through the lens of *actionable affordances*.  However, transferring such knowledge remains challenging due to:  1) a lack of large-scale datasets with precise affordance annotations, and 2) insufficient exploration of affordances in diverse manipulation contexts. To address these gaps, we introduce **HOVA-500K**, a large-scale, affordance-annotated dataset comprising 500,000 images across 1,726 object categories and 675 actions. We also release a standardized benchmarking suite for multi-modal affordance reasoning. Built upon HOVA-500K, we present **GLOVER++**, a *global-to-local* affordance training framework that effectively transfers actionable affordance knowledge from human demonstrations to downstream open-vocabulary reasoning tasks. GLOVER++ achieves state-of-the-art results on the HOVA-500K benchmark and demonstrates strong generalization across diverse downstream robotic manipulation tasks.  By explicitly modeling actionable affordances, GLOVER++ facilitates robust transfer across scenes, modalities, and tasks. We hope that HOVA-500K and the GLOVER++ framework will serve as valuable resources for bridging the gap between human demonstrations and robotic manipulation capabilities. We will release our dataset, code and models.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/ma25b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/ma25b.html</guid>
        
        
      </item>
    
      <item>
        <title>Constraint-Aware Diffusion Guidance for Robotics: Real-Time Obstacle Avoidance for Autonomous Racing</title>
        <description>Diffusion models hold great potential in robotics due to their ability to capture complex, high-dimensional data distributions. However, their lack of constraint-awareness limits their deployment in safety-critical applications. We propose Constraint-Aware Diffusion Guidance (CoDiG), a data-efficient and general-purpose framework that integrates barrier functions into the denoising process, guiding diffusion sampling toward constraint-satisfying outputs. CoDiG enables constraint satisfaction even with limited training data and generalizes across tasks. We evaluate our framework in the challenging setting of miniature autonomous racing, where real-time obstacle avoidance is essential. Real-world experiments show that CoDiG generates safe outputs efficiently under dynamic conditions, highlighting its potential for broader robotic applications.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/ma25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/ma25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Leveraging Correlation Across Test Platforms for Variance-Reduced Metric Estimation</title>
        <description>Learning-based robotic systems demand rigorous validation to assure reliable performance, but extensive real-world testing is often prohibitively expensive and if conducted may still yield insufficient data for high-confidence guarantees. In this work, we introduce a general estimation framework that leverages *paired* data across test platforms, e.g., paired simulation and real-world observations, to achieve better estimates of real-world metrics via the method of control variates. By incorporating cheap and abundant auxiliary measurements (for example, simulator outputs) as control variates for costly real-world samples, our method provably reduces the variance of Monte Carlo estimates and thus requires significantly fewer real-world samples to attain a specified confidence bound on the mean performance. We provide theoretical analysis characterizing the variance and sample-efficiency improvement, and demonstrate empirically in autonomous driving and quadruped robotics settings that our approach achieves high-probability bounds with markedly reduced sample complexity. Our technique can lower the real-world testing burden for validating the performance of the stack, thereby enabling more efficient and cost-effective experimental evaluation of robotic systems.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/luo25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/luo25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Crossing the Human-Robot Embodiment Gap with Sim-to-Real RL using One Human Demonstration</title>
        <description>Teaching robots dexterous manipulation skills often requires collecting hundreds of demonstrations using wearables or teleoperation, a process that is challenging to scale. Videos of human-object interactions are easier to collect and scale, but leveraging them directly for robot learning is difficult due to the lack of explicit action labels and human-robot embodiment differences. We propose Human2Sim2Robot, a novel real-to-sim-to-real framework for training dexterous manipulation policies using only one RGB-D video of a human demonstrating a task. Our method utilizes reinforcement learning (RL) in simulation to cross the embodiment gap without relying on wearables, teleoperation, or large-scale data collection. From the video, we extract: (1) the object pose trajectory to define an object-centric, embodiment-agnostic reward, and (2) the pre-manipulation hand pose to initialize and guide exploration during RL training. These components enable effective policy learning without any task-specific reward tuning. In the single human demo regime, Human2Sim2Robot outperforms object-aware replay by over 55% and imitation learning by over 68% on grasping, non-prehensile manipulation, and multi-step tasks. Website: https://human2sim2robot.github.io</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/lum25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/lum25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Few-Shot Neuro-Symbolic Imitation Learning for Long-Horizon Planning and Acting</title>
        <description>Imitation learning enables intelligent systems to acquire complex behaviors with minimal supervision. However, existing methods often focus on short-horizon skills, require large datasets, and struggle to solve long-horizon tasks or generalize across task variations and distribution shifts. We propose a novel neuro-symbolic framework that jointly learns continuous control policies and symbolic domain abstractions from a few skill demonstrations. Our method abstracts high-level task structures into a graph, discovers symbolic rules via an Answer Set Programming solver, and trains low-level controllers using diffusion policy imitation learning. A high-level oracle filters task-relevant information to focus each controller on a minimal observation and action space. Our graph-based neuro-symbolic framework enables capturing complex state transitions, including non-spatial and temporal relations, that data-driven learning or clustering techniques often fail to discover in limited demonstration datasets. We validate our approach in six domains that involve four robotic arms, Stacking, Kitchen, Assembly, and Towers of Hanoi environments, and a distinct Automated Forklift domain with two environments. The results demonstrate high data efficiency with as few as five skill demonstrations, strong zero- and few-shot generalizations, and interpretable decision making. Our code is publicly available.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/lorang25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/lorang25a.html</guid>
        
        
      </item>
    
      <item>
        <title>PicoPose: Progressive Pixel-to-Pixel Correspondence Learning for Novel Object Pose Estimation</title>
        <description>RGB-based novel object pose estimation is critical for rapid deployment in robotic applications, yet zero-shot generalization remains a key challenge. In this paper, we introduce PicoPose, a novel framework designed to tackle this task using a three-stage pixel-to-pixel correspondence learning process. Firstly, PicoPose matches features from the RGB observation with those from rendered object templates, identifying the best-matched template and establishing coarse correspondences. Secondly, PicoPose smooths the correspondences by globally regressing a 2D affine transformation, including in-plane rotation, scale, and 2D translation, from the coarse correspondence map. Thirdly, PicoPose applies the affine transformation to the feature map of the best-matched template and learns correspondence offsets within local regions to achieve fine-grained correspondences. By progressively refining the correspondences, PicoPose significantly improves the accuracy of object poses computed via PnP/RANSAC. PicoPose achieves state-of-the-art performance on the seven core datasets of the BOP benchmark, demonstrating exceptional generalization to novel objects. Our code and models will be made publicly available.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/liu25g.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/liu25g.html</guid>
        
        
      </item>
    
      <item>
        <title>D-CODA: Diffusion for Coordinated Dual-Arm Data Augmentation</title>
        <description>Learning bimanual manipulation is challenging due to its high dimensionality and tight coordination required between two arms. Eye-in-hand imitation learning, which uses wrist-mounted cameras, simplifies perception by focusing on task-relevant views. However, collecting diverse demonstrations remains costly, motivating the need for scalable data augmentation. While prior work has explored visual augmentation in single-arm settings, extending these approaches to bimanual manipulation requires generating viewpoint-consistent observations across both arms and producing corresponding action labels that are both valid and feasible. In this work, we propose Diffusion for COordinated Dual-arm Data Augmentation (D-CODA), a method for offline data augmentation tailored to eye-in-hand bimanual imitation learning that trains a diffusion model to synthesize novel, viewpoint-consistent wrist-camera images for both arms while simultaneously generating joint-space action labels. It employs constrained optimization to ensure that augmented states involving gripper-to-object contacts adhere to constraints suitable for bimanual coordination. We evaluate D-CODA on 5 simulated and 3 real-world tasks. Our results across 2250 simulation trials and 180 real-world trials demonstrate that it outperforms baselines and ablations, showing its potential for scalable data augmentation in eye-in-hand bimanual manipulation. Our anonymous website is at: https://dcodaaug.github.io/D-CODA/.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/liu25f.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/liu25f.html</guid>
        
        
      </item>
    
      <item>
        <title>ReasonPlan: Unified Scene Prediction and Decision Reasoning for Closed-loop Autonomous Driving</title>
        <description>Due to the powerful vision-language reasoning and generalization abilities, multimodal large language models (MLLMs) have garnered significant attention in the field of end-to-end (E2E) autonomous driving. However, their application to closed-loop systems remains underexplored, and current MLLM-based methods have not shown clear superiority to mainstream E2E imitation learning approaches. In this work, we propose ReasonPlan, a novel MLLM fine-tuning framework designed for closed-loop driving through holistic reasoning with a self-supervised Next Scene Prediction task and supervised Decision Chain-of-Thought process. This dual mechanism encourages the model to align visual representations with actionable driving context, while promoting interpretable and causally grounded decision making. We curate a planning-oriented decision reasoning dataset, namely PDR, comprising 210k diverse and high-quality samples. Our method outperforms the mainstream E2E imitation learning method by a large margin of 19% L2 and 16.1 driving score on Bench2Drive benchmark. Furthermore, ReasonPlan demonstrates strong zero-shot generalization on unseen DOS benchmark, highlighting its adaptability in handling zero-shot corner cases.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/liu25e.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/liu25e.html</guid>
        
        
      </item>
    
      <item>
        <title>CASPER: Inferring Diverse Intents for Assistive Teleoperation with Vision Language Models</title>
        <description>Assistive teleoperation, where control is shared between a human and a robot, enables efficient and intuitive human-robot collaboration in diverse and unstructured environments. A central challenge in real-world assistive teleoperation is for the robot to infer a wide range of human intentions from user control inputs and to assist users with correct actions. Existing methods are either confined to simple, predefined scenarios or restricted to task-specific data distributions at training, limiting their support for real-world assistance. We introduce Casper, an assistive teleoperation system that leverages commonsense knowledge embedded in pre-trained visual language models (VLMs) for real-time intent inference and flexible skill execution. Casper incorporates an open-world perception module for a generalized understanding of novel objects and scenes, a VLM-powered intent inference mechanism that leverages commonsense reasoning to interpret snippets of teleoperated user input, and a skill library that expands the scope of prior assistive teleoperation systems to support diverse, long-horizon mobile manipulation tasks. Extensive empirical evaluation, including human studies and system ablations, demonstrates that Casper improves task performance, reduces human cognitive load, and achieves higher user satisfaction than direct teleoperation and assistive teleoperation baselines. More information is available at https://casper-corl25.github.io/</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/liu25d.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/liu25d.html</guid>
        
        
      </item>
    
      <item>
        <title>FetchBot: Learning Generalizable Object Fetching in Cluttered Scenes via Zero-Shot Sim2Real</title>
        <description>Generalizable object fetching in cluttered scenes remains a fundamental and application-critical challenge in embodied AI. Closely packed objects cause inevitable occlusions, making safe action generation particularly difficult. Under such partial observability, effective policies must not only generalize across diverse objects and layouts but also reason about occlusion to avoid collisions. However, collecting large-scale real-world data for this task remains prohibitively expensive, leaving this problem largely unsolved. In this paper, we introduce FetchBot, a sim-to-real framework for this challenge. We first curate a large-scale synthetic dataset featuring 1M diverse scenes and 500k representative demonstrations. Based on this dataset, FetchBot employs a depth-conditioned method for action generation, which leverages structural cues to enable robust obstacle-aware action planning. However, depth is perfect in simulation but noisy in real-world environments. To address this sim-to-real gap, FetchBot predicts depth from RGB inputs using a foundation model and integrates local occupancy prediction as a co-training task, providing a generalizable latent representation for sim-to-real transfer. Extensive experiments in simulation and real-world environments demonstrate FetchBot’s strong zero-shot sim-to-real transfer, effective clutter handling, and adaptability to novel scenarios. In cluttered environments, it achieves an average success rate of 89.95%, significantly outperforming prior methods. Moreover, FetchBot demonstrates excellent robustness in challenging cases, such as fetching transparent, reflective, and irregular objects, highlighting its practical value.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/liu25c.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/liu25c.html</guid>
        
        
      </item>
    
      <item>
        <title>ImMimic: Cross-Domain Imitation from Human Videos via Mapping and Interpolation</title>
        <description>Learning robot manipulation from abundant human videos offers a scalable alternative to costly robot-specific data collection. However, domain gaps across visual, morphological, and physical aspects hinder direct imitation. To effectively bridge the domain gap, we propose ImMimic, an embodiment-agnostic co-training framework that leverages both human videos and a small amount of teleoperated robot demonstrations. ImMimic uses Dynamic Time Warping (DTW) with either action- or visual-based mapping to map retargeted human hand poses to robot joints, followed by MixUp interpolation between paired human and robot trajectories. Our key insights are (1) retargeted human hand trajectories provide informative action labels, and (2) interpolation over the mapped data creates intermediate domains that facilitate smooth domain adaptation during co-training. Evaluations on four real-world manipulation tasks (Pick and Place, Push, Hammer, Flip) across four robotic embodiments (Robotiq, Fin Ray, Allegro, Ability) show that ImMimic improves task success rates and execution smoothness, highlighting its efficacy to bridge the domain gap for robust robot manipulation. The project website can be found at https://sites.google.com/view/immimic.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/liu25b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/liu25b.html</guid>
        
        
      </item>
    
      <item>
        <title>LocoFormer: Generalist Locomotion via Long-context Adaptation</title>
        <description>Humans and animals exhibit flexible locomotion strategies, such as learning to walk within minutes, and efficient adaptation to changes in morphology. In contrast, modern locomotion controllers are manually tuned for specific embodiments. In this paper, we present LocoFormer, a generalist policy that can control previously unseen legged and wheeled robots, even without precise knowledge of their kinematics. LocoFormer is able to adapt to changes in morphology and dynamics at test time. We find that two key choices enable adaptation. First, we train massive scale RL on procedurally generated robots with aggressive domain randomization. Second, in contrast to previous policies that are myopic with short context lengths, we extend context by orders of magnitude to span episode boundaries. We deploy the same LocoFormer to varied robots, and show robust control even with large disturbances such as weight and motor failures. In extreme scenarios, we see emergent adaptation across episodes, LocoFormer learns from falls in early episodes to improve control strategies in later ones. We believe this simple yet general recipe can be used to train foundation models for other robotic skills in the future. Videos at generalist-locomotion.github.io.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/liu25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/liu25a.html</guid>
        
        
      </item>
    
      <item>
        <title>TypeTele: Releasing Dexterity in Teleoperation by Dexterous Manipulation Types</title>
        <description>Dexterous teleoperation plays a crucial role in robotic manipulation for real-world data collection and remote robot control. Previous dexterous teleoperation mostly relies on hand retargeting to closely mimic human hand postures. However, these approaches may fail to fully leverage the inherent dexterity of dexterous hands, which can execute unique actions through their structural advantages compared to human hands. To address this limitation, we propose TypeTele, a type-guided dexterous teleoperation system, which enables dexterous hands to perform actions that are not constrained by human motion patterns. This is achieved by introducing dexterous manipulation types into the teleoperation system, allowing operators to employ appropriate types to complete specific tasks. To support this system, we build an extensible dexterous manipulation type library to cover comprehensive dexterous postures used in manipulation tasks. During teleoperation, we employ a MLLM (Multi-modality Large Language Model)-assisted type retrieval module to identify the most suitable manipulation type based on the specific task and operator commands. Extensive experiments of real-world teleoperation and imitation learning demonstrate that the incorporation of manipulation types significantly takes full advantage of the dexterous robot’s ability to perform diverse and complex tasks with higher success rates.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/lin25d.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/lin25d.html</guid>
        
        
      </item>
    
      <item>
        <title>Sim-to-Real Reinforcement Learning for Vision-Based Dexterous Manipulation on Humanoids</title>
        <description>Learning generalizable robot manipulation policies, especially for complex multi-fingered humanoids, remains a significant challenge. Existing approaches primarily rely on extensive data collection and imitation learning, which are expensive, labor-intensive, and difficult to scale. Sim-to-real reinforcement learning (RL) offers a promising alternative, but has mostly succeeded in simpler state-based or single-hand setups. How to effectively extend this to vision-based, contact-rich bimanual manipulation tasks remains an open question. In this paper, we introduce a practical sim-to-real RL recipe that trains a humanoid robot to perform three challenging dexterous manipulation tasks: grasp-and-reach, box lift and bimanual handover. Our method features an automated real-to-sim tuning module, a generalized reward formulation based on contact and object goals, a divide-and-conquer policy distillation framework, and a hybrid object representation strategy with modality-specific augmentation. We demonstrate high success rates on unseen objects and robust, adaptive policy behaviors – highlighting that vision-based dexterous manipulation via sim-to-real RL is not only viable, but also scalable and broadly applicable to real-world humanoid manipulation tasks.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/lin25c.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/lin25c.html</guid>
        
        
      </item>
    
      <item>
        <title>Constraint-Preserving Data Generation for One-Shot Visuomotor Policy Generalization</title>
        <description>Large-scale demonstration data has powered key breakthroughs in robot manipulation, but collecting that data remains costly and time-consuming. To this end, we present Constraint-Preserving Data Generation (CP-Gen), a method that uses a single expert trajectory to generate robot demonstrations containing novel object geometries and poses. These generated demonstrations are used to train closed-loop visuomotor policies that transfer zero-shot to the real world. Similar to prior data-generation work focused on pose variations, CP-Gen first decomposes expert demonstrations into free-space motions and robot skills. Unlike prior work, we achieve geometry-aware data generation by formulating robot skills as keypoint-trajectory constraints: keypoints on the robot or grasped object must track a reference trajectory defined relative to a task-relevant object. To generate a new demonstration, CP-Gen samples pose and geometry transforms for each task-relevant object, then applies these transforms to the object and its associated keypoints or keypoint trajectories. We optimize robot joint configurations so that the keypoints on the robot or grasped object track the transformed keypoint trajectory, and then motion plan a collision-free path to the first optimized joint configuration. Using demonstrations generated by CP-Gen, we train visuomotor policies that generalize across variations in object geometries and poses. Experiments on 16 simulation tasks and four real-world tasks, featuring multi-stage, non-prehensile and tight-tolerance manipulation, show that policies trained using our method achieve an average success rate of 77%, outperforming the best baseline which achieves an average success rate of 50%.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/lin25b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/lin25b.html</guid>
        
        
      </item>
    
      <item>
        <title>LocoTouch: Learning Dynamic Quadrupedal Transport with Tactile Sensing</title>
        <description>Quadrupedal robots have demonstrated remarkable agility and robustness in traversing complex terrains. However, they struggle with dynamic object interactions, where contact must be precisely sensed and controlled. To bridge this gap, we present LocoTouch, a system that equips quadrupedal robots with tactile sensing to address a particularly challenging task in this category: long-distance transport of unsecured cylindrical objects, which typically requires custom mounting or fastening mechanisms to maintain stability. For efficient large-area tactile sensing, we design a high-density distributed tactile sensor that covers the entire back of the robot. To effectively leverage tactile feedback for robot control, we develop a simulation environment with high-fidelity tactile signals, and train tactile-aware transport policies using a two-stage learning pipeline. Furthermore, we design a novel reward function to promote robust, symmetric, and frequency-adaptive locomotion gaits. After training in simulation, LocoTouch transfers zero-shot to the real world, reliably transporting a wide range of unsecured cylindrical objects with diverse sizes, weights, and surface properties. Moreover, it remains robust over long distances, on uneven terrain, and under severe perturbations.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/lin25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/lin25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Hold My Beer: Learning Gentle Humanoid Locomotion and End-Effector Stabilization Control</title>
        <description>Can your humanoid walk up and hand you a full cup of beer—without spilling a drop? While humanoids are increasingly featured in flashy demos—dancing, delivering packages, traversing rough terrain—fine-grained control during locomotion remains a significant challenge. In particular, stabilizing a filled end-effector (EE) while walking is far from solved, due to a fundamental mismatch in task dynamics: locomotion demands slow-timescale, robust control, whereas EE stabilization requires rapid, high-precision corrections. To address this, we propose SoFTA, a Slow-Fast Two-Agent framework that decouples upper-body and lower-body control into separate agents operating at different frequencies and with distinct rewards. This temporal and objective separation mitigates policy interference mitagates objective conflict and enables coordinated whole-body behavior. SoFTA executes upper-body actions at 100 Hz for precise EE control and lower-body actions at 50 Hz for robust gait. It reduces EE acceleration by 2-5x to baselines and performs 2–3x closer to human-level stability, enabling delicate tasks such as carrying nearly full cups, capturing steady video during locomotion, and disturbance rejection with EE stability.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/li25i.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/li25i.html</guid>
        
        
      </item>
    
      <item>
        <title>CLONE: Closed-Loop Whole-Body Humanoid Teleoperation for Long-Horizon Tasks</title>
        <description>Humanoid robot teleoperation plays a vital role in demonstrating and collecting data for complex interactions. Current methods suffer from two key limitations: (1) restricted controllability due to decoupled upper- and lower-body control, and (2) severe drift caused by open-loop execution. These issues prevent humanoid robots from performing coordinated whole-body motions required for long-horizon loco-manipulation tasks. We introduce CLONE, a whole-body teleoperation system that overcomes these challenges through three key contributions: (1) a Mixture-of-Experts (MoE) whole-body control policy that enables complex coordinated movements, such as “picking up an object from the ground” and “placing it in a distant bin”; (2) a closed-loop error correction mechanism using LiDAR odometry, reducing translational drift to 12cm over 8.9-meter trajectories; and (3) a systematic data augmentation strategy that ensures robust performance under diverse, previously unseen operator poses. In extensive experiments, CLONE demonstrates robust performance across diverse scenarios while maintaining stable whole-body control. These capabilities significantly advance humanoid robotics by enabling the collection of long-horizon interaction data and establishing a foundation for more sophisticated humanoid-environment interaction in both research and practical applications.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/li25h.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/li25h.html</guid>
        
        
      </item>
    
      <item>
        <title>3DS-VLA: A 3D Spatial-Aware Vision Language Action Model for Robust Multi-Task Manipulation</title>
        <description>Recently, 2D vision-language-action (VLA) models have made significant strides in multi-task manipulation. However, these models struggle to reason about 3D spatial relationships from 2D image inputs. Although an increasing number of 3D approaches explicitly integrate 3D information, they encounter challenges such as limited availability of large-scale 3D datasets and loss of spatial information during input processing. Meanwhile, existing policies typically focus on the perception-to-action learning paradigm, lacking an explicit understanding of the spatial and temporal relationships between the robot and its environment. To address this, we propose 3DS-VLA, which enhances pretrained 2D vision-language models (VLMs) with comprehensive 3D awareness, enabling the prediction of robust end-effector poses. Specifically, we enable a 2D vision encoder to encode both 2D images and 3D spatial observation by introducing a 2D-to-3D positional alignment mechanism. This allows 3DS-VLA to leverage the large-scale pre-trained knowledge of the VLM for effective reasoning in complex 3D robotic environments. Furthermore, to better understand the spatiotemporal relationship between 3D observations and robot behavior, we guide the model to learn the introduced sequential 3D spatial constraints, which define affordance-relevant 3D keypoints on objects, ensuring robust interactions. Experiments in simulated and real-world demonstrate that 3DS-VLA outperforms previous state-of-the-art policies and showcase its generalizable capabilities across multi-task, multi-embodiment, and diverse environmental settings.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/li25g.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/li25g.html</guid>
        
        
      </item>
    
      <item>
        <title>KoopMotion: Learning Almost Divergence Free Koopman Flow Fields for Motion Planning</title>
        <description>In this work, we propose a novel flow field-based motion planning method that drives a robot from any initial state to a desired reference trajectory such that it converges to the trajectory’s end point.  Despite demonstrated efficacy in using Koopman operator theory for modeling dynamical systems, Koopman does not inherently enforce convergence to desired trajectories nor to specified goals, a requirement when learning from demonstrations (LfD).  We present KoopMotion which represents motion flow fields as dynamical systems, parameterized by Koopman Operators, and leverages the divergence properties of the learnt flow fields to obtain smooth motion fields that converge to a desired reference trajectory when the robot is placed away from the desired trajectory, and tracks the trajectory until the end point. To demonstrate the effectiveness of our approach, we show evaluations of KoopMotion on the LASA human handwriting dataset, including spectral analysis. We also perform experiments on a physical robot, verifying KoopMotion on a miniature autonomous surface vehicle operating in a non-static fluid flow environment. Our approach is highly sample efficient in both space and time, requiring only 3% of the LASA dataset to generate dense motion plans. Additionally, KoopMotion provides a significant improvement over baselines when comparing metrics that measure spatial and temporal dynamics modeling efficacy.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/li25f.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/li25f.html</guid>
        
        
      </item>
    
      <item>
        <title>Learn from What We HAVE: History-Aware VErifier that Reasons about Past Interactions Online</title>
        <description>We introduce a novel History-Aware VErifier (HAVE) to disambiguate uncertain scenarios online by leveraging past interactions. Robots frequently encounter visually ambiguous objects whose manipulation outcomes remain uncertain until physically interacted with. While generative models alone could theoretically adapt to such ambiguity, in practice they obtain suboptimal performance in ambiguous cases, even when conditioned on action history.  To address this, we propose explicitly decoupling action generation from verification: we use an unconditional diffusion-based generator to propose multiple candidate actions and employ our history-aware verifier to select the most promising action by reasoning about past interactions. Through theoretical analysis, we demonstrate that employing a verifier significantly improves expected action quality. Empirical evaluations and analysis across multiple simulated and real-world environments including articulated objects, multi-modal doors, and uneven object pick-up confirm the effectiveness of our method and improvements over baselines.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/li25e.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/li25e.html</guid>
        
        
      </item>
    
      <item>
        <title>Morphologically Symmetric Reinforcement Learning for Ambidextrous Bimanual Manipulation</title>
        <description>Humans naturally exhibit bilateral symmetry in their gross manipulation skills, effortlessly mirroring simple actions between left and right hands. Bimanual robots—which also feature bilateral symmetry—should similarly exploit this property to perform tasks with either hand. Unlike humans, who often favor a dominant hand for fine dexterous skills, robots should ideally execute ambidextrous manipulation with equal proficiency. To this end, we introduce SYMDEX (SYMmetric DEXterity), a reinforcement learning framework for ambidextrous bi-manipulation that leverages the robot’s inherent bilateral symmetry as an inductive bias. SYMDEX decomposes complex bimanual manipulation tasks into per-hand subtasks and trains dedicated policies for each. By exploiting bilateral symmetry via equivariant neural networks, experience from one arm is inherently leveraged by the opposite arm. We then distill the subtask policies into a global ambidextrous policy that is independent of the hand-task assignment. We evaluate SYMDEX on six challenging simulated manipulation tasks and demonstrate successful real-world deployment on two of them. Our approach outperforms baselines on more complex, asymmetric tasks, where the left and right hands perform different roles. We further demonstrate SYMDEX’s scalability by extending it to a four-arm manipulation setup, where our symmetry-aware policies enable effective multi-arm collaboration and coordination. Our results highlight how structural symmetry as inductive bias in policy learning enhances sample efficiency, robustness, and generalization across diverse dexterous manipulation tasks.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/li25d.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/li25d.html</guid>
        
        
      </item>
    
      <item>
        <title>ControlVLA: Few-shot Object-centric Adaptation for Pre-trained Vision-Language-Action Models</title>
        <description>Learning real-world robotic manipulation is challenging, particularly when limited demonstrations are available. Existing methods for few-shot manipulation often rely on simulation-augmented data or pre-built modules like grasping and pose estimation, which struggle with sim-to-real gaps and lack extensibility. While large-scale imitation pre-training shows promise, adapting these general-purpose policies to specific tasks in data-scarce settings remains unexplored. To achieve this, we propose ControlVLA, a novel framework that bridges pre-trained VLA models with object-centric representations via a ControlNet-style architecture for efficient fine-tuning. Specifically, to introduce object-centric conditions without overwriting prior knowledge, ControlVLA zero-initializes a set of projection layers, allowing them to gradually adapt the pre-trained manipulation policies. In real-world experiments across 6 diverse tasks, including pouring cubes and folding clothes, our method achieves a 76.7% success rate while requiring only 10-20 demonstrations — a significant improvement over traditional approaches that require more than 100 demonstrations to achieve comparable success.  Additional experiments highlight ControlVLA’s extensibility to long-horizon tasks and robustness to unseen objects and backgrounds.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/li25c.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/li25c.html</guid>
        
        
      </item>
    
      <item>
        <title>ActLoc: Learning to Localize on the Move via Active Viewpoint Selection</title>
        <description>Reliable localization is critical for robot navigation, yet many existing systems assume that all viewpoints along a trajectory are equally informative. In practice, localization becomes unreliable when the robot observes unmapped, ambiguous, or uninformative regions. To address this, we present ActLoc, an active viewpoint-aware planning framework for enhancing localization accuracy for general robot navigation tasks. At the core of ActLoc is an attention-based model trained at scale for viewpoint selection. This model encodes a metric map of the scene, along with camera poses used during map construction, and estimates localization accuracy over camera pitch and yaw directions at arbitrary 3D waypoint in space. This per-point accuracy distribution is integrated into the path planning process, allowing the robot to actively choose camera orientation that maximize localization robustness while respecting task and motion constraints. ActLoc achieves state-of-the-art performance in single-viewpoint selection task, and generalizes effectively to full-trajectory planning. It provides a modular enhancement to a wide range of navigation and inspection tasks in structured environments.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/li25b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/li25b.html</guid>
        
        
      </item>
    
      <item>
        <title>AT-Drone: Benchmarking Adaptive Teaming in Multi-Drone Pursuit</title>
        <description>Adaptive teaming—the capability of agents to effectively collaborate with unfamiliar teammates without prior coordination—is widely explored in virtual video games but overlooked in real-world multi-robot contexts. Yet, such adaptive collaboration is crucial for real-world applications, including border surveillance, search-and-rescue, and counter-terrorism operations. To address this gap, we introduce AT-Drone, the first dedicated benchmark explicitly designed to facilitate comprehensive training and evaluation of adaptive teaming strategies in multi-drone pursuit scenarios. AT-Drone makes the following key contributions: (1) An adaptable simulation environment configurator that enables intuitive and rapid setup of adaptive teaming multi-drone pursuit tasks, including four predefined pursuit environments. (2) A streamlined real-world deployment pipeline that seamlessly translates simulation insights into practical drone evaluations using edge devices (such as Jetson Orin Nano) and Crazyflie drones. (3) A novel algorithm zoo integrated with a distributed training framework, featuring diverse algorithms explicitly tailored, for the first time, to multi-pursuer and multi-evader drone pursuit task. (4) Standardized evaluation protocols with newly designed unseen drone zoos, explicitly designed to rigorously assess the performance of adaptive teaming. Comprehensive experimental evaluations across four progressively challenging multi-drone pursuit scenarios confirm AT-Drone’s effectiveness in advancing adaptive teaming research. Real-world drone experiments further validate its practical feasibility and utility for realistic robotic operations. Videos, code and weights are available at \url{https://sites.google.com/view/at-drone}.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/li25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/li25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Phantom: Training Robots Without Robots Using Only Human Videos</title>
        <description>Training general-purpose robots requires learning from large and diverse data sources. Current approaches rely heavily on teleoperated demonstrations which are difficult to scale. We present a scalable framework for training manipulation policies directly from human video demonstrations, requiring no robot data. Our method converts human demonstrations into robot-compatible observation-action pairs using hand pose estimation and visual data editing. We inpaint the human arm and overlay a rendered robot to align the visual domains. This enables zero-shot deployment on real hardware without any fine-tuning. We demonstrate strong success rates—up to 92%—on a range of tasks including deformable object manipulation, multi-object sweeping, and insertion. Our approach generalizes to novel environments and supports closed-loop execution. By demonstrating that effective policies can be trained using only human videos, our method broadens the path to scalable robot learning. Videos are available at https://phantom-training-robots.github.io.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/lepert25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/lepert25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Imagine, Verify, Execute: Memory-guided Agentic Exploration with Vision-Language Models</title>
        <description>Exploration is key for general-purpose robotic learning, particularly in open-ended environments where explicit guidance or task-specific feedback is limited. Vision-language models (VLMs), which can reason about object semantics, spatial relations, and potential outcomes, offer a promising foundation for guiding exploratory behavior by generating high-level goals or transitions. However, their outputs lack grounding, making it difficult to determine whether imagined transitions are physically feasible or informative. To bridge the gap between imagination and execution, we present IVE (Imagine, Verify, Execute), an agentic exploration framework inspired by human curiosity.  Human exploration often emerges from the drive to discover novel scene configurations and to understand the environment. Inspired by this, IVE leverages VLMs to abstract RGB-D observations into semantic scene graphs, imagine novel scenes, predict their physical plausibility, and generate executable skill sequences through action tools. We evaluate IVE in both simulated and real-world tabletop environments. The results show that IVE produces more diverse and meaningful exploration than RL baselines. The collected data facilitates learning downstream tasks that closely match those of policies trained on human-collected demonstrations.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/lee25b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/lee25b.html</guid>
        
        
      </item>
    
      <item>
        <title>CLASS: Contrastive Learning via Action Sequence Supervision for Robot Manipulation</title>
        <description>Recent advances in Behavior Cloning (BC) have led to strong performance in robotic manipulation, driven by expressive models, sequence modeling of actions, and large-scale demonstration data. However, BC faces significant challenges when applied to heterogeneous datasets, such as visual shift with different camera poses or object appearances, where performance degrades despite the benefits of learning at scale. This stems from BC’s tendency to overfit individual demonstrations rather than capture shared structure, limiting generalization. To address this, we introduce Contrastive Learning via Action Sequence Supervision (CLASS), a method for learning behavioral representations from demonstrations using supervised contrastive learning. CLASS leverages weak supervision from similar action sequences identified via Dynamic Time Warping (DTW) and optimizes a soft InfoNCE loss with similarity-weighted positive pairs. We evaluate CLASS on 5 simulation benchmarks and 3 real-world tasks to achieve competitive results using retrieval-based control with representations only. Most notably, for downstream policy learning under significant visual shifts, CLASS achieves an average success rate of 70% with Diffusion Policy, while all other baseline methods fail to perform competitively.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/lee25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/lee25a.html</guid>
        
        
      </item>
    
      <item>
        <title>RoboMonkey: Scaling Test-Time Sampling and Verification for Vision-Language-Action Models</title>
        <description>Vision-Language-Action (VLA) models, pre-trained on large-scale imitation learning datasets, have demonstrated remarkable capabilities in visuomotor control. However, these models exhibit diverse failure modes in unstructured real-world environments, limiting the widespread adoption of VLAs in robotics. Efforts to enhance the robustness and generalization of VLAs have gradually shifted from the pre-training to the post-training phase. Yet, the potential of scaling test-time compute remains underexplored. In this paper, we investigate test-time scaling for robotics through the lens of sampling and verification. We first demonstrate that the relationship between action error and the number of generated samples follows an exponentiated power law across a range of VLAs, indicating the existence of inference-time scaling laws. Building on this insight, we propose a synthetic data generation pipeline for training a Vision-Language Model (VLM)-based action verifier, and show that scaling the synthetic dataset consistently improves verification and downstream accuracy. We then introduce RoboMonkey, a test-time scaling framework for VLAs. At deployment, RoboMonkey samples a small set of actions from a VLA, applies Gaussian perturbations and majority voting to construct an action proposal distribution, and then uses the VLM-based verifier to select the optimal action. Through extensive evaluations across simulated and real-world environments, we show that pairing existing VLAs with RoboMonkey yields significant performance gains, achieving a 25% absolute improvement on out-of-distribution tasks and 8% higher average success rate on in-distribution tasks. Additionally, when adapting to new robot setups, we show that fine-tuning both VLAs and action verifiers yields a 7% performance increase compared to fine-tuning VLAs alone.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/kwok25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/kwok25a.html</guid>
        
        
      </item>
    
      <item>
        <title>COLLAGE: Adaptive Fusion-based Retrieval for Augmented Policy Learning</title>
        <description>In this work, we study the problem of data retrieval for few-shot imitation learning: select data from a large dataset to train a performant policy for a specific task, given only a few target demonstrations. Prior methods retrieve data using a single-feature distance heuristic, assuming that the best demonstrations are those that most closely resemble the target examples in visual, semantic, or motion space. However, this approach captures only a subset of the relevant information and is prone to introducing detrimental demonstrations, e.g., retrieving data from unrelated tasks due to similar scene layouts, or selecting similar motions from tasks with divergent goals. We present COLLAGE, a method for COLLective data AGgrEgation in few-shot imitation learning that uses an adaptive late fusion mechanism to guide the selection of relevant demonstrations based on a task12 specific combination of multiple cues. COLLAGE follows a simple, but flexible and efficient data aggregation recipe: it assigns weights to subsets of the dataset that are pre-selected using a single feature (e.g., appearance, shape, or language similarity), based on their task relevance, measured by how well a policy trained on each subset predicts actions in the few target demonstrations. These weights are then used during policy training to perform importance sampling over the aggregated dataset, sampling data more densely or sparsely, according to their estimated relevance. This weighted aggregation strategy is general and feature-agnostic, allowing COLLAGE to combine and leverage any number of subsets selected by any retrieval heuristic or method, and to identify which subset provides the most benefit for the target task. In extensive experiments, COLLAGE outperforms state-of-the-art retrieval and multi-task learning approaches, achieving a 5.1% improvement over the best baseline in simulation across 10 tasks, and a 16.6% improvement in the real world across 6 tasks. For our real world experiments, we include data selection from the large-scale, real-world DROID dataset, significantly improving few-shot imitation policy training. More information at: https://collagecorl25.github.io/</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/kumar25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/kumar25a.html</guid>
        
        
      </item>
    
      <item>
        <title>EndoVLA: Dual-Phase Vision-Language-Action for Precise Autonomous Tracking in Endoscopy</title>
        <description>In endoscopic procedures, autonomous tracking of abnormal regions and following of circumferential cutting markers can significantly reduce the cognitive burden on endoscopists. However, conventional model-based pipelines are fragile—each component (e.g., detection, motion planning) requires manual tuning and struggles to incorporate high-level endoscopic intent, resulting in poor generalization across variable scenes. Vision–Language–Action (VLA) models, which integrate visual perception, language grounding, and motion planning within an end-to-end framework, offer a promising alternative to semantically adapt to surgeon prompts, without the need for manual recalibration. Despite their potential, applying VLA models to robotic endoscopy presents unique challenges due to the inherently complex and dynamic anatomical environments of the gastrointestinal (GI) tract. To this end, we introduce EndoVLA, designed specifically for continuum robots in GI interventions. Provided endoscopic images and surgeon-issued tracking prompts, EndoVLA performs three core tasks: (1) polyp tracking, (2) delineation and following of abnormal mucosal regions, and (3) adherence to predefined circular markers during circumferential cutting. To address the unique challenges posed by data scarcity and domain shifts, we propose a dual-phase strategy, with supervised fine-tuning on our EndoVLA-Motion dataset and reinforcement fine-tuning using task-aware rewards. Our approach significantly enhances the tracking performance in endoscopy, and zero-shot generalization of tracking in general scenes and more challenging sequential tasks.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/kit25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/kit25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Disentangled Multi-Context Meta-Learning: Unlocking Robust and Generalized Task Learning</title>
        <description>In meta-learning and its downstream tasks, many methods use implicit adaptation to represent task-specific variations. However, implicit approaches hinder interpretability and make it difficult to understand which task factors drive performance. In this work, we introduce a disentangled multi-context meta-learning framework that explicitly learns separate context vectors for different aspects that define a task. By decoupling these factors, our approach improves both robustness, through deeper task understanding, and generalization, by enabling context vector sharing across tasks with the same context. We evaluate our approach in two domains. First, on a sinusoidal regression benchmark, our model outperforms baselines on out-of-distribution tasks and generalizes to unseen sine functions by sharing context vectors associated with shared amplitudes or phase shifts. Second, in a quadruped locomotion task, we disentangle the robot-specific properties and the characteristics of the terrain in the robot dynamics model. Using these context vectors in reinforcement learning, the learned policy demonstrates improved robustness under out-of-distribution conditions, compared to a model using a single unified context. Furthermore, by effectively sharing context, our model enables successful sim-to-real policy transfer to challenging terrains with out-of-distribution robot-specific properties using only real data from flat terrain, which is not achievable with single-task adaptation.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/kim25e.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/kim25e.html</guid>
        
        
      </item>
    
      <item>
        <title>UniSkill: Imitating Human Videos via Cross-Embodiment Skill Representations</title>
        <description>Mimicry is a fundamental learning mechanism in humans, enabling individuals to learn new tasks by observing and imitating experts. However, applying this ability to robots presents significant challenges due to the inherent differences between human and robot embodiments in both their visual appearance and physical capabilities. While previous methods bridge this gap using cross-embodiment datasets with shared scenes and tasks, collecting such aligned data between humans and robots at scale is not trivial. In this paper, we propose UniSkill, a novel framework that learns embodiment-agnostic skill representations from large-scale cross-embodiment video data without any labels, enabling skills extracted from human video prompts to effectively transfer to robot policies trained only on robot data. Our experiments in both simulation and real-world environments show that our cross-embodiment skills successfully guide robots in selecting appropriate actions, even with unseen video prompts.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/kim25d.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/kim25d.html</guid>
        
        
      </item>
    
      <item>
        <title>CARE: Enhancing Safety of Visual Navigation through Collision Avoidance via Repulsive Estimation</title>
        <description>We propose CARE (Collision Avoidance via Repulsive Estimation) for improving the robustness of learning-based visual navigation methods. Recently, visual navigation models, particularly foundation models, have demonstrated promising performance by generating viable trajectories using only RGB images. However, these policies can generalize poorly to environments containing out-of-distribution (OOD) scenes characterized by unseen objects or different camera setups (e.g., variations in field of view, camera pose, or focal length). Without fine-tuning, such models could produce trajectories that lead to collisions, necessitating substantial efforts in data collection and additional training. To address this limitation, we introduce CARE, an attachable module that enhances the safety of visual navigation without requiring additional range sensors or fine-tuning of pretrained models. CARE can be integrated seamlessly into any RGB-based navigation model that generates local robot trajectories. It dynamically adjusts trajectories produced by a pretrained model using repulsive force vectors computed from depth images estimated directly from RGB inputs.  We evaluate CARE by integrating it with state-of-the-art visual navigation models across diverse robot platforms. Real-world experiments show that CARE significantly reduces collisions (up to 100%) without compromising navigation performance in goal-conditioned navigation, and further improves collision-free travel distance (up to 10.7$\times$) in exploration tasks.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/kim25c.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/kim25c.html</guid>
        
        
      </item>
    
      <item>
        <title>BranchOut: Capturing Realistic Multimodality in Autonomous Driving Decisions</title>
        <description>Modeling the nuanced, multimodal nature of human driving remains a core challenge for autonomous systems, as existing methods often fail to capture the diversity of plausible behaviors in complex real-world scenarios. In this work, we introduce a novel benchmark and end-to-end planner for modeling realistic multimodality in autonomous driving decisions.  We propose a Gaussian Mixture Model (GMM)-based diffusion model designed to explicitly capture human-like, multimodal driving decisions in diverse contexts. Our model achieves state-of-the-art performance on current benchmarks, but reveals weaknesses in standard evaluation practices, which rely on single ground-truth trajectories or coarse closed-loop metrics while often penalizing diverse yet plausible alternatives. To address this limitation, we further develop a human-in-the-loop simulation benchmark that enables finer-grained evaluations and measures multimodal realism in challenging driving settings. Our code, models, and benchmark data will be released to promote more accurate and human-aware evaluation of autonomous driving models.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/kim25b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/kim25b.html</guid>
        
        
      </item>
    
      <item>
        <title>ScrewSplat: An End-to-End Method for Articulated Object Recognition</title>
        <description>Articulated object recognition – the task of identifying both the geometry and kinematic joints of objects with movable parts – is essential for enabling robots to interact with everyday objects such as doors and laptops. However, existing approaches often rely on strong assumptions, such as a known number of articulated parts; require additional inputs, such as depth images; or involve complex intermediate steps that can introduce potential errors – limiting their practicality in real-world settings. In this paper, we introduce **ScrewSplat**, a simple end-to-end method that operates solely on RGB observations. Our approach begins by randomly initializing screw axes, which are then iteratively optimized to recover the object’s underlying kinematic structure. By integrating with Gaussian Splatting, we simultaneously reconstruct the 3D geometry and segment the object into rigid, movable parts. We demonstrate that our method achieves state-of-the-art recognition accuracy across a diverse set of articulated objects, and further enables zero-shot, text-guided manipulation using the recovered kinematic model.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/kim25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/kim25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Eye, Robot: Learning to Look to Act with a BC-RL Perception-Action Loop</title>
        <description>Humans do not passively observe the visual world—we actively look in order to act. Motivated by this principle, we introduce EyeRobot, a robotic system with gaze behavior that emerges from the need to complete real-world tasks. We develop a mechanical eyeball that can freely rotate to observe its surroundings and train a gaze policy to control it using reinforcement learning. We accomplish this by introducing a BC-RL loop trained using teleoperated demonstrations recorded with a 360 camera. The resulting video enables a simulation environment that supports rendering arbitrary eyeball viewpoints, allowing reinforcement learning of gaze behavior. The hand (BC) agent is trained from rendered eye observations, and the eye (RL) agent is rewarded when the hand produces correct actions. In this way, hand-eye coordination emerges as the eye looks towards regions which allow the hand to complete the task. We evaluate EyeRobot on five large workspace manipulation tasks and compare performance to two common camera setups: wrist and external cameras. Our experiments suggest EyeRobot exhibits hand-eye coordination which effectively facilitates action such as visual search or target switching, which enable manipulation across large workspaces.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/kerr25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/kerr25a.html</guid>
        
        
      </item>
    
      <item>
        <title>GENNAV: Polygon Mask Generation for Generalized Referring Navigable Regions</title>
        <description>We focus on the task of identifying the location of target regions from a natural language instruction and a front camera image captured by a mobility. This task is challenging because it requires both existence prediction and segmentation mask generation, particularly for stuff-type target regions with ambiguous boundaries. Existing methods often underperform in handling stuff-type target regions, in addition to absent or multiple targets. To overcome these limitations, we propose GENNAV, which predicts target existence and generates segmentation masks for multiple stuff-type target regions.  To evaluate GENNAV, we constructed a novel benchmark called GRiN-Drive, which includes three distinct types of samples: no-target, single-target, and multi-target. GENNAV achieved superior performance over baseline methods on standard evaluation metrics. Furthermore, we conducted real-world experiments with four automobiles operated in five geographically distinct urban areas to validate its zero-shot transfer performance. In these experiments, GENNAV outperformed baseline methods and demonstrated its robustness across diverse real-world environments.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/katsumata25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/katsumata25a.html</guid>
        
        
      </item>
    
      <item>
        <title>OPAL: Visibility-aware LiDAR-to-OpenStreetMap Place Recognition via Adaptive Radial Fusion</title>
        <description>LiDAR place recognition is a critical capability for autonomous navigation and cross-modal localization in large-scale outdoor environments. Existing approaches predominantly depend on pre-built 3D dense maps or aerial imagery, which impose significant storage overhead and lack real-time adaptability. In this paper, we propose OPAL, a novel network for LiDAR place recognition that leverages OpenStreetMap (OSM) as a lightweight and up-to-date prior. Our key innovation lies in bridging the domain disparity between sparse LiDAR scans and structured OSM data through two carefully designed components. First, a cross-modal visibility mask that identifies maximal observable regions from both modalities to guide feature learning. Second, an adaptive radial fusion module that dynamically consolidates radial features into discriminative global descriptors. Extensive experiments on the KITTI and KITTI-360 datasets demonstrate OPAL’s superiority, achieving 15.98% higher recall at  @1m threshold for top-1 retrieved matches, along with 12x faster inference speed compared to the state-of-the-art approach. Code and datasets will be publicly available.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/kang25b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/kang25b.html</guid>
        
        
      </item>
    
      <item>
        <title>Learning Impact-Rich Rotational Maneuvers via Centroidal Velocity Rewards and Sim-to-Real Techniques: A One-Leg Hopper Flip Case Study</title>
        <description>Dynamic rotational maneuvers, such as front flips, inherently involve large angular momentum generation and intense impact forces, presenting major challenges for reinforcement learning and sim-to-real transfer. In this work, we propose a general framework for learning and deploying impact-rich, rotation-intensive behaviors through centroidal velocity-based rewards and actuator-aware sim-to-real techniques. We identify that conventional link-level reward formulations fail to induce true whole-body rotation and introduce a centroidal angular velocity reward that accurately captures system-wide rotational dynamics. To bridge the sim-to-real gap under extreme conditions, we model motor operating regions (MOR) and apply transmission load regularization to ensure realistic torque commands and mechanical robustness. Using the one-leg hopper front flip as a representative case study, we demonstrate the first successful hardware realization of a full front flip. Our results highlight that incorporating centroidal dynamics and actuator constraints is critical for reliably executing highly dynamic motions.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/kang25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/kang25a.html</guid>
        
        
      </item>
    
      <item>
        <title>NeuralSVCD for Efficient Swept Volume Collision Detection</title>
        <description>Robot manipulation in unstructured environments requires efficient and reliable Swept Volume Collision Detection (SVCD) for safe motion planning. Traditional discrete methods potentially miss collisions between these points, whereas SVCD continuously checks for collisions along the entire trajectory. Existing SVCD methods typically face a trade-off between efficiency and accuracy, limiting practical use. In this paper, we introduce NeuralSVCD, a novel neural encoder-decoder architecture tailored to overcome this trade-off. Our approach leverages shape locality and temporal locality through distributed geometric representations and temporal optimization. This enhances computational efficiency without sacrificing accuracy. Comprehensive experiments show that NeuralSVCD consistently outperforms existing state-of-the-art SVCD methods in terms of both collision detection accuracy and computational efficiency, demonstrating its robust applicability across diverse robotic manipulation scenarios. Code and videos are available at https://neuralsvcd.github.io/.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/jung25e.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/jung25e.html</guid>
        
        
      </item>
    
      <item>
        <title>Joint Model-based Model-free Diffusion for Planning with Constraints</title>
        <description>Model-free diffusion planners have shown great promise for robot motion planning, but practical robotic systems often require combining them with model-based optimization modules to enforce constraints, such as safety. Naïvely integrating these modules presents compatibility challenges when diffusion’s multi-modal outputs behave adversarially to optimization-based modules. To address this, we introduce Joint Model-based Model-free Diffusion (JM2D), a novel generative modeling framework. JM2D formulates module integration as a joint sampling problem to maximize compatibility via an interaction potential, without additional training. Using importance sampling, JM2D guides modules outputs based only on evaluations of the interaction potential, thus handling non-differentiable objectives commonly arising from non-convex optimization modules. We evaluate JM2D via application to aligning diffusion planners with safety modules on offline RL and robot manipulation. JM2D significantly improves task performance compared to conventional safety filters without sacrificing safety. Further, we show that conditional generation is a special case of JM2D and elucidate key design choices by comparing with SOTA gradient-based and projection-based diffusion planners. More details at: \url{https://sites.google.com/view/joint-mbmf-diffusion}</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/jung25d.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/jung25d.html</guid>
        
        
      </item>
    
      <item>
        <title>ImLPR: Image-based LiDAR Place Recognition using Vision Foundation Models</title>
        <description>LiDAR Place Recognition (LPR) is a key component in robotic localization, enabling robots to align current scans with prior maps of their environment. While Visual Place Recognition (VPR) has embraced Vision Foundation Models (VFMs) to enhance descriptor robustness, LPR has relied on task-specific models with limited use of pre-trained foundation-level knowledge. This is due to the lack of 3D foundation models and the challenges of using VFM with LiDAR point clouds. To tackle this, we introduce ImLPR, a novel pipeline that employs a pre-trained DINOv2 VFM to generate rich descriptors for LPR. To our knowledge, ImLPR is the first method to leverage a VFM to support LPR. ImLPR converts raw point clouds into Range Image Views (RIV) to leverage VFM in the LiDAR domain. It employs MultiConv adapters and Patch-InfoNCE loss for effective feature learning. We validate ImLPR using public datasets where it outperforms state-of-the-art (SOTA) methods in intra-session and inter-session LPR with top Recall@1 and F1 scores across various LiDARs. We also demonstrate that RIV outperforms Bird’s-Eye-View (BEV) as a representation choice for adapting LiDAR for VFM. We release ImLPR as open source for the robotics community.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/jung25c.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/jung25c.html</guid>
        
        
      </item>
    
      <item>
        <title>Uncertainty-aware Accurate Elevation Modeling for Off-road Navigation via Neural Processes</title>
        <description>Terrain elevation modeling for off-road navigation aims to accurately estimate changes in terrain geometry in real-time and quantify the corresponding uncertainties. Having precise estimations and uncertainties plays a crucial role in planning and control algorithms to explore safe and reliable maneuver strategies. However, existing approaches, such as Gaussian Processes (GPs) and neural network-based methods, often fail to meet these needs. They are either unable to perform in real-time due to high computational demands, underestimating sharp geometry changes, or harming elevation accuracy when learned with uncertainties. Recently, Neural Processes (NPs) have emerged as a promising approach that integrates the Bayesian uncertainty estimation of GPs with the efficiency and flexibility of neural networks. Inspired by NPs, we propose an effective NP-based method that precisely estimates sharp elevation changes and quantifies the corresponding predictive uncertainty without losing elevation accuracy. Our method leverages semantic features from LiDAR and camera sensors to improve interpolation and extrapolation accuracy in unobserved regions. Also, we introduce a local ball-query attention mechanism to effectively reduce the computational complexity of global attention by 17% while preserving crucial local and spatial information. We evaluate our method on off-road datasets having interesting geometric features, collected from trails, deserts, and hills. Our results demonstrate superior performance over baselines and showcase the potential of neural processes for effective and expressive terrain modeling in complex off-road environments.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/jung25b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/jung25b.html</guid>
        
        
      </item>
    
      <item>
        <title>$\textttSPIN$: distilling $\textttSkill-RRT$ for long-horizon prehensile and non-prehensile manipulation</title>
        <description>Current robots struggle with long-horizon manipulation tasks requiring sequences of prehensile and non-prehensile skills, contact-rich interactions, and long-term reasoning. We present $\texttt{SPIN}$ ($\textbf{S}$kill $\textbf{P}$lanning to $\textbf{IN}$ference), a framework that distills a computationally intensive planning algorithm into a policy via imitation learning. We propose $\texttt{Skill-RRT}$, an extension of RRT that incorporates skill applicability checks and intermediate object pose sampling for solving such long-horizon problems. To chain independently trained skills, we introduce $\textit{connectors}$, goal-conditioned policies trained to minimize object disturbance during transitions. High-quality demonstrations are generated with $\texttt{Skill-RRT}$ and distilled through noise-based replay in order to reduce online computation time. The resulting policy, trained entirely in simulation, transfers zero-shot to the real world and achieves over 80% success across three challenging long-horizon manipulation tasks and outperforms state-of-the-art hierarchical RL and planning methods.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/jung25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/jung25a.html</guid>
        
        
      </item>
    
      <item>
        <title>UnPose: Uncertainty-Guided Diffusion Priors for Zero-Shot Pose Estimation</title>
        <description>Estimating the 6D pose of novel objects is a fundamental yet challenging problem in robotics, often relying on access to object CAD models.  However, acquiring such models can be costly and impractical.  Recent approaches aim to bypass this requirement by leveraging strong priors from foundation models to reconstruct objects from single or multi-view images, but typically require additional training or produce hallucinated geometry. To this end, we propose $\textit{UnPose}$, a novel framework for zero-shot, model-free 6D object pose estimation and reconstruction that exploits 3D priors and uncertainty estimates from a pre-trained diffusion model.  Specifically, starting from a single-view RGB-D frame, $\textit{UnPose}$ uses a multi-view diffusion model to estimate an initial 3D model using 3D Gaussian Splatting (3DGS) representation, along with pixel-wise epistemic uncertainty estimates. As additional observations become available, we incrementally refine the 3DGS model by fusing new views guided by the diffusion model’s uncertainty, thereby, continuously improving the pose estimation accuracy and 3D reconstruction quality.  To ensure global consistency, the diffusion prior-generated views and subsequent observations are further integrated in a pose graph and jointly optimized into a coherent 3DGS field. Extensive experiments demonstrate that $\textit{UnPose}$ significantly outperforms existing approaches in both 6D pose estimation accuracy and 3D reconstruction quality. We further showcase its practical applicability in real-world robotic manipulation tasks.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/jiang25d.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/jiang25d.html</guid>
        
        
      </item>
    
      <item>
        <title>IRIS: An Immersive Robot Interaction System</title>
        <description>This paper introduces IRIS, an Immersive Robot Interaction System leveraging Extended Reality (XR). Existing XR-based systems enable efficient data collection but are often challenging to reproduce and reuse due to their specificity to particular robots, objects, simulators, and environments. IRIS addresses these issues by supporting immersive interaction and data collection across diverse simulators and real-world scenarios. It visualizes arbitrary rigid and deformable objects, robots from simulation, and integrates real-time sensor-generated point clouds for real-world applications. Additionally, IRIS enhances collaborative capabilities by enabling multiple users to simultaneously interact within the same virtual scene. Extensive experiments demonstrate that IRIS offers efficient and intuitive data collection in both simulated and real-world settings.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/jiang25c.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/jiang25c.html</guid>
        
        
      </item>
    
      <item>
        <title>BEHAVIOR Robot Suite: Streamlining Real-World Whole-Body Manipulation for Everyday Household Activities</title>
        <description>Real-world household tasks present significant challenges for mobile manipulation robots. An analysis of existing robotics benchmarks reveals that successful task performance hinges on three key whole-body control capabilities: bimanual coordination, stable and precise navigation, and extensive end-effector reachability. Achieving these capabilities requires careful hardware design, but the resulting system complexity further complicates visuomotor policy learning. To address these challenges, we introduce the BEHAVIOR Robot Suite (BRS), a comprehensive framework for whole-body manipulation in diverse household tasks. Built on a bimanual, wheeled robot with a 4-DoF torso, BRS integrates a cost-effective whole-body teleoperation interface for data collection and a novel algorithm for learning whole-body visuomotor policies. We evaluate BRS on five challenging household tasks that not only emphasize the three core capabilities but also introduce additional complexities, such as long-range navigation, interaction with articulated and deformable objects, and manipulation in confined spaces. We believe that BRS’s integrated robotic embodiment, data collection interface, and learning framework mark a significant step toward enabling real-world whole-body manipulation for everyday household tasks. BRS is open-sourced at https://behavior-robot-suite.github.io/.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/jiang25b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/jiang25b.html</guid>
        
        
      </item>
    
      <item>
        <title>Streaming Flow Policy: Simplifying diffusion/flow-matching policies by treating action trajectories as flow trajectories</title>
        <description>Recent advances in diffusion$/$flow-matching policies have enabled imitation learning of complex, multi-modal action trajectories. However, they are computationally expensive because they sample a *trajectory of trajectories*—a diffusion$/$flow trajectory of action trajectories. They discard intermediate action trajectories, and must wait for the sampling process to complete before any actions can be executed on the robot. We simplify diffusion$/$flow policies by *treating action trajectories as flow trajectories*. Instead of starting from pure noise, our algorithm samples from a narrow Gaussian around the last action. Then, it incrementally integrates a velocity field learned via flow matching to produce a sequence of actions that constitute a *single* trajectory. This enables actions to be streamed to the robot on-the-fly *during* the flow sampling process, and is well-suited for receding horizon policy execution. Despite streaming, our method retains the ability to model multi-modal behavior. We train flows that *stabilize* around demonstration trajectories to reduce distribution shift and improve imitation learning performance. Streaming flow policy outperforms prior methods while enabling faster policy execution and tighter sensorimotor loops for learning-based robot control.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/jiang25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/jiang25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Rapid Mismatch Estimation via Neural Network Informed Variational Inference</title>
        <description>With robots increasingly operating in human-centric environments, ensuring soft and safe physical interactions, whether with humans, surroundings, or other machines, is essential. While compliant hardware can facilitate such interactions, this work focuses on impedance controllers that allow torque-controlled robots to safely and passively respond to contact while accurately executing tasks. From inverse dynamics to quadratic programming based controllers, the effectiveness of these methods relies on accurate dynamics models of the robot and the object it manipulates. Any model mismatch results in task failures and unsafe behaviors. Thus, we introduce Rapid Mismatch Estimation (RME), an adaptive, controller-agnostic, probabilistic framework that estimates end-effector dynamics mismatches online, without relying on external force-torque sensors. From the robot’s proprioceptive feedback, a Neural Network Model Mismatch Estimator generates a prior for a Variational Inference solver, which rapidly converges to the unknown parameters while quantifying uncertainty. With a real 7-DoF manipulator driven by a state-of-the-art passive impedance controller, RME adapts to sudden changes in mass and center of mass at the end-effector in $\sim400$ ms, in static and dynamic settings. We demonstrate RME in a collaborative scenario where a human attaches an unknown basket to the robot’s end-effector and dynamically adds/removes heavy items, showcasing fast and safe adaptation to changing dynamics during physical interaction without any external sensory system.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/jaszczuk25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/jaszczuk25a.html</guid>
        
        
      </item>
    
      <item>
        <title>DreamGen: Unlocking Generalization in Robot Learning through Video World Models</title>
        <description>In this work, we unlock new capabilities in robot learning from neural trajectories, synthetic robot data generated from video world models. Our proposed recipe is simple, but powerful: we take the most recent state-of-the-art video generative models (world models), adapt them to the target robot embodiment, and generate new, synthetic robot data of the same task or even new behaviors. Since these video world models only generate videos, we explore two techniques of getting robot actions: extracting latent actions from a general-purpose latent action model and getting predicted actions from an inverse-dynamics model (IDM), giving flexibility across diverse scenarios. Our proposed approach unlocks behavior and environment generalization, allowing a humanoid robot to perform 20+ new behaviors in unseen environments while only collecting teleoperation data for pick and place in a single environment. By introducing a new world modeling benchmark, we demonstrate that stronger video world models directly correlate with improved downstream robot policy performance. This establishes a new scaling dimension beyond simply collecting additional teleoperation data, changing how we approach robot learning.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/jang25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/jang25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Enabling Long(er) Horizon Imitation for Manipulation Tasks by Modeling Subgoal Transitions</title>
        <description>Imitation-based policy training for long-horizon manipulation tasks involving multi-step object interactions is often susceptible to compounding action errors. Contemporary methods discover semantic subgoals embedded within the overall task, decomposing the overall task into tractable shorter-horizon goal-conditioned policy learning. However, policy deployment requires iteratively estimating $\textit{which}$ subgoal is being pursued and $\textit{when}$ it is achieved. We observe the brittleness of conventional $\textit{heuristic}$-based approaches (ad hoc threshold based), particularly for long-horizon imitation, since pursuing an incorrect subgoal can lead the robot policy to experience out of distribution states. In this work, we introduce two policy architectures for modeling subgoal transitions within a policy learning loop for long-horizon tasks. The first model autoregressively predicts the likelihood of the next subgoal transition, while the second uses cross-attention (via a transformer-based architecture) and implicitly models smooth and continuous transitions. We evaluate our models on $25$ simulated tasks on Franka Kitchen, $6$ real-world table-top tasks and $18$ simulated tasks on a new corpus (Franka-Long Horizon Tasks (LHT)) focused on tasks with rich object interactions over long episode lengths. Experimental results show significant improvements in learning efficacy, task success rates and generalization to out-of-distribution settings- extending horizon lengths for imitating manipulation tasks $\textit{from long to long(er)}$.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/jain25b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/jain25b.html</guid>
        
        
      </item>
    
      <item>
        <title>JaxRobotarium: Training and Deploying Multi-Robot Policies in 10 Minutes</title>
        <description>Multi-agent reinforcement learning (MARL) has emerged as a promising solution for learning complex and scalable coordination behaviors in multi-robot systems. However, established MARL platforms (e.g., SMAC and MPE) lack robotics relevance and hardware deployment, leaving multi-robot learning researchers to develop bespoke environments and hardware testbeds dedicated to the development and evaluation of their individual contributions. The Multi-Agent RL Benchmark and Learning Environment for the Robotarium (MARBLER) is an exciting recent step in providing a standardized robotics-relevant platform for MARL, by bridging the Robotarium testbed with existing MARL software infrastructure. However, MARBLER lacks support for parallelization and GPU/TPU execution, making the platform prohibitively slow compared to modern MARL environments and hindering adoption. We contribute JaxRobotarium, a  Jax-powered end-to-end simulation, learning, deployment, and benchmarking platform for the Robotarium. JaxRobotarium enables rapid training and deployment of multi-robot reinforcement learning (MRRL) policies with realistic robot dynamics and safety constraints, supporting both parallelization and hardware acceleration. Our generalizable learning interface provides an easy-to-use integration with SOTA MARL libraries (e.g., JaxMARL). In addition, JaxRobotarium includes eight standardized coordination scenarios, including four novel scenarios that bring established MARL benchmark tasks (e.g., RWARE and Level-Based Foraging) to a realistic robotics setting. We demonstrate that JaxRobotarium retains high simulation fidelity while achieving dramatic speedups over baseline (20x in training and 150x in simulation), and provides an open-access sim-to-real evaluation pipeline through the Robotarium testbed, accelerating and democratizing access to multi-robot learning research and evaluation.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/jain25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/jain25a.html</guid>
        
        
      </item>
    
      <item>
        <title>CaRL: Learning Scalable Planning Policies with Simple Rewards</title>
        <description>We investigate reinforcement learning (RL) for privileged planning in autonomous driving.  State-of-the-art approaches for this task are rule-based, but these methods do not scale to the long tail. RL, on the other hand, is scalable and does not suffer from compounding errors like imitation learning. Contemporary RL approaches for driving use complex shaped rewards that sum multiple individual rewards, \eg progress, position, or orientation rewards.  We show that PPO fails to optimize a popular version of these rewards when the mini-batch size is increased, which limits the scalability of these approaches. Instead, we propose a new reward design based primarily on optimizing a single intuitive reward term: route completion.  Infractions are penalized by terminating the episode or multiplicatively reducing route completion. We find that PPO scales well with higher mini-batch sizes when trained with our simple reward, even improving performance. Training with large mini-batch sizes enables efficient scaling via distributed data parallelism.  We scale PPO to 300M samples in CARLA and 500M samples in nuPlan with a single 8-GPU node. The resulting model achieves 64 DS on the CARLA longest6 v2 benchmark, outperforming other RL methods with more complex rewards by a large margin. Requiring only minimal adaptations from its use in CARLA, the same method is the best learning-based approach on nuPlan. It scores 91.3 in non-reactive and 90.6 in reactive traffic on the Val14 benchmark while being an order of magnitude faster than prior work.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/jaeger25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/jaeger25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Fail2Progress: Learning from Real-World Robot Failures with Stein Variational Inference</title>
        <description>Skill effect models for long-horizon manipulation tasks are prone to failures in conditions not covered by training data distributions. Therefore, enabling robots to reason about and learn from failures is necessary. We investigate the problem of efficiently generating a dataset targeted to observed failures. After fine-tuning a skill effect model on this dataset, we evaluate the extent to which the model can recover from failures and minimize future failures. We propose Fail2Progress, an approach that leverages Stein variational inference to generate multiple simulation environments in parallel, enabling efficient data sample generation similar to observed failures. Our method is capable of handling several challenging mobile manipulation tasks, including transporting multiple objects, organizing a constrained shelf, and tabletop organization. Through large-scale simulation and real-world experiments, we demonstrate that our approach excels at learning from failures across different numbers of objects. Furthermore, we show that Fail2Progress outperforms several baselines.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/huang25d.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/huang25d.html</guid>
        
        
      </item>
    
      <item>
        <title>ParticleFormer: A 3D Point Cloud World Model for Multi-Object, Multi-Material Robotic Manipulation</title>
        <description>3D world models (i.e., learning-based 3D dynamics models) offer a promising approach to generalizable robotic manipulation by capturing the underlying physics of environment evolution conditioned on robot actions. However, existing 3D world models are primarily limited to single-material dynamics using a particle-based Graph Neural Network model, and often require time-consuming 3D scene reconstruction to obtain 3D particle tracks for training. In this work, we present ParticleFormer, a Transformer-based point cloud world model trained with a hybrid point cloud reconstruction loss, supervising both global and local dynamics features in multi-material, multi-object robot interactions. ParticleFormer captures fine-grained multi-object interactions between rigid, deformable, and flexible materials, trained directly from real-world robot perception data without an elaborate scene reconstruction. We demonstrate the model’s effectiveness both in 3D scene forecasting tasks, and in downstream manipulation tasks using a Model Predictive Control (MPC) policy.  In addition, we extend existing dynamics learning benchmarks to include diverse multi-material, multi-object interaction scenarios. We validate our method on six simulation and three real-world experiments, where it consistently outperforms leading baselines by achieving superior dynamics prediction accuracy and less rollout error in downstream visuomotor tasks. Experimental videos are available at https://particleformer.github.io/.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/huang25c.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/huang25c.html</guid>
        
        
      </item>
    
      <item>
        <title>VT-Refine: Learning Bimanual Assembly with Visuo-Tactile Feedback via Simulation Fine-Tuning</title>
        <description>Humans excel at bimanual assembly tasks by adapting to rich tactile feedback—a capability that remains difficult to replicate in robots through behavioral cloning alone, due to the suboptimality and limited diversity of human demonstrations. In this work, we present VT-Refine, a visuo-tactile policy learning framework that combines real-world demonstrations, high-fidelity tactile simulation, and reinforcement learning to tackle precise, contact-rich bimanual assembly. We begin by training a diffusion policy on a small set of demonstrations using synchronized visual and tactile inputs. This policy is then transferred to a simulated digital twin equipped with simulated tactile sensors and further refined via large-scale reinforcement learning to enhance robustness and generalization. To enable accurate sim-to-real transfer, we leverage high-resolution piezoresistive tactile sensors that provide normal force signals and can be realistically modeled in parallel using GPU-accelerated simulation. Experimental results show that VT-Refine improves assembly performance in both simulation and the real world by increasing data diversity and enabling more effective policy fine-tuning. Our project page is available at https://vt-refine.github.io/ .</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/huang25b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/huang25b.html</guid>
        
        
      </item>
    
      <item>
        <title>LaDi-WM: A Latent Diffusion-Based World Model for Predictive Manipulation</title>
        <description>Predictive manipulation has recently gained considerable attention in the Embodied AI community due to its potential to improve robot policy performance by leveraging predicted states. However, generating accurate future visual states of robot-object interactions from world models remains a well-known challenge, particularly in achieving high-quality pixel-level representations. To this end, we propose LaDi-WM, a world model that predicts the latent space of future states using diffusion modeling. Specifically, LaDi-WM leverages the well-established latent space aligned with pre-trained Visual Foundation Models (VFMs), which comprises both geometric features (DINO-based) and semantic features (CLIP-based). We find that predicting the evolution of the latent space is easier to learn and more generalizable than directly predicting pixel-level images. Building on LaDi-WM, we design a diffusion policy that iteratively refines output actions by incorporating forecasted states, thereby generating more consistent and accurate results. Extensive experiments on both synthetic and real-world benchmarks demonstrate that LaDi-WM significantly enhances policy performance by 27.9% on the LIBERO-LONG benchmark and 20% on the real-world scenario. Furthermore, our world model and policies achieve impressive generalizability in real-world experiments. The full source code will be publicly available.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/huang25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/huang25a.html</guid>
        
        
      </item>
    
      <item>
        <title>See, Point, Fly: A Learning-Free VLM Framework for Universal Unmanned Aerial Navigation</title>
        <description>We present See, Point, Fly (SPF), a training-free aerial vision-and-language navigation (AVLN) framework built atop vision-language models (VLMs). SPF is capable of navigating to any goal based on any type of free-form instructions in any kind of environment. In contrast to existing VLM-based approaches that treat action prediction as a text generation task, our key insight is to consider action prediction for AVLN as a 2D spatial grounding task. SPF harness VLMs to decompose vague language instructions into iterative annotation of 2D waypoints on the input image. Along with the predicted traveling distance, SPF transforms predicted 2D waypoints into 3D displacement vectors as action commands for UAVs. Moreover, SPF also adaptively adjusts the traveling distance to facilitate more efficient navigation. Notably, SPF performs navigation in a closed-loop control manner, enabling UAVs to follow dynamic targets in dynamic environments. SPF sets a new state of the art in DRL simulation benchmark, out performing the previous best method by an absolute margin of 63%. In extensive real-world evaluations, SPF outperforms strong baselines by a large margin. We also conduct comprehensive ablation studies to highlight the effectiveness of our design choice. Lastly, SPF shows remarkable generalization to different VLMs.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/hu25e.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/hu25e.html</guid>
        
        
      </item>
    
      <item>
        <title>Granular loco-manipulation: Repositioning rocks through strategic sand avalanche</title>
        <description>Legged robots have the potential to leverage obstacles to climb steep sand slopes. However, efficiently repositioning these obstacles to desired locations is challenging. Here we present DiffusiveGRAIN, a learning-based method that enables a multi-legged robot to strategically induce localized sand avalanches during locomotion and indirectly manipulate obstacles. We conducted 375 trials, systematically varying obstacle spacing, robot orientation, and leg actions in 75 of them. Results show that movement of closely-spaced obstacles exhibit significant interference, requiring joint modeling. In addition, different multi-leg excavation actions could cause distinct robot state changes, necessitating integrated planning of manipulation and locomotion. To address these challenges, DiffusiveGRAIN includes a diffusion-based environment predictor to capture multi-obstacle movements under granular flow interferences and a robot state predictor to estimate changes in robot state from multi-leg action patterns. Deployment experiments (90 trials) demonstrate that by integrating the environment and robot state predictors, the robot can autonomously plan its movements based on loco-manipulation goals, successfully shifting closely located rocks to desired locations in over 65% of trials. Our study showcases the potential for a locomoting robot to strategically manipulate obstacles to achieve improved mobility on challenging terrains.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/hu25d.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/hu25d.html</guid>
        
        
      </item>
    
      <item>
        <title>ComposableNav: Instruction-Following Navigation in Dynamic Environments via Composable Diffusion</title>
        <description>This paper considers the problem of enabling robots to navigate dynamic environments while following instructions.  The challenge lies in the combinatorial nature of instruction specifications: each instruction can include multiple specifications, and the number of possible specification combinations grows exponentially as the robot’s skill set expands. For example, “overtake the pedestrian while staying on the right side of the road” consists of two specifications: *&quot;overtake the pedestrian&quot;* and *&quot;walk on the right side of the road.&quot;* To tackle this challenge, we propose ComposableNav, based on the intuition that following an instruction involves independently satisfying its constituent specifications, each corresponding to a distinct motion primitive.  Using diffusion models, ComposableNav learns each primitive separately, then composes them in parallel at deployment time to satisfy novel combinations of specifications unseen in training.  Additionally, to avoid the onerous need for demonstrations of individual motion primitives, we propose a two-stage training procedure: (1) supervised pre-training to learn a base diffusion model for dynamic navigation, and (2) reinforcement learning fine-tuning that molds the base model into different motion primitives. Through simulation and real-world experiments, we show that ComposableNav enables robots to follow instructions by generating trajectories that satisfy diverse and unseen combinations of specifications, significantly outperforming both non-compositional VLM-based policies and costmap composing baselines.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/hu25c.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/hu25c.html</guid>
        
        
      </item>
    
      <item>
        <title>SLAC: Simulation-Pretrained Latent Action Space for Whole-Body Real-World RL</title>
        <description>Building capable household and industrial robots requires mastering the control of versatile, high-degree-of-freedom (DoF) systems such as mobile manipulators. While reinforcement learning (RL) holds promise for autonomously acquiring robot control policies, scaling it to high-DoF embodiments remains challenging. Direct RL in the real world demands both safe exploration and high sample efficiency, which are difficult to achieve in practice. Sim-to-real RL, on the other hand, is often brittle due to the reality gap. This paper introduces SLAC, a method that renders real-world RL feasible for complex embodiments by leveraging a low-fidelity simulator to pretrain a task-agnostic latent action space. SLAC trains this latent action space via a customized unsupervised skill discovery method designed to promote temporal abstraction, disentanglement, and safety, thereby facilitating efficient downstream learning. Once a latent action space is learned, SLAC uses it as the action interface for a novel off-policy RL algorithm to autonomously learn downstream tasks through real-world interactions. We evaluate SLAC against existing methods on a suite of bimanual mobile manipulation tasks, where it achieves state-of-the-art performance. Notably, SLAC learns contact-rich whole-body tasks in under an hour of real-world interactions, without relying on any demonstrations or hand-crafted behavior priors.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/hu25b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/hu25b.html</guid>
        
        
      </item>
    
      <item>
        <title>Robot Trains Robot: Automatic Real-World Policy Adaptation and Learning for Humanoids</title>
        <description>Simulation-based reinforcement learning (RL) has significantly advanced humanoid locomotion tasks, yet direct real-world RL from scratch or starting from pretrained policies remains rare, limiting the full potential of humanoid robots. Real-world training, despite being crucial for overcoming the sim-to-real gap, faces substantial challenges related to safety, reward design, and learning efficiency. To address these limitations, we propose Robot-Trains-Robot (RTR), a novel framework where a robotic arm teacher actively supports and guides a humanoid student robot. The RTR system provides protection, schedule, reward, perturbation, failure detection, and automatic resets, enabling efficient long-term real-world training with minimal human intervention. Furthermore, we propose a novel RL pipeline that facilitates and stabilizes sim-to-real transfer by optimizing a single dynamics-encoded latent variable in the real world. We validate our method through two challenging real-world humanoid tasks: fine-tuning a walking policy for precise speed tracking and learning a humanoid swing-up task from scratch, illustrating the promising capabilities of real-world humanoid learning realized by RTR-style systems.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/hu25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/hu25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Tactile Beyond Pixels: Multisensory Touch Representations for Robot Manipulation</title>
        <description>We present TacX, the first multisensory touch representations across four tactile modalities: image, audio, motion, and pressure. Trained on  1M contact-rich interactions collected with the Digit 360 sensor, TacX captures complementary touch signals at diverse temporal and spatial scales. By leveraging self-supervised learning, TacX fuses these modalities into a unified representation that captures physical properties useful for downstream robot manipulation tasks. We study how to effectively integrate real-world touch representations for both imitation learning and tactile adaptation of sim-trained policies, showing that TacX boosts policy success rates by 63% over an end-to-end model using tactile images and improves robustness by 90% in recovering object states from touch. Finally, we benchmark TacX’s ability to make inference about physical properties, such as object-action identification, material-quantity estimation and force estimation. TacX improves accuracy in characterizing physical properties by 48% compared to end-to-end approaches, demonstrating the advantages of multisensory pretraining for capturing features essential for dexterous manipulation.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/higuera25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/higuera25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Latent Theory of Mind: A Decentralized Diffusion Architecture for Cooperative Manipulation</title>
        <description>We present Latent Theory of Mind (LatentToM), a decentralized diffusion policy architecture for collaborative robot manipulation. Our policy allows multiple manipulators with their own perception and computation to collaborate with each other towards a common task goal with or without explicit communication. Our key innovation lies in allowing each agent to maintain two latent representations: an ego embedding specific to the robot, and a consensus embedding trained to be common to both robots, despite their different sensor streams and poses. We further let each robot train a decoder to infer the other robot’s ego embedding from their consensus embedding, akin to &quot;theory of mind&quot; in latent space. Training occurs centrally, with all the policies’ consensus encoders supervised by a loss inspired by sheaf theory, a mathematical theory for clustering data on a topological manifold. Specifically, we introduce a first-order cohomology loss to enforce sheaf-consistent alignment of the consensus embeddings. To preserve the expressiveness of the consensus embedding, we further propose structural constraints based on theory of mind and a directional consensus mechanism. Execution can be fully distributed, requiring no explicit communication between policies. In which case, the information is exchanged implicitly through each robot’s sensor stream by observing the actions of the other robots and their effects on the scene. Alternatively, execution can leverage direct communication to share the robots’ consensus embeddings, where the embeddings are shared once during each inference step and are aligned using the sheaf Laplacian. While we tested our method using two manipulators, our approach can naturally be extended to an arbitrary number of agents. In our hardware experiments, LatentToM outperforms a naive decentralized diffusion baseline, and shows comparable performance with a state-of-the-art centralized diffusion policy for bi-manual manipulation. Additionally, we show that LatentToM is naturally robust to temporary robot failure or delays, while a centralized policy may fail.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/he25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/he25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Train-Once Plan-Anywhere Kinodynamic Motion Planning via Diffusion Trees</title>
        <description>Kinodynamic motion planning is concerned with computing collision-free trajectories while abiding by the robot’s dynamic constraints. This critical problem is often tackled using sampling-based planners (SBPs) that explore the robot’s high-dimensional state space by constructing a search tree via action propagations. Although SBPs can offer global guarantees on completeness and solution quality, their performance is often hindered by slow exploration due to uninformed action sampling. Learning-based approaches can yield significantly faster runtimes, yet they fail to generalize to out-of-distribution (OOD) scenarios and lack critical guarantees, e.g., safety, thus limiting their deployment on physical robots. We present Diffusion Tree (DiTree): a provably-generalizable framework leveraging diffusion policies (DPs) as informed samplers to efficiently guide state-space search within SBPs. DiTree  combines DP’s ability to model complex distributions of expert trajectories, conditioned on local observations, with the completeness of SBPs, to yield provably-safe solutions within a few action propagation iterations for complex dynamical systems. We demonstrate DiTree’s power with an implementation combining the popular RRT planner with a DP action sampler trained on a single environment. In comprehensive evaluations on OOD scenarios, DiTree has comparable runtimes to a standalone DP (4x faster than classical SBPs), while improving the success rate over DP and SBPs (on average).</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/hassidof25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/hassidof25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Mechanistic Interpretability for Steering Vision-Language-Action Models</title>
        <description>Vision-Language-Action (VLA) models are a promising path to realizing generalist embodied agents that can quickly adapt to new tasks, modalities, and environments. However, methods for interpreting and steering VLAs fall far short of classical robotics pipelines, which are grounded in explicit models of kinematics, dynamics, and control. This lack of mechanistic insight is a central challenge for deploying learned policies in real-world robotics, where robustness and explainability are critical. Motivated by advances in mechanistic interpretability for large language models, we introduce the first framework for interpreting and steering VLAs via their internal representations, enabling direct intervention in model behavior at inference time. We project feedforward activations within transformer layers onto the token embedding basis, identifying sparse semantic directions - such as speed and direction - that are causally linked to action selection. Leveraging these findings, we introduce a general-purpose activation steering method that modulates behavior in real time, without fine-tuning, reward signals, or environment interaction. We evaluate this method on two recent open-source VLAs, Pi0 and OpenVLA, and demonstrate zero-shot behavioral control in simulation (LIBERO) and on a physical robot (UR5). This work demonstrates that interpretable components of embodied VLAs can be systematically harnessed for control—establishing a new paradigm for transparent and steerable foundation models in robotics.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/haon25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/haon25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Force-Modulated Visual Policy for Robot-Assisted Dressing with Arm Motions</title>
        <description>Robot-assisted dressing has the potential to significantly improve the lives of individuals with mobility impairments. To ensure an effective and comfortable dressing experience, the robot must be able to handle challenging deformable garments, apply appropriate forces, and adapt to limb movements throughout the dressing process. Prior work often makes simplifying assumptions—such as static human limbs during dressing—which limits real-world applicability. In this work, we develop a robot-assisted dressing system capable of handling partial observations with visual occlusions, as well as robustly adapting to arm motions during the dressing process. Given a policy trained in simulation with partial observations, we propose a method to fine-tune it in the real world using a small amount of data and multi-modal feedback from vision and force sensing, to further improve the policy’s adaptability to arm motions and enhance safety. We evaluate our method in simulation with simplified articulated human meshes and in a real world human study with 12 participants across 264 dressing trials. Our policy successfully dresses two long-sleeve everyday garments onto the participants while being adaptive to various kinds of arm motions, and greatly outperforms prior baselines in terms of task completion and user feedback.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/hao25b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/hao25b.html</guid>
        
        
      </item>
    
      <item>
        <title>CHD: Coupled Hierarchical Diffusion for Long-Horizon Tasks</title>
        <description>Diffusion-based planners have shown strong performance in short-horizon tasks but often fail in complex, long-horizon settings. We trace the failure to loose coupling between high-level (HL) sub-goal selection and low-level (LL) trajectory generation, which leads to incoherent plans and degraded performance. We propose Coupled Hierarchical Diffusion (CHD), a framework that models HL sub-goals and LL trajectories jointly within a unified diffusion process. A shared classifier passes LL feedback upstream so that sub-goals self-correct while sampling proceeds. This tight HL–LL coupling improves trajectory coherence and enables scalable long-horizon diffusion planning. Experiments across maze navigation, tabletop manipulation, and household environments show that CHD consistently outperforms both flat and hierarchical diffusion baselines.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/hao25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/hao25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Wheeled Lab: Modern Sim2Real for Low-cost, Open-source Wheeled Robotics</title>
        <description>Simulation has been pivotal in recent robotics milestones and is poised to play a prominent role in the field’s future. However, recent robotic advances often rely on expensive and high-maintenance platforms, limiting access to broader robotics audiences. This work introduces Wheeled Lab, a framework for integrating the low-cost, open-source wheeled platforms that are already widely established in education and research with Isaac Lab, an open-source, widely adopted, and rapidly growing simulation framework for robotics research. Wheeled Lab thus introduces to new user communities modern techniques in Sim2Real, such as domain randomization, sensor simulation, and end-to-end learning. To kickstart educational uses, we demonstrate three state-of-the-art policies for small-scale RC cars: controlled drifting, elevation traversal, and visual navigation, each trained and deployed through zero-shot reinforcement learning. By bridging the gap between advanced Sim2Real methods and affordable, available robotics, Wheeled Lab aims to democratize access to cutting-edge tools, fostering innovation and education in a broader robotics context. The full stack, from hardware to software, is low cost and open-source.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/han25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/han25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Point Policy: Unifying Observations and Actions with Key Points for Robot Manipulation</title>
        <description>Building robotic agents capable of operating across diverse environments and object types remains a significant challenge, often requiring extensive data collection. This is particularly restrictive in robotics, where each data point must be physically executed in the real world. Consequently, there is a critical need for alternative data sources for robotics and frameworks that enable learning from such data. In this work, we present Point Policy, a new method for learning robot policies exclusively from offline human demonstration videos without any teleoperation data. Point Policy leverages state-of-the-art vision models and policy architectures to translate human hand poses into robot poses while capturing object states through semantically meaningful key points. This approach yields a morphology-agnostic representation that facilitates effective policy learning. Through experiments on a diverse set of real-world tasks, we demonstrate that Point Policy significantly outperforms prior methods for policy learning from human videos, performing well not only within the training distribution but also generalizing to novel object instances and cluttered environments. Videos of the robot are best viewed at anon-point-policy.github.io.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/haldar25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/haldar25a.html</guid>
        
        
      </item>
    
      <item>
        <title>DEQ-MPC : Deep Equilibrium Model Predictive Control</title>
        <description>Incorporating task-specific priors within a policy or network architecture is crucial for enhancing safety and improving representation and generalization in robotic control problems. Differentiable Model Predictive Control (MPC) layers have proven effective for embedding these priors, such as constraints and cost functions, directly within the architecture, enabling end-to-end training. However, current methods often treat the solver and the neural network as separate, independent entities, leading to suboptimal integration. In this work, we propose a novel approach that co-develops the solver and architecture unifying the optimization solver and network inference problems. Specifically, we formulate this as a \textit{joint fixed-point problem} over the coupled network outputs and necessary conditions of the optimization problem. We solve this problem in an iterative manner where we alternate between network forward passes and optimization iterations. Through extensive ablations in various robotic control tasks, we demonstrate that our approach results in richer representations and more stable training, while naturally accommodating warm starting, a key requirement for MPC.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/gurumurthy25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/gurumurthy25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Adapting by Analogy: OOD Generalization of Visuomotor Policies via Functional Correspondence</title>
        <description>End-to-end visuomotor policies trained using behavior cloning have shown a remarkable ability to generate complex, multi-modal low-level robot behaviors. However, at deployment time, these policies still struggle to act reliably when faced with out-of-distribution (OOD) visuals induced by objects, backgrounds, or environment changes. Prior works in interactive imitation learning solicit corrective expert demonstrations under the OOD conditions—but this can be costly and inefficient. We observe that task success under OOD conditions does not always warrant novel robot behaviors. In-distribution (ID) behaviors can directly be transferred to OOD conditions that share functional similarities with ID conditions. For example, behaviors trained to interact with in-distribution (ID) pens can apply to interacting with a visually-OOD pencil. The key challenge lies in disambiguating which ID observations functionally correspond to the OOD observation for the task at hand. We propose that an expert can provide this OOD-to-ID functional correspondence. Thus, instead of collecting new demonstrations and re-training at every OOD encounter, our method: (1) detects the need for feedback by checking if current observations are OOD and the most similar training observations show divergent behaviors (2) solicits functional correspondence feedback to disambiguate between those behaviors, and (3) intervenes on the OOD observations with the functionally corresponding ID observations to perform deployment-time generalization. We validate our method across diverse real-world robotic manipulation tasks with a Franka Panda robotic manipulator. Our results show that test-time functional correspondences can improve the generalization of a vision-based diffusion policy to OOD objects and environment conditions with low feedback.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/gupta25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/gupta25a.html</guid>
        
        
      </item>
    
      <item>
        <title>DemoSpeedup: Accelerating Visuomotor Policies via Entropy-Guided Demonstration Acceleration</title>
        <description>Imitation learning has shown great promise in robotic manipulation, but the policy’s execution is often unsatisfactorily slow due to commonly tardy demonstrations collected by human operators. In this work, we present DemoSpeedup, a self-supervised method to accelerate visuomotor policy execution via entropy-guided demonstration acceleration. DemoSpeedup starts from training an arbitrary generative policy (e.g., ACT or Diffusion Policy) on normal-speed demonstrations, which serves as a per-frame action entropy estimator. The key insight is that frames with lower action entropy estimates call for more consistent policy behaviors, which often indicate the demands for higher-precision operations. In contrast, frames with higher entropy estimates correspond to more casual sections, and therefore can be more safely accelerated. Thus, we segment the original demonstrations according to the estimated entropy, and accelerate them by down-sampling at rates that increase with the entropy values. Trained with the speedup demonstrations, the resulting policies execute up to 3 times faster while maintaining the task completion performance. Interestingly, these policies could even achieve higher success rates than those trained with normal-speed demonstrations, due to the benefits of reduced decision-making horizons.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/guo25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/guo25a.html</guid>
        
        
      </item>
    
      <item>
        <title>From Tabula Rasa to Emergent Abilities: Discovering Robot Skills via Real-World Unsupervised Quality-Diversity</title>
        <description>Autonomous skill discovery aims to enable robots to acquire diverse be-haviors without explicit supervision. Learning such behaviors directly on physical hardware remains challenging due to safety and data efficiency constraints. Existing methods, including Quality-Diversity Actor-Critic (QDAC), require manually defined skill spaces and carefully tuned heuristics, limiting real-world applicability. We propose Unsupervised Real-world Skill Acquisition (URSA), an extension of QDAC that enables robots to autonomously discover and master diverse, high-performing skills directly in the real world. We demonstrate that URSA successfully discovers diverse locomotion skills on a Unitree A1 quadruped in both simulation and the real world. Our approach supports both heuristic-driven skill discovery and fully unsupervised settings. We also show that the learn skill repertoire can be reused for downstream tasks such as real-world damage adaptation, where URSA outperforms all baselines in 5 out of 9 simulated and 3 out of 5 real-world damage scenarios. Our results establish a new framework for real-world robot learning that enables continuous skill discovery with limited human intervention, representing a significant step toward more autonomous and adaptable robotic systems.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/grillotti25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/grillotti25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Estimating Value of Assistance for Online POMDP Robotic Agents</title>
        <description>Robotic agents operating in dynamic, partially observable environments often benefit from teammate assistance. We address the challenge of determining when and how to assist in multi-robot systems where agents can modify the physical environment, such as moving obstacles that block perception or manipulation. For robots using online POMDP planning, evaluating assistance impacts requires computationally intensive policy evaluation, making real-time decisions difficult. We formulate Value of Assistance (VOA) for POMDP agents and develop efficient heuristics that approximate VOA without requiring complete policy evaluation. Our empirical evaluation on both a standard POMDP benchmark and a collaborative manipulation task demonstrates that our Full Information heuristic enables real-time assistance decisions while maintaining sufficient accuracy for effective helping action selection.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/goshen25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/goshen25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Geometric Red-Teaming for Robotic Manipulation</title>
        <description>Standard evaluation protocols in robotic manipulation typically assess policy performance over curated, in-distribution test sets, offering limited insight into how systems fail under plausible variation.  We introduce a red-teaming framework that probes robustness through object-centric geometric perturbations, automatically generating CrashShapes—structurally valid, user-constrained mesh deformations that trigger catastrophic failures in pre-trained manipulation policies.  The method integrates a Jacobian field–based deformation model with a gradient-free, simulator-in-the-loop optimization strategy. Across insertion, articulation, and grasping tasks, our approach consistently discovers deformations that collapse policy performance, revealing brittle failure modes missed by static benchmarks.  By combining task-level policy rollouts with constraint-aware shape exploration, we aim to build a general purpose framework for structured, object-centric robustness evaluation in robotic manipulation. We additionally show that fine-tuning on individual CrashShapes, a process we refer to as blue-teaming, improves task success by up to 60 percentage points on those shapes, while preserving performance on the original object, demonstrating the utility of red-teamed geometries for targeted policy refinement. Finally, we validate both red-teaming and blue-teaming results with a real robotic arm, observing that simulated CrashShapes reduce task success from 90% to as low as 22.5%, and that blue-teaming recovers performance to up to 90% on the corresponding real-world geometry—closely matching simulation outcomes.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/goel25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/goel25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering</title>
        <description>As robots become increasingly capable of operating over extended periods—spanning days, weeks, and even months—they are expected to accumulate knowledge of their environments and leverage this experience to assist humans more effectively. This paper studies the problem of Long-term Active Embodied Question Answering (LA-EQA), a new task in which a robot must both recall past experiences and actively explore its environment to answer complex, temporally-grounded questions. Unlike traditional EQA settings, which typically focus either on understanding the present environment alone or on recalling a single past observation, LA-EQA challenges an agent to reason over past, present, and possible future states, deciding when to explore, when to consult its memory, and when to stop gathering observations and provide a final answer. Standard EQA approaches based on large models struggle in this setting due to limited context windows, absence of persistent memory, and an inability to combine memory recall with active exploration. To address this, we propose a structured memory system for robots, inspired by the mind palace method from cognitive science. Our method encodes episodic experiences as scene-graph-based world instances, forming a reasoning and planning algorithm that enables targeted memory retrieval and guided navigation. To balance the exploration-recall trade-off, we introduce value-of-information-based stopping criteria that determines when the agent has gathered sufficient information. We evaluate our method on real-world experiments and introduce a new benchmark that spans popular simulation environments and actual industrial sites. Our approach significantly outperforms state-of-the-art baselines, yielding substantial gains in both answer accuracy and exploration efficiency.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/ginting25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/ginting25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Off Policy Lyapunov Stability in Reinforcement Learning</title>
        <description>Traditional reinforcement learning lacks the ability to provide stability guarantees. More recent algorithms learn Lyapunov functions alongside the control policies to ensure stable learning. However, the current self-learned Lyapunov functions are sample inefficient due to their on-policy nature. This paper introduces a method for learning Lyapunov functions off-policy and incorporates the proposed off-policy Lyapunov function into the Soft Actor Critic and Proximal Policy Optimization algorithms to provide them with a data efficient stability certificate. Simulations of an inverted pendulum and a quadrotor illustrate the improved performance of the two algorithms when endowed with the proposed off-policy Lyapunov function.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/gill25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/gill25a.html</guid>
        
        
      </item>
    
      <item>
        <title>One View, Many Worlds: Single-Image to 3D object Meets Generative Domain Randomization for One-Shot 6D Pose Estimation</title>
        <description>Estimating the 6D pose of arbitrary objects from a single reference image is a critical yet challenging task in robotics, especially considering the long-tail distribution of real-world instances. While category-level and model-based approaches have achieved notable progress, they remain limited in generalizing to unseen objects under one-shot settings. In this work, we propose a novel pipeline for fast and accurate one-shot 6D pose and scale estimation. Leveraging recent advances in single-view 3D generation, we first build high-fidelity textured meshes without requiring known object poses. To resolve scale ambiguity, we introduce a coarse-to-fine alignment module that estimates both object size and initial pose by matching 2D-3D features with depth information. We then generate a diversified set of plausible 3D models using text-guided generative augmentation and render them with Blender to synthesize large-scale, domain-randomized training data for pose estiamtion. This synthetic data bridges the domain gap and enables robust fine-tuning of pose estimators. Our method achieves state-of-the-art results on several 6D pose benchmarks, and we further validate its effectiveness on a newly collected in-the-wild dataset. Finally, we integrate our system with a dexterous hand, demonstrating its robustness in real-world robotic grasping tasks. All code, data, and models will be released to foster future research.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/geng25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/geng25a.html</guid>
        
        
      </item>
    
      <item>
        <title>ObjectReact: Learning Object-Relative Control for Visual Navigation</title>
        <description>Visual navigation using only a single camera and a topological map has recently become an appealing alternative to methods that require additional sensors and 3D maps. This is typically achieved through an &quot;image-relative&quot; approach to estimating control from a given pair of current observation and subgoal image. However, image-level representations of the world have limitations because images are strictly tied to the agent’s pose and embodiment. In contrast, objects, being a property of the map, offer an embodiment- and trajectory-invariant world representation. In this work, we present a new paradigm of learning &quot;object-relative&quot; control that exhibits several desirable characteristics: a) new routes can be traversed without strictly requiring to imitate prior experience, b) the control prediction problem can be decoupled from solving the image matching problem, and c) high invariance can be achieved in cross-embodiment deployment for variations across both training-testing and mapping-execution settings. We propose a topometric map representation in the form of a relative&quot; 3D scene graph, which is used to obtain more informative object-level global path planning costs. We train a local controller, dubbed &quot;ObjectReact&quot;, conditioned directly on a high-level “WayObject Costmap” representation that eliminates the need for an explicit RGB input. We demonstrate the advantages of learning object-relative control over its image-relative counterpart across sensor height variations and multiple navigation tasks that challenge the underlying spatial understanding capability, e.g., navigating a map trajectory in the reverse direction. We further show that our sim-only policy is able to generalize well to real-world indoor environments.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/garg25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/garg25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Real-Time Out-of-Distribution Failure Prevention via Multi-Modal Reasoning</title>
        <description>Foundation models can provide robust high-level reasoning on appropriate safety interventions in hazardous scenarios beyond a robot’s training data, i.e. out-of-distribution (OOD) failures. However, due to the high inference latency of Large Vision and Language Models, current methods rely on manually defined intervention policies to enact fallbacks, thereby lacking the ability to plan generalizable, semantically safe motions. To overcome these challenges we present FORTRESS, a framework that generates and reasons about semantically safe fallback strategies in real time to prevent OOD failures. At a low frequency in nominal operations, FORTRESS uses multi-modal reasoners to identify goals and anticipate failure modes. When a runtime monitor triggers a fallback response, FORTRESS rapidly synthesizes plans to fallback goals while inferring and avoiding semantically unsafe regions in real time. By bridging open-world, multi-modal reasoning with dynamics-aware planning, we eliminate the need for hard-coded fallbacks and human safety interventions. FORTRESS outperforms on-the-fly prompting of slow reasoning models in safety classification accuracy on synthetic benchmarks and real-world ANYmal robot data, and further improves system safety and planning success in simulation and on quadrotor hardware for urban navigation. Website and code can be found at https://submfort.github.io/fortress/.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/ganai25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/ganai25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Contrastive Forward Prediction Reinforcement Learning for Adaptive Fault-Tolerant Legged Robots</title>
        <description>In complex environments, adaptive and fault-tolerant capabilities are essential for legged robot locomotion. To address this challenge, this study proposes a reinforcement learning framework that integrates contrastive learning with forward prediction to achieve fault-tolerant locomotion for legged robots. This framework constructs a forward prediction model with contrastive learning, incorporating a comparator and a forward model. The forward model predicts the robot’s subsequent state, and the comparator compares these predictions with actual states to generate critical prediction errors. These errors are systematically integrated into the controller, facilitating the continuous adjustment and refinement of control signals.Experiments on quadruped robots across different terrains and various joint damage scenarios have verified the effectiveness of our method, especially the functions of the comparator and the forward model. Furthermore, robots can adapt to locked joints without prior training, demonstrating zero-shot transfer capability. Finally, the proposed method demonstrates universal applicability to both quadruped and hexapod robots, highlighting its potential for broader applications in legged robotics.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/fu25b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/fu25b.html</guid>
        
        
      </item>
    
      <item>
        <title>Capability-Aware Shared Hypernetworks for Flexible Heterogeneous Multi-Robot Coordination</title>
        <description>Recent advances have enabled heterogeneous multi-robot teams to learn complex and effective coordination skills.  However, existing neural architectures that support heterogeneous teaming tend to force a trade-off between expressivity and efficiency. Shared-parameter designs prioritize sample efficiency by enabling a single network to be shared across all or a pre-specified subset of robots (via input augmentations), but tend to limit behavioral diversity. In contrast, recent designs employ a separate policy for each robot, enabling greater diversity and expressivity at the cost of efficiency and generalization. Our key insight is that such tradeoffs can be avoided by viewing these design choices as ends of a broad spectrum. Inspired by recent work in transfer and meta learning, and building on prior work in multi-robot task allocation, we propose Capability-Aware Shared Hypernetworks (CASH), a *soft weight sharing* architecture that uses hypernetworks to efficiently learn a *flexible* shared policy that dynamically adapts to each robot post-training. By explicitly encoding the impact of robot capabilities (e.g., speed and payload) on collective behavior, CASH enables *zero-shot generalization* to unseen robots or team compositions. Our experiments involve multiple heterogeneous tasks, three learning paradigms (imitation learning, value-based, and policy-gradient RL), and SOTA multi-robot simulation (JaxMARL) and hardware (Robotarium) platforms. Across all conditions, we find that CASH generates appropriately-diverse behaviors and consistently outperforms baseline architectures in terms of performance and sample efficiency during both training and zero-shot generalization, all with 60%-80% fewer learnable parameters.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/fu25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/fu25a.html</guid>
        
        
      </item>
    
      <item>
        <title>SimShear: Sim-to-Real Shear-based Tactile Servoing</title>
        <description>We present SimShear: a sim-to-real pipeline for tactile control that allows use of shear information without explicitly modeling shear dynamics in simulation. Shear, which arises from lateral movements across contact surfaces, are critical for tasks involving dynamic object interactions but are challenging to simulate. We introduce shPix2pix: a shear-conditioned U-Net GAN that transforms simulated tactile images absent of shear plus a vector encoding shear information into realistic equivalents that include shear deformations, and show this outperforms baseline pix2pix methods for simulating tactile images and pose/shear prediction. This is applied to two control tasks using a pair of low-cost desktop robotic arms equipped with a vision-based tactile sensor: first, a tactile tracking task, where a follower arm tracks a surface moved by the leader arm; second, a collaborative co-lift task, where both arms jointly hold an object while the leader arm moves along a prescribed trajectory. Our method maintain contact errors within 1-2 mm across varied trajectories where shear sensing is essential for task performance. This work validates the use of sim-to-real shear modeling with rigid-body simulators, opening new possibilities for simulation in tactile robotics.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/freud25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/freud25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Text2Touch: Tactile In-Hand Manipulation with LLM-Designed Reward Functions</title>
        <description>Large language models (LLMs) are beginning to automate reward design for dexterous manipulation. However, no prior work has considered tactile sensing, which is known to be critical for human-like dexterity. We present Text2Touch, bringing LLM-crafted rewards to the challenging task of multi-axis in-hand object rotation with real-world vision based tactile sensing in palm-up and palm-down configurations. Our prompt engineering strategy scales to over 70 environment variables, and sim-to-real distillation enables successful policy transfer to a tactile-enabled fully actuated four-fingered dexterous robot hand. Text2Touch significantly outperforms a carefully tuned human-engineered baseline, demonstrating superior rotation speed and stability while relying on reward functions that are an order of magnitude shorter and simpler. These results illustrate how LLM-designed rewards can significantly reduce the time from concept to deployable dexterous tactile skills, supporting more rapid and scalable multimodal robot learning.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/field25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/field25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation</title>
        <description>Solving complex long-horizon robotic manipulation problems requires sophisticated high-level planning capabilities, the ability to reason about the physical world, and reactively choose appropriate motor skills. Vision-language models (VLMs) pretrained on Internet data could in principle offer a framework for tackling such problems. However, in their current form, VLMs lack both the nuanced understanding of intricate physics required for robotic manipulation and the ability to reason over long horizons to address error compounding issues. In this paper, we introduce a novel test-time computation framework that enhances VLMs’ physical reasoning capabilities for multi-stage manipulation tasks. At its core, our approach iteratively improves a pretrained VLM with a “reflection” mechanism - it uses a generative model to imagine future world states, leverages these predictions to guide action selection, and critically reflects on potential suboptimalities to refine its reasoning. Experimental results demonstrate that our method significantly outperforms several state-of-the-art commercial VLMs as well as other post-training approaches such as Monte Carlo Tree Search (MCTS). Videos are available at https://corl2025-reflectvlm.github.io.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/feng25b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/feng25b.html</guid>
        
        
      </item>
    
      <item>
        <title>FFHFlow: Diverse and Uncertainty-Aware Dexterous Grasp Generation via Flow Variational Inference</title>
        <description>Synthesizing diverse, uncertainty-aware grasps for multi-fingered hands from partial observations remains a critical challenge in robot learning. Prior generative methods struggle to model the intricate grasp distribution of dexterous hands and often fail to reason about shape uncertainty inherent in partial point clouds, leading to unreliable or overly conservative grasps. We propose FFHFlow, a flow-based variational framework that generates diverse, robust multi-finger grasps while explicitly quantifying perceptual uncertainty in the partial point clouds. Our approach leverages a normalizing flow-based deep latent variable model to learn a hierarchical grasp manifold, overcoming the mode collapse and rigid prior limitations of conditional Variational Autoencoders (cVAEs). By exploiting the invertibility and exact likelihoods of flows, FFHFlow introspects shape uncertainty in partial observations and identifies novel object structures, enabling risk-aware grasp synthesis. To further enhance reliability, we integrate a discriminative grasp evaluator with the flow likelihoods, formulating an uncertainty-aware ranking strategy that prioritizes grasps robust to shape ambiguity. Extensive experiments in simulation and real-world setups demonstrate that FFHFlow outperforms state-of-the-art baselines (including diffusion models) in grasp diversity and success rate, while achieving run-time efficient sampling. We also showcase its practical value in cluttered and confined environments, where diversity-driven sampling excels by mitigating collisions.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/feng25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/feng25a.html</guid>
        
        
      </item>
    
      <item>
        <title>$Door(s)$: Junction State Estimation for Efficient Exploration in Reinforcement Learning</title>
        <description>Exploration is one of the important bottlenecks for efficient learning in reinforcement learning, especially in the presence of sparse rewards. One way to traverse the environment faster is by passing through junctions, or metaphorical doors, in the state space. We propose a novel heuristic, $Door(s)$, focused on such narrow passages that serve as pathways to a large number of other states. Our approach works by estimating the state occupancy distribution and allows computation of its entropy, which forms the basis for our measure. Its computation is more sample-efficient compared to other similar methods and robustly works over longer horizons. Our results highlight the detection of dead-end states, show increased exploration efficiency, and demonstrate that $Door(s)$ encodes specific behaviors useful for downstream learning of various robotic manipulation tasks.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/fele25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/fele25a.html</guid>
        
        
      </item>
    
      <item>
        <title>AirExo-2: Scaling up Generalizable Robotic Imitation Learning with Low-Cost Exoskeletons</title>
        <description>Scaling up robotic imitation learning for real-world applications requires efficient and scalable demonstration collection methods. While teleoperation is effective, it depends on costly and inflexible robot platforms. In-the-wild demonstrations offer a promising alternative, but existing collection devices have key limitations: handheld setups offer limited observational coverage, and whole-body systems often require fine-tuning with robot data due to domain gaps. To address these challenges, we present AirExo-2, a low-cost exoskeleton system for large-scale in-the-wild data collection, along with visual adaptors that transform collected data into pseudo-robot demonstrations suitable for policy learning. We further introduce RISE-2, a generalizable imitation learning policy that fuses 3D spatial and 2D semantic perception for robust manipulations. Experiments show that RISE-2 outperforms prior state-of-the-art methods on both in-domain and generalization evaluations. Trained solely on adapted in-the-wild data produced by AirExo-2, RISE-2 achieves comparable performance to policies trained with teleoperated data, highlighting the effectiveness and potential of AirExo-2 for scalable and generalizable imitation learning.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/fang25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/fang25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Long-VLA: Unleashing Long-Horizon Capability of Vision Language Action Model for Robot Manipulation</title>
        <description>Vision-Language-Action (VLA) models have become a cornerstone in robotic policy learning, leveraging large-scale multimodal data for robust and scalable control. However, existing VLA frameworks primarily address short-horizon tasks, and their effectiveness on long-horizon, multi-step robotic manipulation remains limited due to challenges in skill chaining and subtask dependencies. In this work, we introduce Long-VLA, the first end-to-end VLA model specifically designed for long-horizon robotic tasks. Our approach features a novel phase-aware input masking strategy that adaptively segments each subtask into moving and interaction phases, enabling the model to focus on phase-relevant sensory cues and enhancing subtask compatibility. This unified strategy preserves the scalability and data efficiency of VLA training, and our architecture-agnostic module can be seamlessly integrated into existing VLA models. We further propose the L-CALVIN benchmark to systematically evaluate long-horizon manipulation. Extensive experiments on both simulated and real-world tasks demonstrate that Long-VLA significantly outperforms prior state-of-the-art methods, establishing a new baseline for long-horizon robotic control.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/fan25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/fan25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Learning Smooth State-Dependent Traversability from Dense Point Clouds</title>
        <description>A key open challenge in off-road autonomy is that the traversability of terrain often depends on the vehicle’s state. In particular, some obstacles are only traversable from some orientations. However, learning this interaction by encoding the angle of approach as a model input demands a large and diverse training dataset and is computationally inefficient during planning due to repeated model inference. To address these challenges, we present SPARTA, a method for estimating approach angle conditioned traversability from point clouds. Specifically, we impose geometric structure into our network by outputting a smooth analytical function over the 1-Sphere that predicts risk distribution for any angle of approach with minimal overhead and can be reused for subsequent queries. The function is composed of Fourier basis functions, which has important advantages for generalization due to their periodic nature and smoothness. We demonstrate SPARTA both in a high-fidelity simulation platform, where our model achieves a 91% success rate crossing a 40m boulder field (compared to 73% for the baseline), and on hardware, illustrating the generalization ability of the model to real-world settings.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/dong25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/dong25a.html</guid>
        
        
      </item>
    
      <item>
        <title>TReF-6: Inferring Task-Relevant Frames from a Single Demonstration for One-Shot Skill Generalization</title>
        <description>Robots often struggle to generalize from a single demonstration due to the lack of a transferable and interpretable spatial representation. In this work, we introduce TReF-6, a method that infers a simplified, abstracted 6DoF Task-Relevant Frame from a single trajectory. Our approach identifies an influence point purely from the trajectory geometry to define the origin for a local frame, which serves as a reference for parameterizing a Dynamic Movement Primitive (DMP). This influence point captures the task’s spatial structure, extending the standard DMP formulation beyond start-goal imitation. The inferred frame is semantically grounded via a vision-language model and localized in novel scenes by Grounded-SAM, enabling functionally consistent skill generalization. We validate TReF-6 in simulation and demonstrate robustness to trajectory noise. We further deploy an end-to-end pipeline on real-world manipulation tasks, showing that TReF-6 supports one-shot imitation learning that preserves task intent across diverse object configurations.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/ding25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/ding25a.html</guid>
        
        
      </item>
    
      <item>
        <title>GraspMolmo: Generalizable Task-Oriented Grasping via Large-Scale Synthetic Data Generation</title>
        <description>We present GraspMolmo, a generalizable open-vocabulary task-oriented grasping (TOG) model. GraspMolmo predicts semantically appropriate, stable grasps conditioned on a natural language instruction and a single RGB-D frame. For instance, given &quot;pour me some tea&quot;, GraspMolmo selects a grasp on a teapot handle rather than its body. Unlike prior TOG methods, which are limited by small datasets, simplistic language, and uncluttered scenes, GraspMolmo learns from a large-scale synthetic dataset of 379k samples featuring cluttered environments and diverse, realistic task descriptions. We fine-tune the Molmo visual-language model on this data, enabling GraspMolmo to generalize to novel open-vocabulary instructions and objects. In challenging real-world evaluations, GraspMolmo achieves state-of-the-art results, with a 70% prediction success on complex tasks, compared to the 35% achieved by the next best alternative.  GraspMolmo also successfully demonstrates the ability to predict semantically correct bimanual grasps zero-shot. We release our synthetic dataset, code, model, and benchmarks to accelerate research in task-semantic robotic manipulation.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/deshpande25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/deshpande25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Residual Neural Terminal Constraint for MPC-based Collision Avoidance in Dynamic Environments</title>
        <description>In this paper, we propose a hybrid MPC local planner that uses a learning-based approximation of a time-varying safe set, derived from local observations and applied as the MPC terminal constraint. This set can be represented as a zero-superlevel set of the value function computed via Hamilton-Jacobi (HJ) reachability analysis, which is infeasible in real-time. We exploit the property that the HJ value function can be expressed as a difference of the corresponding signed distance function (SDF) and a non-negative residual function. The residual component is modeled as a neural network with non-negative output and subtracted from the computed SDF, resulting in a real-time value function estimate that is at least as safe as the SDF by design. Additionally, we parametrize the neural residual by a hypernetwork to improve real-time performance and generalization properties. The proposed method is compared with three state-of-the-art methods in simulations and hardware experiments, achieving up to 30% higher success rates compared to the best baseline while requiring a similar computational effort and producing high-quality (low travel-time) solutions.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/derajic25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/derajic25a.html</guid>
        
        
      </item>
    
      <item>
        <title>SafeBimanual: Diffusion-based trajectory optimization for safe bimanual manipulation</title>
        <description>Bimanual manipulation has been widely applied in household services and manufacturing, which enables the complex task completion with coordination requirements. Recent diffusion-based policy learning approaches have achieved promising performance in modeling action distributions for bimanual manipulation. However, they ignored the physical safety constraints of bimanual manipulation, which leads to the dangerous behaviors with damage to robots and objects.  To this end, we propose a test-time trajectory optimization framework named SafeBimanual for any pre-trained diffusion-based bimanual manipulation policies, which imposes the safety constraints on bimanual actions to avoid dangerous robot behaviors with improved success rate. Specifically, we design diverse cost functions for safety constraints in different dual-arm cooperation patterns including avoidance of tearing objects and collision between arms and objects, which optimizes the manipulator trajectories with guided sampling of diffusion denoising process. Moreover, we employ a vision-language model (VLM) to schedule the cost functions by specifying keypoints and corresponding pairwise relationship, so that the optimal safety constraint is dynamically generated in the entire bimanual manipulation process. SafeBimanual demonstrates superiority on 8 simulated tasks in RoboTwin with a 11.1% increase in success rate and a 18.9% reduction in unsafe interactions over state-of-the-art diffusion-based methods. Extensive experiments on 4 real-world tasks further verify its practical value by improving the success rate by 32.5%.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/deng25c.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/deng25c.html</guid>
        
        
      </item>
    
      <item>
        <title>Subteaming and Adaptive Formation Control for Coordinated Multi-Robot Navigation</title>
        <description>Coordinated multi-robot navigation is essential for robots to operate as a team in diverse environments.  During navigation, robot teams usually need to maintain specific formations, such as circular formations to protect human teammates at the center.  However, in complex scenarios such as narrow corridors, rigidly preserving predefined formations can become infeasible.  Therefore, robot teams must be capable of dynamically splitting into smaller subteams and adaptively controlling the subteams to navigate through such scenarios while preserving formations. To enable this capability, we introduce a novel method for SubTeaming and Adaptive Formation (STAF), which is built upon a unified hierarchical learning framework: (1) high-level deep graph cut for team splitting, (2) intermediate-level graph learning for facilitating coordinated navigation among subteams,  and (3) low-level policy learning for controlling individual mobile robots to reach their goal positions while avoiding collisions. To evaluate STAF, we conducted extensive experiments in both indoor and outdoor environments using robotics simulations and physical robot teams. Experimental results show that STAF enables the novel capability for subteaming and adaptive formation control, and achieves promising performance in coordinated multi-robot navigation through challenging scenarios. More details are available on the project website: https://anonymous188.github.io/STAF/.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/deng25b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/deng25b.html</guid>
        
        
      </item>
    
      <item>
        <title>GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data</title>
        <description>Embodied foundation models are gaining increasing attention for their zero-shot generalization, scalability, and adaptability to new tasks through few-shot post-training. However, existing models rely heavily on real-world data, which is costly and labor-intensive to collect. Synthetic data offers a cost-effective alternative, yet its potential remains largely underexplored. To bridge this gap, we explore the feasibility of training Vision-Language-Action (VLA) models entirely with large-scale synthetic action data. We curate SynGrasp-1B, a billion-frame robotic grasping dataset generated in simulation with photorealistic rendering and extensive domain randomization. Building on this, we present GraspVLA, a VLA model pretrained on large-scale synthetic action data as a foundational model for grasping tasks. GraspVLA integrates autoregressive perception tasks and a flow-matching-based action generation into a unified Chain-of-Thought process, enabling joint training on synthetic action data and Internet semantics data. This design helps mitigate sim-to-real gaps and facilitates the transfer of learned actions to a broader range of Internet-covered objects, achieving open-vocabulary generalization in grasping. Extensive evaluations across real-world and simulation benchmarks demonstrate GraspVLA’s advanced zero-shot generalizability and few-shot adaptability to specific human preferences. We will release SynGrasp-1B dataset and pre-trained weights to benefit the community.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/deng25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/deng25a.html</guid>
        
        
      </item>
    
      <item>
        <title>X-Sim: Cross-Embodiment Learning via Real-to-Sim-to-Real</title>
        <description>Human videos offer a scalable way to train robot manipulation policies, but lack the action labels needed by standard imitation learning algorithms. Existing cross-embodiment approaches try to map human motion to robot actions, but often fail when the embodiments differ significantly. We propose X-Sim, a real-to-sim-to-real framework that uses object motion as a dense and transferable signal for learning robot policies. X-Sim starts by reconstructing a photorealistic simulation from an RGBD human video and tracking object trajectories to define object-centric rewards. These rewards are used to train a reinforcement learning (RL) policy in simulation. The learned policy is then distilled into an image-conditioned diffusion policy using synthetic rollouts rendered with varied viewpoints and lighting. To transfer to the real world, X-Sim introduces an online domain adaptation technique that aligns real and simulated observations during deployment. Importantly, X-Sim does not require any robot teleoperation data. We evaluate it across 5 manipulation tasks in 2 environments and show that it: (1) improves task progress by 30% on average over hand-tracking and sim-to-real baselines, (2) matches behavior cloning with 10x less data collection, and (3) generalizes to new camera viewpoints and test-time changes.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/dan25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/dan25a.html</guid>
        
        
      </item>
    
      <item>
        <title>AimBot: A Simple Auxiliary Visual Cue to Enhance Spatial Awareness of Visuomotor Policies</title>
        <description>In this paper, we propose AimBot, a lightweight visual augmentation technique that provides explicit spatial cues to improve visuomotor policy learning in robotic manipulation. AimBot overlays shooting lines and scope reticles onto multi-view RGB images, offering auxiliary visual guidance that encodes the end-effector’s state. The overlays are computed from depth images, camera extrinsics, and the current end-effector pose, explicitly conveying spatial relationships between the gripper and objects in the scene. AimBot incurs minimal computational overhead (less than 1 ms) and requires no changes to model architectures, as it simply replaces original RGB images with augmented counterparts. Despite its simplicity, our results show that AimBot consistently improves the performance of various visuomotor policies in both simulation and real-world settings, highlighting the benefits of spatially grounded visual feedback. More videos can be found at https://aimbot-reticle.github.io/</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/dai25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/dai25a.html</guid>
        
        
      </item>
    
      <item>
        <title>LLM-Guided Probabilistic Program Induction for POMDP Model Estimation</title>
        <description>Partially Observable Markov Decision Processes (POMDPs) model decision making under uncertainty. While there are many approaches to approximately solving POMDPs, we aim to address the problem of learning such models. In particular, we are interested in a subclass of POMDPs wherein the components of the model, including the observation function, reward function, transition function, and initial state distribution function, can be modeled as low-complexity probabilistic graphical models in the form of a short probabilistic program. Our strategy to learn these programs uses an LLM as a prior, generating candidate probabilistic programs that are then tested against the empirical distribution and adjusted through feedback. We experiment on a number of classical toy POMDP problems, simulated MiniGrid domains, and two real mobile-base robotics search domains involving partial observability. Our results show that using an LLM to guide in the construction of a low-complexity POMDP model can be more effective than tabular POMDP learning, behavior cloning, or direct LLM planning.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/curtis25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/curtis25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Action-Free Reasoning for Policy Generalization</title>
        <description>End-to-end imitation learning offers a promising approach for training robot policies. However, generalizing to new settings—such as unseen scenes, tasks, and object instances—remains a significant challenge. Although large-scale robot demonstration datasets have shown potential for inducing generalization, they are resource-intensive to scale. In contrast, human video data is abundant and diverse, presenting an attractive alternative. Yet, these human-video datasets lack action labels, complicating their use in imitation learning. Existing methods attempt to extract grounded action representations (e.g., hand poses), but resulting policies struggle to bridge the embodiment gap between human and robot actions. We propose an alternative approach: leveraging language-based reasoning from human videos—essential for guiding robot actions—to train generalizable robot policies. Building on recent advances in reasoning-based policy architectures, we introduce Reasoning through Action-free Data (RAD). RAD learns from both robot demonstration data (with reasoning and action labels) and action-free human video data (with only reasoning labels). The robot data teaches the model to map reasoning to low-level actions, while the action-free data enhances reasoning capabilities. Additionally, we release a new dataset of 3,377 human-hand demonstrations compatible with the Bridge V2 benchmark. This dataset includes chain-of-thought reasoning annotations and hand-tracking data, and is aimed at facilitating future research on reasoning-driven robot learning. Our experiments demonstrate that RAD enables effective transfer across the embodiment gap, allowing robots to perform tasks seen only in action-free data. Furthermore, scaling up action-free reasoning data significantly improves policy performance and generalization to novel tasks. These results highlight the promise of reasoning-driven learning from action-free datasets for advancing generalizable robot control. See website with videos: https://rad-generalization-s.github.io/</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/clark25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/clark25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Multimodal Fused Learning for Solving the Generalized Traveling Salesman Problem in Robotic Task Planning</title>
        <description>Effective and efficient task planning is essential for mobile robots, especially in applications like warehouse retrieval and environmental monitoring. These tasks often involve selecting one location from each of several target clusters, forming a Generalized Traveling Salesman Problem (GTSP) that remains challenging to solve both accurately and efficiently. To address this, we propose a Multimodal Fused Learning (MMFL) framework that leverages both graph and image-based representations to capture complementary aspects of the problem, and learns a policy capable of generating high-quality task planning schemes in real time. Specifically, we first introduce a coordinate-based image builder that transforms GTSP instances into spatially informative representations. We then design an adaptive resolution scaling strategy to enhance adaptability across different problem scales, and develop a multimodal fusion module with dedicated bottlenecks that enables effective integration of geometric and spatial features. Extensive experiments show that our MMFL approach significantly outperforms state-of-the-art methods across various GTSP instances while maintaining the computational efficiency required for real-time robotic applications. Physical robot tests further validate its practical effectiveness in real-world scenarios.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/cheng25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/cheng25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Hand-Eye Autonomous Delivery: Learning Humanoid Navigation, Locomotion and Reaching</title>
        <description>We propose Hand-Eye Autonomous Delivery (HEAD), a framework that learns navigation, locomotion, and reaching skills for humanoids, directly from human motion and vision perception data. We take a modular approach where the high-level planner commands the target position and orientation of the hands and eyes of the humanoid, delivered by the low-level policy that controls the whole-body movements. Specifically, the low-level whole-body controller learns to track the three points (eyes, left hand, and right hand) from existing large-scale human motion capture data while high-level policy learns from human data collected by Aria glasses. Our modular approach decouples the ego-centric vision perception from physical actions, promoting efficient learning and scalability to novel scenes. We evaluate our method both in simulation and in the real-world, demonstrating humanoid’s capabilities to navigate and reach in complex environments designed for humans.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/chen25e.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/chen25e.html</guid>
        
        
      </item>
    
      <item>
        <title>Tool-as-Interface: Learning Robot Policies from Observing Human Tool Use</title>
        <description>Tool use is essential for enabling robots to perform complex real-world tasks, but learning such skills requires extensive datasets. While teleoperation is widely used, it is slow, delay-sensitive, and poorly suited for dynamic tasks. In contrast, human videos provide a natural way for data collection without specialized hardware, though they pose challenges on robot learning due to viewpoint variations and embodiment gaps. To address these challenges, we propose a framework that transfers tool-use knowledge from humans to robots. To improve the policy’s robustness to viewpoint variations, we use two RGB cameras to reconstruct 3D scenes and apply Gaussian splatting for novel view synthesis. We reduce the embodiment gap using segmented observations and tool-centric, task-space actions to achieve embodiment-invariant visuomotor policy learning. Our method achieves a 71% improvement in task success and a 77% reduction in data collection time compared to diffusion policies trained on teleoperation with equivalent time budgets. Our method also reduces data collection time by 41% compared with the state-of-the-art data collection interface.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/chen25d.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/chen25d.html</guid>
        
        
      </item>
    
      <item>
        <title>MEReQ: Max-Ent Residual-Q Inverse RL for Sample-Efficient Alignment from Intervention</title>
        <description>Aligning robot behavior with human preferences is crucial for deploying embodied AI agents in human-centered environments. A promising solution is interactive imitation learning from human intervention, where a human expert observes the policy’s execution and provides interventions as feedback. However, existing methods often fail to utilize the prior policy efficiently to facilitate learning, thus hindering sample efficiency. In this work, we introduce Maximum-Entropy Residual-Q Inverse Reinforcement Learning, designed for sample-efficient alignment from human intervention. Instead of inferring the complete human behavior characteristics, MEReQ infers a residual reward function that captures the discrepancy between the human expert’s and the prior policy’s underlying reward functions. It then employs Residual Q-Learning (RQL) to align the policy with human preferences using this residual reward function. Extensive evaluations on simulated and real-world tasks demonstrate that MEReQ achieves sample-efficient policy alignment from human intervention compared to other baselines.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/chen25c.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/chen25c.html</guid>
        
        
      </item>
    
      <item>
        <title>ClutterDexGrasp: A Sim-to-Real System for General Dexterous Grasping in Cluttered Scenes</title>
        <description>Dexterous grasping in cluttered scenes presents significant challenges due to diverse object geometries, occlusions, and potential collisions. Existing methods primarily focus on single-object grasping or grasp-pose prediction without interaction, which are insufficient for complex, cluttered scenes. Recent vision-language-action models offer a potential solution but require extensive real-world demonstrations, making them costly and difficult to scale. To address these limitations, we revisit the sim-to-real transfer pipeline and develop key techniques that enable zero-shot deployment in reality while maintaining robust generalization. We propose ClutterDexGrasp, a two-stage teacher-student framework for closed-loop target-oriented dexterous grasping in cluttered scenes. The framework features a teacher policy trained in simulation using clutter density curriculum learning, incorporating both a novel geometry- and spatially-embedded scene representation and a comprehensive safety curriculum, enabling general, dynamic, and safe grasping behaviors. Through imitation learning, we distill the teacher’s knowledge into a student 3D diffusion policy (DP3) that operates on partial point cloud observations. To the best of our knowledge, this represents the first zero-shot sim-to-real closed-loop system for target oriented dexterous grasping in cluttered scenes, demonstrating robust performance across diverse objects and layouts.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/chen25b.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/chen25b.html</guid>
        
        
      </item>
    
      <item>
        <title>Training Strategies for Efficient Embodied Reasoning</title>
        <description>Robot chain-of-thought reasoning (CoT) – wherein a model predicts helpful intermediate representations before choosing actions – provides an effective method for improving the generalization and performance of robot policies, especially vision-language-action models (VLAs). While such approaches have been shown to improve performance and generalization, they suffer from core limitations, like needing specialized robot reasoning data and slow inference speeds. To design new robot reasoning approaches that address these issues, a more complete characterization of why reasoning helps policy performance is critical. We hypothesize several mechanisms by which robot reasoning improves policies – (1) better representation learning, (2) improved learning curricularization, and (3) increased expressivity – then devise simple variants of robot CoT reasoning to isolate and test each one. We find that learning to generate reasonings does lead to better VLA representations, while attending to the reasonings aids in actually leveraging these features for improved action prediction. Our results provide us with a better understanding of why CoT reasoning helps VLAs, which we use to introduce two simple and lightweight alternative recipes for robot reasoning. Our proposed approaches achieve significant performance gains over non-reasoning policies, state-of-the-art results on the LIBERO-90 benchmark, and a 3x inference speedup compared to standard robot reasoning.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/chen25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/chen25a.html</guid>
        
        
      </item>
    
      <item>
        <title>DiWA: Diffusion Policy Adaptation with World Models</title>
        <description>Fine-tuning diffusion policies with reinforcement learning (RL) presents significant challenges. The long denoising sequence for each action prediction impedes effective reward propagation. Additionally, standard RL methods require millions of physical interaction steps, making fine-tuning even more challenging. Prior work models the denoising steps in diffusion policies as a Markov Decision Process to adapt to RL policy updates, but its heavy reliance on environment interactions still leads to inefficiency. To bridge this gap, we introduce DiWA, a novel framework that leverages a world model for fine-tuning diffusion-based robotic skills entirely offline with reinforcement learning. Unlike model-free approaches that require millions of environment interactions to fine-tune a repertoire of robot skills, DiWA achieves effective adaptation using a world model trained once on a few hundred thousand offline play interactions. This results in dramatically improved sample efficiency, making the approach significantly more practical and safer for real-world robot learning. On the challenging CALVIN benchmark, DiWA improves performance across eight tasks using only offline adaptation, while requiring orders of magnitude fewer physical interactions than model-free baselines. To our knowledge, this is the first demonstration of fine-tuning diffusion policies for real-world robotic skills using an offline world model. We make the code publicly available at _redacted-for-review_.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/chandra25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/chandra25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Human-like Navigation in a World Built for Humans</title>
        <description>When navigating in a man-made environment they haven’t visited before—like an office building—humans employ behaviors such as reading signs and asking others for directions. These behaviors help humans reach their destinations efficiently by reducing the need to search through large areas. Existing robot navigation systems lack the ability to execute such behaviors and are thus highly inefficient at navigating within large environments. We present ReasonNav, a modular navigation system which integrates these human-like navigation skills by leveraging the reasoning capabilites of a vision-language model (VLM). We design compact input and output abstractions based on navigation landmarks, allowing the VLM to focus on language understanding and reasoning. We evaluate ReasonNav on real and simulated navigation tasks and show that the agent successfully employs higher-order reasoning to navigate efficiently in large, complex buildings.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/chandaka25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/chandaka25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Improving Efficiency of Sampling-based Motion Planning via Message-Passing Monte Carlo</title>
        <description>Sampling-based motion planning methods, while effective in high-dimensional spaces, often suffer from inefficiencies due to irregular sampling distributions, leading to suboptimal exploration of the configuration space. In this paper, we propose an approach that enhances the efficiency of these methods by utilizing low-discrepancy distributions generated through Message-Passing Monte Carlo (MPMC). MPMC leverages Graph Neural Networks (GNNs) to generate point sets that uniformly cover the space, with uniformity assessed using the the $\mathcal{L}_p$-discrepancy measure, which quantifies the irregularity of sample distributions. By improving the uniformity of the point sets, our approach significantly reduces computational overhead and the number of samples required for solving motion planning problems. Experimental results demonstrate that our method outperforms traditional sampling techniques in terms of planning efficiency.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/chahine25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/chahine25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Divide, Discover, Deploy: Factorized Skill Learning with Symmetry and Style Priors</title>
        <description>Unsupervised Skill Discovery (USD) allows agents to autonomously learn diverse behaviors without task-specific rewards. While recent USD methods have shown promise, their application to real-world robotics remains underexplored. In this paper, we propose a modular USD framework to address the challenges in safety, interpretability, and deployability of the learned skills. Our approach factorizes the state space to learn disentangled skill representations and assigns different skill discovery algorithms to each factor based on the desired intrinsic reward function. To encourage structured morphology-aware skills, we introduce symmetry-based inductive biases tailored to individual factors. We also incorporate a style factor and regularization penalties to promote safe and robust behaviors. We evaluate our framework in simulation using a quadrupedal robot and demonstrate zero-shot transfer of the learned skills to real hardware. Our results show that factorization and symmetry lead to the discovery of structured, human-interpretable behaviors, while the style factor and penalties enhance safety and diversity. Additionally, we show that the learned skills can be used for downstream tasks and perform on par with oracle policies trained with hand-crafted rewards. To facilitate future research, we will release our code upon publication.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/cathomen25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/cathomen25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Self-supervised Learning Of Visual Pose Estimation Without Pose Labels By Classifying LED States</title>
        <description>We introduce a model for monocular RGB relative pose estimation of a ground robot that trains from scratch without pose labels nor prior knowledge about the robot’s shape or appearance. At training time, we assume: (i) a robot fitted with multiple LEDs, whose states are independent and known at each frame; (ii) knowledge of the approximate viewing direction of each LED; and (iii) availability of a calibration image with a known target distance, to address the ambiguity of monocular depth estimation. Training data is collected by a pair of robots moving randomly without needing external infrastructure or human supervision. Our model trains on the task of predicting from an image the state of each LED on the robot. In doing so, it learns to predict the position of the robot in the image, its distance, and its relative bearing. At inference time, the state of the LEDs is unknown, can be arbitrary, and does not affect the pose estimation performance. Quantitative experiments indicate that our approach: is competitive with SoA approaches that require supervision from pose labels or a CAD model of the robot; generalizes to different domains; and handles multi-robot pose estimation.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/carlotti25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/carlotti25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Pseudo-Simulation for Autonomous Driving</title>
        <description>Existing evaluation paradigms for Autonomous Vehicles (AVs) face critical limitations. Real-world evaluation is often challenging due to safety concerns and a lack of reproducibility, whereas closed-loop simulation can face insufficient realism or high computational costs. Open-loop evaluation, while being efficient and data-driven, relies on metrics that generally overlook compounding errors. In this paper, we propose pseudo-simulation, a novel paradigm that addresses these limitations. Pseudo-simulation operates on real datasets, similar to open-loop evaluation, but augments them with synthetic observations generated prior to evaluation using 3D Gaussian Splatting. Our key idea is to approximate potential future states the AV might encounter by generating a diverse set of observations that vary in position, heading, and speed. Our method then assigns a higher importance to synthetic observations that best match the AV’s likely behavior using a novel proximity-based weighting scheme. This enables evaluating error recovery and the mitigation of causal confusion, as in closed-loop benchmarks, without requiring sequential interactive simulation. We show that pseudo-simulation is better correlated with closed-loop simulations ($R^2=0.8$) than the best existing open-loop approach ($R^2=0.7$). We also establish a public leaderboard for the community to benchmark new methodologies with pseudo-simulation.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/cao25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/cao25a.html</guid>
        
        
      </item>
    
      <item>
        <title>RobotxR1: Enabling Embodied Robotic Intelligence on Large Language Models through Closed-Loop Reinforcement Learning</title>
        <description>Future robotic systems operating in real-world environments require on-board embodied intelligence without continuous cloud connection, balancing capabilities with constraints on computational power and memory. This work presents an extension of the R1-zero approach, which enables the usage of small parameter-count Large Language Models (LLMs) in the robotic domain. The R1-Zero approach was originally developed to enable mathematical reasoning in LLMs using static datasets. We extend it to the robotics domain through integration with a closed-loop Reinforcement Learning (RL) framework. This extension allows reasoning in Embodied Artificial Intelligence (EmbodiedAI) settings without relying solely on distillation of large models through Supervised Fine-Tuning (SFT). We show that small-scale LLMs can achieve effective reasoning performance by learning through closed-loop interaction with their environment, which enables tasks that previously required significantly larger models. A performance gain of 20.2% points over the SFT-based baseline is observed with a Qwen2.5-1.5B model. Using the proposed training procedure, Qwen2.5-3B achieves a 63.3% control adaptability score, surpassing the 58.5% obtained by the much larger, cloud-bound GPT-4o. These results highlight that practical, on-board deployment of small LLMs is not only feasible but can outperform larger models when trained through environmental interaction, underscoring the importance of an interactive, embodied learning framework for robotic EmbodiedAI — one grounded in practical experience rather than static supervision.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/boyle25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/boyle25a.html</guid>
        
        
      </item>
    
      <item>
        <title>$\pi_0.5$: a Vision-Language-Action Model with Open-World Generalization</title>
        <description>In order for robots to be useful, they must perform practically relevant tasks in the real world, outside of the lab. While vision-language-action (VLA) models have demonstrated impressive results for end-to-end robot control, it remains an open question how far such models can generalize in the wild. We describe $\pi_{0.5}$, a new model based on $\pi_0$ that uses co-training on heterogeneous tasks to enable broad generalization. $\pi_{0.5}$ uses data from multiple robots, high-level semantic prediction, web data, and other sources to enable broadly generalizable real-world robotic manipulation. Our system uses a combination of co-training and hybrid multi-modal examples that combine image observations, language commands, object detections, semantic subtask prediction, and low-level actions. Our experiments show that this kind of knowledge transfer is essential for effective generalization, and we demonstrate for the first time that an end-to-end learning-enabled robotic system can perform long-horizon and dexterous manipulation skills, such as cleaning a kitchen or bedroom, in entirely new homes.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/black25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/black25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation</title>
        <description>How can robot manipulation policies generalize to novel tasks involving unseen object types and new motions? In this paper, we provide a solution in terms of predicting motion information from web data through human video generation and conditioning a robot policy on the generated video. Instead of attempting to scale robot data collection which is expensive, we show how we can leverage video generation models trained on easily available web data, for enabling generalization. \textit{Our approach Gen2Act casts language-conditioned manipulation as zero-shot human video generation followed by execution with a single policy conditioned on the generated video.} To train the policy, we use an order of magnitude less robot interaction data compared to what the video prediction model was trained on. Gen2Act doesn’t require fine-tuning the video model at all and we directly use a pre-trained model for generating human videos. Our results on diverse real-world scenarios show how Gen2Act enables manipulating unseen object types and performing novel motions for tasks not present in the robot data.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/bharadhwaj25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/bharadhwaj25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Focusing on What Matters: Object-Agent-centric Tokenization for Vision Language Action models</title>
        <description>Vision-Language-Action (VLA) models offer a pivotal approach to learning robotic manipulation at scale by repurposing large pre-trained Vision-Language-Models (VLM) to output robotic actions. However, adapting VLMs for robotic domains comes with an unnecessarily high computational cost, which we attribute to the tokenization scheme of visual inputs. In this work, we aim to enable efficient VLA training by proposing Oat-VLA, an Object-Agent-centric Tokenization for VLAs. Building on the insights of object-centric representation learning, our method introduces an inductive bias towards scene objects and the agent’s own visual information. As a result, we find that Oat-VLA can drastically reduce the number of visual tokens  to just a few tokens without sacrificing performance. We reveal that Oat-VLA converges at least twice as fast as OpenVLA on the LIBERO suite, as well as outperform OpenVLA in diverse real-world pick and place tasks.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/bendikas25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/bendikas25a.html</guid>
        
        
      </item>
    
      <item>
        <title>RoboArena: Distributed Real-World Evaluation of Generalist Robot Policies</title>
        <description>Comprehensive, unbiased, and comparable evaluation of modern generalist policies is uniquely challenging: existing approaches for robot benchmarking typically rely on heavy standardization, either by specifying fixed evaluation tasks and environments, or by hosting centralized &quot;robot challenges&quot;, and do not readily scale to evaluating generalist policies across a broad range of tasks and environments. In this work, we propose RoboArena, a new approach for scalable evaluation of generalist robot policies in the real world. Instead of standardizing evaluations around fixed tasks, environments, or locations, we propose to crowd-source evaluations across a distributed network of evaluators. Importantly, evaluators can freely choose the tasks and environments they evaluate on, enabling easy scaling of diversity, but they are required to perform double-blind evaluations over pairs of policies. Then, by aggregating preference feedback from pairwise comparisons across diverse tasks and environments, we can derive a ranking of policies. We instantiate our approach across a network of evaluators at seven academic institutions using the DROID robot platform. Through more than 600 pairwise real-robot evaluation episodes across seven generalist policies, we demonstrate that our crowd-sourced approach can more accurately rank the performance of existing generalist policies than conventional, centralized evaluation approaches, while being more scalable, resilient, and trustworthy. We open our evaluation network to the community and hope that it can enable more accessible comparisons of generalist robot policies.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/atreya25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/atreya25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Pointing3D: A Benchmark for 3D Object Referral via Pointing Gestures</title>
        <description>Pointing gestures provide a natural and efficient way to communicate spatial information in human-machine interaction, yet their potential for 3D object referral remains largely under-explored. To fill this gap, we introduce the task of pointing-based 3D segmentation. In this task, given an image of a person pointing at an object and the 3D point cloud of the environment, the goal is to predict the 3D segmentation mask of the referred object. To enable the standardized evaluation of this task, we introduce POINTR3D, a curated dataset of over 65,000 frames captured with three cameras across four indoor scenes, featuring diverse pointing scenarios. Each frame is annotated with the information of the active hand, the corresponding object ID, and the 3D segmentation mask of the object. To showcase the application of the proposed dataset, we further introduce Pointing3D, a transformer-based architecture that predicts the pointing direction from RGB images and uses this prediction as a prompt to segment the referred object in the 3D point cloud. Experimental results show that Pointing3D outperforms other strong baselines we introduce and lays the groundwork for future research. The dataset, source code, and evaluation tools will be made publicly available to support further research in this area, enabling a natural human-machine interaction.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/arslanoglu25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/arslanoglu25a.html</guid>
        
        
      </item>
    
      <item>
        <title>SAIL: Faster-than-Demonstration Execution of Imitation Learning Policies</title>
        <description>Offline Imitation Learning (IL) methods such as Behavior Cloning are effective at acquiring complex robotic manipulation skills.  However, existing IL-trained policies are confined to execute the task at the same speed as shown in demonstration data. This limits the task throughput of a robotic system, a critical requirement for applications such as industrial automation. In this paper, we introduce and formalize the novel problem of enabling faster-than-demonstration execution of visuomotor policies and identify fundamental challenges in robot dynamics and state-action distribution shifts. We instantiate the key insights as SAIL (Speed Adaptation for Imitation Learning), a full-stack system integrating four tightly-connected components: (1) a consistency-preserving action inference algorithm for smooth motion at high speed, (2) high-fidelity tracking of controller-invariant motion targets, (3) adaptive speed modulation that dynamically adjusts execution speed based on motion complexity, and (4) action scheduling to handle real-world system latencies.  Experiments on 12 tasks across simulation and two real, distinct robot platforms shows that SAIL achieves up to a {4$\times$ speedup} over demonstration speed in simulation and up to {3.2$\times$ speedup} in the real world. Additional detail is available at https://sail-robot.github.io</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/arachchige25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/arachchige25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Poke and Strike: Learning Task-Informed Exploration Policies</title>
        <description>In many dynamic robotic tasks, such as striking pucks into a goal outside the reachable workspace, the robot must first identify the relevant physical properties of the object for successful task execution, as it is unable to recover from failure or retry without human intervention. To address this challenge, we propose a task-informed exploration approach, based on reinforcement learning, that trains an exploration policy using rewards automatically generated from the sensitivity of a privileged task policy to errors in estimated properties. We also introduce an uncertainty-based mechanism to determine when to transition from exploration to task execution, ensuring sufficient property estimation accuracy with minimal exploration time. Our method achieves a 90% success rate on the striking task with an average exploration time under 1.2 seconds—significantly outperforming baselines that achieve at most 40% success or require inefficient querying and retraining in a simulator at test time. Additionally, we demonstrate that our task-informed rewards capture the relative importance of physical properties in both the striking task and the classical CartPole example. Finally, we validate our approach by demonstrating its ability to identify object properties and adjust task execution in a physical setup using the KUKA iiwa robot arm.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/aoyama25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/aoyama25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Efficient Evaluation of Multi-Task Robot Policies With Active Experiment Selection</title>
        <description>Evaluating learned robot control policies to determine their performance costs the experimenter time and effort. As robots become more capable in accomplishing diverse tasks, evaluating across all these tasks becomes more difficult as it is impractical to test every policy on every task multiple times. Rather than considering the average performance of a policy on a task, we consider the distribution of performance over time. In a multi-task policy evaluation setting, we actively model the distribution of robot performance across multiple tasks and policies as we sequentially execute experiments. We show that natural language is a useful prior in modeling relationships between tasks because they often share similarities that can reveal potential relationships in policy behavior. We leverage this formulation to reduce experimenter effort by using a cost-aware information gain heuristic to efficiently select informative trials. We conduct experiments on existing evaluation data from real robots and simulations and find a 50% reduction in estimates of the mean performance given a fixed cost budget. We encourage the use of our surrogate model as a scalable approach to track progress in evaluation.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/anwar25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/anwar25a.html</guid>
        
        
      </item>
    
      <item>
        <title>ReCoDe: Reinforcement Learning-based Dynamic Constraint Design for Multi-Agent Coordination</title>
        <description>Constraint-based optimization is a cornerstone of robotics, enabling the design of controllers that reliably encode task and safety requirements such as collision avoidance or formation adherence. However, handcrafted constraints can fail in multi-agent settings that demand complex coordination. We introduce ReCoDe—Reinforcement-based Constraint Design—a decentralized, hybrid framework that merges the reliability of optimization-based controllers with the adaptability of multi-agent reinforcement learning. Rather than discarding expert controllers, ReCoDe improves them by learning additional, dynamic constraints that capture subtler behaviors, for example, by constraining agent movements to prevent congestion in cluttered scenarios. Through local communication, agents collectively constrain their allowed actions to coordinate more effectively under changing conditions. In this work, we focus on applications of ReCoDe to multi-agent navigation tasks requiring intricate, context-based movements and consensus, where we show that it outperforms purely handcrafted controllers, other hybrid approaches, and standard MARL baselines. We give empirical (real robot) and theoretical evidence that retaining a user-defined controller, even when it is imperfect, is more efficient than learning from scratch, especially because ReCoDe can dynamically change the degree to which it relies on this controller.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/amir25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/amir25a.html</guid>
        
        
      </item>
    
      <item>
        <title>First Order Model-Based RL through Decoupled Backpropagation</title>
        <description>There is growing interest in reinforcement learning (RL) methods that leverage the simulator’s derivatives to improve learning efficiency. While early gradient-based approaches have demonstrated superior performance compared to derivative-free methods, accessing simulator gradients is often impractical due to their implementation cost or unavailability. Model-based RL (MBRL) can approximate these gradients via learned dynamics models, but the solver efficiency suffers from compounding prediction errors during training rollouts, which can degrade policy performance. We propose an approach that decouples trajectory generation from gradient computation: trajectories are unrolled using a simulator, while gradients are computed via backpropagation through a learned differentiable model of the simulator. This hybrid design enables efficient and consistent first-order policy optimization, even when simulator gradients are unavailable, as well as learning a critic from simulation rollouts, which is more accurate. Our method achieves the sample efficiency and speed of specialized optimizers such as SHAC, while maintaining the generality of standard approaches like PPO and avoiding ill behaviors observed in other first-order MBRL methods. We empirically validate our algorithm on benchmark control tasks and demonstrate its effectiveness on a real Go2 quadruped robot, across both quadrupedal and bipedal locomotion tasks.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/amigo25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/amigo25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Merging and Disentangling Views in Visual Reinforcement Learning for Robotic Manipulation</title>
        <description>Vision is well-known for its use in manipulation, especially using visual servoing. To make it robust, multiple cameras are needed to expand the field of view. That is computationally challenging. Merging multiple views and using Q-learning allows the design of more effective representations and optimization of sample efficiency. Such a solution might be expensive to deploy. To mitigate this, we introduce a merge and disentanglement (MAD) algorithm that efficiently merges views to increase sample efficiency while augmenting with single-view features to allow lightweight deployment and ensure robust policies. We demonstrate the efficiency and robustness of our approach using Meta-World and ManiSkill3.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/almuzairee25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/almuzairee25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Visual Imitation Enables Contextual Humanoid Control</title>
        <description>How can we teach humanoids to climb staircases and sit on chairs using the surrounding environment context? Arguably the simplest way is to _just show them_—casually capture a human motion video and feed it to humanoids. We introduce **VideoMimic**, a real-to-sim-to-real pipeline that mines everyday videos, jointly reconstructs the humans and the environment, and produces whole-body control policies for humanoid robots that perform the corresponding skills. We demonstrate the results of our pipeline on real humanoid robots, showing robust, repeatable contextual control such as staircase ascents and descents, sitting and standing from chairs and benches, as well as other dynamic whole-body skills all from a single policy, conditioned on the environment and global root commands. We hope our data and approach help enable a scalable path towards teaching humanoids to operate in diverse real-world environments.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/allshire25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/allshire25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Distributed Upload and Active Labeling for Resource-Constrained Fleet Learning</title>
        <description>In multi-robot systems, fleets are often deployed to collect data that improves the performance of machine learning models for downstream perception and planning. However, real-world robotic deployments generate vast amounts of data across diverse conditions, while only a small portion can be transmitted or labeled due to limited bandwidth, constrained onboard storage, and high annotation costs. To address these challenges, we propose Distributed Upload and Active Labeling (DUAL), a decentralized, two-stage data collection framework for resource-constrained robotic fleets. In the first stage, each robot independently selects a subset of its local observations to upload under storage and communication constraints. In the second stage, the cloud selects a subset of uploaded data to label, subject to a global annotation budget. We evaluate DUAL on classification tasks spanning multiple sensing modalities, as well as on RoadNet—a real-world dataset we collected from vehicle-mounted cameras for time and weather classification. We further validate our approach in a physical experiment using a Franka Emika Panda robot arm, where it learns to move a red cube to a green bowl. Finally, we test DUAL on trajectory prediction using the nuScenes autonomous driving dataset to assess generalization to complex prediction tasks. Across all settings, DUAL consistently outperforms state-of-the-art baselines, achieving up to 31.1% gain in classification accuracy and a 13% improvement in real-world robotics task completion rates.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/akcin25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/akcin25a.html</guid>
        
        
      </item>
    
      <item>
        <title>Towards Embodiment Scaling Laws in Robot Locomotion</title>
        <description>Developing generalist agents that operate across diverse tasks, environments, and robot embodiments is a grand challenge in robotics and artificial intelligence. While substantial progress has been made in cross-task and cross-environment generalization, achieving broad generalization to novel embodiments remains elusive. In this work, we study embodiment scaling laws — the hypothesis that increasing the quantity of training embodiments improves generalization to unseen ones. To explore this, we procedurally generate a dataset of $\sim$1,000 varied robot embodiments, spanning humanoids, quadrupeds, and hexapods, and train embodiment-specific reinforcement learning experts for legged locomotion. We then distill these experts into a single generalist policy capable of handling diverse observation and action spaces. Our large-scale study reveals that generalization performance improves with the number of training embodiments. Notably, a policy trained on the full dataset zero-shot transfers to diverse unseen embodiments in both simulation and real-world evaluations. These results provide preliminary empirical evidence for embodiment scaling laws and suggest that scaling up embodiment quantity may serve as a foundation for building generalist robot agents.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/ai25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/ai25a.html</guid>
        
        
      </item>
    
      <item>
        <title>CUPID: Curating Data your Robot Loves with Influence Functions</title>
        <description>In robot imitation learning, policy performance is tightly coupled with the quality and composition of the demonstration data. Yet, developing a precise understanding of how individual demonstrations contribute to downstream outcomes—such as closed-loop task success or failure—remains a persistent challenge. Inspired by the theory of influence functions, we propose CUPID. Given a set of evaluation rollouts, CUPID estimates the influence of a training demonstration on the policy’s expected return. This enables ranking and selection of demonstrations according to their impact on the policy’s closed-loop performance. We use our estimator to curate data by 1) filtering out training demonstrations that harmed the policy’s performance and 2) subselecting newly collected trajectories that will most help improve the policy. Extensive simulated and hardware experiments show that our approach consistently identifies which data drives test-time performance. For example, training with less than 33% of curated data can result in state-of-the-art diffusion policies on the simulated Robomimic benchmark, and we observe similar improvements in hardware experiments. Furthermore, our hardware experiments show that our influence-based estimator can identify robust strategies under distribution shift, isolate spurious correlations, and even enhance post-training of generalist policies.</description>
        <pubDate>Tue, 07 Oct 2025 00:00:00 +0000</pubDate>
        <link>https://proceedings.mlr.press/v305/agia25a.html</link>
        <guid isPermaLink="true">https://proceedings.mlr.press/v305/agia25a.html</guid>
        
        
      </item>
    
  </channel>
</rss>
