| | |
Last updated on May 7, 2026. This conference program is tentative and subject to change
Technical Program for Wednesday June 3, 2026
| |
| WeI1I Interactive Session, Hall C |
Add to My Program |
| Interactive Session 3 |
|
| |
| |
| 09:00-10:30, Paper WeI1I.1 | Add to My Program |
| Robust Nonprehensile Object Transportation with Uncertain Inertial Parameters |
|
| Heins, Adam | University of Toronto |
| Schoellig, Angela P. | TU Munich |
Keywords: Mobile Manipulation, Whole-Body Motion Planning and Control
Abstract: We consider the nonprehensile object transportation task known as the waiter's problem---in which a robot must move an object on a tray from one location to another---when the transported object has uncertain inertial parameters. In contrast to existing approaches that completely ignore uncertainty in the inertia matrix or which only consider small parameter errors, we are interested in pushing the limits of the amount of inertial parameter uncertainty that can be handled. We first show how constraints that are robust to inertial parameter uncertainty can be incorporated into an optimization-based motion planning framework to transport objects while moving quickly. Next, we develop necessary conditions for the inertial parameters to be realizable on a bounding shape based on moment relaxations, allowing us to verify whether a trajectory will violate the constraints for any realizable inertial parameters. Finally, we demonstrate our approach on a mobile manipulator in simulations and real hardware experiments: our proposed robust constraints consistently successfully transport a 56 cm tall object with substantial inertial parameter uncertainty in the real world, while the baseline approaches drop the object while transporting it.
|
| |
| 09:00-10:30, Paper WeI1I.2 | Add to My Program |
| Inspection Planning under Execution Uncertainty |
|
| Alpert, Shmuel David | Technion - Israel Institute of Technology |
| Solovey, Kiril | Technion--Israel Institute of Technology |
| Klein, Itzik | University of Haifa |
| Salzman, Oren | Technion |
Keywords: Inspection planning under uncertainty, Motion and Path Planning, Aerial Systems: Applications, Collision Avoidance
Abstract: Autonomous inspection tasks require path-planning algorithms to efficiently gather observations from points of interest (POIs). However, localization errors in urban environments introduce execution uncertainty, posing challenges to successfully completing such tasks. Existing inspection-planning algorithms do not explicitly address this uncertainty, which can hinder their performance. To overcome this, we introduce IRIS-Under-Uncertainty (IRIS-U²), an inspection-planning algorithm that provides statistical assurances regarding coverage, path length, and collision probability. Our approach builds upon IRIS—our framework for deterministic, highly efficient, and provably asymptotically-optimal framework. This extension adapts IRIS to uncertain settings using a refined search procedure that estimates POI coverage probabilities through Monte Carlo (MC) sampling. We demonstrate IRIS-U² through a case study on bridge inspections, achieving improved expected coverage, reduced collision probability, and increasingly precise statistical guarantees as MC samples grow. Additionally, we explore bounded suboptimal solutions to reduce computation time while preserving statistical assurances.
|
| |
| 09:00-10:30, Paper WeI1I.3 | Add to My Program |
| Drive in Corridors: Enhancing the Safety of End-To-End Autonomous Driving Via Corridor Learning and Planning |
|
| Zhang, Zhiwei | Fudan University |
| Yang, Ruichen | Fudan University |
| Wu, Ke | Fudan University |
| Xu, Zijun | Fudan University |
| Liu, Jingchu | Horizon Robotics |
| Mu, Lisen | Horizon Robotics |
| Gan, Zhongxue | Fudan University |
| Ding, Wenchao | Fudan University |
Keywords: Integrated Planning and Learning, Collision Avoidance, Vision-Based Navigation
Abstract: Safety remains one of the most critical challenges in autonomous driving systems. In recent years, the end-to-end driving has shown great promise in advancing vehicle autonomy in a scalable manner. However, existing approaches often face safety risks due to the lack of explicit behavior constraints. To address this issue, we uncover a new paradigm by introducing the corridor as the intermediate representation. Widely adopted in robotics planning, the corridors represents spatio-temporal obstacle-free zones for the vehicle to traverse. To ensure accurate corridor prediction in diverse traffic scenarios, we develop a comprehensive learning pipeline including data annotation, architecture refinement and loss formulation. The predicted corridor is further integrated as the constraint in a trajectory optimization process. By extending the differentiability of the optimization, we enable the optimized trajectory to be seamlessly trained within the end-to-end learning framework, improving both safety and interpretability. Experimental results on the nuScenes dataset demonstrate state-of-the-art performance of our approach, showing a 66.7% reduction in collisions with agents and a 46.5% reduction with curbs, significantly enhancing the safety of end-to-end driving. Additionally, incorporating the corridor contributes to higher success rates in closed-loop evaluations.
|
| |
| 09:00-10:30, Paper WeI1I.4 | Add to My Program |
| Tactile Elastography |
|
| Xiang, Yichen | Southeast University |
| Zhu, Lifeng | Southeast University |
| Song, Aiguo | Southeast University |
| Zhang, Jessica | Carnegie Mellon University |
Keywords: Force and Tactile Sensing, Soft Sensors and Actuators, Object Detection, Segmentation and Categorization, Vision-based Tactile Perception
Abstract: Elasticity is one of the representative parameters that reflect the mechanical properties of soft materials. Detecting the underneath elasticity distribution called elastography is a key step for understanding and interacting with objects. Existing solutions for capturing the interior elasticity distribution typically rely on expensive apparatus. In this work, the dense tactile signal captured by the high-resolution vision-based tactile sensor is introduced as a new modality for reconstructing 3D elasticity distribution. We propose a model-based method, which exploit the tactile maps from active pressing trials for the elastography task. The interior elasticity distribution for non-rigid objects is reconstructed from an inverse physics model. We analyze the credibility of the estimated elasticity distribution obtained from our method. Varying design factors are also discussed. We experiment our method on a set of synthesized 3D models and physical models in robot-assisted scenes. Various experimental results have been gathered, demonstrating the efficacy of our approach in perceiving elasticity distribution.
|
| |
| 09:00-10:30, Paper WeI1I.5 | Add to My Program |
| Neural Internal Model Control: Learning a Robust Control Policy Via Predictive Error Feedback |
|
| Gao, Feng | Tsinghua University |
| Yu, Chao | Tsinghua University |
| Wang, Yu | Tsinghua University |
| Wu, Yi | Tsinghua University |
Keywords: Reinforcement Learning, Sensorimotor Learning, Machine Learning for Robot Control
Abstract: Accurate motion control in the face of disturbances within complex environments remains a major challenge in robotics. Classical model-based approaches often struggle with nonlinearities and unstructured disturbances, while reinforcement learning (RL)-based methods can be fragile when encountering unseen scenarios. In this paper, we propose a novel framework, Neural Internal Model Control (NeuralIMC), which integrates model-based control with RL-based control to enhance robustness. Our framework streamlines the predictive model by applying Newton-Euler equations for rigid-body dynamics, eliminating the need to capture complex high-dimensional nonlinearities. This internal model combines model-free RL algorithms with predictive error feedback. Such a design enables a closed-loop control structure to enhance the robustness and generalizability of the control system. We demonstrate the effectiveness of our framework on both quadrotors and quadrupedal robots, achieving superior performance compared to state-of-the-art methods. Furthermore, real-world deployment on a quadrotor with rope-suspended payloads highlights the framework’s robustness in sim-to-real transfer. Our code is released at url{https://github.com/thu-uav/NeuralIMC}.
|
| |
| 09:00-10:30, Paper WeI1I.6 | Add to My Program |
| BnB-Based Robust PnP Pose Estimation Method for Outliers |
|
| Long, Chenrong | Beihang University |
| Hu, Qinglei | Beihang University |
| Jiang, Cuicui | Beihang University |
| Li, Dongyu | Beihang University |
| Ouyang, Zhenchao | Beihang University |
| |
| 09:00-10:30, Paper WeI1I.7 | Add to My Program |
| MonoTher-Depth: Enhancing Thermal Depth Estimation Via Confidence-Aware Distillation |
|
| Zuo, Xingxing | MBZUAI |
| Ranganathan, Nikhil | California Institute of Technology |
| Lee, Connor | California Institute of Technology |
| Gkioxari, Georgia | Facebook AI Research |
| Chung, Soon-Jo | Caltech |
Keywords: Range Sensing, Deep Learning for Visual Perception
Abstract: Monocular depth estimation (MDE) from thermal images is a crucial technology for robotic systems operating in challenging conditions such as fog, smoke, and low light. The limited availability of labeled thermal data constrains the generalization capabilities of thermal MDE models compared to foundational RGB MDE models, which benefit from datasets of millions of images across diverse scenarios. To address this challenge, we introduce a novel pipeline that enhances thermal MDE through knowledge distillation from a versatile RGB MDE model. Our approach features a confidence-aware distillation method that utilizes the predicted confidence of the RGB MDE to selectively strengthen the thermal MDE model, capitalizing on the strengths of the RGB model while mitigating its weaknesses. Our method significantly improves the accuracy of the thermal MDE, independent of the availability of labeled depth supervision, and greatly expands its applicability to new scenarios. In our experiments on new scenarios without labeled depth, the proposed confidence-aware distillation method reduces the absolute relative error of thermal MDE by 22.88% compared to the baseline without distillation.
|
| |
| 09:00-10:30, Paper WeI1I.8 | Add to My Program |
| Semantically-Driven Deep Reinforcement Learning for Inspection Path Planning |
|
| Malczyk, Grzegorz | NTNU - Norwegian University of Science and Technology |
| Kulkarni, Mihir | NTNU: Norwegian University of Science and Technology |
| Alexis, Kostas | NTNU - Norwegian University of Science and Technology |
Keywords: Aerial Systems: Perception and Autonomy, Aerial Systems: Applications, Reinforcement Learning
Abstract: This letter introduces a novel semantics-aware inspection planning policy derived through deep reinforcement learning. Reflecting the fact that within autonomous informative path planning missions in unknown environments, it is often only a sparse set of objects of interest that need to be inspected, the method contributes an end-to-end policy that simultaneously performs semantic object visual inspection combined with collision-free navigation. Assuming access only to the instantaneous depth map, the associated segmentation image, the ego-centric local occupancy, and the history of past positions in the robot’s neighborhood, the method demonstrates robust generalizability and successful crossing of the sim2real gap. Beyond simulations and extensive comparison studies, the approach is verified in experimental evaluations onboard a flying robot deployed in novel environments with previously unseen semantics and overall geometric configurations.
|
| |
| 09:00-10:30, Paper WeI1I.9 | Add to My Program |
| Stable Object Placement Planning from Contact Point Robustness |
|
| Nadeau, Philippe | University of Toronto |
| Kelly, Jonathan | University of Toronto |
Keywords: Object Placement, Manipulation Planning, Assembly, Task Planning
Abstract: We introduce a planner designed to guide robot manipulators in stably placing objects within complex scenes. Our proposed method reverses the traditional approach to object placement: our planner selects contact points first and then determines a placement pose that solicits the selected points. This is instead of sampling poses, identifying contact points, and evaluating pose quality. Our algorithm facilitates stability-aware object placement planning, imposing no restrictions on object shape, convexity, or mass density homogeneity, while avoiding combinatorial computational complexity. Our proposed stability heuristic enables our planner to find a solution about 20 times faster when compared to the same algorithm not making use of the heuristic and eight times faster than a state-of-the-art method using the traditional sample-and-evaluate approach. The proposed planner is also more successful in finding stable placements than the five other benchmarked algorithms. Derived from first principles and validated in ten real robot experiments, our approach provides a general and scalable solution to the problem of rigid object placement planning.
|
| |
| 09:00-10:30, Paper WeI1I.10 | Add to My Program |
| A Versatile Neural Network Configuration Space Planning and Control Strategy for Modular Soft Robot Arms |
|
| Chen, Zixi | Scuola Superiore Sant'Anna |
| Guan, Qinghua | EPFL |
| Hughes, Josie | EPFL |
| Menciassi, Arianna | Scuola Superiore Sant'Anna - SSSA |
| Stefanini, Cesare | MBZUAI |
Keywords: Soft Robot Applications, Optimization and Optimal Control, Deep Learning in Robotics and Automation, Soft Robot Control
Abstract: Modular soft robot arms (MSRAs) are composed of multiple modules connected in a sequence, and they can bend at different angles in various directions. This capability allows MSRAs to perform more intricate tasks than single-module robots. However, the modular structure also induces challenges in accurate planning and control. Nonlinearity and hysteresis complicate the physical model, while the modular structure and increased DOFs further lead to cumulative errors along the sequence. To address these challenges, we propose a versatile configuration space planning and control strategy for MSRAs, named S2C2A (State to Configuration to Action). Our approach formulates an optimization problem, S2C (State to Configuration planning), which integrates various loss functions and a forward model based on biLSTM to generate configuration trajectories based on target states. A configuration controller C2A (Configuration to Action control) based on biLSTM is implemented to follow the planned configuration trajectories, leveraging only inaccurate internal sensing feedback. We validate our strategy using a cable-driven MSRA, demonstrating its ability to perform diverse offline tasks such as position and orientation control and obstacle avoidance. Furthermore, our strategy endows MSRA with online interaction capability with targets and obstacles. Future work focuses on addressing MSRA challenges, such as more accurate physical models.
|
| |
| 09:00-10:30, Paper WeI1I.11 | Add to My Program |
| Sensor Model Identification Via Simultaneous Model Selection and State Variable Determination |
|
| Brommer, Christian | University of Klagenfurt |
| Fornasier, Alessandro | Hexagon Robotics |
| Steinbrener, Jan | University of Klagenfurt |
| Weiss, Stephan | University of Klagenfurt |
Keywords: Calibration and Identification, Sensor Fusion, Localization, Aerial Systems: Applications
Abstract: We present a method for the unattended gray-box identification of sensor models commonly used by localization algorithms in the field of robotics. The objective is to determine the most likely sensor model for a time series of unknown measurement data, given an extendable catalog of predefined sensor models. Sensor model definitions may require states for rigid-body calibrations and dedicated reference frames to replicate a measurement based on the robot's localization state. A health metric is introduced, which verifies the outcome of the selection process in order to detect false positives and facilitate reliable decision-making. In a second stage, an initial guess for identified calibration states is generated, and the necessity of sensor world reference frames is evaluated. The identified sensor model with its parameter information is then used to parameterize and initialize a state estimation application, thus ensuring a more accurate and robust integration of new sensor elements. This method is helpful for inexperienced users who want to identify the source and type of a measurement, sensor calibrations, or sensor reference frames. It will also be important in the field of modular multi-agent scenarios and modularized robotic platforms that are augmented by sensor modalities during runtime. Overall, this work aims to provide a simplified integration of sensor modalities to downstream applications and circumvent common pitfalls in the usage and development of localization approaches.
|
| |
| 09:00-10:30, Paper WeI1I.12 | Add to My Program |
| FusionGS-SLAM: Multiple Sensors Fusion for Simultaneous Localization and Real-Time Photorealistic Mapping |
|
| Phan, Thanh-Danh | Chungbuk National University |
| Kim, Gon-Woo | Chungbuk National University |
Keywords: Mapping, SLAM, Localization
Abstract: This work presents a FusionGS-SLAM, a robust framework for simultaneous localization and real-time photorealistic mapping leveraging the power of sensor fusion techniques. To achieve this, the proposed method employs a tightly-coupled technique to effectively combine multiple factors from improved subsystems, thereby generating a robust odometry for the downstream tasks. Moreover, a dense 3D Gaussian map is constructed by leveraging geometric information across sensor modalities, with real-time mapping strategies designed to enhance robustness and rendering quality in large-scale and challenging environments. Experimental evaluation of various challenging scenes, including the public and self-collected datasets, showcases the superior performance compared to the current state-of-the-art 3DGS SLAM.
|
| |
| 09:00-10:30, Paper WeI1I.13 | Add to My Program |
| Design of a Variable Stiffness Quasi-Direct Drive Cable-Actuated Tensegrity Robot |
|
| Mi, Jonathan | University of Michigan, Ann Arbor |
| Tong, Wenzhe | University of Michigan, Ann Arbor |
| Ma, Yilin | University of Michigan, Ann Arbor |
| Huang, Xiaonan | University of Michigan |
Keywords: Soft Robot Materials and Design, Compliant Joints and Mechanisms, Flexible Robotics
Abstract: Tensegrity robots excel in tasks requiring extreme levels of deformability and robustness. However, there are challenges in state estimation and payload versatility due to their high number of degrees of freedom and unconventional shape. This paper introduces a modular three-bar tensegrity robot featuring a customizable payload design. Our tensegrity robot employs a novel Quasi-Direct Drive (QDD) cable actuator with low-stretch polymer cables to achieve accurate proprioception without needing external force or torque sensors. The design allows for on-the-fly stiffness tuning for better environment and payload adaptability. In this paper, we present the robot’s design, fabrication, assembly, and experimental results. Experimental data demonstrates the high accuracy cable length estimation (<1% error relative to bar length) and variable stiffness control of the cable actuator up to 7 times the minimum stiffness for self support. The shape morphing and stiffness tuning capabilities are leveraged in two realistic demonstrations. The presented tensegrity robot is a platform for future advancements in autonomous operation and open-source module design. Open source design files are available at (Redacted URL).
|
| |
| 09:00-10:30, Paper WeI1I.14 | Add to My Program |
| HIPPo: Harnessing Image-To-3D Priors for Model-Free Zero-Shot 6D Pose Estimation |
|
| Liu, Yibo | York University |
| Jiang, Zhaodong | University of Toronto |
| Xu, Binbin | Huawei Noah's Ark Lab |
| Wu, Guile | Huawei Noah's Ark Lab |
| Ren, Yuan | Noah's Ark Lab, Huawei Technologies Canada Inc |
| Cao, Tongtong | Noah's Ark Lab, Huawei Technologies |
| Liu, Bingbing | Huawei Technologies |
| Yang, Rui Heng | Huawei Technologies Canada |
| Rasouli, Amir | Huawei Technologies Canada |
| Shan, Jinjun | York University |
Keywords: AI-Based Methods, Computer Vision for Automation, AI-Enabled Robotics
Abstract: This work focuses on the problem of 6D pose estimation for novel objects when a reference 3D model or posed reference images are not available. While existing methods can estimate the precise 6D pose of objects, they heavily rely on curated CAD models or reference images, the preparation of which is a time-consuming and labor-intensive process. Moreover, in real-world scenarios, 3D models or reference images may not be available in advance and instant robot reaction is desired. In this work, we propose a novel framework named HIPPo, which eliminates the need for curated CAD models and reference images by harnessing image-to-3D priors from Diffusion Models, enabling model-free zero-shot 6D pose estimation. Specifically, we construct HIPPo Dreamer, a rapid image-to-mesh model built on a multiview Diffusion Model and a 3D reconstruction foundation model. Our HIPPo Dreamer can generate a 3D mesh of any unseen objects from a single glance in just a few seconds. Then, as more observations are acquired, we propose to continuously refine the diffusion prior mesh model by joint optimization of object geometry and appearance. This is achieved by a measurement-guided scheme that gradually replaces the plausible diffusion priors with more reliable online observations. Consequently, HIPPo can instantly estimate and track the 6D pose of a novel object and maintain a complete mesh for immediate robotic applications. Thorough experiments on various benchmarks show that HIPPo outperforms state-of-the-art methods in 6D object pose estimation when prior reference images are limited.
|
| |
| 09:00-10:30, Paper WeI1I.15 | Add to My Program |
| Online Planning for Multi-UAV Pursuit-Evasion in Unknown Environments Using Deep Reinforcement Learning |
|
| Chen, Jiayu | Tsinghua University |
| Yu, Chao | Tsinghua University |
| Li, Guosheng | Tsinghua University |
| Tang, Wenhao | Tsinghua University |
| Ji, Shilong | Tsinghua University |
| Yang, Xinyi | Tsinghua University |
| Xu, Botian | Tsinghua University |
| Yang, Huazhong | Tsinghua University |
| Wang, Yu | Tsinghua University |
Keywords: Reinforcement Learning, Cooperating Robots, Multi-Robot Systems
Abstract: Multi-UAV pursuit-evasion, where pursuers aim to capture evaders, poses a key challenge for UAV swarm intelligence. Multi-agent reinforcement learning (MARL) has demonstrated potential in modeling cooperative behaviors, but most RL-based approaches remain constrained to simplified simulations with limited dynamics or fixed scenarios. Previous attempts to deploy RL policy to real-world pursuit-evasion are largely restricted to two-dimensional scenarios, such as ground vehicles or UAVs at fixed altitudes. In this paper, we propose a novel MARL-based algorithm that learns online planning for multi-UAV pursuit-evasion in unknown environments (OPEN). OPEN introduces an evader prediction-enhanced network to tackle partial observability in cooperative policy learning. Additionally, OPEN proposes an adaptive environment generator within MARL training, enabling higher exploration efficiency and better policy generalization across diverse scenarios. Simulations show our method significantly outperforms all baselines in challenging scenarios, generalizing to unseen scenarios with a 100% capture rate. Finally, after integrating calibrated dynamics models of UAVs into training, we derive a feasible policy via a two-stage reward refinement and deploy the policy on real quadrotors in a zero-shot manner. To our knowledge, this is the first work to derive and deploy an RL-based policy using collective thrust and body rates control commands for multi-UAV pursuit-evasion in unknown environments. The open-source code and videos are available at https://sites.google.com/view/pursuit-evasion-rl.
|
| |
| 09:00-10:30, Paper WeI1I.16 | Add to My Program |
| CVF-DLO: Cross-Visual-Field Branched Deformable Linear Objects Route Estimation |
|
| Yu, Chang | Tsinghua University |
| Wang, Jianjian | Tsinghua University |
| Feng, Pingfa | Tsinghua University |
| Yu, Dingwen | Tsinghua University |
| Zhang, Jianfu | Tsinghua University |
Keywords: Perception for Grasping and Manipulation, Object Detection, Segmentation and Categorization, Computer Vision for Manufacturing
Abstract: The perception of deformable linear objects (DLOs) poses significant challenges in robotic manipulation. Crossovers, mergings, and bifurcations of multiple DLOs complicate the identification of individual DLO physical instances. Furthermore, DLOs are often too large to be captured by a single camera, requiring the stitching of multiple overlapping views. This paper presents CVF-DLO, a cross-visual-field route estimation framework of branched DLOs (BDLOs) laid along physical surfaces such as wire harnesses, based on images from multiple viewpoints and pose-aware cameras. CVF-DLO is applicable to various perception tasks involving DLO-like structures, such as verifying connection accuracy and route consistency in cables and pipes. We propose a DLO instance segmentation method that demonstrates superior performance in handling crossings and bifurcations. The extracted DLO paths are projected onto the designed cable-laying surfaces using the camera pose and scene model. Finally, DLO routes are retrieved by searching within the spatial path domain formed by intersecting visual fields. To validate our method on wiring harnesses and intersections, we use two public DLO datasets and introduce a new BDLO dataset to benchmark against state-of-the-art DLO instance segmentation methods. Additionally, we present a cabin wiring harness dataset to evaluate the performance of the cross-visual-field route estimation. We have released all our source code and datasets (with ground truth) at https://github.com/ForNe-tech/CVF-DLO.
|
| |
| 09:00-10:30, Paper WeI1I.17 | Add to My Program |
| SICNav-Diffusion: Safe and Interactive Crowd Navigation with Diffusion Trajectory Predictions |
|
| Samavi, Sepehr | University of Toronto |
| Lem, Anthony Jia-Hao | University of Toronto |
| Sato, Fumiaki | CyberAgent, Inc |
| Chen, Sirui | University of Toronto |
| Gu, Qiao | University of Toronto |
| Yano, Keijiro | Konica Minolta |
| Schoellig, Angela P. | TU Munich |
| Shkurti, Florian | University of Toronto |
Keywords: Autonomous Vehicle Navigation, Collision Avoidance, Planning under Uncertainty
Abstract: To navigate crowds without collisions, robots must interact with humans by forecasting their future motion and reacting accordingly. While learning-based prediction models have shown success in generating likely human trajectory predictions, integrating these stochastic models into a robot controller presents several challenges. The controller needs to account for interactive coupling between planned robot motion and human predictions while ensuring both predictions and robot actions are safe (i.e. collision-free). To address these challenges, we present a receding horizon crowd navigation method for single-robot multi-human environments. We first propose a diffusion model to generate joint trajectory predictions for all humans in the scene. We then incorporate these multi-modal predictions into a SICNav Bilevel MPC problem that simultaneously solves for a robot plan (upper-level) and acts as a safety filter to refine the predictions for non-collision (lower-level). Combining planning and prediction refinement into one bilevel problem ensures that the robot plan and human predictions are coupled. We validate the open-loop trajectory prediction performance of our diffusion model on the commonly used ETH/UCY benchmark and evaluate the closed-loop performance of our robot navigation method in simulation and extensive real-robot experiments demonstrating safe, efficient, and reactive robot motion.
|
| |
| 09:00-10:30, Paper WeI1I.18 | Add to My Program |
| A Distributed Multi-Modal Sensing Approach for Human Activity Recognition in Real-Time Human-Robot Collaboration |
|
| Belcamino, Valerio | Fusion AI Labs S.r.l |
| Le Dinh, Minh Nhat | The University of Danang–University of Science and Technology |
| Luu, Quan | Purdue University |
| Carfì, Alessandro | Fusion AI Labs S.r.l |
| Ho, Van | Japan Advanced Institute of Science and Technology |
| Mastrogiovanni, Fulvio | University of Genoa |
Keywords: Multi-Modal Perception for HRI, Physical Human-Robot Interaction
Abstract: Human activity recognition (HAR) is fundamental in human-robot collaboration (HRC), enabling robots to respond and dynamically adapt to human intentions. This paper introduces a HAR system combining a modular data glove equipped with Inertial Measurement Units and a vision-based tactile sensor to capture hand activities in contact with a robot. We tested our activity recognition approach under different conditions, including offline classification of segmented sequences, real-time classification under static conditions, and a realistic HRC scenario. The experimental results show a high accuracy for all the tasks, suggesting that multiple collaborative settings could benefit from this multi-modal approach.
|
| |
| 09:00-10:30, Paper WeI1I.19 | Add to My Program |
| NEUSIS: A Compositional Neuro-Symbolic Framework for Autonomous Perception, Reasoning, and Planning in Complex UAV Search Missions |
|
| Cai, Zhixi | Monash University |
| Rojas Cardenas, Cristian | Monash University |
| Leo, Kevin | Monash University |
| Zhang, Chenyuan | Monash University |
| Backman, Kal | Monash University |
| Li, Hanbing | Monash University |
| Li, Boying | Monash University |
| Ghorbanali, Mahsa | Monash University |
| Datta, Stavya | Monash University |
| Qu, Lizhen | Monash University |
| Gutierrez, Julian | University of Sussex |
| Ignatiev, Alexey | Monash University |
| Li, Yuan-Fang | Monash University |
| Vered, Mor | Monash University |
| Stuckey, Peter | Monash University |
| Garcia de la Banda, Maria | Monash University |
| Rezatofighi, Hamid | Monash University |
Keywords: Aerial Systems: Perception and Autonomy, Deep Learning for Visual Perception
Abstract: This paper addresses the problem of autonomous UAV search missions, where a UAV must locate specific Entities of Interest (EOIs) within a time limit, based on brief descriptions in large, hazard-prone environments with keep-out zones. The UAV must perceive, reason, and make decisions with limited and uncertain information. We propose NEUSIS, a compositional neuro-symbolic system designed for effective UAV search and navigation in realistic scenarios. NEUSIS integrates neuro-symbolic visual perception, reasoning, and grounding (GRiD) to process raw sensory inputs, maintains a probabilistic world model for environment representation, and uses a hierarchical planning component (SNaC) for efficient path planning. Experimental results from simulated urban search missions using AirSim and Unreal Engine show that NEUSIS outperforms state-of-the-art baselines for both perception and planning. These results demonstrate the effectiveness of our compositional neuro-symbolic approach in handling complex scenarios, making it a promising solution for autonomous UAV systems in search missions.
|
| |
| 09:00-10:30, Paper WeI1I.20 | Add to My Program |
| Pair-VPR: Place-Aware Pre-Training and Contrastive Pair Classification for Visual Place Recognition with Vision Transformers |
|
| Hausler, Stephen | CSIRO |
| Moghadam, Peyman | CSIRO |
Keywords: Deep Learning for Visual Perception, Recognition, Localization
Abstract: In this work we propose a novel joint training method for Visual Place Recognition (VPR), which simultaneously learns a global descriptor and a pair classifier for re-ranking. The pair classifier can predict whether a given pair of images are from the same place or not. The network only comprises Vision Transformer components for both the encoder and the pair classifier, and both components are trained using their respective class tokens. In existing VPR methods, typically the network is initialized using pre-trained weights from a generic image dataset such as ImageNet. In this work we propose an alternative pre-training strategy, by using Siamese Masked Image Modelling as a pre-training task. We propose a Place-aware image sampling procedure from a collection of large VPR datasets for pre-training our model, to learn visual features tuned specifically for VPR. By re-using the Mask Image Modelling encoder and decoder weights in the second stage of training, Pair-VPR can achieve state-of-the-art VPR performance across five benchmark datasets with a ViT-B encoder, along with further improvements in localization recall with larger encoders.
|
| |
| 09:00-10:30, Paper WeI1I.21 | Add to My Program |
| ASMA: An Adaptive Safety Margin Algorithm for Vision-Language Drone Navigation Via Scene-Aware Control Barrier Functions |
|
| Sanyal, Sourav | Purdue University |
| Roy, Kaushik | Purdue University |
Keywords: Sensor-based Control, Collision Avoidance, AI-Enabled Robotics
Abstract: In the rapidly evolving field of vision–language navigation (VLN), ensuring safety for physical agents remains an open challenge. For a human-in-the-loop language-operated drone to navigate safely, it must understand natural language commands, perceive the environment, and simultaneously avoid hazards in real time. Control Barrier Functions (CBFs) are formal methods that enforce safe operating conditions. Model Predictive Control (MPC) is an optimization framework that plans a sequence of future actions over a prediction horizon, ensuring smooth trajectory tracking while obeying constraints. In this work, we consider a VLN-operated drone platform and enhance its safety by formulating a novel scene-aware CBF that leverages ego-centric observations from a camera which has both Red-Green-Blue as well as Depth (RGB-D) channels. A CBF-less baseline system uses a Vision–Language Encoder with cross–modal attention to convert commands into an ordered sequence of landmarks. An object detection model identifies and verifies these landmarks in the captured images to generate a planned path. To further enhance safety, an Adaptive Safety Margin Algorithm (ASMA) is proposed. ASMA tracks moving objects and performs scene-aware CBF evaluation on- the-fly, which serves as an additional constraint within the MPC framework. By continuously identifying potentially risky observations, the system performs prediction in real time about unsafe conditions and proactively adjusts its control actions to maintain safe navigation throughout the trajectory. Deployed on a Parrot Bebop2 quadrotor in the Gazebo environment using the Robot Operating System (ROS), ASMA achieves 64%–67% increase in success rates with only a slight increase (1.4%–5.8%) in trajectory lengths com
|
| |
| 09:00-10:30, Paper WeI1I.22 | Add to My Program |
| A 3-Degrees-Of-Freedom Lightweight Flexible Twisted String Actuators (TSAs)-Based Exoskeleton for Wrist Rehabilitation |
|
| Dragusanu, Mihai | University of Siena |
| Guinet, Nicolas | University of Siena |
| Suthar, Bhivraj | IIT Jodhpur |
| Lisini Baldi, Tommaso | University of Siena |
| Prattichizzo, Domenico | University of Siena |
| Malvezzi, Monica | University of Siena |
Keywords: Prosthetics and Exoskeletons, Rehabilitation Robotics, Soft Robot Materials and Design
Abstract: This paper introduces a lightweight, three-degrees- of-freedom exoskeleton for wrist rehabilitation powered by Twisted String Actuators (TSAs), specifically designed to support flexion/extension, radial/ulnar deviation, and pronation/supination movements. Leveraging the high power-to-weight ratio of TSA actuation system, the exoskeleton ensures effective, comfortable, and personalized rehabilitation exercises. The device comprises five TSAs arranged in a tendon-driven configuration, enabling precise control and adaptability to various user anatomies. The experimental evaluations was conducted on a prototype demonstrating the device’s ability to accurately replicate wrist movements guided by a physiotherapist, achieving low tracking errors (RMSE 1°). The exoskeleton effectively achieves the desired wrist range of motion—115° for flexion/extension, 70° for radial/ulnar deviation, and 150° for pronation/supination—with torque capabilities suitable for rehabilitation purposes (0.35 Nm for flexion/extension and radial/ulnar deviation, and 0.06 Nm for pronation/supination). These preliminary results validate the exoskeleton as a promising solution, offering improved comfort, flexibility, and effectiveness compared to traditional rehabilitation devices.
|
| |
| 09:00-10:30, Paper WeI1I.23 | Add to My Program |
| DISCO: Language-Guided Manipulation with Diffusion Policies and Constrained Inpainting |
|
| Hao, Ce | National University of Singapore |
| Lin, Kelvin | National University of Singapore |
| Xue, Zhiwei | National University of Singapore |
| Luo, Siyuan | National University of Singapore |
| Soh, Harold | National University of Singapore |
Keywords: Imitation Learning, Machine Learning for Robot Control
Abstract: Diffusion policies have demonstrated strong performance in generative modeling, making them promising for robotic manipulation guided by natural language instructions. However, generalizing language-conditioned diffusion policies to open-vocabulary instructions in everyday scenarios remains challenging due to the scarcity and cost of robot demonstration datasets. To address this, we propose DISCO, a framework that leverages off-the-shelf vision-language models (VLMs) to bridge natural language understanding with high-performance diffusion policies. DISCO translates linguistic task descriptions into actionable 3D keyframes using VLMs, which then guide the diffusion process through constrained inpainting. However, enforcing strict adherence to these keyframes can degrade performance when the VLM-generated keyframes are inaccurate. To mitigate this, we introduce an inpainting optimization strategy that balances keyframe adherence with learned motion priors from training data. Experimental results in both simulated and real-world settings demonstrate that DISCO outperforms conventional fine-tuned language-conditioned policies, achieving superior generalization in zero-shot, open-vocabulary manipulation tasks.
|
| |
| 09:00-10:30, Paper WeI1I.24 | Add to My Program |
| Learning Autonomous and Safe Quadruped Traversal of Complex Terrains Using Multi-Layer Elevation Maps |
|
| Chen, Yeke | The University of Hong Kong |
| Ma, Ji | The University of Hong Kong |
| Luo, Zeren | The University of Hong Kong |
| Han, Yimin | The University of Hong Kong |
| Dong, Yinzhao | The University of Hong Kong |
| Xu, Bowen | The University of Hong Kong |
| Lu, Peng | The University of Hong Kong |
Keywords: Machine Learning for Robot Control, Legged Robots
Abstract: Legged robots hold great promise for agile and flexible mobility across diverse and unstructured terrains, inspired by the remarkable adaptability of bipeds and quadrupeds in nature. However, achieving robust autonomous locomotion in cluttered and complex environments remains a significant challenge. In this work, we present a hierarchical control framework for quadrupedal robots that enables safe and autonomous traversal of cluttered terrains. Central to our approach is a novel multi-layer elevation map representation, which is generalized enough to capture a wide range of terrains. To further improve policy generalization and maneuverability, we incorporate terrain augmentation, knowledge distillation, and carefully designed reward functions. Extensive simulation experiments demonstrate that each component contributes to improved policy generalization, and that our terrain representation is more efficient and informative than existing alternatives. By training a terrain compressor in simulation, we successfully deploy our system on a low-cost quadrupedal robot in real-world environments, showcasing the practicality and robustness of our approach.
|
| |
| 09:00-10:30, Paper WeI1I.25 | Add to My Program |
| Enhancing Robustness of Locomotion Policy for Quadrupedal Robot with Deep Disturbance Observer |
|
| Muhamad, Fikih | Seoul National University of Science and Technology |
| Kusuma, Anak Agung Krisna Ananda | Seoul National University of Science and Technology |
| Park, Jae-Han | Korea Institute of Industrial Technology |
| Kim, Jung Su | Seoul National University of Science and Technology |
Keywords: Reinforcement Learning, Legged Robots, Humanoid and Bipedal Locomotion
Abstract: This letter proposes a control framework to enhance the robustness of a locomotion policy against uncertainties by integrating it with a deep disturbance observer (DOB) network and a deep state estimator network. The deep DOB approximates the inverse model of a quadrupedal robot. The locomotion policy is trained to produce optimal actions, with the deep DOB estimating the overall uncertainties of the robot, and the deep state estimator estimates the body's linear velocities. All networks are trained under nominal conditions in Isaac Gym. Subsequently, all the trained networks are transferred to Gazebo and a real robot with ROS2 are used to validate their robustness under uncertain conditions without additional tuning. Furthermore, validation results show that the proposed control framework performs best in velocity tracking compared to the baseline method in terms of lowest estimation errors. This emphasizes the effectiveness of the proposed control framework in improving robustness of the locomotion policy. Videos on Isaac Gym and Gazebo simulation, and real robot experiment are available at Project page: bit.ly/3CF3OTQ.
|
| |
| 09:00-10:30, Paper WeI1I.26 | Add to My Program |
| Unified Map Prior Encoder for Mapping and Planning |
|
| Zhang, Zongzheng | Tsinghua University |
| Zou, Sizhe | Beijing Jiaotong University |
| Zheng, Guantian | Huazhong University of Science and Technology |
| Zhu, Zhenxin | Beihang University |
| Gao, Yu | Bosch (China) Investment Ltd |
| Chi, Guoxuan | Tsinghua University |
| Wang, Shuo | Bosch China |
| Heng, Yuwen | Bosch Corporate Research |
| Sun, Zhigang | Desay SV |
| Wang, Yiru | Bosch |
| Sun, Hao | National University of Singapore |
| Ma, Chao | Shanghai Jiao Tong University |
| Li, Zhen | Shenzhen Research Institute of Big Data |
| Jiang, Anqing | Robert Bosch |
| Zhao, Hao | Tsinghua University |
Keywords: Reactive and Sensor-Based Planning, Semantic Scene Understanding, Intelligent Transportation Systems
Abstract: Online mapping and end-to-end (E2E) planning in autonomous driving are still largely sensor-centric, leaving rich map priors—HD/SD vector maps, rasterized SD maps, and satellite imagery—underused due to heterogeneity, pose drift, and inconsistent availability at test time. We present emph{UMPE}, a Unified Map Prior Encoder that can ingest any subset of four priors and fuse them with BEV features for both mapping and planning. emph{UMPE} has two branches. The vector encoder pre-aligns HD/SD polylines with a frame-wise SE(2) correction, encodes points via multi-frequency sinusoidal features, and produces polyline tokens with confidence scores. BEV queries then apply cross-attention with confidence bias, followed by normalized channel-wise gating to avoid length imbalance and to softly down-weight uncertain sources. The raster encoder shares a ResNet-18 backbone conditioned by FiLM (scaling/shift at every stage), performs SE(2) micro-alignment, and injects priors through zero-initialized residual fusion so the network starts from a do-no-harm baseline and learns to add only useful prior evidence. A vector-then-raster fusion order reflects the inductive bias of “geometry first, appearance second.” On nuScenes mapping, emph{UMPE} lifts MapTRv2 from 61.5 → 67.4 mAP (+5.9) and MapQR from 66.4 → 71.7 mAP (+5.3). On Argoverse2, emph{UMPE} adds +4.1 mAP over strong baselines. emph{UMPE} is compositional: when trained with all priors, it outperforms single-prior models even when only one prior is available at test time, demonstrating powerset robustness. For E2E planning (VAD backbone, nuScenes), emph{UMPE} reduces trajectory error from 0.72 → 0.42 m L2 (avg. −0.30 m) and collision rate from 0.22% → 0.12% (−0.10%), surpassing recent prior-injection methods. These results show that a unified, alignment-aware treatment of heterogeneous map priors yields better mapping and better planning.
|
| |
| 09:00-10:30, Paper WeI1I.27 | Add to My Program |
| Motion Manifold Flow Primitives for Task-Conditioned Trajectory Generation under Complex Task-Motion Dependencies |
|
| Lee, Yonghyeon | Massachusetts Institute of Technology |
| Lee, Byeongho | Samsung Eletronics |
| Kim, Seungyeon | Seoul National University |
| Park, Frank | Seoul National University |
Keywords: Learning from Demonstration, Imitation Learning, Deep Learning Methods
Abstract: Effective movement primitives should be capable of encoding and generating a rich repertoire of trajectories -- typically collected from human demonstrations -- conditioned on task-defining parameters such as vision or language inputs. While recent methods based on the motion manifold hypothesis, which assumes that a set of trajectories lies on a lower-dimensional nonlinear subspace, address challenges such as limited dataset size and the high dimensionality of trajectory data, they often struggle to capture complex task-motion dependencies, i.e., when motion distributions shift drastically with task variations. To address this, we introduce Motion Manifold Flow Primitives (MMFP), a framework that decouples the training of the motion manifold from task-conditioned distributions. Specifically, we employ flow matching models, state-of-the-art conditional deep generative models, to learn task-conditioned distributions in the latent coordinate space of the learned motion manifold. Experiments are conducted on language-guided trajectory generation tasks, where many-to-many text-motion correspondences introduce complex task-motion dependencies, highlighting MMFP's superiority over existing methods.
|
| |
| 09:00-10:30, Paper WeI1I.28 | Add to My Program |
| Design of a Hydraulic-Driven Adaptive Gripper with a Novel Actuation Mechanism |
|
| Jeong, Yonghwan | University of Science and Technology (UST) |
| Kim, Jungyeong | Korea Institute of Industrial Technology (KITECH) |
| Han, SangChul | Korea Institute of Industrial Technology |
| Yoon, Sungwoon | SUNGKYUNKWAN UNIVERSITY |
| Lee, Sungho | Sungkyunkwan University |
| Park, Sangshin | Korea Institute of Industrial Technology |
| Kim, Jin Tak | KITECH(Korea Institute of Industrial Technology), |
| Kim, Jinhyeon | Korea Institute of Industrial Technology(KITECH) |
| Cho, Jungsan | KITECH(Korea Institute of Industrial Technology) |
Keywords: Grippers and Other End-Effectors, Mechanism Design, Hydraulic/Pneumatic Actuators
Abstract: This paper presents a novel hydraulic-driven two- finger robotic gripper designed to handle objects of various shapes and sizes. To meet the demands of field robotics and heavy industrial environments, a self-adaptive finger mechanism was integrated with hydraulic actuation. However, this integration leads to increased structural volume, as hydraulics produce linear motion and require additional hydraulics components. Additionally, precise force control becomes challenging, as harsh environments limit the use of other sensing devices for fine control. These issues are addressed by employing an offset slider- crank mechanism, which efficiently converts linear motion into rotational motion. Additionally, a newly designed double-acting bi-piston cylinder allows both fingers to operate using a single cylinder, reducing the number of hydraulic components. To enable pressure-based force control, kinematic and static analyses of the mechanism were conducted. A prototype was developed and experimentally validated for its grasping performance and analysis. It demonstrated high performance in lifting heavy objects, such as an 18 kg tire, and delicately handling fragile items like eggs and paper cups. cups.
|
| |
| 09:00-10:30, Paper WeI1I.29 | Add to My Program |
| A Perception-Based Architecture for Autonomous Convoying in GNSS-Denied Areas (I) |
|
| Bienemann, Alexander | Universität Der Bundeswehr München |
| Beer, Lukas | University of the Bundeswehr, München |
| Reich, Andreas | Universität Der Bundeswehr München |
| Steinecker, Thomas | Universität Der Bundeswehr München |
| Forkel, Bianca | Universität Der Bundeswehr München |
| Backhaus, Anton | University of the Bundeswehr Munich |
| Berthold, Philipp | Bundeswehr University Munich |
| González González, Juan David | University of the Bundeswehr Munich |
| Mortimer, Peter | Universität Der Bundeswehr München |
| Luettel, Thorsten | University of the Bundeswehr Munich |
| Mirko, Maehlisch | University of the Bundeswehr Munich |
Keywords: Field Robots
Abstract: In this article, we present a perception-based full-stack system for autonomous vehicle following that does not rely on accurate global localization or map data. Our architecture consists of modules for vehicle communication, localization, object tracking, waypoint management, static environment modeling, trajectory planning, and control, which all are covered in the article. To test our system, we conducted several practical experiments in various scenarios on our two autonomous vehicles. Those experiments include the handling of static and dynamic obstacles, driving on- and off-road under different light and weather conditions with distances between the vehicles ranging from 5m to 100m and with speeds of up to 20m/s. Furthermore, we showcased our system’s performance during the 12th European Land Robot Trial 2024, where our institute participated in the convoying scenario. The tests from the trial and our own experiments showed satisfactory results. Our system archives a high path-following accuracy and is able to cope with various challenging scenarios.
|
| |
| 09:00-10:30, Paper WeI1I.30 | Add to My Program |
| Dynamic Movement Primitives with Control Barrier Functions for Constrained Trajectory Planning |
|
| Vesentini, Federico | University of Verona |
| Meli, Daniele | University of Verona |
| Sansonetto, Nicola | University of Verona |
| Di Persio, Luca | University of Verona |
| Muradore, Riccardo | University of Verona |
Keywords: Imitation Learning, Constrained Motion Planning, Robot Safety
Abstract: DynamicMovement Primitives (DMPs) form a robust framework for trajectory generation based on imitation learning, aiming to replicate the shape of reference trajectories from demonstrations closely. DMPs have been extensively employed for trajectory planning in robotic systems. However, they cannot safely guarantee complex nonlinear constraints, which is essential at the control level. On the other hand, Control Barrier Functions (CBFs) are used to modulate the input of control-affine dynamic systems subject to state-dependent constraints, guaranteeing that the system remains within predefined safe sets while converging towards target states. This letter proposes Constrained Movement Primitives (CMPs), a novel framework that integrates DMPs with CBFs to generate safe-by-construction trajectories subject to nonlinear constraints. We represent DMPs in control-affine form and combine them with the closed-form input provided by CBFs, overcoming the limitations of existing iterative optimisation methods for constrained DMPs. We demonstrate that CBFs preserve the goal convergence guarantees of DMPs. Moreover, we validate our approach in simulation and on a realmobile robot subject to nonlinear kinodynamic constraints, concerning maximum Cartesian velocity, obstacle avoidance, andmaximum centrifugal acceleration to avoid slippery over curved trajectories.
|
| |
| 09:00-10:30, Paper WeI1I.31 | Add to My Program |
| COSMO-Bench: A Benchmark for Collaborative SLAM Optimization |
|
| McGann, Daniel | Carnegie Mellon University |
| Potokar, Easton | Carnegie Mellon Uiversity |
| Kaess, Michael | Carnegie Mellon University |
Keywords: Multi-Robot SLAM, SLAM, Multi-Robot Systems
Abstract: Recent years have seen a focus on research into distributed optimization algorithms for multi-robot Collaborative Simultaneous Localization and Mapping (C-SLAM). Research in this domain, however, is made difficult by a lack of standard benchmark datasets. Such datasets have been used to great effect in the field of single-robot SLAM, and researchers focused on multi-robot problems would benefit greatly from dedicated benchmark datasets. To address this gap, we design and release the Collaborative Open-Source Multi-robot Optimization Benchmark (COSMO-Bench) -- a suite of 24 datasets derived from a baseline C-SLAM front-end and real-world LiDAR data.
|
| |
| 09:00-10:30, Paper WeI1I.32 | Add to My Program |
| B*: Efficient and Optimal Base Placement for Fixed-Base Manipulators |
|
| Zhao, Zihang | Peking University |
| Cui, Leiyao | University of Chinese Academy of Sciences |
| Xie, Sirui | Institute for Artificial Intelligence, Peking University |
| Zhang, Saiyao | University of Chinese Academy of Sciences |
| Han, Zhi | Shenyang Institute of Automation, Chinese Academy of Sciences |
| Ruan, Lecheng | Peking University |
| Zhu, Yixin | Peking University |
Keywords: Industrial Robots, Factory Automation, Intelligent and Flexible Manufacturing
Abstract: B* is a novel optimization framework that addresses a critical challenge in fixed-base manipulator robotics: optimal base placement. Current methods rely on pre-computed kinematics databases generated through sampling to search for solutions. However, they face an inherent trade-off between solution optimality and computational efficiency when determining sampling resolution. To address these limitations, B* unifies multiple objectives without database dependence. The framework employs a two-layer hierarchical approach. The outer layer systematically manages terminal constraints through progressive tightening, particularly for base mobility, enabling feasible initialization and broad solution exploration. The inner layer addresses non-convexities in each outer-layer subproblem through sequential local linearization, converting the original problem into tractable sequential linear programming (SLP). Testing across multiple robot platforms demonstrates B*'s effectiveness. The framework achieves solution optimality five orders of magnitude better than sampling-based approaches while maintaining perfect success rates and reduced computational overhead. Operating directly in configuration space, B* enables simultaneous path planning with customizable optimization criteria. B* serves as a crucial initialization tool that bridges the gap between theoretical motion planning and practical deployment, where feasible trajectory existence is fundamental.
|
| |
| 09:00-10:30, Paper WeI1I.33 | Add to My Program |
| Enhancing Reusability of Learned Skills for Robot Manipulation Via Gaze Information and Motion Bottlenecks |
|
| Takizawa, Ryo | The University of Tokyo |
| Karino, Izumi | The University of Tokyo |
| Nakagawa, Koki | The University of Tokyo |
| Ohmura, Yoshiyuki | The University of Tokyo |
| Kuniyoshi, Yasuo | The University of Tokyo |
Keywords: Imitation Learning, Perception for Grasping and Manipulation, Dual Arm Manipulation
Abstract: Autonomous agents capable of diverse object manipulations should be able to acquire a wide range of manipulation skills with high reusability. Although advances in deep learning have made it increasingly feasible to replicate the dexterity of human teleoperation in robots, generalizing these acquired skills to previously unseen scenarios remains a significant challenge. In this study, we propose a novel algorithm, Gaze-based Bottleneck-aware Robot Manipulation (GazeBot), which enables high reusability of learned motions without sacrificing dexterity or reactivity. By leveraging gaze information and motion bottlenecks, both crucial features for object manipulation, GazeBot achieves high success rates compared with state-of-the-art imitation learning methods, particularly when the object positions and end-effector poses differ from those in the provided demonstrations. Furthermore, the training process of GazeBot is entirely data-driven once a demonstration dataset with gaze data is provided.
|
| |
| 09:00-10:30, Paper WeI1I.34 | Add to My Program |
| SCDCE-3D: Soft-Weighted Covariance and Dual-Branch Channel Enhancement for 3D Place Recognition in Complex Orchard Environments |
|
| Tan, Yuping | Xi'an University of Technology |
| Zhao, Chunjiang | Beijing Research Center of Intelligent Equipment for Agriculture |
| Zhao, Qin | Xi'an University of Technology |
| Xinhong, Hei | Xi'an University of Technology |
| Xiaogang, Song | Xi'an University of Technology |
Keywords: Deep Learning for Visual Perception, Localization, Recognition
Abstract: Recent progress in 3D place recognition has delivered strong results in urban and indoor scenarios, but orchards remain largely unexplored. In these environments, unreliable or absent GNSS signals necessitate LiDAR-based place recognition for robust long-term localization, yet challenges such as ill-defined geometry, semi-transparent foliage, and severe inter-/intra-row overlaps cause high structural ambiguity. To address these challenges, we propose SCDCE-3D, a novel framework that integrates soft-weighted covariance representation with dual-branch channel enhancement. The soft-weighted covariance module adaptively down-weights noisy or overlapping points using a sigmoid-based weighting strategy, enabling robust second-order statistical representation that suppresses cross-row interference. In parallel, a dual-branch backbone extracts complementary global and local features, which drive a dynamic channel enhancement mechanism to emphasize discriminative feature channels while suppressing redundancy. Furthermore, multi-level triplet learning is applied not only to the final descriptor but also to intermediate statistical features, reinforcing robustness against structural ambiguity. Experiments on orchard-based LiDAR datasets demonstrate that SCDCE-3D significantly outperforms state-of-the-art methods in both recall and robustness, offering a reliable solution for long-term 3D place recognition in agricultural robotics. Code is available at https://github.com/typist2001/SCDCE-3D.
|
| |
| 09:00-10:30, Paper WeI1I.35 | Add to My Program |
| UVDtact: UV Marker-Embedded Fingertip-Like Vision-Based Tactile Sensor for Shape Reconstruction and Force Estimation |
|
| Kim, Woojong | KAIST |
| Kim, Won Dong | Samsung Electronics Co., Ltd |
| Park, Hyunkyu | Samsung Advanced Institute of Technology |
| Lee, Joonho | Korea Institute of Machinery & Materials (kimm) |
| Kim, Jeong-Jung | Korea Institute of Machinery and Materials (KIMM) |
| Kim, Jung | KAIST |
Keywords: Perception for Grasping and Manipulation, Force and Tactile Sensing, Soft Sensors and Actuators
Abstract: Vision-based tactile sensors are highly promising for enabling robots to perform dexterous, contact-rich manipulation tasks by providing high-resolution tactile data. Recent studies have attempted to implement shape reconstruction and force estimation capabilities for sensors with omnidirectional sensing surfaces and a compact form factor. However, achieving a small diameter comparable to that of a human fingertip remains challenging, and integrating the multiple functionalities within the fingertip form factor poses significant challenges. In this study, we present UVDtact, a vision-based tactile sensor with a fingertip-like form factor that incorporates a switchable translucent elastomer. The proposed switchable translucent elastomer, which integrates ultraviolet (UV) ink and a translucent elastomer, decouples tactile images for shape reconstruction and force estimation. The independent tactile images ensure that shape reconstruction remains unaffected by UV markers, making them visible when needed, thereby enabling effective force estimation. For shape reconstruction, we leverage the darkening effect of the translucent elastomer in response to tactile stimuli and introduce a calibration method that utilizes this effect in an all-around curved sensor configuration. Furthermore, we validate that embedding UV markers enhances tactile features, improving force estimation performance while preserving the quality of tactile images used for shape reconstruction. By integrating various tactile sensing capabilities into a compact, fingertip-like design, UVDtact contributes to developing robotic systems with human-like dexterity.
|
| |
| 09:00-10:30, Paper WeI1I.36 | Add to My Program |
| A Multi-Level Similarity Approach for Single-View Object Grasping: Matching, Planning, and Fine-Tuning |
|
| Chen, Hao | The University of Osaka |
| Kiyokawa, Takuya | The University of Osaka |
| Hu, Zhengtao | Shanghai University |
| Wan, Weiwei | The University of Osaka |
| Harada, Kensuke | The University of Osaka |
Keywords: Grasping, Dexterous Manipulation, Computer Vision for Automation, Similarity Matching
Abstract: Grasping unknown objects from a single view has remained a challenging topic in robotics due to the uncertainty of partial observation. Recent advances in large-scale models have led to benchmark solutions such as GraspNet-1Billion. However, such learning-based approaches still face a critical limitation in performance robustness for their sensitivity to sensing noise and environmental changes. To address this bottleneck in achieving highly generalized grasping, we abandon the traditional learning framework and introduce a new perspective: similarity matching, where similar known objects are utilized to guide the grasping of unknown target objects. We newly propose a method that robustly achieves unknown-object grasping from a single viewpoint through three key steps: 1) Leverage the visual features of the observed object to perform similarity matching with an existing database containing various object models, identifying potential candidates with high similarity; 2) Use the candidate models with pre-existing grasping knowledge to plan imitative grasps for the unknown target object; 3) Optimize the grasp quality through a local fine-tuning process. To address the uncertainty caused by partial and noisy observation, we propose a multi-level similarity matching framework that integrates semantic, geometric, and dimensional features for comprehensive evaluation. Especially, we introduce a novel point cloud geometric descriptor, the C-FPFH descriptor, which facilitates accurate similarity assessment between partial point clouds of observed objects and complete point clouds of database models. In addition, we incorporate the use of large language models, introduce the semi-oriented bounding box, and develop a novel point cloud registration approach based on plane detection to enhance matching accuracy under single-view conditions.
|
| |
| 09:00-10:30, Paper WeI1I.37 | Add to My Program |
| SIGMA: An Agent-Based Modeling UAV Swarm Simulator for Swarm Intelligence Algorithms (I) |
|
| Zhang, Sheng | Beijing Institute of Technology |
| Li, Juan | Beijing Institute of Technology |
| Liu, Chang | BIT |
| Fu, Lei | Beijing Institute of Technology |
| Bai, Zehao | Beijing Institute of Technology |
| Li, Jie | Beijing Institute of Technology |
Keywords: Simulation and Animation, Agent-Based Systems, Swarm Robotics
Abstract: Swarm intelligence for uncrewed aerial vehicles (UAVs) significantly improves the success rate of executing intricate tasks using “distributed platforms and aggregated effects”. However, the high experimental costs and safety risks constrain its development. This paper introduces SIGMA (Swarm Intelligence Generic simulator for Multi-UAVs), a high-fidelity distributed UAV swarm simulator for swarm intelligence algorithms. As an agent-based modeling simulator (ABMS), SIGMA has three key innovations: First, an automatic model tuning method improves aircraft dynamics fidelity. Second, a bidirectional discrete-event simulation (BiDES) architecture resolves the time alignment challenges in distributed systems. Third, a multiagent learning toolbox ensures algorithm compatibility via an episodic training structure and a memory replay mechanism. In the verification part, the fidelity and scalability of the simulator are verified by quantitative simulations and experiments, and several successful applications demonstrate the practicality of the proposed simulator.
|
| |
| 09:00-10:30, Paper WeI1I.38 | Add to My Program |
| A Single-Stage Spectrum-Domain Network for Trajectory Prediction |
|
| Xia, Beihao | Huazhong University of Science and Technology |
| Peng, Qinmu | Huazhong University of Science and Technology |
| You, Xinge | Huazhong University of Science and Technology |
Keywords: Intelligent Transportation Systems
Abstract: Trajectory prediction is a fundamental yet challenging task in intelligent systems. Existing methods are mainly categorized as single-stage time-domain, two-stage time-domain, or two-stage spectrum-domain approaches, while single-stage spectrum-domain methods have been relatively underexplored. In the frequency domain, low-frequency components reflect global motion trends, while high-frequency components capture fine-grained local variations. Most existing spectrum-domain approaches process these components independently, overlooking their intrinsic complementarity. Inspired by the success of bilinear models in explicitly capturing cross-factor interactions, we propose S^{3}-Net, a single-stage spectrum-domain trajectory prediction network with a bilinear fusion module that integrates low- and high-frequency dynamics. This design yields richer spectral representations and enables accurate, socially compliant, and multimodal predictions. Experiments on the ETH-UCY and Stanford Drone Datasets demonstrate that S^{3}-Net achieves up to 16.8%/15.1% ADE/FDE reduction over spectrum-domain baselines while maintaining a compact model size and low inference latency, making it suitable for real-time scenarios.
|
| |
| 09:00-10:30, Paper WeI1I.39 | Add to My Program |
| Energy-Shaping Controller for Time-Invariant Multiple Contacts |
|
| Hasan, Ehtisham ul | Free University of Bozen Bolzano |
| Peer, Angelika | Free University of Bolzano |
Keywords: Physical Human-Robot Interaction, Safety in HRI
Abstract: While in the past industrial robots were strictly separated from humans, today robots serve humans in a variety of industrial applications that also involve close or even physical human-robot interaction. Hereby, safety is of utmost importance and thus, the design of the control system needs to ensure a stable and safe operation. In this context, safety has been mainly addressed for single interaction points. In this article, we present an energy shaping controller that is capable of ensuring safety even in the case of multiple human contact points that may occur when co-manipulating an object. The presented approach is tested and validated in experiments. Re- sults indicate that for the studied co-manipulation task involving time-invariant multiple human contacts, a safe interaction can be achieved.
|
| |
| 09:00-10:30, Paper WeI1I.40 | Add to My Program |
| A Blockchain Framework for Equitable and Secure Task Allocation in Robot Swarms |
|
| Zhao, Hanqing | Université Laval |
| Pacheco, Alexandre | Université Libre De Bruxelles |
| Beltrame, Giovanni | Ecole Polytechnique De Montreal |
| Liu, Xue | McGill University |
| Dorigo, Marco | Université Libre De Bruxelles |
| Dudek, Gregory | McGill University |
Keywords: Swarm Robotics, Failure Detection and Recovery, Robust/Adaptive Control
Abstract: Recent studies demonstrate the potential of blockchain to enable robots in a swarm to achieve secure consensus about the environment, particularly when robots are homogeneous and perform identical tasks. Typically, robots receive rewards for their contributions to consensus achievement, but no studies have yet targeted heterogeneous swarms, in which the robots have distinct physical capabilities suited to different tasks. We present a novel framework that leverages domain knowledge to decompose the swarm mission into a hierarchy of tasks within smart contracts. This allows the robots to reach a consensus about both the environment and the action plan, allocating tasks among robots with diverse capabilities to improve their performance while maintaining security against faults and malicious behaviors. We refer to this concept as equitable and secure task allocation. Validated in Simultaneous Localization and Mapping missions, our approach not only achieves equitable task allocation among robots with varying capabilities, improving mapping accuracy and efficiency, but also shows resilience against malicious attacks.
|
| |
| 09:00-10:30, Paper WeI1I.41 | Add to My Program |
| AI-Driven Landing Zone Detection Module for Vertical Take-Off and Landing Vehicles Using Projection-Based LiDAR-Navigation Pipelines (I) |
|
| Herath, Nirasha | Memorial University of Newfoundland |
| De Silva, Oscar | Memorial University of Newfoundland |
| Mann, George K. I. | Memorial University of Newfoundland |
| Jayasiri, Awantha | National Research Council |
Keywords: Autonomous Vehicle Navigation, Semantic Scene Understanding, Deep Learning Methods
Abstract: This paper introduces an artificial intelligence-based landing zone detection module (LZDM) for vertical take-off and landing (VTOL) navigation. It employs a projection-based point cloud semantic segmentation (PCSS) convolutional neural network model combined with point cloud accumulation and a range image generation module. The proposed method addresses the limitations of existing projection-based PCSS methods, which often struggle with low-resolution and non-repetitive scan raw light detection and ranging (LiDAR) data commonly found in aerial datasets. The proposed LZDM was developed using three sets of aerial datasets collected from a DJI M600 hexacopter drone, a DJI M300 RTK quadrotor, and a Bell412 helicopter. The results were evaluated using both qualitative and quantitative metrics, demonstrating its robustness and effectiveness. In terms of quantitative results, the proposed method achieved mean intersection over union and accuracy values greater than 0.93 and 98 percent, respectively, across all three datasets, highlighting its accuracy in identifying safe landing zones (LZs). To assess the real-time feasibility of the proposed LZDM, it was deployed on a reconfigurable hardware-accelerated module. This setup achieved processing rates higher than 10 Hz for all three datasets and a throughput of over 5 million pts/s on the Jetson AGX Xavier dedicated hardware combined with the PyTorch TensorRT optimization module.
|
| |
| 09:00-10:30, Paper WeI1I.42 | Add to My Program |
| Follow Everything: Goal-Aware Adaptation and Graph-Based Planner for Arbitrary Leader Following |
|
| Zhang, Qianyi | Nankai University |
| Ma, Shijian | University of Macau |
| Liu, Boyi | Hong Kong University of Science and Technology |
| Liu, Jingtai | Nankai University |
| Jiao, Jianhao | University College London; the Hong Kong Polytechnic University |
| Kanoulas, Dimitrios | University College London |
Keywords: Motion and Path Planning, Human-Aware Motion Planning
Abstract: Enabling robots to robustly follow leaders supports tasks such as carrying supplies or guiding customers. While existing methods often fail to generalize to arbitrary leaders, and struggle when the leader temporarily leaves the robot’s field of view, this work presents a unified framework to address both challenges. First, a segmentation model replaces traditional category-specific detection models, allowing the leader to be of any shape or type. To improve robustness, a distance frame buffer is designed to store high-confidence leader embeddings across distance intervals, accounting for the unique characteristics of leader-following tasks. Second, a goal-aware adaptation mechanism is designed to govern robot planning states based on the leader's visibility and motion, complemented by a graph-based planner that generates candidate trajectories for each state, ensuring efficient following with obstacle avoidance. Simulations and real-world experiments with a legged robot follower and diverse leaders in indoor and outdoor settings demonstrate the highest follow success rate of 96.9%, the lowest visual loss of 10.7%, the lowest collision rate of 1.8%, and the shortest leader-follower distance of 2.0 m. Visit follow-everything.github.io for more details.
|
| |
| 09:00-10:30, Paper WeI1I.43 | Add to My Program |
| SA-TP2: A Safety-Aware Trajectory Prediction and Planning Model for Autonomous Driving |
|
| Liao, Haicheng | University of Macau |
| Li, Zhenning | University of Macau |
| Zhu, Kaiqun | University of Macau |
| Li, Keqiang | Tsinghua University |
| Xu, Chengzhong | University of Macau |
| |
| 09:00-10:30, Paper WeI1I.44 | Add to My Program |
| Adaptive Capacity Allocation for Vision Language Action Fine-Tuning |
|
| Kim, Donghoon | Seoul National University |
| Bae, Minji | Seoul National University |
| Nam, Unghui | Seoul National University |
| Kim, Gyeonghun | Seoul National University |
| Lee, Suyun | Seoul National University |
| Shim, Kyuhong | Sungkyunkwan University |
| Shim, Byonghyo | Seoul National University |
Keywords: Deep Learning Methods, Transfer Learning, Machine Learning for Robot Control
Abstract: Vision language action models (VLAs) are increasingly used for Physical AI, but deploying a pre-trained VLA model to unseen environments, embodiments, or tasks still requires adaptation. Parameter-efficient fine-tuning (PEFT), especially LoRA, is common for VLA policies, yet the exposed capacity knob, the rank, does not transfer uniformly: robotics transfer exhibits a higher and task-varying intrinsic rank than language fine-tuning. Small ranks suffice for LLMs (e.g., r ∈ {4, 8}), while spectral analyses indicate VLAs may require much larger ranks (e.g., r ≈ 128) or near–full rank, a mismatch that worsens in multi-task settings. We present LoRA-SP (Select–Prune), a rank-adaptive fine-tuning method that replaces fixed-rank updates with input- and layer-wise capacity. LoRA-SP uses an SVD-style parameterization with a small router whose nonnegative scores act as singular values over a shared vector bank. The active set is chosen by an energy target on the cumulative squared scores E(k) ≥ η, providing a direct link to approximation error via our spectral analysis. During training, η concentrates energy on a few directions and teaches the router to rely on fewer vectors while preserving accuracy. This yields compact adapters that reduce cross-task interference and improve generalization. On four real-robot manipulation tasks collected on an unseen AgileX PiPER arm, across two VLA backbones ( π0 and SmolVLA), LoRA-SP matches or exceeds full fine-tuning with far fewer trainable parameters, and improves multi-task success by up to 31.6% over standard LoRA while remaining robust to rank choice.
|
| |
| 09:00-10:30, Paper WeI1I.45 | Add to My Program |
| Learning Multiple Initial Solutions to Optimization Problems |
|
| Sharony, Elad | Technion |
| Yang, Heng | Harvard University |
| Che, Tong | NVIDIA |
| Pavone, Marco | Stanford University |
| Mannor, Shie | Technion |
| Karkus, Peter | NVIDIA |
Keywords: Optimization and Optimal Control, Imitation Learning, Autonomous Vehicle Navigation
Abstract: Sequentially solving similar optimization problems under strict runtime constraints is essential for many applications, such as robot control, autonomous driving, and portfolio management. The performance of local optimization methods in these settings is sensitive to the initial solution: poor initialization can lead to slow convergence or suboptimal solutions. To address this challenge, we propose learning to predict multiple diverse initial solutions given parameters that define the problem instance. We introduce two strategies for utilizing multiple initial solutions: (i) a single-optimizer approach, where the most promising initial solution is chosen using a selection function, and (ii) a multiple-optimizers approach, where several optimizers, potentially run in parallel, are each initialized with a different solution, with the best solution chosen afterward. Notably, by including a default initialization among predicted ones, the cost of the final output is guaranteed to be equal or lower than with the default initialization. We validate our method on three optimal control benchmark tasks: cart-pole, reacher, and autonomous driving, using different optimizers: DDP, MPPI, and iLQR. We find significant and consistent improvement with our method across all evaluation settings and demonstrate that it efficiently scales with the number of initial solutions required.
|
| |
| 09:00-10:30, Paper WeI1I.46 | Add to My Program |
| Passive Adaptive Object Prehension, Retention, and Release with a Mechanically Intelligent Gripper |
|
| Chan, Ming Chun | HKUST |
| Li, Jiayun | The Hong Kong University of Science and Technology |
| Wu, Ziyao | The Hong Kong University of Science and Technology (HKUST) |
| Wang, Nan | THE HONG KONG UNIVERSITY OF SCIENCE AND TECHNOLOGY, HONG KONG CENTER FOR CONSTRUCTION ROBOTICS |
| Scharff, Rob B.N. | The Hong Kong University of Science and Technology |
|
|
| |
| 09:00-10:30, Paper WeI1I.47 | Add to My Program |
| Efficient Minimal Solvers for Visual-Inertial Relative Pose Estimation in Multi-Camera Systems |
|
| Li, Tao | Naval Aviation University |
| Yu, Zhenbao | National University of Defense Technology |
| Guan, Banglei | National University of Defense Technology |
| Han, Jianli | Naval Aviation University |
| Lv, Weimin | Naval Aviation University |
Keywords: Visual-Inertial SLAM, Vision-Based Navigation, Omnidirectional Vision
Abstract: Estimating the relative poses of multi-camera systems is a fundamental problem in computer vision, with critical applications in autonomous vehicles, mobile devices, and unmanned aerial vehicles (UAVs). However, existing solutions often suffer from high computational complexity or rely on an excessive number of point correspondences, limiting their real-world applicability. To address these limitations, we propose two efficient minimal solvers for estimating the relative poses of multi-camera systems using a novel parameterization. The first solver leverages the vertical direction prior provided by Inertial Measurement Units (IMUs), while the second utilizes the rotation axis direction prior from IMUs. Our methods require only four point correspondences and reduce the problem of multi-camera relative pose estimation to solving a univariate 6th-degree polynomial—a significant improvement over existing approaches, which typically involve 8th-degree polynomials. This reduction in computational complexity and correspondence requirements makes our solvers particularly effective when integrated into RANSAC frameworks, demonstrating strong potential for visual odometry applications. Through rigorous evaluations on synthetic data and the KITTI benchmark, our methods achieved superior computational efficiency and competitive accuracy compared to state-of-the-art algorithms.
|
| |
| 09:00-10:30, Paper WeI1I.48 | Add to My Program |
| SurfAAV: Design and Implementation of a Novel Multimodal Surfing Aquatic-Aerial Vehicle |
|
| Liu, Kun | National University of Defense Technology |
| Xiao, Junhao | National University of Defense Technology |
| Lin, Hao | National University of Defense Technology |
| Cao, Yue | National University of Defense Technology |
| Peng, Hui | Central South University |
| Huang, Kaihong | National University of Defense Technology |
| Lu, Huimin | National University of Defense Technology |
Keywords: Field Robots, Mechanism Design, Aerial Systems: Applications
Abstract: Despite significant advancements in the research of aquatic-aerial robots, existing configurations struggle to efficiently perform underwater, surface, and aerial movement. In this paper, we propose a novel multimodal surfing aquatic-aerial vehicle, SurfAAV, which efficiently integrates underwater navigation, surface gliding, and aerial flying capabilities. Thanks to the design of the novel differential thrust vectoring hydrofoil, SurfAAV can achieve efficient surface gliding and underwater navigation without the need for a buoyancy adjustment system. This design provides flexible operational capabilities for both surface and underwater tasks, enabling the robot to quickly carry out underwater monitoring activities. Additionally, when it is necessary to reach another water body, SurfAAV can switch to aerial mode through a gliding takeoff, flying to the target water area to perform corresponding tasks. The main contribution of this letter lies in proposing a new solution for underwater, surface, and aerial movement, designing a novel hybrid prototype concept, developing the required control laws, and validating the robot's ability to successfully perform surface gliding and gliding takeoff. SurfAAV achieves a maximum surface gliding speed of 7.96 m/s and a maximum underwater speed of 3.1 m/s. The prototype’s max surface gliding speed and max underwater cruising speed both exceed those of existing aquatic-aerial vehicles.
|
| |
| 09:00-10:30, Paper WeI1I.49 | Add to My Program |
| Flexible Trajectory Planning for Autonomous Vehicles Via Environmental Assessment in Extreme Scenarios |
|
| Li, Xiang | Harbin Institute of Technology, Shenzhen |
| Lin, Ke | Harbin Institute of Technology, Shenzhen |
| Yang, Xiaoqing | Harbin Institute of Technology, Shenzhen |
| Yan, Kejian | Harbin Institute of Technology, Shenzhen |
| Li, Yanjie | Harbin Institute of Technology, Shenzhen |
| Lou, Yunjiang | Harbin Institute of Technology, Shenzhen |
Keywords: Motion and Path Planning, Optimization and Optimal Control, Wheeled Robots
Abstract: Trajectory planning is a core task in autonomous driving. However, in diverse extreme scenarios characterized by unstructured obstacles, there is a lack of solutions that provide efficient computation, safety, and scene generalization capabilities. To address this issue, we propose a two-stage spatio-temporal joint trajectory planning method based on environmental assessment. In the first stage, we introduce the EAHybrid A* algorithm, which generates high-quality initial trajectories by evaluating environmental complexity, thereby significantly improving computational efficiency. The second stage formulates the trajectory planning problem as an optimal control problem, utilizing environmental assessment for joint spatio-temporal optimization, ensuring kinematic feasibility and obstacle avoidance. Experiments demonstrate that our method achieves higher success rates and planning speeds in extreme scenarios compared to state-of-the-art planning methods. Moreover, we have deployed and validated this approach in the CARLA simulator and real vehicles, proving its effectiveness and robustness in handling extreme environments.
|
| |
| 09:00-10:30, Paper WeI1I.50 | Add to My Program |
| RENet: Fault-Tolerant Motion Control for Quadruped Robots Via Redundant Estimator Networks under Visual Collapse |
|
| ZHang, Yueqi | Fudan University |
| Qian, Quancheng | Fudan University |
| Hou, Taixian | FuDan University |
| Zhai, Peng | Fudan University |
| Wei, Xiaoyi | Fudan University |
| Hu, Kangmai | Fudan University |
| Yi, Jiafu | Hainan University |
| ZHang, Lihua | Fudan University |
Keywords: Legged Robots, Reinforcement Learning
Abstract: Vision-based locomotion in outdoor environments presents significant challenges for quadruped robots. Accurate environmental prediction and effective handling of depth sensor noise during real-world deployment remain difficult, severely restricting the outdoor applications of such algorithms. To address these deployment challenges in vision-based motion control, this letter proposes the Redundant Estimator Network (RENet) framework. The framework employs a dual-estimator architecture that ensures robust motion performance while maintaining deployment stability during onboard vision failures. Through an online estimator adaptation, our method enables seamless transitions between estimation modules when handling visual perception uncertainties. Experimental validation through real-world robot demonstrates the framework's effectiveness in complex outdoor environments, showing particular advantages in scenarios with degraded visual perception. This framework demonstrates its potential as a practical solution for reliable robotic deployment in challenging field conditions. Project website: https://RENet-Loco.github.io/
|
| |
| 09:00-10:30, Paper WeI1I.51 | Add to My Program |
| Compact Robotic Gripper with Tandem Actuation for Selective Apple Harvesting |
|
| Velasquez-Lopez, Alejandro | Oregon State University |
| Grimm, Cindy | Oregon State University |
| Davidson, Joseph | Oregon State University |
Keywords: Field Robots, Agricultural Automation, Grippers and Other End-Effectors
Abstract: One of the primary reasons robotic apple harvesting is a challenging manipulation problem is the cluttered tree canopy. An effective harvesting gripper should i) be compact to minimize collisions with the canopy, ii) offer a compliant grasp to prevent bruising; and iii) hold the fruit securely to counteract forces during picking. Much of the prior work has used single-mode grippers (suction or fingers), which are often compliant but have low grasp strength (suction), or have a strong grasp but a large form factor (fingers). We present a compact robotic gripper that combines the benefits of both. It first uses an array of soft suction cups to gently attach to the fruit, then deploys three telescoping fingers that sweep away obstacles and pivot inward to secure the grasp. We analyze the finger design for its ability to sweep clutter and maintain a tight grasp, and we measure grasp strength across suction-only, fingers-only, and combined (tandem) actuation modes. Tandem mode consistently provides a grasp that can counter typically observed fruit detachment forces. Using an apple proxy, we test the gripper’s performance in cluttered scenarios, achieving over 96% pick success with an ideal controller. Finally, we validate the gripper in a commercial apple orchard, achieving an 81% pick success rate.
|
| |
| 09:00-10:30, Paper WeI1I.52 | Add to My Program |
| Collaborative Exploration with a Marsupial Ground-Aerial Robot Team through Task-Driven Map Compression |
|
| Zacharia, Angelos | NTNU - Norwegian University of Science and Technology |
| Dharmadhikari, Mihir Rahul | NTNU - Norwegian University of Science and Technology |
| Alexis, Kostas | NTNU - Norwegian University of Science and Technology |
Keywords: Cooperating Robots, Motion and Path Planning
Abstract: Efficient exploration of unknown environments is crucial for autonomous robots, especially in confined and large-scale scenarios with limited communication. To address this challenge, we propose a collaborative exploration framework for a marsupial ground-aerial robot team that leverages the complementary capabilities of both platforms. The framework employs a graph-based path planning algorithm to guide exploration and deploy the aerial robot in areas where its expected gain significantly exceeds that of the ground robot, such as large open spaces or regions inaccessible to the ground platform, thereby maximizing coverage and efficiency. To facilitate large-scale spatial information sharing, we introduce a bandwidth-efficient, task-driven map compression strategy. This method enables each robot to reconstruct resolution-specific volumetric maps while preserving exploration-critical details, even at high compression rates. By selectively compressing and sharing key data, communication overhead is minimized, ensuring effective map integration for collaborative path planning. Simulation and real-world experiments validate the proposed approach, demonstrating its effectiveness in improving exploration efficiency while significantly reducing data transmission.
|
| |
| 09:00-10:30, Paper WeI1I.53 | Add to My Program |
| Grasp Like Humans: Learning Generalizable Multi-Fingered Grasping from Human Proprioceptive Sensorimotor Integration |
|
| Guo, Ce | National University of Defense Technology |
| Chen, Xieyuanli | The Chinese University of Hong Kong |
| Zeng, Zhiwen | National University of Defense Technology |
| Guo, Zirui | National University of Defense Technology |
| Li, YiHong | National University of Defense Technology |
| Xiao, Haoran | National University of Defense Technology |
| Hu, Dewen | National University of Defense Technology |
| Lu, Huimin | National University of Defense Technology |
Keywords: Learning from Demonstration, Force and Tactile Sensing, Grasping, Multifingered Hands
Abstract: Tactile and kinesthetic perceptions are crucial for human dexterous manipulation, enabling reliable grasping of objects via proprioceptive sensorimotor integration. For robotic hands, even though acquiring such tactile and kinesthetic feedback is feasible, establishing a direct mapping from this sensory feedback to motor actions remains challenging. In this article, we propose a novel glove-mediated tactile–kinematic perception–prediction framework for grasp skill transfer from human intuitive and natural operation to robotic execution based on imitation learning, and its effectiveness is validated through generalized grasping tasks, including those involving deformable objects. First, we integrate a data glove to capture tactile and kinesthetic data at the joint level. The glove is adaptable for both human and robotic hands, allowing data collection from natural human hand demonstrations across different scenarios. It ensures consistency in the raw data format, enabling evaluation of grasping for both human and robotic hands. Second, we establish a unified representation of multimodal inputs based on graph structures with polar coordinates. We explicitly integrate the morphological differences into the designed representation, enhancing the compatibility across different demonstrators and robotic hands. Furthermore, we introduce the tactile–kinesthetic spatio-temporal graph networks, which leverage multidimensional subgraph convolutions and attention-based long short-term memory (LSTM) layers to extract spatio-temporal features from graph inputs to predict node-based states for each hand joint. These predictions are then mapped to final commands through a force-position hybrid mapping. Comparative experiments and ablation studies demonstrate that our approach surpasses other methods in grasp success rate, finger coordination, contact force management, and both grasp and computational efficiency, achieving results most akin to human grasping.
|
| |
| 09:00-10:30, Paper WeI1I.54 | Add to My Program |
| Detection of EMU Components Based on Optical Flow Attention Prior and Multi-Modal RGBD RTDETR |
|
| Cong, Mingjun | Huazhong University of Science and Technology |
| Peng, Gang | Huazhong University of Science and Technology |
| Tang, Yongchang | School of Artificial Intelligence and Automation, Huazhong University of Science and Technology |
| Song, Chaowei | Huazhong University of Science and Technology |
| Wang, Chaoze | Huazhong University of Science and Technology |
|
|
| |
| 09:00-10:30, Paper WeI1I.55 | Add to My Program |
| Proactive Risk-Aware Trajectory Planning for Autonomous Driving in Unstructured Environments Via Reinforcement Learning with Adaptive Reward Design |
|
| Du, Jiawei | Peking University |
| Qu, Weiming | Peking Universitiy |
| Yuan, Shenghai | Nanyang Technological University |
| Wang, Jia | Peking University |
| Bai, Qifei | Hangzhou Dianzi University |
| Li, Chengguang | Skyfend |
| Wu, Xihong | Peking University |
| Luo, Dingsheng | Peking University |
Keywords: Autonomous Vehicle Navigation, Intelligent and Flexible Manufacturing, Planning, Scheduling and Coordination
Abstract: Trajectory planning for autonomous driving in dynamic unstructured traffic remains a fundamental challenge. Existing methods are often reactive, i.e., they only respond to observed situations without explicitly anticipating future risks. Moreover, most reinforcement learning based approaches rely on manually crafted reward functions, which limits their adaptability and generalization across complex driving scenarios. In this paper, we propose a novel RL-based trajectory planning framework that integrates proactive obstacle avoidance and adaptive reward learning. Specifically, our planner predicts the future trajectories of surrounding traffic participants as well as potential ghost-probe risk zones, and proactively avoids these high-risk regions during planning. In addition, we introduce a large-model agent that dynamically adjusts the reward signals according to evolving traffic contexts, enabling more adaptive and robust policy learning compared with fixed reward designs. To evaluate our method, we build a high-fidelity simulation environment based on the Peking University campus, which provides realistic unstructured traffic scenarios. Extensive experiments demonstrate that our method significantly improves safety, efficiency, and generalization over state-of-the-art baselines, particularly in scenarios with occlusions and unpredictable behaviors. We may open-source our code and simulation environment for community benefit soon.
|
| |
| 09:00-10:30, Paper WeI1I.56 | Add to My Program |
| Learning Forward Looking Adaptation to Dynamic Payloads for Quadruped Locomotion Via Physics-Informed Neural Networks |
|
| Youngquist, Oscar | University of Massachusetts Amherst |
| Zhang, Hao | University of Massachusetts Amherst |
Keywords: Deep Learning Methods, Legged Robots, Machine Learning for Robot Control
Abstract: Payload-adaptive locomotion is an essential capability for quadruped robots operating in real-world scenarios, particularly when tasked with transporting dynamic payloads. Existing approaches face fundamental limitations: reactive adaptation strategies respond too slowly to sudden payload changes, while learning-based methods often yield physically inconsistent models of robot dynamics that generalize poorly to novel states. To address these key challenges, we introduce Forward-Looking Adaptation to Dynamic Payloads (FLAP), a novel approach that learns to proactively compensate for discrepancies between expected and actual locomotion behavior induced by dynamic payloads. FLAP combines two critical components: (1) a physics-informed neural network (PINN) that predicts anticipated joint states while enforcing physical consistency through dynamics based loss functions, and (2) a composite adaptive control law that rapidly generates anticipatory joint torque compensations based on the PINN’s predictions. Through unifying structured dynamics modeling with real-time anticipatory control, our method enables generalizable and physically consistent adaptation to dynamic payloads. Experimental results demonstrate that FLAP achieves robust locomotion under diverse payload conditions on physical quadruped robots in real-world environments.
|
| |
| 09:00-10:30, Paper WeI1I.57 | Add to My Program |
| ROPA: Synthetic Robot Pose Generation for RGB-D Bimanual Data Augmentation |
|
| Chen, Jason | University of Southern California |
| Liu, I-Chun Arthur | University of Southern California |
| Sukhatme, Gaurav | University of Southern California |
| Seita, Daniel | University of Southern California |
Keywords: Data Sets for Robot Learning, Imitation Learning, Bimanual Manipulation
Abstract: Training robust bimanual manipulation policies via imitation learning requires demonstration data with broad coverage over robot poses, contacts, and scene contexts. However, collecting diverse and precise real-world demonstrations is costly and time-consuming, which hinders scalability. Prior works have addressed this with data augmentation, typically for either eye-in-hand (wrist camera) setups with RGB inputs or for generating novel images without paired actions, leaving augmentation for eye-to-hand (third-person) RGB-D training with new action labels less explored. In this paper, we propose Synthetic Robot Pose Generation for RGB-D Bimanual Data Augmentation (ROPA), an offline imitation learning data augmentation method that fine-tunes Stable Diffusion to synthesize third-person RGB and RGB-D observations of novel robot poses with up to four camera viewpoints. Our approach simultaneously generates corresponding joint-space action labels while employing constrained optimization to enforce physical consistency through appropriate gripper-to-object contact constraints in bimanual scenarios. We evaluate our method on 5 simulated and 3 real-world tasks. Our results across 2625 simulation trials and 300 real-world trials demonstrate that ROPA outperforms baselines and ablations, showing its potential for scalable RGB and RGB-D data augmentation in eye-to-hand bimanual manipulation. Our project website is available at: https://ropaaug.github.io/.
|
| |
| 09:00-10:30, Paper WeI1I.58 | Add to My Program |
| Semi-SMD: Semi-Supervised Metric Depth Estimation Via Surrounding Cameras for Autonomous Driving |
|
| Xie, Yusen | The Hong Kong University of Science and Technology (Guangzhou) |
| Huang, Zhenmin | The Hong Kong University of Science and Technology |
| Shen, Shaojie | Hong Kong University of Science and Technology |
| Ma, Jun | The Hong Kong University of Science and Technology |
Keywords: Deep Learning for Visual Perception, Visual Learning, Visual Tracking
Abstract: In this paper, we introduce Semi-SMD, a novel metric depth estimation framework tailored for surrounding cameras equipment in autonomous driving. In this work, the input data consists of adjacent surrounding frames and camera parameters. We propose a unified spatial-temporal-semantic fusion module to construct the visual fused features. Cross-attention components for surrounding cameras and adjacent frames are utilized to focus on metric scale information refinement and temporal feature matching. Building on this, we propose a pose estimation framework using surrounding cameras, their corresponding estimated depths, and extrinsic parameters, which effectively address the scale ambiguity in multi-camera setups. Moreover, semantic world model and monocular depth estimation world model are integrated to supervise the depth estimation, which improve the quality of depth estimation. We evaluate our algorithm on DDAD and nuScenes datasets, and the results demonstrate that our method achieves state-of-the-art performance in terms of surrounding camera based depth estimation quality. The source code is available on GitHub.
|
| |
| 09:00-10:30, Paper WeI1I.59 | Add to My Program |
| VLM-E2E: Enhancing End-To-End Autonomous Driving with Multimodal Driver Attention Fusion |
|
| Liu, Pei | The Hong Kong University of Science and Technology(GuangZhou) |
| Liu, Haipeng | Shanghai Li Auto Co., Ltd |
| Liu, Haichao | Nanyang Technological University |
| Liu, Xin | HKUST(GZ) |
| Ni, Jinxin | Xiamen University |
| Ma, Jun | The Hong Kong University of Science and Technology |
Keywords: Sensor Fusion, Deep Learning for Visual Perception, Computer Vision for Automation
Abstract: Human drivers adeptly navigate complex scenarios by utilizing rich attentional semantics, but the current autonomous systems struggle to replicate this ability, as they often lose critical semantic information when converting 2D observations into 3D space. In this sense, it hinders their effective deployment in dynamic and complex environments. Leveraging the superior scene understanding and reasoning abilities of Vision-Language Models (VLMs), we propose VLM-E2E, a novel framework that uses the VLMs to enhance training by providing attentional cues. Our method integrates textual representations into Bird's-Eye-View (BEV) features for semantic supervision, which enables the model to learn richer feature representations that explicitly capture the driver's attentional semantics. By focusing on attentional semantics, VLM-E2E better aligns with human-like driving behavior, which is critical for navigating dynamic and complex environments. Furthermore, we introduce a BEV-Text learnable weighted fusion strategy to address the issue of modality importance imbalance in fusing multimodal information. This approach dynamically balances the contributions of BEV and text features, ensuring that the complementary information from visual and textual modalities is effectively utilized. By explicitly addressing the imbalance in multimodal fusion, our method facilitates a more holistic and robust representation of driving environments. We evaluate VLM-E2E on the nuScenes dataset and achieve significant improvements in perception, prediction, and planning over the baseline end-to-end model, showcasing the effectiveness of our attention-enhanced BEV representation in enabling more accurate and reliable autonomous driving tasks.
|
| |
| 09:00-10:30, Paper WeI1I.60 | Add to My Program |
| Cubemap-Based LiDAR-Inertial Odometry with Intensity Assistance |
|
| Liu, Yang | Panasonic Advanced Technology Development Co., Ltd |
| Yamamoto, Kazushige | Panasonic Advanced Technology Development Co., Ltd |
| Matsui, Atsushi | Panasonic Advanced Technology Development Co., Ltd |
| Takahashi, Saburo | Panasonic Advanced Technology Development Co., Ltd |
| Abe, Toshihisa | Panasonic Advanced Technology Development Co., Ltd |
Keywords: SLAM, Localization, Mapping
Abstract: We present CUBE-LIO, a LiDAR-inertial odometry framework that leverages direct photometric constraints from LiDAR intensity to improve robustness in geometrically degenerate environments. At its core is an efficient cubemap projection that maps LiDAR intensity onto six cube faces, eliminating pole singularities and severe polar distortion. This yields a more uniform overall distortion while avoiding the costly trigonometric operations typical of equirectangular mappings. Building on this representation, we introduce a semi-dense feature selection and direct optimization strategy based on intensity gradient magnitude. This strategy improves resilience to intensity noise and variations induced by range and incidence angle. Photometric constraints are jointly optimized with geometric measurements in a tightly coupled LIO pipeline. CUBE-LIO is sensor-agnostic and supports both spinning and solid-state LiDARs. Experiments on multiple public benchmarks demonstrate state-of-the-art accuracy and real-time performance, with particularly pronounced gains in scenes where the geometric structure is sparse or weak.
|
| |
| 09:00-10:30, Paper WeI1I.61 | Add to My Program |
| The Temporal Trap: Entanglement in Pre-Trained Visual Representations for Visuomotor Policy Learning |
|
| Tsagkas, Nikolaos | University of Edinburgh |
| Sochopoulos, Andreas | The University of Edinburgh |
| Danier, Duolikun | University of Edinburgh |
| Lu, Chris Xiaoxuan | University College London |
| Mac Aodha, Oisin | University of Edinburgh |
Keywords: Visual Learning, Imitation Learning, Learning from Demonstration
Abstract: The integration of pre-trained visual representations (PVRs) has significantly advanced visuomotor policy learning. However, effectively leveraging these models remains a challenge. We identify temporal entanglement as a critical, inherent issue when using these time-invariant models in sequential decision-making tasks. This entanglement arises because PVRs, optimised for static image understanding, struggle to represent the temporal dependencies crucial for visuomotor control. In this work, we quantify the impact of temporal entanglement, demonstrating a strong correlation between a policy's success rate and the ability of its latent space to capture task-progression cues. Based on these insights, we propose a simple, yet effective disentanglement baseline designed to mitigate temporal entanglement. Our empirical results show that traditional methods aimed at enriching features with temporal components are insufficient on their own, highlighting the necessity of explicitly addressing temporal disentanglement for robust visuomotor policy learning.
|
| |
| 09:00-10:30, Paper WeI1I.62 | Add to My Program |
| Replicating Painting Strokes: Shape Aware Dynamic Motion Primitives for Robotic Manipulation |
|
| Vuletic, Jelena | University of Zagreb, Faculty of Electrical Engineering and Computing |
| Drobac, Pero | University of Zagreb |
| Maric, Bruno | Univeristy of Zagreb, Faculty of Electrical Engineering and Computing |
| Orsag, Matko | University of Zagreb, Faculty of Electrical Engineering and Computing |
|
|
| |
| 09:00-10:30, Paper WeI1I.63 | Add to My Program |
| AeroScene: Progressive Scene Synthesis for Aerial Robotics |
|
| Vu Huu, Nghia | AIOZ |
| Do, Tuong | AIOZ |
| Tran, Viet-Dzung | Hanoi University of Science and Technology |
| Nguyen, Binh | AIOZ |
| Nguyen, Hoan | University of Information Technology |
| Tjiputra, Erman | AIOZ |
| Tran, Quang | AIOZ |
| Nguyen, Hai-Nguyen (Hann) | RMIT University |
| Nguyen, Anh | University of Liverpool |
Keywords: Deep Learning Methods, Aerial Systems: Applications, Semantic Scene Understanding
Abstract: Generative models have shown substantial impact across multiple domains, their potential for scene synthesis remains underexplored in robotics. This gap is more evident in drone simulators, where simulation environments still rely heavily on manual efforts, which are time-consuming to create and difficult to scale. In this work, we introduce AeroScene, a hierarchical diffusion model for progressive 3D scene synthesis. Our approach leverages hierarchy-aware tokenization and multi-branch feature extraction to reason across both global layouts and local details, ensuring physical plausibility and semantic consistency. This makes AeroScene particularly suited for generating realistic scenes for aerial robotics tasks such as navigation, landing, and perching. We demonstrate its effectiveness through extensive experiments on our newly collected dataset and a public benchmark, showing that AeroScene significantly outperforms prior methods. Furthermore, we use AeroScene to generate a large-scale dataset of over 1,000 physics-ready, high fidelity 3D scenes that can be directly integrated into NVIDIA Isaac Sim. Finally, we illustrate the utility of these generated environments on downstream drone navigation tasks.
|
| |
| 09:00-10:30, Paper WeI1I.64 | Add to My Program |
| IMPACT: Intelligent Motion Planning with Acceptable Contact Trajectories Via Vision-Language Models |
|
| Ling, Yiyang | University of Southern California |
| Owalekar, Karan | University of Southern California |
| Adesanya, Oluwatobiloba | University of Southern California |
| Bıyık, Erdem | University of Southern California |
| Seita, Daniel | University of Southern California |
Keywords: Manipulation Planning, Semantic Scene Understanding, AI-Enabled Robotics
Abstract: Motion planning involves determining a sequence of robot configurations to reach a desired pose, subject to movement and safety constraints. Traditional motion planning finds collision-free paths, but this is overly restrictive in clutter, where it may not be possible for a robot to accomplish a task without contact. In addition, contacts range from relatively benign (e.g. brushing a soft pillow) to more dangerous (e.g. toppling a glass vase), making it difficult to characterize which may be acceptable. In this paper, we propose IMPACT, a novel motion planning framework that uses Vision-Language Models (VLMs) to infer environment semantics, identifying which parts of the environment can best tolerate contact based on object properties and locations. Our approach uses Stochastic Belief Propagation to generate an anisotropic cost map that encodes directional push safety. We pair this map with a novel contact-aware A* planner to find stable, contact-rich paths. We perform experiments using 20 simulation and 10 real-world scenes and assess using task success rate, object displacements, and feedback from human evaluators. Our results over 3200 simulation and 200 real-world trials suggest that IMPACT enables efficient contact-rich motion planning in cluttered settings while outperforming alternative methods and ablations. Supplementary material is available at https://impact-planning.github.io/.
|
| |
| 09:00-10:30, Paper WeI1I.65 | Add to My Program |
| TeTRA-VPR: A Ternary Transformer Approach for Compact Visual Place Recognition |
|
| Grainge, Oliver Edward | University of Southampton |
| Milford, Michael J | Queensland University of Technology |
| Bodala, Indu | University of Southampton |
| Ramchurn, Sarvapali | University of Southampton |
| Ehsan, Shoaib | University of Essex |
Keywords: Localization, Recognition, Deep Learning for Visual Perception
Abstract: Visual Place Recognition (VPR) localizes a query image by matching it against a database of geo-tagged reference images, making it essential for navigation and mapping in robotics. Although Vision Transformer (ViT) solutions deliver high accuracy, their large models often exceed the memory and compute budgets of resource-constrained platforms such as drones and mobile robots. To address this issue, we propose TeTRA, a ternary transformer approach that progressively quantizes the ViT backbone to 2-bit precision and binarizes its final embedding layer, offering substantial reductions in model size and latency. A carefully designed progressive distillation strategy preserves the representational power of a full-precision teacher, allowing TeTRA to retain or even surpass the accuracy of uncompressed convolutional counterparts, despite using fewer resources. Experiments on standard VPR benchmarks demonstrate that TeTRA reduces memory consumption by up to 69% compared to efficient baselines, while lowering inference latency by 35%, with either no loss or a slight improvement in recall@1. These gains enable high-accuracy VPR on power-constrained, memory-limited robotic platforms, making TeTRA an appealing solution for real-world deployment.
|
| |
| 09:00-10:30, Paper WeI1I.66 | Add to My Program |
| SplatSDF: Boosting SDF-NeRF Via Architecture-Level Fusion with Gaussian Splats |
|
| Li, Runfa | University of California - San Diego |
| George, Daniel | University of California - San Diego |
| Suzuki, Keito | University of California, San Diego |
| Du, Bang | University of California San Diego |
| Lee, Ki Myung Brian | University of California San Diego |
| Atanasov, Nikolay | University of California, San Diego |
| Nguyen, Truong | University of California, San Diego |
Keywords: Visual Learning
Abstract: Signed distance-radiance field (SDF-NeRF) is a promising environment representation that offers both photorealistic rendering and geometric reasoning such as proximity queries for collision avoidance. However, the slow training speed and convergence of SDF-NeRF hinder their use in practical robotic systems. We propose SplatSDF, a novel SDF-NeRF architecture that accelerates convergence using 3D Gaussian splats (3DGS), which can be quickly pre-trained. Unlike prior approaches that introduce a consistency loss between separate 3DGS and SDF-NeRF models, SplatSDF directly fuses 3DGS at an architectural level by consuming it as an input to SDF-NeRF during training. This is achieved using a novel sparse 3DGS fusion strategy that injects neural embeddings of 3DGS into SDF-NeRF around the object surface, while also permitting inference without 3DGS for minimal operation. Experimental results show SplatSDF achieves 3 times faster convergence to the same geometric accuracy than the best baseline, and outperforms state-of-the-art SDF-NeRF methods in terms of chamfer distance and peak signal to noise ratio, unlike consistency loss-based approaches that in fact provide limited gains. We also present computational techniques for accelerating gradient and Hessian steps by 3 times. We expect these improvements will contribute to deploying SDF-NeRF on practical systems.
|
| |
| 09:00-10:30, Paper WeI1I.67 | Add to My Program |
| CAIMAN: Causal Action Influence Detection for Sample-Efficient Loco-Manipulation |
|
| Yuan, Yuanchen | ETH Zürich |
| Cheng, Jin | ETH Zürich |
| Armengol Urpí, Nuria | ETH Zürich |
| Coros, Stelian | ETH Zürich |
Keywords: Legged Robots, Mobile Manipulation, Reinforcement Learning
Abstract: Enabling legged robots to perform non-prehensile loco-manipulation is crucial for enhancing their versatility. However, learning behaviors such as whole-body object pushing often necessitates sophisticated planning strategies or extensive task-specific reward shaping. In this work, we present CAIMAN, a practical reinforcement learning framework that encourages the agent to gain control over other entities in the environment. CAIMAN leverages causal action influence as an intrinsic motivation objective, allowing legged robots to efficiently acquire object pushing skills even under sparse task rewards. We employ a hierarchical control strategy, combining a low-level locomotion module with a high-level policy that generates task-relevant velocity commands and is trained to maximize the intrinsic reward. To estimate causal action influence, we learn the dynamics of the environment by integrating a kinematic prior with data collected during training. We empirically demonstrate CAIMAN’s superior sample efficiency and adaptability to diverse scenarios in simulation, as well as its successful transfer to real-world systems without further fine-tuning. A video demo is available at https://www.youtube.com/watch?v=dNyvT04Cqaw.
|
| |
| 09:00-10:30, Paper WeI1I.68 | Add to My Program |
| Global Tensor Motion Planning |
|
| Le, An Thai | Vin University |
| Pompetzki, Kay | Intelligent Autonomous Systems Group, Technical University Darmstadt |
| Mueller Carvalho, Joao Andre | Technische Universitaet Darmstadt |
| Watson, Joe | TU Darmstadt |
| Urain, Julen | TU Darmstadt |
| Biess, Armin | Ben-Gurion University of the Negev |
| Chalvatzaki, Georgia | Technische Universität Darmstadt |
| Peters, Jan | Technische Universität Darmstadt |
Keywords: Motion and Path Planning, Manipulation Planning
Abstract: Batch planning is increasingly necessary to quickly produce diverse and quality motion plans for downstream learning applications, such as distillation and imitation learning. This paper presents Global Tensor Motion Planning (GTMP)---a sampling-based motion planning algorithm comprising only tensor operations. We introduce a novel discretization structure represented as a random multipartite graph, enabling efficient vectorized sampling, collision checking, and search. We provide a theoretical investigation showing that GTMP exhibits probabilistic completeness while supporting modern GPU/TPU. Additionally, by incorporating smooth structures into the multipartite graph, GTMP directly plans smooth splines without requiring gradient-based optimization. Experiments on lidar-scanned occupancy maps and the MotionBenchMarker dataset demonstrate GTMP's computation efficiency in batch planning compared to baselines, underscoring GTMP's potential as a robust, scalable planner for diverse applications and large-scale robot learning tasks.
|
| |
| 09:00-10:30, Paper WeI1I.69 | Add to My Program |
| EZREAL: Enhancing Zero-Shot Outdoor Robot Navigation Toward Distant Targets under Varying Visibility |
|
| Zeng, Tianle | Soutern University of Science and Technology |
| Peng, Jianwei | Southern University of Science and Technology |
| Ye, Hanjing | Southern University of Science and Technology |
| Chen, Guangcheng | Southern University of Science and Technology |
| Luo, Senzi | Southern University of Science and Technology |
| Zhang, Hong | Southern University of Science and Technology |
Keywords: Vision-Based Navigation, Semantic Scene Understanding, Planning under Uncertainty
Abstract: Zero-shot object navigation (ZSON) in large-scale outdoor environments faces many challenges; we specifically address a coupled one: long-range targets that reduce to tiny projections and intermittent visibility due to partial or complete occlusion. We present a unified, lightweight closed-loop system built on an aligned multi-scale image tile hierarchy. Through hierarchical target–saliency fusion, it summarizes localized semantic contrast into a stable coarse-layer regional saliency that provides the target direction and indicates target visibility. This regional saliency supports visibility-aware heading maintenance through keyframe memory, saliency-weighted fusion of historical headings, and active search during temporary invisibility. The system avoids whole-image rescaling, enables deterministic bottom-up aggregation, supports zero-shot navigation, and runs efficiently on a mobile robot. Across simulation and real-world outdoor trials, the system detects semantic targets beyond 150 m, maintains a correct heading through visibility changes with 82.6% probability, and improves overall task success by 17.5% compared with the SOTA methods, demonstrating robust ZSON toward distant and intermittently observable targets.
|
| |
| 09:00-10:30, Paper WeI1I.70 | Add to My Program |
| Agile Hauler Curriculum: Learning High-Speed Locomotion for Robots under Demanding Payloads |
|
| Zhou, Yawen | University of Chinese Academy of Sciences |
| Tang, Haopeng | Hohai University; Nanjing Institute of Software Technology; University of Chinese Academy of Sciences, Nanjing |
| Fu, Huiqiao | Nanjing University |
| Li, Peng | Institute of Software, Chinese Academy of Sciences |
|
|
| |
| 09:00-10:30, Paper WeI1I.71 | Add to My Program |
| R2BC: Multi-Agent Imitation Learning from Single-Agent Demonstrations |
|
| Mattson, Connor | University of Utah |
| Raveendra, Varun | University of Utah |
| Novoseller, Ellen | DEVCOM Army Research Laboratory |
| Waytowich, Nicholas | US. Army Research Laboratory |
| Lawhern, Vernon | US Army Research Laboratory |
| Brown, Daniel | University of Utah |
Keywords: Learning from Demonstration, Multi-Robot Systems, Imitation Learning
Abstract: Imitation Learning (IL) is a natural way for humans to teach robots, particularly when high-quality demonstrations are easy to obtain. While IL has been widely applied to single-robot settings, relatively few studies have addressed the extension of these methods to multi-agent systems, especially in settings where a single human must provide demonstrations to a team of collaborating robots. In this paper, we introduce and study Round-Robin Behavior Cloning (R2BC), a method that enables a single human operator to effectively train multi-robot systems through sequential, single-agent demonstrations. Our approach allows the human to teleoperate one agent at a time and incrementally teach multi-agent behavior to the entire system, without requiring demonstrations in the joint multi-agent action space. We show that R2BC methods match—and in some cases surpass—the performance of an oracle behavior cloning approach trained on privileged synchronized demonstrations across four multi-agent simulated tasks. Finally, we deploy R2BC on two physical robot tasks trained using real human demonstrations.
|
| |
| 09:00-10:30, Paper WeI1I.72 | Add to My Program |
| An Automatic LiDAR-Camera Extrinsic Calibration Method for Sparse Point Clouds Using Boundary Features |
|
| Gu, Tiancheng | Zhejiang University of Technology |
| Wang, Minqian | Zhejiang University of Technology |
| Weng, Libo | Zhejiang University of Technology |
| Lei, Yanjing | Zhejiang University of Technology |
| Gao, Fei | Zhejiang University of Technology |
Keywords: Calibration and Identification
Abstract: Extrinsic calibration for LiDAR and camera using sparse point clouds can significantly reduce cost and improve efficiency. However, most target-based methods are designed for dense point clouds and are less effective in sparse scenarios, while targetless methods primarily rely on environmental features. To address this limitation, a LiDAR–camera extrinsic calibration method for sparse point clouds is proposed in this paper. First, the proposed method extracts the complete checkerboard image via line-segment direction clustering and midpoint-to-normal projection. Second, a constructed theoretical checkerboard boundary point cloud is aligned to the scanned boundary point cloud using a proposed dimension-reduced, global-search and local-refinement (DGL) method. Third, coarse calibration is derived from the centroids of the checkerboard in images and aligned point clouds, followed by refinement through joint optimization of reprojection error and normal consistency error. Finally, experiments on the simulated dataset yield translation and rotation errors below 0.015 m and 0.3°, respectively. On a self-collected dataset, the method achieves an mIoU of 90.9% between the checkerboard region reprojected from point clouds and its image counterpart, outperforming state-of-the-art methods under sparse point cloud conditions.
|
| |
| 09:00-10:30, Paper WeI1I.73 | Add to My Program |
| Policy Diversification through Representation Distinguishability Regularization for Multi-Actor Deep Reinforcement Learning |
|
| Xu, Meng | City University of Hong Kong |
| Chen, Xinhong | City University of Hong Kong |
| Wang, Shuguang | City University of Hong Kong |
| Zhao, Guanyi | City University of Hong Kong |
| Wang, Jianping | City University of Hong Kong |
Keywords: Reinforcement Learning, Representation Learning, Machine Learning for Robot Control
Abstract: Deep reinforcement learning (DRL) has been widely applied to various applications, but improving exploration remains a key challenge. Recently, multi-actor DRL has emerged as a promising approach that enhances exploration by simultaneously deploying multiple actors for learning. Among these methods, actor diversity helps actors discover better policies. However, existing multi-actor DRL methods still lack effective techniques to promote actor diversity, leading to homogeneous, redundant actors and suboptimal policies. To address this, this work proposes a generic solution that can be seamlessly integrated into existing multi-actor DRL methods to promote actor diversity, thereby enabling better policy learning. Specifically, we decompose each actor into a representation module and a decision-making module, where the representation module receives the environment state and outputs a representation vector for the decision module to generate actions. We then compute the difference between each actor’s representation vector and those of all other actors as an additional loss, referred to as representation distinguishability regularization, and train the actor alongside its original loss to promote actor diversity. We demonstrate that our method effectively improves the performance of nine state-of-the-art (SOTA) multi-actor DRL methods across eight benchmark tasks, in terms of return.
|
| |
| 09:00-10:30, Paper WeI1I.74 | Add to My Program |
| Synthetic vs. Real Training Data for Visual Navigation |
|
| Suomela, Lauri Aleksanteri | Tampere University |
| Kuruppu Arachchige, Sasanka | Tampere University |
| Torres, German F. | Tampere University |
| Edelman, Harry | Turku University of Applied Sciences |
| Kamarainen, Joni-Kristian | Tampere University |
Keywords: Vision-Based Navigation, Imitation Learning, Data Sets for Robot Learning
Abstract: This paper investigates how the performance of visual navigation policies trained in simulation compares to policies trained with real-world data. Performance degradation of simulator-trained policies is often significant when they are evaluated in the real world. However, despite this well-known sim-to-real gap, we demonstrate that simulator-trained policies can match the performance of their real-world-trained counterparts. Central to our approach is a navigation policy architecture that bridges the sim-to-real appearance gap by leveraging pretrained visual representations and runs real-time on robot hardware. Evaluations on a wheeled mobile robot show that the proposed policy, when trained in simulation, outperforms its real-world-trained version by 31 and the prior state-of-the-art methods by 50 points in navigation success rate. Policy generalization is verified by deploying the same model onboard a drone. Our results highlight the importance of diverse image encoder pretraining for sim-to-real generalization, and identify on-policy learning as a key advantage of simulated training over training with real data. Code, model checkpoints and multimedia materials are available at https://lasuomela.github.io/faint/.
|
| |
| 09:00-10:30, Paper WeI1I.75 | Add to My Program |
| RoboEye: Enhancing 2D Robotic Object Identification with Selective 3D Geometric Keypoint Matching |
|
| Zhang, Xingwu | Hunan University |
| Li, Guanxuan | Hunan University |
| Zhang, Zhuocheng | Hunan University |
| Long, Zijun | Hunan University |
Keywords: Object Detection, Segmentation and Categorization, Foundations of Automation, Deep Learning Methods
Abstract: The rapid growing number of product categories in large-scale e-commerce makes accurate object identification for automated packing in warehouses substantially more difficult. As the catalog grows, intra-class variability and a long tail of rare or visually similar items increase, and—when combined with diverse packaging, cluttered containers, frequent occlusion, and large viewpoint changes—these factors amplify discrepancies between query and reference images, causing sharp performance drops for methods that rely solely on 2D appearance features. Thus, we propose RoboEye, a two-stage identification framework that dynamically augments 2D semantic features with domain-adapted 3D reasoning and lightweight adapters to bridge training–deployment gaps. In the first stage, we train a large vision model to extract 2D features for generating candidate rankings. A lightweight 3D-feature-awareness module then estimates 3D features quality and predicts whether 3D re-ranking is necessary, preventing performance degradation and avoiding unnecessary computation. When invoked, the second stage uses our robot 3D retrieval transformer, comprising a 3D feature extractor that produces geometry-aware dense features and a keypoint-based matcher that computes keypoint-correspondence confidences between query and reference images instead of conventional cosine-similarity scoring. Experiments show that RoboEye improves Recall@1 by 7.1% over the prior state of the art (RoboLLM). Moreover, RoboEye operates using only RGB images, avoiding reliance on explicit 3D inputs and reducing deployment costs. The code used in this paper is publicly available at: https://github.com/longkukuhi/RoboEye.
|
| |
| 09:00-10:30, Paper WeI1I.76 | Add to My Program |
| Reactive Whole-Body Control of Mobile Manipulators for Dynamic Target Tracking Via Adaptive-Predictive Visual Servoing |
|
| Monguzzi, Andrea | Leonardo, Innovation Hub |
| Alfonso, Giuseppe | Leonardo, Innovation Hub |
| Kashiri, Navvab | Leonardo Labs |
Keywords: Motion Control, Mobile Manipulation, Visual Servoing
Abstract: This paper addresses the challenging problem of enabling a mobile manipulator with an eye-in-hand camera to track dynamic targets with time-varying positions and orientations in an unbounded workspace. Specifically, we propose an optimization-based whole-body control framework for dynamic target tracking. The framework enables the mobile manipulator to maintain the target within the camera’s field of view while reaching the desired pose, by dynamically regulating the priorities of the optimization constraints and objectives according to the task execution state. Moreover, we present an adaptive-predictive position-based visual servoing strategy to generate the Cartesian references sent to the controller. To enhance the tracking performance, we introduce (1) adaptive gains to avoid abrupt motions and the resulting vibrations while preserving final precision; (2) dynamic addition of a feedforward term incorporating a velocity estimate of the target using a Kalman Filter. The proposed approach is validated on a real robotic setup, as compared to a state-of-the-art approach, demonstrating superior performance in dynamic target tracking.
|
| |
| 09:00-10:30, Paper WeI1I.77 | Add to My Program |
| InsSo3D: Inertial Navigation System and 3D Sonar SLAM for Turbid Environment Inspection |
|
| Archieri, Simon | Heriot-Watt University |
| Cinar, Ahmet Fatih | Frontier Robotics |
| Pan, Shu | Heriot Watt University |
| Jonatan, Scharff Willners | Frontier Robotics |
| Grimaldi, Michele | Heriot-Watt University |
| Carlucho, Ignacio | Heriot-Watt University |
| Petillot, Yvan R. | Heriot-Watt University |
Keywords: Marine Robotics, SLAM, Range Sensing
Abstract: This paper presents InsSo3D, an accurate and efficient method for large-scale 3D Simultaneous Localisation and Mapping (SLAM) using a 3D Sonar and an Inertial Navigation System (INS). Unlike traditional sonar, which produces 2D images containing range and azimuth information but lacks elevation information, 3D Sonar produces a 3D point cloud, which therefore does not suffer from elevation ambiguity. We introduce a robust and modern SLAM framework adapted to the 3D Sonar data using INS as prior, detecting loop closure and performing pose graph optimisation. We evaluated InsSo3D performance inside a test tank with access to ground truth data and in an outdoor flooded quarry. Comparisons to reference trajectories and maps obtained from an underwater motion tracking system and visual Structure From Motion (SFM) demonstrate that InsSo3D efficiently corrects odometry drift. The average trajectory error is below 21cm during a 50-minute-long mission, producing a map of 10m by 20m with a 9cm average reconstruction error, enabling safe inspection of natural or artificial underwater structures even in murky water conditions.
|
| |
| 09:00-10:30, Paper WeI1I.78 | Add to My Program |
| Increasing the Stiffness of Tendon-Driven Continuum Robots Via Multi-Constraints |
|
| Du, Zhenting | King's College London |
| Chen, Xingyu | King's College London |
| Fischer, Nikola | King's College London |
| Bai, Weibang | ShanghaiTech University |
| Yuan, Quan | ShanghaiTech University |
| Zebian, Bassel | Department of Neurosurgery, Kings College Hospital NHS Trust, London, UK |
| Booth, Thomas | King's College London |
| Neji, Radhouene | Research Department of Surgical & Interventional Engineering, School of Biomedical Engineering & Imaging Sciences, King's Coll |
| Bergeles, Christos | King's College London |
|
|
| |
| 09:00-10:30, Paper WeI1I.79 | Add to My Program |
| CropNeRF: A Neural Radiance Field-Based Framework for Crop Counting |
|
| Muzaddid, Md Ahmed Al | University of Texas at Arlington |
| Beksi, William J. | The University of Texas at Arlington |
Keywords: Agricultural Automation, Computer Vision for Automation, Object Detection, Segmentation and Categorization
Abstract: Rigorous crop counting is crucial for effective agricultural management and informed intervention strategies. However, in outdoor field environments, partial occlusions combined with inherent ambiguity in distinguishing clustered crops from individual viewpoints poses an immense challenge for image-based segmentation methods. To address these problems, we introduce a novel crop counting framework designed for exact enumeration via 3D instance segmentation. Our approach utilizes 2D images captured from multiple viewpoints and associates independent instance masks for neural radiance field (NeRF) view synthesis. We introduce crop visibility and mask consistency scores, which are incorporated alongside 3D information from a NeRF model. This results in an effective segmentation of crop instances in 3D and highly-accurate crop counts. Furthermore, our method eliminates the dependence on crop-specific parameter tuning. We validate our framework on three agricultural datasets consisting of cotton bolls, apples, and pears, and demonstrate consistent counting performance despite major variations in crop color, shape, and size. A comparative analysis against the state of the art highlights superior performance on crop counting tasks. Lastly, we contribute a cotton plant dataset to advance further research on this topic.
|
| |
| 09:00-10:30, Paper WeI1I.80 | Add to My Program |
| Language-Guided Attribute Alignment and Semantic Consistency for Zero-Shot Domain Adaptation |
|
| Pan, Junhong | Nanjing University of Science and Technology |
| Jiang, Chenyi | Nanjing University of Science and Technology |
| Li, Minxian | Nanjing University of Science and Technology |
| Zhang, Haofeng | Nanjing University of Science and Technology |
Keywords: Semantic Scene Understanding, Autonomous Vehicle Navigation, Vision-Based Navigation
Abstract: In cross-domain visual understanding tasks, models often achieve strong performance on the source domain but suffer severe degradation when applied to target domains with substantial distribution shifts. This challenge is particularly prominent under the zero-shot domain adaptation setting, where adaptation must be achieved without access to target-domain samples and instead relies on language guidance to bridge the gap. However, existing approaches typically depend on fixed class names or handcrafted prompt templates, which fail to capture fine-grained semantic attributes present in the target domain. Moreover, the insufficient alignment between visual and linguistic modalities further constrains the transferability of semantic knowledge. To address these issues, we propose an attribute-driven cross-modal feature modulation framework, termed Language-guided Attribute alignment and Semantic Consistency (LASC). On the semantic side, we introduce an attribute-driven prompt generation module that dynamically combines category information with domain-relevant attributes to construct adaptive text prompts, which are aligned with visual features through cross-modal attention for enhanced semantic stability. Furthermore, we incorporate a semantic consistency constraint, where a memory bank enforces intra-class compactness and inter-class separation, ensuring robust discriminability across domains. Extensive experiments demonstrate that our approach achieves significant improvements over state-of-the-art baselines on multiple cross-domain benchmarks, and maintains strong adaptation ability without requiring any target-domain data.
|
| |
| 09:00-10:30, Paper WeI1I.81 | Add to My Program |
| CommCP: Efficient Multi-Agent Coordination Via LLM-Based Communication with Conformal Prediction |
|
| Zhang, Xiaopan | University of California - Riverside |
| Wang, Zejin | University of California, Riverside |
| Li, Zhixu | University of California, Riverside |
| Yao, Jianpeng | University of California, Riverside |
| Li, Jiachen | University of California, Riverside |
Keywords: Multi-Robot Systems, Cooperating Robots, Autonomous Agents
Abstract: To complete assignments provided by humans in natural language, robots must interpret commands, generate and answer relevant questions for scene understanding, and manipulate target objects. Real-world deployments often require multiple heterogeneous robots with different manipulation capabilities to handle different assignments cooperatively. Beyond the need for specialized manipulation skills, effective information gathering is important in completing these assignments. To address this component of the problem, we formalize the information-gathering process in a fully cooperative setting as an underexplored multi-agent multi-task Embodied Question Answering (MM-EQA) problem, which is a novel extension of canonical Embodied Question Answering (EQA), where effective communication is crucial for coordinating efforts without redundancy. To address this problem, we propose CommCP, a novel LLM-based decentralized communication framework designed for MM-EQA. Our framework employs conformal prediction to calibrate the generated messages, thereby minimizing receiver distractions and enhancing communication reliability. To evaluate our framework, we introduce an MM-EQA benchmark featuring diverse, photo-realistic household scenarios with embodied questions. Experimental results demonstrate that CommCP significantly enhances the task success rate and exploration efficiency over baselines. The experiment videos, code, and dataset are available on our project website: https://comm-cp.github.io.
|
| |
| 09:00-10:30, Paper WeI1I.82 | Add to My Program |
| DIPP: A Diffusion-Based Potential Planner for Synergistic Navigation and Mapping |
|
| Zhang, Yiqing | Beijing Institute of Technology |
| Wang, Tao | Beijing Institute of Technology |
| Pan, Miaoxin | Beijing Institute of Technology |
| Yang, Yi | Beijing Institute of Technology |
| Fu, Mengyin | Beijing Institute of Technology |
Keywords: AI-Based Methods, Semantic Scene Understanding, Autonomous Agents
Abstract: Object-Goal Navigation (ObjectNav) requires an embodied agent to search for and reach a target object category in previously unseen environments using only onboard egocentric observations, which is a fundamental capability for long-horizon autonomous robots. Current Object-Goal Navigation methods typically discard environmental knowledge after each episode, limiting their ability to operate autonomously over long horizons. To overcome this limitation, we introduce DIPP, a diffusion-based potential planner that unifies navigation and mapping. DIPP generates two complementary potential fields: a navigation potential that directs the agent toward the target and a topological potential that captures the environment’s structural skeleton. The topological potential serves a dual purpose: it acts as an implicit structural prior for waypoint selection when fused directly with the navigation potential and, more importantly, enables the incremental construction of a persistent, explicit topological graph. This graph enables a hierarchical policy to select strategic, long-horizon waypoints, elevating planning from a tactical search to a strategic decision. We evaluate DIPP in the Habitat simulator on the Gibson dataset. Results show that DIPP achieves strong performance on standard ObjectNav metrics (SR, SPL) while constructing structurally accurate maps, evidenced by a high Node Recall score. Furthermore, leveraging the explicit persistent graph for hierarchical planning significantly boosts navigation performance. These findings demonstrate the effectiveness of DIPP in enabling embodied agents to build and exploit persistent spatial knowledge for long-term operation in unseen environments.
|
| |
| 09:00-10:30, Paper WeI1I.83 | Add to My Program |
| Wind-Aware Aerial Deployment and Control Strategy for Precision Landing of Single-Actuator Autorotating Wing |
|
| Win, Shane Kyi Hla | Singapore University of Technology & Design |
| Win, Luke Soe Thura | Singapore University of Technology & Design |
| Sufiyan, Danial | Singapore University of Technology & Design |
| Foong, Shaohui | Singapore University of Technology and Design |
Keywords: Aerial Systems: Applications, Aerial Systems: Mechanics and Control
Abstract: The Samara Autorotating Wing (SAW) is a bioinspired autorotating glider capable of both controlled autorotation and diving modes. This work presents control and deployment strategies that enable precision landing of the platform from low altitude. The proposed control approach leverages cyclic control and a dive maneuver to improve landing accuracy. The deployment strategy is developed by updating parameters in a simulated model to reflect real-world performance under varying wind conditions, and then using the model to predict feasible release regions for specified wind direction, speed, and altitude. A total of 56 deployments were conducted from 60 m altitude in both low-wind (< 5 ms−1) and high-wind (> 5 ms−1) conditions representative of the local climate. The platform achieved landings within 10 m of the target in 89% of low wind trials and 57% of high-wind trials. These results highlight the potential of SAW platform for applications requiring high precision remote sensor deployment.
|
| |
| 09:00-10:30, Paper WeI1I.84 | Add to My Program |
| Throw Maneuver: Exact Trajectories for Invariant Target Hitting in Robotic Throwing |
|
| Cheng, Sheng | University of Illinois Urbana-Champaign |
| Han, Ziyin | University of Illinois at Urbana Champaign |
| Wang, Rong | University of Illinois Urbana-Champaign |
| Hovakimyan, Naira | University of Illinois Urbana-Champaign |
Keywords: Aerial Systems: Applications, Dynamics, Field Robots
Abstract: Robots can throw objects to distant targets using gravity, with applications ranging from material transport to firefighting. Existing approaches typically adopt a singleton throw formulation, where the carrier must reach a specific position–velocity configuration at the moment of throw. This reliance on a single throw point makes target-hitting highly sensitive to release delays. To address this limitation, we introduce the throw maneuver: a carrier trajectory that guarantees target hitting for objects released at any time along the trajectory. By differentiating the governing projectile equations, we derive the throw maneuver in its exact representation as ordinary differential equations, with analytical solutions available in special cases. Simulation results verify its invariant target-hit property and show that throw maneuvers achieve longer available throw time and ranges without target miss compared with a strong baseline throw method. Outdoor quadrotor experiments further demonstrate throw maneuver's improved accuracy and precision under realistic flight conditions compared with several baseline throw methods.
|
| |
| 09:00-10:30, Paper WeI1I.85 | Add to My Program |
| Monitoring Autonomous Persistent Surveillance Missions Using Invariance |
|
| Nenchev, Vladislav | University of the Bundeswehr Munich |
| Sotiriadis, Prodromos | Universität Der Bundeswehr München |
Keywords: Hybrid Logical/Dynamical Planning and Verification, Failure Detection and Recovery, Environment Monitoring and Management
Abstract: This paper studies runtime monitoring for persistent surveillance by autonomous robots when the autonomy stack is a black box. The environment is partitioned into finitely many parts, each carrying an uncertainty state that decreases when observed and increases otherwise. We model the closed loop as a state-dependent hybrid system with linear parameter varying dynamics and design a monitor based on an invariant computed offline. As this invariant is typically hard to obtain for large to-be-surveyed spaces, we propose a compositional monitor obtained by decentralized computation of low-dimensional invariant sets for each uncertainty region, and checking their conjunction online. Under common independence assumptions, the compositional monitor is sound and complete with respect to the full-system invariant. The approach is applied in a case study with a real robot persistently monitoring a labyrinth, emphasizing its applicability in practice.
|
| |
| 09:00-10:30, Paper WeI1I.86 | Add to My Program |
| 3D-ADAM: A Dataset for 3D Anomaly Detection in Additive Manufacturing |
|
| McHard, Paul Matthew | University of Glasgow |
| Audonnet, Florent P. | University of Glasgow |
| Summerell, Oliver | University of Glasgow |
| Andraos, Sebastian | HAL Robotics Ltd |
| Henderson, Paul | University of Glasgow |
| Aragon-Camarasa, Gerardo | University of Glasgow |
Keywords: Computer Vision for Manufacturing, Additive Manufacturing, Data Sets for Robotic Vision
Abstract: Surface defects are a primary source of yield loss in manufacturing, yet existing anomaly detection methods often fail in real-world deployment due to limited and unrepresentative datasets. To overcome this, we introduce 3D-ADAM, a 3D Anomaly Detection in Additive Manufacturing dataset, that is the first large-scale, industry-relevant dataset for RGB+3D surface defect detection in additive manufacturing. 3D-ADAM comprises 14,120 high-resolution scans of 217 unique parts, captured with four industrial depth sensors, and includes 27,346 annotated defects across 12 categories along with 27,346 annotations of machine element features in 16 classes. 3D-ADAM is captured in a real industrial environment and as such reflects real production conditions, including variations in part placement, sensor positioning, lighting, and partial occlusion. Benchmarking state-of-the-art models demonstrates that 3D-ADAM presents substantial challenges beyond existing datasets. Validation through expert labelling surveys with industry partners further confirms its industrial relevance. By providing this benchmark, 3D-ADAM establishes a foundation for advancing robust 3D anomaly detection capable of meeting manufacturing demands. We provide our dataset for accessibility at: https://huggingface.co/datasets/pmchard/3D-ADAM
|
| |
| 09:00-10:30, Paper WeI1I.87 | Add to My Program |
| Toward Embodiment Equivariant Vision-Language-Action Policy |
|
| Chen, Anzhe | Zhejiang University |
| Yang, Yifei | Zhejiang University |
| Zhu, Zhenjie | Zhejiang University |
| Xu, Kechun | Zhejiang University |
| Zhou, Zhongxiang | Zhejiang University |
| Xiong, Rong | Zhejiang University |
| Wang, Yue | Zhejiang University |
Keywords: Service Robotics, Domestic Robotics, Imitation Learning
Abstract: Vision-language-action policies learn manipulation skills across tasks, environments and embodiments through large-scale pre-training. However, their ability to generalize to novel robot configurations remains limited. Most approaches emphasize model size, dataset scale and diversity while paying less attention to the design of action spaces. This leads to the configuration generalization problem, which requires costly adaptation. We address this challenge by formulating cross-embodiment pre-training as designing policies equivariant to embodiment configuration transformations. Building on this principle, we propose a framework that (i) establishes a embodiment equivariance theory for action space and policy design, (ii) introduces an action decoder that enforces configuration equivariance, and (iii) incorporates a geometry-aware network architecture to enhance embodiment-agnostic spatial reasoning. Extensive experiments in both simulation and real-world settings demonstrate that our approach improves pre-training effectiveness and enables efficient fine-tuning on novel robot embodiments. Our code is available at the anonymous repository: url{https://github.com/hhcaz/e2vla}
|
| |
| 09:00-10:30, Paper WeI1I.88 | Add to My Program |
| EdgeGrasp: Enhancing Edge Perception for 7-DoF Grasping Pose Estimation in Cluttered Scenes |
|
| Qiu, Junning | Xi'an Jiaotong University |
| Wang, Fei | Xi'an Jiaotong University |
| Guo, Yu | School of Software Engineering, Xi’an Jiaotong University |
| Ling, Yonggen | Tencent |
| Lu, Minglei | Tencnet |
Keywords: Deep Learning Methods, Perception for Grasping and Manipulation, Deep Learning in Grasping and Manipulation
Abstract: Estimating 7-DoF grasping poses (6-DoF with gripper width) in cluttered scenes is a critical challenge for robotic manipulation. In such environments, object edges often contain many promising grasp candidates, but relying solely on incomplete single-view point cloud to infer them is difficult. While neural networks excel at learning edge features from RGB images, simply combining these with point clouds often fails to generalize to novel scenes. To address these challenges, we propose EdgeGrasp, which enhances edge perception by allowing each modality to contribute to the most suitable edge information source for improving grasping performance. The internal edge features are extracted through voxel-based sparse 3D convolution on the aggregated point cloud from the edge interior, ensuring a rich geometric representation while mitigating incompleteness at the edge. For external edge and junction, vision foundation model is employed to extract local zero-shot semantic features, capturing fine-grained details and improving cross-object generalization. Finally, edge spatial attention fuses these features into edge-enhanced features by encoding edge distance for estimating 7-DoF grasping poses. Experimental results demonstrate our method's effectiveness, achieving state-of-the-art performance on the Graspnet-1Billion benchmark. Real-world robotic experiments further validate its practical applicability.
|
| |
| 09:00-10:30, Paper WeI1I.89 | Add to My Program |
| Seeing the Bigger Picture: 3D Latent Mapping for Mobile Manipulation Policy Learning |
|
| Kim, Sunghwan | University of California, San Diego |
| Chung, Woojeh | University of California, San Diego |
| Dai, Zhirui | University of California, San Diego |
| Bhatt, Dwait | University of California, San Diego |
| Shukla, Arth | University of California, San Diego |
| Su, Hao | University of California, San Diego |
| Tian, Yulun | University of Michigan |
| Atanasov, Nikolay | University of California, San Diego |
Keywords: Perception for Grasping and Manipulation, Semantic Scene Understanding, Deep Learning in Grasping and Manipulation
Abstract: In this paper, we demonstrate that mobile manipulation policies utilizing a 3D latent map achieve stronger spatial and temporal reasoning than policies relying solely on images. We introduce Seeing the Bigger Picture (SBP), an end-to-end policy learning approach that operates directly on a 3D map of latent features. In SBP, the map extends perception beyond the robot's current field of view and aggregates observations over long horizons. Our mapping approach incrementally fuses multiview observations into a grid of scene-specific latent features. A pre-trained, scene-agnostic decoder reconstructs target embeddings from these features and enables online optimization of the map features during task execution. A policy, trainable with behavior cloning or reinforcement learning, treats the latent map as a state variable and uses global context from the map obtained via a 3D feature aggregator. We evaluate SBP on scene-level mobile manipulation and sequential tabletop manipulation tasks. Our experiments demonstrate that SBP (i) reasons globally over the scene, (ii) leverages the map as long-horizon memory, and (iii) outperforms image-based policies in both in-distribution and novel scenes, e.g., improving the success rate by 15% for the sequential manipulation task.
|
| |
| 09:00-10:30, Paper WeI1I.90 | Add to My Program |
| TEGA: A Tactile-Enhanced Grasping Assistant for Assistive Robotics Via Sensor Fusion and Closed-Loop Haptic Feedback |
|
| You, Hengxu | University of Florida |
| Zhou, Tianyu | University of Florida |
| Xu, Fang | University of Florida |
| Smith, Kaleb | NVIDIA Corporation |
| Du, Jing | University of Florida |
Keywords: Haptics and Haptic Interfaces, Telerobotics and Teleoperation
Abstract: Recent advances in teleoperation have enabled sophisticated manipulation of dexterous robotic hands, with most systems concentrating on guiding finger positions to achieve desired grasp configurations. However, while accurate finger positioning is essential, it often overlooks the equally critical task of grasp force modulation—vital for handling objects of diverse hardness, texture, and shape. This limitation poses a significant challenge for users, especially individuals with upper-limb disabilities who lack natural tactile feedback and rely on indirect cues to infer appropriate force levels. To address this gap, we propose a novel teleoperation framework that integrates EMG-based force control, AR-based pose tracking, and visuo-tactile sensing to enable precise and intuitive force adjustment. A wearable haptic vest delivers real-time tactile feedback, allowing users to dynamically refine grasp force during manipulation. User studies confirm that our dual-loop control system substantially improves grasp stability and task success, underscoring its potential for assistive robotic applications.
|
| |
| 09:00-10:30, Paper WeI1I.91 | Add to My Program |
| D-GVIO: A Buffer-Driven and Efficient Decentralized GNSS-Visual-Inertial State Estimator for Multi-Agent Systems |
|
| Luo, Yarong | Wuhan University |
| Lu, Wentao | Wuhan University |
| Guo, Chi | Wuhan University |
| Li, Ming | Wuhan University |
Keywords: Localization, Sensor Fusion, Multi-Robot Systems
Abstract: Cooperative localization is essential for swarm applications like collaborative exploration and search-and-rescue missions. However, maintaining real-time capability, robustness, and computational efficiency on resource-constrained platforms presents significant challenges. To address these challenges, we propose D-GVIO, a buffer-driven and fully decentralized GNSS-Visual-Inertial Odometry (GVIO) framework that leverages a novel buffering strategy to support efficient and robust distributed state estimation. The proposed framework is characterized by four core mechanisms. Firstly, through covariance segmentation, covariance intersection and buffering strategy, we modularize propagation and update steps in distributed state estimation, significantly reducing computational and communication burdens. Secondly, the left-invariant extended Kalman filter (L-IEKF) is adopted for information fusion, which exhibits superior state estimation performance over the traditional extended Kalman filter (EKF) since its state transition matrix is independent of the system state. Thirdly, a buffer-based re-propagation strategy is employed to handle delayed measurements efficiently and accurately by leveraging the L-IEKF, eliminating the need for costly re-computation. Finally, an adaptive buffer-driven outlier detection method is proposed to dynamically cull GNSS outliers, enhancing robustness in GNSS-challenged environments.
|
| |
| 09:00-10:30, Paper WeI1I.92 | Add to My Program |
| Point2Act: Efficient 3D Distillation of Multimodal LLMs for Zero-Shot Context-Aware Grasping |
|
| Kim, Sang Min | Seoul National University |
| Heo, Hyeongjun | Seoul National University |
| Kim, Junho | Seoul National University |
| Lee, Yonghyeon | MIT |
| Kim, Young Min | Seoul National University |
Keywords: Perception for Grasping and Manipulation, Deep Learning in Grasping and Manipulation, Deep Learning for Visual Perception
Abstract: We propose Point2Act, which directly retrieves the 3D action point relevant to a contextually described task, leveraging Multimodal Large Language Models (MLLMs). Foundation models have opened the possibility for generalist robots that can perform a zero-shot task following natural language descriptions within an unseen environment. While the semantics from large-scale image and language datasets provide contextual understanding in 2D images, existing methods that leverage foundation models for 3D reconstruction struggle to accurately interpret complex compositional queries and require extensive computation. Our proposed 3D relevancy fields bypass the high-dimensional features, instead efficiently imbuing lightweight 2D point-level guidance tailored to the task-specific action. The multi-view aggregation effectively compensates for misalignments caused by geometric ambiguities, such as occlusion, or semantic uncertainties inherent in the language descriptions. The output region is highly localized, leveraging fine-grained 3D spatial context to directly identify an explicit position for a physical action in the on-the-fly reconstruction of the scene. Our full-stack pipeline–which includes capturing, MLLM querying, 3D reconstruction, and grasp pose extraction–generates spatially grounded responses in 16.5 seconds, facilitating practical manipulation tasks.
|
| |
| 09:00-10:30, Paper WeI1I.93 | Add to My Program |
| OTAS: Open-Vocabulary Token Alignment for Outdoor Segmentation |
|
| Schwaiger, Simon | FH Technikum Wien |
| Thalhammer, Stefan | TU Wien |
| Wöber, Wilfried | UAS Technikum Wien, University of Natural Resources and Life Sciences Vienna |
| Steinbauer-Wagner, Gerald | Graz University of Technology |
|
|
| |
| 09:00-10:30, Paper WeI1I.94 | Add to My Program |
| Multi-View Control for Robust 3D Gaussian Splatting |
|
| Mao, YuNong | Inner Mongolia University |
| Zhang, Zhibin | Inner Mongolia University |
| Shi, Yufu | Inner Mongolia University |
Keywords: Computational Geometry, Visual Learning
Abstract: 3D Gaussian Splatting (3DGS) has recently demonstrated impressive capabilities in real-time novel view synthesis. However, the performance of 3DGS tends to degrade significantly when the quality of the initial point cloud is poor. Although subsequent research has successfully addressed the initialization issue by using suboptimal point clouds to train the 3D Gaussian model, certain challenges still remain in practical applications. Specifically, the lack of an effective pruning strategy to thoroughly eliminate suboptimal points (defined as erroneous points in this paper). The excessive accumulation of these erroneous points leads to overfitting in specific viewpoints, thereby affecting the visual appearance and geometric accuracy in novel view synthesis. To address these challenges, we propose a novel 3DGS optimization method named MVC-GS, which introduces two key innovative contributions. First, based on multi-view geometric constraints, we use image rendering errors as a guiding criterion for optimization. By performing point calibration in the target region, we effectively mitigate the impact of erroneous Gaussian points. Subsequently, we introduce a multi-view Gaussian attribute optimization method that further enhances the precision of 3D Gaussian attributes representation, while avoiding overfitting to the training views. We conducted comprehensive visualization analysis across multiple scenes in various datasets. Extensive experiments on public datasets show that the proposed method achieves state-of-the-art performance across diverse scenes.
|
| |
| 09:00-10:30, Paper WeI1I.95 | Add to My Program |
| Efficient UAV Exploration with Hybrid Global–Local Strategy and Adaptive Yaw Planning |
|
| Xue, Yangyang | Xidian University |
| Liu, Xiaotao | Xidian University |
| Zhou, Shaojian | Xidian University |
| Ruan, Jingtai | Xidian University |
| Huang, Ting | Xidian University |
Keywords: Aerial Systems: Applications, Aerial Systems: Perception and Autonomy, Motion and Path Planning
Abstract: Autonomous exploration in complex environments is frequently hindered by inefficient back-and-forth movements and repetitive revisits to previously explored areas. To address these drawbacks, we propose a two-mode hybrid dynamic exploration strategy that detects isolated frontier clusters and adaptively switches between two modes: global exploration mode (GEM) and local clearance mode (LCM). The GEM generates sequences for frontier exploration access, while the LCM employs a flight-time greedy approach to select and clear isolated clusters, thereby avoiding redundant visits. In addition, to achieve adaptive yaw planning, the proposed exploration strategy generates a reference yaw sequence based on the frontiers near the path trajectory. The reference yaw sequence is then used to perform yaw optimization, with non-uniform B-spline time adjustments ensuring feasible yaw trajectories, fully leveraging the UAV's maneuverability and perception capabilities, and providing a plug-and-play solution for exploration research. Extensive simulations compared to state-of-the-art methods demonstrate that our approach significantly reduces both exploration time and distance, with real-world experiment confirming its practical effectiveness.
|
| |
| 09:00-10:30, Paper WeI1I.96 | Add to My Program |
| MotionTrans: Human VR Data Enable Motion-Level Learning for Robotic Manipulation Policies |
|
| Yuan, Chengbo | Tsinghua University |
| Zhou, Rui | Wuhan University |
| Liu, Mengzhen | Peking University |
| Hu, Yingdong | Tsinghua University |
| Wang, Shengjie | Tsinghua University |
| Yi, Li | Tsinghua University |
| Wen, Chuan | Shanghai Jiao Tong University |
| Zhang, Shanghang | Peking University |
| Gao, Yang | Tsinghua University |
Keywords: Transfer Learning, Learning from Demonstration, Data Sets for Robot Learning
Abstract: Scaling real robot data is a key bottleneck in imitation learning, leading to the use of auxiliary data for policy training. While other aspects of robotic manipulation such as image or language understanding may be learned from internet-based datasets, acquiring motion knowledge remains challenging. Human data, with its rich diversity of manipulation behaviors, offers a valuable resource for this purpose. While previous works show that using human data can bring benefits, such as improving robustness and training efficiency, it remains unclear whether it can realize its greatest advantage: enabling robot policies to directly learn new motions for task completion. In this paper, we systematically explore this potential through multi-task human-robot cotraining. We introduce MotionTrans, a framework that includes a data collection system, a human data transformation pipeline, and a weighted cotraining strategy. By cotraining 30 human-robot tasks simultaneously, we direcly transfer motions of 13 tasks from human data to deployable end-to-end robot policies. Notably, 9 tasks achieve non-trivial success rates in zero-shot manner. MotionTrans also significantly enhances pretraining-finetuning performance (+40% success rate). These findings unlock the potential of motion-level learning from human data, offering insights into its effective use for training robotic manipulation policies. All data, code, and model weights will be open-sourced.
|
| |
| 09:00-10:30, Paper WeI1I.97 | Add to My Program |
| Identification and Reduction of Fat Shadows in Obese Patients Using Robotic Echocardiography |
|
| Hashimoto, Tomoshige | Waseda University |
| Tsukamoto, Soma | Waseda University |
| Shida, Yuuki | Waseda University |
| Iwata, Hiroyasu | Waseda University |
Keywords: Medical Robots and Systems, Object Detection, Segmentation and Categorization, Visual Servoing
Abstract: The present study proposes a method to identify fat and lung shadows and reduce fat shadows in obese patients to obtain clear ultrasound (US) images. The shadow identification method focuses on the difference between the luminance value of the shadow and the direction of image loss caused by the degree of reflection and absorption of US waves in fat and lungs, thereby registering the degree of image blurring caused by each shadow. The method for reducing fat shadows was based on preliminary experiments to derive the appropriate probe pressing pressure for the patient’s BMI, enabling the acquisition of US images with fewer fat shadows while relieving the patient from feeling pressure. Verification tests were conducted using the proposed method. With regard to the shadow identification method, it was possible to detect fat shadows and lung shadows with an accuracy of 92.7% and 96.8% of the F-measure, respectively. The introduction of the shadow reduction method increased the number and range of mitral valve detections. These results underline the usefulness of the proposed method.
|
| |
| 09:00-10:30, Paper WeI1I.98 | Add to My Program |
| Neuro-Robot Interaction in Robot-Assisted Surgery Using EEG and Self-Supervised Graph Transformer |
|
| Das Chakladar, Debashis | Lulea Technical University |
| Simistira Liwicki, Foteini | Lulea University of Technology |
| Saini, Rajkumar | Lulea University of Technology |
Keywords: Neurorobotics, AI-Enabled Robotics, Brain-Machine Interfaces
Abstract: Robot-Assisted Surgery (RAS) represents a major frontier in the robotics community, blending precision automation with human skill in high-stakes clinical environments. Evaluating surgeon performance in RAS is critical for training and certification, yet current methods rely heavily on video analysis or subjective manual scoring. This study presents a neuro-robotic interaction framework that uses Electroencephalography (EEG)-derived brain connectivity features to classify surgeons’ skill levels during RAS tasks. The high dimensionality of EEG data imposes substantial computational cost. Therefore, we first apply Harris Hawks Optimization (HHO) to select an optimal EEG-channel subset, reducing computational cost. Then, functional connectivity feature metrics are extracted from the reduced EEG channel set and used to construct brain graphs, which serve as input to a Self-Supervised Graph Transformer (SSGT). The SSGT model is pre-trained via masked edge reconstruction to capture structural dependencies and finetuned for downstream skill-level classification. The proposed SSGT model achieves a classification accuracy of 96.60%, significantly outperforming both traditional machine learning and deep learning baselines. The label-efficient, structurally aware design of SSGT enables scalable and real-time assessment of surgical proficiency. This framework provides a foundation for intelligent robotic tutoring systems and generalizes to broader cognitive monitoring tasks in high-stakes human-robot interaction domains using EEG.
|
| |
| 09:00-10:30, Paper WeI1I.99 | Add to My Program |
| Graph-Based Multi-Agent Reinforcement Learning for Scalable UAV Formation Control and Target Tracking |
|
| Wang, Haowen | Peking University |
| Zhang, Shuting | Peking University |
| Li, Guangchen | Peking University |
Keywords: Swarm Robotics, Reinforcement Learning, Distributed Robot Systems
Abstract: This paper presents a graph-based multi-agent reinforcement learning framework for scalable UAV formation control and target tracking. The framework introduces a conflict-aware graph representation that aggregates neighborhood information through attention-based message passing, enabling each UAV to reason about both local interactions and global formation geometry. To generate agile and stable maneuvers, a hierarchical policy is designed that first selects motion primitives from a structured library and then refines them with continuous trajectory adjustments, ensuring smooth and dynamically feasible flight in cluttered environments. Extensive simulations and real-world experiments validate the proposed approach, demonstrating accurate target tracking, stable formation maintenance, and robust adaptation across varying swarm sizes and obstacle densities. In particular, policies trained on smaller swarms generalize effectively to larger ones without retraining, highlighting the scalability and practicality. The demonstration video is available on the project website: https://swift520.github.io/Formation-Tracking/.
|
| |
| 09:00-10:30, Paper WeI1I.100 | Add to My Program |
| Learning Robust Control Policies for Inverted Pose on Miniature Blimp Robots |
|
| Yang, Yuanlin | The Hong Kong University of Science and Technology |
| Hong, Lin | Harbin Institute of Technology |
| Zhang, Fumin | Hong Kong University of Science and Technology |
Keywords: Reinforcement Learning, Machine Learning for Robot Control, Aerial Systems: Mechanics and Control
Abstract: The ability to achieve and maintain inverted poses is essential for unlocking the full agility of miniature blimp robots (MBRs). However, developing reliable inverted control strategies for MBRs remains challenging due to their complex and underactuated dynamics. To address this challenge, we propose a novel framework that enables robust control policy learning for inverted pose on MBRs. The proposed framework consists of three core stages. First, a high-fidelity three-dimensional (3D) simulation environment is constructed and calibrated using real-world MBR motion data. Second, a robust inverted control policy is trained in simulation using a modified Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm combined with a domain randomization strategy. Third, a mapping layer is designed to bridge the sim-to-real gap and facilitate real-world deployment of the learned policy. Comprehensive evaluations in the simulation environment demonstrate that the learned policy achieves a higher success rate compared to the energy-shaping controller. Furthermore, experimental results confirm that the learned policy with a mapping layer enables an MBR to achieve and maintain a fully inverted pose in real-world settings.
|
| |
| 09:00-10:30, Paper WeI1I.101 | Add to My Program |
| Distributed Virtual Model Control for Scalable Human-Robot Collaboration in Shared Workspace |
|
| Zhang, Yi | University of Cambridge |
| Faris, Omar | University of Cambridge |
| Sirithunge, Chapa | University of Cambridge |
| Chu, Kai-Fung | University of Cambridge |
| Iida, Fumiya | University of Cambridge |
| Forni, Fulvio | University of Cambridge |
Keywords: Human-Robot Collaboration, Human-Aware Motion Planning, Motion Control
Abstract: We present a decentralized, agent agnostic, and safety-aware control framework for human–robot collaboration based on Virtual Model Control (VMC). In our approach, both humans and robots are embedded in the same virtual-component-shaped workspace, where motion is the result of the interaction with virtual springs and dampers rather than explicit trajectory planning. A decentralized, force-based stall detector identifies deadlocks, which are resolved through negotiation. This reduces the probability of robots getting stuck in the block placement task from up to 61.2% to zero in our experiments. The framework scales without structural changes thanks to the distributed implementation: in experiments we demonstrate safe collaboration with up to two robots and two humans, and in simulation up to four robots, maintaining inter-agent separation at around 20 cm. Results show that the method shapes robot behavior intuitively by adjusting control parameters and achieves deadlock-free operation across team sizes in all tested scenarios.
|
| |
| 09:00-10:30, Paper WeI1I.102 | Add to My Program |
| Minimal Intervention Shared Control with Guaranteed Safety under Non-Convex Constraints |
|
| Chaubey, Shivam | Aalto University |
| Verdoja, Francesco | Aalto University |
| Deka, Shankar | Aalto University |
| Kyrki, Ville | Aalto University |
Keywords: Safety in HRI, Human-Robot Collaboration, Human-Aware Motion Planning
Abstract: Shared control combines human intention with autonomous decision-making. At the low level, the primary goal is to maintain safety regardless of the user’s input to the system. However, existing shared control methods—based on, e.g., Model Predictive Control, Control Barrier Functions, or learning-based control—often face challenges with feasibility, scalability, and mixed constraints. To address these challenges, we propose a Constraint-Aware Assistive Controller that computes control actions online while ensuring recursive feasibility, strict constraint satisfaction, and minimal deviation from the user’s intent. It also accommodates a structured class of non-convex constraints common in real-world settings. We leverage Robust Controlled Invariant Sets for recursive feasibility and a Mixed-Integer Quadratic Programming formulation to handle non-convex constraints. We validate the approach through a large-scale user study with 66 participants—one of the most extensive in shared control research—using a simulated environment to assess task load, trust, and perceived control, in addition to performance. The results show consistent improvements across all these aspects without compromising safety and user intent. Additionally, a real-world experiment on a robotic manipulator demonstrates the framework’s applicability under bounded disturbances, ensuring safety and collision-free operation.
|
| |
| 09:00-10:30, Paper WeI1I.103 | Add to My Program |
| KAN We Flow? Advancing Robotic Manipulation with 3D Flow Matching Via KAN & RWKV |
|
| Chen, Zhihao | Beijing University of Posts and Telecommunications |
| Ge, Yiyuan | South China University of Technology |
| Wang, Ziyang | Aston University |
Keywords: AI-Enabled Robotics, Learning from Experience, Embodied Cognitive Science
Abstract: Diffusion-based visuomotor policies excel at modeling action distributions but are inference-inefficient, since recursively denoising from noise to policy requires many steps and heavy UNet backbones, which hinders deployment on resource-constrained robots. Flow matching alleviates the sampling burden by learning a one-step vector field, yet prior implementations still inherit large UNet-style architectures. In this work, we present KAN-We-Flow, a flow-matching policy that draws on recent advances in Receptance Weighted Key Value (RWKV) and Kolmogorov-Arnold Networks (KAN) from vision to build a lightweight and highly expressive backbone for 3D manipulation. Concretely, we introduce an RWKV-KAN block: an RWKV first performs efficient sequence/spatial mixing to propagate task context, and a subsequent GroupKAN layer applies learnable spline-based, groupwise functional mappings to perform feature-wise nonlinear calibration of the action mapping on RWKV outputs. Moreover, we introduce an Action Consistency Regularization (ACR), a lightweight auxiliary loss that enforces alignment between predicted action trajectories and expert demonstrations via Euler extrapolation, providing additional supervision to stabilize training and improve policy precision. Without resorting to large UNets, our design reduces parameters by 86.8%, maintains fast runtime, and achieves state-of-the-art success rates on Adroit, Meta-World, and DexArt benchmarks.
|
| |
| 09:00-10:30, Paper WeI1I.104 | Add to My Program |
| SHOPPER: Practical Insights on Grasp Strategies for Mobile Manipulation in the Wild |
|
| Huang, Isabella | Toyota Research Institute |
| Cheng, Richard | California Institute of Technology |
| Kim, Sangwoon | Massachusetts Institute of Technology |
| Kruse, Daniel | Rensselaer Polytechnic Institute |
| Chen, Carolyn | Toyota Research Institute |
| Kaul, Lukas | Toyota Research Institute |
| Hancock, JC Anne | Toyota Research Institute |
| Harikumar, Shanmuga Perumal | Toyota Research Institute |
| Tjersland, Mark | Toyota Research Institute |
| Borders, James | Toyota Research Institute |
| Helmick, Daniel | Toyota Research Institute |
Keywords: Mobile Manipulation, Grasping
Abstract: Mobile manipulation systems have advanced significantly in recent years. However, substantial gaps remain that prevent state-of-the-art platforms from achieving widespread real-world deployment, particularly in reliably grasping items in unstructured environments. To help bridge this gap, we develop SHOPPER, a mobile manipulation robot platform designed to push the boundaries of reliable and generalizable grasp strategies. We develop these grasp strategies and deploy them in a real-world grocery store--an exceptionally challenging setting chosen for its vast diversity of manipulable items, fixtures, and layouts. In this work, we present our detailed approach to designing general grasp strategies towards picking any item in a real grocery store. Additionally, we provide an in-depth analysis of our latest real-world field test, discussing key findings related to fundamental failure modes over hundreds of distinct pick attempts. Through our detailed analysis, we aim to offer valuable practical insights and identify key grasping challenges, which can guide the robotics community towards pressing open problems in the field. Lastly, we provide a dataset of 1200+ grasp attempts in unseen grocery stores.
|
| |
| 09:00-10:30, Paper WeI1I.105 | Add to My Program |
| EIT–Pneumatic Hybrid Robotic Skin for Practical and Accurate Force Map Reconstruction |
|
| Cho, Junhwi | KAIST |
| Bae, Sunggyu | DGIST |
| Ma, Junghyeon | DGIST |
| Lee, Hyosang | Eindhoven University of Technology |
| Kim, Jung | KAIST |
| Park, Kyungseo | DGIST |
Keywords: Force and Tactile Sensing, Physical Human-Robot Interaction, Touch in HRI
Abstract: We present a hybrid robotic skin that combines electrical impedance tomography (EIT) with pneumatic tactile sensing to improve force reconstruction capability. The developed robotic skin is fabricated entirely by 3D printing and spray coating, making it affordable and easy to build. A Tikhonov-regularized inverse reconstruction, paired with per-pad pneumatic calibration, enables accurate large-area tactile sensing with a simple measurement scheme. For validation, we conducted load-cell indentation experiments; the results showed consistent force reconstruction across locations within a pad. sg{Compared with an EIT-only baseline,} sensitivity non-uniformity was also reduced, with the coefficient of variation decreasing from 0.31 to 0.14, indicating that the proposed approach addresses a longstanding limitation of EIT. We further demonstrated chest-mounted integration on a humanoid robot and found that the pneumatic signals remained reliable across diverse contact scenarios, including multiple simultaneous contacts on the same sensing pad. These results indicate a practical path toward accurate, scalable whole-body tactile sensing in real robotic systems.
|
| |
| 09:00-10:30, Paper WeI1I.106 | Add to My Program |
| Human-Interpretable Uncertainty Explanations for Point Cloud Registration |
|
| Gaus, Johannes Albert | University of Tuebingen |
| Schneider, Loris | Karlsruhe Institute of Technology |
| Shi, Yitian | Karlsruhe Institute of Technology |
| Lee, Jongseok | German Aerospace Center |
| Rayyes, Rania | Karlsruhe Institute for Technology (KIT) |
| Triebel, Rudolph | German Aerospace Center (DLR) |
Keywords: RGB-D Perception, Probabilistic Inference, Probability and Statistical Methods
Abstract: In this paper, we address the point cloud registration problem, where well-known methods like ICP fail under uncertainty arising from sensor noise, pose‐estimation errors, and partial overlap due to occlusion. We develop a novel approach, Gaussian Process Concept Attribution (GP-CA), which not only quantifies registration uncertainty but also explains it by attributing uncertainty to well-known sources of errors in registration problems. Our approach leverages active learning to discover new uncertainty sources in the wild by querying informative instances. We validate GP-CA on three publicly available datasets and in our real-world robot experiment. Extensive ablations substantiate our design choices. Our approach outperforms other state-of-the-art methods in terms of runtime, high sample-efficiency with active learning, and high accuracy. Our real-world experiment clearly demonstrates its applicability. Our video also demonstrates that GP-CA enables effective failure-recovery behaviors, yielding more robust robotic perception.
|
| |
| 09:00-10:30, Paper WeI1I.107 | Add to My Program |
| Detection of Jamming and Low Harvesting Height in Automated Cabbage Harvesting |
|
| Asano, Masaki | University of Tokyo |
| Fukao, Takanori | University of Tokyo |
Keywords: Robotics and Automation in Agriculture and Forestry, Agricultural Automation, Field Robots
Abstract: Agricultural labor shortages have increased the demand for automation in farming. In cabbage harvesting, automated harvesters rely on a side-mounted camera for detection to control harvesting height, but occlusion from outer leaves can cause errors and lead to failures. This paper presents a robust detection and control framework that integrates YOLO-based cabbage detection, trajectory tracking, LSTM-based motion classification, and LiDAR point cloud analysis. The system functions as a fail-safe while also providing redundancy, enabling recovery when side-mounted camera detection fails, and addresses two critical failure modes: cabbage jamming during extraction and low harvesting height. Temporal motion features are classified by an LSTM, while LiDAR-based trajectory analysis of the cabbage head point cloud centroid identifies low harvesting height. When both jamming and low harvesting height are detected, the system issues a raising command to the harvester. Experiments on real-world data demonstrated 95.3% accuracy in jamming detection and 95% in low harvesting height detection. Field experiments confirmed real-time operation at 10 Hz and effective prevention of severe blockages, achieving an overall control accuracy of 97.0%. These results demonstrate the feasibility of the proposed method for robust automated cabbage harvesting.
|
| |
| 09:00-10:30, Paper WeI1I.108 | Add to My Program |
| A Contrastive Few-Shot RGB-D Traversability Segmentation Framework for Indoor Robotic Navigation |
|
| An, Qiyuan | University of Texas at Arlington |
| Dang, Tuan | University of Arkansas |
| Makedon, Fillia | University of Texas at Arlington |
Keywords: Object Detection, Segmentation and Categorization, Big Data in Robotics and Automation, RGB-D Perception
Abstract: Indoor traversability segmentation aims to identify safe, navigable free space for autonomous agents, which is critical for robotic navigation. Pure vision-based models often fail to detect thin obstacles, such as chair legs, which can pose serious safety risks. We propose a multi-modal segmentation framework that leverages RGB images and sparse 1D laser depth information to capture geometric interactions and improve the detection of challenging obstacles. To reduce the reliance on large labeled datasets, we adopt the few-shot segmentation (FSS) paradigm, enabling the model to generalize from limited annotated examples. Traditional FSS methods focus solely on positive prototypes, often leading to overfitting to the support set and poor generalization. To address this, we introduce a negative contrastive learning (NCL) branch that leverages negative prototypes (obstacles) to refine free-space predictions. Additionally, we design a two-stage attention depth module to align 1D depth vectors with RGB images both horizontally and vertically. Extensive experiments on our custom-collected indoor RGB-D traversability dataset demonstrate that our method outperforms state-of-the-art FSS and RGB-D segmentation baselines, achieving up to 9% higher mIoU under both 1-shot and 5-shot settings. These results highlight the effectiveness of leveraging negative prototypes and sparse depth for robust and efficient traversability segmentation.
|
| |
| 09:00-10:30, Paper WeI1I.109 | Add to My Program |
| Touch-Based Object Localisation with Spatially-Aware Belief Entropy Estimation |
|
| Brudermüller, Lara | University of Oxford |
| Jankowski, Julius | Idiap Research Institute and EPFL |
| Toussaint, Marc | TU Berlin, Robotics Institute Germany |
| Hawes, Nick | University of Oxford |
Keywords: Manipulation Planning, Dexterous Manipulation, Planning under Uncertainty
Abstract: Robust robotic manipulation in the real world requires coping with incomplete or unreliable sensory input. While vision provides rich information, it often fails in the presence of occlusions, clutter, or poor lighting. In such cases, touch offers a robust alternative, enabling object localisation through contact alone. We present a touch-only global localisation method that operates in continuous state space with a particle belief. Sparse contact/no-contact signals are turned into informative likelihoods via a proximity-aware measurement model, and contact-aware resampling mitigates particle starvation. An information-gathering controller selects actions that maximise expected information gain using a non-parametric entropy estimator sensitive to both observation updates and dynamics. On real hardware, the system reliably localises and then grasps from broad, multi-modal initial beliefs with mode separations up to 0.4 m, far beyond the narrow uncertainty ranges assumed in related work. Information-aware localisation-actions speed up belief convergence and boost grasp success; and ablations in simulation confirm the benefits of the measurement and resampling components.
|
| |
| 09:00-10:30, Paper WeI1I.110 | Add to My Program |
| HMSim: A Hierarchical Multi-Agent Simulator for Autonomous Vehicles |
|
| Liu, Haolan | University of California San Diego |
| Zhao, Jishen | UC San Diego |
| Zhang, Liangjun | GM |
Keywords: Simulation and Animation, Motion and Path Planning, Autonomous Vehicle Navigation
Abstract: This paper addresses the challenge of developing a realistic urban-driving simulator to accurately model agent behaviors, a crucial component for self-driving car development. Most previous simulators focus on the plausibility of sensor data synthesis, whereas the plausibility of driving behaviors is poorly explored. To tackle this problem, we propose a hierarchical architecture, which comprises (i) a high-level intention simulation summarizing driving scenarios and (ii) a low-level policy trained by reinforcement algorithms to refine plans. Unlike existing simulators, our approach captures diverse behaviors, even sub-optimal ones, vital for robust policy training and evaluation. We also highlight the importance of interactive simulations over static scenarios for realistic policy development. Extensive experiments demonstrate that our approach significantly improves long-term behavior prediction and closed-loop simulation, enhancing the realism and diversity of urban-driving simulations. The videos of this work are available in our project page: href{https://sites.google.com/ucsd.edu/h-sim/home}{https:/ /sites.google.com/ucsd.edu/h-sim/home}.
|
| |
| 09:00-10:30, Paper WeI1I.111 | Add to My Program |
| ASAR: ε-Optimal Graph Search for Minimum Expected-Detection-Time Paths with Path Budget Constraints for Search and Rescue (SAR) |
|
| Mugford, Eric | Queen's University at Kingston |
| Gammell, Jonathan | Queen's University |
Keywords: Motion and Path Planning, Search and Rescue Robots, Planning under Uncertainty
Abstract: Searches are conducted to find missing persons and/or objects given uncertain information, imperfect observers and large search areas in Search and Rescue (SAR). In many scenarios, such as Maritime SAR, expected survival times are short and optimal search could increase the likelihood of success. This optimization problem is complex for nontrivial problems given its probabilistic nature. Stochastic optimization methods search large problems by nondeterministically sampling the space to reduce the effective size of the problem. This has been used in SAR planning to search otherwise intractably large problems but the stochastic nature provides no formal guarantees on the quality of solutions found in finite time. This paper instead presents ASAR, an ε-optimal search algorithm for SAR planning. It calculates a heuristic to bound the search space and uses graph-search methods to find solutions that are formally guaranteed to be within a user-specified factor, ε, of the optimal solution. It finds better solutions faster than existing optimization approaches in operational simulations. It is also demonstrated with a real-world field trial on Lake Ontario, Canada, where it was used to locate a drifting manikin in only 150s
|
| |
| 09:00-10:30, Paper WeI1I.112 | Add to My Program |
| Adaptive Diffusion Constrained Sampling for Bimanual Robot Manipulation |
|
| Tong, Haolei | Technical University of Darmstadt |
| Zhang, Yuezhe | Technical University of Darmstadt |
| Lueth, Sophie C. | Technical University of Darmstadt |
| Chalvatzaki, Georgia | Technische Universität Darmstadt |
Keywords: Bimanual Manipulation, Constrained Motion Planning, Deep Learning Methods
Abstract: Coordinated multi-arm manipulation requires satisfying multiple simultaneous geometric constraints across high-dimensional configuration spaces, which poses a significant challenge for traditional planning and control methods. In this work, we propose Adaptive Diffusion Constrained Sampling (ADCS), a generative framework that flexibly integrates both equality (e.g., relative and absolute pose constraints) and structured inequality constraints (e.g., proximity to object surfaces) into an energy-based diffusion model. Equality constraints are modeled using dedicated energy networks trained on pose differences in the Lie algebra space, while inequality constraints are represented via Signed Distance Functions (SDFs) and encoded into learned constraint embeddings, allowing the model to reason about complex spatial regions. A key innovation of our method is a Transformer-based architecture that learns to weigh constraint-specific energy functions at inference time, enabling flexible and context-aware constraint integration. Moreover, we adopt a two-stage batch-wise sampling strategy that improves precision and sample diversity by combining Langevin dynamics with resampling and density-aware re-weighting. Experimental results on dual-arm manipulation tasks show that ADCS significantly improves sample diversity and generalization in settings demanding precise coordination and adaptive constraint handling.
|
| |
| 09:00-10:30, Paper WeI1I.113 | Add to My Program |
| Active Tactile Exploration for Rigid Body Pose and Shape Estimation |
|
| Gordon, Ethan Kroll | University of Pennsylvania |
| Baraki, Bruke | University of Pennsylvania |
| Bui, Hien | University of Pennsylvania |
| Posa, Michael | University of Pennsylvania |
Keywords: Contact Modeling, Force and Tactile Sensing, Incremental Learning
Abstract: General robot manipulation requires the handling of previously unseen objects. Learning a physically accurate model at test time can provide significant benefits in data efficiency, predictability, and reuse between tasks. Tactile sensing can compliment vision with its robustness to occlusion, but its temporal sparsity necessitates careful online exploration to maintain data efficiency. Direct contact can also cause an unrestrained object to move, requiring both shape and location estimation. In this work, we propose a learning and exploration framework that uses only tactile data to simultaneously determine the shape and location of rigid objects with minimal robot motion. We build on recent advances in contact-rich system identification to formulate a loss function that penalizes physical constraint violation without introducing the numerical stiffness inherent in rigid-body contact. Optimizing this loss, we can learn cuboid and convex polyhedral geometries with less than 10s of randomly collected data after first contact. Our exploration scheme seeks to maximize Expected Information Gain and results in significantly faster learning in both simulated and real-robot experiments. More information can be found at: https://dairlab.github.io/activetactile
|
| |
| 09:00-10:30, Paper WeI1I.114 | Add to My Program |
| Unified Humanoid Fall-Safety Policy from a Few Demonstrations |
|
| Xu, Zhengjie | University of Michigan, Ann Arbor |
| Li, Ye | University of Michigan, Ann Arbor |
| Lin, Kwan-Yee | University of Michigan, Ann Arbor |
| Yu, Stella | University of Michigan, Ann Arbor |
Keywords: Natural Machine Motion, Whole-Body Motion Planning and Control, Learning from Demonstration
Abstract: Falling is an inherent risk of humanoid mobility. Maintaining stability is therefore a primary safety focus in robot control and learning, yet no existing approach fully averts loss of balance. When instability does occur, prior work addresses only isolated aspects of falling: avoiding falls, choreographing a controlled descent, or standing up afterward. Consequently, humanoid robots lack integrated strategies for impact mitigation and prompt recovery when real falls defy these scripts. We aim to go beyond keeping balance to make the entire fall-and-recovery process safe and autonomous: Prevent falls when possible, reduce impact when unavoidable, and stand up when fallen. By fusing sparse human demonstrations with reinforcement learning and a diffusion-based memory of safe reactions, we learn whole-body behaviors that unify fall prevention, impact mitigation, and rapid recovery in a single policy. Experiments in simulation and on a Unitree G1 demonstrate robust sim-to-real transfer, lower impact forces, and consistently fast recovery across diverse disturbances, pointing toward safer, more resilient humanoids in real environments. Videos are available at https://firm2025.github.io.
|
| |
| 09:00-10:30, Paper WeI1I.115 | Add to My Program |
| Quadrature Oscillation System for Coordinated Motion in Crawling Origami Robot |
|
| Liu, Sean | UCLA |
| Mehta, Ankur | UCLA |
| Yan, Wenzhong | UC Davis |
Keywords: Soft Robot Materials and Design, Soft Robot Applications, Mechanism Design
Abstract: Origami-inspired robots offer rapid, accessible design and manufacture with diverse functionalities. In particular, origami robots without conventional electronics have the unique advantage of functioning in extreme environments such as ones with high radiation or large magnetic fields. However, the absence of sophisticated control systems limits these robots to simple autonomous behaviors. In our previous studies, we developed a printable, electronics-free, and self-sustained oscillator that generates simple complementary square-wave signals. Our study presents a quadrature oscillation system capable of generating four square-wave signals a quarter-cycle out of phase, enabling four distinct states. Such control signals are important in various engineering and robotics applications, such as orchestrating limb movements in bio-inspired robots. We demonstrate the practicality and value of this oscillation system by designing and constructing an origami crawling robot that utilizes the quadrature oscillator to achieve coordinated locomotion. Together, the oscillator and robot illustrate the potential for more complex control and functions in origami robotics, paving the way for more electronics-free, rapid-design origami robots with advanced autonomous behaviors.
|
| |
| 09:00-10:30, Paper WeI1I.116 | Add to My Program |
| Energy-Aware Informative Path Planning for Heterogeneous Multi-Robot Systems |
|
| Munir, Aiman | Pennsylvania State University |
| Dutta, Ayan | University of North Florida |
| Parasuraman, Ramviyas | University of Georgia |
Keywords: Multi-Robot Systems, Environment Monitoring and Management, Energy and Environment-Aware Automation
Abstract: Effective energy management is essential for maximizing information gathering tasks with networked mobile robots, particularly for large-scale, energy-intensive tasks such as agricultural monitoring and wildfire mapping. This paper presents a novel framework that integrates robots’ energy profiles with confidence bounds of their assigned regions to optimize sampling targets. Designed for persistent, long-term deployments, the framework employs Gaussian Process Regression (GPR) to maximize data acquisition and accurately reconstruct unknown spatial distributions (e.g., algae outbreaks or humidity maps). The method enables seamless transitions among exploration (mapping uncertain regions at high energy), exploitation (refining maps at moderate energy levels), and recharging (navigating to charging stations at low energy), thereby achieving energy-balanced informative path planning. Experiments demonstrate the effectiveness of the approach against state-of-the-art methods in generating energy-efficient and distinct paths for heterogeneous robots, delivering up to 32% energy savings while maintaining high reconstruction accuracy. Hardware experiments closely matched the performance in simulation.
|
| |
| 09:00-10:30, Paper WeI1I.117 | Add to My Program |
| Good Weights: Proactive, Adaptive Dead Reckoning Fusion for Continuous and Robust Visual SLAM |
|
| Du, Yanwei | Georgia Institute of Technology |
| Peng, Jing-Chen | Georgia Institute of Technology |
| Vela, Patricio | Georgia Institute of Technology |
Keywords: SLAM, Sensor Fusion, Wheeled Robots
Abstract: Given that Visual SLAM relies on appearance cues for localization and scene understanding, texture-less or visually degraded environments (e.g., plain walls or low lighting) lead to poor pose estimation and track loss. However,robots are typically equipped with sensors that provide some form of dead reckoning odometry with reasonable short-time performance but unreliable long-time performance. The Good Weights (GW) algorithm described here provides a framework to adaptively integrate dead reckoning (DR) with passive visual SLAM for continuous and accurate frame-level pose estimation. Importantly, it describes how all modules in a comprehensive SLAM system must be modified to incorporate DR into its design. Adaptive weighting increases DR influence when visual tracking is unreliable and reduces when visual feature information is strong, maintaining pose track without overreliance on DR. Good Weights yields a practical solution for mobile navigation that improves visual SLAM performance and robustness. Experiments on collected datasets and in real-world deployment demonstrate the benefits of Good Weights.
|
| |
| 09:00-10:30, Paper WeI1I.118 | Add to My Program |
| A Type of Actuator with Large Deformation and Load Capacity: Design and Modeling |
|
| Meng, Lingzhe | Harbin Engineering University |
| Wang, Xinyu | Harbin Engineering University |
| Ying, Zhuhong | Harbin Engineering University |
| Ding, Mingxuan | Harbin Engineering University |
| Jia, Peng | Harbin Engineering University |
| Yun, Feihong | Harbin Engineering University |
| Wang, Gang | Harbin Engineering University |
Keywords: Modeling, Control, and Learning for Soft Robots, Soft Robot Materials and Design, Soft Sensors and Actuators
Abstract: Flexible actuators have garnered extensive attention due to their flexibility and versatility. However, they still exhibit significant limitations in load capacity and structural stiffness. We have developed a multifunctional rigid-flexible coupled actuator with large deformation and high load capacity. We first investigated the structural design and material selection of the actuator. When establishing the mechanical model, we found that conventional methods could not solve it and that geometric nonlinearity could not be neglected. Therefore, we proposed a rigid-flexible coupled multibody dynamics modeling method suitable for large deformations and conducted static experiments to obtain a nonlinear torque–rotation angle curve. Finally, we compared the simulation results with the dynamic experimental results, demonstrating the effectiveness and accuracy of the proposed method.
|
| |
| 09:00-10:30, Paper WeI1I.119 | Add to My Program |
| WaveComm: Lightweight Communication for Collaborative Perception Via Wavelet Feature Distillation |
|
| Bao, Erdemt | Huazhong University of Science and Technology |
| Yang, Jin | Xi'an Jiaotong University |
Keywords: Intelligent Transportation Systems, Deep Learning for Visual Perception, Computer Vision for Transportation
Abstract: In multi-agent collaborative sensing systems, substantial communication overhead from information exchange significantly limits scalability and real-time performance, especially in bandwidth-constrained environments. This often results in degraded performance and reduced reliability. To address this challenge, we propose WaveComm, a wavelet-based communication framework that drastically reduces transmission loads while preserving sensing performance in low-bandwidth scenarios. The core innovation of WaveComm lies in decomposing feature maps using Discrete Wavelet Transform (DWT), transmitting only compact low-frequency components to minimize communication overhead. High-frequency details are omitted, and their effects are reconstructed at the receiver side using a lightweight generator. A Multi-Scale Distillation (MSD) Loss is employed to optimize the reconstruction quality across pixel, structural, semantic, and distributional levels. Experiments on the OPV2V and DAIR-V2X datasets for LiDAR-based and camera-based perception tasks demonstrate that WaveComm maintains state-of-the-art performance even when the communication volume is reduced to 86.3% and 87.0% of the original, respectively. Compared to existing approaches, WaveComm achieves competitive improvements in both communication efficiency and perception accuracy. Ablation studies further validate the effectiveness of its key components.
|
| |
| 09:00-10:30, Paper WeI1I.120 | Add to My Program |
| Safe Model Predictive Diffusion with Shielding |
|
| Kim, Taekyung | University of Michigan, Ann Arbor |
| Majd, Keyvan | Toyota Motor North America R&D |
| Okamoto, Hideki | Toyota Motor North America R&D |
| Hoxha, Bardh | Toyota Motor North America R&D |
| Panagou, Dimitra | University of Michigan, Ann Arbor |
| Fainekos, Georgios | Toyota Motor North America R&D |
Keywords: Motion and Path Planning, Nonholonomic Motion Planning, Constrained Motion Planning
Abstract: Generating safe, kinodynamically feasible, and optimal trajectories for complex robotic systems is a central challenge in robotics. This paper presents Safe Model Predictive Diffusion (Safe MPD), a training-free diffusion planner that unifies a model-based diffusion framework with a safety shield to generate trajectories that are both kinodynamically feasible and safe by construction. By enforcing feasibility and safety on all samples during the denoising process, our method avoids the common pitfalls of post-processing corrections, such as computational intractability and loss of feasibility. We validate our approach on challenging non-convex planning problems, including kinematic and acceleration-controlled tractor-trailer systems. The results show that it substantially outperforms existing safety strategies in success rate and safety, while achieving sub-second computation times.
|
| |
| 09:00-10:30, Paper WeI1I.121 | Add to My Program |
| AssemMate: Graph-Based LLM for Robotic Assembly Assistance |
|
| Zheng, Qi | Tsinghua University |
| Zhang, Chaoran | Tsinghua University |
| Liang, Zijian | Tsinghua University |
| Lin, Ente | Tsinghua University |
| Cui, Shubo | Guangzhou Fuwei Intelligent Technology Co., Ltd |
| Xie, Qinghongbing | Tsinghua University |
| Xu, Zhaobo | Tsinghua University |
| Zeng, Long | Tsinghua University |
Keywords: Human-Centered Robotics, AI-Enabled Robotics, Assembly
Abstract: Large Language Model (LLM)-based robotic assembly assistance has gained significant research attention. It requires the injection of domain-specific knowledge to guide the assembly process through natural language interaction with humans. Despite some progress, existing methods represent knowledge in the form of natural language text. Due to the long context and redundant content, they struggle to meet the robots' requirements for real-time and precise reasoning. In order to bridge this gap, we present a novel graph-based LLM, denoted as AssemMate, which consists of two stages: graph-based question answering and vision-enhanced grasp execution. The first stage enables natural language question answering on a knowledge graph, supporting human-robot interaction and assembly task planning for specific products. The second stage then utilizes the planning generated before as a target, senses stacked scenes, and executes grasping to assist with assembly. Specifically, a self-supervised Graph Convolutional Network (GCN) encodes knowledge graph entities and relations into a latent space and aligns them with LLM's representation, enabling the LLM to understand graph information. In addition, a vision-enhanced strategy is employed to address stacked scenes in grasping. Through training and evaluation, AssemMate outperforms existing methods, achieving 6.4% higher accuracy, 3 times faster inference, and 28 times shorter context length, while demonstrating strong generalization ability on random graphs. And our approach further demonstrates superiority through robotic grasping experiments in both simulated and real-world settings. More details can be found on the project page https://github.com/cristina304/AssemMate.git
|
| |
| 09:00-10:30, Paper WeI1I.122 | Add to My Program |
| ReThinkNav: Zero-Shot Vision-And-Language Navigation with Open-Source LLMs Via Contextual Reasoning and Loop Recovery |
|
| Li, Aolin | Wuhan University |
| Yan, Yixian | Wuhan University |
| Luo, Hongkun | Wuhan University |
| Zhan, Jiao | Wuhan University |
| Guo, Chi | Wuhan University |
Keywords: Vision-Based Navigation
Abstract: Zero-shot Vision-and-Language Navigation in Continuous Environments (VLN-CE) requires agents to follow natural language instructions and navigate without task-specific training. Prior works have demonstrated the potential of open-source large language models (LLMs) in zero-shot VLN-CE, yet two major limitations remain: (1) difficulty in accurately following instructions, and (2) susceptibility to loops in spatially confined or semantically similar regions. In this work, we introduce ReThinkNav, a framework designed to further advance open-source LLMs in zero-shot VLN-CE. ReThinkNav integrates contextual reasoning for enhanced instruction comprehension and progress estimation, enabling the LLM to accurately infer both the appropriate action and its rationale. In addition, a Loop Detection and Recovery (LDR) module detects loops and adjusts decisions accordingly. Experiments on the R2R-CE benchmark demonstrate excellent zero-shot performance, while real-world validation on the Unitree G1 humanoid robot confirms its practical applicability. The code is available at https://github.com/damonds27/ReThinkNav.
|
| |
| 09:00-10:30, Paper WeI1I.123 | Add to My Program |
| Design, Mapping, and Contact Anticipation with 3D-Printed Whole-Body Tactile and Proximity Sensors |
|
| Kohlbrenner, Carson | University of Colorado Boulder |
| Soukhovei, Anna | University of Colorado Boulder |
| Escobedo, Caleb | University of Colorado - Boulder |
| Nechyporenko, Nataliya | University of Colorado Boulder |
| Roncone, Alessandro | University of Colorado Boulder |
Keywords: Touch in HRI, Bioinspired Robot Learning, Multi-Modal Perception for HRI
Abstract: Robots operating in dynamic and shared environments benefit from anticipating contact before it occurs. We present GenTact-Prox, a fully 3D-printed artificial skin that integrates tactile and proximity sensing for contact detection and anticipation. The artificial skin platform is modular in design, procedurally generated to fit any robot morphology, and can cover the whole body of a robot. The skin achieved detection ranges of up to 18 cm during evaluation. To characterize how robots perceive nearby space through this skin, we introduce a data-driven framework for mapping the Perisensory Space---the body-centric volume of space around the robot where sensors provide actionable information for contact anticipation. We demonstrate this approach on a Franka Research 3 robot equipped with five GenTact-Prox units, enabling online object-aware operation and contact prediction.
|
| |
| 09:00-10:30, Paper WeI1I.124 | Add to My Program |
| Placeit! a Framework for Learning Robot Object Placement Skills |
|
| Ferrad, Amina | Sorbonne Université |
| Huber, Johann | ISIR, Sorbonne Université |
| Hélénon, François | Sorbonne Université |
| Gleyze, Julien | Institut Des Systèmes Intelligents Et De Robotique |
| Khoramshahi, Mahdi | Sorbonne University |
| Doncieux, Stéphane | Sorbonne University |
Keywords: Manipulation Planning, Evolutionary Robotics, Data Sets for Robot Learning
Abstract: Robotics research has made significant strides in learning, yet mastering basic skills like object placement remains a fundamental challenge. A key bottleneck is the acquisition of large-scale, high-quality data, which is often a manual and laborious process. Inspired by Graspit!, a foundational work that used simulation to automatically generate dexterous grasp poses, we introduce Placeit!, an evolutionary-computation framework for generating valid placement positions for rigid objects. Placeit! is highly versatile, supporting tasks from placing objects on tables to stacking and inserting them. Our experiments show that by leveraging quality-diversity optimization, Placeit! significantly outperforms state-of-the-art methods across all scenarios for generating diverse valid poses. A pick&place pipeline built on our framework achieved a 90% success rate over 120 real-world deployments. This work positions Placeit! as a powerful tool for open-environment pick-and-place tasks and as a valuable engine for generating the data needed to train simulation-based foundation models in robotics.
|
| |
| 09:00-10:30, Paper WeI1I.125 | Add to My Program |
| Learning Conservative Neural Control Barrier Functions from Demonstrations |
|
| Tabbara, Ihab | Washington University in St. Louis |
| Sibai, Hussein | Washington University in St. Louis |
Keywords: Robot Safety, Collision Avoidance, Reinforcement Learning
Abstract: Safety filters, particularly those based on control barrier functions, have gained increased interest as effective tools for safe control of dynamical systems. Existing correct-by-construction synthesis algorithms for such filters, however, suffer from the curse-of-dimensionality. Deep learning approaches have been proposed in recent years to address this challenge. In this paper, we add to this set of approaches an algorithm for training neural control barrier functions from offline datasets. Such functions can be used to design constraints for quadratic programs that are then used as safety filters. Our algorithm trains these functions so that the system is not only prevented from reaching unsafe states, but is also disincentivized from reaching out-of-distribution ones, at which they would be less reliable. It is inspired by Conservative Q-learning, an offline reinforcement learning algorithm. We call its outputs Conservative Control Barrier Functions (CCBFs). Our empirical results demonstrate that CCBFs outperform existing methods in maintaining safety while minimally affecting task performance.
|
| |
| 09:00-10:30, Paper WeI1I.126 | Add to My Program |
| Mixed-Initiative Dialog for Human-Robot Collaborative Manipulation |
|
| Yu, Albert | UT Austin |
| Li, Chengshu | Stanford University |
| Macesanu, Luca | New York University |
| Balaji, Arnav | UT Austin |
| Ray, Ruchira | University of Edinburgh |
| Mooney, Raymond | UT Austin |
| Martín-Martín, Roberto | UT Austin |
Keywords: Human-Robot Collaboration, Natural Dialog for HRI, Machine Learning for Robot Control
Abstract: Effective robotic systems for long-horizon human-robot collaboration must adapt to a wide range of human partners, whose physical behavior, willingness to assist, and understanding of the robot's capabilities may change over time. This demands a tightly coupled communication loop that grants both agents the flexibility to propose, accept, or decline requests as they coordinate toward completing the task effectively. We propose MICoBot, a system that enables the human and robot, both using natural language, to take initiative in formulating, accepting, or rejecting proposals on who can best complete different steps of a task. To handle diverse, task-directed dialog, and find successful collaborative strategies that minimize human effort, MICoBot makes decisions at three levels: (1) a meta-planner considers human dialog to formulate and code a high-level collaboration strategy, (2) a planner optimally allocates the remaining steps to either agent based on the robot's capabilities (measured by a simulation-pretrained affordance model) and the estimated human's willingness to help, and (3) an action executor decides the low-level actions to perform or words to say to the human. In physical robot trials with 18 unique human participants, MICoBot significantly improves task success and user experience over a pure LLM baseline and standard agent allocation models.
|
| |
| 09:00-10:30, Paper WeI1I.127 | Add to My Program |
| Hybrid Model-Learning Decoupled Control for Tendon-Driven Multi-Segment Continuum Robotic Bronchoscope |
|
| Zhang, Ming-Yang | Institute of Automation, Chinese Academy of Sciences |
| Li, Zhen | Institute of Automation, Chinese Academy of Sciences |
| Ye, Qiang | Institute of Automation, Chinese Academy of Sciences |
| Fu, Pan | Institute of Automation, Chinese Academy of Sciences |
| Zhai, Yu-Peng | College of Integrated Circuits, Taiyuan University of Technology |
| Ren, Han | Institute of Automation, Chinese Academy of Sciences |
| Deng, Yawen | Institute of Automation, Chinese Academy of Sciences |
| Guo, Chao | Peking Union Medical College Hospital |
| He, Wenhao | Institute of Automation, Chinese Academy of Sciences |
| Bian, Gui-Bin | Institute of Automation, Chinese Academy of Sciences |
Keywords: Surgical Robotics: Steerable Catheters/Needles, Flexible Robotics, Motion Control
Abstract: Flexible tendon-driven multi-segment robotic bronchoscopes can reach peripheral lung regions for minimally invasive diagnosis and therapy. However, long tendon transmissions introduce friction, elasticity, and backlash, which couple the motion of adjacent segments and reduce operational accuracy and safety. This paper proposes a hybrid model-learning decoupled control framework for a two-segment bronchoscope that explicitly cancels distal-to-proximal coupling while compensating transmission disturbances. The method learns online a pose-dependent coupling map from synchronized encoder and electromagnetic measurements and uses it for feedforward cancellation in the proximal channel. In addition, an adaptive disturbance compensation module estimates per-tendon compliance and backlash to correct stretch and dead-zone effects. A two-segment tendon-driven robotic bronchoscope platform demonstrated a substantial reduction in proximal drift during distal actuation. At a 90° distal bend, the mean proximal coupling angle was 5.84°. Compared with the most commonly used piecewise constant curvature model baseline, the proposed controller achieved stronger motion decoupling, reducing the coupling rate by 86.47%, thereby enabling more precise bronchoscopic manipulation in anatomically constrained environments.
|
| |
| 09:00-10:30, Paper WeI1I.128 | Add to My Program |
| Causality-Based Parametric Control Barrier Function for Safe Multi-Vehicle Interaction |
|
| Lyu, Yiwei | Texas A&M University |
| Chang, Caleb | University of Washington |
| Dolan, John M. | Carnegie Mellon University |
Keywords: Intelligent Transportation Systems, Robot Safety
Abstract: Safe control has been widely studied in various safety-critical applications, for instance, autonomous driving. In order to ensure the autonomous vehicle does not collide with other vehicles, it is essential to obtain an accurate expectation of surrounding vehicles' behavior and react adaptively. Instead of assuming fully cooperative and homogeneous vehicles using the same safety-critical controllers, recent works have been exploring different data-driven approaches to model the neighboring vehicles' underlying controllers with observed data. However, existing works either suffer from 1) the inter-vehicle influence during the multi-vehicle interaction, which makes it hard to determine the causality of surrounding vehicles' behavior in controller modeling, or 2) being dominated by the worst-case analysis, which may lead to overly conservative behavior. In this paper, we extend the prior work on Parametric-Control Barrier Function (Parametric-CBF) to multi-robot interactions with embedded causality inference to explicitly reason over the inter-vehicle influence. Given the learned Causality-based Parametric-CBF, we present an adaptive safety-critical controller that allows the ego vehicle to safely react to surrounding vehicles with the learned expectation. We demonstrate that by leveraging the motion flexibility among multi-vehicle systems, task efficiency can be greatly improved in various interaction-intensive scenarios.
|
| |
| 09:00-10:30, Paper WeI1I.129 | Add to My Program |
| SAGE: Scene Graph-Aware Guidance and Execution for Long-Horizon Manipulation Tasks |
|
| Li, Jialiang | Shanghai Jiao Tong University |
| Wu, Wenzheng | ShanghaiTech University |
| Zhang, Gaojing | University of Sussex |
| Han, Yifan | Chinese Academy of Sciences |
| Lian, Wenzhao | Shanghai Jiao Tong University |
Keywords: Integrated Planning and Learning, Semantic Scene Understanding, Visual Learning
Abstract: Successfully solving long-horizon manipulation tasks remains a fundamental challenge. These tasks involve extended action sequences and complex object interactions, presenting a critical gap between high-level symbolic planning and low-level continuous control. To bridge this gap, two essential capabilities are required: robust long-horizon task planning and effective goal-conditioned manipulation. Existing task planning methods, including traditional and LLM-based approaches, often exhibit limited generalization or sparse semantic reasoning. Meanwhile, image-conditioned control methods struggle to adapt to unseen tasks. To tackle these problems, we propose SAGE, a novel framework for Scene Graph-Aware Guidance and Execution in Long-Horizon Manipulation Tasks. SAGE utilizes semantic scene graphs as a structural representation for scene states. A structural scene graph enables bridging task-level semantic reasoning and pixel-level visuo-motor control. This also facilitates the controllable synthesis of accurate, novel sub-goal images. SAGE consists of two key components: (1) a scene graph-based task planner that uses VLMs and LLMs to parse the environment and reason about physically-grounded scene state transition sequences, and (2) a decoupled structural image editing pipeline that controllably converts each target sub-goal graph into a corresponding image through image inpainting and composition. Extensive experiments have demonstrated that SAGE achieves state-of-the-art performance on distinct long-horizon tasks.
|
| |
| 09:00-10:30, Paper WeI1I.130 | Add to My Program |
| Design of a Novel Loosely Coupled Parallel Structural Upper Body Exoskeleton |
|
| Tian, Xinhui | Central South University |
| Zhou, Xin | Central South University |
| Xie, Bin | Central South University |
Keywords: Prosthetics and Exoskeletons, Mechanism Design, Kinematics
Abstract: Exoskeletons, as wearable human–robot collaborative devices, can effectively reduce muscle fatigue caused by prolonged material handling and overhead tasks. However, most existing active exoskeletons adopt tightly coupled serial structures, which generally suffer from insufficient wearing comfort, limited muscle coverage, and restricted workspace. To address these issues, this paper presents a novel loosely coupled, parallel upper-body exoskeleton (6.9kg). The proposed exoskeleton is connected only at the waist and elbow, providing assistance not only to the small muscle groups of the arms and shoulders but also to the larger muscle groups of the waist, back, and chest. Moreover, heavy components of the exoskeleton (approximately 78% of the total mass), such as actuators are located near the wearer’s waist, which places the center of mass close to the human center of mass, improving comfort and control reliability. To validate the feasibility of the design, kinematic models of both the exoskeleton and the human upper body were established. Analysis showed that the end-effector workspace of the exoskeleton exceeds that of the human elbow. Prototype experiments were conducted, allowing the wearer to perform arbitrary postures without constraining spinal motion. This indicates that the exoskeleton holds potential in work assistance scenarios such as long-term heavy lifting and overhead work.
|
| |
| 09:00-10:30, Paper WeI1I.131 | Add to My Program |
| Imagine2Act: Leveraging Object-Action Motion Consistency from Imagined Goals for Robotic Manipulation |
|
| Heng, Liang | Peking University |
| Xu, Jiadong | University of Chinese Academy of Sciences |
| Wang, Yiwen | Peking University |
| Li, Xiaoqi | Peking University |
| Cai, Muhe | Peking University |
| Shen, Yan | Peking University |
| Zhu, Juan | AgiBot |
| Ren, Guanghui | Agibot |
| Dong, Hao | Peking University |
Keywords: Deep Learning in Grasping and Manipulation, Perception for Grasping and Manipulation
Abstract: Relational object rearrangement (ROR) tasks require a robot to manipulate objects with precise semantic and geometric reasoning. Existing approaches either rely on pre-collected demonstrations that struggle to capture complex geometric constraints, or generate goal-state observations to capture semantic and geometric knowledge but fail to explicitly couple object transformation with action prediction, leading to errors from generative noise. To address these limitations, we propose Imagine2Act, a 3D imitation-learning framework that incorporates semantic and geometric constraints of objects into policy learning to tackle high-precision manipulation tasks. We first generate imagined goal images conditioned on language instructions and reconstruct corresponding 3D point clouds to provide robust semantic and geometric priors. These imagined goal point clouds serve as additional inputs to the policy model, while an object–action consistency strategy with soft pose supervision explicitly aligns predicted action motion with object transformation. This design enables Imagine2Act to reason about object relational goals and achieve accurate, high-precision manipulation across diverse tasks. Experiments in both simulation and real world demonstrate that Imagine2Act outperforms previous state-of-the-art policies.
|
| |
| 09:00-10:30, Paper WeI1I.132 | Add to My Program |
| SC-VLMaps: Depth-Free Visual–Language Mapping Via Scene Coordinate Regression |
|
| Istighfarin, Nanda Febri | Jeonbuk National University |
| Choi, Baehoon | Bstar Robotics Co., Ltd |
| Jo, HyungGi | Jeonbuk National University |
Keywords: Mapping, Semantic Scene Understanding, Object Detection, Segmentation and Categorization
Abstract: The ability to connect visual observations with human language is increasingly valuable for embodied agents in tasks such as navigation and semantic mapping. Existing visual–language map (VLMaps) approach enables this connection but typically depends on depth images to project semantic features into 3D space, which limits scalability due to sensor cost and deployment constraints. In this work, we introduce SC-VLMaps, a depth-free visual–language mapping framework that constructs semantic maps using only monocular RGB input. SC-VLMaps leverages a scene coordinate regression (SCR) network to predict dense 3D coordinates from images, bypassing the need for depth supervision and enabling implicit geometry reconstruction. The predicted coordinates are fused into a voxel grid and augmented with language-aligned features from a frozen visual–language encoder, producing maps that are both geometrically coherent and semantically enriched. By employing a multi-scene training strategy, SC-VLMaps generalizes from indoor datasets (7Scenes) to challenging outdoor benchmarks (Cambridge Landmarks). Experiments show that SC-VLMaps achieves denser, more compact maps with stronger semantic alignment than VLMaps, while requiring only monocular RGB images.
|
| |
| 09:00-10:30, Paper WeI1I.133 | Add to My Program |
| PDLNet: Learning Point Cloud Distortion for Unsupervised Cross-Domain Point Cloud Segmentation in Adverse Weather |
|
| Dong, Shuhua | Nanjing University of Science and Technology |
| Li, Minxian | Nanjing University of Science and Technology |
| Zhang, Haofeng | Nanjing University of Science and Technology |
Keywords: Robotics and Automation in Construction, Object Detection, Segmentation and Categorization, Deep Learning for Visual Perception
Abstract: Existing point cloud semantic segmentation models are usually trained and evaluated using data collected under clear weather conditions. Under adverse weather conditions such as rain, snow and fog, point clouds are usually distorted and significant degradation of existing model performance occurs. Many domain adaptation methods try to address this issue by simulating adverse weather or using data augmentation techniques during training. However, they cannot accurately model the actual distortion in the target domain. By analyzing the visualization and statistical information of the target domain data and referring to existing studies, we categorize the distortion of point cloud data into position distortion, intensity distortion, and quantity distortion. To address these distortions of the target domain data, we propose a Point Distortion Learning Network (PDLNet) to integrate the Point Distortion Learning (PDL) module to learn the feature distortion of target domain data due to the adverse weather. Moreover, we also integrate the Cross-domain Feature Association (CFA) module to assist the model learn domain-invariant feature representations to improve the model's adaptability to the target domain. In addition, PDLNet introduces the Point Semantic Knowledge Distillation (PSKD) module, which ensures that only the target domain data is used efficiently in the inference phase while preserving the learned cross-domain knowledge. To further improve the model performance, we also iteratively optimize the model by introducing the curriculum learning module. Our approach establishes a new state-of-the-art level by achieving 40.6% mIoU and 27.7% mIoU in the SemanticKITTI-to-SemanticSTF and SynLiDAR-to-SemanticSTF benchmarks, respectively. Source code will be released at https://github.com/JerryD233/PDLNet.
|
| |
| 09:00-10:30, Paper WeI1I.134 | Add to My Program |
| Robust Online Residual Refinement Via Koopman-Guided Dynamics Modeling |
|
| Gong, Zhefei | Westlake University |
| Lyu, Shangke | Nanjing University |
| Ding, Pengxiang | Westlake University |
| Xiao, Wei | Westlake University |
| Wang, Donglin | Westlake University |
Keywords: Reinforcement Learning, Imitation Learning, Machine Learning for Robot Control
Abstract: Imitation learning (IL) enables efficient skill acquisition from demonstrations but often struggles with long-horizon tasks and high-precision control due to compounding errors. Residual policy learning offers a promising, model-agnostic solution by refining a base policy through closed-loop corrections. However, existing approaches primarily focus on local corrections to the base policy, lacking a global understanding of state evolution, which limits robustness and generalization to unseen scenarios. To address this, we propose incorporating global dynamics modeling to guide residual policy updates. Specifically, we leverage Koopman operator theory to impose linear time-invariant structure in a learned latent space, enabling reliable state transitions and improved extrapolation for long-horizon prediction and unseen environments. We introduce KORR (Koopman-guided Online Residual Refinement), a simple yet effective framework that conditions residual corrections on Koopman-predicted latent states, enabling globally informed and stable action refinement. We evaluate KORR on long-horizon, fine-grained robotic furniture assembly tasks under various perturbations. Results demonstrate consistent gains in performance, robustness, and generalization over strong baselines. Our findings further highlight the potential of Koopman-based modeling to bridge modern learning methods with classical control theory.
|
| |
| 09:00-10:30, Paper WeI1I.135 | Add to My Program |
| UniUncer: Unified Dynamic–Static Uncertainty for End-To-End Driving |
|
| Gao, Yu | Bosch (China) Investment Ltd |
| Wang, Jijun | AIR |
| Zhang, Zongzheng | Tsinghua University |
| Jiang, Anqing | Robert Bosch |
| Wang, Yiru | Bosch |
| Heng, Yuwen | Bosch Corporate Research |
| Wang, Shuo | Bosch China |
| Sun, Hao | National University of Singapore |
| Hu, Zhangfeng | Rensselaer Polytechnic Institute |
| Zhao, Hao | Tsinghua University |
Keywords: Autonomous Agents, AI-Enabled Robotics
Abstract: End-to-end (E2E) driving has become a cornerstone of both industry deployment and academic research, offering a single learnable pipeline that maps multi-sensor inputs to actions while avoiding hand-engineered modules. However, the reliability of such pipelines strongly depends on how well they handle uncertainty: sensors are noisy, semantics can be ambiguous, and interaction with other road users is inherently stochastic. Uncertainty also appears in multiple forms: classification vs. localization, and, crucially, in both static map elements and dynamic agents. Existing E2E approaches model only static-map uncertainty, leaving planning vulnerable to overconfident and unreliable inputs. We present UniUncer, the first lightweight, unified uncertainty framework that jointly estimates and uses uncertainty for both static and dynamic scene elements inside an E2E planner. Concretely: (1) we convert deterministic heads to probabilistic Laplace regressors that output per-vertex location and scale for vectorized static and dynamic entities; (2) we introduce an uncertainty-fusion module that encodes these parameters and injects them into object/map queries to form uncertainty-aware queries; and (3) we design an uncertainty-aware gate that adaptively modulates reliance on historical inputs (ego status or temporal perception queries) based on current uncertainty levels. The design adds minimal overhead and drops throughput by only ~0.5 FPS while remaining plug-and-play for common E2E backbones. On nuScenes (open-loop), UniUncer reduces average L2 trajectory error by 7%. On NavsimV2 (pseudo closed-loop), it improves overall EPDMS by 10.8% and achieve notable Stage-2 gains in challenging, interaction-heavy scenes. Ablations confirm that dynamic-agent uncertainty and the uncertainty-aware gate are both necessary.
|
| |
| 09:00-10:30, Paper WeI1I.136 | Add to My Program |
| Drone Air Traffic Control: Tracking a Set of Moving Objects with Minimal Power |
|
| Loi, Chek-Manh | Technische Universität Braunschweig |
| Perk, Michael | Technische Universität Braunschweig |
| Hoffmann, Malte | Technische Universität Braunschweig |
| Fekete, Sándor | Technische Universität Braunschweig |
Keywords: Surveillance Robotic Systems, Aerial Systems: Applications, Swarm Robotics
Abstract: A common sensing problem is to use a set of stationary tracking locations to monitor a collection of moving devices: Given n objects that need to be tracked, each following its own trajectory, and m stationary traffic control stations, each with a sensing region of adjustable range; how should we adjust the individual sensor ranges in order to optimize energy consumption? We provide both negative theoretical and positive practical results for this important and natural challenge. On the theoretical side, we show that even if all objects move at constant speed along straight lines, no polynomial-time algorithm can guarantee optimal coverage for a given starting solution. On the practical side, we present an algorithm based on geometric insights that is able to find optimal solutions for the min max variant of the problem, which aims at minimizing peak power consumption. Runtimes for instances with 500 moving objects and 25 stations are in the order of seconds for scenarios that take minutes to play out in the real world, demonstrating real-time capability of our methods.
|
| |
| 09:00-10:30, Paper WeI1I.137 | Add to My Program |
| Full-Scale Autonomous Highway Inspection with Quadruped Robot: Multi-Level Locomotion Learning in Complex Environments |
|
| Ma, Chenxiang | Southeast University |
| Xu, Chengcheng | Southeast University |
| Wang, Feng | Southeast University |
Keywords: Automation Technologies for Smart Cities, Intelligent Transportation Systems, Reinforcement Learning
Abstract: This paper proposes an innovative approach of full-scale autonomous highway inspection in complex environments using quadruped robot to enhance the adaptability and coverage of inspection tasks. Considering adaptive locomotion control as the foundation of autonomous inspection, a multi-level locomotion learning framework based on reinforcement learning is developed, including primitive-level, skill-level and inspection-level. Primitive-level control policy built upon Vector Quantized Variational Autoencoder is trained through imitation learning from existing open-source robots locomotion models, thereby achieving discrete embedding and reusability of foundational locomotion knowledge. At skill-level, to support diverse inspection skills learning, parametric modular scenario modeling method of the highway environment is proposed. Each skill-level control network is trained in corresponding modular scenario while reusing primitive-level control network. Inspection-level control network is established through multi-skill distillation from trained skill control networks. Combined with coverage path generator, automatic inspection can be completed. In a simulated complex highway environment, inspection robot demonstrates diverse inspection skills, successfully completing inspection of 14,400m2 area in 0.4h, with speed of 2.37m/s. Coverage and hazard detection rates both reach 100%. Compared to the existing highway inspection forms, the proposed highway inspection framework with quadruped robot enables efficient, stable, and full-scale autonomous inspection in complex highway environments, which provides general deployment capability for intelligent inspection systems.
|
| |
| 09:00-10:30, Paper WeI1I.138 | Add to My Program |
| PolygMap: A Perceptive Locomotion Framework for Humanoid Robot Stair Climbing |
|
| Li, Bingquan | Guangdong University of Technology |
| Wang, Ning | Nanjing University of Aeronautics and Astronautics |
| He, Zhicheng | Harbin Institute of Technology |
| Wu, Yucong | Southern University of Science and Technology |
| Zhang, Tianwei | The University of Tokyo |
Keywords: Humanoid Robot Systems, Humanoid and Bipedal Locomotion, Visual-Inertial SLAM
Abstract: Recently, biped robot walking technology has been significantly developed; however, mainly in a bland walking scheme. To emulate human walking, robots need to step on the positions they see in unknown spaces accurately. In this paper, we present PolyMap, a perception-based locomotion planning framework for humanoid robots to climb stairs. Our core idea is to build a real-time polygonal staircase plane semantic map, followed by a footstep planar using these polygonal plane segments. These plane segmentation and visual odometry are done by multi-sensor fusion(LIDAR, RGB-D camera and IMUs). The proposed framework is deployed on a NVIDIA Orin, which performs 20-30 Hz whole-body motion planning output. Both indoor and outdoor real-scene experiments indicate that our method is efficient and robust for humanoid robot stair climbing.
|
| |
| 09:00-10:30, Paper WeI1I.139 | Add to My Program |
| Real-Time Trajectory Optimization for Continuum Robots in Human–Robot Interaction Using Vision-Based Target Pose Estimation |
|
| Tang, Duo | The University of Hong Kong |
| Peng, Rui | The University of Hong Kong |
| Deng, Ping | The University of Hong Kong |
| Cao, Xiao | The University of Hong Kong |
| Lu, Peng | The University of Hong Kong |
Keywords: Modeling, Control, and Learning for Soft Robots, Soft Robot Applications
Abstract: Continuum robots possess intrinsic compliance, high flexibility, and continuously deformable structures, making them well-suited for safe human–robot interaction (HRI). However, their continuous backbone and high degrees of freedom pose significant challenges for real-time trajectory generation: motions must satisfy curvature constraints while adapting to uncertain and rapidly changing human inputs. Existing methods can generate smooth and feasible paths, but many are computationally intensive, neglect curvature continuity or mechanical constraints, or lack adaptability to dynamic environments. As a result, producing smooth, feasible, and responsive trajectories for continuum robots in interactive scenarios remains challenging. To address this, we propose a real-time trajectory optimization framework that integrates temporally filtered, vision-based human intention signals with curvature-constrained planning. Human hand motions are converted into stable reference signals, which guide a sliding-window sequential quadratic programming (SQP) optimizer. The planner continuously generates smooth and feasible trajectories that adapt in real time to evolving inputs. Simulations and hardware experiments demonstrate accurate tracking, robustness to noise, and timely adaptation, highlighting the framework’s potential to enable safe and natural human–continuum robot collaboration in real-world applications.
|
| |
| 09:00-10:30, Paper WeI1I.140 | Add to My Program |
| GeoGS-SLAM: Online Monocular Reconstruction Using Gaussian Splatting with Geometric Priors |
|
| Gao, Ruilan | Zhejiang University |
| Jin, Letian | Zhejiang University |
| Zhang, Yu | Zhejiang University |
Keywords: SLAM, Mapping, Localization
Abstract: SLAM methods based on 3D Gaussian Splatting (3DGS) have demonstrated impressive tracking and mapping performance, but typically require additional geometric information from external depth sensors. Meanwhile, recent SLAM systems that leverage geometric priors from pre-trained feed-forward models enable real-time dense reconstruction, yet often discard original RGB information during optimization, thus degrading overall reconstruction quality. We present GeoGS-SLAM, an online monocular dense reconstruction system that combines the 3DGS-based map representation with learned geometric priors. Given uncalibrated RGB input, we first employ a feed-forward visual geometry model to predict camera and scene priors. The Gaussian scene map is then expanded by directly sampling Gaussian primitives from both RGB input and geometric priors. Camera poses and the scene map are jointly optimized through a coarse-to-fine strategy that minimizes both photometric and geometric losses. To ensure global consistency, we further incorporate online loop closure detection and pose graph optimization. Extensive experiments across indoor and outdoor benchmarks demonstrate that GeoGS-SLAM achieves superior rendering quality and tracking accuracy compared to state-of-the-art methods while maintaining online real-time performance. Project page: url{https://rlgao.github.io/geogs_slam}.
|
| |
| 09:00-10:30, Paper WeI1I.141 | Add to My Program |
| Python Bindings for a Large C++ Robotics Library: The Case of OMPL |
|
| Guo, Weihang | Rice University |
| Tyrovouzis, Theodoros | Rice University |
| Kavraki, Lydia | Rice University |
Keywords: Methods and Tools for Robot System Design
Abstract: Python bindings are a critical bridge between high-performance C++ libraries and the flexibility of Python, enabling rapid prototyping, reproducible experiments, and integration with simulation and learning frameworks. Yet, generating bindings for large codebases is a tedious process that creates a heavy burden for a small group of maintainers. In this work, we investigate the use of Large Language Models (LLMs) to assist in generating Nanobind wrappers, with human experts kept in the loop. Our workflow mirrors the structure of the C++ codebase, scaffolds empty wrapper files, and employs LLMs to fill in binding definitions. Experts then review and refine the generated code to ensure correctness, compatibility, and performance. Through a case study on a large C++ motion planning library, we document common failure modes including mismanaging shared pointers, overloads, and trampolines and show how in-context examples and careful prompt design improve reliability. Experiments demonstrate that the resulting bindings achieve runtime performance comparable to legacy solutions. Beyond this case study, our results provide general lessons for applying LLMs to binding generation in large-scale C++ projects.
|
| |
| 09:00-10:30, Paper WeI1I.142 | Add to My Program |
| Rethinking the Practicality of Vision-Language-Action Model: A Comprehensive Benchmark and an Improved Baseline |
|
| Song, Wenxuan | The Hong Kong University of Science and Technology (Guangzhou) |
| Chen, Jiayi | Hong Kong University of Science and Technology (Guangzhou) |
| Sun, Xiaoquan | Huazhong University of Science and Technology |
| Lei, Huashuo | The Hong Kong University of Science and Technology (Guangzhou) |
| Qin, Yikai | The Hong Kong University of Science and Technology (Guangzhou) |
| Zhao, Wei | Westlake University |
| Ding, Pengxiang | Westlake University |
| Zhao, Han | Westlake University |
| Wang, Tongxin | The Hong Kong University of Science and Technology (Guangzhou) |
| Hou, Pengxu | The Hong Kong University of Science and Technology (GuangZhou) |
| Zhong, Zhide | The Hong Kong University of Science and Technology (Guangzhou) |
| Yan, Haodong | The Hong Kong University of Science and Technology (Guangzhou) |
| Wang, Donglin | Westlake University |
| Ma, Jun | The Hong Kong University of Science and Technology |
| Li, Haoang | Hong Kong University of Science and Technology (Guangzhou) |
Keywords: Perception-Action Coupling, Learning from Demonstration, Imitation Learning
Abstract: Vision-Language-Action (VLA) models have emerged as a generalist robotic agent. However, existing VLAs are hindered by excessive parameter scales, prohibitive pre-training requirements, and limited applicability to diverse embodiments. To improve the practicality of VLAs, we propose a comprehensive benchmark and an improved baseline. First, we propose CEBench, a new benchmark spanning diverse embodiments in both simulation and the real world with consideration of domain randomization. We collect 14.4k simulated trajectories and 1.6k real-world expert-curated trajectories to support training on CEBench. Second, using CEBench as our testbed, we study three critical aspects of VLAs' practicality and offer several key findings. Informed by these findings, we introduce LightVLA, a lightweight yet powerful VLA designed for practical deployment on consumer-grade GPUs. Architecturally, it integrates a compact VLM backbone with multi-view perception, proprioceptive tokenization, and action chunking. To eliminate reliance on costly pre-training, LightVLA adopts a two-stage training paradigm including post-training and fine-tuning. Furthermore, LightVLA extends the action space to unify navigation and manipulation. Experiments across embodiments demonstrate the capabilities of generalization and versatility of LightVLA, while real-world mobile manipulation experiments establish it as the first end-to-end VLA model for mobile manipulation. We will open-source all datasets, codes, and checkpoints upon acceptance to foster reproducibility and future research.
|
| |
| 09:00-10:30, Paper WeI1I.143 | Add to My Program |
| Language Enabled Hierarchical Scene Graphs for Precision Agriculture Autonomy |
|
| Mukuddem, Adam | University of Cape Town |
| Speed-Andrews, John Adam | University of Cape Town |
| Amayo, Paul | University of Cape Town |
Keywords: Robotics and Automation in Agriculture and Forestry, Field Robots
Abstract: The focus on human-robot collaboration has emerged as a pivotal area in the advancement of precision agricultural systems. This strategy exploits the distinct strengths of both humans and robots while minimising the exertion of each. A central aim within human-robot collaboration is to create robotic systems that are capable of understanding instructions given in natural language. Agricultural settings, especially those with structured rows of crops, are characteristically uniform, presenting difficulties in accurately grounding instructions and navigating the space. In this paper, we establish a systematic method for robotic platforms operating within agricultural settings to recognize natural language directives and autonomously traverse toward specified targets, gathering data en route. We advance the 3D Scene graph model introduced in Osiris [3], adapting it to support autonomy through a Visual Teach and Repeat paradigm, which does not rely on an expansive navigation stack. Additionally, we exploit large language models to correctly ground instructions within the newly constructed 3D scene graph representation, thus enabling natural language directives to be relayed to robotic systems in agricultural contexts. The system’s ability to interpret and execute natural language commands is confirmed through validation and evaluation in a practical agricultural scenario via a ground robot.
|
| |
| 09:00-10:30, Paper WeI1I.144 | Add to My Program |
| TaSA: Two-Phased Deep Predictive Learning of Tactile Sensory Attenuation for Improving In-Grasp Manipulation |
|
| Ponnivalavan, Pranav | Waseda University |
| Funabashi, Satoshi | Waseda University |
| Schmitz, Alexander | Waseda University |
| Ogata, Tetsuya | Waseda University |
| Sugano, Shigeki | Waseda University |
Keywords: In-Hand Manipulation, Deep Learning in Grasping and Manipulation, Force and Tactile Sensing
Abstract: Humans can achieve diverse in-hand manipulations, such as object pinching and tool use, which often involve simultaneous contact between the object and multiple fingers. This is still an open issue for robotic hands because such dexterous manipulation requires distinguishing between tactile sensations generated by their self-contact and those arising from external contact. Otherwise, object/robot breakage happens due to contacts/collisions. Indeed, most approaches ignore self-contact altogether, by constraining motion to avoid/ignore self-tactile information during contact. While this reduces complexity, it also limits generalization to real-world scenarios where self-contact is inevitable. Humans overcome this challenge through self-touch perception, using predictive mechanisms that anticipate the tactile consequences of their own motion, through a principle called sensory attenuation, where the nervous system differentiates predictable self-touch signals, allowing novel object stimuli to stand out as relevant. Deriving from this, we introduce {TaSA}, a two-phased deep predictive learning framework. In the first phase, TaSA explicitly learns self-touch dynamics, modeling how a robot’s own actions generate tactile feedback. In the second phase, this learned model is incorporated into the motion learning phase, to emphasize object contact signals during manipulation. We evaluate TaSA on a set of insertion tasks, which demand fine tactile discrimination: inserting a pencil lead into a mechanical pencil, inserting coins into a slot, and fixing a paper clip onto a sheet of paper, with various orientations, positions, and sizes. Across all tasks, policies trained with TaSA achieve significantly higher success rates than baseline methods, demonstrating that structured tactile perception with self-touch based on sensory attenuation is critical for dexterous robotic manipulation.
|
| |
| 09:00-10:30, Paper WeI1I.145 | Add to My Program |
| SVP: Improving Vision-Language-Action Models with Dual Stochastic Visual Prompting |
|
| Zhong, Zhide | The Hong Kong University of Science and Technology (Guangzhou) |
| Yan, Haodong | The Hong Kong University of Science and Technology (Guangzhou) |
| Zhang, Tianran | The Hong Kong University of Science and Technology(Guangzhou) |
| Wang, Lujia | Hong Kong University of Science and Technology (Guangzhou) |
| Wu, Jin | University of Science and Technology Beijing |
| Ma, Jun | The Hong Kong University of Science and Technology |
| Zheng, Xinhu | The HongKong University of Science and Technology (Guangzhou) |
| Li, Haoang | Hong Kong University of Science and Technology (Guangzhou) |
Keywords: Imitation Learning, Representation Learning, Machine Learning for Robot Control
Abstract: Vision-Language-Action (VLA) models, such as OpenVLA, hold the promise of generalist robots, yet their performance is often impaired by distracted attention, which we identify as a manifestation of shortcut learning. We posit that the solution lies not in architectural modifications, but in a new training paradigm centered on visual prompts that provide explicit visual guidance to the model. We introduce Dual Stochastic Visual Prompting (SVP) as a concrete realization of this paradigm. SVP functions as a training-only ``visual scaffold'', a non-invasive mechanism that requires no architectural modifications. Our work demonstrates that this data-centric training paradigm is a highly effective strategy for mitigating distracted attention, enabling the learning of more robust and capable policies without architectural overhead. SVP yields substantial gains on the challenging LIBERO benchmark and real robot experience. It improves the absolute success rate of the standard OpenVLA by 8.2% on long-horizon tasks and enhances the performance of the highly optimized OpenVLA-OFT. These improvements are validated on a real robot, where our model consistently outperforms baselines across a variety of manipulation tasks.
|
| |
| 09:00-10:30, Paper WeI1I.146 | Add to My Program |
| Cross-Distill: Multi-Manifold and Viewpoint-Decoupled Distillation for Cross-View Geo-Localization |
|
| Gao, Jiaxu | Northeastern University |
| Zhao, Shuying | Northeastern University |
| Zhang, Yunzhou | Northeastern University |
| Zhou, Hongyu | Northeastern University |
| Qi, Man | Northeastern University |
| Shen, Jiabo | Northeastern University |
| Zhang, Yu | Northeastern University |
Keywords: Localization, Deep Learning for Visual Perception, Aerial Systems: Perception and Autonomy
Abstract: Abstract— Cross-View Geo-Localization (CVGL) localizes a query image via retrieval from georeferenced satellite imagery,yet severe viewpoint variation remains a central challenge. Recent advances often rely on heavy backbones or add-on modules that achieve high accuracy but are impractical on resource-constrained UAVs. To balance accuracy and efficiency, we introduce Cross-Distill, a knowledge-distillation framework for CVGL. Cross-Distill performs Cross-Similarity Ranking Distillation by constructing a teacher-student interaction matrix to enforce ranking consistency and enhance discrimination. Building on this, it introduces Viewpoint Decoupling, which partitions ranking relations into intra-view, intra-to-cross-view, and cross-to-cross-view, enabling precise modeling of cross-view dependencies and improving class compactness and separability. Cross-Distill further employs Multi-Manifold Feature Distillation that jointly enforces angular consistency on the spherical manifold, preserves local distances in Euclidean space, and leverages hyperbolic distance as a negatively curved metric to strengthen teacher–student alignment. Experiments on University-1652 and SUES-200 show that the distilled student achieves significant gains with low complexity (31.43M parameters, 13.09 GFLOPs),and an inference time of only 62.02 ms per image on an RK3588. For instance, on University-1652 UAV→SAT retrieval, R@1 improves from 75.97% to 94.43% and AP from 79.24% to 95.33%.
|
| |
| 09:00-10:30, Paper WeI1I.147 | Add to My Program |
| Global End-Effector Pose Control of an Underactuated Aerial Manipulator Via Reinforcement Learning |
|
| Deshmukh, Shlok | Delft University of Technology |
| Alonso-Mora, Javier | Delft University of Technology |
| Sun, Sihao | Delft University of Technology |
Keywords: Aerial Systems: Applications, Aerial Systems: Mechanics and Control, Reinforcement Learning
Abstract: Aerial manipulators, which combine robotic arms with multi-rotor drones, face strict constraints on arm weight and mechanical complexity. In this work, we study a lightweight 2-degree-of-freedom (DoF) arm mounted on a quadrotor via a differential mechanism, capable of full six-DoF end-effector pose control. While the minimal design enables simplicity and reduced payload, it also introduces challenges such as underactuation and sensitivity to external disturbances. To address these, we employ reinforcement learning, training a Proximal Policy Optimization (PPO) agent in simulation to generate feedforward commands for quadrotor acceleration and body rates, along with joint angle targets. These commands are tracked by an incremental nonlinear dynamic inversion (INDI) attitude controller and a PID joint controller, respectively. Flight experiments demonstrate centimeter-level position accuracy and degree-level orientation precision, with robust performance under external force disturbances—including manipulation of heavy loads and pushing tasks. The results highlight the potential of learning-based control strategies for enabling contact-rich aerial manipulation using simple, lightweight platforms. Videos of the experiment and the method are summarized in https://youtu.be/bWLTPqKcCOA.
|
| |
| 09:00-10:30, Paper WeI1I.148 | Add to My Program |
| Exploiting Vulnerabilities: Universal Adversarial Attacks on Vision-Language-Action Models in Robotics |
|
| Yang, Songhua | Wuhan University |
| Liu, Ziyu | WuHan University |
| Liu, Yuanwei | Wuhan University |
| Li, Xuetao | Wuhan University |
| Fei, Xuanye | Wuhan University |
| Huang, He | Wuhan University |
| Wang, Zheng | Wuhan University |
| Li, Miao | Wuhan University |
Keywords: Robot Safety, Bimanual Manipulation, AI-Enabled Robotics
Abstract: 近年来,视觉-语言-行动(VLA)模型通过无缝整合视觉感知、语言理解和动作生成,在端到端的学习框架中彻底革新了机器人作。然而,由于这些模型设计为直接与物理世界和人类交互,其安全性至关重要,即使是小漏洞也可能导致灾难性故障。在本研究中,我们提出了通用对抗对象,这是一种表面纹理优化的球体,当置于机器人视野内时,任务成功率会显著降低。具体来说,我们的方法引入了一个多层次攻击框架,能够共同干扰轨迹规划、任务执行和动作控制。我们在模拟和现实机器人环境中验证了我们的方法。实验结果表明,对抗对象在两种代表性VLA模型(Pi0和RDT
|
| |
| 09:00-10:30, Paper WeI1I.149 | Add to My Program |
| Scalable Multi-Objective Robot Reinforcement Learning through Gradient Conflict Resolution |
|
| Munn, Humphrey James Lee | University of Queensland |
| Tidd, Brendan | CSIRO |
| Bohm, Peter | The University of Queensland |
| Gallagher, Marcus | University of Queensland |
| Howard, David | CSIRO |
Keywords: Reinforcement Learning, Machine Learning for Robot Control, Whole-Body Motion Planning and Control
Abstract: Reinforcement Learning (RL) robot controllers usually aggregate many task objectives into one scalar reward. While large-scale proximal policy optimisation (PPO) has enabled impressive results such as robust real-world robot locomotion, many tasks still require careful reward tuning and remain brittle to local optima. Tuning cost and sub-optimality grow with the number of objectives, limiting scalability. Modelling reward vectors and their trade-offs can address these issues; however, multi-objective methods remain underused in RL for robotics because of computational cost and optimisation difficulty. In this work, we study gradient conflicts that arise when multiple task objectives are combined into a scalar reward. In particular, we explicitly address the conflict between task-based rewards and terms that regularise the policy towards realistic behaviour. We propose GCR-PPO, a lightweight modification to PPO that decomposes actor updates into objective-wise gradients using a multi-headed critic and resolves conflicts according to objective priority. We evaluate GCR-PPO on IsaacLab manipulation and locomotion benchmarks and two additional tasks modified to include many objectives. GCR-PPO demonstrates superior scalability compared to massively-parallel PPO (p = 0.04) without significant computational overhead. Across tasks, GCR-PPO improves performance over large-scale PPO by an average of 9.5% (Symmetric Percentage Change), with larger gains on tasks exhibiting higher gradient conflict. Code is available at: https://github.com/humphreymunn/GCR-PPO.
|
| |
| 09:00-10:30, Paper WeI1I.150 | Add to My Program |
| Tool-Grasp: A 6-DoF Functional Grasping Framework for General-Purpose Hand Tools |
|
| Lei, Hongliang | Huazhong University of Science and Technology |
| Huang, Jian | Huazhong University of Science and Technology |
| Li, Andong | Huazhong University of Science and Technology |
| Wang, Haoyuan | Huazhong University of Science and Technology |
| Liu, Chen | Huazhong University of Science and Technology |
| Luo, Wei | China Ship Development &design Center |
| Xiang, Jiuyao | The Hong Kong University of Science and Technology |
Keywords: Grasping, RGB-D Perception, Data Sets for Robot Learning
Abstract: Detecting functional grasp poses for tool operation is critical for robots in complex real-world tasks, yet existing methods lack this capability. Key challenges are: 1) Scarce realworld datasets with fine-grained functional labels and task-valid grasp annotations, as their construction requires domain knowledge (making annotation labor-intensive/subjective) and linking poses to tool usage (beyond stability checks); 2) Difficulty in fine-grained functional segmentation, where minimal sub-region differences are overwhelmed by global cues/noise, with 3D model-dependent methods impractical in unstructured settings; 3) Poor 6-DoF grasp alignment with functional regions due to high morphological heterogeneity, as existing methods either fail to balance stability and functional constraints (high-score grasps outside regions) or are limited to low degrees of freedom. To address these, we build the Tool-Grasp Dataset (20 tool categories, 50 scenes, 12,600 RGB-D images, 250M+ 6-DoF annotations) with fine-grained functional labels. We propose ToolGrasp, a two-stage 6-DoF framework: Stage 1’s Mask-Guided Grasp Region Segmentation Network (MG-GRSN) leverages tool-specific semantics to output precise functional masks, mitigating intra-tool variability; Stage 2’s Quality-Aware MultiModal Grasp Pose Detection Network (QAM-GPDN) uses these masks to constrain predictions, fusing RGB-D features with a quality module to select aligned poses. Experiments show MGGRSN outperforms baselines by 3.5% (seen) and 5.2% (unseen) in mIoU; QAM-GPDN boosts functional pose AP by 2.89% (seen) and 3.76% (unseen). Real-robot experiments validate real-world effectiveness.
|
| |
| 09:00-10:30, Paper WeI1I.151 | Add to My Program |
| User-Tailored Learning to Forecast Walking Modes for Exosuits |
|
| Abbate, Gabriele | Istituto Dalle Molle di Studi sull'Intelligenza Artificiale (IDSIA), USI-SUPSI |
| Tricomi, Enrica | Technical University of Munich |
| Gierden, Nathalie | Technical University of Munich |
| Giusti, Alessandro | IDSIA USI-SUPSI |
| Masia, Lorenzo | Technische Universität München (TUM) |
| Paolillo, Antonio | IDSIA USI-SUPSI |
|
|
| |
| 09:00-10:30, Paper WeI1I.152 | Add to My Program |
| Commonsense-Guided Object Graph Reasoning with Policy Regularization for Object Goal Navigation |
|
| Meng, Yiyue | Wuhan University |
| Li, Aolin | Wuhan University |
| Zhan, Jiao | Wuhan University |
| Li, Shenxin | Wuhan University |
| Guo, Chi | Wuhan University |
Keywords: Vision-Based Navigation, Reinforcement Learning, Representation Learning
Abstract: Object goal navigation aims to guide an agent to find a specific target object in an unseen environment using only first-person visual observations. It requires the agent to enhance scene understanding and train a robust navigation policy. To address this, we proposed two complementary techniques, commonsense-guided object graph reasoning (COGR) and policy regularization (PR). Specifically, COGR improves the agent's scene understanding by integrating object relationships, including category proximity and spatial correlation. It extracts co-occurrence embeddings of the target object from a large language model (LLM) as commonsense knowledge to guide object graph reasoning, enabling the agent to reason beyond visual co-occurrence observed in training environments. PR is a knowledge distillation-inspired regularization mechanism, where a commonsense-free model is used to regularize the navigation policy of the commonsense-guided model. We propose PR to mitigate potential performance degradation caused by knowledge bias from the LLM, enabling the training of a more robust navigation policy. Experiments in the AI2-Thor and RoboThor environments demonstrate the effectiveness and efficiency of our proposed method, and real-world deployment further validates its transferability.
|
| |
| 09:00-10:30, Paper WeI1I.153 | Add to My Program |
| MoE-DP: An MoE-Enhanced Diffusion Policy for Robust Long-Horizon Robotic Manipulation with Skill Decomposition and Failure Recovery |
|
| Cheng, Baiye | Huazhong University of Science and Technology |
| Liang, Tianhai | Tsinghua University |
| Huang, Suning | Stanford University |
| Shao, Maanping | Tsinghua University |
| Zhang, Feihong | Tsinghua University |
| Xu, Botian | Tsinghua University |
| Xue, Zhengrong | Tsinghua University |
| Xu, Huazhe | Tsinghua University |
Keywords: Imitation Learning, Learning from Demonstration, Deep Learning in Grasping and Manipulation
Abstract: Diffusion policies have emerged as a powerful framework for robotic visuomotor control, yet they often lack the robustness to recover from subtask failures in long-horizon, multi-stage tasks and their learned representations of observations are often difficult to interpret. In this work, we propose the Mixture of Experts-Enhanced Diffusion Policy (MoE-DP), where the core idea is to insert a Mixture of Experts (MoE) layer between the visual encoder and the diffusion model. This layer decomposes the policy's knowledge into a set of specialized experts, which are dynamically activated to handle different phases of a task. We demonstrate through extensive experiments that MoE-DP exhibits a strong capability to recover from disturbances, significantly outperforming standard baselines in robustness. On a suite of 6 long-horizon simulation tasks, this leads to a 36% average relative improvement in success rate under disturbed conditions. This enhanced robustness is further validated in the real world, where MoE-DP also shows significant performance gains. We further show that MoE-DP learns an interpretable skill decomposition, where distinct experts correspond to semantic task primitives (e.g., approaching, grasping). This learned structure can be leveraged for inference-time control, allowing for the rearrangement of subtasks without any re-training.
|
| |
| 09:00-10:30, Paper WeI1I.154 | Add to My Program |
| Interpretable Multimodal Gesture Recognition for Drone and Mobile Robot Teleoperation Via Log-Likelihood Ratio Fusion |
|
| Baek, Seungyeol | Korea University |
| Singh, Jaspreet | Rheinland-Pfälzische Technische Universität Kaiserslautern-Landau (RPTU) |
| Ray, Lala Shakti Swarup | DFKI kaiserslautern |
| Bello, Hymalai | German Research Center for Artificial Intelligence (DFKI) |
| Lukowicz, Paul | DFKI Kaiserslautern |
| Suh, Sungho | Korea University |
|
|
| |
| 09:00-10:30, Paper WeI1I.155 | Add to My Program |
| UMBRELLA: Uncertainty-Aware Multi-Robot Reactive Coordination under Dynamic Temporal Logic Tasks |
|
| Zhao, Qisheng | Peking University |
| Guo, Meng | Peking University |
| Du, Hengxuan | Peking University |
| Lindemann, Lars | University of Southern California |
| Li, Zhongkui | Peking University |
Keywords: Multi-Robot Systems, Planning under Uncertainty, Task and Motion Planning
Abstract: Multi-robot systems can be extremely efficient for accomplishing team-wise tasks by acting concurrently and collaboratively. However, most existing methods either assume static task features or simply replan when environmental changes occur. This paper addresses the challenging problem of coordinating multi-robot systems for collaborative tasks involving dynamic and moving targets. We explicitly model the uncertainty in target motion prediction via Conformal Prediction (CP), while respecting the spatial-temporal constraints specified by Linear Temporal Logics (LTL). The proposed framework (UMBRELLA) combines the Monte Carlo Tree Search (MCTS) over partial plans with uncertainty-aware rollouts, and introduces a CP-based metric to guide and accelerate the search. The objective is to minimize the Conditional Value at Risk (CVaR) of the average makespan. For tasks released online, a receding-horizon planning scheme dynamically adjusts the assignments based on updated task specifications and motion predictions. Spatial and temporal constraints among the tasks are always ensured, and only partial synchronization is required for the collaborative tasks during online execution. Extensive large-scale simulations and hardware experiments demonstrate significant reductions in both the average makespan and its variance by 23% and 71%, compared with static baselines.
|
| |
| 09:00-10:30, Paper WeI1I.156 | Add to My Program |
| Responsibility and Engagement - Evaluating Interactions in Social Robot Navigation |
|
| Probst, Malte | Honda Research Institute Europe GmbH |
| Wenzel, Raphael | Honda Research Institute Europe GmbH |
| Dasi, Monica | Honda Research Institute Europe GmbH |
Keywords: Performance Evaluation and Benchmarking, Human-Aware Motion Planning, Motion and Path Planning
Abstract: In Social Robot Navigation (SRN), the availability of meaningful metrics is crucial for evaluating trajectories from human-robot interactions. In the SRN context, such interactions often relate to resolving conflicts between two or more agents. Correspondingly, the shares to which agents contribute to the resolution of such conflicts are important. This paper builds on recent work, which proposed a Responsibility metric capturing such shares. We extend this framework in two directions: First, we model the conflict buildup phase by introducing a time normalization. Second, we propose the related Engagement metric, which captures how the agents' actions intensify a conflict. In a comprehensive series of simulated scenarios with dyadic, group and crowd interactions, we show that the metrics carry meaningful information about the cooperative resolution of conflicts in interactions. They can be used to assess behavior quality and foresightedness. We extensively discuss applicability, design choices and limitations of the proposed metrics.
|
| |
| 09:00-10:30, Paper WeI1I.157 | Add to My Program |
| A Collision-Free Sway Damping Model Predictive Controller for Safe and Reactive Forestry Crane Navigation |
|
| Ecker, Marc-Philip | TU Wien, Austrian Institute of Technology |
| Froehlich, Christoph | Austrian Institute of Technology |
| Huemer, Johannes | AIT Austrian Institute of Technology GmbH |
| Gruber, David | AIT Austrian Institute of Technology GmbH |
| Bischof, Bernhard | Austrian Institute of Technology |
| Glück, Tobias | AIT Austrian Institute of Technology GmbH |
| Kemmetmueller, Wolfgang | TU Wien |
Keywords: Robotics and Automation in Agriculture and Forestry
Abstract: Forestry cranes operate in dynamic, unstructured outdoor environments where simultaneous collision avoidance and payload sway control are critical for safe navigation. Existing approaches address these challenges separately, either focusing on sway damping with predefined collision-free paths or performing collision avoidance only at the global planning level. We present the first collision-free, sway-damping model predictive controller (MPC) for a forestry crane that unifies both objectives in a single control framework. Our approach integrates LiDAR-based environment mapping directly into the MPC using online Euclidean distance fields (EDF), enabling real-time environmental adaptation. The controller simultaneously enforces collision constraints while damping payload sway, allowing it to (i) replan upon quasi-static environmental changes, (ii) maintain collision-free operation under disturbances, and (iii) provide safe stopping when no bypass exists. Experimental validation on a real forestry crane demonstrates effective sway damping and successful obstacle avoidance.
|
| |
| 09:00-10:30, Paper WeI1I.158 | Add to My Program |
| SSR-ZSON: Zero-Shot Object Navigation Via Spatial-Semantic Relations within a Hierarchical Exploration Framework |
|
| Meng, Xiangyi | Beijing Institute of Technology |
| Li, Delun | Beijing Institute of Technology |
| Mao, Zihao | Beijing Institute of Technology |
| Yang, Yi | Beijing Institute of Technology |
| Song, Wenjie | Beijing Institute of Technology |
Keywords: Search and Rescue Robots, Planning under Uncertainty, Autonomous Vehicle Navigation
Abstract: Zero-shot object navigation in unknown environments presents significant challenges, mainly due to two key limitations: insufficient semantic guidance leads to inefficient exploration, while limited spatial memory resulting from environmental structure causes entrapment in local regions. To address these issues, we propose SSR-ZSON, a spatial-semantic relative zero-shot object navigation method based on the TARE hierarchical exploration framework, integrating a viewpoint generation strategy balancing spatial coverage and semantic density with an LLM-based global guidance mechanism. The performance improvement of the proposed method is due to two key innovations. First, the viewpoint generation strategy prioritizes areas of high semantic density within traversable sub-regions to maximize spatial coverage and minimize invalid exploration. Second, coupled with an LLM-based global guidance mechanism, it assesses semantic associations to direct navigation toward high-value spaces, preventing local entrapment and ensuring efficient exploration. Deployed on hybrid Habitat-Gazebo simulations and physical platforms, SSR-ZSON achieves real-time operation and superior performance. On Matterport3D and Habitat-Matterport3D datasets, it improves the Success Rate(SR) by 18.5% and 11.2%, and the Success weighted by Path Length(SPL) by 0.181 and 0.140, respectively, over state-of-the-art methods.
|
| |
| 09:00-10:30, Paper WeI1I.159 | Add to My Program |
| Bi-Manual Joint Camera Calibration and Scene Representation |
|
| Tang, Haozhan | Carnegie Mellon University |
| Zhang, Tianyi | Carnegie Mellon University |
| Johnson-Roberson, Matthew | Carnegie Mellon University |
| Zhi, Weiming | The University of Sydney; Vanderbilt University |
Keywords: Perception for Grasping and Manipulation
Abstract: Robot manipulation, especially bimanual manipulation, often requires setting up multiple cameras on multiple robot manipulators. Before robot manipulators can generate motion or even build representations of their environments, the cameras rigidly mounted to the robot need to be calibrated. Camera calibration is a cumbersome process involving collecting a set of images, with each capturing a pre-determined marker. In this work, we introduce the Bi-Manual Joint Calibration and Representation Framework (Bi-JCR). Bi-JCR enables multiple robot manipulators, each with cameras mounted, to circumvent taking images of calibration markers. By leveraging 3D foundation models for dense, marker-free multi-view correspondence, Bi-JCR jointly estimates: (i) the extrinsic transformation from each camera to its end-effector, (ii) the inter-arm relative poses between manipulators, and (iii) a unified, scale-consistent 3D representation of the shared workspace, all from the same captured RGB image sets. The representation, jointly constructed from images captured by cameras on both manipulators, lives in a common coordinate frame and supports collision checking and semantic segmentation to facilitate downstream bimanual coordination tasks. We empirically evaluate the robustness of Bi-JCR on a variety of tabletop environments, and demonstrate its applicability on a variety of downstream tasks.
|
| |
| 09:00-10:30, Paper WeI1I.160 | Add to My Program |
| Reinforcement Fine-Tuning of Flow-Matching Policies for Vision-Language-Action Models |
|
| Lv, Mingyang | Institute of Automation,Chinese Academy of Sciences |
| Sun, Yinqian | Institute of Automation, Chinese Academy of Science |
| Lin, Erliang | Institute of Automation Chinese Academy of Sciences |
| Li, Huangrui | University of Chinese Academy of Sciences |
| Chen, Ruolin | CASIA |
| Zhao, Feifei | Institute of Automation,Chinese Academy of Sciences |
| Zeng, Yi | Institute of Automation, Chinese Academy of Sciences |
Keywords: Learning from Demonstration, Imitation Learning, Reinforcement Learning
Abstract: Vision-Language-Action (VLA) models such as OpenVLA, Octo, and π0 have shown strong generalization by leveraging large-scale demonstrations, yet their performance is still fundamentally constrained by the quality and coverage of supervised data. Reinforcement learning (RL) therefore provides a promising path for further improving and fine-tuning VLAs through online interaction. However, conventional policy gradient methods are computationally infeasible in the context of flow-matching based models due to the intractability of the importance sampling process, which requires explicit computation of policy ratios. To overcome this limitation, we propose Flow Policy Optimization (FPO) algorithm, which reformulates importance sampling by leveraging per-sample changes in the conditional flow-matching objective. Furthermore, FPO achieves stable and scalable online reinforcement fine-tuning of the π0 model by integrating structure-aware credit assignment to enhance gradient efficiency, clipped surrogate objectives to stabilize optimization, multi-step latent exploration to encourage diverse policy updates, and a Q-ensemble mechanism to provide robust value estimation. We evaluate FPO on the LIBERO benchmark and the ALOHA simulation task against supervised, preference-aligned, diffusion-based, autoregressive online RL, and π0-FAST baselines, observing consistent improvements over the imitation prior and strong alternatives with stable learning under sparse rewards. In addition, ablation studies and analyses of the latent space dynamics further highlight the contributions of individual components within FPO, validating the effectiveness of the proposed computational modules and illustrating the stable convergence of the conditional flow-matching objective during online reinforcement learning.
|
| |
| 09:00-10:30, Paper WeI1I.161 | Add to My Program |
| DriveSafe: A Framework for Risk Detection and Safety Suggestions in Driving Scenarios |
|
| Artham, Sainithin | IIIT Hyderabad |
| Gangisetty, Shankar | IIIT Hyderabad |
| Dasgupta, Avijit | IIIT Hyderabad |
| Jawahar, C.V. | IIIT, Hyderabad |
Keywords: Deep Learning for Visual Perception, Computer Vision for Transportation, Intelligent Transportation Systems
Abstract: Comprehensive situational awareness is essential for autonomous vehicles operating in safety-critical environments, as it enables the identification and mitigation of potential risks. Although recent Multimodal Large Language Models (MLLMs) have shown promise on general vision–language tasks, our findings indicate that zero-shot MLLMs still underperform compared to domain-specific methods in fine-grained, spatially grounded risk assessment. To address this gap, we propose DriveSafe, a framework for risk-aware scene understanding that leverages structured natural language descriptions. Specifically, our method first generates spatially grounded captions enriched with multimodal context—including motion, spatial, and depth cues. These captions are then used for downstream risk assessment, explicitly identifying hazardous objects, their locations, and the unsafe behaviors they imply, followed by actionable safety suggestions. To further improve performance, we employ caption–risk pairings to fine-tune a lightweight adapter module, efficiently injecting domain-specific knowledge into the base LLM. By conditioning risk assessment on explicit language-based scene representations, DriveSafe achieves significant gains over both zero-shot MLLMs and prior domain-specific baselines. Exhaustive experiments on the DRAMA benchmark demonstrate state-of-the-art performance, while ablation studies validate the effectiveness of our key design choices. Project page: url{https://cvit.iiit.ac.in/research/projects/cvit-projects/drivesafe}.
|
| |
| 09:00-10:30, Paper WeI1I.162 | Add to My Program |
| LangEditor: Natural Language-Driven 4D Editing for Improved Controllability of Dynamic Driving Scenes |
|
| Liang, Xiaoyu | Peking University |
| Wang, Linhui | Peking University Shenzhen Graduate School |
| Li, Chunlam | Peking University |
| Lin, Junhong | Peking University |
| Gao, Wei | Peking University Shenzhen Graduate School |
Keywords: Computer Vision for Transportation, Visual Learning, Computer Vision for Manufacturing
Abstract: Diverse and realistic data are essential for developing reliable autonomous driving (AD) systems, yet collecting and annotating large-scale real-world driving datasets is costly and time-consuming. Recent advances in synthetic scene generation and editing have enabled the creation of diverse driving scenarios. However, fully synthetic scenes often lack real-world grounding, while existing editing approaches are either limited to pure video manipulation or involve cumbersome manual operations. To solve this, we present LangEditor, the natural language-driven 4D editing framework for dynamic driving scenes. LangEditor automatically grounds free-form language instructions to target vehicles and their editable attributes, generating physically plausible trajectories consistent with scene semantics. To ensure spatiotemporal coherence and visual fidelity, we propose a joint refinement strategy that integrates a Dynamic Illumination-Aware Shadow Module for lighting consistency across space-time, and an Appearance Refinement module for synthesizing high-quality textures and material properties. Extensive experiments on realistic driving datasets demonstrate that LangEditor enables intuitive, fine-grained, and photorealistic 4D scene manipulation, outperforming existing baselines in both editing quality and controllability. Our approach bridges the gap between realistic scene editing and user-friendly controllability, offering a powerful tool for data augmentation and simulation in AD research.
|
| |
| 09:00-10:30, Paper WeI1I.163 | Add to My Program |
| Shape Sensing and Tip Tracking Via Reciprocating Magnet in the Soft Continuum Robot |
|
| Xiang, Pingyu | Zhejiang University |
| Zhao, Zexi | Zhejiang University |
| Zhang, Hongye | Zhejiang University |
| Wang, Yue | Zhejiang University |
| Xiong, Rong | Zhejiang University |
| Lu, Haojian | Zhejiang University |
Keywords: Micro/Nano Robots, Automation at Micro-Nano Scales, Biologically-Inspired Robots
Abstract: Soft continuum robots, attributable to inherently compliant trunks and shape manipulability, have been widely deployed in complex scenarios requiring safe human-robot interaction. However, their nonlinear deformations and hyperredundant degrees of freedom pose substantial challenges for full-body shape sensing and closed-loop control of the end effector. A low-cost yet accurate feedback solution is thus highly desirable. To address this, we present a hydraulicdriven reciprocating magnet strategy, integrated with magnetic localization, to enable both shape sensing and tip pose estimation of soft continuum robots, thereby facilitating precise closed-loop control. The proposed approach time-multiplexes a single magnet under different operational phases to fulfill two functions: full-body shape reconstruction and tip pose tracking. We validate the effectiveness of the reciprocating magnet system on a pneumatic manipulator prototype with two active degrees of freedom. Experimental results show that the magnet can travel through the guide channel at a maximum speed of 6.5 cm/s, achieving average errors of less than 2 mm in position (1.1% of the robot’s length), 3◦ in orientation for shape sensing and tip pose estimation. Using this sensing strategy, we demonstrate a simple closed-loop control on the soft continuum robot. Owing to its simplicity, low cost, and high precision, the proposed method holds promise as a practical alternative for state feedback in soft continuum robots.
|
| |
| 09:00-10:30, Paper WeI1I.164 | Add to My Program |
| One Prompt, Many Rooms: A Force-Directed Approach to 3D Scene Generation for Robotics Simulation |
|
| May, Christopher | Friedrich-Alexander-Universität Erlangen-Nürnberg |
| Hetzner, Peter | Friedrich-Alexander-Universität Erlangen-Nürnberg |
| Kalenberg, Matthias | Friedrich-Alexander Universität Erlangen-Nürnberg |
| Sajith Nambiar, Ashwin | Friedrich-Alexander-Universität Erlangen-Nürnberg |
| Franke, Jörg | Friedrich-Alexander-Universität Erlangen-Nürnberg |
| Reitelshöfer, Sebastian | Friedrich-Alexander-Universität Erlangen-Nürnberg |
Keywords: Simulation and Animation, Data Sets for Robot Learning, Software Tools for Benchmarking and Reproducibility
Abstract: Training generalizable robotic agents requires large datasets of diverse, physically consistent 3D scenes, yet their generation remains a critical bottleneck. Current text-to-scene methods are inefficient for this task; generating diverse layouts requires repeated, expensive sampling that fails to maintain the semantic consistency required for robust policy learning. In this paper, we address this with our “one-plan-to-many-layouts” method. A Large Language Model (LLM) generates a single declarative plan, which a force-directed physics simulation then realizes into multiple layouts that share semantics but differ in geometry. We validate our method by transfer to photorealistic 3D reconstructions of real environments (Replica) within simulation, where a navigation agent trained on our scenes attains a Success Rate of 0.84. These results establish our pipeline as a scalable method for producing the controlled, diverse data required for embodied AI training.
|
| |
| 09:00-10:30, Paper WeI1I.165 | Add to My Program |
| Bi-Hap: A Bi-Directional Learning-Based Control and Momentum-Based Haptic Feedback System for Dexterous In-Hand Telemanipulation |
|
| Wang, Haoyang | Oklahoma State University |
| Guo, Haoran | Oklahoma State University |
| Li, Zhengxiong | University of Colorado Denver |
| Tao, Lingfeng | Kennesaw State University |
Keywords: Telerobotics and Teleoperation, Haptics and Haptic Interfaces, Dexterous Manipulation
Abstract: Dexterous in-hand telemanipulation demands precise control and realistic haptic feedback to achieve stable and intuitive human–robot interaction. Existing systems often emphasize isolated control policies or unidirectional force feedback, limiting performance in tasks that require coordinated bidirectional information flow. In this work, we introduce Bi-Hap, a bi-directional learning-based control and momentum-based haptic feedback system for real-time, in-hand telemanipulation. On the control side, Bi-Hap leverages an inertial measurement unit to capture operator motion and drives a deep reinforcement learning policy that enables robust and adaptive manipulation of objects with fine rotational dexterity. On the feedback side, a compact, palm-sized momentum-actuated mechanism delivers torque and vibration cues directly to the operator, augmented by an error-adaptive strategy that modulates feedback intensity based on task states. When integrated, this closed-loop design establishes an immersive bidirectional control–feedback framework. Experimental results show that Bi-Hap achieves low feedback latency (<0.03s), high torque fidelity (RMSE <0.01Nm), and significantly improved telemanipulation performance by elevating manipulation accuracy, responsiveness, and operator situational awareness in diverse task settings.
|
| |
| 09:00-10:30, Paper WeI1I.166 | Add to My Program |
| DensePercept-NCSSD: Vision Mamba towards Real-Time Dense Visual Perception with Non-Causal State Space Duality |
|
| Anand, Tushar | BITS Hyderabad |
| Sinha, Advik | Birla Institute of Technology and Science Pilani, Hyderabad |
| Das, Abhijit | BITS Pilani |
Keywords: Deep Learning for Visual Perception, Visual Learning
Abstract: In this work, we propose an accurate and real-time optical flow and disparity estimation model by fusing pairwise input images in the proposed non-causal selective state space for dense perception tasks. We propose a non-causal Mamba block-based model that is fast and efficient and aptly manages the constraints present in real-time applications. Our proposed model reduces inference times while maintaining high accuracy and low GPU usage for optical flow and disparity map generation. The results and analysis justify that our proposed model can be used for unified real-time and accurate 3D dense perception estimation tasks. The code, along with the models, can be found at https://github.com/vimstereo/DensePerceptNCSSD
|
| |
| 09:00-10:30, Paper WeI1I.167 | Add to My Program |
| Instance-Guided Unsupervised Domain Adaptation for Robotic Semantic Segmentation |
|
| Antonazzi, Michele | KTH Royal Institute of Technology |
| Signorelli, Lorenzo | University of Milan |
| Luperto, Matteo | Università Degli Studi Di Milano |
| Basilico, Nicola | University of Milan |
Keywords: Deep Learning for Visual Perception, Object Detection, Segmentation and Categorization
Abstract: Semantic segmentation networks, which are essential for robotic perception, often suffer from performance degradation when the visual distribution of the deployment environment differs from that of the source dataset on which they were trained. Unsupervised Domain Adaptation (UDA) addresses this challenge by adapting the network to the robot’s target environment without external supervision, leveraging the large amounts of data a robot might naturally collect during long-term operation. In such settings, UDA methods can exploit multi-view consistency across the environment’s map to fine-tune the model in an unsupervised fashion and mitigate domain shift. However, these approaches remain sensitive to cross-view instance-level inconsistencies. In this work, we propose a method that starts from a volumetric 3D map to generate multi-view consistent pseudo-labels. We then refine these labels using the zero-shot instance segmentation capabilities of a foundation model, enforcing instance-level coherence. The refined annotations serve as supervision for self-supervised fine-tuning, enabling the robot to adapt its perception system at deployment time. Experiments on real-world data demonstrate that our approach consistently improves performance over state-of-the-art UDA baselines based on multi-view consistency, without requiring any ground-truth labels in the target domain.
|
| |
| 09:00-10:30, Paper WeI1I.168 | Add to My Program |
| H-Zero: Cross-Humanoid Locomotion Pretraining Enables Few-Shot Novel Embodiment Transfer |
|
| Lin, Yunfeng | Shanghai Jiao Tong University |
| Liu, Minghuan | Shanghai Jiao Tong University |
| Xue, Yufei | Shanghai Jiao Tong University |
| Zhou, Ming | Shanghai Jiao Tong University |
| Yu, Yong | Shanghai Jiao Tong University |
| Pang, Jiangmiao | Shanghai AI Laboratory |
| Zhang, Weinan | Shanghai Jiao Tong University |
Keywords: Humanoid and Bipedal Locomotion, Transfer Learning, Reinforcement Learning
Abstract: The rapid advancement of humanoid robotics has intensified the need for robust and adaptable controllers to enable stable and efficient locomotion across diverse platforms. However, developing such controllers remains a significant challenge because existing solutions are tailored to specific robot designs, requiring extensive tuning of reward functions, physical parameters, and training hyperparameters for each embodiment. To address this challenge, we introduce H-Zero, a cross-humanoid locomotion pretraining pipeline that learns a generalizable humanoid base policy. We show that pretraining on a limited set of embodiments enables zero-shot and few-shot transfer to novel humanoid robots with minimal fine-tuning. Evaluations show that the pretrained policy maintains up to 81% of the full episode duration on unseen robots in simulation while enabling few-shot transfer to unseen humanoids and upright quadrupeds within 30 minutes of fine-tuning.
|
| |
| 09:00-10:30, Paper WeI1I.169 | Add to My Program |
| TranTac: Leveraging Transient Tactile Signals for Contact-Rich Robotic Manipulation |
|
| Wu, Yinghao | Harbin Institute of Technology, Shenzhen |
| Hou, Shuhong | Harbin Institute of Technology, Shenzhen |
| Zheng, Haowen | Harbin Institute of Technology, Shenzhen |
| Li, Yichen | Tsinghua Shenzhen International Graduate School |
| Lu, Weiyi | Harbin Institute of Technology |
| Zhou, Xun | Harbin Institute of Technology, Shenzhen |
| Shao, Yitian | Harbin Institute of Technology, Shenzhen |
Keywords: Force and Tactile Sensing, Assembly, Bioinspired Robot Learning
Abstract: Robotic manipulation tasks such as inserting a key into a lock or plugging a USB device into a port can fail when visual perception is insufficient to detect misalignment. In these situations, touch sensing is crucial for the robot to monitor the task's states and make precise, timely adjustments. Current touch sensing solutions are either insensitive to detect subtle changes or demand excessive sensor data. Here, we introduce TranTac, a data-efficient and low-cost tactile sensing and control framework that integrates a single contact-sensitive 6-axis inertial measurement unit within the elastomeric tips of a robotic gripper for completing fine insertion tasks. Our customized sensing system can detect dynamic translational and torsional deformations at the micrometer scale, enabling the tracking of visually imperceptible pose changes of the grasped object. By leveraging transformer-based encoders and diffusion policy, TranTac can imitate human insertion behaviors using transient tactile cues detected at the gripper's tip during insertion processes. These cues enable the robot to dynamically control and correct the 6-DoF pose of the grasped object. When combined with vision, TranTac achieves an average success rate of 79% on object grasping and insertion tasks, outperforming both vision-only policy and the one augmented with end-effector 6D force/torque sensing. Additionally, TranTac's contact localization performance is validated through tactile-only insertion tasks, where the inserted object and slot are initially misaligned by 1 to 3 mm, achieving an average success rate of 88%. We assess the generalizability by training TranTac on a single prism-slot pair and testing it on unseen data, including a USB plug and a metal key, and find that the insertion tasks can still be completed with an average success rate of nearly 70%. The proposed framework may inspire new robotic tactile sensing systems for delicate manipulation tasks.
|
| |
| 09:00-10:30, Paper WeI1I.170 | Add to My Program |
| APREBot: Active Perception System for Reflexive Evasion Robot |
|
| Xu, Zihao | National University of Singapore |
| Sima, Kuankuan | National University of Singapore |
| Deng, Junhao | Beijing Institute of Technology |
| Zhuang, Zixuan | Sun Yat-Sen University |
| Wang, Chunzheng | National University of Singapore |
| Hao, Ce | National University of Singapore |
| Dong, Jin Song | National University of Singapore |
Keywords: Legged Robots, Collision Avoidance
Abstract: Reliable onboard perception is critical for quadruped robots navigating dynamic environments, where obstacles can emerge from any direction under strict reaction time constraints. Single-sensor systems face inherent limitations: LiDAR provides omnidirectional coverage but lacks rich texture information, while cameras capture high-resolution detail but suffer from restricted field of view. We introduce APREBot (Active Perception System for Reflexive Evasion Robot), a novel framework that integrates reflexive evasion with active hierarchical perception. APREBot strategically combines LiDAR-based omnidirectional scanning with camera-based active focusing, achieving comprehensive environmental awareness essential for agile obstacle avoidance in quadruped robots. We validate APREBot through extensive Sim2Real experiments on a quadruped platform, evaluating diverse obstacle types, trajectories, and approach directions. Our results demonstrate substantial improvements over strong baselines in both safety metrics and operational efficiency, highlighting APREBot's potential for dependable autonomy in safety-critical scenarios. Paper homepage: https://aprebot-2026.github.io/.
|
| |
| 09:00-10:30, Paper WeI1I.171 | Add to My Program |
| MetaDAT: Generalizable Trajectory Prediction Via Meta Pre-Training and Data-Adaptive Test-Time Updating |
|
| Wang, Yuning | Xi'an Jiaotong University |
| Zhang, Pu | DiDi KargoBot |
| He, Yuan | KargoBot |
| Wang, Ke | Kargobot.AI |
| Xue, Jianru | Xi'an Jiaotong University |
Keywords: Intelligent Transportation Systems, Autonomous Agents
Abstract: Existing trajectory prediction methods exhibit significant performance degradation under distribution shifts during test time. Although test-time training techniques have been explored to enable adaptation, current approaches rely on an offline pre-trained predictor that lacks online learning flexibility. Moreover, they depend on fixed online model updating rules that do not accommodate the specific characteristics of test data. To address these limitations, we first propose a novel meta-learning framework to directly optimize the pre-trained predictor for fast and accurate online adaptation, which performs bi-level optimization on the performance of simulated test-time adaptation tasks during pre-training. Furthermore, at test time, we introduce a data-adaptive model updating mechanism that dynamically adjusts the predefined learning rates and updating frequencies based on online partial derivatives and hard sample selection. This mechanism makes the learning rate suit test data, and focuses on informative hard samples to enhance efficiency. Experiments are conducted on various challenging cross-dataset distribution shift scenarios, including nuScenes, Lyft, and Waymo. Results demonstrate that our method achieves superior forecasting accuracy, surpassing state-of-the-art test-time training methods for trajectory prediction. Additionally, our method excels under suboptimal learning rates and high FPS demands, showcasing its robustness and practicality.
|
| |
| 09:00-10:30, Paper WeI1I.172 | Add to My Program |
| Bipedal-Walking-Dynamics Model on Granular Terrains |
|
| Chen, Xunjie | Rutgers University |
| Huang, Xinyan | Rutgers University |
| Shan, Peter | Bridgewater Raritan High School |
| Yi, Jingang | Rutgers University |
| Liu, Tao | Zhejiang University |
Keywords: Humanoid and Bipedal Locomotion, Contact Modeling, Legged Robots
Abstract: Bipeds have demonstrated high agility and mobility in unstructured environments such as sand. The yielding of such granular media brings significant sinkage and slip of the bipedal feet, leading to uncertainty and instability of walking locomotion. We present a new dynamics-modeling approach to capture and predict bipedal-walking locomotion on granular media. A dynamic foot-terrain interaction model is integrated to compute the ground reaction force (GRF). The proposed granular dynamic model has three additional degree-of-freedom (DoF) to estimate foot sinkage and slip that are critical to capturing robot-walking kinematics and kinetics such as cost of transport (CoT). Using the new model, we analyze bipedal kinetics, CoT, and foot-terrain rolling and intrusion affects. Experiments are conducted using a biped robotic walker on sand to validate the proposed dynamic model with robot-gait profiles, media-intrusion prediction, and GRF estimations. This new dynamics model can further serve as an enabling tool for locomotion control and optimization of bipedal robots to efficiently walk on granular terrains.
|
| |
| 09:00-10:30, Paper WeI1I.173 | Add to My Program |
| ParkDiffusion++: Ego Intention Conditioned Joint Trajectory Prediction for Automated Parking Using Diffusion Models |
|
| Wei, Jiarong | University of Freiburg |
| Rehr, Anna | Cariad SE |
| Feist, Christian | CARIAD SE |
| Valada, Abhinav | University of Freiburg |
Keywords: Intelligent Transportation Systems, AI-Based Methods, Behavior-Based Systems
Abstract: Automated parking is a challenging operational domain for advanced driver assistance systems, requiring robust scene understanding and interaction reasoning. The key challenge is twofold: (i)predict multiple plausible ego intentions according to context and (ii)for each intention, predict the joint responses of surrounding agents, enabling effective what-if decision-making. However, existing methods often fall short, typically treating these interdependent problems in isolation. We propose ParkDiffusion++, which jointly learns a multi-modal ego intention predictor and an ego-conditioned multi-agent joint trajectory predictor for automated parking. Our approach makes four key contributions. First, we introduce an ego intention tokenizer that predicts a small set of discrete endpoint intentions from agent histories and vectorized map polylines. Second, we perform ego intention-conditioned joint prediction, yielding socially consistent predictions of the surrounding agents for each possible ego intention. Third, we employ a lightweight safety-guided denoiser with different constraints to refine joint scenes during training, thus improving accuracy and safety. Fourth, we propose counterfactual knowledge distillation, where an EMA teacher refined by a frozen safety-guided denoiser provides pseudo-targets that capture how agents react to alternative ego intentions. Extensive evaluations demonstrate that ParkDiffusion++ achieves state-of-the-art performance on the Dragon Lake Parking (DLP) dataset and the Intersections Drone (inD) dataset. Importantly, qualitative what-if visualizations show other agents react appropriately to different ego intentions.
|
| |
| 09:00-10:30, Paper WeI1I.174 | Add to My Program |
| Neural Operators for Design-Space Surrogate Modeling of Tendon-Actuated Continuum Robots |
|
| Frieden, Branden | University of Utah |
| Ferguson, James | University of Utah |
| Kuntz, Alan | Vanderbilt University |
| Shankar, Varun | University of Utah |
Keywords: Modeling, Control, and Learning for Soft Robots, Medical Robots and Systems, Deep Learning Methods
Abstract: Continuum robots enable dexterous manipulation in constrained environments, but require accurate and efficient models for real-time manipulation and control. Traditional physics-based models can be computationally expensive and may suffer from inaccuracies due to unmodeled effects, while current learning-based methods often generalize poorly beyond the specific robot on which they are trained. We present a formulation of surrogate modeling for tendon-driven continuum robots as an operator learning problem that maps robot design parameters and tendon actuation inputs to resulting configurations. This formulation enables a single trained model to generalize across a large class of robot designs. We develop four novel neural operator architectures--two based on Deep Operator Networks (DeepONets) and two based on Fourier Neural Operators (FNOs)--and train them on simulation data to predict robot configurations. All architectures achieve good accuracy while allowing for fast and accurate generalization across designs. Our results demonstrate that operator learning provides an effective and generalizable surrogate for continuum robot mechanics in the design space, enabling fast modeling for control, planning, and design optimization in surgical and industrial applications.
|
| |
| 09:00-10:30, Paper WeI1I.175 | Add to My Program |
| VLION: Vision-Language Guided Interactive Object Navigation with Mobile Manipulation |
|
| Liu, Renming | Sun Yat-Sen University |
| Ren, Hao | Sun Yat-Sen University |
| Zheng, Lanxiang | Sun Yat-Sen University |
| Zeng, Yiming | Sun Yat-Sen University |
| Wu, Ying | Sun Yat-Sen University |
| Cheng, Hui | Sun Yat-Sen University |
Keywords: Vision-Based Navigation, Mobile Manipulation, Semantic Scene Understanding
Abstract: Object navigation for mobile robots typically assumes that targets are visible and paths are unobstructed. However, real-world scenarios often involve occluded targets like objects hidden behind doors or inside containers. Such scenarios require interactive navigation and manipulation by mobile manipulators. To address this challenge, we propose VLION, a vision-language model-guided framework for interactive object navigation (ION) that enables robots to locate and access such targets efficiently. VLION constructs a probabilistic occupancy map and dynamically identifies frontiers for efficient exploration. It leverages vision-language models (VLMs) to perform joint semantic reasoning at both the scene and object levels, generating Scene-Target and Object-Target Value Maps from egocentric observations. These maps are adaptively fused based on spatial entropy to guide target selection and dynamically balance navigation and manipulation priorities for multi-step decision-making. A hybrid A* planner ensures safe and feasible navigation, while star-convex manipulation regions enable interaction with objects. Extensive experiments in iGibson simulations and real-world environments demonstrate the effectiveness of VLION in zero-shot transfer and on-board deployment, advancing the state of the art in ION.
|
| |
| 09:00-10:30, Paper WeI1I.176 | Add to My Program |
| Revisiting Replanning from Scratch: Real-Time Incremental Planning with Fast Almost-Surely Asymptotically Optimal Planners |
|
| Sabbadini, Mitchell Edris Crisante | Queen's University |
| Liu, Andrew H. | Purdue University |
| Ruan, Joseph | Purdue University |
| Wilson, Tyler S. | Queen's University |
| Kingston, Zachary | Purdue University |
| Gammell, Jonathan | Queen's University |
Keywords: Reactive and Sensor-Based Planning, Motion and Path Planning, Planning under Uncertainty
Abstract: Robots operating in changing environments either predict obstacle changes and/or plan quickly enough to react to them. Predictive approaches require a strong prior about the position and motion of obstacles. Reactive approaches require no assumptions about their environment but must replan quickly and find high-quality paths to navigate effectively. Reactive approaches often reuse information between queries to reduce planning cost. These techniques are conceptually sound but updating dense planning graphs when information changes can be computationally prohibitive. It can also require significant effort to detect the changes in some applications. This paper revisits the long-held assumption that reactive replanning requires updating existing plans. It shows that the incremental planning problem can alternatively be solved more efficiently as a series of independent problems using fast almost-surely asymptotically optimal (ASAO) planning algorithms. These ASAO algorithms quickly find an initial solution and converge towards an optimal solution which allows them to find consistent global plans in the presence of changing obstacles without requiring explicit plan reuse. This is demonstrated with simulated experiments where Effort Informed Trees (EIT*) finds shorter median solution paths than the tested reactive planning algorithms and is further validated on a real-world planning problem on a robot arm.
|
| |
| 09:00-10:30, Paper WeI1I.177 | Add to My Program |
| PEDESTRIANQA: A Benchmark for Vision-Language Models on Pedestrian Intention and Trajectory Prediction |
|
| Mishra, Naman | International Institute of Information Technology - Hyderabad |
| Gangisetty, Shankar | IIIT Hyderabad |
| Jawahar, C.V. | IIIT, Hyderabad |
Keywords: Data Sets for Robotic Vision, Multi-Modal Perception for HRI, Intelligent Transportation Systems
Abstract: Pedestrian intention and trajectory prediction are critical for the safe deployment of autonomous driving systems, directly influencing navigation decisions in complex traffic environments. Recent advances in large vision–language models offer a powerful new paradigm for these tasks by combining high-capacity visual understanding with flexible natural language reasoning. In this work, we introduce PedestrianQA, a large-scale video-based dataset that formulates pedestrian intention and trajectory prediction as question–answering tasks augmented with structured rationales. PedestrianQA expresses richly annotated pedestrian sequences, in natural language, enabling VLMs to learn from visual dynamics, contextual cues, and interactions among traffic agents while generating concise explanations of their predictions without needing specialized architectures tailored for each task. Empirical evaluations across PIE, JAAD, TITAN, and IDD-PeD show that finetuning state-of-the-art VLMs on PedestrianQA significantly improves intention classification, trajectory forecasting accuracy, and the quality of explanatory rationales, demonstrating the strong potential of VLMs as a unified and explainable framework for safety-critical pedestrian behavior modeling. Dataset and models are available at https://github.com/botmahn/PedestrianQA
|
| |
| 09:00-10:30, Paper WeI1I.178 | Add to My Program |
| Physics-Based Reduced-Order Modeling of Magnetic Microparticle Swarms for Biomedical Control |
|
| Fadal, Boudehane | INSA Centre Val De Loire - Univ. Orléans |
| Mellal, Lyès | INSA Centre Val De Loire |
| Do, Trung Son | INSA Centre Val De Loire |
| Folio, David | INSA-CVL, Univ. Orléans, PRISME, EA 4229, Bourges, France |
| Ferreira, Antoine | INSA Centre Val De Loire |
Keywords: Micro/Nano Robots, Medical Robots and Systems, Swarm Robotics
Abstract: Magnetic particle swarms are governed by rich nonlinear collective dynamics that complicate predictive, feedback-based control in biomedical microrobotics. We develop a physics-based reduced-order ellipse model that describes the swarm morphology by its principal radii (r1, r2). At steady state, these radii depend explicitly on magnetic-field curvature, axial gradients, and actuation angular velocity through anisotropic stiffness terms. Model parameters are identified experimentally, yielding low validation errors (RMSE: 0.25 mm for r1 and 0.42 mm for r2) and revealing pronounced stiffness anisotropy (Sx/Sy ≈ 0.16). The resulting formulation provides compact, interpretable equations that enable tractable control design and feedback regulation of magnetic particle swarms.
|
| |
| 09:00-10:30, Paper WeI1I.179 | Add to My Program |
| Task Generalization with Pathwise Conditioning of Gaussian Process for Learning from Demonstration |
|
| Prados, Adrian | Universidad Carlos III De Madrid |
| Espinoza, Gonzalo | Universidad Carlos III De Madrid |
| Mendez, Alberto | Universidad Carlos III De Madrid |
| Barber, Ramon | Universidad Carlos III De Madrid |
Keywords: Learning from Demonstration, Task and Motion Planning, Imitation Learning
Abstract: To effectively operate in human-centered environments, robots must possess the capability to rapidly adapt to novel and changing situations. Techniques such as Learning from Demonstration enable fast learning without the need for explicit coding. However, in certain cases they exhibit limitations in generalizing beyond the set of demonstrations, which constrains their ability to rapidly adapt to unforeseen scenarios. In this work, we present a movement primitive learning algorithm based on Gaussian Processes, combined with a zero-shot adaptation to new via-points without requiring retraining, through Pathwise Conditioning. The algorithm not only learns the movement policy but is also capable of adapting it rapidly while preserving prior knowledge. The method has been evaluated through comparisons against other state-of-the-art approaches, experiments in simulated environments, as well as on a real robotic platform, generating new solutions for learned tasks by modifying via-points in both position and orientation.
|
| |
| 09:00-10:30, Paper WeI1I.180 | Add to My Program |
| 3D Printing of Passively Actuated Self-Folding Robots with Integrated Functional Modules |
|
| Ge, Gaolin | University of Washington |
| Yang, Qifeng | University of Washington |
| Lu, Haoran | University of Washington |
| Cheng, Tingyu | University of Notre Dame |
| Nisser, Martin | University of Washington |
| Luo, Yiyue | University of Washington |
Keywords: Engineering for Robotic Systems, Mechanism Design, Swarm Robotics
Abstract: We introduce an elastic-driven self-folding approach that fabricates robots directly from flat 3D-printed conductive PLA nets. Elastic bands routed through printed hooks store energy that folds the sheet into programmed 3D geometries, while the flat state allows accurate placement of electronics and magnets before deployment. The same substrate doubles as electrodes for capacitive touch and supports a reusable platform I/O palette with Hall sensors and eccentric rotating mass (ERM) motors for docking detection and vibration actuation. We also derive a closed-form folding model that balances hinge stiffness with elastic band moment to predict equilibrium fold angles; experiments validate the model and yield a design map linking hinge thickness, band size, and hook spacing to target angles. Using this workflow we realize multiple polyhedral modules and demonstrate three applications: a cube that highlights the potential of self-folding for scalable modular robot collectives, a deployable gripper, and a tendon-driven finger. The method is low cost, stimulus-free, and integrates actuation and sensing.
|
| |
| 09:00-10:30, Paper WeI1I.181 | Add to My Program |
| Autonomous Docking Using LiDAR-Based Tracking and Adaptive Pose Selection: Closed-Loop Sea Trials |
|
| Fors, August Johansen | Norwegian University of Science and Technology (NTNU) |
| Lexau, Simon J. N. | Norwegian University of Science and Technology (NTNU) |
| Brekke, Edmund | Norwegian University of Science and Technology (NTNU) |
| Hinostroza, Miguel | Norwegian University of Science and Technology (NTNU) |
| Lekkas, Anastasios M. | Norwegian University of Science and Technology (NTNU) |
| Breivik, Morten | Norwegian University of Science and Technology (NTNU) |
Keywords: Marine Robotics, Object Detection, Segmentation and Categorization, Collision Avoidance
Abstract: This paper presents a fully integrated autonomous docking system validated through closed-loop sea trials on the milliAmpere1 research ferry operating in a live maritime harbour with moving vessels. Real harbour environments require continuous situational awareness and adaptive decision-making under dynamic traffic conditions. The proposed architecture combines cartographic land masking, LiDAR-based clustering, probabilistic multi-target tracking (JIPDA), dynamic footprint estimation, adaptive docking pose selection, and real-time path replanning within a finite state machine framework. Rather than introducing new algorithms, the contribution lies in system-level integration and operational validation of a complete perception-to-control pipeline under realistic maritime constraints. The system is demonstrated in multiple closed-loop experiments including collision avoidance and adaptive docking with moving obstacles. Results highlight both performance characteristics and practical deployment considerations, including runtime behaviour, sensor limitations, and integration trade-offs. The work provides empirical evidence that robust autonomous docking in dynamic harbour environments can be achieved through carefully engineered integration of established methods.
|
| |
| 09:00-10:30, Paper WeI1I.182 | Add to My Program |
| Risk-Aware Reinforcement Learning with Bandit-Based Adaptation for Quadrupedal Locomotion |
|
| Zeng, Yuanhong | University of California, Los Angeles |
| Dixit, Anushri | University of California, Los Angeles |
Keywords: Legged Robots, Reinforcement Learning, Robust/Adaptive Control
Abstract: In this work, we introduce a risk-aware reinforcement learning framework for robust quadrupedal locomotion. Our approach first trains a family of risk-conditioned policies using a Conditional Value-at-Risk (CVaR) constrained optimization technique, which improves both training stability and sample efficiency. During deployment, we frame online policy selection as a multi-armed bandit problem. Relying solely on observed episodic returns rather than privileged environment information, this method dynamically adjusts the robot's robustness level to handle unknown conditions on the fly. We evaluate our approach in simulation across eight diverse settings—varying dynamics, contacts, sensing noise, and terrain—as well as in real-world trials on a Unitree Go2 robot. Compared to existing baselines, our risk-aware policy achieves nearly twice the mean and tail performance in novel environments, with the bandit algorithm successfully identifying the optimal policy within just two minutes of operation.
|
| |
| 09:00-10:30, Paper WeI1I.183 | Add to My Program |
| HybNetic: A Mobile Hybrid Magnetic Actuation System |
|
| Masjosthusmann, Lukas | University of Twente |
| Posselli, Nicholas | University of Twente |
| Misra, Sarthak | University of Twente |
Keywords: Medical Robots and Systems, Dexterous Manipulation
Abstract: Magnetic actuation enables contactless control of medical microrobots and instruments and offers the potential for improved safety and effectiveness in robot-assisted minimally invasive surgery. While much research is being conducted on the development of surgical devices, there is a lack of external actuation systems that provide the necessary magnetic field shaping capability for in vivo control. Existing magnetic actuation systems often face trade-offs between field shaping capability and workspace size. In this work, we introduce HybNetic, a mobile hybrid magnetic actuation system that combines a single electromagnet with four independently rotatable permanent magnets mounted on a robotic arm. The C-shaped configuration of HybNetic has an opening of 520 mm, allowing positioning around the human torso. The mobility of the employed robotic arm extends the effective workspace to the length of a human body. We describe the design and field modeling and characterize the magnetic performance by comparing analytical model predictions and finite element simulations with experimental validations. Finally, we demonstrate the versatility of HybNetic by levitating a magnetic sphere and navigating a magnetic guidewire through a dimensionally accurate phantom of the abdominal aorta. The demonstrations highlight the potential of HybNetic as a magnetic actuation system with a workspace that is suitable for in vivo manipulation of macro- and micro-scale magnetic devices.
|
| |
| 09:00-10:30, Paper WeI1I.184 | Add to My Program |
| Touch with Insight: Physics-Aware Data-Driven Learning for EIT-Based Tactile Sensing |
|
| Nazari, Kiyanoush | University College London |
| Huang, Yunqi | University College London |
| Hardman, David | University of Cambridge |
| Iida, Fumiya | University of Cambridge |
| George Thuruthel, Thomas | University College London |
Keywords: Force and Tactile Sensing, Perception for Grasping and Manipulation, Dexterous Manipulation
Abstract: Tactile sensing is essential for enabling dexterous robotic manipulation, yet estimating contact states such as location and force from high-dimensional sensor measurements remains challenging due to noise and complex nonlinear mappings between raw signals and physical interaction states. In this work, we propose a physics-informed contact modeling framework that combines the flexibility of deep models with inductive biases from physical modeling. Focusing on electrical impedance tomography (EIT) tactile skins, our approach incorporates knowledge of the EIT forward model by regularizing neural estimators with a latent-space consistency constraint, stabilizing the ill-posed inverse mapping from voltages to contact states. To support robust training and evaluation, we also develop a high-fidelity simulation pipeline that incorporates key hardware imperfections to better bridge the sim-to-real gap. We benchmark multiple architectures—including multilayer perceptrons, convolutional networks, Transformer-based models, and autoencoder regressors—on both real and synthetic datasets. Results show that the proposed hybrid approach consistently improves estimation accuracy, particularly for force prediction, and generalizes across domains. These findings highlight the value of embedding physical priors into learning pipelines for reliable tactile state estimation in robotic manipulation.
|
| |
| 09:00-10:30, Paper WeI1I.185 | Add to My Program |
| 3D Robotic Swarmalators That Reconfigure, Navigate, and Avoid Obstacles |
|
| Xu, Zehui | University of Michigan Ann Arbor |
| Xu, Xinyue | University of Michigan, Ann Arbor |
| Ceron, Steven | University of Michigan |
Keywords: Swarm Robotics, Multi-Robot Systems
Abstract: We realize 3D robotic swarmalators that reconfigure, navigate, and avoid obstacles with formal safety on Crazyflie drones. We incorporate ellipsoidal Control Barrier Functions to avoid downwash turbulence between drones, and a combination of Control Lyapunov Function and Control Barrier Function methods to enable the collective to move toward desired locations while avoiding collisions between drones or with nearby obstacles. We implement a global control scheme that moves the collective as a single entity, and a local control scheme that enables fluid-like flow around nearby obstacles while maintaining the same general collective formation. Finally, we demonstrate how the swarmalator model combined with these control schemes can be used to reconfigure and rotate a drone collective so it moves through a narrow passage without colliding with the surrounding environment. Our simulations and physical experiments quantify scalability limits and validate the feasibility of implementing 3D swarmalator-based control on real drone collectives.
|
| |
| 09:00-10:30, Paper WeI1I.186 | Add to My Program |
| A High-Payload Wall-Climbing Robot Using Passive Bistable Suction Cups |
|
| Nguyen, Andrew | University of Michigan |
| Li, Mingyuan | University of Michigan Ann Arbor |
| Bruder, Daniel | University of Michigan |
Keywords: Climbing Robots
Abstract: Wall-climbing robots capable of scaling vertical surfaces could help automate hazardous or labor intensive tasks such as window washing, inspection, maintenance, and construction. Active adhesion methods achieve higher payload capacities, but require power to maintain their grip. Passive adhesion devices such as suction cups are an attractive option for such robots because they do not require power to maintain their grip, but they are limited by their payload capacity. This work presents a novel high-payload wall-climbing robot that utilizes passive bistable suction cups to generate adhesion without needing to be pushed into the wall. The robot features a track-based system that automatically engages and disengages bistable suction cups to achieve locomotion on smooth surfaces. The robot is able to achieve vertical wall climbing on glass, wood, metal, and painted surfaces, sideways and upside-down climbing, and is able to tow a payload of 7.940 kg (with a payload-to-weight ratio of 2.25).
|
| |
| 09:00-10:30, Paper WeI1I.187 | Add to My Program |
| Optimal Dexterity Path Planning for Robotic Manipulators Using Rapid Workspace Density Approximation |
|
| Osikowicz, Nathaniel | Pennsylvania State University |
| Cooper, John | NASA |
| Singla, Puneet | The Pennsylvania State University |
Keywords: Motion and Path Planning, Probability and Statistical Methods, Manipulation Planning
Abstract: This paper introduces a path planning algorithm for executing robotic manipulation tasks with maximum dexterity in the workspace. This is achieved by using the workspace density of the end-effector as the objective function in a sampling-based planner. In doing so, the path planning algorithm prioritizes joint configurations that correspond to the highest density of local end-effector positions. This results in a singularity avoidant path planning algorithm that favors redundancy, making it a favorable approach for manipulation scenarios in which dexterity is paramount. However, due to the exponential relationship between the number of possible end-effector positions and the number of joints, computing the workspace density via traditional methods is computationally intractable for most modern industrial robots. In this paper, a newly developed approach is taken wherein the workspace density is approximated by a Gaussian mixture model that solves for the optimal workspace density function subject to higher-order statistical moment constraints. The statistical moments of the workspace density function are computed recursively with a minimum number of sample points by using a non-product quadrature rule known as the Conjugate Unscented Transform (CUT). This results in a computationally efficient framework that allows the user to trade accuracy and computation time by varying the number of mixture components and the number of statistical moments used in the workspace density approximation. To demonstrate, the algorithm is implemented on the Precision Assembled Space Structure (PASS) platform at NASA illustrating its effectiveness in dexterous robotic assembly tasks.
|
| |
| 09:00-10:30, Paper WeI1I.188 | Add to My Program |
| The Impact of Motor Action on Language Acquisition and Action-Verb Learning in a Robot |
|
| Lemhaouri, Zakaria | CY Cergy Paris University / Vrije Universiteit Brussel/Esiee-It |
| Cohen, Laura | CY Cergy Paris Université |
| Nowé, Ann | VUB |
| Canamero, Lola | CY Cergy Paris University |
Keywords: Developmental Robotics, Cognitive Modeling, Bioinspired Robot Learning
Abstract: In humans, the acquisition of a new motor skill is associated with the development of a wide range of cognitive areas and can create contexts in which new cognitive capacities develop. Motor development is linked to language development in infants, as crawling and walking promote active exploration of the environment, while manipulating objects and pointing draw the caregiver’s attention and help establish joint attention. Together, these motor experiences broaden communication contexts and support the learning of nouns (object-based words) and verbs (action-based words). However, many questions remain unanswered about how children's actions influence language development, qualitatively and quantitatively, and how they help the acquisition of different types of words, particularly the learning of verbs. In this paper, we propose a robot architecture to study how gestures can affect early language learning. The architecture follows the developmental robotics paradigm, i.e. inspired by the way human children develop and acquire language according to multiple developmental theories. The experimental results demonstrate that enabling the robot to produce gestures expands its vocabulary size and facilitates the acquisition of verbs. These results are in line with the finding that verb learning lags behind noun learning since the acquisition of verbs depends more on motor abilities and requires the maturation of motor development.
|
| |
| 09:00-10:30, Paper WeI1I.189 | Add to My Program |
| Continual-RL for Generalization in Autonomous Racing on the RoboRacer Platform |
|
| Siegert, Joel | ETH Zurich |
| Ghignone, Edoardo | ETH |
| Magno, Michele | ETH Zurich |
Keywords: Wheeled Robots, Continual Learning, Reinforcement Learning
Abstract: A key challenge in modern robotics is to adapt to changing environments, a challenge that is exacerbated when simulations cannot encompass every possible real-world configuration, and therefore Reinforcement Learning (RL) in the physical world becomes necessary. Continual Reinforcement Learning (RL) provides the tools to address this challenge; however, both the frameworks and the methods remain un- derexplored. Autonomous Racing (AR) and in particular the RoboRacer competition provide a testing ground for such methods, as learning to drive on a new track-floor combination with the least amount of new experience naturally frames a continual learning problem. This work tries to address this gap by proposing a continual RL framework based on Continual Backpropagation (CBP) that is able, with only real-world data, to train a generalistic policy on a set of tracks and then fine- tune it within 15 minutes to outperform classical controllers. Furthermore, a comparison method based on offline RL is proposed, and a simulation analysis of the plasticity properties of the methods is conducted.
|
| |
| 09:00-10:30, Paper WeI1I.190 | Add to My Program |
| An Infrastructure-Less, Control-Independent Solution to Relative Localisation of a Team of Mobile Robots Using Ranging Measurements |
|
| Golinelli, Paolo | University of Trento |
| Faraci, Tommaso | University of Trento |
| Fontanelli, Daniele | University of Trento |
Keywords: Localization, Multi-Robot Systems, Range Sensing
Abstract: The ability to localise teams of robots is essential for applications ranging from robotic fleets in unstructured environments to cooperative control and navigation tasks. In such contexts, fixed infrastructure is often unavailable, deployments must be fast and flexible, and system requirements must be minimal. We present a decentralised cooperative localisation algorithm that addresses all these challenges at once. The method is anchor-less, fully decentralised, and, unlike most existing approaches, does not require controlling the robots motion to ensure team observability. It relies only on local odometry, sparse inter-agent ranging measurements, and short-range communication, all of which are widely available in practice. The algorithm adopts a multi-hypothesis Bayesian framework that maintains the entire set of feasible solutions, ensuring robustness under transient unobservable conditions. Moreover, through information sharing, each agent benefits from the estimates of the entire group, even in partially connected conditions.
|
| |
| 09:00-10:30, Paper WeI1I.191 | Add to My Program |
| Learning to Throw Objects Safely in Multi-Obstacle Environments |
|
| Kasaei, Mohammadreza | University of Edinburgh |
| Voncina, Klemen | University of Groningen |
| Kasaei, Hamidreza | University of Groningen |
Keywords: Service Robotics, Logistics, Machine Learning for Robot Control
Abstract: Robotic throwing enables fast and efficient object placement beyond the robot’s immediate workspace, but reliable throwing in cluttered environments remains underexplored. Existing approaches, such as TossingBot, learn throwing strategies from visual input but assume obstacle-free settings. In this paper, we address the problem of throwing objects into a target basket while avoiding obstacles placed randomly in the scene. We introduce a potential field state representation that compactly encodes both basket attraction and obstacle repulsion on a fixed-size grid, enabling reinforcement learning (RL) policies to generalize across arbitrary numbers and configurations of obstacles. The policy is initialized from kinesthetic demonstrations and optimized in simulation using three state-of-the-art RL algorithms (SAC, DDPG, TD3). Among these, SAC achieves the most consistent performance across scenarios. We compare the potential field representation against explicit state encodings and demonstrate that it achieves higher success rates and better scalability to unseen obstacle configurations. Real-robot experiments with unseen throwable objects confirm robust sim-to-real transfer, achieving up to 90% success in cluttered scenes. These results demonstrate that PFR provides a practical and robust representation for safe and efficient robotic throwing in unstructured environments. A video showcasing our experiments has been attached to the paper as supplementary material.
|
| |
| 09:00-10:30, Paper WeI1I.192 | Add to My Program |
| DRAW2ACT: Turning Depth-Encoded Trajectories into Robotic Demonstration Videos |
|
| Bai, Yang | Ludwig Maximilian University of Munich |
| Yang, Liudi | University of Freiburg |
| Eskandar, George | University of Stuttgart |
| Shen, Fengyi | Technical University of Munich |
| Altillawi, Mohammad | Huawei, Autonomous University of Barcelona, |
| Liu, Ziyuan | Huawei Group |
| Kutyniok, Gitta | The Ludwig Maximilian University of Munich |
Keywords: Deep Learning in Grasping and Manipulation, Imitation Learning, Deep Learning Methods
Abstract: Video diffusion models provide powerful real-world simulators for embodied AI but remain limited in controllability for robotic manipulation. Recent works on trajectory-conditioned video generation address this gap but often rely on 2D trajectories or single modality conditioning, which restricts their ability to produce controllable and consistent robotic demonstrations. We present DRAW2ACT, a depth-aware trajectory-conditioned video generation framework that extracts multiple orthogonal representations from the input trajectory, capturing depth, semantics, shape and motion, and injects them into the diffusion model. Moreover, we propose to jointly generate spatially aligned RGB and depth videos, leveraging cross-modality attention mechanisms and depth supervision to enhance the spatio-temporal consistency. Finally, we introduce a multimodal policy model conditioned on the generated RGB and depth sequences to regress the robot's joint angles. Experiments on Bridge V2, Berkeley Autolab, and simulation benchmarks show that DRAW2ACT achieves superior visual fidelity and consistency while yielding higher manipulation success rates compared to existing baselines.
|
| |
| 09:00-10:30, Paper WeI1I.193 | Add to My Program |
| Perceive What Matters: Relevance-Driven Scheduling for Multimodal Streaming Perception |
|
| Huang, Dingcheng | Massachusetts Institute of Technology |
| Zhang, Xiaotong | Massachusetts Institute of Technology |
| Youcef-Toumi, Kamal | Massachusetts Institute of Technology |
Keywords: Robotics in Under-Resourced Settings, Computer Vision for Automation, RGB-D Perception
Abstract: In modern human-robot collaboration (HRC) applications, multiple perception modules jointly extract visual, auditory, and contextual cues to achieve comprehensive scene understanding, enabling the robot to provide appropriate assistance to human agents intelligently. While executing multiple perception modules on a frame-by-frame basis enhances perception quality in offline settings, it inevitably accumulates latency, leading to a substantial decline in system performance in streaming perception scenarios. Recent work in scene understanding, termed Relevance, has established a solid foundation for developing efficient methodologies in HRC. However, modern perception pipelines still face challenges related to information redundancy and suboptimal allocation of computational resources. Drawing inspiration from the Relevance concept and the information sparsity in HRC events, we propose a novel lightweight perception scheduling framework that efficiently leverages output from previous frames to estimate and schedule necessary perception modules in real-time based on scene context. The experimental results demonstrate that the proposed perception scheduling framework effectively reduces computational latency by up to 27.52% compared to conventional parallel perception pipelines, while also achieving a 72.73% improvement in MMPose activation recall. Additionally, the framework demonstrates high keyframe accuracy, achieving rates of up to 98%. The results validate the framework's capability to enhance real-time perception efficiency without significantly compromising accuracy. The framework shows potential as a scalable and systematic solution for multimodal streaming perception systems in HRC.
|
| |
| 09:00-10:30, Paper WeI1I.194 | Add to My Program |
| SE(3)-PoseFlow: Estimating 6D Pose Distributions for Uncertainty-Aware Robotic Manipulation |
|
| Jin, Yufeng | Technische Universität Darmstadt |
| Funk, Niklas Wilhelm | TU Darmstadt |
| Prasad, Vignesh | TU Darmstadt |
| Li, Zechu | Technische Universität Darmstadt |
| Franzius, Mathias | Honda Research Institute (HRI) |
| Peters, Jan | Technische Universität Darmstadt |
| Chalvatzaki, Georgia | Technische Universität Darmstadt |
Keywords: Deep Learning for Visual Perception, RGB-D Perception, Perception for Grasping and Manipulation
Abstract: Object pose estimation is a fundamental problem in robotics and computer vision, yet it remains challenging due to partial observability, occlusions, and object symmetries, which inevitably lead to pose ambiguity and multiple hypotheses consistent with the same observation. While deterministic deep networks achieve impressive performance under well-constrained conditions, they are often overconfident and fail to capture the multi-modality of the underlying pose distribution. To address these challenges, we propose a probabilistic framework that leverages flow matching on the SE(3) manifold for estimating 6D object pose distributions. Unlike existing methods that regress a single deterministic output, our approach models the full pose distribution with a sample-based estimate and enables reasoning about uncertainty in ambiguous cases such as symmetric objects or severe occlusions. We achieve state-of-the-art results on Real275, YCB-V and LM-O, and demonstrate how our sample-based pose estimates can be leveraged in downstream robotic manipulation tasks such as active perception for disambiguating uncertain viewpoints, or guiding grasp synthesis in an uncertainty-aware manner.
|
| |
| 09:00-10:30, Paper WeI1I.195 | Add to My Program |
| Practical and Performant Enhancements for Maximization of Algebraic Connectivity |
|
| Jung, Leonard | Northeastern University |
| Papalia, Alan | University of Michigan |
| Doherty, Kevin | Massachusetts Institute of Technology |
| Everett, Michael | Northeastern University |
Keywords: SLAM, Mapping
Abstract: Long-term state estimation over graphs remains challenging as current graph estimation methods scale poorly on large, long-term graphs. To address this, our work advances a current state-of-the-art graph sparsification algorithm, maximizing algebraic connectivity (MAC). MAC is a sparsification method that preserves estimation performance by maximizing the algebraic connectivity, a spectral graph property that is directly connected to the estimation error. Unfortunately, MAC remains computationally prohibitive for online use and requires users to manually pre-specify a connectivity-preserving edge set. Our contributions close these gaps along three complementary fronts: we develop a specialized solver for algebraic connectivity that yields an average 2x runtime speedup; we investigate advanced step size strategies for MAC’s optimization procedure to enhance both convergence speed and solution quality; and we propose automatic schemes that guarantee graph connectivity without requiring manual specification of edges. Together, these contributions make MAC more scalable, reliable, and suitable for real-time estimation applications.
|
| |
| 09:00-10:30, Paper WeI1I.196 | Add to My Program |
| COMPAct: Computational Optimization and Automated Modular Design of Planetary Actuators |
|
| Singh, Aman | Indian Institute of Science |
| Kapa, Deepak | Indian Institute of Technology Roorkee |
| Joshi, Suryank | Indian Institute of Science |
| Kolathaya, Shishir | Indian Institute of Science |
Keywords: Actuation and Joint Mechanisms, Legged Robots, Mechanism Design
Abstract: The optimal design of robotic actuators is a critical area of research, yet limited attention has been given to optimizing gearbox parameters and automating actuator CAD. This paper introduces COMPAct: Computational Optimization and Automated Modular Design of Planetary Actuators, a framework that systematically identifies optimal gearbox parameters for a given motor across four gearbox types, single-stage planetary gearbox (SSPG), compound planetary gearbox (CPG), Wolfrom planetary gearbox (WPG), and double-stage planetary gearbox (DSPG). The framework minimizes mass and actuator width while maximizing efficiency, and further automates actuator CAD generation to enable direct 3D printing without manual redesign. Using this framework, optimal gearbox designs are explored across a wide range of gear ratios, providing insights into the suitability of different gearbox types while automatically generating CAD models for all four gearbox types with varying gear ratios and motors. Two actuator types are fabricated and experimentally evaluated through power efficiency, no-load backlash, and transmission stiffness tests. Experimental results indicate that the SSPG actuator achieves a mechanical efficiency of 60--80%, a no-load backlash of 0.59 deg, and a transmission stiffness of 242.7 Nm/rad, while the CPG actuator demonstrates 60% efficiency, 2.6 deg backlash, and a stiffness of 201.6 Nm/rad. CODE: https://github.com/singhaman1750/COMPAct.git VIDEO: https://youtu.be/etK6anjXag8
|
| |
| 09:00-10:30, Paper WeI1I.197 | Add to My Program |
| Bootstrapping Self-Supervised Learning of Binary Classification Using Error Bounds: A Case Study on a Robotic Insertion Task |
|
| Duan, Zebin | University of Southern Denmark |
| Krüger, Norbert | University of Southern Denmark |
| Heredia, Juan | University of Southern Denmark |
| Iversen, Thorbjørn Mosekjær | University of Southern Denmark |
| Hagelskjær, Frederik | University of Southern Denmark |
Keywords: Continual Learning, Industrial Robots, Force and Tactile Sensing
Abstract: Flexible manufacturing requires rapid deployment of solutions and minimal setup time to remain competitive. An essential attribute is the ability to control error levels, as failures can range from minor performance degradation to severe equipment damage. However, conventional deployment often involves extensive setup, data collection, model training or parameter tuning, and system testing, resulting in significant delays that hinder commercial feasibility. We propose a data engine which gathers data and improve its performance while executing the task. The data engine consists of two classifiers, a fast model prediction and expensive verification. First, a model prediction is performed and based on the confidence level of the prediction, the expensive verification can be used. By adjusting the confidence level, users can control the level of tolerable error. Our method is implemented on a real-world robotic insertion task, which uses force data for the model prediction. The system applies UMAP dimensionality reduction and uses Wilson-Score to compute the confidence bounds of the prediction. Results demonstrate the ability to learn and reduce the need for expensive verifications over time, while staying within the set error-rate. The results highlight the potential of confidence bounds in self-improving models to enhance reliability in robotic classification task.
|
| |
| 09:00-10:30, Paper WeI1I.198 | Add to My Program |
| TactEx: An Explainable Multimodal Robotic Interaction Framework for Human-Like Touch and Hardness Estimation |
|
| Verstraete, Felix | Imperial College London |
| Wei, Lan | Imperial College London |
| Fan, Wen | University of Bristol |
| Zhang, Dandan | Imperial College London |
Keywords: Force and Tactile Sensing, Sensor Fusion, Visual Servoing
Abstract: Accurate perception of object hardness is essential for safe and dexterous contact-rich robotic manipulation. Here, we present TactEx, an explainable multimodal robotic interaction framework that unifies vision, touch, and language for human-like hardness estimation and interactive guidance. While task-agnostic, we demonstrate and evaluate TactEx's capabilities on fruit-ripeness assessment as a representative use case requiring tactile perception and contextual understanding. Our system fuses GelSight-Mini tactile streams with RGB observations and language prompts: a ResNet50 + LSTM estimates hardness from tactile sequential data, and a cross-modal alignment module integrates visual cues with LLM guidance. The resulting interface is explainable and multimodal, enabling users to identify fruit ripeness levels with statistically significant separability (p < 0.01 for all fruit pairs). For touch placement, we compare YOLO with Grounded-SAM (GSAM) and find GSAM to be more robust for fine-grained segmentation and contact-site selection. A lightweight LLM parses user instructions and produces grounded natural-language explanations linked to the tactile outputs. In end-to-end evaluations, TactEx attains 90% task success on simple user queries and generalises to novel tasks without large-scale tuning. These results highlight the promise of combining pretrained visual and tactile models with language grounding to advance explainable, human-like touch perception and decision-making in robotics.
|
| |
| 09:00-10:30, Paper WeI1I.199 | Add to My Program |
| Task Robustness Via Re-Labelling Vision-Action Robot Data |
|
| Kuramshin, Artur | Université De Montréal |
| Aslan, Ozgur | Université De Montréal |
| Neary, Cyrus | The University of British Columbia |
| Berseth, Glen | Université De Montréal |
Keywords: Imitation Learning, Deep Learning in Grasping and Manipulation, Big Data in Robotics and Automation
Abstract: The recent trend in scaling models for robot learning has resulted in impressive policies that can perform various manipulation tasks and generalize to novel scenarios. However, these policies continue to struggle with following instructions, likely due to the limited linguistic and action sequence diversity in existing robotics datasets. This paper introduces Task Robustness via RE-Labelling Vision-Action Robot Data (TREAD), a scalable framework that leverages large Vision-Language Models (VLMs) to augment existing robotics datasets without additional data collection, harnessing the transferable knowledge embedded in these models. Our approach leverages a pretrained VLM through three stages: generating semantic sub-tasks from original instruction labels and initial scenes, segmenting demonstration videos conditioned on these sub-tasks, and producing diverse instructions that incorporate object properties, effectively decomposing longer demonstrations into grounded language-action pairs. We further enhance robustness by augmenting the data with linguistically diverse versions of the text goals. Evaluations on LIBERO demonstrate that policies trained on our augmented datasets exhibit improved performance on novel, unseen tasks and goals. Our results show that TREAD enhances both planning generalization through trajectory decomposition and language-conditioned policy generalization through increased linguistic diversity. Project website: https://akuramshin.github.io/tread.
|
| |
| 09:00-10:30, Paper WeI1I.200 | Add to My Program |
| Will People Enjoy a Robot Trainer? a Case Study with Snoopie the Pacerbot |
|
| Du, Maximilian | Stanford |
| Grannen, Jennifer | Stanford University |
| Song, Shuran | Stanford University |
| Sadigh, Dorsa | Stanford University |
Keywords: Physical Human-Robot Interaction, Human-Centered Robotics, Robot Companions
Abstract: The physicality of exercise makes the role of athletic trainers unique. Their physical presence allows them to guide a student through a motion, demonstrate an exercise, and give intuitive feedback. Robot quadrupeds are also embodied agents with robust agility and athleticism. In our work, we investigate whether a robot quadruped can serve as an effective and enjoyable personal trainer device. We focus on a case study of interval training for runners: a repetitive, long-horizon task where precision and consistency are important. To meet this challenge, we propose Snoopie, an autonomous robot quadruped pacer capable of running interval training exercises tailored to challenge a user's personal abilities. We conduct a set of user experiments that compare the robot trainer to a wearable trainer device--the Apple Watch--to investigate the benefits of a physical embodiment in exercise-based interactions. We demonstrate 60.6% better adherence to a pace schedule and were 45.9% more consistent across their running speeds with the quadruped trainer. Subjective results also showed that participants strongly preferred training with the robot over wearable devices across many qualitative axes, including its ease of use (+56.7%), enjoyability of the interaction (+60.6%), and helpfulness (+39.1%). Additional videos and visualizations can be found on our website: https://sites.google.com/view/snoopie
|
| |
| 09:00-10:30, Paper WeI1I.201 | Add to My Program |
| Weakly-Supervised Learning for Physics-Informed Neural Motion Planning Via Sparse Roadmap |
|
| Ni, Ruiqi | Purdue University |
| Liu, Yuchen | Purdue University |
| Qureshi, Ahmed H. | Purdue University |
Keywords: Integrated Planning and Learning, Motion and Path Planning, Deep Learning Methods
Abstract: The motion planning problem requires finding a collision-free path between start and goal configurations in high-dimensional, cluttered spaces. Recent learning-based methods offer promising solutions, with self-supervised physics-informed approaches such as Neural Time Fields (NTFields) solving the Eikonal equation to learn value functions without expert demonstrations. However, existing physics-informed methods struggle to scale in complex, multi-room environments, where simply increasing the number of samples cannot resolve local minima or guarantee global consistency. We propose Hierarchical Neural Time Fields (H-NTFields), a weakly-supervised framework that combines weak supervision from sparse roadmaps with physics-informed PDE regularization. The roadmap provides global topological anchors through upper and lower bounds on travel times, while PDE losses enforce local geometric fidelity and obstacle-aware propagation. Experiments on 18 Gibson environments and real robotic platforms show that H-NTFields substantially improves robustness over prior physics-informed methods, while enabling fast amortized inference through a continuous value representation.
|
| |
| 09:00-10:30, Paper WeI1I.202 | Add to My Program |
| Agile Flight Emerges from Multi-Agent Competitive Racing |
|
| Pasumarti, Vineet | University of Pennsylvania |
| Bianchi, Lorenzo | University of Rome Tor Vergata |
| Loquercio, Antonio | University of Pennsylvania |
Keywords: Aerial Systems: Perception and Autonomy, Machine Learning for Robot Control, Aerial Systems: Applications
Abstract: Through multi-agent competition and the sparse high-level objective of winning a race, we find that both agile flight (e.g., high-speed motion pushing the platform to its physical limits) and strategy (e.g., overtaking or blocking) emerge from agents trained with reinforcement learning. We provide evidence in both simulation and the real world that this approach outperforms the common paradigm of training agents in isolation with rewards that prescribe behavior, e.g., progress on the raceline, in particular when the complexity of the environment increases, e.g., in the presence of obstacles. Moreover, we find that multi-agent competition yields policies that transfer more reliably to the real world than policies trained with a single-agent progress-based reward, despite the two methods using the same simulation environment, randomization strategy, and hardware. In addition to improved sim-to-real transfer, the multi-agent policies also exhibit some degree of generalization to opponents unseen at training time. Overall, our work, following in the tradition of multi-agent competitive game-play in digital domains, shows that sparse task-level rewards are sufficient for training agents capable of advanced low-level control in the physical world.
|
| |
| 09:00-10:30, Paper WeI1I.203 | Add to My Program |
| ScheduleStream: Temporal Planning with Samplers for GPU-Accelerated Multi-Arm Task and Motion Planning & Scheduling |
|
| Garrett, Caelan | NVIDIA |
| Ramos, Fabio | University of Sydney, NVIDIA |
Keywords: Task and Motion Planning, Bimanual Manipulation, Task Planning
Abstract: Bimanual and humanoid robots are appealing because of their human-like ability to leverage multiple arms to efficiently complete tasks. However, controlling multiple arms at once is computationally challenging due to the growth in the hybrid discrete-continuous action space. Task and Motion Planning (TAMP) algorithms can efficiently plan in hybrid spaces but generally produce plans, where only one arm is moving at a time, rather than schedules that allow for parallel arm motion. In order to extend TAMP to produce schedules, we present ScheduleStream, the first general-purpose framework for planning & scheduling with sampling operations. ScheduleStream models temporal dynamics using hybrid durative actions, which can be started asynchronously and persist for a duration that's a function of their parameters. We propose domain-independent algorithms that solve ScheduleStream problems without any application-specific mechanisms. We apply ScheduleStream to Task and Motion Planning & Scheduling (TAMPAS), where we use GPU acceleration within samplers to expedite planning. We compare ScheduleStream algorithms to several ablations in simulation and find that they produce more efficient solutions. We demonstrate ScheduleStream on several real-world bimanual robot tasks at https://schedulestream.github.io.
|
| |
| 09:00-10:30, Paper WeI1I.204 | Add to My Program |
| Design and Dynamic Modeling of a Flying Fish Robot with Experimental Validation |
|
| Jin, Jiayi | University of Wisconsin-Madison |
| Wang, Wei | University of Wisconsin-Madison |
Keywords: Marine Robotics, Biologically-Inspired Robots, Simulation and Animation
Abstract: Flying fish have often served as an inspiration for engineering designs due to their remarkable ability for cross-domain locomotion between water and air. Previous observations and simulations suggest that taxiing behavior before takeoff is indispensable, yet the mechanics of this transition remain unclear. In this work, we present the design and dynamic modeling of a robotic flying fish to investigate swimming and taxiing locomotion, with a particular focus on tail pitching. We develop a bio-inspired prototype with an active tail-pitching structure and a high-power-density tail-beating propulsion system. We further formulate a cross-domain dynamic model that couples hydrodynamic and aerodynamic forces during taxiing. Simulations show that pitching the tail downward increases peak height and enables the robot to leave the water at a more aerodynamically favorable angle of attack. Experiments with the robotic prototype validate these trends and show that downward tail pitching increases upward force and body elevation with only minor loss of forward speed. These results provide insight into the role of tail pitching in flying fish taxiing and takeoff preparation.
|
| |
| 09:00-10:30, Paper WeI1I.205 | Add to My Program |
| A Single-Fiber Optical Frequency Domain Reflectometry (OFDR)-Based Shape Sensing of Concentric Tube Steerable Drilling Robots |
|
| Kulkarni, Yash | The University of Texas at Austin |
| Tavangarifard, Mobina | The University of Texas at Austin |
| Maroufi, Daniyal | University of Texas at Austin |
| Khadem, Mohsen | University of Edinburgh |
| Bird, Justin E. | The University of Texas M.D. Anderson Cancer Center |
| Siewerdsen, Jeff | John Hopkins |
| Alambeigi, Farshid | University of Texas at Austin |
Keywords: Medical Robots and Systems, Surgical Robotics: Steerable Catheters/Needles
Abstract: This paper introduces a novel shape-sensing approach for Concentric Tube Steerable Drilling Robots (CT-SDRs) based on Optical Frequency Domain Reflectometry (OFDR). Unlike traditional FBG-based methods, OFDR enables continuous strain measurement along the entire fiber length with enhanced spatial resolution. In the proposed method, a Shape Sensing Assembly (SSA) is first fabricated by integrating a single OFDR fiber with a flat NiTi wire. The calibrated SSA is then routed through and housed within the internal channel of a flexible drilling instrument, which is guided by the pre-shaped NiTi tube of the CT-SDR. In this configuration, the drilling instrument serves as a protective sheath for the SSA during drilling, eliminating the need for integration or adhesion to the instrument surface that is typical of conventional optical sensor approaches. The performance of the proposed SSA, integrated within the cannulated CT-SDR, was thoroughly evaluated under free-bending conditions and during drilling along multiple J-shaped trajectories in synthetic Sawbones phantoms. Results demonstrate accurate and reliable shape-sensing capability, confirming the feasibility and robustness of this integration strategy.
|
| |
| 09:00-10:30, Paper WeI1I.206 | Add to My Program |
| Multistability Enabled Passive Multiplexing in an N-DOF, Underactuated Hyper-Redundant Robot |
|
| Nagata, Cole | University of Pennsylvania |
| Raney, Jordan | University of Pennsylvania |
| Yim, Mark | University of Pennsylvania |
Keywords: Mechanism Design, Tendon/Wire Mechanism, Compliant Joints and Mechanisms
Abstract: New developments in robotics have allowed robots to become very small, and capable of completing tasks humans cannot. Current robots capable of achieving this are physically limited in how small they can be without compromising on other aspects such as sensing, strength, or complexity. Thus, we strive to understand how we can more compactly map complex mechanical outputs to a low number of mechanical inputs. This paper presents a novel design for a hyper-redundant robot, capable of passive multiplexing. This is achieved using bistable joints at each link, with each link having a different bistable moment in order to establish priority when multiplexing. In doing so, this simple mechanism is able to achieve individual joint control, and reach a variety of complex configurations. To demonstrate the proposed robot, we construct an eleven linked mechanism and four linked mechanism, in which we demonstrate multiplexing, as well as high positional accuracy. By simulating the mechanism, we also quantify a geometric relationship between individual links and the overall robot’s workspace.
|
| |
| 09:00-10:30, Paper WeI1I.207 | Add to My Program |
| Learning Adaptive Pseudo-Label Selection for Semi-Supervised 3D Object Detection |
|
| Kong, Taehun | KAIST, LG Electronics |
| Kim, Tae-Kyun | KAIST |
Keywords: Object Detection, Segmentation and Categorization, Deep Learning Methods, Transfer Learning
Abstract: Semi-supervised 3D object detection (SS3DOD) aims to reduce costly 3D annotations utilizing unlabeled data. Recent studies adopt pseudo-label-based teacher-student frameworks and demonstrate impressive performance. The main challenge of these frameworks is in selecting high-quality pseudo-labels from the teacher’s predictions. Most previous methods, however, select pseudo-labels by comparing confidence scores over thresholds manually set. The latest works tackle the challenge either by dynamic thresholding or refining the quality of pseudo-labels. Such methods still overlook contextual information e.g. object distances, classes, and learning states, and inadequately assess the pseudo-label quality using partial information available from the networks. In this work, we propose a novel SS3DOD framework featuring a learnable pseudo-labeling module designed to automatically and adaptively select high-quality pseudo-labels. Our approach introduces two networks at the teacher output level. These networks reliably assess the quality of pseudo-labels by the score fusion and determine context-adaptive thresholds, which are supervised by the alignment of pseudo-labels over GT bounding boxes. Additionally, we introduce a soft supervision strategy that can learn robustly under pseudo-label noises. This helps the student network prioritize cleaner labels over noisy ones in semi-supervised learning. Extensive experiments on the KITTI and Waymo datasets demonstrate the effectiveness of our method. The proposed method selects high-precision pseudo-labels while maintaining a wider coverage of contexts and a higher recall rate, significantly improving relevant SS3DOD methods.
|
| |
| 09:00-10:30, Paper WeI1I.208 | Add to My Program |
| Efficient Hierarchical Reinforcement Learning with Dynamic Kolmogorov–Arnold Network for Long-Horizon Robotic Manipulation |
|
| Qu, Yuke | National University of Defense Technology |
| Ren, Junkai | National University of Defense Technology |
| Luo, Jiawei | National University of Defense Technology |
| Xie, Yufeng | National University of Defense Technology |
| Lu, Huimin | National University of Defense Technology |
| Xu, Xin | National University of Defense Technology |
| Ye, Yicong | National University of Defense Technology |
Keywords: Reinforcement Learning, AI-Based Methods
Abstract: Long-horizon robotic manipulation remains a critical challenge in robotics. Hierarchical reinforcement learning offers a promising solution, but often suffers from an imbalance dilemma: simplifying skill learning increases the complexity of planning, thereby expanding the solution space and computational burden of planning. To tackle this challenge, we propose a Hierarchical Reinforcement Learning framework with Dynamic Kolmogorov-Arnold Network (DyKAN) based Actor Critic, named HIKER. Firstly, HIKER innovates with a dual-chain design that decomposes the complex task into two intersecting sub-chains, reducing the optimization conflict across subtasks and alleviating the burden on the planning model. Secondly, we develop DyKAN, a scalable neural network for both actor and critic in the skill model of HIKER. DyKAN adaptively adjusts grids and basis functions while preserving learned knowledge, enabling efficient learning of complex manipulation skills. Furthermore, to optimize DyKAN's performance, we design a per-layer update module that uses Dynamic Tanh (DyT) and low-rank decomposition to ensure stable, low-cost updates during training. Finally, experiments on long-horizon robotic manipulation tasks demonstrate that HIKER significantly improves efficiency and robustness, yielding higher-quality skill models and achieving a 10.9% increase in task success rate under the high noise condition. Further insights are available on the website: https://sites.google.com/view/hikerdykan.
|
| |
| 09:00-10:30, Paper WeI1I.209 | Add to My Program |
| Multimodal Diffusion Forcing for Forceful Manipulation |
|
| Huang, Zixuan | University of Michigan |
| Hou, Huaidian | University of Michigan |
| Berenson, Dmitry | University of Michigan |
Keywords: Sensorimotor Learning, Deep Learning in Grasping and Manipulation, Force and Tactile Sensing
Abstract: Given a dataset of expert trajectories, standard imitation learning approaches typically learn a direct mapping from observations (e.g., RGB images) to actions. However, such methods often overlook the rich interplay between different modalities, i.e., sensory inputs, actions, and rewards — which is crucial for modeling robot behavior and understanding task outcomes. In this work, we propose Multimodal Diffusion Forcing, a unified framework for learning from multimodal robot trajectories that extends beyond action generation. Rather than modeling a fixed distribution, MDF applies random partial masking and trains a diffusion model to reconstruct the trajectory. This training objective encourages the model to learn temporal and cross-modal dependencies, such as predicting the effects of actions on force signals or inferring states from partial observations. We evaluate MDF on contact-rich, forceful manipulation tasks in simulated and real-world environments. Our results show that MDF not only delivers versatile functionalities, but also achieves strong performance, and robustness under noisy observations.
|
| |
| 09:00-10:30, Paper WeI1I.210 | Add to My Program |
| RODEO: RObotic DEcentralized Organization |
|
| Groshev, Milan | IE University |
| Castello Ferrer, Eduardo | IE University |
Keywords: Software Architecture for Robotic and Automation, Software Tools for Robot Programming, Computer Architecture for Robotic and Automation
Abstract: Robots are improving their autonomy with minimal human supervision. However, auditable actions, transparent decision processes, and new human-robot interaction models are still missing requirements to achieve extended robot autonomy. To tackle these challenges, we propose RODEO (RObotic DEcentralized Organization), a blockchain-based framework that integrates trust and accountability mechanisms for robots. This paper formalizes Decentralized Autonomous Organizations (DAOs) for service robots. First, it provides a ROS–ETH bridge between the DAO and the robots. Second, it offers templates that enable organizations (e.g., companies, universities) to integrate service robots into their operations. Third, it provides proof-verification mechanisms that allow robot actions to be auditable. In our experimental setup, a mobile robot was deployed as a trash collector in a lab scenario. The robot collects trash and uses a smart bin to sort and dispose of it correctly. Then, the robot submits a proof of the successful operation and is compensated in DAO tokens. Finally, the robot re-invests the acquired funds to purchase battery charging services. Data collected in a three day experiment show that the robot doubled its income and reinvested funds to extend its operating time. The proof-validation times of approximately one minute ensured verifiable task execution, while the accumulated robot income successfully funded up to 88 hours of future autonomous operation. The results of this research give insights about how robots and organizations can coordinate tasks and payments with auditable execution proofs and on-chain settlement.
|
| |
| 09:00-10:30, Paper WeI1I.211 | Add to My Program |
| ADM-DP: Adaptive Dynamic Modality Diffusion Policy through Vision-Tactile-Graph Fusion for Multi-Agent Manipulation |
|
| Enyi, Wang | Imperial College London |
| Fan, Wen | University of Bristol |
| Zhang, Dandan | Imperial College London |
Keywords: Imitation Learning, Deep Learning in Grasping and Manipulation, Multi-Robot Systems
Abstract: Multi-agent robotic manipulation remains challenging due to the combined demands of coordination, grasp stability, and collision avoidance in shared workspaces. To address these challenges, we propose the Adaptive Dynamic Modality Diffusion Policy (ADM-DP), a framework that integrates vision, tactile, and graph-based (multi-agent pose) modalities for coordinated control. ADM-DP introduces four key innovations. First, an enhanced visual encoder merges RGB and point-cloud features via Feature-wise Linear Modulation (FiLM) modulation to enrich perception. Second, a tactile-guided grasping strategy uses Force-Sensitive Resistor (FSR) feedback to detect insufficient contact and trigger corrective grasp refinement, improving grasp stability. Third, a graph-based collision encoder leverages shared tool center point (TCP) positions of multiple agents as structured kinematic context to maintain spatial awareness and reduce inter-agent interference. Fourth, an Adaptive Modality Attention Mechanism (AMAM) dynamically re-weights modalities according to task context, enabling flexible fusion. For scalability and modularity, a decoupled training paradigm is employed in which agents learn independent policies while sharing spatial information. This maintains low interdependence between agents while retaining collective awareness. Across seven multi-agent tasks, ADM-DP achieves 12-25% performance gains over state-of-the-art baselines. Ablation studies show the greatest improvements in tasks requiring multiple sensory modalities, validating our adaptive fusion strategy and demonstrating its robustness for diverse manipulation scenarios.
|
| |
| 09:00-10:30, Paper WeI1I.212 | Add to My Program |
| Constraint Manifold Exploration for Efficient Continuous Coverage Estimation |
|
| Wilbrandt, Robert | FZI Forschungszentrum Informatik |
| Dillmann, Rüdiger | FZI Forschungszentrum Informatik |
Keywords: Kinematics, Constrained Motion Planning, Industrial Robots
Abstract: Many automated manufacturing processes rely on industrial robot arms to move process-specific tools along workpiece surfaces. In applications like grinding, sanding, spray painting, or inspection, they need to cover a workpiece fully while keeping their tools perpendicular to its surface. While there are approaches to generate trajectories for these applications, there are no sufficient methods for analyzing the feasibility of full surface coverage. This work proposes a sampling-based approach for continuous coverage estimation that explores reachable surface regions in the configuration space. We define an extended ambient configuration space that allows for the representation of tool position and orientation constraints. A continuation-based approach is used to explore it using two different sampling strategies. A thorough evaluation across different kinematics and environments analyzes their runtime and efficiency. This validates our ability to accurately and efficiently calculate surface coverage for complex surfaces in complicated environments.
|
| |
| 09:00-10:30, Paper WeI1I.213 | Add to My Program |
| DexCtrl: Sim-To-Real Dexterity with Adaptive Controller Learning |
|
| Zhao, Shuqi | University of California, Berkeley |
| Yang, Ke | New York University |
| Chen, Yuxin | University of California, Berkeley |
| Li, Chenran | University of California, Berkeley |
| Xie, Yichen | University of California, Berkeley |
| Zhang, Xiang | University of California, Berkeley |
| Wang, Changhao | University of California, Berkeley |
| Tomizuka, Masayoshi | University of California |
Keywords: Dexterous Manipulation, In-Hand Manipulation, Robust/Adaptive Control
Abstract: Dexterous manipulation has advanced rapidly, with policies now capable of performing complex, contact-rich tasks in simulation. However, transferring these policies from simulation to real world remains a significant challenge. A key obstacle is the mismatch in low-level controller dynamics, where same trajectories can produce vastly different contact forces and behaviors when control parameters change. Existing solutions often rely on manual tuning or controller randomization, which can be labor-intensive, task-specific, and introduce substantial training difficulty. In this work, we propose DexCtrl, a novel framework that jointly learns actions and controller parameters by leveraging the historical information of both trajectory and controller. This adaptive controller adjustment mechanism enables the policy to automatically tune control parameters during execution, thereby mitigating severe sim-to-real gap without extensive manual tuning or excessive randomization. Moreover, by explicitly providing controller parameters as part of the observation, our approach facilitates better reasoning over force interactions and improves robustness in real-world scenarios. Experimental results demonstrate that our method achieves improved transfer performance across a variety of dexterous tasks involving variable force conditions.
|
| |
| 09:00-10:30, Paper WeI1I.214 | Add to My Program |
| PIRATR: Parametric Object Inference for Robotic Applications with Transformers in 3D Point Clouds |
|
| Schwingshackl, Michael | AIT Austrian Institute of Technology |
| Oberweger, Fabio Francisco | AIT Austrian Institute of Technology |
| Niedermeyer, Mario | AIT Austrian Institute of Technology |
| Huemer, Johannes | AIT Austrian Institute of Technology |
| Murschitz, Markus | AIT Austrian Institute of Technology |
Keywords: Deep Learning for Visual Perception, Computer Vision for Automation, Deep Learning in Grasping and Manipulation
Abstract: We present PIRATR, an end-to-end 3D object detection framework for robotic use cases in point clouds. Extending PI3DETR, our method streamlines parametric 3D object detection by jointly estimating multi-class 6-DoF poses and class-specific parametric attributes directly from occlusion-affected point cloud data. This formulation enables not only geometric localization but also the estimation of task-relevant properties for parametric objects, such as a gripper’s opening, where the 3D model is adjusted according to simple, predefined rules. The architecture employs modular, class-specific heads, making it straightforward to extend to novel object types without re-designing the pipeline. We validate PIRATR on an automated forklift platform, focusing on three structurally and functionally diverse categories: crane grippers, loading platforms, and pallets. Trained entirely in a synthetic environment, PIRATR generalizes effectively to real outdoor LiDAR scans, achieving a detection mAP of 0.919 without additional fine-tuning. PIRATR establishes a new paradigm of pose-aware, parameterized perception. This bridges the gap between low-level geometric reasoning and actionable world models, paving the way for scalable, simulation-trained perception systems that can be deployed in dynamic robotic environments.
|
| |
| 09:00-10:30, Paper WeI1I.215 | Add to My Program |
| Bridging the Sim-To-Real Gap with Multipanda_ros2: A Real-Time ROS2 Framework for Multimanual Systems |
|
| Škerlj, Jon | Technical University of Munich |
| Bien, Seongjin | University of Technology Nuremberg |
| Naceri, Abdeldjallil | Technical University of Munich |
| Haddadin, Sami | Mohamed Bin Zayed University of Artificial Intelligence |
Keywords: Control Architectures and Programming, Performance Evaluation and Benchmarking, Software, Middleware and Programming Environments
Abstract: We present multipanda_ros2, a novel open-source ROS2 architecture for multi-robot control of Franka Robotics robots. Leveraging ros2_control, this framework provides native ROS 2 interfaces for controlling any number of robots from a single process. Our core contributions address key challenges in real-time torque control, including interaction control and robot-environment modeling. A central focus of this work is sustaining a 1 kHz control frequency, a necessity for real-time control and a minimum frequency required by safety standards. Moreover, we introduce a controllet-feature design pattern that enables controller-switching delays of ≤ 2 ms, facilitating reproducible benchmarking and complex multi-robot interaction scenarios. To bridge the simulation-to-reality (sim2real) gap, we integrate a high-fidelity MuJoCo simulation with quantitative metrics for both kinematic accuracy and dynamic consistency (torques, forces, and control errors). Furthermore, we demonstrate that real-world inertial parameter identification can significantly improve force and torque accuracy, providing a methodology for iterative physics refinement. Our work extends approaches from soft robotics to rigid dual-arm, contact-rich tasks, showcasing a promising method to reduce the sim2real gap and providing a robust, reproducible platform for advanced robotics research.
|
| |
| 09:00-10:30, Paper WeI1I.216 | Add to My Program |
| Real-Time and Accurate Collision-Free Teleoperation Via Differentiable Constraint-Based Trajectory Planning |
|
| Grobbel, Max | FZI Forschungszentrum Informatik |
| Schneider, Tristan | FZI - Forschungszentrum Informatik |
| Flögel, Daniel | FZI Research Center for Information Technology, Karlsruhe Institute for Technology (KIT) |
| Hohmann, Sören | Institute of Control Systems, Karlsruhe Institute of Technology |
|
|
| |
| 09:00-10:30, Paper WeI1I.217 | Add to My Program |
| Statistical Contraction for Chance-Constrained Trajectory Optimization of Non-Gaussian Stochastic Systems |
|
| D'Silva, Rihan Aaron | University of Illinois Urbana Champaign |
| Tsukamoto, Hiroyasu | University of Illinois Urbana-Champaign/NASA Jet Propulsion Laboratory |
|
|
| |
| 09:00-10:30, Paper WeI1I.218 | Add to My Program |
| Unsupervised Domain Adaptation for Robust Imitation Learning under Visual Perturbations |
|
| Kato, Yasuhiro | University of Tokyo |
| Westfechtel, Thomas | The University of Tokyo |
| Chang, Jen-Yen | The Unversity of Tokyo |
| Morihira, Naoki | Honda R&D Co., Ltd |
| Hayashi, Akinobu | Honda R&D Co., Ltd |
| Harada, Tatsuya | The University of Tokyo |
| Osa, Takayuki | RIKEN |
Keywords: Imitation Learning
Abstract: Vision-based robot manipulation systems often suffer from performance degradation under domain shifts in visual inputs. While data augmentation is commonly employed in reinforcement learning, its application in imitation learning remains relatively underexplored. Our preliminary experiments indicate that simply incorporating augmentation techniques does not yield effective improvements in imitation learning. To address this challenge, we propose a two-stage learning process. First, we develop an adversarial feature learning framework that leverages data augmentation to enhance robustness against domain shifts. Second, we introduce an unsupervised domain adaptation method that adapts models to target environments using only easily collected image data. In robotic tasks, visual domain shifts can often be detected from initial observations alone. Since collecting complete action-labeled episodes in new domains is expensive, adapting with only initial images greatly reduces data collection costs. To this end, we develop an adaptation strategy that relies solely on initial target-domain observations, eliminating the need for labeled demonstrations. Experimental results across both simulation and physical robot implementations demonstrate that our method preserves source domain performance while exhibiting enhanced resilience to visual perturbations, including varying lighting conditions, background modifications, and environmental distractors.
|
| |
| 09:00-10:30, Paper WeI1I.219 | Add to My Program |
| Environment-Aware Learning of Smooth GNSS Covariance Dynamics for Autonomous Racing |
|
| Chen, Y. Deemo | California Institute of Technology |
| Zimmermann, Arion | California Institute of Technology |
| Berrueta, Thomas | California Institute of Technology |
| Chung, Soon-Jo | California Institute of Technology |
Keywords: Sensor Fusion, Localization, Probabilistic Inference
Abstract: Ensuring accurate and stable state estimation is a challenging task crucial to safety-critical domains such as high-speed autonomous racing, where measurement uncertainty must be both adaptive to the environment and temporally smooth for control. In this work, we develop a learning-based framework, LACE, capable of directly modeling the temporal dynamics of GNSS measurement covariance. We model the covariance evolution as an exponentially stable dynamical system where a deep neural network (DNN) learns to predict the system's process noise from environmental features through an attention mechanism. By using contraction-based stability and systematically imposing spectral constraints, we formally provide guarantees of exponential stability and smoothness for the resulting covariance dynamics. We validate our approach on an AV-24 autonomous racecar, demonstrating improved localization performance and smoother covariance estimates in challenging, GNSS-degraded environments. Our results highlight the promise of dynamically modeling the perceived uncertainty in state estimation problems that are tightly coupled with control sensitivity.
|
| |
| 09:00-10:30, Paper WeI1I.220 | Add to My Program |
| UMI-On-Air: Embodiment-Aware Guidance for Embodiment-Agnostic Visuomotor Policies |
|
| Gupta, Harsh | University of Illinois Urbana-Champaign |
| Guo, Xiaofeng | Carnegie Mellon Univeristy |
| Ha, Huy | Columbia University |
| Pan, Chuer | Stanford University |
| Cao, Muqing | Carnegie Mellon University |
| Lee, Dongjae | Kyung Hee University |
| Scherer, Sebastian | Carnegie Mellon University |
| Song, Shuran | Stanford University |
| Shi, Guanya | Carnegie Mellon University |
Keywords: Learning from Demonstration, Aerial Systems: Applications, Mobile Manipulation
Abstract: We introduce UMI-on-Air, a framework for embodiment-aware deployment of embodiment-agnostic manipulation policies. Our approach leverages diverse, unconstrained human demonstrations collected with a handheld gripper (UMI) to train generalizable visuomotor policies. A central challenge in transferring these policies to constrained robotic embodiments—such as aerial manipulators—is the mismatch in control and robot dynamics, which often leads to out-of-distribution behaviors and poor execution. To address this, we propose Embodiment-Aware Diffusion Policy (EADP), which couples a high-level UMI policy with a low-level embodiment-specific controller at inference time. By integrating gradient feedback from the controller’s tracking cost into the diffusion sampling process, our method steers trajectory generation towards dynamically feasible modes tailored to the deployment embodiment. This enables plug-and-play, embodiment-aware trajectory adaptation at test time. We validate our approach on long-horizon and high-precision aerial manipulation tasks, showing improved success rates, efficiency, and robustness under disturbances compared to unguided diffusion baselines. Finally, we demonstrate deployment in previously unseen environments, using UMI demonstrations collected in the wild, highlighting a practical pathway for scaling generalizable manipulation skills across diverse—and even highly constrained—embodiments. All code, data, checkpoints, and result videos can be found at umi-on-air.github.io.
|
| |
| 09:00-10:30, Paper WeI1I.221 | Add to My Program |
| 3DFacePolicy: Speech-Driven 3D Facial Animation Based on Diffusion Policy |
|
| Sha, Xuanmeng | The University of Osaka |
| Zhang, Liyun | The University of Tokyo |
| Mashita, Tomohiro | Osaka Electro-Communication University |
| Chiba, Naoya | The University of Osaka |
| Uranishi, Yuki | The University of Osaka |
Keywords: Computer Vision for Automation, Motion and Path Planning, Imitation Learning
Abstract: Speech-driven 3D facial animation has achieved significant progress in both research and applications. While recent baselines struggle to generate natural and continuous facial movements due to their frame-by-frame vertex generation approach, we propose 3DFacePolicy, a pioneer work that introduces a novel definition of vertex trajectory changes across consecutive frames through the concept of "action". By predicting action sequences for each vertex that encode frame-to-frame movements, we reformulate vertex generation approach into an action-based control paradigm. Specifically, we leverage a robotic control mechanism, diffusion policy, to predict action sequences conditioned on both audio and vertex states. Extensive experiments on VOCASET and BIWI datasets demonstrate that our approach significantly outperforms state-of-the-art methods and is particularly expert in dynamic, expressive and naturally smooth facial animations.
|
| |
| 09:00-10:30, Paper WeI1I.222 | Add to My Program |
| Masquerade: Learning from In-The-Wild Human Videos Using Data-Editing |
|
| Lepert, Marion | Stanford University |
| Fang, Jiaying | Cornell University |
| Bohg, Jeannette | Stanford University |
Keywords: Imitation Learning, Big Data in Robotics and Automation, Representation Learning
Abstract: Robot manipulation research still suffers from significant data scarcity: even the largest robot datasets are orders of magnitude smaller and less diverse than those that fueled recent breakthroughs in language and vision. We introduce Masquerade, a method that edits in-the-wild egocentric human videos to bridge the visual embodiment gap between humans and robots and then learns a robot policy with these edited videos. Our pipeline turns each human video into "robotized" demonstrations by (i) estimating 3D hand poses, (ii) inpainting the human arms, and (iii) overlaying a rendered bimanual robot that tracks the recovered end-effector trajectories. We pre-train a visual encoder to predict future 2-D robot keypoints on 675K frames of these edited clips. We continue that auxiliary loss while fine-tuning a diffusion-policy head on only 50 robot demonstrations per task. This yields policies that generalize significantly better than prior work. On three long-horizon, bimanual kitchen tasks evaluated in three unseen scenes each, Masquerade outperforms baselines by 5-6×. Ablations show that both the robot overlay and co-training are indispensable, and performance scales logarithmically with the amount of edited human video. These results demonstrate that explicitly closing the visual embodiment gap unlocks a vast, readily available source of data from human videos that can be used to improve robot policies.
|
| |
| 09:00-10:30, Paper WeI1I.223 | Add to My Program |
| PerceptTwin: Semantic Scene Reconstruction for Iterative LLM Planning and Verification |
|
| Gauthier, Charlie | Université De Montréal, Mila |
| Morin, Sacha | Université De Montréal, Mila |
| Paull, Liam | Université De Montréal, Mila |
Keywords: Semantic Scene Understanding, Task Planning, AI-Enabled Robotics
Abstract: Simulation environments are useful for both robot policy learning and planning verification and validation. Traditionally, the process of creating a simulation was onerous. Creating a bespoke simulation environment for each individual environment that a robot would operate in was simply infeasible. In this work, we introduce PerceptTwin, a fully automatic pipeline that constructs interactive simulations directly from semantic scene representations produced by a robot’s perception stack. PerceptTwin combines open-vocabulary object maps with 3D asset generation, affordance prediction, and commonsense condition checking. These interactive simulations can be used to validate and refine plans before they are executed on the robot hardware. Borrowing from the AI alignment literature, we also introduce an LLM judge that verifies plan correctness and alignment with human preferences. Experiments show that PerceptTwin feedback allows LLM planners to refine plans, enhance safety, and resist harmful black-box prompting attacks. In our suite of tasks, PerceptTwin improves plan success by an average of ~39% for GPT5, GPT5Mini, and GPT5Nano planners. Additionally, PerceptTwin also improves human plan verification by up to ~18% on average for plans that fail due to unfilled skill preconditions. Our results demonstrate the potential of open-vocabulary scene simulation from robot perception as a foundation for safer, more reliable robot planning.
|
| |
| 09:00-10:30, Paper WeI1I.224 | Add to My Program |
| EEG-Driven Intention Decoding: Offline Deep Learning Benchmarking on a Robotic Rover |
|
| Alosaimi, Ghadah | Imam Mohammad Ibn Saud Islamic University, Durham University |
| Alsayyari, Maha | Durham University, King Saud University |
| Sun, Yixin | Durham University |
| Katsigiannis, Stamos | Durham University |
| Atapour-Abarghouei, Amir | Durham University |
| Breckon, Toby | Durham University |
Keywords: Brain-Machine Interfaces, Machine Learning for Robot Control, Human-Centered Robotics
Abstract: Brain–computer interfaces (BCIs) provide a hands-free control modality for mobile robotics, yet decoding user intent during real-world navigation remains challenging. This work presents a brain–robot control framework for offline decoding of driving commands during robotic rover operation. A 4WD Rover Pro platform was remotely operated by 12 participants who navigated a predefined route using a joystick, executing the following commands: forward, reverse, left, right, and stop. Electroencephalogram (EEG) signals were recorded with a 16-channel OpenBCI cap and aligned with motor actions at ∆ = 0 ms and eight future prediction horizons (∆ > 0 ms). After data preprocessing, eleven deep learning (DL) models were benchmarked for the task of intent classification, across the Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), and Transformer architectural families. ShallowConvNet achieved the highest performance for both action prediction (F1-score 67% at ∆ = 0 ms) and intent prediction (F1-score 66% at ∆ = 300 ms), maintaining robust performance at future horizons. By combining real-world robotic control with multi-horizon EEG intention decoding, this study introduces a reproducible benchmark and reveals key design insights for predictive, DL-based BCI systems.
|
| |
| 09:00-10:30, Paper WeI1I.225 | Add to My Program |
| Uncertainty-Aware Autonomous Vehicles: Predicting the Road Ahead |
|
| Kudukkil Manchingal, Shireen | Oxford Brookes University |
| Amaritei, Armand | Oxford Brookes University |
| Gohad, Mihir | Oxford Brookes University |
| Sultana, Maryam | Oxford Brookes University |
| Kooij, Julian Francisco Pieter | TU Delft |
| Cuzzolin, Fabio | Oxford Brookes University |
| Bradley, Andrew | Oxford Brookes University |
Keywords: Deep Learning for Visual Perception, Planning under Uncertainty, Autonomous Vehicle Navigation
Abstract: Autonomous Vehicle (AV) perception systems have advanced rapidly in recent years, providing vehicles with the ability to accurately interpret their environment. Perception systems remain susceptible to errors caused by overly-confident predictions in the case of rare events or out-of-sample data. This study equips an autonomous vehicle with the ability to ’know when it is uncertain’, using an uncertainty-aware image classifier as part of the AV software stack. Specifically, the study exploits the ability of Random-Set Neural Networks (RS-NNs) to explicitly quantify prediction uncertainty. Unlike traditional CNNs or Bayesian methods, RS-NNs predict belief functions over sets of classes, allowing the system to identify and signal uncertainty clearly in novel or ambiguous scenarios. The system is tested in a real-world autonomous racing vehicle software stack, with the RS-NN classifying the layout of the road ahead and providing the associated uncertainty of the prediction. Performance of the RS-NN under a range of road conditions is compared against traditional CNN and Bayesian neural networks, with the RS-NN achieving significantly higher accuracy and superior uncertainty calibration. This integration of RS-NNs into Robot Operating System (ROS)-based vehicle control pipeline demonstrates that predictive uncertainty can dynamically modulate vehicle speed, maintaining high-speed performance under confident predictions while proactively improving safety through speed reductions in uncertain scenarios. These results demonstrate the potential of uncertainty-aware neural networks - in particular RS-NNs - as a practical solution for safer and more robust autonomous driving.
|
| |
| 09:00-10:30, Paper WeI1I.226 | Add to My Program |
| Low Cost, Easily Manufactured, Highly Flexible Strain and Touch Sensitive Fiber for Robotics Applications |
|
| Diaz Herrera, Christian | Wesleyan University |
| Raste, Srushti | Wesleyan University |
| Liu, Simin | Wesleyan University |
| Modeste, Miles | Wesleyan University |
| Yin, Jiyang (Patton) | Brown University |
| McCall, Katelyn | Wesleyan University |
| Yao, Yuxing Jared | Wesleyan University |
| Chahal, Roopkamal | Wesleyan University |
| Chidley, Simon | Wesleyan University |
| Ha, Trung | Wesleyan University |
| Westmoreland, T. David | Wesleyan University |
| Roberts, Sonia | Wesleyan University |
Keywords: Soft Sensors and Actuators, Force and Tactile Sensing, Physical Human-Robot Interaction
Abstract: Existing stretch and touch sensors for robots are generally expensive with respect to at least one of material costs, required manufacturing equipment, or manufacturing time. We present and experimentally characterize a conductive fiber made using only inexpensive commercial off-the-shelf parts (conductive thread at 0.07/ft, silicone tubing at 0.94/ft) and tools (loop-style needle threader at 2), which can be manufactured quickly (20 cm length in 2 minutes.) We demonstrate its use as a resistive strain sensor with three applications: Triggering a grasp in a pneumatically actuated assistive finger, sensing the pose of a pneumatically actuated robotic strap, and estimating the pose of a flexible solid. We also demonstrate that it can be used as a capacitive sensor with two applications: First, as a touch sensor which triggers a commercial robot arm to move, and second, as a near-field sensor enabling the robot arm to follow a moving hand. The capacitive sensors are knitted, showcasing the high flexibility of the fiber. We discuss methods for improving manufacturing scalability and their cost trade-offs. Finally, we demonstrate a method for repairing a cut fiber.
|
| |
| 09:00-10:30, Paper WeI1I.227 | Add to My Program |
| Workspace Optimization of a Flexure-Based 3RRR Parallel Robot with Joint Angle Constraints |
|
| Karuppiah, Annamalai | York University |
| Orszulik, Ryan | York University |
Keywords: Parallel Robots, Compliant Joints and Mechanisms, Kinematics
Abstract: The optimization of flexure-based planar parallel robots with traditional methods demands significant computational resources due to the coupling of kinematic and structural effects. This work presents a decoupled optimization approach, in which the geometry of the manipulator is optimized for workspace via a kinematic model approach. A novel approach is presented via a modified Moore boundary following algorithm which provides for an efficient calculation of the workspace. With this approach, the optimal kinematic design parameters, namely the leg ratio and initial orientation of the end effector are determined and presented for a number of cases described by the combination of joint angle constraints.
|
| |
| 09:00-10:30, Paper WeI1I.228 | Add to My Program |
| Sparse Meets Dense: Correspondence Guided Robotic Manipulation with Rigid-Deformable Interactions |
|
| Zhu, Ziyu | Peking University |
| Chen, Yue | Peking University |
| Liang, Xirui | Peking University |
| Bae, Hojin | Tsinghua University |
| Wang, Yuran | Peking University |
| Yuan, Zhen | Peking University |
| Wu, Ruihai | Peking University |
| Dong, Hao | Peking University |
Keywords: Representation Learning, Visual Learning, Perception for Grasping and Manipulation
Abstract: Manipulation involving rigid-deformable interactions, such as hanging clothes or dressing humans, is essential for household robots. Compared to single-object manipulation or interactions between rigid bodies, these tasks are particularly challenging due to the rich multi-point contacts and the complex dynamics of the deformable bodies during interaction. Therefore, object-centric representations such as 6D poses or structural points without task-specific information become insufficient for these interactions. In this work, we propose a hybrid correspondence-based representation tailored for deformable-rigid interactions. First, we introduce structure-, task-, and interaction-aware sparse keypoints. The keypoints are generated based on the global structures of both rigid and deformable objects, and filtered by their local interaction contacts. However, tracking these sparse keypoints through the interaction remains difficult due to the high-dimensional dynamics of deformable objects. Therefore, we further construct dense correspondences on the deformable objects for accurate keypoint tracking throughout manipulation. This hybrid design combines the advantages of both representations: sparse keypoints encode rich, task-specific information for fine-grained manipulation, while dense correspondences ensure efficient tracking and generalization to novel deformations, shapes, and scenarios. Extensive experiments demonstrate the effectiveness and broad applicability of our method.
|
| |
| 09:00-10:30, Paper WeI1I.229 | Add to My Program |
| Real-Time Puncture Detection and Recovery for Pneumatic Soft Actuators |
|
| Deshpande, Tejonidhi R | Georgia Institute of Technology |
| Cheng, Tingyu | University of Notre Dame |
| Hester, Josiah | Georgia Institute of Technology |
Keywords: Failure Detection and Recovery, Hydraulic/Pneumatic Actuators, Redundant Robots
Abstract: Soft robots offer safe and adaptive interaction with humans and unstructured environments through their inherent ability to deform and comply. Pneumatic actuators are one way to build soft robots. They are typically made from soft silicone materials and are especially effective for driving such systems, enabling smooth and adaptable motion. However, their compliant nature also makes them vulnerable to mechanical failures like punctures and tears, limiting practical deployment. To address this, we propose a puncture detection system for soft actuators using motion data from a single inertial measurement unit. Extracted features are used to train anomaly detectors for puncture detection and non-linear models to estimate severity. We also introduce a multi-chamber pneumatic soft bending actuator capable of diverse configurations via selective chamber inflation. Our algorithm identifies the punctured chamber and provides a severity score using a chamber perturbation scheme. Anomaly detectors are trained on normal operation data and detect damage through reconstruction errors, while severity is estimated by a separate model trained under slightly modified conditions. Finally, we demonstrate a failure recovery strategy to maintain actuation force post-failure. This approach enhances the reliability and safety of soft robotic systems through real-time, data-driven damage detection.
|
| |
| 09:00-10:30, Paper WeI1I.230 | Add to My Program |
| Shared Haptic Control for Surgical Skill Transfer on a Dual-Console Da Vinci Research Kit |
|
| Le, Xiangyi | Johns Hopkins University |
| Jiang, Nan | Johns Hopkins University |
| Shao, Pucheng | Johns Hopkins University |
| Burkhart, Brendan | Johns Hopkins University |
| Kazanzides, Peter | Johns Hopkins University |
| Tumerdem, Ugur | Marmara University |
Keywords: Surgical Robotics: Laparoscopy, Telerobotics and Teleoperation, Medical Robots and Systems
Abstract: Robotic surgery has revolutionized minimally invasive procedures by offering enhanced precision, dexterity, and patient outcomes. However, the training and operational paradigms in robotic surgery have not evolved in parallel. Current apprenticeship models fall short in this domain, as robotic surgery isolates the primary surgeon in a teleoperated control loop, limiting opportunities for hands-on learning by trainees. To address this, we present the first implementation of a multilateral controller on a da Vinci Research Kit (dVRK), enabled by a four-channel teleoperation architecture and learning-based force estimation on a dual-console setup. This framework allows an expert and novice to share motion and force authority on the patient side robots through an adjustable dominance factor. We validated the system in three experiments. In transparency tests, the architecture achieved sub-millimeter position tracking errors (PTE <= 0.2mm) and force tracking errors (FTE) <= 1N. In a palpation pilot user study (N=10) with tumor-tissue phantoms, participants identified stiffer regions, without visual feedback, with 83% accuracy in single-user mode (alpha = 1) and 74% accuracy in dual-user shared mode (alpha = 0.5). In a suturing force control pilot user study (N=10), novices significantly reduced force error and increased time within the safe range after expert-guided training, with no suture breakage observed post-training. These results on a dual-console dVRK setup demonstrate the feasibility of expert-in-the-loop training with real-time haptic guidance, positioning multilateral teleoperation as a promising approach for surgical skill transfer.
|
| |
| 09:00-10:30, Paper WeI1I.231 | Add to My Program |
| Overcoming Imperfect Kinematics in Surgical Robotics through Sim-To-Real Visuomotor Learning |
|
| Yan, Zhaoxuan | Imperial College London |
| Deng, Kaizhong | Imperial College London |
| Hu, Zhaoyang Jacopo | Imperial College London |
| Mylonas, George | Imperial College London |
| Elson, Daniel | Imperial College London |
Keywords: Medical Robots and Systems, Learning from Demonstration, Machine Learning for Robot Control
Abstract: Robot-Assisted Surgery is integral to modern minimally invasive procedures, with automation emerging as the next frontier to enhance precision and reduce surgeon fatigue. This evolution is largely impeded by the inherent kinematic inaccuracies of surgical robots, where unreliable internal sensors lead to significant control errors. While previous methods attempted to mitigate these issues through complex model-based calibration, they often suffer from high cost and limited effectiveness. This work utilises a learning-policy to actively compensate for hardware inaccuracies using closed-loop visual feedback that was trained from a teacher-student learning framework. The policy can fuse unreliable internal readings with precise external visual data, allowing it to correct for kinematic errors in real time without needing a perfect physical model. The learned policy was successfully deployed on the da Vinci Research Kit, where experiments validated the fundamental feasibility of using external vision to overcome internal sensor deficits. This research provides a foundational and reliable control methodology, paving the way for more advanced and robust surgical automation.
|
| |
| 09:00-10:30, Paper WeI1I.232 | Add to My Program |
| SldprtNet: A Large-Scale Multimodal Dataset for CAD Generation in Language-Driven 3D Design |
|
| Li, Ruogu | Univ. of North Carolina at Chapel Hill |
| Li, Sikai | Univ. of North Carolina at Chapel Hill |
| Mu, Yao | Shanghai Jiaotong University |
| Ding, Mingyu | University of North Carolina at Chapel Hill |
Keywords: Data Sets for Robot Learning, Representation Learning, Data Sets for Robotic Vision
Abstract: We introduce SldprtNet, a large-scale dataset comprising over 242,000 industrial parts, designed for semantic-driven CAD modeling, geometric deep learning, and the training/fine-tuning of multimodal models for 3D design. The dataset provides 3D models in both .step and .sldprt formats to support di-verse training and testing. To enable parametric modeling and facilitate dataset scalability, we developed two supporting tools, encoder and decoder, which support 13 types of CAD commands and enable lossless transformation between 3D models and a structured text representation. Additionally, each sample is paired with a composite image created by merging seven rendered views from different view-points of the 3D model, effectively reducing input token length and accelerating inference. By combining this image with the parameterized text output from the encoder, we employ the lightweight multi-modal language model Qwen2.5-VL-7B to generate a natural language description of each part’s appearance and functionality. To ensure accuracy, we manually verified and aligned the generated descriptions, rendered images, and 3D models. These descriptions, along with the parameterized modeling scripts, rendered images, and 3D model files, are fully aligned to construct SldprtNet. To assess its effectiveness, we fine-tuned baseline models on a dataset subset, com-paring image-plus-text inputs with text-only inputs. Results confirm the necessity and value of multi-modal datasets for CAD generation. It features care-fully selected real-world industrial parts, supporting tools for scalable dataset expansion, diverse modalities, and ensured diversity in model complexity and geometric features, making it a comprehensive multimodal dataset built for semantic-driven CAD modeling and cross-modal learning.
|
| |
| 09:00-10:30, Paper WeI1I.233 | Add to My Program |
| Struct-Loc: Confidence-Aware Structural Localization Via Hierarchical Point Cloud Registration |
|
| Józsa, Csaba Máté | Nokia Bell Labs |
| Bóta, Attila | Nokia Bell Labs |
| Varga, Krisztián Zsolt | Nokia Bell Labs |
| Kovács, Ferenc | Nokia Bell Labs |
Keywords: Localization, AI-Based Methods, Industrial Robots
Abstract: Localization systems often rely heavily on visual information, which can degrade under challenging conditions such as variable lighting, dynamic objects, or repetitive textures. To enhance robustness beyond single-image methods, we model localization as a structural point cloud registration problem, leveraging motion continuity and geometric consistency over time. This formulation reduces sensitivity to transient occlusions and appearance changes, enabling the system to resolve ambiguities that single-image techniques often cannot. In this work, we introduce Struct-Loc, a localization framework that advances structural point cloud registration through confidence-aware hierarchical localization. By estimating the reliability of structural regions and incorporating it into the matching process, Struct-Loc generates robust descriptors tailored for pose estimation. To achieve near real-time performance, Struct-Loc combines efficient point convolutional encoders, a caching mechanism, and a hierarchical coarse-to-fine matching strategy that progressively narrows the search space. It consistently outperforms strong baselines in both accuracy and runtime, while achieving a 100× compression of the global map compared to COLMAP, significantly improving storage efficiency. We validate Struct-Loc on the LaMAR benchmark, demonstrating its effectiveness and robustness under real-world conditions.
|
| |
| 09:00-10:30, Paper WeI1I.234 | Add to My Program |
| Terra: Hierarchical Terrain-Aware 3D Scene Graph for Task-Agnostic Outdoor Mapping |
|
| Samuelson, Chad | Brigham Young University |
| Austin, Abigail | Brigham Young University |
| Knoop, Seth | Brigham Young University |
| Romrell, Blake | Brigham Young University |
| Slade, Gabriel | Brigham Young University |
| McLain, T.W. | Brigham Young University |
| Mangelson, Joshua | Brigham Young University |
Keywords: Semantic Scene Understanding, Object Detection, Segmentation and Categorization, Mapping
Abstract: Outdoor intelligent autonomous robotic operation relies on a sufficiently expressive map of the environment. Classical geometric mapping methods retain essential structural environment information, but lack a semantic understanding and organization to allow high-level robotic reasoning. 3D scene graphs (3DSGs) address this limitation by integrating geometric, topological, and semantic relationships into a multi-level graph-based map. Outdoor autonomous operations commonly rely on terrain information either due to task-dependence or the traversability of the robotic platform. We propose a novel approach that combines indoor 3DSG techniques with standard outdoor geometric mapping and terrain-aware reasoning, producing terrain-aware place nodes and hierarchically organized regions for outdoor environments. Our method generates a task-agnostic metric-semantic sparse map and constructs a 3D Scene Graph from this map for downstream planning tasks, all while remaining lightweight for autonomous robotic operation. Our thorough evaluation demonstrates our 3DSG method performs on par with state-of-the-art camera-based 3DSG methods while remaining memory efficient. We also demonstrate its effectiveness in diverse robotic tasks of object retrieval and region monitoring in both simulation and real-world environments.
|
| |
| 09:00-10:30, Paper WeI1I.235 | Add to My Program |
| Adaptive Motion Priors with Constrained Optimization |
|
| Sangthaworn, Tuchapong | King Mongkut’s University of Technology Thonburi |
| Sakulkueakulsuk, Bawornsak | King Mongkut's University of Technology Thonburi |
Keywords: Imitation Learning, Humanoid and Bipedal Locomotion, Reinforcement Learning
Abstract: Choosing locomotion learning paradigm in high-DOF system like humanoid robot faces several challenges. Free exploration creates complex reward surfaces that resist efficient exploration, while human motion priors cannot be directly copied due to different mechanical constraints. We present Adaptive Motion Priors with Constrained Optimization (AMPCO), a novel framework that transitions from human reference motions to task-focused optimization within learned behavioral bounds. AMPCO employs a two-phase optimization strategy: (1) Adaptive Imitation Guidance that prioritizes human motion, and (2) Adaptive Reward Weighting for Constrained Optimization that optimizes task objectives while maintaining motion quality within statistically-guaranteed bounds from Phase I. The transition between phases is automatically detected through percentile-based breakout detection from discriminator convergence. AMPCO introduces adaptive weighting mechanisms that smoothly adjust the importance of human imitation based on learning progress. Our experiments on the Unitree G1 humanoid robot simulation demonstrate that AMPCO reduces energy consumption variance by 67-90% across all baseline methods while achieving 70% lower energy consumption than task-focused baseline while maintaining velocity tracking accuracy comparable to the best-performing methods, with minimal computational overhead (<0.012% per training cycle).
|
| |
| 09:00-10:30, Paper WeI1I.236 | Add to My Program |
| Look Forward to Walk Backward: Efficient Terrain Memory for Backward Locomotion with Forward Vision |
|
| Luo, Shixin | Zhejiang University |
| Li, Songbo | Zhejiang University |
| Hao, Yuan | Zhejiang University |
| Wang, Yaqi | Nankai University |
| Zheng, Jun | Hangzhou Public Library |
| Wu, Jun | Zhejiang University |
| Zhu, Qiuguo | Zhejiang University |
Keywords: Legged Robots, Reinforcement Learning, Deep Learning for Visual Perception
Abstract: Legged robots with egocentric forward-facing depth cameras can couple exteroception and proprioception to achieve robust forward agility on complex terrain. When these robots walk backward, the forward-only field of view provides no preview. Purely proprioceptive controllers can remain stable on moderate ground when moving backward but cannot fully exploit the robot's capabilities on complex terrain and must collide with obstacles. We present Look Forward to Walk Backward (LF2WB), an efficient terrain-memory locomotion framework that uses forward egocentric depth and proprioception to write a compact associative memory during forward motion and to retrieve it for collision-free backward locomotion without rearward vision. The memory backbone employs a delta-rule selective update that softly removes then writes the memory state along the active subspace. Training uses hardware-efficient parallel computation, and deployment runs recurrent, constant-time per-step inference with a constant-size state, making the approach suitable for onboard processors on low-cost robots. Experiments in both simulations and real-world scenarios demonstrate the effectiveness of our method, improving backward agility across complex terrains under limited sensing.
|
| |
| 09:00-10:30, Paper WeI1I.237 | Add to My Program |
| SF-ODNav: Successor Feature Framework for Map-Less Target-Driven Outdoor Visual Navigation |
|
| Wu, Junzhe | University of Illinois Urbana-Champaign |
| Zhang, Jiaming | University of Illinois Urbana-Champaign |
| Zhang, Tingrong | University of Illinois at Urbana-Champaign |
| Tao, Ruining | University of Illinois Urbana Champaign |
| Tran, Huy | University of Illinois at Urbana Champaign |
| Chowdhary, Girish | University of Illinois at Urbana Champaign |
Keywords: Vision-Based Navigation, Deep Learning for Visual Perception, Reinforcement Learning
Abstract: Traditional deep reinforcement learning-based visual navigation techniques face challenges in dynamic and unstructured outdoor environments, particularly in the absence of high-resolution maps and GPS signals. This paper presents a deep reinforcement learning-based approach for target-driven visual navigation without explicit localization and mapping in outdoor settings, using the successor feature (SF) framework to enhance the model's transfer learning. This design enables effective knowledge transfer across tasks, allowing the model to adapt to novel environments with zero-shot or few-shot fine-tuning. To facilitate training and evaluation, we design grid-world environments constructed from real-world outdoor images, providing realistic yet controlled conditions for developing and testing deep reinforcement learning-based navigation. Experimental results demonstrate that our method can adapt effectively in outdoor environments, both within the same domain and across different domains. Moreover, despite being trained in a discrete grid-world setting, the model is successfully deployed in real time within the same area, maintaining robust performance and highlighting its strong transferability to continuous, real-world conditions.
|
| |
| 09:00-10:30, Paper WeI1I.238 | Add to My Program |
| Audio-VLA: Adding Contact Audio Perception to Vision-Language-Action Model for Robotic Manipulation |
|
| Wei, Xiangyi | East China Normal University |
| Zhang, Haotian | East China Normal University |
| Cao, Xinyi | East China Normal University |
| Xie, Siyu | East China Normal University |
| Ge, Weifeng | Fudan University |
| Li, Yang | East China Normal University |
| Wang, Changbo | East China Normal University |
Keywords: Deep Learning in Grasping and Manipulation, Contact Modeling, Imitation Learning
Abstract: The Vision-Language-Action models (VLA) have achieved significant advances in robotic manipulation recently. However, vision-only VLA models create fundamental limitations, particularly in perceiving interactive and manipulation dynamic processes. This paper proposes Audio-VLA, a multimodal manipulation policy that leverages contact audio to perceive contact events and dynamic process feedback. Audio-VLA overcomes the vision-only constraints of VLA models. Additionally, this paper introduces the Task Completion Rate (TCR) metric to systematically evaluate dynamic operational processes. Audio-VLA employs pre-trained DINOv2 and SigLIP as visual encoders, AudioCLIP as the audio encoder, and Llama2 as the large language model backbone. We apply LoRA fine-tuning to these pre-trained modules to achieve robust cross-modal understanding of both visual and acoustic inputs. A multimodal projection layer aligns features from different modalities into the same feature space. Moreover RLBench and LIBERO simulation environments are enhanced by adding collision-based audio generation to provide realistic sound feedback during object interactions. Since current robotic manipulation evaluations focus on final outcomes rather than providing systematic assessment of dynamic operational processes, the proposed TCR metric measures how well robots perceive dynamic processes during manipulation, creating a more comprehensive evaluation metric. Extensive experiments on LIBERO, RLBench, and two real-world tasks demonstrate Audio-VLA’s superior performance over vision-only comparative methods, while the TCR metric effectively quantifies dynamic process perception capabilities. The source code and pre-trained models are publicly available at https://wxone.github.io/AudioVLA.
|
| |
| 09:00-10:30, Paper WeI1I.239 | Add to My Program |
| Tying Knots in the Air: Reducing the Number of Quadrotors for Aerial Knot Formation |
|
| Wu, Tongshu | Lehigh University |
| Lopez Tarazona, Edward Caleb | Lehigh University |
| S. D'Antonio, Diego | Oakland University |
| Saldaña, David | Lehigh University |
Keywords: Aerial Systems: Applications, Multi-Robot Systems, Aerial Systems: Mechanics and Control
Abstract: Knots provide compact, lightweight, and mechanically stable configurations that are invaluable for aerial transportation and construction. However, autonomous knot formation in midair remains an open challenge due to the dexterity and complexity of manipulating flexible cables. In this paper, we present a method for midair knot formation that employs two types of aerial robots: lifting robots, which hold the cable endpoints, and support robots, which stabilize intermediate spans to enable interlacing. Our approach focuses on minimizing the number of support robots required while ensuring that the knot’s topology is preserved. Our method proceeds in three stages: (i) encode the knot projection as a grid of directional segments and crossings, (ii) apply our Loop Consistency Filter (LCF) to identify the minimal set of support robots required to preserve topology, and (iii) reconstruct continuous Cartesian trajectories using a cable model governed by a spring–damper force and a straightening force. Our results show a reduction in the required robots to form a knot of at least fifty percent compared to the baseline grid-based method. We demonstrate that our method is effective on actual robots, enabling the formation of knots with multiple quadrotors.
|
| |
| 09:00-10:30, Paper WeI1I.240 | Add to My Program |
| SurgCUT3R: Surgical Scene-Aware Continuous Understanding of Temporal 3D Representation |
|
| Xu, Kaiyuan | Imperial College London |
| Hong, Fangzhou | Nanyang Technological University |
| Elson, Daniel | Imperial College London |
| Huang, Baoru | Imperial College London |
Keywords: Computer Vision for Medical Robotics, Medical Robots and Systems, Computer Vision for Automation
Abstract: The reconstruction of surgical scenes from monocular endoscopic video is crucial for advancing robotic-assisted surgery, but applying state-of-the-art general-purpose reconstruction models is hindered by a severe lack of supervised training data and performance degradation over long sequences. To address these challenges, we propose SurgCUT3R, a systematic framework for adapting unified 4D reconstruction models to the surgical domain. Our approach makes three primary contributions. First, we introduce a data generation pipeline that leverages public stereo surgical datasets to create large-scale, metric-scale pseudo-ground-truth depth maps, effectively bridging the data gap. Second, we employ a hybrid supervision strategy that combines our pseudo-GT with geometric self-correction to enhance robustness against inherent data imperfections. Third, we design a hierarchical inference framework that utilizes two specialized models—one for global stability and one for local accuracy—to significantly reduce accumulated pose drift in long videos. Experiments on the public SCARED and StereoMIS datasets demonstrate that our method achieves a highly competitive balance between accuracy and efficiency. It delivers near state-of-the-art pose estimation, offering a practical and effective solution for robust reconstruction in surgical environments.
|
| |
| 09:00-10:30, Paper WeI1I.241 | Add to My Program |
| Bi3: A Biplatform, Bicultural, Biperson Dataset for Social Robot Navigation |
|
| Stratton, Andrew | University of Michigan |
| Singamaneni, Phani Teja | LAAS-CNRS |
| Goyal, Pranav | University of Michigan - Ann Arbor |
| Alami, Rachid | CNRS |
| Mavrogiannis, Christoforos | University of Michigan |
Keywords: Human-Aware Motion Planning, Datasets for Human Motion, Performance Evaluation and Benchmarking
Abstract: We contribute Bi3, a dataset of social robot navigation among groups of people in a constrained lab space. Compared to prior data collection efforts for social robot navigation, our dataset is unique in that it features: an original experiment design giving rise to close navigation encounters between two humans and a robot; five different navigation algorithms; two different robot platforms; a diverse participant pool of 74 people recruited from two sites in the USA and France; multimodal data streams including 10.5 hours of human and robot ground-truth motion tracks, RGB video, and user impressions over robot performance. Our analysis of the collected dataset through metrics like interaction density and human velocity suggests that Bi3 represents a benchmark of unique diversity and modeling complexity. Bi3 contributes towards understanding how humans and robots can productively mesh their activities in constrained environments, and can be a resource for training models of human motion prediction and robot control policies for navigation in densely crowded spaces.
|
| |
| 09:00-10:30, Paper WeI1I.242 | Add to My Program |
| Viewpoint-Agnostic Manipulation Policies with Strategic Vantage Selection |
|
| Vasudevan Nampoothiri, Sreevishakh | Arizona State University |
| Sagar, Som | Arizona State University |
| Senanayake, Ransalu | Arizona State University |
Keywords: Incremental Learning, Continual Learning, AI-Enabled Robotics
Abstract: Since vision-based manipulation policies are typically trained from data gathered from a single viewpoint, their performance drops when the view changes during deployment. Naively aggregating demonstrations from numerous random views is not only costly but also known to destabilize learning, as excessive visual diversity acts as noise. We present Vantage, a viewpoint selection framework to fine-tune any pre-trained policy on a small, strategically chosen set of camera poses to induce viewpoint-agnostic behavior. Instead of relying on costly brute-force search over viewpoints, Vantage formulates camera placement as an information gain optimization problem in a continuous space. This approach balances exploration of novel poses with exploitation of promising ones, while also providing theoretical guarantees about convergence and robustness. Across manipulation tasks and policy families, Vantage consistently improves success under viewpoint shifts compared to fixed, grid, or random data selection strategies with only a handful of fine-tuning steps. Experiments conducted on simulated and real-world setups show that Vantage increases the task success rate by ≈25% for diffusion policies, and yields robust gains in dynamic-camera settings.
|
| |
| 09:00-10:30, Paper WeI1I.243 | Add to My Program |
| High-Altitude Balloon Station-Keeping with First Order Model Predictive Control |
|
| Pasetsky, Myles | Cornell University |
| Lin, Jiawei | University of California San Diego |
| Guo, Bradley | Cornell University |
| Dean, Sarah | Cornell University |
Keywords: Aerial Systems: Mechanics and Control, Planning under Uncertainty
Abstract: High-altitude balloons (HABs) are common in scientific research due to their wide range of applications and low cost. Because of their nonlinear, underactuated dynamics and the partial observability of wind fields, prior work has largely relied on model-free reinforcement learning (RL) methods to design near-optimal control schemes for station-keeping. These methods often compare only against hand-crafted heuristics, dismissing model-based approaches as impractical given the system complexity and uncertain wind forecasts. We revisit this assumption about the efficacy of model-based control for station-keeping by developing First-Order Model Predictive Control (FOMPC). By implementing the wind and balloon dynamics as differentiable functions in JAX, we enable gradient-based trajectory optimization for online planning. FOMPC outperforms a state-of-the-art RL policy, achieving a 24% improvement in time-within-radius (TWR) without requiring offline training, though at the cost of greater online computation per control step. Through systematic ablations of modeling assumptions and control factors, we show that online planning is effective across many configurations, including under simplified wind and dynamics models.
|
| |
| 09:00-10:30, Paper WeI1I.244 | Add to My Program |
| Controllable Motion Generation Via Diffusion Modal Coupling |
|
| Wang, Luobin | University of California San Diego |
| Yu, Hongzhan | University of California San Diego |
| Yu, Chenning | University of California San Diego |
| Gao, Sicun | University of California San Diego |
| Christensen, Henrik Iskov | University of California San Diego |
Keywords: Motion and Path Planning, AI-Based Methods
Abstract: Diffusion models are increasingly used in robotics to represent multi-modal distributions over system states and behaviors, but precise control of generated outcomes without degrading physical realism remains challenging. This paper introduces a controllable diffusion framework that (i) replaces the standard unimodal Gaussian prior with an explicit multi-modal prior, and (ii) enforces modal coupling between prior components and principal data modes through novel forward and reverse diffusion processes. Sampling is initialized directly from a selected prior mode aligned with task constraints, avoiding train–test mismatch and manifold drift commonly induced by post-hoc guidance. Empirical evaluations on motion prediction (Waymo Dataset) and multi-task control (Maze2D) show consistent improvements over guidance-based baselines in fidelity, diversity, and controllability. These results indicate that multi-modal priors with strong modal coupling provide a scalable basis for controllable motion generation in robotics.
|
| |
| 09:00-10:30, Paper WeI1I.245 | Add to My Program |
| NSF-HRPT: Neural Semantic Field Meets Hierarchical Risk Perception Tree for Safety-Critical Scenario Assessment |
|
| Zhao, Yu | Zhejiang University |
| Pan, Jiangyu | Zhejiang University |
| Hu, Tao | Zhejiang University |
| Yin, Ming | Zhejiang University |
| Yang, Fan | Zhejiang University |
| Liu, Jiangfan | Beihang University |
| Liang, Xiubo | Zhejiang University |
Keywords: Autonomous Vehicle Navigation, Deep Learning for Visual Perception, Collision Avoidance
Abstract: The ability to accurately assess and anticipate risks in safety-critical scenarios is crucial for autonomous driving systems. While existing research has made progress in collision prediction, accurately quantifying risk levels from monocular vision inputs remains challenging due to the complex dynamics of multi-agent interactions and the inherent uncertainty in real-world environments. To address these challenges, we present textbf{NSF-HRPT}, a novel framework that combines learning-based perception with structured reasoning for quantitative risk assessment. Our approach features a Neural Semantic Field (NSF) that learns to model scene semantics, trajectory predictions, and probabilistic Time-to-Collision (TTC) distributions from simulation data. During inference, the pre-trained NSF serves as a prior for our Hierarchical Risk Perception Tree (HRPT), which enables efficient parallel computation and spatial reasoning about multi-agent risks. Additionally, we introduce a Sim2Real enhancement strategy that improves real-world applicability without retraining by incorporating priors from foundation models. Extensive evaluations demonstrate that our framework achieves state-of-the-art performance on synthetic benchmarks and delivers competitive, near-state-of-the-art results on real-world datasets for both TTC estimation accuracy and risk localization precision. The proposed method provides an effective solution for real-time risk awareness from monocular camera inputs.
|
| |
| 09:00-10:30, Paper WeI1I.246 | Add to My Program |
| AERO-MPPI: Anchor-Guided Ensemble Trajectory Optimization for Agile Mapless Drone Navigation |
|
| Chen, Xin | National University of Singapore |
| Huang, Rui | National University of Singapore |
| Tang, Longbin | National University of Singapore |
| Zhao, Lin | National University of Singapore |
Keywords: Collision Avoidance, Motion and Path Planning, Aerial Systems: Perception and Autonomy
Abstract: Agile mapless navigation in cluttered 3D environments poses significant challenges for autonomous drones. Conventional mapping–planning–control pipelines incur high computational cost and propagate estimation errors. We present AERO-MPPI, a fully GPU-accelerated framework that unifies perception and planning through an anchor-guided ensemble of Model Predictive Path Integral (MPPI) optimizers. Specifically, we design a multi-resolution LiDAR point-cloud representation that rapidly extracts spatially distributed “anchors” as look-ahead intermediate endpoints, from which we construct polynomial trajectory guides to explore distinct homotopy path classes. At each planning step, we run multiple MPPI instances in parallel and evaluate them with a two-stage multi-objective cost that balances collision avoidance and goal reaching. Implemented entirely with NVIDIA Warp GPU kernels, AERO-MPPI achieves real-time onboard operation and mitigates the local-minima failures of single-MPPI approaches. Extensive simulations in forests, verticals, and inclines demonstrate sustained reliable flight above 7 m/s, with success rates above 80% and smoother trajectories compared to state-of-the-art baselines. Real-world experiments on a LiDAR-equipped quadrotor with NVIDIA Jetson Orin NX 16G confirm that AERO-MPPI runs in real time onboard and consistently achieves safe, agile, and robust flight in complex cluttered environments. Code is available at https://github.com/XinChen-stars/AERO_MPPI.
|
| |
| 09:00-10:30, Paper WeI1I.247 | Add to My Program |
| The Robot's Inner Critic: Self-Refinement of Social Behaviors through VLM-Based Replanning |
|
| Lim, Jiyu | Kwangwoon University |
| Yoon, Youngwoo | ETRI |
| Park, Kwanghyun | Kwangwoon University |
Keywords: Social HRI, Human-Centered Robotics, Emotional Robotics
Abstract: Conventional robot social behavior generation has been limited in flexibility and autonomy, relying on predefined motions or human feedback. This study proposes CRISP (Critique-and-Replan for Interactive Social Presence), an autonomous framework where a robot critiques and replans its own actions by leveraging a Vision-Language Model (VLM) as a `human-like social critic.' CRISP integrates (1) extraction of movable joints and constraints by analyzing the robot's description file (e.g., MJCF), (2) generation of step-by-step behavior plans based on situational context, (3) generation of low-level joint control code by referencing visual information (joint range-of-motion visualizations), (4) VLM-based evaluation of social appropriateness and naturalness, including pinpointing erroneous steps, and (5) iterative refinement of behaviors through reward-based search. This approach is not tied to a specific robot API; it can generate subtly different, human-like motions on various platforms using only the robot's structure file. In a user study involving five different robot types and 20 scenarios, including mobile manipulators and humanoids, our proposed method achieved significantly higher preference and situational appropriateness ratings compared to previous methods. This research presents a general framework that minimizes human intervention while expanding the robot's autonomous interaction capabilities and cross-platform applicability. Detailed result videos and supplementary information regarding this work are available at https://limjiyu99.github.io/inner-critic/.
|
| |
| 09:00-10:30, Paper WeI1I.248 | Add to My Program |
| ACG: Action Coherence Guidance for Flow-Based Vision-Language-Action Models |
|
| Park, Minho | KAIST |
| Kim, Kinam | KAIST |
| Hyung, Junha | KAIST |
| Jang, Hyojin | KAIST |
| Jin, Hoiyeong | KAIST |
| Yun, Jooyeol | KAIST |
| Lee, Hojoon | KAIST |
| Choo, Jaegul | KAIST |
Keywords: Imitation Learning, Learning from Demonstration, Machine Learning for Robot Control
Abstract: Diffusion and flow matching models have emerged as powerful robot policies, enabling Vision-Language-Action (VLA) models to generalize across diverse scenes and instructions. Yet, when trained via imitation learning, their high generative capacity makes them sensitive to noise in human demonstrations: jerks, pauses, and jitter which reduce action coherence. Reduced action coherence causes instability and trajectory drift during deployment, failures that are catastrophic in fine-grained manipulation where precision is crucial. In this paper, we present Action Coherence Guidance (ACG) for VLA models, a training-free test-time guidance algorithm that improves action coherence and thereby yields performance gains. Evaluated on RoboCasa, DexMimicGen, and real-world SO-101 tasks, ACG consistently improves action coherence and boosts success rates across diverse manipulation tasks.
|
| |
| 09:00-10:30, Paper WeI1I.249 | Add to My Program |
| Adaptive Friction-Based Inchworm-Like Locomotion |
|
| Kubik, Jiri | Czech Technical University in Prague |
| Faigl, Jan | Czech Technical University in Prague |
Keywords: Biologically-Inspired Robots, Motion Control
Abstract: In this paper, we study adaptive locomotion for inchworm-like robots that move using friction-based pads inspired by snake scales. The robot moves by alternating extension and contraction phases, which require enough friction at the rear and front pads. Locomotion speed depends on how far the body extends, but it is limited by the friction available on the pads. Since the pads are passive, friction can only be controlled by shifting body weight onto them. The challenge is to balance friction and extension length to achieve fast movement on different terrains. Previous work relied on offline tuning for specific terrain types. We propose a new adaptive controller that automatically adjusts the friction requirements by detecting slippage. We demonstrate the approach using a six-degree-of-freedom inchworm-like robot and three locomotion strategies adapted from the literature, which were tested across four terrain types. The locomotion performance is measured by the achieved average locomotion speed, cost of transport, and reliability. Based on the experimental results, the proposed adaptive locomotion achieves performance similar to offline-tuned locomotion and even surpasses it on certain terrains.
|
| |
| 09:00-10:30, Paper WeI1I.250 | Add to My Program |
| Autonomous Search for Sparsely Distributed Visual Phenomena through Environmental Context Modeling |
|
| Chen, Eric | MIT, Woods Hole Oceanographic Institution |
| Manderson, Travis | Massachusetts Institute of Technology |
| Karapetyan, Nare | Woods Hole Oceanographic Institution |
| Edmunds, Peter | California State University, Northridge |
| Roy, Nicholas | Massachusetts Institute of Technology |
| Girdhar, Yogesh | Woods Hole Oceanographic Institution |
Keywords: Marine Robotics, Environment Monitoring and Management, Vision-Based Navigation
Abstract: Autonomous underwater vehicles (AUVs) are increasingly used to survey coral reefs, yet efficiently locating specific coral species of interest remains difficult: target species are often sparsely distributed across the reef, and an AUV with limited battery life cannot afford to search everywhere. When detections of the target itself are too sparse to provide directional guidance, the robot benefits from an additional signal to decide where to look next. We propose using the visual environmental context -- the habitat features that tend to co-occur with a target species -- as that signal. Because context features are spatially denser and often vary more smoothly than target detections, we hypothesize that a reward function targeted at broader environmental context will enable adaptive planners to make better decisions on where to go next, even in regions where no target has yet been observed. Starting from a single labeled image, our method uses patch-level DINOv2 embeddings to perform one-shot detections of both the target species and its surrounding context online. We validate our approach using real imagery collected by an AUV at two reef sites in St. John, U.S. Virgin Islands, simulating the robot's motion offline. Our results demonstrate that one-shot detection combined with adaptive context modeling enables efficient autonomous surveying, sampling up to 75% of the target in roughly half the time required by exhaustive coverage when the target is sparsely distributed, and outperforming search strategies that only use target detections.
|
| |
| 09:00-10:30, Paper WeI1I.251 | Add to My Program |
| Controllable Steering Torque Generation Via Flapping Motion by a Cross-Coupled Two-Degree-Of-Freedom Drive System |
|
| Hamamoto, Masaki | Nakakita Seisakusho Co., Ltd |
| Yamashita, Makoto | Nakakita Seisakusho Co., Ltd |
| Tanaka, Mika | Nakakita Seisakusho Co., Ltd |
| Senda, Kei | Kyoto University |
Keywords: Aerial Systems: Mechanics and Control, Biologically-Inspired Robots, Mechanism Design
Abstract: We present a novel flapping‑wing mechanism capable of generating steering torque through a two‑degree‑of‑freedom (2‑DoF) coordinated actuation. Whereas most existing flapping‑flight systems produce steering torque by incorporating additional mechanisms that asynchronously alter passive deformation limits, our approach enables transient aerodynamic force modulation synchronized with each flapping stroke. The concept draws inspiration from biological flyers such as dragonflies and hawkmoths, which utilize multiple synchronous muscles per wing and perform stroke‑synchronized, multi‑DoF wing kinematics—strategies thought to contribute to their precise attitude and position control even at low flapping frequencies. To emulate this capability, we developed a mechanism employing parallel direct‑drive actuators within a coupled multi‑DoF architecture. By introducing a cross‑coupling force between the actuators, we eliminate path dependency in the wing‑twist motion, thereby enabling stable and independent control of both stroke angle and angle of attack. Using a single‑wing 2‑DoF testbed, we successfully demonstrate a lift force exceeding 10 gf and a yaw‑steering torque range of 1.5 mNm. This work advances the development of biologically inspired, stroke‑synchronized steering mechanisms for next‑generation flapping‑wing micro aerial vehicles.
|
| |
| 09:00-10:30, Paper WeI1I.252 | Add to My Program |
| VO-DP: Semantic-Geometric Adaptive Diffusion Policy for Vision-Only Robotic Manipulation |
|
| Ni, Zehao | Tongji University |
| He, Yonghao | D-Robotics |
| Qian, Lingfeng | D-Robotics |
| Mao, Jilei | D-Robotics |
| Fu, Fa | D-Robotics |
| Sui, Wei | D-Robotics |
| Su, Hu | Institute of Automation, Chinese Academy of Science |
| Peng, Junran | University of Science and Technology Beijing |
| Wang, Zhipeng | Tongji University |
| He, Bin | Tongji University |
Keywords: Representation Learning, Imitation Learning, Perception for Grasping and Manipulation
Abstract: In the context of imitation learning, visuomotor-based diffusion policy learning is one of the main directions in robotic manipulation. Most of these approaches rely on point clouds as observation inputs and construct scene representations through point clouds feature learning, which enables them to achieve remarkable accuracy. However, the existing literature lacks an in-depth exploration of vision-only solutions that have significant potential. In this paper, we propose a Vision-Only and single-view Diffusion Policy learning method (VO-DP) that leverages pretrained visual foundation models to achieve effective fusion of semantic and geometric features. We utilize intermediate features from VGGT incorporating semantic features from DINOv2 and geometric features from Alternating Attention blocks. Features are fused via cross-attention and spatially compressed with a CNN to form the input to the policy head. Extensive experiments demonstrate that VO-DP not only outperforms the vision-only baseline DP significantly but also exhibits distinct performance trends against the point cloud-based method DP3: in simulation tasks, VO-DP achieves an average success rate of 64.6%—on par with DP3 64.0% and far higher than DP 34.8%, while in real-world tasks, it reaches 87.9%, outperforming both DP3 67.5% and DP 11.2% by a notable margin. Further robustness evaluations confirm that VO-DP remains highly stable under varying conditions including color, size, background, and lighting. Lastly, we open-source DRRM (D-Robotics Robotic Manipulation), a training library for robotic manipulation. Built on Accelerate, this library supports multi-machine and multi-GPU parallel training, as well as mixed precision training (e.g., bf16, fp16). It is compatible with visuomotor policies such as DP and DP3, and also supports the RoboTwin simulator. VO-DP is integrated into DRRM. We refer to the project page for the code and videos.
|
| |
| 09:00-10:30, Paper WeI1I.253 | Add to My Program |
| SIPHON: An Origami Soft Salp Robot |
|
| Van Stratum, Brian | Oregon State University |
| Justus, Nathan | Oregon State University |
| Hatton, Ross | Oregon State University |
| Davidson, Joseph | Oregon State University |
| Hollinger, Geoffrey | Oregon State University |
Keywords: Biologically-Inspired Robots, Soft Robot Materials and Design, Soft Robot Applications
Abstract: We present SIPHON, a Salp-Inspired robot designed to utilize Passive Hydrodynamics, and equipped with soft robotic Origami bellows and soft Nozzles. We reveal the construction, including a novel use of an interlocking origami Kresling pattern, along with duckbill and mammal-heart-inspired valves. We derive a physical model for the coupled dynamics of body displacement and body contraction. We show experimental results of pool free swimming trials, and we compare these results to the model. Compared to other power-autonomous, bioinspired pulsed jet swimmers, SIPHON swims with high speed and efficiency, achieving a mean swimming speed of 16.5 cm/s (0.59 Bl/s) and a cost of transport of 4.9 J/m (1.8W s/(N m)).
|
| |
| 09:00-10:30, Paper WeI1I.254 | Add to My Program |
| Legs Over Arms: On the Predictive Value of Lower-Body Pose for Human Trajectory Prediction from Egocentric Robot Perception |
|
| Le, Nhat | George Mason University |
| Song, Daeun | George Mason University |
| Xiao, Xuesu | George Mason University |
Keywords: Gesture, Posture and Facial Expressions, Human Detection and Tracking, Human-Centered Robotics
Abstract: Predicting human trajectory is crucial for social robot navigation in crowded environments. While most existing approaches treat human as point mass, we present a study on multi-agent trajectory prediction that leverages different human skeletal features for improved forecast accuracy. In particular, we systematically evaluate the predictive utility of 2D and 3D skeletal keypoints and derived biomechanical cues as additional inputs. Through a comprehensive study on the JRDB dataset and another new dataset for social navigation with 360-degree panoramic videos, we find that focusing on lower-body 3D keypoints yields a 13% reduction in Average Displacement Error and augmenting 3D keypoint inputs with corresponding biomechanical cues provides a further 1-4% improvement. Notably, the performance gain persists when using 2D keypoint inputs extracted from equirectangular panoramic images, indicating that monocular surround vision can capture informative cues for motion forecasting. Our finding that robots can forecast human movement efficiently by watching their legs provides actionable insights for designing sensing capabilities for social robot navigation.
|
| |
| 09:00-10:30, Paper WeI1I.255 | Add to My Program |
| INTACT-GRIP: An Inflatable Tactile Gripper for Soft Manipulation and High-Resolution Texture Mapping |
|
| Kara, Ozdemir Can | University of Texas at Austin |
| Rafiee Javazm, Mohammad | University of Texas at Austin |
| Rezayof, Omid | University of Texas at Austin |
| Alambeigi, Farshid | University of Texas at Austin |
Keywords: Grippers and Other End-Effectors, Force and Tactile Sensing, Soft Sensors and Actuators
Abstract: Robotic manipulation, especially of fragile and irregularly shaped objects, remains a significant challenge due to the need for both adaptability and precise tactile feedback. In this work, we introduce INTACT-GRIP, a robotic gripper that combines soft manipulation and high-resolution tactile sensing for inflation-based soft grasping. INTACT-GRIP integrates inflatable balloons with vision-based tactile feedback, enabling fingertip stiffness modulation for stable and damage-free manipulation of fragile and irregularly shaped objects. To evaluate its performance, we conducted a series of qualitative and quantitative experiments. In these experiments, inflation pressure was manually controlled by a human operator, who adjusted and stopped the pressure based on real-time visual feedback of the captured texture features. The results demonstrate the system’s ability to safely conform to fragile and irregularly shaped objects with varying stiffness, enabling pressure-controlled grasping and high-resolution tactile imaging during contact. Furthermore, a case study with a robotic arm highlighted the system’s potential as a versatile solution for precise and soft manipulation of delicate objects, supported by pressure-adjustable fingertips and real-time visual–tactile feedback.
|
| |
| 09:00-10:30, Paper WeI1I.256 | Add to My Program |
| MOASIC: Skill-Centric Manipulation Planning with Physics Simulation |
|
| Mishani, Itamar | Carnegie Mellon University, Robotics Institute |
| Shaoul, Yorai | Carnegie Mellon University |
| Likhachev, Maxim | Carnegie Mellon University |
Keywords: Manipulation Planning, Integrated Planning and Learning, Motion and Path Planning
Abstract: Planning long-horizon manipulation motions using a set of predefined skills is a central challenge in robotics; solving it efficiently could enable general-purpose robots to tackle novel tasks by flexibly composing generic skills. Solutions to this problem lie in an infinitely vast space of parameterized skill sequences--a space where common incremental methods struggle to find sequences that have non-obvious intermediate steps. Some approaches reason over lower-dimensional, symbolic spaces, which are more tractable to explore but may be brittle and are laborious to construct. In this work, we introduce textsc{Mosaic}, a skill-centric, multi-directional planning approach that targets these challenges by reasoning about which skills to employ and where they are most likely to succeed, by utilizing physics simulation to estimate skill execution outcomes. Specifically, textsc{Mosaic} employs two complementary skill families: textit{Generators}, which identify ``islands of competence'' where skills are demonstrably effective, and textit{Connectors}, which link these skill-trajectories by solving boundary value problems. By focusing planning efforts on regions of high competence, textsc{Mosaic} efficiently discovers physically-grounded solutions. We demonstrate its efficacy on complex long-horizon problems in both simulation and the real world, using a diverse set of skills including generative diffusion models, motion planning algorithms, and manipulation-specific models. Visit href{skill-mosaic.github.io}{texttt{skill-mosaic.github.io}} for demonstrations.
|
| |
| 09:00-10:30, Paper WeI1I.257 | Add to My Program |
| Agile and Controllable Omnidirectional Fast-Start Maneuvers of Robotic Fish Via Bio-Inspired Reinforcement Learning |
|
| Huang, Xu | Shanghaitech University |
| Lin, Xiaozhu | ShanghaiTech University |
| Liu, Xiaopei | SHANGHAITECH UNIVERSITY |
| Wang, Yang | Shanghaitech University |
Keywords: Biologically-Inspired Robots, Bioinspired Robot Learning, Marine Robotics
Abstract: Fast-start maneuvers—exemplified by the C-start in fish—represent a highly agile and very attractive locomotor strategy that requires precise multi-joint coordination under conditions of unsteady fluid dynamics, and has evolved through extensive predator–prey interactions in natural environments. Replicating such maneuvers in robotic fish is challenging due to strong fluid–structure nonlinearities, instantaneous dynamics, and complex vortex interactions. Prior approaches were limited by their dependence on specialized materials, lack of active controllability, incompatibility with mechanical structures, and inability to generate sufficient forward propulsion. Here, we propose a deep reinforcement learning method for multi-joint robotic fish that embeds key biological features of C-start maneuvers—burst acceleration, rapid directional adjustment, and two-stage bend-and-stretch motion—into the reward and observation design. By training in a physically consistent, high-performance Computational Fluid Dynamics (CFD) solver, the agent autonomously discovers effective launch strategies without requiring explicit models or real fish data. The resulting policies not only reproduce C-start-like motions and achieve fully controllable directional fast-starts, but also significantly expand the maneuvering potential of robotic fish, enabling higher velocities, greater displacement, and more agile motion than state-of-the-art methods. This biologically inspired and generalizable method demonstrates the promise of integrating biological principles into reinforcement learning to unlock advanced, high-acceleration capabilities in multi-joint aquatic robots.
|
| |
| 09:00-10:30, Paper WeI1I.258 | Add to My Program |
| Simulated Annealing for Multi-Robot Ergodic Information Acquisition Using Graph-Based Discretization |
|
| Wong, Benjamin | University of Washington |
| Weber, Aaron | University of Washington |
| Safwat, Mohamed | University of Washington, Seattle |
| Devasia, Santosh | University of Washington |
| Banerjee, Ashis | University of Washington |
Keywords: Optimization and Optimal Control, Probability and Statistical Methods, Multi-Robot Systems
Abstract: One of the goals of active information acquisition using multi-robot teams is to keep the relative uncertainty in each region at the same level to maintain identical acquisition quality (e.g., consistent target detection) in all the regions. To achieve this goal, ergodic coverage can be used to assign the number of samples according to the quality of observation, i.e., sampling noise levels. However, the noise levels are unknown to the robots. Although this noise can be estimated from samples, the estimates are unreliable at first and can generate fluctuating values. The main contribution of this paper is to use simulated annealing to generate the target sampling distribution, starting from uniform and gradually shifting to an estimated optimal distribution, by varying the coldness parameter of a Boltzmann distribution with the estimated sampling entropy as energy. Simulation results show a substantial improvement of both transient and asymptotic entropy compared to both uniform and direct-ergodic searches. Finally, a demonstration is performed with a TurtleBot swarm system to validate the physical applicability of the algorithm.
|
| |
| 09:00-10:30, Paper WeI1I.259 | Add to My Program |
| A Natural Language Interface for Multi-Constraint Spatiotemporal Planning Via LLM-Parameterized Mixed-Integer Scheduling and A* |
|
| Ye, Sean | Georgia Institute of Technology |
| Luebbers, Matthew | Georgia Institute of Technology |
| Gombolay, Matthew | Georgia Institute of Technology |
Keywords: Planning, Scheduling and Coordination, Human Factors and Human-in-the-Loop, Task Planning
Abstract: Spatiotemporal planning is critically important in fields like robotics, logistics, and naval operations, especially for problem specifications involving multiple constraints. Traditional approaches place the burden on end users to manually specify cost functions, constraints, or model parameters, a time-consuming and laborious process often resulting in less-than-ideal plans. We present a novel architecture integrating an LLM-based natural language interface with MILP scheduling and A* motion planning for multi-constraint spatiotemporal planning. We validate our LLM-planning approach through a within-subjects user study using a simulated maritime route-planning domain against manual control, and against autonomous planning with classical template-based constraint specification. Results showed our LLM-planning approach not only improved usability and reduced workload over alternative input modalities but also maintained the path optimality of traditional constraint specification interfaces while decreasing planning time. These findings demonstrate that bridging LLM-powered interfaces with robust schedulers and motion planners can enhance human-autonomy interaction in complex planning tasks, potentially making advanced spatiotemporal planning tools more practical for a broader range of users.
|
| |
| 09:00-10:30, Paper WeI1I.260 | Add to My Program |
| CADET: A Modular Platform for Evaluating Distributed Cooperative Autonomy in Connected Autonomous Vehicles |
|
| Sharma, Pragya | University of California Los Angeles |
| Wang, Brian | UCLA |
| Srivastava, Mani | UCLA |
Keywords: Distributed Robot Systems, Cooperating Robots, Hardware-Software Integration in Robotics
Abstract: Deep learning models are increasingly central to autonomous vehicle (AV) pipelines, yet their integration has traditionally followed a monolithic design where perception, planning, and control execute on a single onboard computer. This design overlooks the emerging paradigm of cooperative autonomy, where vehicles interact with roadside units (RSUs), edge servers, and cloud-hosted intelligence through vehicle-to-everything (V2X) connectivity. Cooperative perception and control improve safety and efficiency, but also introduce systems-level challenges: network latency, compute heterogeneity, and multi-tenant contention, all critically affect real-time decision-making. These challenges are further amplified by the increasing reliance on large foundation models, whose scale necessitates cloud deployment. We present CADET (Cooperative Autonomy through Distributed Experimentation Toolkit), a modular platform for systematic and reproducible evaluation of distributed cooperative autonomy systems under realistic deployment conditions. CADET decouples the AV stack into composable modules that can be flexibly deployed across vehicles, infrastructure, and edge/cloud tiers. The framework integrates state-of-the-art models, incorporates trace-driven network and workload emulation, and provides synchronized model-, system-, and task-level instrumentation. Through V2V and V2I experiments, we show that distributed deployment choices fundamentally shape safety, with V2V intent packets outperforming cloud-based perception and RSU-assisted perception sustaining safety until overloaded by concurrent requests. Although designed for AV pipelines, CADET also supports dataset-driven experimentation, enabling systems and ML researchers to benchmark distributed inference workloads independently of full vehicle simulation. CADET is open source, with code and demo available at https://nesl.github.io/cadet-web.
|
| |
| 09:00-10:30, Paper WeI1I.261 | Add to My Program |
| Simultaneous Arrival Control for Distributed Multi-Robot Systems with Curvature and Constant-Speed Constraints |
|
| Xiao, Zhouru | Hunan University |
| Lu, Yang | National University of Defense Technology |
| Yao, Weijia | Hunan University |
| Liu, Min | Hunan University |
| Wang, Yaonan | Hunan University |
Keywords: Multi-Robot Systems, Constrained Motion Planning, Task and Motion Planning
Abstract: The simultaneous arrival of multiple mobile robots at their respective target points is crucial for cooperative tasks such as encirclement, interception, and disaster relief. Although the problem of simultaneous arrival is inherently complex, it becomes even more challenging in multi-robot systems with curvature-constrained trajectories and constant-speed requirements that may differ among robots, along with the need for distributed, real-time, and low-communication control. These constraints are typical for a multi-robot system, such as one consisting of fixed-wing UAVs or car-like mobile robots. To address this challenge, this paper proposes a distributed switching control method based on the maximum consensus protocol. Inspired by the optimization principles and geometric properties of Dubins paths, we introduce a virtual time variable and design a hybrid control law that combines optimal control with saturated proportional control. Under the proposed control law, each robot is driven to approach the maximum virtual time among its neighbors, thereby achieving simultaneous arrival under mild conditions. Furthermore, we prove that in certain cases the proposed method achieves a theoretically optimal arrival time, and its effectiveness and robustness are validated through extensive simulations and real-world experiments.
|
| |
| 09:00-10:30, Paper WeI1I.262 | Add to My Program |
| MonoDuo: Using One Robot Arm to Learn Bimanual Policies |
|
| Bajamahal, Sandeep | UC Berkeley |
| Chen, Lawrence Yunliang | UC Berkeley |
| Lin, Toru | University of California, Berkeley |
| Ma, Zehan | University of California, Berkeley |
| Malik, Jitendra | UC Berkeley |
| Goldberg, Ken | UC Berkeley |
Keywords: Bimanual Manipulation, Transfer Learning, Human-Robot Collaboration
Abstract: Bimanual coordination is essential for many real-world manipulation tasks, yet learning bimanual robot policies is limited by the scarcity of bimanual robots and datasets. Single-arm robots, however, are widely available in research labs. Can we leverage them to train bimanual robot policies? We present MonoDuo, a framework for learning bimanual manipulation policies using single-arm robot demonstrations paired with human collaboration. MonoDuo collects data by teleoperating a single-arm robot to perform one side of a bimanual task while a human performs the other, then swapping roles to cover both sides. RGB-D observations from a wrist-mounted and fixed camera are augmented into synthetic demonstrations for target bimanual robots using state-of-the-art hand pose estimation, image and point cloud segmentation, and inpainting. These synthetic demonstrations, grounded in real robot kinematics, are used to train bimanual policies. We evaluate MonoDuo on five tasks—box lifting, backpack packing, cloth folding, jacket zipping, and plate handover. Compared to approaches relying solely on human bimanual videos, MonoDuo enables zero-shot deployment on unseen bimanual robot configurations, achieving success rates up to 70%. With only 25 target robot demonstrations, few-shot finetuning further boosts success rates by 65–70% over training from scratch, demonstrating MonoDuo’s effectiveness in efficiently transferring knowledge from single-arm robot data to bimanual robot policies. Project page: https://bimanual-monoduo.github.io
|
| |
| 09:00-10:30, Paper WeI1I.263 | Add to My Program |
| Designing an Efficient Excavator Bucket for Lunar ISRU: A Comparative Study with Vision-Based Fill and Displacement Analysis |
|
| Kafi, Abdulla Hil | Kyutech, Space Robotics Laboratory |
| Koshi, Tomoki | Kyushu Institute of Technology |
| Casir, Jorge | Kyushu Institute of Technology, Space Robotics Laboratory |
| Nagaoka, Kenji | Kyushu Institute of Technology |
Keywords: Space Robotics and Automation, Object Detection, Segmentation and Categorization, Wheeled Robots
Abstract: This paper present a spiral-cavity wheel for lunar regolith excavation and a sensor-light evaluation stack that jointly estimates fill ratio (vision), sinkage (vision), and specific energy from actuator logs. In benchtop tests (four revolutions at 5, 10, and 15 RPM) against two literature baselines, the proposed wheel achieved higher excavated mass and fill ratio, delivering 2.2–3.0 times higher excavation rate while reducing specific energy by 29% relative to a bucket-drum baseline. Normalized sinkage (mm/kg) was also lower, indicating stable traction without bogging. Effort-time traces show a steady torque envelope with repeatable cut–carry–dump cycles across speeds. We provide a retention index η that correlates with fill ratio and a DEM setup that reproduces experimental trends with low error. Results suggest spiral-cavity wheels can replace heavier multi-actuator diggers when mass, simplicity, and energy efficiency are mission drivers.
|
| |
| 09:00-10:30, Paper WeI1I.264 | Add to My Program |
| GenLaM: Generative Layered Mesh for Multi-Modal Sensor Emulation in Robotics |
|
| Bais, Aakash Singh | Luleå University of Technology |
| Patel, Akash | Luleå University of Technology |
| Kanellakis, Christoforos | Luleå University of Technology |
| Nikolakopoulos, George | Luleå University of Technology |
Keywords: Semantic Scene Understanding, Deep Learning for Visual Perception, AI-Enabled Robotics
Abstract: Accurate environment perception is fundamental for robust robot navigation, mapping, and interaction. Traditional perception pipelines rely on multiple sensors, including stereo cameras and LiDAR, which impose constraints on cost, payload, and system integration. In this paper, we propose a novel single-image perception framework that unifies novel view synthesis and RGB/segmented LiDAR emulation into a single pipeline. Leveraging monocular depth estimation and camera intrinsics recovery, our approach projects image pixels into 3D space and performs mesh reconstruction to generate dense geometric representations. This enables high-fidelity sensor emulation, including transparent surface reconstruction such as glass - an element often missed by conventional LiDAR. By enriching synthetic LiDAR scans with otherwise unavailable geometry, our method enhances downstream tasks such as robot path planning and obstacle avoidance. This work opens up new possibilities for resource-efficient robotic perception by reducing sensor dependency while improving geometric reasoning.
|
| |
| 09:00-10:30, Paper WeI1I.265 | Add to My Program |
| Deep Photonic Reservoir Computer Meets UAV Control: An Ultra-Fast Learning-Based Compensator for Agile Flight in Confined Space |
|
| Ma, Qinxiao | ShanghaiTech University |
| Li, Ruiqian | ShanghaiTech University |
| Wang, Cheng | ShanghaiTech University |
| Wang, Yang | Shanghaitech University |
Keywords: Aerial Systems: Mechanics and Control, Motion Control
Abstract: Unmanned aerial vehicles (UAVs) operating in confined, cluttered environments face significant performance degradation due to nonlinear, time-varying unmodeled dynamics—such as ground/ceiling effects and wake recirculation—that are unaccounted for in traditional controllers. While learning-based compensators (e.g., MLPs, TCNs, LSTMs) struggle with historical data dependency, vanishing gradients, and prohibitive computational costs, this work pioneers the integration of a deep photonic reservoir computer (PRC) with feedforward control to overcome these limitations. Harnessing semiconductor laser dynamics and optical feedback, our hardware-implemented deep PRC architecture achieves intrinsic temporal memory without explicit historical inputs, while reducing training time from hours to milliseconds and slashing inference latency to nanoseconds. Reliable high-performance CFD simulations capturing proximity-induced flows demonstrate that deep PRC delivers residual-force prediction accuracy comparable to or exceeding TCN/MLP baselines, while training only a linear readout layer via ridge regression. By injecting these predictions into a nonlinear feedback PID controller via a feedforward channel, the framework significantly enhances closed-loop tracking stability in confined spaces. Essentially, this work establishes the first deep PRC-based lightweight, ultra-fast solution for real-time UAV dynamic compensation, with promising extensibility to unseen scenarios with more complex fluid environments.
|
| |
| 09:00-10:30, Paper WeI1I.266 | Add to My Program |
| Trajectory Planning Using Safe Ellipsoidal Corridors As Projections of Orthogonal Trust Regions |
|
| Jaitly, Akshay | Worcester Polytechnic Institute |
| Arrizabalaga, Jon | Massachusetts Institute of Technology (MIT) |
| Li, Guanrui | Worcester Polytechnic Institute |
Keywords: Motion and Path Planning, Autonomous Vehicle Navigation, Optimization and Optimal Control
Abstract: Planning collision free trajectories in complex environments remains a core challenge in robotics. Existing corridor based planners which rely on decomposition of the free space into collision free subsets scale poorly with environmental complexity and require explicit allocations of time windows to trajectory segments. We introduce a new trajectory parameterization that represents trajectories in a nonconvex collision free corridor as being in a convex cartesian product of balls. This parameterization allows us to decouple problem size from geometric complexity of the solution and naturally avoids explicit time allocation by allowing trajectories to evolve continuously inside ellipsoidal corridors. Building on this representation, we formulate the Orthogonal Trust Region Problem (Orth-TRP), a specialized convex program with separable block constraints, and develop a solver that exploits this parallel structure and the unique structure of each parallel subproblem for efficient optimization. Experiments on a quadrotor trajectory planning benchmark show that our approach produces smoother trajectories and lower runtimes than state-of-the-art corridor based planners, especially in highly complicated environments.
|
| |
| 09:00-10:30, Paper WeI1I.267 | Add to My Program |
| Multi-Quadruped Cooperative Object Transport: Learning Decentralized Pinch-Lift-Move |
|
| Pandit, Bikram | Oregon State University |
| Shrestha, Aayam | Oregon State University |
| Fern, Alan | Oregon State University |
Keywords: Multi-Robot Systems, Reinforcement Learning, Mobile Manipulation
Abstract: We study decentralized cooperative transport using teams of N-quadruped robots with arm that must pinch, lift, and move ungraspable objects through physical contact alone. Unlike prior work that relies on rigid mechanical coupling between robots and objects, we address the more challenging setting where mechanically independent robots must coordinate through contact forces alone without any communication or centralized control. To this end, we employ a hierarchical policy architecture that separates base locomotion from arm control, and propose a constellation reward formulation that unifies position and orientation tracking to enforce rigid contact behavior. The key insight is encouraging robots to behave as if rigidly connected to the object through careful reward design and training curriculum rather than explicit mechanical constraints. Our approach enables coordination through shared policy parameters and implicit synchronization cues—scaling to arbitrary team sizes without retraining. We show extensive simulation experiments to validate the approach and demonstrate robust transport across 2-10 robots on diverse object geometries and masses, along with sim2real transfer results on lightweight objects.
|
| |
| 09:00-10:30, Paper WeI1I.268 | Add to My Program |
| Ego-Vision World Model for Humanoid Contact Planning |
|
| Liu, Hang | University of Michigan |
| Gao, Yuman | Zhejiang University |
| Teng, Sangli | University of California, Berkeley |
| Chi, Yufeng | University of California, Berkeley |
| Shao, Yakun Sophia | University of California, Berkeley |
| Li, Zhongyu | University of California, Berkeley |
| Ghaffari, Maani | University of Michigan |
| Sreenath, Koushil | University of California, Berkeley |
Keywords: Multi-Contact Whole-Body Motion Planning and Control, Integrated Planning and Learning, Deep Learning for Visual Perception
Abstract: Enabling humanoid robots to exploit physical contact, rather than simply avoid collisions, is crucial for autonomy in unstructured environments. Traditional optimization-based planners struggle with contact complexity, while on-policy reinforcement learning (RL) is sample-inefficient and has limited multi-task ability. We propose a framework combining a learned world model with sampling-based Model Predictive Control (MPC), trained on a demonstration-free offline dataset to predict future outcomes in a compressed latent space. To address sparse contact rewards and sensor noise, the MPC uses a learned surrogate value function for dense, robust planning. Our single, scalable model supports contact-aware tasks, including wall support after perturbation, blocking incoming objects, and traversing height-limited arches, with improved sample efficiency and multi-task capability over on-policy RL. Deployed on a physical humanoid, our system achieves robust, real-time contact planning from proprioception and ego-centric depth images.
|
| |
| 09:00-10:30, Paper WeI1I.269 | Add to My Program |
| Few-Shot Physics-Informed Neural Network for Shape Reconstruction of Concentric-Tube Robots |
|
| Feizi, Navid | Harvard Medical School, Brigham and Women's Hospital |
| Pedrosa, Filipe | Western University |
| Patel, Rajnikant V. | The University of Western Ontario |
| Jayender, Jagadeesan | Harvard Medical School, Brigham and Women's Hospital |
Keywords: Surgical Robotics: Steerable Catheters/Needles, Modeling, Control, and Learning for Soft Robots, Machine Learning for Robot Control
Abstract: Modeling concentric tube robots (CTRs) involves complex nonlinear continuum mechanics, and despite recent progress, physics-based models often lack an accurate representation of the experimental setups. To overcome these limitations, deep neural network-based models have been explored as alternatives with superior accuracy; however, they often overlook known mechanics, require large training datasets, and typically discard shape estimation of the robot. We present a physics-informed neural network (PINN) for kinematic modeling of a 6-DoF CTR with three pre-curved tubes that embed the Cosserat rod differential equations and learns from few-shot observational data, balancing physics priors with data-driven fitting. PINN enables full-state estimation of shape, twist angle, torsional strain, bending moment, and orientation. Benchmark tests show a mean shape error below 1% of the robot length and accurately recovers other kinematic states, outperforming a purely physics-based Cosserat rod model baseline while using a minimal training set. The resulting model is also computationally efficient and robust, making it well-suited for real-time control applications.
|
| |
| 09:00-10:30, Paper WeI1I.270 | Add to My Program |
| Weighted Group-K Consistent Set Maximization for Outlier Rejection of Azimuth-Elevation Measurements |
|
| Velasco, Kalliyan | Brigham Young University |
| McLain, T.W. | Brigham Young University |
| Mangelson, Joshua | Brigham Young University |
Keywords: Localization, SLAM, Marine Robotics
Abstract: Reliable localization in robotics requires robust handling of sensor outliers, particularly in environments where acoustic or bearing measurements are noisy. We propose a replicator-dynamics-based approach for weighted group- k consistent set maximization (rGkCM) to identify the densest subsets of mutually consistent measurements in hypergraphs. To complement existing range-based consistency metrics, we introduce a k = 3 azimuth-elevation consistency check for bearing measurements to static landmarks. Our method efficiently identifies cliques in weighted k-uniform hypergraphs, leveraging the fitness of nodes to guide both pruning and recovery. We evaluate rGkCM on simulated trajectories with varying outlier levels and demonstrate significant computational speedup over the heuristic unweighted GkCM (uGkCM) method while maintaining comparable accuracy. Finally, we validate the approach on a WAM-V autonomous surface vessel equipped with an acoustic beacon and GNSS ground truth, showing effective outlier rejection in a shallow, multipath-prone marina. Results indicate that rGkCM enables robust and efficient outlier rejection for real-world bearing-based localization tasks.
|
| |
| 09:00-10:30, Paper WeI1I.271 | Add to My Program |
| GeoLanG: Geometry-Aware Language-Guided Grasping with Unified RGB-D Multimodal Learning |
|
| Tang, Rui | Hong Kong University of Science and Technology (Guangzhou) |
| Wang, Guankun | The Chinese University of Hong Kong |
| Bai, Long | Alibaba DAMO Academy |
| Gao, Huxin | The Chinese University of Hong Kong |
| Lai, Jiewen | The Chinese University of Hong Kong |
| Ng, Chi Kit | The Chinese University of Hong Kong |
| Wang, Jiazheng | Huawei Hong Kong Research Center |
| Zhang, Fan | Huawei Hong Kong Research Center |
| Ren, Hongliang | Chinese Univ Hong Kong (CUHK) & National Univ Singapore(NUS) |
Keywords: Deep Learning in Grasping and Manipulation, RGB-D Perception, Object Detection, Segmentation and Categorization
Abstract: Language-guided grasping has emerged as a promising paradigm for enabling robots to identify and manipulate target objects through natural language instructions, yet it remains highly challenging in cluttered or occluded scenes. Existing methods often rely on multi-stage pipelines that separate object perception and grasping, which leads to limited cross-modal fusion, redundant computation, and poor generalization in cluttered, occluded, or low-texture scenes. To address these limitations, we propose GeoLanG, an end-to-end multi-task framework built upon the CLIP architecture that unifies visual and linguistic inputs into a shared representation space for robust semantic alignment and improved generalization. To enhance target discrimination under occlusion and low-texture conditions, we explore a more effective use of depth information through the Depth-guided Geometric Module (DGGM), which converts depth into explicit geometric priors and injects them into the attention mechanism without additional computational overhead. In addition, we propose Adaptive Dense Channel Integration, which adaptively balances the contributions of multi-layer features to produce more discriminative and generalizable visual representations. Extensive experiments on the OCID-VLG dataset, as well as in both simulation and real-world hardware, demonstrate that GeoLanG enables precise and robust language-guided grasping in complex, cluttered environments, paving the way toward more reliable multimodal robotic manipulation in real-world human-centric settings.
|
| |
| 09:00-10:30, Paper WeI1I.272 | Add to My Program |
| Learning Actuator-Aware Spectral Submanifolds for Precise Control of Continuum Robots |
|
| Wolff, Paul Leonard | ETH Zurich, Stanford |
| Buurmeijer, Hugo | Stanford University |
| Pabon, Luis | Stanford University |
| Alora, John Irvin | Stanford University |
| Leone, Mark | Stanford University |
| Kaundinya, Roshan | ETH Zürich |
| Kazemipour, Amirhossein | ETH Zürich |
| Katzschmann, Robert Kevin | ETH Zurich |
| Pavone, Marco | Stanford University |
Keywords: Modeling, Control, and Learning for Soft Robots, Dynamics, Optimization and Optimal Control
Abstract: Continuum robots exhibit high-dimensional, nonlinear dynamics which are often coupled with their actuation mechanism. Spectral submanifold (SSM) reduction has emerged as a leading method for reducing high-dimensional nonlinear dynamical systems to low-dimensional invariant manifolds. Our proposed control-augmented SSMs (caSSMs) extend this methodology by explicitly incorporating control inputs into the state representation, enabling these models to capture nonlinear state-input couplings. Training these models relies solely on controlled decay trajectories of the actuator‑augmented state, thereby removing the additional actuation‑calibration step commonly needed by prior SSM‑for‑control methods. We learn a compact caSSM model for a tendon-driven trunk robot, enabling real-time control and reducing open-loop prediction error by 40% compared to existing methods. In closed-loop experiments with model predictive control (MPC), caSSM reduces tracking error by 52%, demonstrating improved performance against Koopman and SSM based MPC and practical deployability on hardware continuum robots.
|
| |
| 09:00-10:30, Paper WeI1I.273 | Add to My Program |
| OctHilNet: Hilbert-Guided Hierarchical Geometry Codec for Octree-Structured LiDAR Point Clouds |
|
| Feng, Mingjian | Sun Yat-Sen University |
| Cui, Mingyue | Sun Yat-Sen University |
| Zhong, Yuyang | Sun Yat-Sen University |
| Shu, Chunjie | Sun-Yat Sun University |
| Liu, Han | Sun Yat-Sen University |
| Hu, Daosong | Sun Yat-Sen University |
| Huang, Kai | Sun Yat-Sen University |
Keywords: Intelligent Transportation Systems, Industrial Robots, Computer Vision for Transportation
Abstract: High-quality LiDAR point cloud (LPC) compression is essential for the storage and transmission of 3D data. The octree-structured entropy codec has emerged as the predominant method; however, previous methods do not fully utilize spatial contextual information, due to the loss of local features caused by uneven scanning density. To address this problem, we propose OctHilNet, a novel Hilbert-guided hierarchical framework for LPC compression that introduces the polarized octree for efficient node organization and the serialize-driven entropy model to strengthen the continuity of node contexts. Specifically, to counteract the inherent density imbalance, OctHilNet first transforms points into polar coordinates and applies a non-linear rebalancing to the radial distance. Then, we introduce the Hilbert space-filling curve to mitigate the impact of the decoupling between sequential adjacency and geometric proximity in octree node sequences. Finally, to better capture fine-grained spatial correlations, we propose LocAtten and NeighbConv modules in a hierarchical Transformer, which jointly strengthen local dependencies overlooked by standard self-attention. Compared to the previous state-of-the-art works, our method achieves 45.1%-50.1% and 51.9%-53.9% BD-Rate gains on the LPC benchmark SemanticKITTI and MPEG-specified Ford datasets, respectively. In particular, our OctHilNet allows for extension to downstream tasks (i.e., vehicle detection and semantic segmentation), further demonstrating the practicality of the method.
|
| |
| 09:00-10:30, Paper WeI1I.274 | Add to My Program |
| LRDDv3: High-Resolution Long-Range Drone Detection Dataset with Range Information and Thermal Data |
|
| Peterson, Knut | Drexel University |
| Mayers, Zaid | Drexel University |
| Yousuf, Md Azmain | Drexel University |
| Chowdhury, Priontu | Drexel University |
| Arezoomandan, Solmaz | Drexel University |
| Zaczepinski, Asher Julius | Drexel University |
| Maarefdoust, Reihaneh | University of Maine |
| Han, David | Drexel University |
Keywords: Data Sets for Robotic Vision, Data Sets for Robot Learning, Aerial Systems: Perception and Autonomy
Abstract: Unmanned Aerial Vehicles (UAVs) have quickly become common in various airspaces, representing a wide range of applications from recreation flying to commercial photography and package delivery. With the increasing prevalence of UAVs, it becomes critical that both manned and unmanned aircraft can detect UAVs and other flying objects from long range to effectively track movement and ensure safe operation in shared spaces. While several datasets have been introduced for drone detection, the need for expanded high-quality data persists, especially in the area of high-resolution long-range drone data. To address this, we introduce a high-resolution dataset of 102,532 long-range RGB images of drones, sampled at 5 FPS from 128 distinct video clips taken mid flight during 17 different data collection days spread over 8 months to ensure a wide variety of lighting scenarios, flight locations, and background elements. The dataset boasts comprehensive drone range information across the dataset, as well as 29,630 IR images, all paired with RGB counterparts from the base dataset. As one of the first drone detection datasets to leverage 4k image resolution and paired 640x512 IR images, our work represents a significant advancement to enable the detection of drones at long range. For access to the complete dataset, please visit https://research.coe.drexel.edu/ece/imaple/lrddv3
|
| |
| 09:00-10:30, Paper WeI1I.275 | Add to My Program |
| CRED: Counterfactual Reasoning and Environment Design for Active Preference Learning |
|
| Tung, Yi-Shiuan | University of Colorado Boulder |
| Kumar, Gyanig | University of Colorado, Boulder |
| Jiang, Wei | University of Colorado Boulder |
| Hayes, Bradley | University of Colorado Boulder |
| Roncone, Alessandro | University of Colorado Boulder |
Keywords: Human Factors and Human-in-the-Loop, Reinforcement Learning, Probabilistic Inference
Abstract: As a robot's operational environment and tasks to perform within it grow in complexity, the explicit specification and balancing of optimization objectives to achieve a preferred behavior profile moves increasingly farther out of reach. These systems benefit strongly by being able to align their behavior to reflect human preferences and respond to corrections, but manually encoding this feedback is infeasible. Active preference learning (APL) learns human reward functions by presenting trajectories for ranking. However, existing methods sample from fixed trajectory sets or replay buffers that limit query diversity and often fail to identify informative comparisons. We propose CRED, a novel trajectory generation method for APL that improves reward inference by jointly optimizing environment design and trajectory selection to efficiently query and extract preferences from users. CRED "imagines" new scenarios through environment design and leverages counterfactual reasoning--by sampling possible rewards from its current belief and asking "What if this were the true preference?"-to generate trajectory pairs that expose differences between competing reward functions. Comprehensive experiments and a user study show that CRED significantly outperforms state-of-the-art methods in reward accuracy and sample efficiency and receives higher user ratings.
|
| |
| 09:00-10:30, Paper WeI1I.276 | Add to My Program |
| Communication-Efficient and Context-Adaptive Collaborative Perception |
|
| Lu, Wenyu | University of Science and Technology of China |
| Zhang, Hui | University of Science and Technology of China |
| Yang, Yuquan | University of Science and Technology of China |
| Zhang, ZiYin | University of Science and Technology of China |
| Xu, Xiaohua | University of Science and Technology of China |
Keywords: Computer Vision for Automation, Cooperating Robots, Sensor Fusion
Abstract: Collaborative perception is pivotal for the large-scale deployment of autonomous driving, yet it has long grappled with the trade-off between perception accuracy and bandwidth consumption. Existing methods fail to analyze the fine-grained characteristics of Field of View (FoV), leading to inefficient bandwidth utilization. To address this, we propose a Context-adaptive Collaborative Perception framework, termed CaCP. This method optimizes bandwidth usage by employing distinct collaboration strategies for FoV under varying contexts, thereby reducing communication overhead while maintaining perception accuracy.Additionally, CaCP introduces a novel spatial fusion of intermediate and late fusion strategies, yielding a more flexible collaborative scheme. Extensive experiments across multiple datasets encompassing both simulated (OPV2V) and real-world (V2V4Real) scenarios demonstrate that CaCP establishes a new state-of-the-art trade-off between accuracy and bandwidth. Notably, it reduces bandwidth consumption by up to 17% compared to previous works while achieving competitive or superior perception performance.
|
| |
| 09:00-10:30, Paper WeI1I.277 | Add to My Program |
| Social-Qwen: From Individual Nonverbal Cues and Emotion to Multiparty Social Dynamics Understanding with Instruction Tuning |
|
| Nguyen, Tung | Honda Research Institute Japan, Co., Ltd |
| Chew, Jouh Yeong | Honda Research Institute Japan |
Keywords: Computer Vision for Automation, Big Data in Robotics and Automation, Transfer Learning
Abstract: Effective participation in multiparty scenarios requires robots to move beyond individual toward understanding group-level social dynamics, which are inherently complex due to the interplay of nonverbal cues, internal states, and interaction context. Existing approaches often rely on end-to-end deterministic models, while recent state-of-the-art methods such as large Vision-Language Models (VLMs) address this issue to some extent but remain limited by their size and computational cost for real-time applications. Moreover, both approaches are constrained by the scarcity of multiparty interaction data and annotations, which describe how individual nonverbal cues and emotional states contribute to social dynamics which describe collective outcomes such as group engagement. We hypothesize that explicitly modeling individual-level states is essential for accurate group-level understanding. To this end, we present Social-Qwen, a two-stage framework that first analyzes each participant’s nonverbal cues and emotions, then infers group-level engagement using instruction-tuned representations. To mitigate the lack of individual annotations in group datasets, we employ knowledge distillation to transfer supervision signals. Experiments on the OUC-CGE dataset show that Social-Qwen significantly outperforms prior end-to-end baselines and achieves state-of-the-art performance in group engagement analysis, demonstrating the promise of instruction tuning for scalable social intelligence in robots. We further evaluate robustness by testing generalization to (1) an in-house dataset spanning multiple social activities and (2) estimating other social dynamics such as group harmony. Results suggest consistent performance, highlighting Social-Qwen as a promising approach toward real-time social intelligence for intelligent agents.
|
| |
| 09:00-10:30, Paper WeI1I.278 | Add to My Program |
| TACOcc: Target-Adaptive Cross-Modal Fusion with Sequential Volume Rendering for 3D Semantic Occupancy Prediction |
|
| Lei, Luyao | Xi'an Jiaotong University |
| Xu, Shuo | Xi'an Jiaotong University |
| Bai, Yifan | Xi'an Jiaotong University |
| Yang, Zelin | Xi'an Jiaotong University |
| Guo, Yuanbo | Xi’an Jiaotong University |
| Wei, Xing | Xi'an Jiaotong University |
Keywords: Object Detection, Segmentation and Categorization, Semantic Scene Understanding, Deep Learning Methods
Abstract: Multi-modal 3D semantic occupancy prediction remains challenged by two fundamental issues: (i) geometric--semantic misalignment introduced by fixed-neighborhood fusion under heterogeneous sensing distributions, and (ii) feature degradation with prediction inconsistency in dynamic scenes caused by sparse supervision. We propose TACOcc, a framework coupling a target-adaptive, bidirectional symmetric fusion module with sequential volume rendering supervision. The fusion module predicts a query-wise neighborhood size via a differentiable Gumbel-Softmax strategy, expanding the receptive field for large objects to enrich context while contracting it for small objects to suppress noise, thereby achieving precise cross-modal alignment. To stabilize predictions under sparse labels and motion, we introduce temporally enhanced Gaussian rendering that aggregates multi-frame dependencies, initializes dual-source geometric anchors, and transfers multi-view photometric constraints from images to 3D occupancy features. A velocity-adaptive temporal bandwidth further mitigates flicker in fast-motion cases. Experiments on nuScenes and SemanticKITTI demonstrate strong performance, including 28.9% mIoU on nuScenes, particularly improving small-object categories and long-range regions. These results highlight that scale-aware bidirectional fusion and temporally grounded volumetric supervision form an effective recipe for robust multi-modal occupancy perception.
|
| |
| 09:00-10:30, Paper WeI1I.279 | Add to My Program |
| PEEK: Guiding and Minimal Image Representations for Zero-Shot Generalization of Robot Manipulation Policies |
|
| Zhang, Jesse | University of Washington |
| Memmel, Marius | University of Washington |
| Kim, Kevin | University of Southern California |
| Fox, Dieter | University of Washington |
| Thomason, Jesse | USC Viterbi School of Engineering |
| Ramos, Fabio | University of Sydney, NVIDIA |
| Bıyık, Erdem | University of Southern California |
| Gupta, Abhishek | University of Washington |
| Li, Anqi | NVIDIA |
Keywords: Imitation Learning, Transfer Learning, Learning from Demonstration
Abstract: Robotic manipulation policies often fail to generalize because they must simultaneously learn where to attend, what actions to take, and how to execute them. We argue that high-level reasoning about where and what can be offloaded to vision-language models (VLMs), leaving policies to specialize in how to act. We present PEEK (Point-based End-effector and Entity Keying), which fine-tunes VLMs to predict a unified point-based intermediate representation: (1) end-effector paths specifying what actions to take, and (2) task-relevant masks indicating where to focus. These annotations are directly overlaid onto robot observations, making the representation policy-agnostic and transferable across architectures. To enable scalable training, we introduce an automatic annotation pipeline, generating labeled data across 20+ robot datasets spanning 9 embodiments. In real-world evaluations, PEEK consistently boosts zero-shot generalization, including a 41.4× real-world improvement for a 3D policy trained only in simulation, and 2–3.5× gains for both large VLAs and small manipulation policies. By letting VLMs absorb semantic and visual complexity, PEEK equips manipulation policies with the minimal cues they need—where, what, and how.
|
| |
| 09:00-10:30, Paper WeI1I.280 | Add to My Program |
| T2M Mamba: Motion Periodicity-Saliency Coupling Approach for Stable Text-Driven Motion Generation |
|
| Zhan, Xingzu | Shenzhen University |
| Xie, Chen | Shenzhen University |
| Chen, Honghang | Shenzhen University |
| Lin, Yixun | Jinan University |
| Mai, Xiaochun | Shenzhen University |
Keywords: Human and Humanoid Motion Analysis and Synthesis, Modeling and Simulating Humans, Virtual Reality and Interfaces
Abstract: Text-to-motion generation, which converts motion language descriptions into coherent 3D human motion sequences, has attracted increasing attention in fields, such as avatar animation and humanoid robotic interaction. Though existing models have achieved significant fidelity, they still suffer from two core limitations: (i) They treat motion periodicity and keyframe saliency as independent factors, overlooking their coupling and causing generation drift in long sequences. (ii) They are fragile to semantically equivalent paraphrases, where minor synonym substitutions distort textual embeddings, propagating through the decoder and producing unstable or erroneous motions. In this work, we propose T2M Mamba to address these limitations by (i) proposing Periodicity-Saliency Aware Mamba, which utilizes novel algorithms for keyframe weight estimation via enhanced Density Peaks Clustering and motion periodicity estimation via FFT-accelerated autocorrelation to capture coupled dynamics with minimal computational overhead, and (ii) constructing a Periodic Differential Cross-modal Alignment Module (PDCAM) to enhance robust alignment of textual and motion embeddings. Extensive experiments on HumanML3D and KIT-ML datasets have been conducted, confirming the effectiveness of our approach, achieving an FID of 0.068 and consistent gains on all other metrics.
|
| |
| 09:00-10:30, Paper WeI1I.281 | Add to My Program |
| Traffic Scenario Orchestration from Language Via Constraint Satisfaction |
|
| Rong, Frieda | University of Toronto |
| Zhang, Chris | Waabi / University of Toronto |
| Wong, Kelvin | Waabi / University of Toronto |
| Urtasun, Raquel | Waabi / University of Toronto |
Keywords: Software Tools for Benchmarking and Reproducibility, Simulation and Animation, Formal Methods in Robotics and Automation
Abstract: Autonomous vehicles (AVs) require extensive testing in simulation, but test case generation for driving scenarios is laborious. The desired scenarios are often out-of-distribution and have precise requirements on interactions with the AV policy under test. Manually programming scenarios allows for precise controllability but is difficult to scale. On the other hand, statistical models can leverage compute and data, but struggle with precise controllability when out-of-distribution. We cast scenario orchestration as a constraint-solving problem and present a language-in, simulation-out scenario orchestrator for closed-loop testing AVs. Our approach leverages foundation model reasoning to translate general, natural language descriptions into a set of constraints as a scenario representation. This then allows us to leverage off the shelf solvers to solve for actor behaviors which meet precise testing intentions in closed-loop. Under a benchmark of carefully crafted and diverse scenario descriptions, our approach greatly outperforms our baselines in orchestration success rate. We further show that our closed-loop approach is especially important for scenarios which require ego-reactive specifications.
|
| |
| 09:00-10:30, Paper WeI1I.282 | Add to My Program |
| FlightDiffusion: Revolutionizing Autonomous Drone Training with Diffusion Model Generating FPV Video |
|
| Serpiva, Valerii | Skolkovo Institute of Science and Technology |
| Lykov, Artem | Skolkovo Institute of Science and Technology |
| Batool, Faryal | Skoltech |
| Kozlovskiy, Vladislav | Skoltech |
| Altamirano Cabrera, Miguel | Skolkovo Institute of Science and Technology (Skoltech), Moscow, Russia |
| Tsetserukou, Dzmitry | Skolkovo Institute of Science and Technology |
Keywords: Cognitive Control Architectures, Visual Tracking, Aerial Systems: Perception and Autonomy
Abstract: We present FlightDiffusion, a diffusion-based framework for training autonomous drones from first-person-view (FPV) video. The model generates FPV video sequences from a single frame and a text prompt, and derives corresponding state-action trajectories for task-conditioned navigation. FlightDiffusion leverages generative modeling to synthesize diverse FPV trajectories and corresponding state-action pairs, enabling scalable dataset generation without the high cost of real-world data collection. These datasets support not only learning pipeline but also the training of autonomous navigation systems. Our evaluation shows that the generated trajectories are physically feasible and executable, with a mean positional error of 0.25 m (RMSE 0.28 m) and a mean orientation error of 0.19 rad (RMSE 0.24 rad). This approach enables scalable dataset generation and supports reliable navigation performance. Results in simulated environments indicate stable trajectory planning and consistent behavior across varying conditions. An ANOVA revealed no statistically significant difference between performance in simulation and reality (F(1, 16) = 0.394, p = 0.541), with success rates of M = 0.628 (SD = 0.162) and M = 0.617 (SD = 0.177), respectively, indicating effective sim-to-real transfer. The generated datasets provide a useful resource for future UAV research. This work introduces diffusion-based video generation as a promising mechanism for coupling task-level reasoning with executable trajectory synthesis in aerial robotics.
|
| |
| 09:00-10:30, Paper WeI1I.283 | Add to My Program |
| Contact-Safe Reinforcement Learning with ProMP Reparameterization and Energy Awareness |
|
| Huang, Bingkun | Technical University of Munich |
| Gong, Yuhe | University of Nottingham |
| Yang, Zewen | Technical University of Munich |
| Ren, Tianyu | Technische Universität Darmstadt |
| Figueredo, Luis | University of Nottingham (UoN) |
Keywords: Machine Learning for Robot Control, Robot Safety, Robust/Adaptive Control
Abstract: Reinforcement learning (RL) approaches based on Markov Decision Processes (MDPs) are predominantly applied in the robot joint space, often relying on limited task-specific information and partial awareness of the 3D environment. In contrast, episodic RL has demonstrated advantages over traditional MDP-based methods in terms of trajectory consistency, task awareness, and overall performance in complex robotic tasks. Moreover, traditional step-wise and episodic RL methods often neglect the contact-rich information inherent in task-space manipulation, especially considering the contact-safety and robustness. In this work, contact-rich manipulation tasks are tackled using a task-space, energy-safe framework, where reliable and safe task-space trajectories are generated through the combination of Proximal Policy Optimization (PPO) and movement primitives. Furthermore, an energy-aware Cartesian Impedance Controller objective is incorporated within the proposed framework to ensure safe interactions between the robot and the environment. Our experimental results demonstrate that the proposed framework outperforms existing methods in handling tasks on various types of surfaces in 3D environments, achieving high success rates as well as smooth trajectories and energy-safe interactions.
|
| |
| 09:00-10:30, Paper WeI1I.284 | Add to My Program |
| Unveiling the Surprising Efficacy of Navigation Understanding in End-To-End Autonomous Driving |
|
| Hua, Zhihua | Fudan University |
| Wang, Junli | Institute of Automation, Chinese Academy of Sciences |
| Li, Pengfei | Institute for AI Industry Research (AIR), Tsinghua University |
| Jin, Qihao | Fudan University |
| Zhang, Bo | DIdi Inc |
| Sheng, Kehua | DIdi Inc |
| Chen, Yilun | Tsinghua University |
| Gan, Zhongxue | Fudan University |
| Ding, Wenchao | Fudan University |
Keywords: Autonomous Vehicle Navigation, Motion and Path Planning, Intelligent Transportation Systems
Abstract: Global navigation information and local scene understanding are two crucial components of autonomous driving systems. However, our experimental results indicate that many end-to-end autonomous driving systems tend to over-rely on local scene understanding while failing to utilize global navigation information. These systems exhibit weak correlation between their planning capabilities and navigation input, and struggle to perform navigation-following in complex scenarios. To overcome this limitation, we propose the Sequential Navigation Guidance (SNG) framework, an efficient representation of global navigation information based on real-world navigation patterns. The SNG encompasses both navigation paths for constraining long-term trajectories and turn-by-turn (TBT) information for real-time decision-making logic. We constructed the SNG-QA dataset, a visual question answering (VQA) dataset based on SNG that aligns global and local planning. Additionally, we introduce an efficient model SNG-VLA that fuses local planning with global planning. The SNG-VLA achieves state-of-the-art performance through precise navigation information modeling without requiring auxiliary loss functions from perception tasks.
|
| |
| 09:00-10:30, Paper WeI1I.285 | Add to My Program |
| Mixed Reality-Based, Immersive, Semi-Autonomous Robotic Telemanipulation for the Execution of Peg-In-Hole Tasks |
|
| Duan, Shifei | University of Auckland |
| De Pace, Francesco | Competence Industry Manufacturing 4.0 (CIM4.0) |
| Wang, Zhe | University of Auckland |
| Liarokapis, Minas | National Technical University of Athens |
Keywords: Telerobotics and Teleoperation, Virtual Reality and Interfaces, Human-Robot Collaboration
Abstract: Semi-autonomy in telemanipulation frameworks has the potential to reduce user cognitive load while preserving human perceptual oversight and decision-making capabilities. However, existing semi-autonomous telemanipulation systems are heavily dependent on calibration and hardware configurations, making rapid deployment difficult. Moreover, existing VR-based telemanipulation systems lack intuitive interaction mechanisms, requiring users to manage complex control interfaces. To address these limitations, we introduce an intuitive and immersive semi-autonomous robotic telemanipulation system that leverages a mixed reality (MR) headset with minimal hardware requirements. Requiring only CPU processing and coarse calibration procedures, the system combines human perception with autonomous control strategies through natural hand tracking and finger gestures to achieve precise, reliable task execution. To validate this approach, we conducted thorough evaluations involving complex peg-in-hole tasks and compared performance with and without the proposed control strategy. The results highlight that our system demonstrates robust performance, and the proposed control strategy further enhances its stability and effectiveness.
|
| |
| 09:00-10:30, Paper WeI1I.286 | Add to My Program |
| Beyond the Majority: Long-Tail Imitation Learning for Robotic Manipulation |
|
| Zhu, Junhong | University of Electronic Science and Technology of China |
| Zhang, Ji | Southwest Jiaotong University |
| Song, Jingkuan | Tongji University |
| Gao, Lianli | University of Electronic Science and Technology of China |
| Shen, Heng Tao | Tongji University |
Keywords: Imitation Learning, Learning from Demonstration
Abstract: While generalist robot policies hold significant promise for learning diverse manipulation skills through imitation, their performance is often hindered by the long-tail distribution of training demonstrations. Policies learned on such data, which is heavily skewed towards a few data-rich head tasks, frequently exhibit poor generalization when confronted with the vast number of data-scarce tail tasks. In this work, we conduct a comprehensive analysis of the pervasive long-tail challenge inherent in policy learning. Our analysis begins by demonstrating the inefficacy of conventional long-tail learning strategies (e.g., re-sampling) for improving the policy's performance on tail tasks. We then uncover the underlying mechanism for this failure, revealing that data scarcity on tail tasks directly impairs the policy's spatial reasoning capability. To overcome this, we introduce Approaching-Phase Augmentation (APA), a simple yet effective scheme that transfers knowledge from data-rich head tasks to data-scarce tail tasks without requiring external demonstrations. Extensive experiments in both simulation and real-world manipulation tasks demonstrate the effectiveness of APA. Our code and demos are publicly available at: https://mldxy.github.io/Project-VLA-long-tail/.
|
| |
| 09:00-10:30, Paper WeI1I.287 | Add to My Program |
| INSIGHT: INference-Time Sequence Introspection for Generating Help Triggers in Vision-Language-Action Models |
|
| Karli, Ulas Berk | Yale University |
| Shangguan, Ziyao | Yale University |
| Fitzgerald, Tesca | Yale University |
Keywords: Learning from Demonstration, Continual Learning, Sensorimotor Learning
Abstract: Recent Vision-Language-Action (VLA) models show strong generalization capabilities, yet they lack introspective mechanisms for anticipating failures and requesting help from a human supervisor. We present INSIGHT, a learning framework for leveraging token-level uncertainty signals to predict when a VLA should request help. Using Pi0-FAST as the underlying model, we extract per-token entropy, log-probability, and Dirichlet-based estimates of aleatoric and epistemic uncertainty, and train compact transformer classifiers to map these sequences to help triggers. We explore supervision regimes for strong or weak supervision, and extensively compare them across in-distribution and out-of-distribution tasks. Our results show a trade-off: strong labels enable models to capture fine-grained uncertainty dynamics for reliable help detection, while weak labels, though noisier, still support competitive introspection when training and evaluation are aligned, offering a scalable path when dense annotation is impractical. Crucially, we find that modeling the temporal evolution of token-level uncertainty signals with transformers provides far greater predictive power than static sequence-level scores. This study provides the first systematic evaluation of uncertainty-based introspection in VLAs, opening future avenues for active learning and for real-time error mitigation through selective human intervention.
|
| |
| 09:00-10:30, Paper WeI1I.288 | Add to My Program |
| Beyond Waypoints: Semantic-Centric Autonomy with Unreliable Maps through Learned Abstractions |
|
| Saravanan, Akila | Massachusetts Institute of Technology |
| Zhang, Songyuan | Massachusetts Institute of Technology |
| Manderson, Travis | Massachusetts Institute of Technology |
| Roy, Nicholas | Massachusetts Institute of Technology |
| Fan, Chuchu | Massachusetts Institute of Technology |
Keywords: Reactive and Sensor-Based Planning
Abstract: Autonomous navigation that relies on precise metric maps is inherently fragile to environmental changes and mapping inaccuracies. These discrepancies often lead to failures in localization and path planning, as the robot's internal representation of the world no longer matches reality. We propose an alternative navigation approach that instead focuses on how a robot interacts with its surroundings rather than its precise metric position. Our core contribution is a learned behavioral vocabulary conditioned on raw sensor data that can be used to compose plans for navigation. Our system transforms LiDAR data into low-dimensional learned embeddings which are clustered to create a set of abstract, human-interpretable behaviors (e.g., along wall, exiting intersection, on bridge). This representation allows the robot to control its behavior with respect to the embedding rather than controlling its state with respect to a specific metric cost function or waypoint, thereby minimizing the impact of map and position inaccuracies. We define the mission as a topological sequence of behavioral clusters on the overhead map, enabling high-level navigation.This approach provides a robust way to decompose the environment into recognizable and actionable states that can reliably compose a plan, even on stale maps with environmental deformations and world changes. Our method achieves higher navigation success under intentional map distortions, with average mission success rates 53 and 55 percentage points higher for short and long term plans respectively when compared to baselines which rely on accurate metric maps.
|
| |
| 09:00-10:30, Paper WeI1I.289 | Add to My Program |
| SURE: Semi-Dense Uncertainty-REfined Feature Matching |
|
| Li, Sicheng | Nanyang Technology University |
| Gu, Zaiwang | Institute for Infocomm Research, A*STAR |
| Zhang, Jie | Agency for Science, Technology and Research (A*STAR) |
| Guo, Qing | Agency for Science, Technology and Research (A*STAR) |
| Jiang, Xudong | Nanyang Technological University |
| Cheng, Jun | Institute for Infocomm Research, A*STAR |
Keywords: Deep Learning for Visual Perception, Visual Learning, Localization
Abstract: Establishing reliable image correspondences is essential for many robotic vision problems. However, existing methods often struggle in challenging scenarios with large viewpoint changes or textureless regions, where incorrect correspondences may still receive high similarity scores. This is mainly because conventional models rely solely on feature similarity, lacking an explicit mechanism to estimate the reliability of predicted matches, leading to overconfident errors. To address this issue, we propose SURE, a Semi-dense Uncertainty-REfined matching framework that jointly predicts correspondences and their confidence by modeling both aleatoric and epistemic uncertainties. Our approach introduces a novel evidential head for trustworthy coordinate regression, along with a lightweight spatial fusion module that enhances local feature precision with minimal overhead. We evaluated our method on multiple standard benchmarks, where it consistently outperforms existing state-of-the-art semi-dense matching models in both accuracy and efficiency. The code will be released upon publication to facilitate future research.
|
| |
| 09:00-10:30, Paper WeI1I.290 | Add to My Program |
| RAG-RUSS: A Retrieval-Augmented Robotic Ultrasound for Autonomous Carotid Examination |
|
| Huang, Dianye | Technical University of Munich |
| Cong, Ziping | Technical University of Munich |
| Navab, Nassir | TU Munich |
| Jiang, Zhongliang | The University of Hong Kong |
Keywords: Medical Robots and Systems, Computer Vision for Medical Robotics
Abstract: Robotic ultrasound (US) has recently attracted considerable attention as a means to overcome the limitations of conventional US examinations, such as the strong operator dependence. However, the decision-making process of existing methods is often either rule-based or relies on end-to-end learning models that operate as black boxes. This has been seen as a main limit for clinical acceptance and raises safety concerns for widespread adoption in routine practice. To tackle this problem, we introduce the RAG-RUSS, an interpretable framework capable of performing a full carotid examination in accordance with the clinical workflow while explicitly explaining both the current stage and the next planned action. Furthermore, given the scarcity of medical data, we incorporate retrieval-augmented generation to enhance generalization and reduce dependence on large-scale training datasets. The method was trained on data acquired from 28 volunteers, while an additional four volumetric scans recorded from previously unseen volunteers were reserved for testing. The results demonstrate that the method can stage the current scanning stage and autonomously plan probe motions to complete the carotid examination, encompassing both transverse and longitudinal planes.
|
| |
| 09:00-10:30, Paper WeI1I.291 | Add to My Program |
| Self-Supervised Street Gaussians for Autonomous Driving |
|
| Huang, Nan | Peking University |
| Wei, Xiaobao | Peking University |
| Zheng, Wenzhao | Tsinghua University |
| An, Pengju | Peking University |
| Lu, Ming | Peking University |
| Zhan, Wei | Univeristy of California, Berkeley |
| Tomizuka, Masayoshi | University of California |
| Keutzer, Kurt | University of California, Berkeley |
| Zhang, Shanghang | Peking University |
Keywords: Computer Vision for Transportation, Intelligent Transportation Systems
Abstract: Photorealistic 3D reconstruction of street scenes is a critical technique for developing real-world simulators for autonomous driving. Despite the efficacy of Neural Radiance Fields (NeRF) for driving scenes, 3D Gaussian Splatting (3DGS) emerges as a promising direction due to its faster speed and more explicit representation. However, most existing street 3DGS methods require tracked 3D vehicle bounding boxes to decompose the static and dynamic elements for effective reconstruction, limiting their applications for in-the-wild scenarios. To facilitate efficient 3D scene reconstruction without costly annotations, we propose a self-supervised street Gaussian S3Gaussian method to decompose dynamic and static elements from 4D consistency. We represent each scene with 3D Gaussians to preserve the explicitness and further accompany them with a spatial-temporal field network to compactly model the 4D dynamics. We conduct extensive experiments on the challenging Waymo-Open dataset to evaluate the effectiveness of our method. Our S3Gaussian demonstrates the ability to decompose static and dynamic scenes and achieves the best performance without using 3D annotations.
|
| |
| 09:00-10:30, Paper WeI1I.292 | Add to My Program |
| BiGraspFormer: End-To-End Bimanual Grasp Transformer |
|
| Kim, Kangmin | Gwangju Institute of Science and Technology |
| Back, Seunghyeok | Korea Institute of Machinery & Materials |
| Lee, Geonhyup | Gwangju Institute of Science and Technology |
| Lee, Sangbeom | Gwangju Institute of Science and Technology (GIST) |
| Noh, Sangjun | Gwangju Institute of Science and Technology (GIST) |
| Lee, Kyoobin | Gwangju Institute of Science and Technology |
Keywords: Bimanual Manipulation, Grasping, Perception for Grasping and Manipulation
Abstract: Bimanual grasping is essential for robots to handle large and complex objects. However, existing methods either focus solely on single-arm grasping or employ separate grasp generation and bimanual evaluation stages, leading to coordination problems including collision risks and unbalanced force distribution. To address these limitations, we propose BiGraspFormer, a unified end-to-end transformer framework that directly generates coordinated bimanual grasps from object point clouds. Our key idea is the Single-Guided Bimanual (SGB) strategy, which first generates diverse single grasp candidates using a transformer decoder, then leverages their learned features through specialized attention mechanisms to jointly predict bimanual poses and quality scores. This conditioning strategy reduces the complexity of the 12-DoF search space while ensuring coordinated bimanual manipulation. Comprehensive simulation experiments and real-world validation demonstrate that BiGraspFormer consistently outperforms existing methods while maintaining efficient inference speed (<0.05s), confirming the effectiveness of our framework. Code and supplementary materials are available at https://sites.google.com/bigraspformer
|
| |
| 09:00-10:30, Paper WeI1I.293 | Add to My Program |
| Language Conditioning Improves Accuracy of Aircraft Goal Prediction in Non-Towered Airspace |
|
| Vinodh Sangeetha, Sundhar | Georgia Institute of Technology |
| Chiu, Chih-Yuan | Georgia Institute of Technology |
| Li, Sarah H.Q. | Georgia Institute of Technology |
| Kousik, Shreyas | Georgia Institute of Technology |
Keywords: Intention Recognition, Aerial Systems: Perception and Autonomy, Aerial Systems: Applications
Abstract: Autonomous aircraft must safely operate in non-towered airspace, where coordination relies on voice-based communication among human pilots. Safe operation requires an aircraft to predict the intent, and corresponding goal location, of other aircraft. This paper introduces a multimodal framework for aircraft goal prediction that integrates natural language understanding with spatial reasoning to improve autonomous decision-making in such environments. We leverage automatic speech recognition and large language models to transcribe and interpret pilot radio calls, identify aircraft, and extract discrete intent labels. These intent labels are fused with observed trajectories to condition a temporal convolutional network and Gaussian mixture model for probabilistic goal prediction. Our method significantly reduces goal prediction error compared to baselines that rely solely on motion history, demonstrating that language-conditioned prediction increases prediction accuracy. Experiments on a real-world dataset from a non-towered airport validate the approach and highlight its potential to enable socially aware, language-conditioned robotic motion planning.
|
| |
| 09:00-10:30, Paper WeI1I.294 | Add to My Program |
| Moving On, Even When You’re Broken: Fail-Active Trajectory Generation Via Diffusion Policies Conditioned on Embodiment and Task |
|
| Briscoe-Martinez, Gilberto | University of Colorado Boulder |
| Gautam, Yaashia | University of Colorado Boulder |
| Shetty, Rahul | University of Colorado Boulder |
| Pasricha, Anuj | University of Colorado Boulder |
| Nicotra, Marco | University of Colorado Boulder |
| Roncone, Alessandro | University of Colorado Boulder |
Keywords: Failure Detection and Recovery, Deep Learning in Grasping and Manipulation, Redundant Robots
Abstract: Robot failure is detrimental and disruptive, often requiring human intervention to recover. Our vision is 'fail-active' operation, allowing robots to safely complete their tasks even when damaged. Focusing on 'actuation failures', we introduce DEFT, a diffusion-based trajectory generator conditioned on the robot’s current embodiment and task constraints. DEFT generalizes across failure types, supports constrained and unconstrained motions, and enables task completion under arbitrary failure. We evaluate DEFT in both simulation and real-world scenarios using a 7-DoF robotic arm. DEFT outperforms its baselines over thousands of failure conditions, achieving a 99.5% success rate for unconstrained motions versus RRT's 42.4%, and 46.4% for constrained motions versus differential IK's 30.9%. Furthermore, DEFT demonstrates robust zero-shot generalization by maintaining performance on failure conditions unseen during training. Finally, we perform real-world evaluations on two multi-step tasks, drawer manipulation and whiteboard erasing. These experiments demonstrate DEFT succeeding on tasks where classical methods fail. Our results show that DEFT achieves fail-active manipulation across arbitrary failure configurations and real-world deployments.
|
| |
| 09:00-10:30, Paper WeI1I.295 | Add to My Program |
| SurgSync: Time-Synchronized Multi-Modal Data Collection Framework and Dataset for Surgical Robotics |
|
| Zhou, Haoying | Worcester Polytechnic Institute |
| Liu, Chang | Johns Hopkins University |
| Wu, Yimeng | Johns Hopkins University |
| Wu, Junlin | Johns Hopkins University |
| Wu, Zijian | The University of British Columbia |
| Lee, Yu Chung | The University of British Columbia |
| Martuscelli, Sara | Politecnico Di Milano |
| Salcudean, Septimiu E. | University of British Columbia |
| Fischer, Gregory Scott | Worcester Polytechnic Institute |
| Kazanzides, Peter | Johns Hopkins University |
Keywords: Medical Robots and Systems, Surgical Robotics: Laparoscopy, Data Sets for Robot Learning
Abstract: Most existing robotic surgery systems adopt a human-in-the-loop paradigm, often with the surgeon directly teleoperating the robotic system. Adding intelligence to these robots would enable higher-level control, such as supervised autonomy or even full autonomy. However, artificial intelligence (AI) requires large amounts of training data, which is currently lacking. This work proposes SurgSync, a multi-modal data collection framework with offline and online synchronization to support training and real-time inference, respectively. The framework is implemented on a da Vinci Research Kit (dVRK) and introduces (1) dual-mode (online/offline-matching) synchronized recorders, (2) a modern stereo endoscope to achieve image quality on par with clinical systems, and (3) additional sensors such as a side-view camera and a novel capacitive contact sensor to provide ground truth contact data. The framework also incorporates a post-processing toolbox for tasks such as depth estimation, optical flow, and a practical kinematic reprojection method using Gaussian heatmap. User studies with participants of varying skill levels are performed with ex-vivo tissue to provide clinically realistic data, and a network for surgical skill assessment is employed to demonstrate utilization of the collected data. Through the user study experiments, we obtained a dataset of 214 validated instances across multiple canonical training tasks. All software and data are available at surgsync.github.io.
|
| |
| 09:00-10:30, Paper WeI1I.296 | Add to My Program |
| Goal-VLA: Image-Generative VLMs As Object-Centric World Models Empowering Zero-Shot Robot Manipulation |
|
| Chen, Haonan | National University of Singapore |
| Guo, Jingxiang | National University of Singapore |
| Wang, Bangjun | The University of Hong Kong |
| Zhang, Tianrui | Tsinghua University |
| Huang, Xuchuan | Peking University |
| Hou, Yiwen | National University of Singapore |
| Zheng, Boren | Tsinghua University |
| Tie, Chenrui | National University of Singapore |
| Deng, Jiajun | University of Adelaide |
| Shao, Lin | National University of Singapore |
Keywords: Manipulation Planning, AI-Enabled Robotics, Failure Detection and Recovery
Abstract: Generalization remains a fundamental challenge in robotic manipulation. To tackle this challenge, recent Vision-Language-Action (VLA) models build policies on top of Vision-Language Models (VLMs), seeking to transfer their open-world semantic knowledge. However, their zero-shot capability lags significantly behind the base VLMs, as the instruction-vision-action data is too limited to cover diverse scenarios, tasks, and robot embodiments. In this work, we present textbf{name}, a zero-shot framework that leverages Image-Generative VLMs as world models to generate desired goal states, from which the target object pose is derived to enable generalizable manipulation. The key insight is that object state representation is the golden interface, naturally separating a manipulation system into high-level and low-level policies. This representation abstracts away explicit action annotations, allowing the use of highly generalizable VLMs while simultaneously providing spatial cues for training-free low-level control. To further improve robustness, we introduce a Reflection-through-Synthesis process that iteratively validates and refines the generated goal image before execution. Both simulated and real-world experiments demonstrate that our name achieves strong performance and inspiring generalizability in manipulation tasks. Supplementary materials are available at https://nus-lins-lab.github.io/goalvlaweb/.
|
| |
| 09:00-10:30, Paper WeI1I.297 | Add to My Program |
| CRASH: Context-Aware Recognition of Agents for Simulation of High‑risk Driving |
|
| Cho, Minhee | Ewha Womans University |
| Jo, Hayeon | Ewha Womans University |
| Min, Dongbo | Ewha Womans University |
Keywords: Motion and Path Planning, Task and Motion Planning, Collision Avoidance
Abstract: Evaluating the safety of autonomous vehicles requires simulation of safety-critical scenarios such as potential collisions, which are difficult to reproduce in real-world environments. Prior methods rely on future trajectory predictions and heuristically select adversarial agents based on spatial proximity to the ego vehicle, often producing unrealistic scenarios that misalign with real-world temporal dynamics and contextual risk. To address these issues, we propose CRASH, the first learning-based adversarial agent selection approach that operates solely on past and present observations. It comprises two key components: (1) a Motion-Aware Masking (MAM) module that filters out static agents unlikely to collide with the ego vehicle due to negligible movement, and (2) an Adversarial agent Selection Module (ASM) that models contextual interactions to probabilistically estimate each agent’s likelihood of inducing a collision with the ego vehicle. Experiments on the nuScenes and Waymo datasets demonstrate that CRASH significantly improves the success rate of generating realistic collision scenarios under both replay and rule-based planners, validating the effectiveness of context-aware agent modeling without access to future information.
|
| |
| 09:00-10:30, Paper WeI1I.298 | Add to My Program |
| LACY: A Vision-Language Model-Based Language-Action Cycle for Self-Improving Robotic Manipulation |
|
| Hong, Youngjin | University of Minnesota |
| Yu, Houjian | University of Minnesota, Twin Cities |
| Li, Mingen | University of Minnesota Twin Cities |
| Choi, Changhyun | University of Minnesota, Twin Cities |
Keywords: Deep Learning in Grasping and Manipulation, Grasping
Abstract: Learning generalizable policies for robotic manipulation increasingly relies on large-scale models that excel at mapping language instructions to actions (L2A). However, this unidirectional training paradigm often produces policies that can execute tasks without deeper contextual understanding, thereby limiting their ability to generalize and to explain their behavior. We argue that the complementary skill of mapping actions back to language (A2L) is essential for developing more holistic and robust grounding. An agent capable of both acting and explaining its actions can form richer internal representa- tions and, critically, unlock new paradigms for self-supervised learning. In this paper, we introduce LACY (Language-Action CYcle), a unified framework that learns such bidirectional mappings within a single vision-language model. LACY is jointly trained on three synergistic tasks: generating parameterized actions from language (L2A), explaining observed actions in language (A2L), and verifying semantic consistency between language pairs (L2C). The framework enables a self-improving cycle that autonomously generates new training data by chain- ing the L2A and A2L modules in an L2A2L pipeline. The L2C module then filters this data using an active data augmentation strategy that selectively targets low-confidence cases, thereby improving the model efficiently without requiring additional human annotations. Extensive experiments on pick-and-place tasks in both simulation and the real world demonstrate that LACY substantially improves task success rates by over 56.46% compared to baseline methods and yields more robust language- action grounding for robotic manipulation.
|
| |
| 09:00-10:30, Paper WeI1I.299 | Add to My Program |
| Efficient Construction of Implicit Surface Models from a Single Image for Motion Generation |
|
| Chu, Wei-Teng | Stanford University |
| Zhang, Tianyi | Aurora Innovation |
| Johnson-Roberson, Matthew | Vanderbilt University |
| Zhi, Weiming | The University of Sydney; Vanderbilt University |
Keywords: Deep Learning for Visual Perception
Abstract: Implicit representations have been widely applied in robotics for obstacle avoidance and path planning. In this paper, we explore the problem of constructing an implicit distance representation from a single image. Past methods for implicit surface reconstruction, such as NeuS and its variants generally require a large set of multi-view images as input, and require long training times. In this work, we propose Fast Image-to-Neural Surface (FINS), a lightweight framework that can reconstruct high-fidelity surfaces and SDF fields based on a single or a small set of images. FINS integrates a multi-resolution hash grid encoder with lightweight geometry and color heads, making the training via an approximate second-order optimizer highly efficient and capable of converging within a few seconds. Additionally, we achieve the construction of a neural surface requiring only a single RGB image, by leveraging pre-trained foundation models to estimate the geometry inherent in the image. Our experiments demonstrate that under the same conditions, our method outperforms state-of-the-art baselines in both convergence speed and accuracy on surface reconstruction and SDF field estimation. Moreover, we demonstrate the applicability of FINS for robot surface following tasks and show its scalability to a variety of benchmark datasets.
|
| |
| 09:00-10:30, Paper WeI1I.300 | Add to My Program |
| MADR: MPC-Guided Adversarial Deepreach |
|
| Teoh, Ryan | UCLA |
| Tonkens, Sander | University of California - San Diego |
| Sharpless, William | University of California, San Diego |
| Yang, Aijia | University of California, San Diego |
| Feng, Zeyuan | Stanford University |
| Bansal, Somil | Stanford University |
| Herbert, Sylvia | UC San Diego (UCSD) |
Keywords: Robot Safety, Machine Learning for Robot Control
Abstract: Hamilton-Jacobi Reachability offers a framework for generating safe value functions and policies in the face of adversarial disturbance, but is limited by the curse of dimensionality. Physics-informed deep learning is able to overcome this infeasibility, but itself suffers from slow and inaccurate convergence, primarily due to weak PDE gradients and the complexity of self-supervised learning. Recent works have demonstrated that enriching the self-supervision process with regular supervision (based on the nature of the optimal control problem) greatly accelerates convergence and solution quality; however, these have been limited to single-player problems and simple games. In this work, we introduce MADR: MPC-guided Adversarial DeepReach, a general framework to robustly approximate the two-player, zero-sum differential game value function. In doing so, MADR yields the corresponding optimal strategies for both players in zero-sum games as well as safe policies for worst-case robustness. We test MADR on a multitude of high-dimensional simulated and real robotic agents with varying dynamics and games, finding that our approach significantly outperforms state-of-the-art baselines in simulation and produces impressive results in hardware.
|
| |
| 09:00-10:30, Paper WeI1I.301 | Add to My Program |
| Attention-Based Markerless Pose Estimation of the Assistant-Port Trocar in Robot-Assisted Surgery with a Head-Mounted Display |
|
| Greene, Nicholas | Johns Hopkins University |
| Long, Aoqi | Johns Hopkins University |
| Kazanzides, Peter | Johns Hopkins University |
Keywords: Surgical Robotics: Laparoscopy, Computer Vision for Medical Robotics, Visual Tracking
Abstract: In robotic-assisted minimally invasive surgery, an assistant surgeon stands at the bedside to insert and manipulate instruments while the primary surgeon operates the robot. Augmented reality (AR) head-mounted displays (HMDs) may improve the assistant's spatial awareness, but require tracking of surgical tools (both robotic and hand-held) for accurate overlay. In this work, we propose a markerless method to estimate the 6-DoF trocar pose for the assistant port, which can convey the insertion trajectory of any handheld instrument to the assistant surgeon. The method is based on a deep U-Net architecture with cross-attention and Atrous Spatial Pyramid Pooling (ASPP) to predict 2D keypoints on the trocar, which are then used by a Perspective-n-Point (PnP) method to estimate the trocar's pose. From the predicted trocar pose, we can also directly find the 4-DoF shaft-line of the handheld instrument using a multi-view method; this enables correction for misalignment of the trocar and instrument shaft. The trocar tracking runs in real-time (66 Hz) and can be integrated into an AR-assisted workflow. Experimental results with a phantom show an accuracy of ~5.5 mm and angle error of ~1.9 degrees, which is sufficient to guide instrument insertion into the endoscope field of view.
|
| |
| 09:00-10:30, Paper WeI1I.302 | Add to My Program |
| MS-rPPG: Multi-Spectral State Space Model for Remote Photoplethysmography in Driver Monitoring Systems |
|
| Choi, Jiho | Jeonbuk National University |
| Lee, Sang Jun | Jeonbuk National University |
Keywords: Health Care Management, AI-Based Methods, Autonomous Agents
Abstract: Remote photoplethysmography (rPPG) is a camera-based technique for measuring physiological signals, particularly cardiac activity. From the remotely measured signals, heart rate can be estimated, which is crucial for health monitoring. In this study, we investigate a driver health monitoring system based on remote heart rate estimation. However, driving environments represent uncontrolled settings where videos are subject to varying illumination conditions and frequent head movements. We introduce MS-rPPG, a multi-spectral framework that combines RGB with near-infrared (NIR) face video to alleviate rPPG estimation under challenging driving conditions. To combine the complementary features from two spectral videos, we propose a cross-spectral linear modulation (CSLM) strategy based on frequency-domain analysis. Moreover, we introduce MS-Mamba, a novel state space model designed to effectively model long-range temporal dependencies while jointly capturing cross-channel interactions between multi-spectral features. We collected a real-world dataset called MS-Drive, which was recorded from 50 participants while driving the vehicle. The proposed method was evaluated on the MR-NIRP Car dataset and MS-Drive datasets. The experimental results indicate that MS-rPPG shows better robustness and heart rate estimation accuracy than previous methods, highlighting its promise for driver health monitoring. The codes are available at this https URL.
|
| |
| 09:00-10:30, Paper WeI1I.303 | Add to My Program |
| Reference-Free, Long-Horizon Trajectory Optimization for Aggressive Autonomous Driving in Milliseconds |
|
| Sharma, Prayag | Rensselaer Polytechnic Institute |
| Goh, Jon | Toyota Research Institute |
| Djeumou, Franck | Rensselaer Polytechnic Institute |
Keywords: Motion and Path Planning, Collision Avoidance
Abstract: Autonomous vehicles must generate long-horizon and dynamically feasible trajectories in real time—even when operating at the limits of vehicle handling—to ensure safe operation in adverse conditions. However, existing work rarely quantifies the computational demands of generating such trajectories without prior references, warm starts and often defaults to low-fidelity models, compromising accuracy and control authority. We investigate the modeling and solver design choices that enable real-time solution of long-horizon, reference-free optimal control problems (OCPs) using full vehicle dynamics. To this end, we analyze vehicle stiffness properties to justify the OCP's integration scheme and show that lower-order A-stable methods consistently outperform alternatives, with solve time differences reaching two orders of magnitude. We show that robust nonlinear solver performance hinges on understanding barrier parameter update strategies and safeguarding techniques for Hessian indefiniteness, inherent in some interior point methods. Lastly, we propose a computationally efficient method for generating initial guesses using dynamic equilibrium, unlocking real-time performance and reducing initial infeasibility by up to four orders of magnitude. Extensive benchmarking and high-fidelity BeamNG simulation demonstrate compute times as low as 55 ms over a 260 m horizon, including high-speed obstacle avoidance scenarios where drifting emerges as a necessary component of feasible trajectory generation.
|
| |
| 09:00-10:30, Paper WeI1I.304 | Add to My Program |
| SupGS-SLAM: Gaussian Splatting SLAM with Efficient Keyframe Strategy and Supplementary Mapping |
|
| Liu, Shuai | Renmin University of China |
| Wang, Yongcai | Renmin University of China |
| Chen, Wenping | Renmin Univeristy of China |
| Chen, Wang | Renmin University of China |
| Li, Deying | Renmin University of China |
Keywords: SLAM, Localization, Mapping
Abstract: Gaussian Splatting SLAM methods have exhibited impressive high-fidelity rendering performance. Existing methods maintain high rendering quality around the current camera viewpoint, but the rendering quality degrades in previously observed regions as the camera moves away, particularly in real-world scenarios. We identify two core factors for high-quality rendering: keyframes should efficiently cover the entire scene while minimizing redundancy, and the mapping strategy should effectively select critical keyframes for full scene optimization. To address these issues, we propose SupGS-SLAM to improve rendering quality across the entire scene. For effective keyframe management, we propose an efficient keyframe strategy, which reduces redundant keyframe selection and prioritizes the optimization of critical keyframes by assigning high weights. For enhanced mapping, we propose a supplementary mapping strategy comprising three components: supplementary densification, supplementary global mapping, and supplementary depth mapping. In supplementary densification, we add supplementary Gaussian primitives to previous regions with insufficient representation. In supplementary global mapping, we select keyframes globally to optimize the full scene. In supplementary depth mapping, we use estimated depth to optimize regions without ground-truth depth. Extensive experiments demonstrate that SupGS-SLAM achieves excellent performance on both synthetic and real-world datasets. The project page is available at https://github.com/rucliushuai/SupGS-SLAM.
|
| |
| 09:00-10:30, Paper WeI1I.305 | Add to My Program |
| GAPG: Geometry Aware Push-Grasping Synergy for Goal-Oriented Manipulation in Clutter |
|
| Xiao, Lijingze | South China University of Technology |
| Du, Jinhong | South China University of Technology |
| Cong, Yang | South China University of Technology |
| Diao, Supeng | South China University of Technology |
| Ren, Yu | South China University of Technology |
Keywords: Grasping, Deep Learning for Visual Perception, Perception for Grasping and Manipulation
Abstract: Grasping target objects is a fundamental skill for robotic manipulation, but in cluttered environments with stacked or occluded objects, a single-step grasp is often insufficient. To address this, previous work has introduced pushing as an auxiliary action to create graspable space. However, these methods often struggle with both stability and efficiency because they neglect the scene’s geometric information, which is essential for evaluating grasp robustness and ensuring that pushing actions are safe and effective. To this end, we propose a geometry-aware push–grasp synergy framework that leverages point cloud data to integrate grasp and push evaluation. Specifically, the grasp evaluation module analyzes the geometric relationship between the gripper’s point cloud and the points enclosed within its closing region to determine grasp feasibility and stability. Guided by this, the push evaluation module predicts how pushing actions influence future graspable space, enabling the robot to select actions that reliably transform non-graspable states into graspable ones. By jointly reasoning about geometry in both grasping and pushing, our framework achieves safer, more efficient, and more reliable manipulation in cluttered settings. Our method is extensively tested in simulation and real-world environments in various scenarios. Experimental results demonstrate that our model generalizes well to real-world scenes and unseen objects. The code and video are available at https://github.com/xiaolijz/GAPG.
|
| |
| 09:00-10:30, Paper WeI1I.306 | Add to My Program |
| Octree Diffusion for Semantic Scene Generation and Completion |
|
| Zhang, Xujia | University of Colorado Boulder |
| Crowe, Brendan | University of Colorado Boulder |
| Heckman, Christoffer | University of Colorado at Boulder |
Keywords: Computer Vision for Transportation, Deep Learning for Visual Perception
Abstract: The completion, extension, and generation of 3D semantic scenes are an interrelated set of capabilities that are useful for robotic navigation and exploration. Existing approaches seek to decouple these problems and solve them one-off. Additionally, these approaches are often domain-specific,requiring separate models for different data distributions, e.g.indoor vs. outdoor scenes. To unify these techniques and provide cross-domain com- patibility, we develop a single framework that can perform scene completion, extension, and generation in both indoor and outdoor scenes, which we term Octree Latent Semantic Diffusion. Our approach operates directly on an efficient dual octree graph latent representation: a hierarchical, sparse, and memory-efficient occupancy structure. This technique disentangles synthesis into two stages: (i) structure diffusion, which predicts binary split signals to construct a coarse occupancy octree, and (ii) latent semantic diffusion, which generates semantic embeddings decoded by a graph VAE into voxel-level semantic labels. To perform semantic scene completion or extension, our model leverages inference-time latent inpainting, or outpainting respectively. These inference-time methods use partial LiDAR scans or maps to condition generation, without the need for retraining or finetuning. We demonstrate high-quality structure, coherent semantics, and robust completion from single LiDAR scans, as well as zero-shot generalization to out-of-distribution LiDAR data. These results indicate that completion-through-generation in a dual octree graph latent space is a practical and scalable alternative to regression-based pipelines for real-world robotic perception tasks.
|
| |
| 09:00-10:30, Paper WeI1I.307 | Add to My Program |
| TAG-K: Tail-Averaged Greedy Kaczmarz for Computationally Efficient and Performant Online Inertial Parameter Estimation |
|
| Sha, Shuo | Columbia University |
| Bhakta, Anupam | Columbia University |
| Jiang, Zhenyuan | Columbia University |
| Qiu, Kevin | Columbia University |
| Mahajan, Ishaan | Columbia University |
| Bravo-Palacios, Gabriel | Dartmouth College |
| Plancher, Brian | Dartmouth College |
Keywords: Calibration and Identification, Optimization and Optimal Control
Abstract: Accurate online inertial parameter estimation is essential for adaptive robotic control, enabling real-time adjustment to payload changes, environmental interactions, and system wear. Traditional methods often struggle to track abrupt parameter shifts or incur high computational costs, limiting their effectiveness in dynamic environments and for computationally constrained robotic systems. We introduce TAG-K, a lightweight extension of the Kaczmarz method that combines greedy randomized row selection for rapid convergence with tail averaging for robustness under noise and inconsistency. This design enables fast, stable parameter adaptation while retaining the low per-iteration complexity inherent to the Kaczmarz framework. We evaluate TAG-K in synthetic benchmarks and quadrotor tracking tasks against RLS, KF, and other Kaczmarz variants. TAG-K achieves 1.5×–1.9× faster solve times on laptop-class CPUs and 4.8×–20.7× faster solve times on embedded microcontrollers. More importantly, these speedups are paired with improved robustness to measurement noise and a 25% reduction in estimation error, leading to nearly 2× better end-to-end tracking performance. Website, documentation, and code available at: https://a2r-lab.org/TAG-K/.
|
| |
| 09:00-10:30, Paper WeI1I.308 | Add to My Program |
| Search at Scale: Improving Numerical Conditioning of Ergodic Coverage Optimization for Multi-Scale Domains |
|
| Lahrach, Yanis | UCLouvain |
| Hughes, Christian | Yale University |
| Abraham, Ian | University of Sydney |
Keywords: Robust/Adaptive Control, Motion and Path Planning, Planning under Uncertainty
Abstract: Recent methods in ergodic coverage planning have shown promise as tools that can adapt to a wide range of geometric coverage problems with general constraints, but are highly sensitive to the numerical scaling of the problem space. The underlying challenge is that the optimization formulation becomes brittle and numerically unstable with changing scales, especially under potentially nonlinear constraints that impose dynamic restrictions, due to the kernel-based formulation. This paper proposes to address this problem via the development of a scale-agnostic and adaptive ergodic coverage optimization method based on the maximum mean discrepancy metric (MMD). Our approach allows the optimizer to solve for the scale of differential constraints while annealing the hyperparameters to best suit the problem domain and ensure physical consistency. We also derive a variation of the ergodic metric in the log space, providing additional numerical conditioning without loss of performance. We compare our approach with existing coverage planning methods and demonstrate the utility of our approach on a wide range of coverage problems.
|
| |
| 09:00-10:30, Paper WeI1I.309 | Add to My Program |
| Veila: Panoramic LiDAR Generation from a Monocular RGB Image |
|
| Liu, Youquan | Fudan University |
| Kong, Lingdong | National University of Singapore |
| Yang, Weidong | Fudan University |
| Liang, Ao | National University of Singapore |
| Gao, Jianxiong | Fudan University |
| Wu, Yang | Nanjing University of Science and Technology |
| Xu, Xiang | Nanjing University of Aeronautics and Astronautics |
| Li, Xin | Shanghai AI Laboratory |
| Li, Linfeng | ByteDance |
| Chen, Runnan | The University of Sydney |
| Fei, Ben | The Chinese University of Hong Kong |
Keywords: Computer Vision for Automation, Deep Learning for Visual Perception, Sensor Fusion
Abstract: Realistic and controllable panoramic LiDAR data generation is critical for scalable 3D perception in autonomous driving and robotics. Existing methods either perform unconditional generation with poor controllability or adopt text-guided synthesis, which lacks fine-grained spatial control. Leveraging a monocular RGB image as a spatial control signal offers a scalable and low-cost alternative, which remains an open problem. However, it faces three core challenges: (i) semantic and depth cues from RGB vary spatially, complicating reliable conditioning generation; (ii) modality gaps between RGB appearance and LiDAR geometry amplify alignment errors under noisy diffusion; and (iii) maintaining structural coherence between monocular RGB and panoramic LiDAR is challenging, particularly in image-LiDAR's non-overlap regions. To address these challenges, we propose Veila, a novel conditional diffusion framework that integrates: (i) a Confidence-Aware Conditioning Mechanism (CACM) that strengthens RGB conditioning by adaptively balancing semantic and depth cues according to their local reliability; (ii) Geometric Cross-Modal Alignment (GCMA) for robust RGB-LiDAR alignment under noisy diffusion; and (iii) Panoramic Feature Coherence (PFC) for enforcing global structural consistency across monocular RGB and panoramic LiDAR. Additionally, we introduce two metrics - Cross-Modal Semantic Consistency and Cross-Modal Depth Consistency - to evaluate alignment quality across modalities. Experiments on nuScenes, SemanticKITTI, and our proposed KITTI-Weather benchmark demonstrate that Veila achieves state-of-the-art generation fidelity and cross-modal consistency, while enabling generative data augmentation that improves downstream LiDAR semantic segmentation.
|
| |
| 09:00-10:30, Paper WeI1I.310 | Add to My Program |
| PointSFDA: Source-Free Domain Adaptation for Point Cloud Completion |
|
| He, Xing | Nanjing University of Aeronautics and Astronautics |
| Zhu, Zhe | Nanjing University of Aeronautics and Astronautics |
| Nan, Liangliang | TU Delft |
| Peng, Wenshuo | Tsinghua University |
| Chen, Honghua | Lingnan University |
| Wei, Mingqiang | Nanjing University of Aeronautics and Astronautics |
Keywords: Deep Learning for Visual Perception, Transfer Learning
Abstract: Point cloud completion is critical for autonomous driving and robotic perception, yet deep learning models often experience severe performance degradation under the domain gap between synthetic training and real-world data. While unsupervised domain adaptation (UDA) has been explored to mitigate this issue, its reliance on access to source datasets limits practical applicability, as source data are often proprietary or restricted.We pioneer source-free domain adaptation (SFDA) for point cloud completion, which adapts a pre-trained source model to an unlabeled target domain without requiring source data access. To this end, we propose PointSFDA, a framework that combines global knowledge transfer with target-specific local adaptation. Specifically, we design (i) a Coarse-to-Fine Point Cloud Distillation module to extract domain-invariant global geometric priors from the source model, and (ii) a Partial-Mask Consistency Training strategy to enforce prediction consistency across masking augmentations, enabling self-supervised learning of local target-domain geometry. Experiments on real-world datasets (KITTI, ScanNet) and synthetic benchmarks (ModelNet40, 3D-FUTURE) demonstrate that PointSFDA achieves significant improvements over state-of-the-art methods in cross-domain shape completion, establishing a practical and scalable solution for robotics applications. Our code is available at https://github.com/Starak-x/PointSFDA.
|
| |
| 09:00-10:30, Paper WeI1I.311 | Add to My Program |
| TMR-VLA: Vision-Language-Action Model for Magnetic Motion Control of Tri-Leg Silicone-Based Soft Robot |
|
| Tang, Ruijie | The Chinese University of Hong Kong |
| Ng, Chi Kit | The Chinese University of Hong Kong |
| Wu, Kaixuan | The Chinese University of Hong Kong |
| Bai, Long | Alibaba DAMO Academy |
| Wang, Guankun | The Chinese University of Hong Kong |
| Huang, Yiming | The Chinese University of Hong Kong |
| Wang, Yupeng | The Chinese University of Hong Kong |
| Ren, Hongliang | Chinese Univ Hong Kong (CUHK) & National Univ Singapore(NUS) |
Keywords: Medical Robots and Systems, AI-Enabled Robotics, Modeling, Control, and Learning for Soft Robots
Abstract: In-vivo environments, magnetically actuated soft robots offer advantages such as wireless operation and precise control, showing promising potential for painless detection and therapeutic procedures. We developed a trileg magnetically driven soft robot (TMR) whose multi-legged design enables more flexible gaits and diverse motion patterns. For the silicone made of reconfigurable soft robots, its navigation ability can be separated into sequential motions, namely squatting, rotation, lifting a leg, walking and so on. Its motion and behavior depend on its bending shapes. To bridge motion type description and specific low-level voltage control, we introduced TMR-VLA, an end-to-end multi-modal system for a trileg magnetic soft robot capable of performing hybrid motion types, which is promising for developing a navigation ability by adapting its shape to language-constrained motion types. The TMR-VLA deploys embodied endoluminal localization ability from EndoVLA, and fuses sequential frames and natural language commands as input. Low-level voltage output is generated based on the current observation state and specific motion type description. The result shows the TMR-VLA can predict how the voltage applied to TMR will change the dynamics of a silicon-made soft robot. The TMR-VLA reached a 78% average success rate.
|
| |
| 09:00-10:30, Paper WeI1I.312 | Add to My Program |
| Geometry-Aware Control Barrier Functions for Collision Avoidance Via Bernstein Polynomial Approximations |
|
| Jo, Siwon | University of Pennsylvania |
| Zhang, Yanze | University of Illinois Chicago |
| Yang, Yupeng | University of North Carolina at Charlotte |
| Luo, Wenhao | University of Illinois Chicago |
Keywords: Collision Avoidance, Robot Safety, Autonomous Agents
Abstract: Safe navigation often relies on well-defined conditions based on the shape of robots and obstacles, and can be challenging when they have irregular geometries. While Control Barrier Functions (CBFs) offer an efficient mechanism to enforce safe set forward invariance, common shape surrogates (e.g., spheres or super-ellipsoids) either are overly conservative in unstructured scenes or require many local primitives, which inflates constraint counts and degrades real-time performance. In this paper, we introduce a novel geometry-aware Control Barrier Function (CBF) based on Bernstein–Polynomial Signed Distance Fields (BP-SDFs). It provides a unified way to represent the obstacles and robots, so as to represent the barrier function with a unified minimum distance. Benefiting from the differentiability of the Bernstein polynomials, one can easily enforce the control constraints in a closed loop. We validate the method's efficiency and performance to guarantee safety in single-robot navigation and heterogeneous multi-robot collision avoidance via simulations under different environments.
|
| |
| 09:00-10:30, Paper WeI1I.313 | Add to My Program |
| Learning Structural Latent Points for Efficient Visual Representations in Robotic Manipulation |
|
| Jiang, Yicheng | The Hong Kong University of Science and Technology |
| Wang, Jiaxu | The Chinese University of Hong Kong |
| He, Junhao | The Hong Kong University of Science and Technology (GZ) |
| Gan, Zesen | The Hong Kong University of Science and Technology |
| Li, Junhao | The Hong Kong University of Science and Technology |
| Zhang, Qiang | The Hong Kong University of Science and Technology (Guangzhou) |
| Sun, Jingkai | The Hong Kong University |
| Cao, Jiahang | The University of Hong Kong |
| Sun, Mingyuan | Northeastern University |
| Yue, Xiangyu | The Chinese University of Hong Kong |
| Shao, Qiming | The Hong Kong University of Science and Technology |
Keywords: Deep Learning for Visual Perception, Representation Learning, RGB-D Perception
Abstract: Current 3D-aware pretraining methods for embodied perception and manipulation are largely built on differentiable rendering frameworks, producing either fully implicit neural fields or fully explicit geometric primitives. Implicit representations, while expressive, lack explicit structural cues, whereas explicit ones preserve geometry but suffer from resolution limits and weak generalization. To address these limitations, we propose a novel pretraining framework that learns a hybrid representation-structural latent points. Specifically, we insert a point-wise latent variational autoencoder into the latent space of a point-cloud autoencoder, jointly regularizing point-wise features and coordinates toward a Gaussian prior. The resulting compact latent preserves coarse structural tendencies, which do not encode precise geometry but capture richer rough shape and semantic information, effectively combining the expressiveness of implicit representations with the structural priors of explicit ones. In addition, informed by shared design choices in prior work, we develop a streamlined, efficient 3DGS-based rendering pipeline that is deliberately kept lightweight, improving efficiency while leaving greater representational capacity to the front-end latent module. Extensive evaluations on RLBench, ManiSkill2, and a real-robot platform demonstrate consistent gains in task success, sample efficiency, and robustness to viewpoint and scene variations over strong baselines. Ablation studies further confirm that each component of our framework is critical to overall performance.
|
| |
| 09:00-10:30, Paper WeI1I.314 | Add to My Program |
| Task Planning for Robotic Disinfection Using Generative-Adversary-Trimodel (GAT) |
|
| Ye, Jiajie | The University of Hong Kong |
| Sheng, Yongji | The University of Hong Kong |
| Liu, Tianyu | The University of Hong Kong |
| Xi, Ning | The University of Hong Kong |
Keywords: Task Planning, Planning, Scheduling and Coordination, Robotics and Automation in Life Sciences
Abstract: Robotic disinfection can relieve human operators from repetitive, labor‑intensive tasks while reducing the risk of pathogen transmission in public spaces. Recent advances in learning-based methods further enhance these systems by enabling robust dynamic task planning and the interpretation of ambiguous instructions. However, disinfection task planning remains a four-dimensional (interaction, logic, spatial and temporal) problem that requires expert knowledge. The robust task planning for autonomous disinfection in dynamic environment remains challenging. This paper proposes a novel framework that integrating the Generative Adversarial Trimodel (GAT) method with embodied framework to solve the four-dimensional problem in the dynamic environment. The GAT method injects expert knowledge and iteratively refines neural network-generated plans against analytical model (AM), driving dual convergence and reducing logic, spatial, and temporal errors. By combining embodied framework and the GAT method into a GAT-enhanced embodied framework, the robot system autonomously perceives objects of unknown shape and pose, long-horizon task sequence plans, and executes disinfection operations. Experimental results demonstrate an improvement in success rate and reduce the average task time and rule violation rates compared with non-GAT methods, demonstrating improved robustness and efficiency in dynamic environment.
|
| |
| 09:00-10:30, Paper WeI1I.315 | Add to My Program |
| Awaken Memories with Words: Recursive Visual Imagination and Adaptive Linguistic Grounding for Vision Language Navigation |
|
| Chen, Bolei | Central South University |
| Kang, Jiaxu | Central South University |
| Wang, Yifei | Central South University |
| Zhong, Ping | Central South University |
| Wang, Jianxin | Central South University |
Keywords: Semantic Scene Understanding, Embodied Cognitive Science, Representation Learning
Abstract: Vision Language Navigation (VLN) typically requires agents to navigate to specified objects or remote regions in unknown scenes by obeying linguistic commands. Such tasks require organizing historical visual observations for linguistic grounding, which is critical for long-sequence navigational decisions. However, current agents suffer from overly detailed scene representation and ambiguous vision-language alignment, which weaken their comprehension of navigation-friendly high-level scene priors and easily lead to behaviors that violate linguistic commands. To tackle these issues, we propose a navigation policy by recursively summarizing along-the-way visual perceptions, which are adaptively aligned with commands to enhance linguistic grounding. In particular, by structurally modeling historical trajectories as compact neural grids, several Recursive Visual Imagination (RVI) techniques are proposed to motivate agents to focus on the regularity of visual transitions and semantic scene layouts, instead of dealing with misleading geometric details. Then, an Adaptive Linguistic Grounding (ALG) technique is proposed to align the learned situational memories with different linguistic components purposefully. Such fine-grained semantic matching facilitates the accurate anticipation of navigation actions and progress. Our navigation policy outperforms the state-of-the-art methods on the challenging VLN-CE and ObjectNav tasks, showing the superiority of our RVI and ALG techniques for VLN.
|
| |
| 09:00-10:30, Paper WeI1I.316 | Add to My Program |
| A Non-Invasive Closed-Loop Myoelectric Prosthetic Hand Featuring Electrotactile Sensory Feedback |
|
| Zhu, Guanyu | Northeastern University, China |
| Dou, Yilong | Northeastern University, China |
| Wu, Qiong | Northeastern University, China |
| Ding, Qichuan | Northeastern University, China |
Keywords: Rehabilitation Robotics, Prosthetics and Exoskeletons, Medical Robots and Systems
Abstract: The absence of sensory feedback has been a critical challenge for myoelectric prostheses in recent years. While electrotactile feedback has emerged as an effective non-invasive solution, significant challenges remain in simultaneously ensuring real-time performance, processing EMG signals under electrical stimulation interference, and transmitting richer sensory information. This study proposes a multidimensional bio-inspired electrical stimulation feedback paradigm, implemented on a self-developed closed-loop myoelectric prosthetic hand system with real-time interference avoidance capability. Utilizing the human cutaneous nervous system as the feedback pathway, our paradigm establishes diverse electrotactile patterns through real-time modulation of four-channel stimulation parameters (frequency and current intensity). Experimental results with both able-bodied participants and amputees demonstrate that the proposed paradigm can accurately convey prosthetic state information, enabling users to perceive object size, length, shape, and stiffness through the prosthetic hand. This feedback framework provides a viable sensory restoration solution for prosthetic applications.
|
| |
| 09:00-10:30, Paper WeI1I.317 | Add to My Program |
| Tactile Execution Monitoring of Robotic Manipulation Via Time-Series Based Predictive Encoding |
|
| Voigt, Florian | Mohamed Bin Zayed University of Artificial Intelligence |
| Naceri, Abdeldjallil | Mohamed Bin Zayed University of Artificial Intelligence |
| Haddadin, Sami | Mohamed Bin Zayed University of Artificial Intelligence |
Keywords: Assembly, Force and Tactile Sensing, Compliant Assembly
Abstract: Tactile manipulation is a prominent and growing field where most research focuses on developing generalized manipulation policies. However, tactile execution monitoring - the ability to reliably evaluate manipulation at the skill level - is often overlooked, despite being critical for unsupervised deployment in both human-centered environments and industry, where strict safety and quality requirements apply. We propose the Tactile Predictive Encoding Model (TPEM), a time-series tactile perception framework inspired by human predictive encoding that enables real-time anomaly detection from skilllevel sensory data. TPEM extends predictive coding concepts from global task modeling to precise monitoring of contact-rich manipulation beyond the capabilities of visual sensing. We evaluate TPEM on three representative tasks: key insertion and turning, peg-in-hole insertion, and screw insertion and tightening using an industrial assembly model. Experiments on a tactile-enabled Franka Emika robot under realistic noise conditions show robust anomaly detection with zero false positives. Comparison with baseline methods - including Support Vector Machines (SVM), Hidden Markov Models (HMM), and recurrent generative models such as LSTM-VAE — demonstrates that TPEM consistently outperforms state-of-the-art approaches in contact-rich skill-level execution monitoring.
|
| |
| 09:00-10:30, Paper WeI1I.318 | Add to My Program |
| The Actuator Pre-Filtering Approach to Control-Coherent Koopman LQR for Robot Systems Interacting with Compliant Environment |
|
| Terrones, Jasmine | MIT |
| Asada, Harry | MIT |
Keywords: Machine Learning for Robot Control, Optimization and Optimal Control, Underactuated Robots
Abstract: As a robot makes and breaks contact with environment surfaces, the equations of motion are switched. Task planning and real-time control become challenging as the system traverses multiple regions and switches the governing dynamics. This paper presents a modeling and real-time control methodology for such switched dynamical systems based on Koopman operator theory. Potentially, Koopman operators allow us to subsume segmented dynamics within a unified, globally linear model amenable for control analysis and synthesis. However, the original Koopman operators are not appliable to non-autonomous systems with exogenous input. A new method for converting robot dynamics to a Koopman-compatible model using actuator pre-filtering is presented and applied to the modeling and control of robots interacting with the environment. Specifically, an underactuated cart-pole robot bouncing against multiple walls is modeled as a Control-Coherent Koopman model and a Koopman LQR controller is designed for the wall-bouncing robot. Simulation experiments demonstrate the effectiveness of the method and investigates the effect of the actuator pre-filter parameter on control performance.
|
| |
| 09:00-10:30, Paper WeI1I.319 | Add to My Program |
| CU-Multi: A Dataset for Multi-Robot Collaborative Perception |
|
| Albin, Doncey | University of Colorado - Boulder |
| McGann, Daniel | Carnegie Mellon University |
| Mena, Miles | University of Colorado Boulder |
| Thomas, Annika | Massachusetts Institute of Technology |
| Biggie, Harel | Massachusetts Institute of Technology |
| Sun, Xuefei | University of Colorado Boulder |
| McGuire, Steve | University of California at Santa Cruz |
| How, Jonathan | Massachusetts Institute of Technology |
| Heckman, Christoffer | University of Colorado at Boulder |
Keywords: Data Sets for SLAM, Multi-Robot SLAM, SLAM
Abstract: A central challenge for multi-robot systems is fusing independently gathered perception data into a unified representation. Despite progress in Collaborative SLAM (C-SLAM), benchmarking remains hindered by the scarcity of dedicated multi-robot datasets. Many evaluations instead partition single-robot trajectories, a practice that may only partially reflect true multi-robot operations and, more critically, lacks standardization, leading to inconsistent results across studies. While several multi-robot datasets have recently been introduced, they mostly contain short trajectories with limited inter-robot overlap and sparse intra-robot loop closures. To overcome these limitations, we introduce CU-Multi, a dataset collected over multiple days at two large outdoor sites on the University of Colorado Boulder campus. CU-Multi comprises four synchronized runs with aligned start times and controlled trajectory overlap, replicating the distinct perspectives of a robot team. It includes RGB-D sensing, RTK GPS, semantic LiDAR, and refined ground-truth odometry. By combining overlap variation with dense semantic annotations, CU-Multi provides a strong foundation for reproducible evaluation in multi-robot collaborative perception tasks.
|
| |
| 09:00-10:30, Paper WeI1I.320 | Add to My Program |
| R-FAC: Resilient Value Function Factorization for Multi-Robot Efficient Search with Individual Failure Probabilities |
|
| Guo, Hongliang | Agency for Science Technology and Research |
| Kang, Qi | Northeastern University |
| Yau, Wei-Yun | I2R |
| Chew, Chee-Meng | National University of Singapore |
| Rus, Daniela | MIT |
Keywords: Multi-Robot Systems, Learning and Adaptive Systems, Cooperating Robots, Distributed Robot Systems
Abstract: This paper investigates the resilient multi-robot efficient search problem (R-MuRES), which aims at coordinating multiple robots for the minimal time detection of a 'non-adversarial' moving target. R-MuRES faces challenges like robot malfunctions and withdrawals during task execution, leading to a variable number of searchers and new research hurdles. We propose resilient value function factorization (R-FAC) to construct a central value function resiliently, minimizing mean squared temporal difference (TD) errors across team compositions. R-FAC ensures that individual global maximum (IGM) principles are met, allowing functioning robots to contribute positively. We introduce variational value decomposition network (V2DN) as an instantiation of the R-FAC paradigm, proving superior to brute-force summation in multi-robot search tasks. V2DN is compared with state-of-the-art MuRES solutions and the vanilla VDN, showcasing superior resiliency when robots leave the team. Validation of V2DN is performed in a real multi-robot system in a self-constructed indoor environment, demonstrating its effectiveness and contributing valuable insights to the robotics community.
|
| |
| 09:00-10:30, Paper WeI1I.321 | Add to My Program |
| Semantic Equirectangular Visual Tracking in Lightweight 3D Building Reconstructions |
|
| Loubani, Hussein | Universite De Technologie Belfort Montbeliard |
| Crombez, Nathan | Université De Technologie De Belfort-Montbéliard |
| Buisson, Jocelyn | Université De Technologie De Belfort Montbéliard |
| Ruichek, Yassine | University of Technology of Belfort-Montbeliard - France |
Keywords: Visual Tracking, Visual Servoing, Localization
Abstract: Accurate visual localization often relies on dense, high-fidelity 3D models, which provide rich geometric and photometric detail but are expensive to acquire, heavy to store, and limited in scalability. As an alternative, lightweight city models represent only coarse building volumes, offering compactness, accessibility, and privacy but posing challenges for reliable alignment due to the lack of textures and fine structure. This work addresses these challenges by introducing a semantic equirectangular Gaussian Mixture–based virtual visual servoing approach that aligns real panoramic images with synthetic views rendered from lightweight building models. The method combines semantic building masks with Gaussian Mixtures, a seamless 360^circ formulation, and frequency-domain computation to overcome the poor gradients of direct photometric binary-mask alignment while maintaining computational efficiency. Experiments on outdoor trajectories demonstrate accurate and stable tracking, robustness under frame skipping, and resilience to dynamic occlusions through semantic masking. These results indicate that reliable localization is feasible with coarse city models, providing a scalable alternative to high-fidelity reconstructions and opening perspectives for deeper integration of semantic rules into the localization process.
|
| |
| 09:00-10:30, Paper WeI1I.322 | Add to My Program |
| Hierarchical Motion Planning Adaptation Via Guided Diffusion |
|
| Kim, Woosung | University of Virginia |
| Bezzo, Nicola | University of Virginia |
Keywords: Planning under Uncertainty, Integrated Planning and Learning, Reactive and Sensor-Based Planning
Abstract: Mobile robots deployed for persistent operations in partially known environments need to be able to recover and adapt against unforeseen changes in dynamics, e.g., due to failures, or external disturbances. This paper presents a novel hierarchical framework capable of zero-shot adaptation to environmental and dynamic changes. At the high level, an abstract planner generates a collision-free global path, adapting to degraded mobility by inflating a dynamic safety buffer around obstacles to ensure the route remains navigable. At the low level, a concrete planner employs a conditional Denoising Diffusion Probabilistic Model (DDPM) to refine the abstract path into a smooth, executable trajectory. The key to our approach is conditioning the diffusion model's generation process on the robot's online-estimated dynamic limits. Our framework's effectiveness and robustness are validated in both complex simulations and real-world hardware experiments, demonstrating its ability to ensure mission success under unstructured and unexpected fault situations.
|
| |
| 09:00-10:30, Paper WeI1I.323 | Add to My Program |
| Dynamic Neural Potential Field: Online Trajectory Optimization in Presence of Moving Obstacles |
|
| Staroverov, Aleksei | Cognitive AI Systems Lab |
| Alhaddad, Muhammad | MIRIAI |
| Narendra, Aditya | MIRIAI |
| Mironov, Konstantin | Ufa University of Science and Technology |
| Panov, Aleksandr | Cognitive AI Systems Lab |
Keywords: Machine Learning for Robot Control, Collision Avoidance, Nonholonomic Motion Planning
Abstract: Generalist robot policies must operate safely and reliably in everyday human environments such as homes, offices, and warehouses, where people and objects move unpredictably. We present Dynamic Neural Potential Field (NPField-GPT), a learning-enhanced model predictive control (MPC) framework that couples classical optimization with a Transformer-based predictor of footprint-aware repulsive potentials. Given an occupancy sub-map, robot footprint, and optional dynamic-obstacle cues, our NPField-GPT model forecasts a horizon of differentiable potentials that are injected into a sequential quadratic MPC program via L4CasADi, yielding real-time, constraint-aware trajectory optimization. We additionally study two baselines: NPField-StaticMLP, where a dynamic scene is treated as a sequence of static maps; and NPField-DynamicMLP, which predicts the future potential sequence in parallel with an MLP. In dynamic indoor scenarios from BenchMR and on a Husky UGV in office corridors, NPField-GPT produces more efficient and safer trajectories under motion changes, while StaticMLP/DynamicMLP offer lower latency. We also compare with the CIAO* and MPPI baselines. Across methods, the Transformer+MPC synergy preserves the transparency and stability of model-based planning while learning only the part that benefits from data: spatiotemporal collision risk. Code and trained models are available at https://github.com/CognitiveAISystems/Dynamic-Neural-Potential-Field
|
| |
| 09:00-10:30, Paper WeI1I.324 | Add to My Program |
| LeGO-MM: Learning Navigation for Goal-Oriented Mobile Manipulation Via Hierarchical Policy Distillation |
|
| Chen, Bolei | Central South University |
| Liu, Liangbai | Central South University |
| Yan, Shengsheng | Central South University |
| Yang, Haonan | Central South University |
| Zhong, Ping | Central South University |
| Wang, Jianxin | Central South University |
Keywords: Embodied Cognitive Science, Mobile Manipulation, Reinforcement Learning
Abstract: Benefiting from mobility and dexterity, Mobile Manipulation (MM) systems are expected to assist humans with diverse tasks in everyday life. However, since MM tasks (e.g., tidying up a room) require learning multi-stage heterogeneous behaviors (e.g., picking, placing, and opening), existing Reinforcement Learning (RL) agents often face sample inefficiency and progress reversal issues. In addition, such MM agents are limited to learning customized tasks, thus not allowing for the extrapolation to new tasks and real-world scenes. In this work, we propose a Hierarchical Policy Distillation (HPD)-based RL framework to explicitly address these issues, which outperforms existing curriculum learning-based and hierarchical RL-based methods. Specifically, Sub-Skill Distillation (SSD) allows learning both the main MM task and easier sub-skills in a single training loop, facilitating exploration and mitigating process reversal by distilling the relevant sub-skills' experience into the main task. Self-boosting Policy Distillation (SPD) is designed to enhance generalization and address the information asymmetry between MM tasks in a principled way, i.e., distilling the experience of a prior task to a new one. Comparative and ablation studies on different robotic platforms demonstrate that our method significantly outperforms existing methods. Finally, real-world experiments validate the practicality of our method.
|
| |
| 09:00-10:30, Paper WeI1I.325 | Add to My Program |
| A Synthetic Benchmark for Collaborative 3D Semantic Occupancy Prediction in V2X-Enabled Autonomous Driving |
|
| Wu, Hanlin | The University of Tokyo |
| Lin, Pengfei | The University of Tokyo |
| Javanmardi, Ehsan | The University of Tokyo |
| Bao, Naren | The University of Tokyo |
| Qian, Bo | The University of Tokyo |
| Si, Hao | The University of Tokyo |
| Tsukada, Manabu | The University of Tokyo |
Keywords: Performance Evaluation and Benchmarking, Semantic Scene Understanding, Computer Vision for Automation
Abstract: 3D semantic occupancy prediction is an emerging perception paradigm in autonomous driving, providing a voxel-level representation of both geometric details and semantic categories. However, despite its fine-grained scene understanding, its effectiveness is inherently constrained in single-vehicle setups by occlusions, restricted sensor range, and narrow viewpoints. To address these limitations, collaborative perception enables the exchange of complementary information, thereby enhancing the completeness and accuracy of predictions. Despite its potential, research on collaborative 3D semantic occupancy prediction is hindered by the lack of dedicated datasets. To bridge this gap, we design a high-resolution semantic voxel sensor in CARLA to produce dense and comprehensive annotations for V2X scenarios. We further develop a baseline model that performs inter-agent feature fusion via spatial alignment and attention aggregation. In addition, we establish benchmarks with varying prediction ranges designed to systematically assess the impact of spatial extent on collaborative prediction. Experimental results demonstrate the superior performance of our baseline enabled by vehicle collaboration, with increasing gains observed as the prediction range expands. Our codes and data are available at https://github.com/tlab-wide/Co3SOP.
|
| |
| 09:00-10:30, Paper WeI1I.326 | Add to My Program |
| TransTac: Visuo-Tactile Modality Transition Via Ultraviolet-Encoded Transparent Elastomers |
|
| Yang, Lingyue | Beijing University of Posts and Telecommunications |
| Fang, Bin | Beijing University of Posts and Telecommunications / Tsinghua University |
Keywords: Force and Tactile Sensing, Haptics and Haptic Interfaces, Sensor Fusion
Abstract: Vision-based tactile sensors (VBTS) recover high-resolution contact geometry but typically rely on opaque elastomer layers that prevent visual transparency, while RGB-D cameras provide global depth perception yet degrade significantly at close range. To address this limitation, we present TransTac, a transparent ultraviolet (UV)-encoded binocular VBTS that integrates visual observation and marker-based tactile reconstruction within a single compact device. The system employs a transparent elastomer embedded with UV-reflective markers and a prior-guided Delaunay stereo matching algorithm for robust sparse triangulation. To reliably detect densely distributed semitransparent markers, we develop a lightweight detector that enables stable localization under contact and deformation. The proposed prior-guided Delaunay matching improves correspondence robustness by approximately 21% compared with global assignment baselines while maintaining high reconstruction accuracy. In semantic evaluation, TransTac achieves up to 83.3% zero-shot recognition accuracy on tactile images, exceeding opaque tactile baselines by approximately 50 percentage points. Embedding analysis further reveals substantially stronger cross-modal alignment with natural images, with class-center similarity increasing from around 0.2 to over 0.77. Controlled near-distance experiments quantify the degradation of RGB-D depth reliability and demonstrate extended geometric coverage enabled by visuo-tactile integration. Finally, a compact prototype is implemented with an approximate hardware cost of 70. Code and hardware design are publicly available at https://github.com/87361/TransTac.
|
| |
| 09:00-10:30, Paper WeI1I.327 | Add to My Program |
| SuReNav: Superpixel Graph-Based Constraint Relaxation for Navigation in Over-Constrained Environments |
|
| Koh, Keonyoung | KAIST |
| Jung, Moonkyeong | KAIST |
| Lee, Samuel Seungsup | Korea Advanced Institute of Science and Technology (KAIST) |
| Park, Daehyung | Korea Advanced Institute of Science and Technology, KAIST |
Keywords: Constrained Motion Planning, Reactive and Sensor-Based Planning, AI-Based Methods
Abstract: We address the over-constrained planning problem in semi-static environments. The planning objective is to find a best-effort solution that avoids all hard constraint regions while minimally traversing the least risky areas. Conventional methods often rely on pre-defined area costs, limiting generalizations. Further, the spatial continuity of navigation spaces makes it difficult to identify regions that are passable without overestimation. To overcome these challenges, we propose SuReNav, a superpixel graph-based constraint relaxation and navigation method that imitates human-like safe and efficient navigation. Our framework consists of three components: 1) superpixel graph map generation with regional constraints, 2) regional-constraint relaxation using graph neural network trained on human demonstrations for safe and efficient navigation, and 3) interleaving relaxation, planning, and execution for complete navigation. We evaluate our method against state-of-the-art baselines on 2D semantic maps and 3D maps from OpenStreetMap, achieving the highest human-likeness score of complete navigation while maintaining a balanced trade-off between efficiency and safety. We finally demonstrate its scalability and generalization performance in real-world urban navigation with a quadruped robot, Spot. Code and Videos are available at https://sure-nav.github.io/.
|
| |
| 09:00-10:30, Paper WeI1I.328 | Add to My Program |
| SlotVLA: Towards Modeling of Object-Relation Representations in Robotic Manipulation |
|
| Hanyu, Taisei | University of Arkansas |
| Chung, Nhat | FPT Software AI Center |
| Le, Huy | FPT Software |
| Nguyen, Tien Toan | FPT Software |
| Ikebe, Yuki | University of Arkansas |
| Gunderman, Anthony | University of Arkansas |
| Nguyen Ho Minh, Duy | University of Stuttgart, Max Planck Research School for Intelligent Systems |
| Vo, Khoa | University of Arkansas |
| Kieu, Tung | Aalborg University |
| Yamazaki, Kashu | Carnegie Mellon University |
| Rainwater, Chase | University of Arkansas |
| Nguyen, Anh | University of Liverpool |
| Le, Ngan | University of Arkansas |
|
|
| |
| 09:00-10:30, Paper WeI1I.329 | Add to My Program |
| The Translational/ Rotational Piezoelectric Impact Drive Mechanism for Cell/tissue Extraction from Mouse Cranial Window |
|
| Sugiura, Hirotaka | The University of Tokyo |
| Kunii, Hiroki | The University of Tokyo |
| Amaya, Satoshi | The University of Tokyo |
| Shuntaro, Kawamura | Institute of Science Tokyo |
| Takanori, Takebe | Institute of Science Tokyo |
| Arai, Fumihito | The University of Tokyo |
Keywords: Biological Cell Manipulation, Automation at Micro-Nano Scales, Robotics and Automation in Life Sciences
Abstract: We developed a microscopic cell/tissue extraction device that employed a translational/rotational piezoelectric impact drive mechanism (Piezo IDM). To perform the correlation between gene expression and localized tissue sample at the micrometer scale, the system inserted a knife-edged glass capillary driven by the Piezo IDM and extracted the cells/tissues. The hybridized use of the translational and rotational impact motion significantly improved suction performance, resulting in the reliable acquisition of small, localized cells and tissues, which were previously difficult to be isolated. To characterize the motion of the Piezo IDM, the amplitude and frequency dependence were measured, and were compared with the simulation model. In addition, we found that the synchronous chopping motion could exert the rotational motion efficiently. For the automation, a specialized controller was developed to exert bidirectional motion. The experimental demonstration was performed for both the artificial gel sample and the practical mouse cranial window (CW). The result of the gel sample clearly exhibited the effectiveness of hybridizing the translational and rotational motion of Piezo IDM for cell/tissue extraction. The practical demonstration of the neutrophil extraction experiments in thrombus-induced mice also elucidated the potential performance of the accurate tissue extraction from the in-vivo environment.
|
| |
| 09:00-10:30, Paper WeI1I.330 | Add to My Program |
| Imitation-BT: Automating Behavior Tree Generation by Echoing Reinforcement Learning Agents |
|
| Bathula, Shailendra Sekhar | University of Georgia |
| Parasuraman, Ramviyas | University of Georgia |
Keywords: Behavior-Based Systems, Imitation Learning, Reinforcement Learning
Abstract: Understanding an autonomous agent's decision-making prowess is of paramount importance, as it increases trust and guarantees safety. Although agent policies learned through reinforcement learning (RL) and machine learning (ML) paradigms have demonstrated their dominance in various domains, they struggle with deployment in high-stakes environments due to their algorithmic opacity. A structured and transparent representation of a policy helps us understand, evaluate, and modify it if necessary. Due to their inherent reactivity, modularity, and transparent hierarchical representation, the Behavior Tree (BT) is an ideal solution to represent control policies. In this paper, we focus on building a knowledge representation transfer framework in which knowledge of trained RL agents is captured through imitation learning and then utilized to form a compact BT. Our primary focus is to retain maximum performance while improving the interpretability of the BTs. In combination with planning and learning, we automate the formation of a BT and offer an alternative, transparent architecture for policy representation. In an extensive analysis with a variety of gymnasium environments and the Robotics Package Delivery domain simulations, we demonstrate the significant performance retention capability and superior interpretability of the proposed Imitation-BT. %in real-world applications.
|
| |
| 09:00-10:30, Paper WeI1I.331 | Add to My Program |
| How Bumpy Is It? Incremental Online Learning of Terrain-Induced Bumpiness Costs for Off-Road Vehicles |
|
| Yuan, Haoyu | Beijing Institute of Technology |
| Niu, Tianwei | Beijing Institute of Technology |
| Ma, Shengshan | Beijing Institute of Technology |
| Bao, Runjiao | Beijing Institute of Technology |
| Wang, Shoukun | Beijing Institute of Technology |
Keywords: Field Robots, Robotics in Hazardous Fields, Incremental Learning
Abstract: Stable autonomous driving in unstructured off-road environments remains a longstanding challenge. In the absence of structured roads and in the presence of uneven terrain, vegetation, and soil slopes, vehicles must rely on LiDAR–Camera fusion to identify stable and traversable roads. However, existing terrain perception methods largely remain at the level of semantic segmentation and struggle to capture physical attributes such as surface roughness and load-bearing capacity. Meanwhile, constructing datasets annotated with accurate physical properties is prohibitively costly and inherently limited in class diversity, making it difficult to cover unseen terrains. To address these limitations, we propose an online ground bumpiness cost learning framework for off-road vehicles, which enables continuous and direct learning of terrain-specific bumpiness costs during operation without the need for manual annotation. The framework consists of four key components: (i) ground bumpiness cost computation, (ii) a lightweight multimodal terrain segmentation model, (iii) an instance-level incremental update strategy, and (iv) a bumpiness cost mapping module. Extensive experiments on the EV-56 vibroseis truck demonstrate that the proposed framework can finely discriminate terrains with varying bumpiness costs and incrementally estimate costs for previously unseen terrains, thereby providing strong support for safe and reliable off-road autonomous driving.
|
| |
| 09:00-10:30, Paper WeI1I.332 | Add to My Program |
| OASIS-DC: Generalizable Depth Completion Via Output-Level Alignment of Sparse-Integrated Monocular Pseudo Depth |
|
| Cho, Jaehyeon | Gachon University |
| An, Jhonghyun | Gachon University |
Keywords: Deep Learning for Visual Perception, Sensor Fusion, Deep Learning Methods
Abstract: Recent monocular foundation models excel at zero-shot depth estimation, yet their outputs are inherently relative rather than metric, limiting direct use in robotics and autonomous driving. We leverage the fact that relative depth preserves global layout and boundaries: by calibrating it with sparse range measurements, we transform it into a pseudo metric depth prior. Building on this prior, we design a refinement network that follows the prior where reliable and deviates where necessary, enabling accurate metric predictions from very few labeled samples. The resulting system is particularly effective when curated validation data are unavailable, sustaining stable scale and sharp edges across few-shot regimes. These findings suggest that coupling foundation priors with sparse anchors is a practical route to robust, deployment-ready depth completion under real-world label scarcity.
|
| |
| 09:00-10:30, Paper WeI1I.333 | Add to My Program |
| Traversability-Aware Legged Navigation by Learning from Real-World Visual Data |
|
| Zhang, Hongbo | The Chinese University of Hong Kong |
| Li, Zhongyu | University of California, Berkeley |
| Zeng, Xuanqi | Chinese University of Hong Kong |
| Smith, Laura | UC Berkeley |
| Stachowicz, Kyle | University of California, Berkeley |
| Shah, Dhruv | Princeton University |
| Yue, Linzhu | The Chinese University of Hong Kong |
| Song, Zhitao | The Chinese University of Hong Kong |
| Xia, Weipeng | The Chinese University of Hong Kong |
| Levine, Sergey | UC Berkeley |
| Sreenath, Koushil | University of California, Berkeley |
| Liu, Yunhui | Chinese University of Hong Kong |
Keywords: Legged Robots, Visual-Based Navigation, Motion and Path Planning, Deep Learning in Robotics and Automation
Abstract: The enhanced mobility brought by legged locomotion empowers quadrupedal robots to navigate through complex and unstructured environments. However, optimizing agile locomotion while accounting for the varying energy costs of traversing different terrains remains an open challenge. Most previous work focuses on planning trajectories with traversability estimation based on human-labeled environmental features. This human-centric approach is insufficient because it does not account for the varying capabilities of the robot locomotion controllers. We introduce a novel real-world learning pipeline that unifies offline demonstrations, online reinforcement learning, and multi-modal perception to achieve robust legged navigation. The framework employs multiple training stages to develop a planner that guides the robot in avoiding obstacles and hard-to-traverse terrains while reaching its goals. With the proposed method, a quadrupedal robot learns to perform traversability-aware navigation through real-world interactions in diverse offroad and unstructured environments. Moreover, the robot demonstrates the ability to generalize the learned navigation skills to unseen scenarios.
|
| |
| 09:00-10:30, Paper WeI1I.334 | Add to My Program |
| Aqua-Splat: Physically-Informed Sonar-Camera Gaussian Splatting for Underwater 3D Reconstruction |
|
| Ling, Zijie | Harbin Institute of Technology (Shen Zhen) |
| Feng, Yunxuan | Harbin Institute of Technology(Shen Zhen) |
| Meng, Ao | Harbin Institute of Technology, Shenzhen |
| Xiao, Renxiang | Harbin Institute of Technology, Shenzhen |
| Pan, Shu | Harbin Institute of Technology, Shenzhen |
| Lu, Wenjie | Harbin Institute of Technology (Shenzhen) |
| Hu, Liang | Harbin Institute of Technology, Shenzhen |
Keywords: Marine Robotics, Mapping
Abstract: Differentiable Gaussian Splatting (GS) has emerged as a powerful paradigm for scene representation, enabling efficient rendering and real-time editing. However, existing GS-based methods, which rely mainly on clear visual images, perform poorly in underwater environments due to camera distortions such as light absorption and backscattering. In contrast, acoustic sensors like Forward Looking Sonar (FLS) offer superior penetration and robustness in such conditions. To leverage the complementary merits of visual and FLS images, we propose a novel Gaussian splatting framework customized for underwater scenarios, termed Aqua-Splat, for robust and accurate underwater perception. It ensures physically consistent reconstruction by incorporating the sonar wave propagation modeling in the image formation process. Moreover, we propose a volume rendering technique for sonar image synthesis, achieving similar speed to visual rendering. Additionally, we introduce a sonar-guided den- sification strategy to optimize the scene representation. Through extensive experiments on both simulated and laboratory datasets, we demonstrate that Aqua-Splat significantly improves image synthesis and 3D scene reconstruction in challenging underwater environments, outperforming existing methods in terms of both geometric accuracy and photometric fidelity. The code of Aqua-Splat will be open-sourced later for the community.
|
| |
| 09:00-10:30, Paper WeI1I.335 | Add to My Program |
| MARG: MAstering Risky Gap Terrains for Legged Robots with Elevation Mapping |
|
| Dong, Yinzhao | The University of Hong Kong |
| Ma, Ji | The University of Hong Kong |
| Zhao, Liu | The University of Hong Kong |
| Li, Wanyue | The University of Hong Kong |
| Lu, Peng | The University of Hong Kong |
Keywords: Legged Robots, Learning and Adaptive Systems, Deep Learning in Robotics and Automation, Quadrupedal Robot
Abstract: Deep Reinforcement Learning (DRL) controllers for quadrupedal locomotion have demonstrated impressive performance on challenging terrains, allowing robots to execute complex skills such as climbing, running, and jumping. However, existing blind locomotion controllers often struggle to ensure safety and efficient traversal through risky gap terrains, which are typically highly complex, requiring robots to perceive terrain information and select appropriate footholds during locomotion accurately. Meanwhile, existing perception-based controllers still present several practical limitations, including a complex multi-sensor deployment system and expensive computing resource requirements. This paper proposes a DRL controller named MAstering Risky Gap Terrains (MARG), which integrates terrain maps and proprioception to dynamically adjust the action and enhance the robot's stability in these tasks. During the training phase, our controller accelerates policy optimization by selectively incorporating privileged information (e.g., center of mass, friction coefficients) that are available in simulation but unmeasurable directly in real-world deployments due to sensor limitations. We also designed three foot-related rewards to encourage the robot to explore safe footholds. More importantly, a terrain map generation (TMG) model is proposed to reduce the drift existing in mapping and provide accurate terrain maps using only one LiDAR, providing a foundation for zero-shot transfer of the learned policy. The experimental results indicate that MARG maintains stability in various risky terrain tasks.
|
| |
| 09:00-10:30, Paper WeI1I.336 | Add to My Program |
| Concurrent-Allocation Task Execution for Multi-Robot Path-Crossing-Minimal Navigation in Obstacle Environments |
|
| Hu, Binbin | University of Groningen |
| Yao, Weijia | Hunan University |
| Yanxin, Zhou | Nanyang Technological University |
| Wei, Henglai | University of Victoria |
| Lv, Chen | Nanyang Technological University |
Keywords: Multi-Robot Systems, Path Planning for Multiple Mobile Robots or Agents, Autonomous Vehicle Navigation, Autonomous Agents
Abstract: In this paper, the concurrent-allocation task execution (CATE) algorithm is presented to address this problem (i.e., MPCM navigation in obstacle environments). First, the path-crossing-related elements in terms of (i) robot allocation, (ii) desired-point convergence, and (iii) collision and obstacle avoidance are en-coded into integer and control barrier function (CBF) constraints. Then, the proposed constraints are used in an online constrained optimization framework, which implicitly yet effectively mini-mizes the possible path crossings and trajectory length in obstacle environments by minimizing the desired point allocation cost and slack variables in CBF constraints simultaneously. In this way, the MPCM navigation in obstacle environments can be achieved with flexible spatial orderings. Note that the feasibility of solutions and the asymptotic convergence property of the proposed CATE algorithm in obstacle environments are both guaranteed, and the calculation burden is also reduced by concurrently calculating the optimal allocation and the control input directly without the path planning process. Finally, extensive simulations and experiments are conducted to validate that the CATE algorithm (i) outperforms the existing state-of-the-art baselines in terms of feasibility and efficiency in obstacle environments, (ii) is effective in environments with dynamic obstacles and is adaptable for per-forming various navigation tasks in 2D and 3D, (iii) demonstrates its efficacy and practicality by 2D experiments with a multi-AMR onboard navigation system, and (iv) provides a possible solution to evade deadlocks and pass through a narrow gap.
|
| |
| 09:00-10:30, Paper WeI1I.337 | Add to My Program |
| PKF: Probabilistic Data Association Kalman Filter for Multi-Object Tracking |
|
| Cao, Hanwen | University of California, San Diego |
| Pappas, George J. | University of Pennsylvania |
| Atanasov, Nikolay | University of California, San Diego |
Keywords: Probability and Statistical Methods, Computer Vision for Automation, Visual Tracking
Abstract: In this paper, we derive a new Kalman filter (KF) with probabilistic data association between measurements and states. We formulate a variational inference problem to approximate the posterior density of the state conditioned on the measurement data. We view the unknown data association as a latent variable and apply Expectation Maximization (EM) to obtain a filter with the update step in the same form as the Kalman filter but with an expanded measurement vector of all potential associations. We show that the association probabilities can be computed as permanents of matrices with measurement likelihood entries. We name our probabilistic data association Kalman filter the PKF with P emphasizing both the probabilistic nature of the data association and the matrix permanent computation of the association weights. We compare PKF with the well-established Probabilistic Multi-Hypothesis Tracking (PMHT) and Joint Probabilistic Data Association Filter (JPDAF) in both theory and simulated experiments. The experiments show that we can achieve lower tracking errors than both. We also demonstrate the effectiveness of our filter in multi-object tracking (MOT) on multiple real-world datasets, including MOT17, MOT20, and DanceTrack. We can achieve comparable tracking results with previous KF-based methods without using velocities or doing multi-stage data association and remain real-time. We further show that our PKF can serve as a backbone for other KF-based trackers by applying it to a method that uses varieties of features for association, and improving its results.
|
| |
| 09:00-10:30, Paper WeI1I.338 | Add to My Program |
| Robust and Scalable Multi-Robot Localization Using Stereo UWB Arrays |
|
| Zhao, Hanying | Tsinghua University |
| Xu, Lingwei | Tsinghua University |
| Li, Yi | Tsinghua University |
| Wen, Feiyang | Tsinghua University |
| Gao, Haoran | Tsinghua |
| Liu, Changwu | Tsinghua University |
| Yu, Jincheng | Tsinghua University |
| Wang, Yu | Tsinghua University |
| Shen, Yuan | Tsinghua University |
Keywords: Multi-Robot Systems, Localization, Autonomous Vehicle Navigation, UWB Array
Abstract: In environments where robots operate with limited global navigation satellite system accessibility, ultra-wideband (UWB) localization technology is a popular auxiliary solution to assist visual–inertial odometry systems. However, current UWB approaches lack 3-D pairwise localization capability and suffer from rapidly declining localization update rates as the network scales, limiting their effectiveness for swarm robotic applications. This article presents a novelUWBsensor that enables 3-D pairwise localization and a localization scheme that can deliver robust, scalable, and accurate position awareness for multi-robot systems. Our approach begins with calibrating intrinsic UWB errors from hardware deviations and propagation effects, yielding high-accuracy distance and direction measurements. Using these measurements, we perform distributed relative localization through inter- and intra-node cooperation by integrating UWB and inertial measurement unit data. To enable swarm-scale operation, our platform implements the signal-multiplexing network ranging protocol to maximize update rates and network capacity. Experimental results show that our approach achieves centimeter-level localization accuracy at high update rates (100 Hz with UWB only), validating its robustness, scalability, and accuracy for robotic applications.
|
| |
| 09:00-10:30, Paper WeI1I.339 | Add to My Program |
| AORRTC: Almost-Surely Asymptotically Optimal Planning with RRT-Connect |
|
| Wilson, Tyler S. | Queen's University |
| Thomason, Wil | The AI Institute |
| Kingston, Zachary | Purdue University |
| Gammell, Jonathan | Queen's University |
Keywords: Constrained Motion Planning, Manipulation Planning, Task and Motion Planning
Abstract: Finding high-quality solutions quickly is an important objective in motion planning. This is especially true for high-degree-of-freedom robots. Satisficing planners have traditionally found feasible solutions quickly but provide no guarantees on their optimality, while almost-surely asymptotically optimal (a.s.a.o.) planners have probabilistic guarantees on their convergence towards an optimal solution but are more computationally expensive. This paper uses the AO-x meta-algorithm to extend the satisficing RRT-Connect planner to optimal planning. The resulting Asymptotically Optimal RRT-Connect (AORRTC) finds initial solutions in similar times as RRT-Connect and uses any additional planning time to converge towards the optimal solution in an anytime manner. It is proven to be probabilistically complete and a.s.a.o. AORRTC was tested with the Panda (7 DoF) and Fetch (8 DoF) robotic arms on the MotionBenchMaker dataset. These experiments show that AORRTC finds initial solutions as fast as RRT-Connect and faster than the tested state-of-the-art a.s.a.o. algorithms while converging to better solutions faster. AORRTC finds solutions to difficult high-DoF planning problems in milliseconds where the other a.s.a.o. planners could not consistently find solutions in seconds. This performance was demonstrated both with and without single instruction/multiple data (SIMD) acceleration.
|
| |
| 09:00-10:30, Paper WeI1I.340 | Add to My Program |
| Globally-Stable and Robust Image-Based Visual Servoing for Positioning with Respect to a Cylinder |
|
| Colotti, Alessandro | IMT Atlantique |
| Chaumette, Francois | Inria Center at University of Rennes |
Keywords: Visual Servoing, Sensor-based Control
Abstract: This letter proposes a new image-based visual servoing controller for positioning a camera with respect to a cylindrical object. Traditional image-based approaches often rely on estimating planar parameters from the cylinder’s projected edges, making them sensitive to noise and modeling errors. In this work, we introduce a novel controller that uses pure image features while directly tied to the cylinder’s 3D pose, which depends solely on the cylinder radius. Crucially, this controller offers formal global stability irrespective of the radius estimate. Simulations and real experiments with a robotic arm confirm the controller improved convergence and robustness under practical conditions.
|
| |
| 09:00-10:30, Paper WeI1I.341 | Add to My Program |
| SSF-PAN: Semantic Scene Flow-Based Perception for Autonomous Navigation in Traffic Scenarios |
|
| Chen, Yinqi | Southern University of Science and Technology |
| Zhang, Meiying | Southern University of Science and Technology |
| Hao, Qi | Southern University of Science and Technology |
| Zhou, Guang | Deeproute |
Keywords: Intelligent Transportation Systems, Autonomous Vehicle Navigation, Semantic Scene Understanding
Abstract: Vehicle detection and localization in complex traffic scenarios pose significant challenges due to the interference of moving objects. Traditional methods often rely on outlier exclusions or semantic segmentations, which suffer from low computational efficiency and accuracy. The proposed SSF-PAN can achieve the functionalities of LiDAR point cloud based object detection/localization and SLAM (Simultaneous Localization and Mapping) with high computational efficiency and accuracy, enabling map-free navigation frameworks. The novelty of this work is threefold: 1) developing a neural network which can achieve segmentation among static and dynamic objects within the scene flows with different motion features, that is, semantic scene flow (SSF); 2) developing an iterative framework which can further optimize the quality of input scene flows and output segmentation results; 3) developing a scene flow-based navigation platform which can test the performance of the SSF perception system in the simulation environment. The proposed SSF-PAN method is validated using the SUScape-CARLA and the KITTI datasets, as well as on the CARLA simulator. Experimental results demonstrate that the proposed approach outperforms traditional methods in terms of scene flow computation accuracy, moving object detection accuracy, computational efficiency, and autonomous navigation effectiveness.
|
| |
| 09:00-10:30, Paper WeI1I.342 | Add to My Program |
| Adaptive-Cloud: Dynamic Computation Control for 3D Object Detection from LIDAR Point Clouds |
|
| Mohammad, Mir Sayeed | Georgia Institute of Technology |
| Kamal, Uday | Georgia Institute of Technology, Amazon |
| Mukhopadhyay, Saibal | Georgia Institute of Technology |
Keywords: Computer Vision for Transportation, Representation Learning, Object Detection, Segmentation and Categorization
Abstract: In this work, we introduce an adaptive hierarchical framework for efficient 3D object detection from point cloud data, designed to dynamically balance computational efficiency and detection performance. Our approach employs a shared feature extractor and multiple detector backbones of varying widths, enabling selective activation of models based on the complexity of the input scene. A novel feature gating mechanism dynamically determines the most relevant features for reduced-width backbones, while a surrogate loss prediction module ranks models in real-time, ensuring optimal backbone selection with minimal overhead. This adaptive strategy reduces computational costs by 41.4% while maintaining a negligible 2.44% reduction in detection accuracy across a range of real-world driving scenes (urban, highway, residential, campus, person) from the KITTI dataset. By addressing runtime adaptability—a critical gap in existing 3D detection frameworks—our method provides a significant improvement for deploying high-performance detection models in resource-constrained environments.
|
| |
| 09:00-10:30, Paper WeI1I.343 | Add to My Program |
| Effects of Wrist-Worn Haptic Feedback on Force Accuracy and Task Speed During a Teleoperated Robotic Surgery Task |
|
| Vuong, Brian Binh | Stanford University |
| Davidson, Josie | Stanford University |
| Cheon, Sangheui | Seoul National University |
| Cho, Kyu-Jin | Seoul National University, Biorobotics Laboratory |
| Okamura, Allison M. | Stanford University |
Keywords: Haptics and Haptic Interfaces, Medical Robots and Systems
Abstract: Previous work has shown that adding haptic feedback to the hands can improve awareness of tool-tissue interactions and enhance performance of teleoperated tasks in robot-assisted minimally invasive surgery. However, hand-based haptic feedback occludes direct interaction with the manipulanda of surgeon consoles. We propose relocating haptic feedback to the wrist using a wearable haptic device. It is unknown if such feedback will be effective, given that it is not co-located with the finger movements used for manipulation. To test if relocated haptic feedback improves force application during teleoperated tasks using the da Vinci Research Kit (dVRK) surgical robot, participants learned to palpate a phantom tissue to desired forces. A soft pneumatic wrist-worn haptic device with an anchoring system renders tool-tissue interaction forces to the wrist of the user. Participants demonstrated statistically significant lower force error and performed the palpation task with longer movement times when provided wrist-worn haptic feedback.
|
| |
| 09:00-10:30, Paper WeI1I.344 | Add to My Program |
| Inferring Foresightedness in Dynamic Noncooperative Games |
|
| Armstrong, Cade | University of Texas at Austin |
| Park, Ryan | University of Texas at Austin |
| Liu, Xinjie | Delft University of Technology |
| Gupta, Kushagra | The University of Texas at Austin |
| Fridovich-Keil, David | The University of Texas at Austin |
Keywords: Autonomous Agents, Human-Aware Motion Planning, Probabilistic Inference
Abstract: Dynamic game theory is an increasingly popular tool for modeling multi-agent, e.g. human-robot, interactions. Game-theoretic models presume that each agent wishes to minimize a private cost function that depends on others’ actions. These games typically evolve over a fixed time horizon, specifying how far into the future each agent plans. In practical settings, however, decision-makers may vary in foresightedness, or how much they care about their current cost in relation to their past and future costs. We conjecture that quantifying and estimating each agent’s foresightedness from online data will enable safer and more efficient interactions with other agents. To this end, we frame this inference problem as an inverse dynamic game. We consider a specific objective function parametrization that smoothly interpolates myopic and farsighted planning. Games of this form are readily transformed into parametric mixed complementarity problems; we exploit the directional differentiability of solutions to these problems with respect to their hidden parameters to solve for agents’ foresightedness. We conduct three experiments: one with synthetically generated delivery robot motion, one with real-world data involving people walking, biking, and driving vehicles, and one using high-fidelity simulators. The results of these experiments demonstrate that explicitly inferring agents’ foresightedness enables game-theoretic models to make 33% more accurate models for agents’ behavior.
|
| |
| 09:00-10:30, Paper WeI1I.345 | Add to My Program |
| Frictional and Prismatic Pin-Array Gripper for Universal Gripping and Stable Tool Manipulation |
|
| Lee, Cheonghwa | Kumoh National Institute of Technology |
| Kim, Hyeongwon | Seoul National University |
| Oh, Midum | Seoul National University |
| Ok, Kisu | Seoul National University |
| Ahn, Sung-Hoon | Seoul National University |
Keywords: Pin-array gripper, Grippers and Other End-Effectors, Dexterous Manipulation, Mobile Manipulation
Abstract: Global trend in robotics has shifted towards deploying humanoid robots and mobile manipulators in industrial settings to automate repetitive and structured tasks traditionally performed by human workers. However, most tools and equipment are designed for human hands, and current grippers or end-effectors are highly specialized, limiting their ability to fully replace human handling of simple tools and tasks. This study proposes a novel frictional and prismatic pin-array gripper developed for universal gripping and tool manipulation. A pin-array structure of the gripper mimics the behavior of soft grippers while incorporating rigid components, enabling adaptability to various shapes and sizes. Each pin features semi-automatic actuation through a compression spring, supporting the underactuated mechanism. Most existing studies on grippers focus on simple pick-and-place tasks, whereas the proposed gripper extends functionality to practical tool usage. Enabled by the pin-array structure, it provides increased contact surface and support points, ensuring stable gripping and enhanced manipulation performance. In the evaluation, the pin-array gripper achieved a payload capacity of 2400 g, significantly outperforming the conventional RG2-FT gripper and the frictional flat gripper, which reached maximum capacities of 800 g and 400 g, respectively. It also exhibited higher grasping forces, measuring 1.17 times greater than the RG2-FT gripper and up to 23 times greater than the frictional flat gripper. For tool manipulation, the pin-array gripper exhibited significantly lower manipulation errors, with 21.67 and 6.59 times fewer errors than the RG2-FT and flat grippers, respectively, when handling the hammer, and 7.69 and 4.45 times fewer for the metal file. Additionally, qualitative demonstrations in universal gripping, omnidirectional gripping, and tool usage further validated the gripper’s performance in mobile manipulator tasks.
|
| |
| 09:00-10:30, Paper WeI1I.346 | Add to My Program |
| LLIO: Lidar-Kinematic-Inertial Odometry with Ground Contact Constraints for Legged Robots |
|
| Gu, Chengjie | Nanjing University of Science and Technology |
| Xie, Zhongqu | Nanjing University of Science and Technology |
| Xiang, Beichen | Nanjing University of Science and Technology |
| Zhou, Shichao | Nanjing University of Science and Technology |
| Chen, Lingkun | Nanjing University of Science and Technology |
| Ci, Binbin | Nanjing University of Science and Technology |
| Wang, Yulin | Nanjing University of Science and Technology |
Keywords: SLAM, Legged Robots, Localization
Abstract: This letter presents a robust multi-sensor fusion framework for state estimation in legged robots (LLIO) based on an iterated extended Kalman filter. To address the limitations of IMU priori estimation, which often leads to legged robot localization errors or failures, our method integrates the contact constraints of the robot's leg kinematics with the ground. By introducing a sliding window-based ground contact constraint module, we effectively combine the contact state of the legged robot's foot with ground features, enhancing the constraints in complex environments and reduce localization drift. Additionally, factor graph optimization minimizes global cumulative drift. The proposed method has been extensively evaluated through numerous experiments and relevant public datasets. The results demonstrate that our approach significantly reduces local drift and better computational efficiency.
|
| |
| 09:00-10:30, Paper WeI1I.347 | Add to My Program |
| Doppler-SLAM: Doppler-Aided Radar-Inertial and LiDAR-Inertial Simultaneous Localization and Mapping |
|
| Wang, Dong | University of Würzburg |
| Haag, Hannes | Technische Hochschule Nürnberg Georg-Simon-Ohm |
| Casado Herraez, Daniel | University of Bonn & CARIAD SE |
| May, Stefan | Nuremberg Institute of Technology Georg Simon Ohm |
| Stachniss, Cyrill | University of Bonn |
| Nuechter, Andreas | Julius-Maximilians-Universität Würzburg |
Keywords: SLAM, Data Sets for SLAM, Mapping
Abstract: Simultaneous localization and mapping is a critical capability for autonomous systems. Traditional SLAM approaches often rely on visual or LiDAR sensors and face significant challenges in adverse conditions such as low light or featureless environments. To overcome these limitations, we propose a novel Doppler-aided radar-inertial and LiDAR-inertial SLAM framework that leverages the complementary strengths of 4D radar, FMCW LiDAR, and inertial measurement units. Our system integrates Doppler velocity measurements and spatial data into a tightly-coupled front-end and graph optimization back-end to provide enhanced ego velocity estimation, accurate odometry, and robust mapping. We also introduce a Doppler-based scan-matching technique to improve front-end odometry in dynamic environments. In addition, our framework incorporates an innovative online extrinsic calibration mechanism, utilizing Doppler velocity and loop closure to dynamically maintain sensor alignment. Extensive evaluations on both public and proprietary datasets show that our system significantly outperforms state-of-the-art radar-SLAM and LiDAR-SLAM frameworks in terms of accuracy and robustness. To encourage further research, the code of our Doppler-SLAM and our dataset are available at: url{https://github.com/Wayne-DWA/Doppler-SLAM}.
|
| |
| 09:00-10:30, Paper WeI1I.348 | Add to My Program |
| Analytical and Computational Modeling of a Stop-Rotor Aircraft with Experimental Validation |
|
| Hilby, Kristan | Massachusetts Institute of Technology |
| Hunter, Ian | MIT |
Keywords: Dynamics, Aerial Systems: Mechanics and Control
Abstract: Stop-rotor aircraft are a class of vertical takeoff and landing (VTOL) vehicle that offer improved efficiency across flight modes through the usage of a single central lifting surface. In VTOL, the central lifting surface rotates like a helicopter blade to achieve an upward force. In forward flight, the central lifting surface locks in place like a conventional fixed-wing aircraft and achieves lift from airflow over the surface. The improved efficiency across flight modes enables more complex mission profiles that balance flight time in VTOL and forward flight, such as package delivery and inspection over a large area. Despite the promise of stop-rotor aircraft, challenges in modeling and control, particularly due to the nonlinear rotor dynamics across flight modes, have limited practical implementation. To this end, this paper presents two types of models: 1. Analytical models, derived from first principles physics, provide insight into the stability and control of the vehicle and demonstrate closed-loop stability of yaw and altitude using classical PID control, 2. Computational models, based on numerical integration of the system’s ordinary differential equations, provide full-state dynamics of the vehicle. Validation against bench-top constrained flight tests shows that the analytical models capture over 97% of the variance in the computational results, while the computational models account for up to 40% of the variance observed in experimental data.
|
| |
| 09:00-10:30, Paper WeI1I.349 | Add to My Program |
| RT-GuIDE: Real-Time Gaussian Splatting for Information-Driven Exploration |
|
| Tao, Yuezhan | University of Pennsylvania |
| Ong, Dexter | University of Pennsylvania |
| Murali, Varun | Texas A&M University |
| Spasojevic, Igor | University of California, Riverside |
| Chaudhari, Pratik | University of Pennsylvania |
| Kumar, Vijay | University of Pennsylvania |
Keywords: View Planning for SLAM, Mapping, Perception-Action Coupling
Abstract: We propose a framework for active mapping and exploration that leverages Gaussian splatting for constructing dense maps. Further, we develop a GPU-accelerated motion planning algorithm that can exploit the Gaussian map for real-time navigation. The Gaussian map constructed onboard the robot is optimized for both photometric and geometric quality while enabling real-time situational awareness for autonomy. We show through viewpoint selection experiments that our method yields comparable Peak Signal-to-Noise Ratio (PSNR) and similar reconstruction error to state-of-the-art approaches, while being orders of magnitude faster to compute. In closed-loop physics-based simulation and real-world experiments, our algorithm achieves better map quality (at least 0.8dB higher PSNR and more than 16% higher geometric reconstruction accuracy) than maps constructed by a state-of-the-art method, enabling semantic segmentation using off-the-shelf open-set models.
|
| |
| 09:00-10:30, Paper WeI1I.350 | Add to My Program |
| GaussianVLM: Scene-Centric 3D Vision-Language Models Using Language-Aligned Gaussian Splats for Embodied Reasoning and Beyond |
|
| Halacheva, Anna-Maria | INSAIT, Sofia University "St. Kliment Ohridski" |
| Zaech, Jan-Nico | INSAIT, Sofia University "St. Kliment Ohridski" |
| Wang, Xi | ETH Zurich |
| Paudel, Danda Pani | INSAIT, Sofia University "St. Kliment Ohridski" |
| Van Gool, Luc | INSAIT, Sofia University "St. Kliment Ohridski" |
Keywords: Semantic Scene Understanding, AI-Based Methods, Deep Learning for Visual Perception
Abstract: As multimodal language models advance, their application to 3D scene understanding is a fast-growing frontier, driving the development of 3D Vision-Language Models (VLMs). Current methods show strong dependence on object detectors, introducing processing bottlenecks and limitations in taxonomic flexibility. To address these limitations, we propose a scene-centric 3D VLM for 3D Gaussian splat scenes that employs language- and task-aware scene representations. Our approach directly embeds rich linguistic features into the 3D scene representation by associating language with each Gaussian primitive, achieving early modality alignment. To process the resulting dense representations, we introduce a dual sparsifier that distills them into compact, task-relevant tokens via task-guided and location-guided pathways, producing sparse, task-aware global and local scene tokens. Notably, we present the first Gaussian splatting-based VLM, leveraging photorealistic 3D representations derived from standard RGB images, demonstrating strong generalization: it improves performance of prior 3D VLM five folds, in out-of-the-domain settings.
|
| |
| 09:00-10:30, Paper WeI1I.351 | Add to My Program |
| Design and Modeling of a Reconfigurable Robot: Decoupled STAR (DSTAR) |
|
| Siboni, Tomer | Ben Gurion University of the Negev |
| Coronel, Matan | Ben Gurion University of the Negev |
| Zarrouk, David | Ben Gurion University |
Keywords: Mechanism Design, Field Robots, Search and Rescue Robots
Abstract: This paper presents Decoupled STAR (DSTAR), a novel reconfigurable robot fitted with a sprawling mechanism that allows the wheel rotation axes to vary relative to the body, and two independently activated four-bar extension mechanisms (FBEM). These mechanisms enable the robot to move its center of mass (COM) in any direction, and increase its maneuvering capabilities by selecting a variety of locomotion gaits. A kinematic model of the robot and a quasi-static force analysis are used to optimize the design and evaluate its motor requirements. Experiments demonstrate that combining the sprawling mechanism with FBEM enables the DSTAR to both crawl and drive, overcome a wide range of challenging obstacles, and improve its climbing capability by 66% compared to symmetric FBEM designs (such as RSTAR). The robot can crawl and maneuver over rough terrain using its unique turtle-gait method, roll sideways to surmount wall obstacles up to 20 cm high, travel horizontally across uneven ground, and switch between wheels and whegs to adapt to different terrain types, including dirt, stones, and grass.
|
| |
| 09:00-10:30, Paper WeI1I.352 | Add to My Program |
| DynoSAM: Open-Source Smoothing and Mapping Framework for Dynamic SLAM |
|
| Morris, Jesse | University of Sydney |
| Wang, Yiduo | University of Sydney |
| Kliniewski, Mikolaj | The University of Sydney |
| Ila, Viorela | The University of Sydney |
Keywords: SLAM, Mapping, RGB-D Perception, Dynamic SLAM
Abstract: Traditional Visual Simultaneous Localization and Mapping systems focus solely on static scene structures, overlooking dynamic elements in the environment. Although effective for accurate visual odometry in complex scenarios, these methods discard crucial information about moving objects. By incorporating this information into a Dynamic SLAM framework, the motion of dynamic entities can be estimated, enhancing navigation whilst ensuring accurate localization. However, the fundamental formulation of Dynamic SLAM remains an open challenge, with no consensus on the optimal approach for accurate motion estimation within a SLAM pipeline. Therefore, we developed DynoSAM, an open-source framework for Dynamic Objects SLAM that enables the efficient implementation, testing, and comparison of various Dynamic SLAM optimization formulations. We further propose a novel formulation that encodes rigid-body motion model in object pose estimation as well as an error metric agnostic to object frame definition. DynoSAM integrates static and dynamic measurements into a unified optimization problem solved using factor graphs, simultaneously estimating camera poses, static scene, object motion or poses, and object structures. We evaluate DynoSAM across diverse simulated and real-world datasets, achieving state-of-the-art motion estimation in indoor and outdoor environments, with substantial improvements over existing systems. Additionally, we demonstrate DynoSAM's contributions to downstream applications, including 3D reconstruction of dynamic scenes and trajectory prediction, thereby showcasing potential for advancing dynamic object-aware SLAM systems. Code is open-sourced.
|
| |
| 09:00-10:30, Paper WeI1I.353 | Add to My Program |
| Following Is All You Need: Robot Crowd Navigation Using People As Planners |
|
| Liao, Yuwen | Nanyang Technological University |
| Xu, Xinhang | Nanyang Technological University |
| Bai, Ruofei | Nanyang Technological University |
| Yang, Yizhuo | Nangyang Technological Univercity |
| Cao, Muqing | Carnegie Mellon University |
| Yuan, Shenghai | Nanyang Technological University |
| Xie, Lihua | NanyangTechnological University |
Keywords: Human-Aware Motion Planning, Safety in HRI, Social HRI
Abstract: Navigating in crowded environments requires the robot to be equipped with high-level reasoning and planning techniques. Existing works focus on developing complex and heavyweight planners while ignoring the role of human intelligence. Since humans are highly capable agents who are also widely available in a crowd navigation setting, we propose an alternative scheme where the robot utilises people as planners to benefit from their effective planning decisions and social behaviours. Through a set of rule-based evaluations, we identify suitable human leaders who exhibit the potential to guide the robot towards its goal. Using a simple base planner, the robot follows the selected leader through short-horizon subgoals that are designed to be straightforward to achieve. We demonstrate through both simulated and real-world experiments that our novel framework generates safe and efficient robot plans compared to existing planners, even without predictive or data-driven modules. Our method also brings human-like robot behaviours without explicitly defining traffic rules and social norms. Code will be available at https://github.com/centiLinda/PeopleAsPlanner.git.
|
| |
| 09:00-10:30, Paper WeI1I.354 | Add to My Program |
| Real-Time Localization Scoring for Challenging Industrial Environments |
|
| Yilmaz, Abdurrahman | Istanbul Technical University |
| Dumandag, Umut | Bluepath Robotics |
| Sari, Aydin Cagatay | Bluepath Robotics |
| Savci, Ismail Hakki | Bluepath Robotics |
| Temeltas, Hakan | Istanbul Technical University |
Keywords: Localization, Autonomous Vehicle Navigation, SLAM
Abstract: Autonomous Mobile Robots (AMRs) are revolutionizing industries by enhancing flexibility and efficiency, particularly in dynamic environments such as automotive manufacturing. These environments pose challenges due to their constantly changing layouts, unpredictable obstacles, and varying conditions, which impact the performance of localization systems. This paper presents a novel real-time localization scoring architecture to address these challenges by quantifying the confidence in a robot’s positioning system. The proposed textit{Localization Score} improves map reconciliation, manages sensor interference, adapts navigation strategies, and enhances traffic coordination. Extensive experimental studies, including real-world deployment in an operational automotive production factory, demonstrate the robustness, accuracy, and adaptability of the developed Localization Score algorithm. The results showcase its potential to significantly enhance the operational efficiency and reliability of AMRs in industrial settings.
|
| |
| 09:00-10:30, Paper WeI1I.355 | Add to My Program |
| Flow-Aided Flight through Dynamic Clutters from Point to Motion |
|
| Xu, Bowen | The University of Hong Kong |
| Yan, Zexuan | University of Hong Kong |
| Lu, Minghao | The University of Hong Kong |
| Fan, Xiyu | Hong Kong University |
| Luo, Yi | The University of Hong Kong |
| Lin, Youshen | The University of Hong Kong (HKU) |
| Chen, Zhiqiang | The University of Hong Kong |
| Chen, Yeke | The University of Hong Kong |
| Qiao, Qiyuan | The University of Hong Kong |
| Lu, Peng | The University of Hong Kong |
Keywords: Aerial Systems: Perception and Autonomy, Reinforcement Learning, Autonomous Vehicle Navigation
Abstract: Challenges in traversing dynamic clutters lie mainly in the efficient perception of the environmental dynamics and the generation of evasive behaviors considering obstacle movement. Previous solutions have made progress in explicitly modeling the dynamic obstacle motion for avoidance, but this key dependency of decision-making is time-consuming and unreliable in highly dynamic scenarios with occlusions. On the contrary, without introducing object detection, tracking, and prediction, we empower the sim-to-real reinforcement learning (RL) with single LiDAR sensing to realize an autonomous flight system directly from point to motion. For exteroception, a depth sensing distance map achieving fixed-shape, low-resolution, and detail-safe is encoded from raw point clouds, and an environment change sensing point flow is adopted as motion features extracted from multi-frame observations. These two are integrated into a lightweight and easy-to-learn representation of complex dynamic environments. For action generation, the behavior of avoiding dynamic threats in advance is implicitly driven by the proposed change-aware sensing representation, where the policy optimization is indicated by the relative motion modulated distance field. With the deployment-friendly sensing simulation and dynamics model-free acceleration control, the proposed system shows a superior success rate and adaptability to alternatives, and the policy derived from the simulator can drive a real-world quadrotor with safe maneuvers.
|
| |
| 09:00-10:30, Paper WeI1I.356 | Add to My Program |
| A Lightweight Physics-Informed Neural Network for Sim-To-Real of Biped Robot |
|
| Liu, Yan | Harbin Institute of Technology |
| Zang, XiZhe | Harbin Institute of Technology |
| Zhang, Xuehe | Harbin Institute of Technology |
| Song, Chao | Harbin Institute of Technology |
| Chen, Boyang | Harbin Institute of Technology |
| Zhao, Jie | Harbin Institute of Technology |
Keywords: Humanoid and Bipedal Locomotion, Reinforcement Learning, Machine Learning for Robot Control
Abstract: In this paper, we present a low-cost, easy-to-implement sim-to-real framework for biped locomotion that narrows the reality gap using only simulation data, without motion-capture or additional real-world measurements. First, a walking policy for the BRUCE robot is trained in Isaac Gym via reinforcement learning. Next, we develop a compact, physics-informed neural network (PINN) grounded in Euler-Lagrange structure and augmented with an LSTM to predict simulator forward dynamics. Trained solely on simulation trajectories, the PINN forecasts next-step joint angles and velocities of the simulated robot given the physical robot's current state and control inputs. During hardware deployment, and consistent with a whole-body control architecture, these predicted states serve as reference joint states while the policy outputs provide feedforward torque commands; a feedforward-plus-feedback torque controller then computes the executed joint torques, thereby reducing the sim-to-real gap. Experiments on BRUCE demonstrate that our method better reproduces simulated behavior and attains higher tracking accuracy than direct policy transfer. Furthermore, the dynamics predictor runs at 1 kHz on embedded hardware, showing superior real-time performance relative to existing learning-based models.
|
| |
| 09:00-10:30, Paper WeI1I.357 | Add to My Program |
| ARTEMIS: Autoregressive End-To-End Trajectory Planning with Mixture of Experts for Autonomous Driving |
|
| Feng, Renju | Wuhan University of Technology |
| Xi, Ning | Wuhan University of Technology |
| Chu, Duanfeng | Wuhan University of Technology |
| Wang, Rukang | Wuhan University of Technology |
| Deng, Zejian | University of Waterloo |
| Wang, Anzheng | Wuhan University of Technology |
| Lu, Liping | Wuhan University of Technology |
| Wang, Jinxiang | Southeast University |
| Huang, Yanjun | Tongji University |
Keywords: Autonomous Vehicle Navigation, Integrated Planning and Learning, Deep Learning Methods
Abstract: This paper presents ARTEMIS, an end-to-end autonomous driving framework that combines autoregressive trajectory planning with Mixture-of-Experts (MoE). Traditional modular methods suffer from error propagation, while existing end-to-end models typically employ static one-shot inference paradigms that inadequately capture the dynamic changes of the environment. ARTEMIS takes a different method by generating trajectory waypoints sequentially, preserves critical temporal dependencies while dynamically routing scene-specific queries to specialized expert networks. It effectively relieves trajectory quality degradation issues encountered when guidance information is ambiguous, and overcomes the inherent representational limitations of singular network architectures when processing diverse driving scenarios. Additionally, we use a lightweight batch reallocation strategy that significantly improves the training speed of the Mixture-of-Experts model. Through experiments on the NAVSIM dataset, ARTEMIS exhibits superior competitive performance, achieving 87.0 PDMS and 83.1 EPDMS with ResNet-34 backbone, demonstrates state-of-the-art performance on multiple metrics.
|
| |
| 09:00-10:30, Paper WeI1I.358 | Add to My Program |
| Riemannian Time Warping: Multiple Sequence Alignment in Curved Spaces |
|
| Richter, Julian | Technical University of Braunschweig |
| Erdös, Christopher Andrew | Technische Universität München |
| Scheurer, Christian | KUKA Laboratories GmbH |
| Steil, Jochen J. | Technische Universität Braunschweig |
| Dehio, Niels | KUKA |
Keywords: AI-Based Methods, Learning from Demonstration
Abstract: Temporal alignment of multiple signals through time warping is crucial in many fields, such as classification within speech recognition or robot motion learning. Almost all related works are limited to data in Euclidean space. Although an attempt was made in 2011 to adapt this concept to unit quaternions, a general extension to Riemannian manifolds remains absent. Given its importance for numerous applications in robotics and beyond, we introduce Riemannian Time Warping (RTW). This novel approach efficiently aligns multiple signals by considering the geometric structure of the Riemannian manifold in which the data is embedded. Extensive experiments on synthetic and real-world data, including tests with an LBR iiwa robot, demonstrate that RTW consistently outperforms state-of-the-art baselines in both averaging and classification tasks.
|
| |
| 09:00-10:30, Paper WeI1I.359 | Add to My Program |
| IMPACT: Behavioral Intention-Aware Multimodal Trajectory Prediction with Adaptive Context Trimming |
|
| Sun, Jiawei | National University of Singapore |
| Yue, Xibin | Xiaomi EV |
| Li, Jiahui | National University of Singapore |
| Shen, Tianle | National University of Singapore |
| Yuan, Chengran | National Universtiy of Singapore |
| Sun, Shuo | Singapore-MIT Alliance for Research and Technology (SMART) |
| Guo, Sheng | National University of Singapore |
| Zhou, Quanyun | XiaoMi |
| Ang Jr, Marcelo H | National University of Singapore |
Keywords: Motion and Path Planning, Computer Vision for Transportation, Task and Motion Planning
Abstract: This paper presents a unified framework that jointly predicts behavioral intentions and vectorized occupancy, leveraging them as priors to dynamically prune context information during trajectory decoding, thereby enhancing prediction accuracy, interpretability, and efficiency. While most prior work has focused on boosting the precision of multimodal trajectory prediction, explicit modeling of behavioral intentions (e.g., yielding, overtaking) remains underexplored. To this end, we employ a shared context encoder for both intention and trajectory predictions, thereby reducing structural redundancy and information loss. Moreover, we address the lack of ground-truth behavioral intention labels in mainstream datasets (Waymo, Argoverse) by auto-labeling these datasets, thus advancing the community’s efforts in this direction. We further introduce a vectorized occupancy prediction module that infers the probability of each map polyline being occupied by the target vehicle’s future trajectory. By leveraging these intention and occupancy predictions priors, our method conducts dynamic, modality-dependent pruning of irrelevant agents and map polylines in the decoding stage, effectively reducing computational overhead and mitigating noise from non-critical elements. Our approach ranks first among LiDAR-free methods on the Waymo Motion Dataset and achieves SOTA performance on the Waymo Interactive Prediction Dataset. Remarkably, even without model ensembling, our single-model framework improves the softmAP by 10% compared to the previous SOTA method, BETOP, in Waymo Interactive Prediction Leaderboard. Furthermore, the proposed framework has been successfully deployed on real vehicles, demonstrating its practical effectiveness in real-world applications.
|
| |
| 09:00-10:30, Paper WeI1I.360 | Add to My Program |
| Decentralized Swarm Control Via SO(3) Embeddings for 3D Trajectories |
|
| Silveria, Dimitria | Queen's University |
| Cabral, Kleber | Queen's University |
| Jardine, Peter Travis | Queen's University |
| Givigi, Sidney | Queen's University |
Keywords: Swarm Robotics, Autonomous Agents, Distributed Robot Systems
Abstract: This paper presents a novel decentralized approach for achieving emergent behavior in multi-agent systems with minimal information sharing. Based on prior work in simple orbits, our method produces a broad class of stable, periodic trajectories by stabilizing the system around a Lie group-based geometric embedding. By employing the Lie group SO(3), we generate a wider range of periodic curves than existing quaternion-based methods. Furthermore, we exploit SO(3) properties to eliminate the need for velocity inputs, allowing agents to receive only position inputs. We also propose a novel phase controller that ensures uniform agent separation, along with a formal stability proof. Validation through simulations and experiments showcases the method's adaptability to complex low-level dynamics and disturbances.
|
| |
| 09:00-10:30, Paper WeI1I.361 | Add to My Program |
| FELP: Fast and Effective Autonomous Flight on Large-Scale and Cluttered Environments Based on Unified Linear Parametric Map |
|
| Nie, Hongyu | Shenyang University of Technology |
| Li, Xingyu | Northeastern University - China |
| Liu, Xu | Shenyang Institute of Automation, Chinese Academy of Sciences |
| Li, Decai | Shenyang Institute of Automation, Chinese Academy of Sciences |
| He, Yuqing | Shenyang Institute of Automation, Chinese Academy OfSciences |
Keywords: Mapping, Collision Avoidance, Aerial Systems: Perception and Autonomy
Abstract: Current AAV autonomous flights exhibit efficient performance in both indoor and field environments. However, they often face significant challenges in large-scale and cluttered environments, where the vast amount of captured data can lead to computation and storage bottlenecks. Additionally, the existing gradient-based planning methods depend on appropriate resolutions to adapt to different scenarios. In this letter, we present FELP, a fast and effective autonomous flight system for large-scale and cluttered environments based on the unified linear parametric map. It can enhance the adaptability of planners to diverse environments. First, by the random mapping method (RMM), the original irregular points in low-dimensional space are mapped into the high-dimensional space, where the points are approximately linearly separable or distributed. Leveraging the features of this mapping space, we can quickly obtain the occupancy state and Euclidean distance (the distance to the nearest obstacle) rather than relying on a large number of queries and repeated iterations. Then we learn a unified linear parametric model about grid maps and ESDF maps. Based on the linear parametric model, path searching is quickly executed in the front-end. Unlike traditional methods that compute the ESDF through interpolation, the closed-form ESDF can be solved efficiently, enabling real-time online trajectory optimization in the back-end. Compared to EGO-Planner, FELP reduces the mapping time by 68% and the planning time by 29%. Simulation and real-world experiments are conducted to verify their comprehensive performance compared to typical methods and state-of-the-art methods.
|
| |
| 09:00-10:30, Paper WeI1I.362 | Add to My Program |
| TORM: Transparent Objects Reconstruction and Manipulation with Multi-View Segmentation |
|
| Qiao, Qiyuan | The University of Hong Kong |
| Lin, Fuling | The University of Hong Kong |
| Zhao, Huibin | The University of Hong Kong |
| Xu, Bowen | The University of Hong Kong |
| Chen, Zhiqiang | The University of Hong Kong |
| Xu, Dong | The University of Hong Kong |
| Lu, Peng | The University of Hong Kong |
Keywords: Perception for Grasping and Manipulation, Computer Vision for Automation, Deep Learning for Visual Perception
Abstract: Transparent objects are common in daily life and industry, necessitating that robots be able to perceive and manipulate them. The physical properties of reflection and refraction pose challenges for accurately reconstructing the 3D geometry of transparent objects. Conventional methods, which rely on simultaneous estimation of background ambient light and complex refraction fields, lack robustness in real-world scenes, thereby impeding robotic grasping performance. To address this issue, this letter proposes TORM, a novel framework for robust reconstruction and manipulation of multiple transparent objects. TORM focuses on semantic information from transparent objects and employs multi-view segmentation masks to constrain a self-supervised multi-object deep marching tetrahedra (DMTet-Multi) 3D fitting process. To mitigate the risk of the geometry representation getting stuck in suboptimal solutions during multi-transparent-object reconstruction, we design a novel loss function that prevents marching tetrahedra from crossing boundaries. By applying a connectivity determination strategy to the fitted mesh, transparent objects can be processed in parallel by a grasp perception network, predicting the end-effector configuration for grasp tasks. Real-world experiments demonstrate that TORM achieves an 88.8% grasping success rate in multi-transparent-object grasping tasks.
|
| |
| 09:00-10:30, Paper WeI1I.363 | Add to My Program |
| A Roadmap for Responsible Robotics |
|
| Araiza Illan, Dejanira | Johnson & Johnson |
| Baum, Kevin | German Research Center for Artificial Intelligence (DFKI) |
| Beebee, Helen | The University of Leeds |
| Chatila, Raja | ISIR |
| Christensen, Sarah Moth-Lund | University of Leeds |
| Coghlan, Simon | University of Melbourne |
| Collins, Emily Charlotte | The University of Manchester |
| Devitt, Susannah Kate | Queensland University of Technology |
| Cunha, Alcino | INESC TEC and Universidade Do Minho |
| Dobrosovestnova, Anna | TU Wien |
| Hein, Duijf | Utrecht University |
| Evers, Vanessa | University of Twente |
| Fisher, Michael | University of Manchester |
| Kökciyan, Nadin | University of Edinburgh |
| Hochgeschwender, Nico | University of Bremen |
| Lemaignan, Séverin | IIIA-CSIC |
| Rodríguez Lera, Francisco Javier | Universidad De León |
| Ljungblad, Sara | Lots Design |
| Magnusson, Martin | Örebro University |
| Mansouri, Masoumeh | Birmingham University |
| Milford, Michael J | Queensland University of Technology |
| Moon, AJung | McGill University |
| Powers, Thomas M. | University of Delaware |
| Salvini, Pericle | Responsible Technology Institute, University of Oxford |
| Scantamburlo, Teresa | Ca’ Foscari University |
| Schuster, Nick | Australian National University |
| Marija, Slavkovik | University of Bergen |
| Topcu, Ufuk | The University of Texas at Austin |
| Preciado Vanegas, Daniel Fernando | Vrije Universiteit Amsterdam |
| Wasowski, Andrzej | IT University of Copenhagen |
| Yang, Yi | KU Leuven |
Keywords: Ethics and Philosophy, Acceptability and Trust, Methods and Tools for Robot System Design
Abstract: This document presents the outcomes of the Dagstuhl Seminar "Roadmap for Responsible Robotics," held in September 2023 at the Leibniz Centre for Informatics, Schloss Dagstuhl, Germany. The seminar brought together researchers from Robotics, Computer Science, Social and Cognitive Sciences, and Philosophy with the aim of charting a path towards improving responsibility in robotic systems. Through intensive interdisciplinary discussions centered on the various values at stake as robotics increasingly integrates into human life, the participants identified key priorities to guide future research and regulatory efforts. The resulting roadmap outlines actionable steps to ensure that robotic systems co-evolve with human societies, promoting human agency and humane values rather than undermining them. Designed for diverse stakeholders---researchers, policymakers, industry leaders, practitioners, NGOs, and civil society groups---this roadmap provides a foundation for collaborative efforts toward responsible robotics.
|
| |
| 09:00-10:30, Paper WeI1I.364 | Add to My Program |
| Feedback-MPPI: Fast Sampling-Based MPC Via Rollout Differentiation – Adios Low-Level Controllers |
|
| Belvedere, Tommaso | CNRS |
| Ziegltrum, Michael | University College London |
| Turrisi, Giulio | Istituto Italiano Di Tecnologia |
| Modugno, Valerio | University College London |
Keywords: Optimization and Optimal Control, Motion Control, Legged Robots
Abstract: Model Predictive Path Integral control is a powerful sampling-based approach suitable for complex robotic tasks due to its flexibility in handling nonlinear dynamics and non-convex costs. However, its applicability in real-time, high-frequency robotic control scenarios is limited by computational demands. This paper introduces Feedback-MPPI (F-MPPI), a novel framework that augments standard MPPI by computing local linear feedback gains derived from sensitivity analysis inspired by Riccati-based feedback used in gradient-based MPC. These gains allow for rapid closed-loop corrections around the current state without requiring full re-optimization at each timestep. We demonstrate the effectiveness of F-MPPI through simulations and real-world experiments on two robotic platforms: a quadrupedal robot performing dynamic locomotion on uneven terrain and a quadrotor executing aggressive maneuvers with onboard computation. Results illustrate that incorporating local feedback significantly improves control performance and stability, enabling robust, high-frequency operation suitable for complex robotic systems.
|
| |
| 09:00-10:30, Paper WeI1I.365 | Add to My Program |
| Steerable High-Jumping Tensegrity Robot for Space Exploration (I) |
|
| Green, Jonathan Jacob Kolt | ETH Zurich |
| Bozinovski, Dario | ETH Zurich |
| Tischhauser, Fabian | ETH Zurich |
| Hutter, Marco | ETH Zurich |
| Baines, Robert Lawrence | ETH Zurich |
Keywords: Field Robots, Space Robotics and Automation, Actuation and Joint Mechanisms
Abstract: The growing interest in exploring other planets calls for innovative robotic systems capable of deploying to and traversing challenging space environments. While wheeled rovers have traditionally fulfilled this role, they face limitations, including configuration dependence (e.g., requiring an upright orientation), susceptibility to impacts, and difficulty overcoming obstacles larger than their wheel radius. Tensegrity-based robotics presents a promising alternative for future rovers. These lightweight, compliant structures offer compactibility, adjustable stiffness, and the ability to absorb impacts without damage. Moreover, their unique form factor naturally protects scientific payloads. Recent research has explored tensegrity robots for rolling-based locomotion, with increasing interest in leveraging their structures for jumping-based movement. However, achieving hardware capable of high jumps greater than the robot’s body length (BL) and directional jumping control for steerable jumping remains a challenge. This work introduces a tensegrity robot that utilizes structural deformation for jumping locomotion. Through first-principles analyses, simulations, laboratory experiments, and field tests in a planetary analog environment, we demonstrate a robot capable of vertical jumps of 1.18 m (1.93 BLs), directional jumps covering horizontal distances up to 0.59 m (0.97 BLs), and surviving falls from heights of 21.5 m (35.2 BLs).
|
| |
| 09:00-10:30, Paper WeI1I.366 | Add to My Program |
| Zero-Shot Recognition of Test Tube Types by Automatically Collecting and Labeling RGB Data |
|
| Tang, Yu | Osaka University |
| Wan, Weiwei | Osaka University |
| Chen, Hao | Fujian Agriculture and Forestry University |
| Matsushita, Masaki | H.U. Group Research Inst. G. K., Japan |
| Takahashi, Jun | H.U. Research Institute |
| Kotaka, Takeyuki | H.U. Group Research Inst. G. K., Japan |
| Harada, Kensuke | Osaka University |
Keywords: Computer Vision for Automation, Robotics and Automation in Life Sciences, Grasping
Abstract: This work presents a method for automatically detecting and recognizing test tube types in a rack. It leverages automatic segmentation, clustering, and labeling processes to eliminate the need for explicitly preparing training data. These processes are addressed by using combined global prediction and local cropping, where global prediction estimates the slot occupation states of a rack, and local cropping extracts tube pictures in the local regions of each slot for clustering and labeling. With the help of the proposed method, the robotic tube manipulation system no longer needs tailored data and explicit training in the presence of new tubes, thus achieving flexibility and efficiency. Experimental evaluations conducted with a RealSense D405 camera and the UFactory xArm Lite6 robot manipulator confirm the method’s effectiveness in accurately identifying novel test tube types under real-world conditions.
|
| |
| 09:00-10:30, Paper WeI1I.367 | Add to My Program |
| Safety on the Fly: Constructing Robust Safety Filters Via Policy Control Barrier Functions at Runtime |
|
| Knoedler, Luzia | Delft University of Technology |
| So, Oswin | Massachusetts Institute of Technology |
| Yin, Ji | Georgia Institute of Technology |
| Black, Mitchell | MIT Lincoln Laboratory |
| Serlin, Zachary | Massachusetts Institute of Technology |
| Tsiotras, Panagiotis | Georgia Tech |
| Alonso-Mora, Javier | Delft University of Technology |
| Fan, Chuchu | Massachusetts Institute of Technology |
Keywords: Robot Safety, Optimization and Optimal Control, Collision Avoidance
Abstract: Control Barrier Functions (CBFs) have proven to be an effective tool for performing safe control synthesis for nonlinear systems. However, guaranteeing safety in the presence of disturbances and input constraints for high relative degree systems is a difficult problem. In this work, we propose the Robust Policy CBF (RPCBF), a practical approach for constructing robust CBF approximations online via the estimation of a value function. We establish conditions under which the approximation qualifies as a valid CBF and demonstrate the effectiveness of the RPCBF-safety filter in simulation on a variety of high relative degree input-constrained systems. Finally, we demonstrate the benefits of our method in compensating for model errors on a hardware quadcopter platform by treating the model errors as disturbances. Website including code: www.oswinso.xyz/rpcbf/
|
| |
| 09:00-10:30, Paper WeI1I.368 | Add to My Program |
| Self-Wearing Adaptive Garments Via Soft Robotic Unfurling |
|
| Kim, Nam Gyun | Korea Advanced Institute of Science and Technology |
| Heap, William | Stanford University |
| Qin, Yimeng | Stanford University |
| Yao, Elvy | University of California, Santa Barbara |
| Ryu, Jee-Hwan | Korea Advanced Institute of Science and Technology |
| Okamura, Allison M. | Stanford University |
Keywords: Wearable Robotics, Soft Robot Applications, Physically Assistive Devices
Abstract: Robotic dressing assistance has the potential to improve the quality of life for individuals with limited mobility. Existing solutions predominantly rely on rigid robotic manipulators, which have challenges in handling deformable garments and ensuring safe physical interaction with the human body. Prior robotic dressing methods require excessive operation times, complex control strategies, and constrained user postures, limiting their practicality and adaptability. This paper proposes a novel soft robotic dressing system, the Self-Wearing Adaptive Garment (SWAG), which uses an unfurling and growth mechanism to facilitate autonomous dressing. Unlike traditional approaches, the SWAG conforms to the human body through an unfurling-based deployment method, eliminating skin-garment friction and enabling a safer and more efficient dressing process. We present the working principles of the SWAG, introduce its design and fabrication, and demonstrate its performance in dressing assistance. The proposed system demonstrates effective garment application across various garment configurations, presenting a promising alternative to conventional robotic dressing assistance.
|
| |
| 09:00-10:30, Paper WeI1I.369 | Add to My Program |
| MarsLGPR: Mars Rover Localization with Ground Penetrating Radar (I) |
|
| Sheppard, Anja | University of Michigan |
| Skinner, Katherine | University of Michigan |
Keywords: Space Robotics and Automation, Localization, Data Sets for SLAM
Abstract: In this work, we propose the use of Ground Penetrating Radar (GPR) for rover localization on Mars. Precise pose estimation is an important task for mobile robots exploring planetary surfaces, as they operate in GPS-denied environments. Although visual odometry (VO) provides accurate localization, it is computationally expensive and can fail in dim or high-contrast lighting. Wheel encoders can also provide odometry estimation, but are prone to slipping on the sandy terrain encountered on Mars. Although traditionally a scientific surveying sensor, GPR has been used on Earth for terrain classification and localization through subsurface feature matching. The Perseverance rover and the upcoming ExoMars rover have GPR sensors already equipped to aid in the search of water and mineral resources. We propose to leverage GPR to aid in Mars rover localization. Specifically, we develop a novel GPR-based deep learning model that predicts 1-D relative pose translation. We fuse our GPR pose prediction method with inertial and wheel encoder data in a filtering framework to output rover localization. We perform experiments in a Mars analog environment and demonstrate that our GPR-based displacement predictions both outperform wheel encoders and improve multimodal filtering estimates in high-slip environments. Finally, we present the first dataset aimed at GPR-based localization in Mars analog environments, which will be made publicly available at: https://umfieldrobotics.github.io/marslgpr
|
| |
| 09:00-10:30, Paper WeI1I.370 | Add to My Program |
| Mixed-Reality Digital Twins: Leveraging the Physical and Virtual Worlds for Hybrid Sim2Real Transition of Multi-Agent Reinforcement Learning Policies |
|
| Samak, Chinmay | CU-ICAR |
| Samak, Tanmay | CU-ICAR |
| Krovi, Venkat | CU-ICAR |
Keywords: Autonomous Agents, Reinforcement Learning, Simulation and Animation
Abstract: Multi-agent reinforcement learning (MARL) for cyber-physical vehicle systems usually requires a significantly long training time due to their inherent complexity. Furthermore, deploying the trained policies in the real world demands a feature-rich environment along with multiple physical embodied agents, which may not be feasible due to monetary, physical, energy, or safety constraints. This work seeks to address these pain points by presenting a mixed-reality (MR) digital twin (DT) framework capable of: (i) boosting training speeds by selectively scaling parallelized simulation workloads on-demand, and (ii) immersing the MARL policies across hybrid simulation-to-reality (sim2real) experiments. The viability and performance of the proposed framework are highlighted through two representative use cases, which cover cooperative as well as competitive classes of MARL problems. We study the effect of: (i) agent and environment parallelization on training time, and (ii) systematic domain randomization on zero-shot sim2real transfer, across both case studies. Results indicate up to 76.3% reduction in training time with the proposed parallelization scheme and sim2real gap as low as 2.9% using the proposed deployment method.
|
| |
| 09:00-10:30, Paper WeI1I.371 | Add to My Program |
| LLM-Aided Assistive Robot for Single-Operator Bimanual Teleoperation |
|
| Fei, Haolin | Lancaster University |
| Ma, Songlin | Lancaster University |
| Du, Guanglong | South China University of Technology |
| Yadollahi, Elmira | Lancaster University |
| Lam, Hak-Keung | King's College London |
| Faragasso, Angela | Finger Vision Inc |
| Montazeri, Allahyar | Lancaster University |
| Wang, Ziwei | Lancaster University |
Keywords: Human-Robot Collaboration, Telerobotics and Teleoperation, AI-Enabled Robotics
Abstract: Bimanual teleoperation tasks are highly demanding for human operators, requiring the simultaneous control of two robotic arms while managing complex coordination and cognitive load. Current approaches to this challenge often rely on rigid control schemes or task-specific automation that do not adapt well to dynamic environments or varied operator needs. This paper presents a novel large language model (LLM)-aided bimanual teleoperation assistant (BTLA) that helps operators control dual-arm robots through an intuitive voice command interface and variable autonomy. The BTLA system enables a hybrid control paradigm by combining natural language interaction for an assistive robot arm with direct teleoperation of the dominant robotic arm. Our system implements six core manipulation skills with varying autonomy, ranging from direct mirroring to autonomous object manipulation. The BTLA leverages the LLM to interpret natural language commands and select an appropriate assistance mode based on task requirements and operator preferences. Experimental validation on bimanual object manipulation tasks demonstrates that the BTLA system yields a 240.8% increase in success rate over solo teleoperation and a 69.9% increase over dyadic teleoperation, while significantly reducing operator mental workload. In addition, we validate our approach on a physical dual-arm UR3e robot system, achieving a 90% success rate on challenging soft-bottle handling and box-transportation tasks.
|
| |
| 09:00-10:30, Paper WeI1I.372 | Add to My Program |
| A Physiotherapy Video Matching Method Supporting Arbitrary Camera Placement Via Angle-Of-Limb-Based Posture Structures |
|
| Lin, Jiunn-Wu | Kaohsiung Veterans General Hospital |
| Chou, Yao-Sheng | National Cheng Kung University |
| Huang, Yun Pin | Playsure Technology Company |
| Hung, Min-Hsiung | Chinese Culture University |
| Kao, Ming-Hung | National Cheng Kung University |
| Ji, Jia | National Cheng Kung University |
| Jiang, Lin-Yi | National Cheng Kung University |
| Chen, Pi Wei | National Cheng Kung University |
| Chen, Chao-Chun | National Cheng Kung University |
Keywords: Human Detection and Tracking, Rehabilitation Robotics, Computer Vision for Automation
Abstract: The “Hospital at Home” initiative transforms medical service automation through modern technologies. This paper revisits remote physiotherapy, allowing convalescents to record exercises using mobile devices from arbitrary angles. To address this, we propose a physiotherapy video matching method that accurately aligns movements from unconstrained viewpoints. The task is formulated as an optimization problem and solved using a modular pipeline. We introduce the Angle-of-Limb-based Posture Structure (ALPS) and the Camera-Angle-Free (CAFE) transformation to counter camera-angle differences. We also develop the Three-phase ALPS Matching Algorithm (TALMA) for matching movements between mentor and convalescent videos. Real-world experiments show our method outperforms existing solutions in both precision and practicality, with a time deviation of less than 0.07 seconds from expert annotations. The prototype and datasets are publicly available at: https://github.com/NCKU-CIoTlab/TALMA-on-ALPS/.
|
| |
| 09:00-10:30, Paper WeI1I.373 | Add to My Program |
| QP Chaser: Polynomial Trajectory Generation for Autonomous Aerial Tracking (I) |
|
| Lee, Yunwoo | Carnegie Mellon University |
| Park, Jungwon | Seoul National University of Science and Technology |
| Jung, Seungwoo | Seoul National University |
| Jeon, Boseong | Samsung Research |
| Oh, Dahyun | Seoul National University |
| Kim, H. Jin | Seoul National University |
Keywords: Visual Servoing, Aerial Systems: Perception and Autonomy, Motion and Path Planning
Abstract: Maintaining the visibility of the target is one of the major objectives of aerial tracking missions. This paper proposes a target-visible trajectory planning pipeline using quadratic programming (QP). Our approach can handle various tracking settings, including 1) single- and dual-target following and 2) both static and dynamic environments, unlike other works that focus on a single specific setup. In contrast to other studies that fully trust the predicted trajectory of the target and consider only the visibility of the target’s center, our pipeline considers error in target path prediction and the entire body of the target to maintain the target visibility robustly. First, a prediction module uses a sample-check strategy to quickly calculate the reachable areas of moving objects, which represent the areas their bodies can reach, considering obstacles. Subsequently, the planning module formulates a single QP problem, considering path homotopy, to generate a tracking trajectory that maximizes the visibility of the target's reachable area among obstacles. The performance of the planner is validated in multiple scenarios, through high-fidelity simulations and real-world experiments.
|
| |
| 09:00-10:30, Paper WeI1I.374 | Add to My Program |
| Quantum Machine Learning and Grover’s Algorithm for Quantum Optimization of Robotic Manipulators |
|
| Sirag, Hassen Nigatu | Zhejiang University, Robotics Institute |
| Shi, Gaokun | Yuyao Robot Research Center |
| Li, Jituo | Zhejiang University |
| Wang, Jin | Zhejiang University |
| Lu, GuoDong | Zhejiang University |
| Li, Howard | University of New Brunswick |
Keywords: Mechanism Design, Model Learning for Control, Optimization and Optimal Control
Abstract: Optimizing high-degree-of-freedom robotic manipulators requires searching complex, high-dimensional configuration spaces, a task that is computationally challenging for classical methods. This paper introduces a quantum-native framework that integrates Quantum Machine Learning (QML) with Grover's algorithm to solve kinematic optimization problems efficiently. A parameterized quantum circuit is trained to approximate the forward kinematics model, which then constructs an oracle to identify optimal configurations. Grover's algorithm leverages this oracle to provide a quadratic reduction in search complexity. Demonstrated on 1-DoF, 2-DoF, and dual-arm manipulator tasks, the method achieves significant speedups—up to 93x over classical optimizers like Nelder-Mead—as problem dimensionality increases. This work establishes a foundational, quantum-native framework for robot kinematic optimization, effectively bridging quantum computing and robotics problems.
|
| |
| 09:00-10:30, Paper WeI1I.375 | Add to My Program |
| Flexible-Link Velocity-Bounding Proxy Based Sliding Mode Control |
|
| Nouhi, Hafsa | Vrije Universiteit Brussel |
| Fei, Chaoyue | Vrije Universiteit Brussel |
| Hubert, Thierry | Vrije Universiteit Brussel |
| Van de Perre, Greet | Vrije Universiteit Brussel |
Keywords: Compliance and Impedance Control, Flexible Robotics
Abstract: This paper proposes a control strategy for flexible link manipulators preserving high tracking accuracy in free motion, while ensuring smooth and safe recovery in scenarios involving physical interaction or large positional errors, based on VB-PSMC. The scheme is extended to compensate for the manipulator's flexural dynamics, resulting in a nested control scheme where damping of the induced oscillations is achieved by a model-free proportional strain feedback while gravity induced deflections are counteracted by a feed-forward term based on a quasi-static Euler-Bernoulli beam model. A convergence study on the modified sliding manifold and a stability analysis of the closed-loop system are provided. The performance of the controller was evaluated experimentally and compared against other control strategies such as PSMC and torque limited PD control. The results demonstrate the controller's accurate end effector tracking in free motion, while achieving compliant behavior during contact, by efficiently handling the link’s inherent flexibility leading up to a 32% reduction in interaction force. In addition, studying the FL-VB-PSMC response after releasing contact demonstrated the overdamped and vibration-free recovery even for large position errors.
|
| |
| 09:00-10:30, Paper WeI1I.376 | Add to My Program |
| UniFucGrasp: Human-Hand-Inspired Unified Functional Grasp Annotation Strategy and Dataset for Diverse Dexterous Hands |
|
| Lin, Haoran | Hunan University |
| Chen, Wenrui | Hunan University |
| Chen, Xianchi | Hunan University |
| Yang, Fan | Hunan University |
| Diao, Qiang | Hunan University |
| Xie, Wenxin | HNU University |
| Wu, Sijie | Hunan University |
| Yang, Kailun | Hunan University |
| Li, Maojun | Hunan University |
| Wang, Yaonan | Hunan University |
Keywords: Multifingered Hands, Dexterous Manipulation, Deep Learning in Grasping and Manipulation
Abstract: Dexterous grasp datasets are vital for embodied intelligence, but mostly emphasize grasp stability, ignoring functional grasps needed for tasks like opening bottle caps or holding cup handles. Most rely on bulky, costly, and hard-to-control high-DOF Shadow Hands. Inspired by the human hand’s underactuated mechanism, we establish UniFucGrasp, a universal functional grasp annotation strategy and dataset for multiple dexterous hand types. Based on biomimicry, it maps natural human motions to diverse hand structures and uses geometry-based force closure to ensure functional, stable, human-like grasps. This method supports low-cost, efficient collection of diverse, high-quality functional grasps. Finally, we establish the first multi-hand functional grasp dataset and provide a synthesis model to validate its effectiveness. Experiments on the UFG dataset, IsaacSim, and complex robotic tasks show that our method improves functional manipulation accuracy and grasp stability, demonstrates improved adaptability across multiple robotic hands, helping to alleviate annotation cost and generalization challenges in dexterous grasping.
|
| |
| 09:00-10:30, Paper WeI1I.377 | Add to My Program |
| SignLoc: Robust Localization Using Navigation Signs and Public Maps |
|
| Zimmerman, Nicky | National University of Singapore |
| Loo, Joel | National University of Singapore |
| Agrawal, Ayush | Robotics Research Center, IIIT Hyderabad |
| Hsu, David | National University of Singapore |
Keywords: Localization, Semantic Scene Understanding
Abstract: Navigation signs and maps, such as floor plans and street maps, are widely available and serve as ubiquitous aids for way-finding in human environments. Yet, they are rarely used by robot systems. This paper presents SignLoc, a global localization method that leverages navigation signs to localize the robot on publicly available maps—specifically floor plans and OpenStreetMap (OSM) graphs–without prior sensor-based mapping. SignLoc first extracts a navigation graph from the input map. It then employs a probabilistic observation model to match directional and locational cues from the detected signs to the graph, enabling robust topo-semantic localization within a Monte Carlo framework. We evaluated SignLoc in diverse large-scale environments: part of a university campus, a shopping mall, and a hospital complex. Experimental results show that SignLoc reliably localizes the robot after observing only one to two signs.
|
| |
| 09:00-10:30, Paper WeI1I.378 | Add to My Program |
| Clinicians’ Perspectives on Safety, Ethical, and Legal Considerations for Home-Based Physical Rehabilitation Robots |
|
| Velmurugan, Vignesh | University of Hertfordshire |
| Holthaus, Patrick | University of Hertfordshire |
| Amirabdollahian, Farshid | The University of Hertfordshire |
| Dragone, Mauro | Heriot-Watt University |
Keywords: Rehabilitation Robotics, Ethics and Philosophy, Physical Human-Robot Interaction
Abstract: The growing demand for neurorehabilitation is driving the development of innovative, home-based robotic solutions, offering a promising approach to alleviate the strain on healthcare systems burdened by limited resources and workforce shortages. Despite significant technological advancements in rehabilitation robotics, adoption remains limited due to unresolved safety, legal, and ethical concerns. This study provides a comprehensive analysis of these three aspects from the perspective of experienced neurorehabilitation clinicians, offering valuable insights into the challenges surrounding home-based rehabilitation robots. Using a qualitative approach, we identified eight key themes derived from clinicians' feedback. These themes underscore critical areas, including the need for robust safety measures, regulatory clarity on liability and data privacy, and the ethical imperative of ensuring equitable access to technology for diverse user populations. Our findings highlight the need for a multifaceted approach to overcome these challenges, including user-centred design, rigorous testing, comprehensive user training, and necessary updates to regulatory frameworks to ensure the safe, effective, and equitable deployment of these technologies.
|
| |
| 09:00-10:30, Paper WeI1I.379 | Add to My Program |
| Dynamically Feasible Trajectory Generation with Optimization-Embedded Networks for Autonomous Flight |
|
| Han, Zhichao | Zhejiang University |
| Xu, Long | Zhejiang University |
| Pei, Liuao | Zhejiang University |
| Gao, Fei | Zhejiang University |
Keywords: Motion and Path Planning, Integrated Planning and Learning, Aerial Systems: Applications
Abstract: This paper aims to bridge perception and planning in navigation systems by learning optimal trajectories from depth information in an end-to-end fashion. However, using neural networks as black-box replacements for traditional modules risks scalability and adaptability. Moreover, such methods often fall short in sufficiently incorporating the robot’s dynamic constraints, resulting in trajectories that are either inadequately executable or unexpectedly aggressive, diverging from user expectations. In this paper, we fuse the benefits of conventional methods and neural networks by introducing an optimization-embedded network based on a compact trajectory library. The network distills spatial constraints, which are then applied to model-based spatial-temporal trajectory optimization problem, yielding feasible and optimal solutions. By making the optimization problem differentiable, our model seamlessly approximates the optimal trajectory. Additionally, the introduced regularized trajectory library permits efficient capture of the spatial distribution of optimal trajectories with minimal storage cost, safeguarding multimodal planning features. Benchmarking demonstrates the outstanding performance of our method in trajectory smoothness, success rate, and constraint satisfaction. Real-world flight experiments with an onboard computer showcase the autonomous quadrotor’s ability to navigate swiftly through dense forests. Our project page with videos is at https://zju-fast-lab.github.io/e2e_opt/.
|
| |
| 09:00-10:30, Paper WeI1I.380 | Add to My Program |
| Spline-FRIDA: Towards Diverse, Humanlike Robot Painting Styles with a Sample-Efficient, Differentiable Brush Stroke Model |
|
| Chen, Lawrence | Carnegie Mellon University |
| Schaldenbrand, Peter | Carnegie Mellon University |
| Shankar, Tanmay | Carnegie Mellon University |
| Coleman, Lia | Carnegie Mellon University |
| Oh, Jean | Carnegie Mellon University |
Keywords: Art and Entertainment Robotics, Human-Robot Collaboration, Imitation Learning
Abstract: A painting is more than just a picture on a wall; a painting is a process comprised of many intentional brush strokes, the shapes of which are an important component of a painting's overall style and message. Prior work in modeling brush stroke trajectories either does not work with real-world robotics or is not flexible enough to capture the complexity of human-made brush strokes. In this work, we introduce Spline-FRIDA which can model complex human brush stroke trajectories. This is achieved by recording artists drawing using motion capture, modeling the extracted trajectories with an autoencoder, and introducing a novel brush stroke dynamics model to the existing robotic painting platform FRIDA. We conducted a survey and found that our open-source Spline-FRIDA approach successfully captures the stroke styles in human drawings and that Spline-FRIDA's brush strokes are more human-like, improve semantic planning, and are more artistic compared to existing robot painting systems with restrictive Bezier curve strokes.
|
| |
| 09:00-10:30, Paper WeI1I.381 | Add to My Program |
| Semantic-Augmented 3D Gaussian Splatting for Visual Localization in Complex Indoor Environments |
|
| Chu, Ba Tuan Hoang | Chungbuk National University |
| Kim, Gon-Woo | Chungbuk National University |
Keywords: Semantic Scene Understanding, Localization, Recognition
Abstract: This paper presents a new visual localization framework for complex indoor environments under dynamic scene change conditions. Conventional visual localization methods often struggle to maintain accuracy and robustness in such environments, where frequent scene changes, occlusions, diverse object categories, and intricate scene structures significantly affect feature consistency and matching reliability. These challenges highlight the need for a more adaptive and semantically aware localization approach. By proposing an algorithm that integrates semantic information with a Gaussian map as input, the method enhances the algorithm’s environmental awareness. This allows robust objects to be identified and extracted, thereby improving feature extraction performance and consequently enhancing pose estimation precision. Furthermore, a new coarse-to-fine matching strategy has been developed that takes an overview of the Gaussian map, from which suitable viewpoints are generated to produce the best matching images. Rendered images produced from the Gaussian map are employed in subsequent stages to improve comparison effectiveness, thereby enabling the determination of the most accurate camera pose. Finally, the capability of the proposed methodology is confirmed through experiments on different types of datasets.
|
| |
| 09:00-10:30, Paper WeI1I.382 | Add to My Program |
| Fixture-Free Automated Sewing System Using Dual-Arm Manipulator and High-Speed Fabric Edge Detection |
|
| Tang, Kai | The University of Hong Kong |
| Huang, Xuzhao | The University of Hong Kong |
| Seino, Akira | Centre for Transformative Garment Production |
| Tokuda, Fuyuki | Tohoku University |
| Kobayashi, Akinari | Centre for Transformative Garment Production |
| Tien, Norman | University of Hong Kong |
| Kosuge, Kazuhiro | The University of Hong Kong |
Keywords: Foundations of Automation, Deep Learning for Visual Perception, Perception for Grasping and Manipulation
Abstract: Inspired by human workers who perform complicated sewing tasks by repeating relatively simple operations, this paper proposes a fixture-free automated sewing system using a dual-arm manipulator and an ordinary sewing machine to sew two aligned fabrics along the edges, a common task in garment production. The proposed sewing system has a five-layer architecture: perception, dual-arm sewing Petri net, fundamental operations, control primitives, and hardware layers. This architecture decomposes various complex sewing tasks into sequences of fundamental operations. To meet the real-time requirement of automated sewing, a High-speed Fabric Edge Detection System (Hi-FEDS) is further proposed for the perception layer, which formulates the fabric edge detection problem for sewing as a classification problem of predefined distributed anchors. The anchor distribution is modeled by the Gaussian Uniform Mixture Model (GUMM). The proposed method achieves high-speed fabric edge detection at an average of 120 FPS, with an average error of about one pixel. An experimental robotic sewing platform is developed and the sewing results show that the proposed system achieves high-quality sewing across fabrics of various shapes and materials.
|
| |
| 09:00-10:30, Paper WeI1I.383 | Add to My Program |
| LighterBEV: LiDAR Global Localization Meets Online Learning |
|
| Liu, BinHong | Northwestern Polytechnical University |
| Yang, Tao | Northwestern Polytechnical University |
| Cao, Haoji | Northwestern Polytechnical University |
| Fu, Shuqi | City University of Hong Kong |
| Fang, YangWang | Northwestern Polytechnical University |
| Yan, Zhi | École Nationale Supérieure De Techniques Avancées (ENSTA) |
Keywords: Localization, Incremental Learning, SLAM
Abstract: LiDAR-based global localization provides accurate robot pose estimates against a prior map. Existing deep-learning methods, however, demand heavy computation and long training or inference times and degrade sharply when faced with domain shifts. This letter presents LighterBEV, a lightweight, fast, and generalizable localization method. An Informative Compression Module achieves a fourfold reduction in local-feature dimensional- ity while improving accuracy. We further integrate online learning to enable rapid post - deployment adaptation, mitigating degradation under distribution shift. Extensive experiments on four large-scale datasets show that LighterBEV achieves state-of-the-art performance with limited training data, maintains high accuracy under domain shift, and runs in real time on resource- constrained hardware—supporting both inference and online updates. To our knowledge, LighterBEV is the first LiDAR global localization approach to incorporate online learning for automatic adaptation to new environments, thereby narrowing the domain gap. Code will be released at: https://github.com/npu-iusl-lab/LighterBEV.
|
| |
| 09:00-10:30, Paper WeI1I.384 | Add to My Program |
| Establishing Reality-Virtuality Interconnections in Urban Digital Twins for Superior Intelligent Road Inspection and Simulation |
|
| Zhang, Yikang | Tongji University |
| Liu, Chuangwei | Tongji University |
| Li, Jiahang | Tongji University |
| Chen, Yingbing | Huawei Hongkong Research Center |
| Cheng, Jie | Hong Kong University of Science and Technology |
| Fan, Rui | Tongji University |
Keywords: Intelligent Transportation Systems, Computer Vision for Transportation, Deep Learning for Visual Perception
Abstract: Road inspection is crucial for maintaining road's serviceability and ensuring traffic safety, as road defects gradually develop and compromise functionality. Traditional inspection methods, which rely on manual evaluations, are labor-intensive, costly, and time-consuming. While data-driven approaches are gaining traction, the scarcity and spatial sparsity of real-world road defects present significant challenges in acquiring high-quality datasets. Existing simulators designed to generate detailed synthetic driving scenes, however, lack models for road defects. Moreover, advanced driving tasks that involve interactions with road surfaces, such as planning and control in defective areas, remain underexplored. To address these limitations, we propose a multi-modal sensor platform integrated with an urban digital twin (UDT) system for intelligent road inspection. First, hierarchical road models are constructed from real-world driving data collected using vehicle-mounted sensors, resulting in highly detailed representations of road defect structures and surface elevations. Next, digital road twins are generated to create simulation environments for comprehensive analysis and evaluation of algorithm's performance. These scenarios are then imported into a simulator to facilitate both data acquisition and physical simulation. Experimental results demonstrate that driving tasks, including perception and decision-making, benefit significantly from the high-fidelity road defect scenes generated by our system.
|
| |
| 09:00-10:30, Paper WeI1I.385 | Add to My Program |
| Distance-Based Shared Control for Vitreoretinal Surgery |
|
| Briel, Marius | Carl Zeiss AG |
| Wu, Dongyue | Carl Zeiss AG |
| Hess, Maximilian | Carl Zeiss AG |
| Haide, Ludwig | Carl Zeiss AG |
| Piccinelli, Nicola | University of Verona |
| Kronreif, Gernot | ACMIT Gmbh |
| Pellegrini, Marco | University of Ferrara |
| Tagliabue, Eleonora | Carl Zeiss AG |
| Mathis-Ullrich, Franziska | Friedrich-Alexander-University Erlangen-Nurnberg (FAU) |
Keywords: Medical Robots and Systems, Sensor-based Control, Human Factors and Human-in-the-Loop
Abstract: The fragility of ocular tissues combined with the limited surgical workspace demands precise instrument control and focus, making sensor-integrated robotic systems a promising solution. In this paper, we introduce a surgical system for telemanipulated endolaser photocoagulation that leverages instrument-integrated optical coherence tomography (iiOCT) for accurate distance measurement. We have developed a controller that maintains a specified instrument-to-retina distance, complemented by haptic shared control to assist the ophthalmic surgeon throughout the procedure. We conducted a pilot study involving 12 participants, including an expert vitreoretinal surgeon, to evaluate the system's performance across three levels of user assistance. The distance-based controller demonstrated a significant improvement in axial precison compared to telemanipulated trials, achieving a mean error of 4 micrometers and a standard deviation of 69 micrometers across all subjects. Preliminary experiments conducted on porcine eyes confirmed the feasibility of our approach on ex vivo tissues.
|
| |
| 09:00-10:30, Paper WeI1I.386 | Add to My Program |
| On Transient Release Dynamics in Robot Throwing: A Sliding Pivot Model |
|
| Liu, Yang | EPFL |
| Billard, Aude | EPFL |
Keywords: Dexterous Manipulation, Direct/Inverse Dynamics Formulation, Contact Modeling, Dynamic Manipulation
Abstract: Humans throw projectiles with high speed and accuracy, yet robots still lag behind despite precise control and low latency. A key obstacle is the lack of high-fidelity, tractable models for transient release dynamics, where momentum is exchanged via friction over ~50 ms. We first show that the conventional model combining rigid body dynamics and patch friction (LS model) suffers from pathological behaviors, resulting in poor prediction accuracy. While this can be mitigated using viscous smoothing and implicit integration (ILS model), it incurs a high computational cost. Motivated by the dominant effect of in-hand pivoting during release, we propose a Sliding Pivot (SP) model that simplifies the dynamics by capturing sticking–pivoting–sliding under vanishing normal force. The SP model offers comparable accuracy (within 10% of ILS) while running over 20× faster. Compared to the conventional LS model, SP reduces horizontal velocity error by 40% and angular velocity error by 63%, achieving 2.4 cm and 15.4 degrees mean absolute error for landing position and orientation. These results provide a robust, physically grounded foundation for scalable throwing robots.
|
| |
| 09:00-10:30, Paper WeI1I.387 | Add to My Program |
| LVI-Q: Robust LiDAR-Visual-Inertial-Kinematic Odometry for Quadruped Robots Using Tightly-Coupled and Efficient Alternating Optimization |
|
| Marsim, Kevin Christiansen | KAIST |
| Oh, Minho | KAIST |
| Yu, Byeongho | URobotics Corp |
| Lee, Seungjae | Korea Advanced Institute of Science and Technology |
| Nahrendra, I Made Aswin | KRAFTON |
| Lim, Hyungtae | Massachusetts Institute of Technology |
| Myung, Hyun | KAIST (Korea Advanced Institute of Science and Technology) |
Keywords: Sensor Fusion, SLAM, Legged Robots
Abstract: Autonomous navigation for legged robots in complex and dynamic environments relies on robust simultaneous localization and mapping (SLAM) systems to accurately map surroundings and localize the robot, ensuring safe and efficient operation. While prior sensor fusion-based SLAM approaches have integrated various sensor modalities to improve their robustness, these algorithms are still susceptible to estimation drift in challenging environments due to their reliance on unsuitable fusion strategies. Therefore, we propose a robust LiDAR-visual-inertial-kinematic odometry system that integrates information from multiple sensors, such as a camera, LiDAR, inertial measurement unit (IMU), and joint encoders, for visual and LiDAR-based odometry estimation. Our system employs a fusion-based pose estimation approach that runs optimization-based visual-inertial-kinematic odometry (VIKO) and filter-based LiDAR-inertial-kinematic odometry (LIKO) based on measurement availability. In VIKO, we utilize the foot-preintegration technique and robust LiDAR-visual depth consistency using superpixel clusters in a sliding window optimization. In LIKO, we incorporate foot kinematics and employ a point-to-plane residual in an error-state iterative Kalman filter (ESIKF). Compared with other sensor fusion-based SLAM algorithms, our approach shows robust performance across public and longterm datasets
|
| |
| 09:00-10:30, Paper WeI1I.388 | Add to My Program |
| A Piezoelectrically-Actuated Mesoscale Compliant Parallel Robot Via Additive Manufacture |
|
| Tabak, Ariel | York University |
| Karuppiah, Annamalai | York University |
| Orszulik, Ryan | York University |
Keywords: Compliant Joints and Mechanisms, Additive Manufacturing, Parallel Robots
Abstract: Micro-positioning and pick-and-place applications at the millimeter scale are driving the development of smaller robots necessitating the use of alternative methods for design and manufacture. Additive manufacturing can enable significant cost and time savings in the fabrication of robots while having a low barrier to entry. Specifically, multimaterial 3D printing naturally lends itself to the creation of monolithic mechanisms by removing the requirement for manual assembly, in particular, when compliant joints can replace the rigid joints that are traditionally used. The lack of an assembly requirement naturally opens up the possibility of reducing the size scale of these mechanisms. In this work, the design, fabrication, and characterization of an additively manufactured mesoscale compliant parallel robot actuated by piezoelectric bimorphs through a compliant transmission mechanism is presented. The transmission mechanism is required to convert and amplify the small but rapid linear displacements of piezoelectric actuators into the large rotational motion that is required to create a large workspace for the compliant parallel robot. The developed planar parallel robot has a workspace with maximum planar extents of 14.36 mm by 8.66 mm, with a total area of 65.6 millimeters squared. Three different trajectories are tracked at frequencies of up to 10 Hz, demonstrating the robot's capability to rapidly follow trajectories in its workspace.
|
| |
| 09:00-10:30, Paper WeI1I.389 | Add to My Program |
| DASP: Self-Supervised Nighttime Monocular Depth Estimation with Domain Adaptation of Spatiotemporal Priors |
|
| Huang, Yiheng | Guangdong University of Technology |
| Chen, Junhong | Guangdong University of Technology |
| Ning, Anqi | Shantou University |
| Liang, Zhanhong | Guangdong University of Technology |
| Michiels, Nick | Hasselt University - Flanders Make - Digital Future Lab |
| Claesen, Luc | Hasselt Univeristy |
| Liu, Wenyin | Guangdong University of Technology |
Keywords: Deep Learning for Visual Perception, Deep Learning Methods, Semantic Scene Understanding
Abstract: Self-supervised monocular depth estimation has achieved notable success under daytime conditions. However, its performance deteriorates markedly at night due to low visibility and varying illumination, e.g., insufficient light causes textureless areas, and moving objects bring blurry regions. To this end, we propose a self-supervised framework named DASP that leverages spatiotemporal priors for nighttime depth estimation. Specifically, DASP consists of an adversarial branch for extracting spatiotemporal priors and a self-supervised branch for learning. In the adversarial branch, we first design an adversarial network where the discriminator is composed of four devised spatiotemporal priors learning blocks (SPLB) to exploit the daytime priors. In particular, the SPLB contains a spatial-based temporal learning module (STLM) that uses orthogonal differencing to extract motion-related variations along the time axis and an axial spatial learning module (ASLM) that adopts local asymmetric convolutions with global axial attention to capture the multiscale structural information. By combining STLM and ASLM, our model can acquire sufficient spatiotemporal features to restore textureless areas and estimate the blurry regions caused by dynamic objects. In the self-supervised branch, we propose a 3D consistency projection loss to bilaterally project the target frame and source frame into a shared 3D space, and calculate the 3D discrepancy between the two projected frames as a loss to optimize the 3D structural consistency and daytime priors. Extensive experiments on the Oxford RobotCar and nuScenes datasets demonstrate that our approach achieves state-of-the-art performance for nighttime depth estimation. Ablation studies further validate the effectiveness of each component.
|
| |
| 09:00-10:30, Paper WeI1I.390 | Add to My Program |
| SurgIRL: Towards Life-Long Learning for Surgical Automation by Incremental Reinforcement Learning |
|
| Ho, Yun-Jie | University of California San Diego |
| Chiu, Zih-Yun | Johns Hopkins University |
| Zhi, Yuheng | University of California, San Diego |
| Yip, Michael C. | University of California, San Diego |
Keywords: Surgical Robotics: Planning, Medical Robots and Systems, Surgical Robotics: Laparoscopy
Abstract: Surgical automation holds immense potential to improve the outcome and accessibility of surgery. Recent studies use reinforcement learning to automate various surgical tasks. However, these policies are developed independently, and their reusability is limited when applied to other scenarios, making it more time-consuming for robots to incrementally solve tasks. Inspired by how human surgeons build their expertise, we propose Surgical Incremental Reinforcement Learning (SurgIRL). SurgIRL aims to (1) acquire new skills by referring to external policies (knowledge) and (2) build an expandable knowledge base and reuse it to solve multiple unseen tasks incrementally (incremental learning). Our SurgIRL framework includes three major components. We first define an expandable knowledge set containing heterogeneous policies that can be helpful for surgical tasks. Then, we propose Knowledge Inclusive Attention Network with mAximum Coverage Exploration (KIAN-ACE), which enhances learning performance through extensive navigation of the knowledge base. Finally, we develop incremental learning pipelines to expand and reuse a knowledge base and solve multiple surgical tasks sequentially. Our simulation experiments show that SurgIRL efficiently learns to automate ten surgical tasks separately or incrementally. We also demonstrate successful sim-to-real transfers of SurgIRL's policies on the da Vinci Research Kit (dVRK). The results represent an initial step towards lifelong robot learning for surgical automation.
|
| |
| 09:00-10:30, Paper WeI1I.391 | Add to My Program |
| Nullspace Optimization of Redundant Robots for Dynamics Decoupling in Motion Force Control |
|
| Tang, Wenbo | Shanghai Jiao Tong University |
| Wang, Weiming | Shanghai Jiao Tong University |
| Wang, Shiquan | Shanghai Flexiv Robotics Technology CO., LTD |
| Liu, Wenhai | Shanghai Jiao Tong University |
Keywords: Force Control, Compliance and Impedance Control, Redundant Robots
Abstract: The dynamics coupling between motion and force subspaces in robotic control poses significant challenges to ensuring force control robustness, particularly under large external disturbances. While actively shaping the system inertia can eliminate this coupling, it introduces additional disturbances due to modeling uncertainties and force sensing errors. Inspired by how humans naturally adjust their elbow postures to facilitate motion force operations, we propose a quadratic programming-based nullspace optimization method that minimizes dynamics coupling for redundant torque-controlled robots. Integrated into an impedance motion force control framework, our approach minimizes an objective function defined by the Frobenius norm of the projection matrix representing inertia coupling in Cartesian space, yielding human-like postures that passively decouple task dynamics. Experimental results demonstrate that the proposed nullspace optimization significantly improves force control stability and tracking performance under conditions of high friction and external disturbances, outperforming conventional motion force control combined with traditional nullspace tracking approaches.
|
| |
| 09:00-10:30, Paper WeI1I.392 | Add to My Program |
| Dense Monocular SLAM in Real-Time with Structured Gaussian Representation |
|
| Liu, Shaofan | Hefei University of Technology |
| Wei, Xing | Hefei University of Technology |
| Zhao, Chong | Hefei University of Technology |
| Tian, Aoxiang | Hefei University of Technology |
| Du, Bin | Hefei University of Technology |
Keywords: Mapping, Localization
Abstract: Monocular dense SLAM faces significant challenges in low-texture environments and under rapid camera motions. The recent development of 3D Gaussian Splatting (3DGS) offers a promising approach for real-time dense 3D reconstruction. However, existing 3DGS-based SLAM systems employ end-to-end optimization frameworks, which often struggle to achieve both efficient camera tracking and high-quality scene reconstruction simultaneously. To address this challenge, we propose a dense decoupled SLAM system that seamlessly integrates traditional visual odometry with 3DGS within a unified framework. Our system utilizes dense direct image alignment using pseudo-depth maps rendered from a global model, which is represented by an octree-managed structured Gaussian representation. This structured Gaussian supports fast rendering and efficient mesh extraction. Furthermore, we adopt a stereo 3D reconstruction model to generate dense depth maps from visual odometry for optimizing the 3D Gaussians. Experimental results demonstrate that our framework achieves state-of-the-art performance in both tracking robustness and reconstruction outperforming to existing monocular Gaussian-based SLAM systems, while maintaining real-time efficiency.
|
| |
| 09:00-10:30, Paper WeI1I.393 | Add to My Program |
| COMET: A Dual Swashplate Autonomous Coaxial Bi-Copter AAV with High-Maneuverability and Long-Endurance |
|
| Wang, Shuai | Sun Yat-Sen University |
| Tang, Xiaoming | Sun Yat-Sen University |
| Liang, Junning | Sun Yat-Sen University |
| Zheng, Haowen | Sun Yat-Sen University |
| Ye, Biyu | Sun Yat-Sen University |
| Liu, Zhaofeng | Sun Yat-Sen University |
| Gao, Fei | Zhejiang University |
| Lyu, Ximin | Sun Yat-Sen University |
Keywords: Aerial Systems: Mechanics and Control, Aerial Systems: Applications
Abstract: Coaxial bi-copter unmanned aerial vehicles (UAVs) have garnered attention due to their potential for improved rotor system efficiency and compact form factor. However, balancing efficiency, maneuverability, and compactness in coaxial bi-copter systems remains a key design challenge, limiting their practical deployment. This letter introduces COMET, a coaxial bi-copter UAV platform featuring a dual swashplate mechanism. The coaxial bi-copter system's efficiency and compactness are optimized through bench tests, and the whole prototype's efficiency and robustness under varying payload conditions are verified through flight endurance experiments. The maneuverability performance of the system is evaluated in comprehensive trajectory tracking tests. The results indicate that the dual swashplate configuration enhances tracking performance and improves flight efficiency compared to the single swashplate alternative. Successful autonomous flight trials across various scenarios verify COMET's potential for real-world applications.
|
| |
| 09:00-10:30, Paper WeI1I.394 | Add to My Program |
| Neuromorphic Event Camera-Based Object Recognition and Grasping Position Detection Using a Transfer Learning-Enhanced Multi-Task Model (I) |
|
| Zafar, Muhammad Hamza | University of Agder |
| Moosavi, Syed Kumayl Raza | University of Agder |
| Sanfilippo, Filippo | University of Agder |
Keywords: Deep Learning in Grasping and Manipulation, Industrial Robots, Computer Vision for Automation
Abstract: Object recognition and grasping position detection are critical tasks in robotic manipulation, particularly when operating in dynamic and unstructured environments. This paper presents the Channel Sharpening Attention-based Adaptive Inception Network (CSA-AInceptNet), a novel multi-task learning model designed for these tasks using event camera data. The proposed architecture integrates channel sharpening attention with adaptive inception networks to enhance feature extraction and improve robustness. The model's performance is evaluated on two state-of-the-art event camera datasets, E-Grasp and Neuro-Grasp. On the E-Grasp dataset, CSA-AIncepNet achieves a remarkable accuracy of 99.47% and a mean Intersection over Union (IoU) of 0.9370, significantly surpassing existing methods. On the Neuro-Grasp dataset, leveraging transfer learning, the model attains 98.58% accuracy and a mean IoU of 0.4897, demonstrating strong generalization capabilities across datasets. Comparative analyses and ablation studies further validate the effectiveness of the proposed architecture, highlighting its superiority over conventional models like ConvNeXt, DarkNet, DenseNet, and VGG16. The results establish CSA-AIncepNet as a robust solution for event-based object recognition and grasping detection, paving the way for advancements in human-robot collaboration and dynamic robotic manipulation.
|
| |
| 09:00-10:30, Paper WeI1I.395 | Add to My Program |
| Impact-Aware Dual-Arm Manipulation |
|
| Hermus, James | EPFL |
| Bombile, Michael Bosongo | Cybernetics Laboratory (CynLr SA) |
| van Steen, Jari J. | Eindhoven University of Technology |
| Jeandupeux, Elise | EPFL |
| Zermane, Ahmed | CNRS-LIRMM |
| Melone, Alessandro | German Aerospace Center (DLR) |
| Troebinger, Mario | Technical University of Munich |
| Naceri, Abdeldjallil | Technical University of Munich |
| Lacoursière, Claude | Algoryx Simulation |
| de Looijer, Stijn | Delft University of Technology |
| Haddadin, Sami | Mohamed Bin Zayed University of Artificial Intelligence |
| Kheddar, Abderrahmane | CNRS-AIST |
| Saccon, Alessandro | Eindhoven University of Technology - TU/e |
| Billard, Aude | EPFL |
Keywords: Dual Arm Manipulation, Logistics, Motion Control
Abstract: This article presents an impact-aware manipulation framework for logistics, where growing e-commerce demands have increased the need for faster and more flexible package handling solutions. The proposed framework addresses swiftly grabbing and placing objects in depalletizing tasks with dual-arm robotic systems. Impact-aware robotics leverages intentional collisions to achieve dynamic interactions and has the potential to be faster and more energy efficient than classical quasi-static approaches. The generation of desired impacts brings multiple challenges encompassing robust motion generation, managing impacts with objects, enforcing hardware and safety constraints, contact state sensing, and simulation of contact behavior. To tackle these challenges, we developed within the EU-funded Impact-Aware Manipulation (I.AM.) project an integrated solution exploiting nonsmooth mechanics for impact modeling, dynamical systems for motion generation, QP-based control for constraint enforcement, internal state sensing without external force transducers, and batch-capable impact simulation. This article highlights the benefits of the proposed approach in terms of speed (a 29% decrease in average task time) and energy efficiency (a 35% decrease) through a systematic comparison between classical grabbing and impact-aware swift grabbing and tossing. In summary, our article underscores the transformative potential of impact-aware technologies in revolutionizing robotic logistics operations.
|
| |
| 09:00-10:30, Paper WeI1I.396 | Add to My Program |
| Stability Principle Inherent in Wheel Gait of Planar X-Shaped Walker Generated Using Constant Torque Drive and Mechanical Stoppers |
|
| Asano, Fumihiko | Japan Advanced Institute of Science and Technology |
| Yan, Cong | Ritsumeikan University |
Keywords: Legged Robots, Mechanism Design, Dynamics
Abstract: Since the late 19th century when the first walking toys were developed, it has been known that mechanical stoppers at the hip joint are crucial for generating stable passive dynamic walking. Recent research on passive-dynamic and limit-cycle walkers has also confirmed that mechanical stoppers at the hip and knee joints are effective for generating stable walking motion, but theoretical research on how this mechanical constraint enhances the overall gait stability remains insufficient. This paper introduces a planar X-shaped walker equipped with mechanical stoppers at the hip joint, and investigates the effect of the mechanical constraint on the stability of the wheel gait generated by constant torque drive on a downslope. By simply falling forward while using the stoppers to constrain itself to the target impact posture, the robot can generate a highly stable wheel gait. We divide the motion of one step into four phases, derive approximate analytical solutions for the state error transition function matrix in each phase using a linearized model, and analyze the increase or decrease in the state error norm using metrics such as its maximum singular value. Numerical simulations demonstrate that while two phases are unstable in terms of the increase in the state error norm, the remaining two phases are stable, resulting in overall asymptotic stability.
|
| |
| 09:00-10:30, Paper WeI1I.397 | Add to My Program |
| SLOT-MPC: A Hierarchical Whole-Body Model Predictive Controller to Enhance Simultaneous Localization and Object Tracking for UAVs |
|
| Hua, Zhengyu | Harbin Institute of Technology Shenzhen |
| Xing, Li | Harbin Institute of Technology, Shenzhen |
| Wu, Yiwei | Harbin Institute of Technology, Shenzhen |
| Wang, Can | Shenzhen Institute of Advanced Technology, ChineseAcademyof Sciences |
| Lu, Wencan | Shenzhen University General Hospital |
| Chen, Haoyao | Harbin Institute of Technology, Shenzhen |
|
|
| |
| 09:00-10:30, Paper WeI1I.398 | Add to My Program |
| Reinforcement Learning for Robust Athletic Intelligence: Lessons from the 2nd “AI Olympics with RealAIGym” Competition |
|
| Wiebe, Felix | DFKI GmbH Robotics Innovation Center |
| Turcato, Niccolò | Università Degli Studi Di Padova |
| Dalla Libera, Alberto | University of Padova |
| Choe, Jean Seong Bjorn | Korea University |
| Choi, Bumkyu | Korea University |
| Faust, Tim | Technical University of Darmstadt |
| Maraqten, Habib | TU Darmstadt |
| Aghadavoodi Jolfaei, Erfan | TU Darmstadt |
| Calì, Marco | University of Padova |
| Sinigaglia, Alberto | Università Di Padova |
| Giacomuzzo, Giulio | University of Padova |
| Carli, Ruggero | University of Padova |
| Romeres, Diego | Mitsubishi Electric Research Laboratories |
| Kim, Jong-kook | Korea Univeristy |
| Susto, Gian Antonio | University of Padova |
| Vyas, Shubham | Robotics Innovation Center, DFKI GmbH |
| Mronga, Dennis | German Research Center for Artificial Intelligence |
| Belousov, Boris | German Research Center for Artificial Intelligence - DFKI |
| Peters, Jan | Technische Universität Darmstadt |
| Kirchner, Frank | University of Bremen |
| Kumar, Shivesh | Chalmers University of Technology |
Keywords: Robust/Adaptive Control, Reinforcement Learning, Performance Evaluation and Benchmarking
Abstract: In robotics many different approaches ranging from classical planning over optimal control to reinforcement learning (RL) are developed and borrowed from other fields to achieve reliable control in diverse tasks. In order to get a clear understanding of their individual strengths and weaknesses and their applicability in real-world robotic scenarios it is important to benchmark and compare their performances not only in a simulation but also on real hardware. The ‘2nd AI Olympics with RealAIGym’ competition was held at the IROS 2024 conference to contribute to this cause and evaluate different controllers according to their ability to solve a dynamic control problem on an underactuated double pendulum system (Fig. 1) with chaotic dynamics. This paper describes the four different RL methods submitted by the participating teams, presents their performance in the swing-up task on a real double pendulum, measured against various criteria, and discusses their transferability from simulation to real hardware and their robustness to external disturbances.
|
| |
| 09:00-10:30, Paper WeI1I.399 | Add to My Program |
| Robust Visual Localization in Compute-Constrained Environments by Salient Edge Rendering and Weighted Hamming Similarity |
|
| Pham, Tu-Hoa | NASA Jet Propulsion Laboratory |
| Bailey, Philip | Blue Origin |
| Posada, Daniel | Blue Origin |
| Georgakis, Georgios | Jet Propulsion Laboratory, California Institute of Technology |
| Enriquez, Jorge | Amazon |
| Suresh, Surya | Ohio State University |
| Dolci, Marco | Jet Propulsion Laboratory |
| Twu, Philip | NASA Jet Propulsion Laboratory |
| |
| 09:00-10:30, Paper WeI1I.400 | Add to My Program |
| ProbPer-LiLo: Probabilistic Persistency Modeling for Life-Long Mapping |
|
| Ali, Waqas | KTH Royal Institute of Technology |
| Cai, Yixi | KTH Royal Institute of Technology |
| Jensfelt, Patric | KTH - Royal Institute of Technology |
| Nguyen, Thien-Minh | The University of Queensland |
Keywords: Mapping, Probabilistic Inference, SLAM
Abstract: 3D mapping is vital for a broad range of applications that rely on a consistent and accurate representation of the environment. Change is an ever-persistent force in our world and with the evolution of a scene its 3D map becomes outdated. Thus, a mapping framework that can adapt and refine the 3D maps with the changes in the scene is necessary. In this paper, we propose a lifelong mapping framework where map maintenance is based on two objectives including preservation of static structures and refinement of the 3D map. To preserve only the static structures, we classify the object’s state and remove the dynamic objects and the quasi-static objects, i.e., objects which temporarily appear static. For classifying the state of objects, we propose a discrete probabilistic solution utilizing a factor graph. Using this classification, we generate static maps from multiple sessions which are used for map refinement. The refinement is based on change detection and map update, leveraging semantic and geometric information. For the evaluation, we collect a multi-campus lifelong dataset as an extension of the MCD datasets from KTH and NTU campuses. The proposed approach is capable of accurately detecting quasi-static objects even in highly dynamic environments. Our system demonstrates state of the art performance in large scale environments. Furthermore, our approach can handle both SLAM-generated and survey-grade maps.
|
| |
| 09:00-10:30, Paper WeI1I.401 | Add to My Program |
| TeachingBot: Robot Teacher for Human Handwriting |
|
| Hou, Zhimin | Lingnan University |
| Yu, Cunjun | NUS |
| Hsu, David | National University of Singapore |
| Yu, Haoyong | National University of Singapore |
Keywords: Human-Centered Robotics, Physical Human-Robot Interaction, Human Performance Augmentation
Abstract: Teaching and learning physical skills often require one-on-one interaction, making it difficult to scale up, as there are not enough human teachers. Robots offer an attractive alternative. This paper presents TeachingBot, an adaptive robotic system that teaches handwriting to human learners through physical interaction. Robot teaching poses two major challenges: (i) adapting to the individual handwriting style of the learner and (ii) maintaining an engaging learning experience. For the first challenge, TeachingBot uses a probabilistic model to capture the learner's writing style from their writing samples. Drawing on the insight that effective teaching balances standardization with individuality, the system generates a personalized teaching trajectory that aligns with the learner's natural writing. For the second challenge, TeachingBot employs variable impedance control to guide the learner, dynamically adjusting the strength of physical guidance based on the learner's performance. Human-subject experiments with 15 participants demonstrate the effectiveness of TeachingBot, showing clear improvement in learners' handwriting and engagement over baseline methods.
|
| |
| 09:00-10:30, Paper WeI1I.402 | Add to My Program |
| Importance Sampling Model-Based Diffusion for Trajectory Optimization |
|
| Golembeski, Seth | Georgia Institute of Technology |
| Mazumdar, Anirban | Georgia Institute of Technology |
Keywords: Nonholonomic Motion Planning, Motion and Path Planning
Abstract: Trajectory optimization for robotic systems remains a challenging problem. This is especially true for robotic systems featuring nonlinear dynamics and many degrees of freedom. Data-based or model-free diffusion has recently been popularized in the fields of artificial intelligence and trajectory optimization. Model-Based Diffusion provides a data-free method of trajectory optimization, trained at runtime on a system dynamics model, suitable for high-dimensional models. This paper examines how importance sampling can enhance the performance of Model-Based Diffusion for trajectory optimization. We quantify the benefits of importance sampling across three long horizon planning tasks. These results show as much as a 13x improvement in sample efficiency depending on environment and optimization parameters.
|
| |
| 09:00-10:30, Paper WeI1I.403 | Add to My Program |
| FlowDreamer: A RGB-D World Model with Flow-Based Motion Representations for Robot Manipulation |
|
| Guo, Jun | Tsinghua University |
| Ma, Xiaojian | State Key Laboratory of General Artificial Intelligence |
| Wang, Yikai | Beijing Normal University |
| Yang, Min | University of Science and Technology of China |
| Liu, Huaping | Tsinghua University |
| Li, Qing | State Key Laboratory of General Artificial Intelligence |
Keywords: Deep Learning in Grasping and Manipulation, Visual Learning, Deep Learning Methods
Abstract: This paper investigates training better visual world models for robot manipulation, i.e., models that can predict future visual observations by conditioning on past frames and robot actions. Specifically, we consider world models that operate on RGB-D frames (RGB-D world models). As opposed to canonical approaches that handle dynamics prediction mostly implicitly and reconcile it with visual rendering in a single model, we introduce FlowDreamer, which adopts 3D scene flow as explicit motion representations. FlowDreamer first predicts 3D scene flow from past frame and action conditions with a U-Net, and then a diffusion model will predict the future frame utilizing the scene flow. FlowDreamer is trained end-to-end despite its modularized nature. We conduct experiments on 4 different benchmarks, covering both video prediction and visual planning tasks. The results demonstrate that FlowDreamer achieves better performance compared to other baseline RGB-D world models by 7% on semantic similarity, 11% on pixel quality, and 6% on success rate in various robot manipulation domains.
|
| |
| 09:00-10:30, Paper WeI1I.404 | Add to My Program |
| Velocity Potential Field Modulation for Dense Coordination of Polytopic Swarms and Its Application to Assistive Robotic Furniture |
|
| Tang, Lixuan | EPFL |
| Rüegg, David | École Polytechnique Fédérale De Lausanne |
| Zhang, Runze | Tongji University |
| Bolotnikova, Anastasia | LAAS-CNRS |
| Rabaey, Jan M. | University of California: Berkeley |
| Ijspeert, Auke | EPFL |
Keywords: Collision Avoidance, Path Planning for Multiple Mobile Robots or Agents, Swarm Robotics
Abstract: We explore the use of mobile furniture swarms that are intended to assist users with limited mobility in their daily indoor activities. We focus on the multi-agent coordination problem for a mobile furniture swarm when a dense target pose configuration is required, such as in an apartment setting. In those cases, one agent’s convergence to the target can be significantly affected by neighboring agents with specific shapes. In this letter, we propose a solution, named Velocity Potential Field Modulation (VPFM), to deal with the dense coordination problem of a polytopic swarm in a decentralized manner. We adapt our method to assistive applications, such as room reconfigurations and facilitating indoor movement of wheelchair users. We evaluate the performance of our method in simulations and on real-world mobile furniture hardware, demonstrating its effectiveness and real-time performance.
|
| |
| 09:00-10:30, Paper WeI1I.405 | Add to My Program |
| TCS Jumper: A Bio-Inspired Jumping Robot Featuring High Energy Density Via Synergistic Deformation |
|
| Yang, Yang | Hefei Institutes of Physical Science |
| Kong, Deyi | Hefei Institutes of Physical Science |
| Wei, Yuliang | Hefei Institutes of Physical Science, Chinese Academy of Sciences |
| Zhong, Junkui | Hefei Institutes of Physical Science, Chinese Academy of Sciences. |
| Su, Zhengguo | University of Science and Technology of China |
|
|
| |
| 09:00-10:30, Paper WeI1I.406 | Add to My Program |
| Onboard Mission Replanning for Adaptive Cooperative Multi-Robot Systems |
|
| Kwan, Elim | The Alan Turing Institute |
| Qureshi, Rehman | Auburn University |
| Fletcher, Liam | The Alan Turing Institute |
| Laganier, Colin | The Alan Turing Institute |
| Nockles, Victoria | The Alan Turing Institute |
| Walters, Richard | The Alan Turing Institute |
Keywords: Path Planning for Multiple Mobile Robots or Agents, Cooperating Robots, Reinforcement Learning
Abstract: Cooperative autonomous robotic systems have significant potential for executing complex multi-task missions across space, air, ground, and maritime domains. But they commonly operate in remote, dynamic and hazardous environments, requiring rapid in-mission adaptation without relying on fragile or slow communication links to centralized compute. Fast, onboard replanning algorithms are therefore essential to enhance resilience for these systems, but do not yet exist. Reinforcement Learning (RL) shows strong promise for efficiently solving mission planning tasks formulated as Travelling Salesperson Problems (TSPs), but existing methods: 1) are unsuitable for replanning, where agents do not start at a single location; 2) do not allow cooperation between agents; 3) are unable to model tasks with variable durations; or 4) lack practical considerations for onboard deployment. Here we address this gap by defining the Cooperative Mission Replanning Problem as a novel adaptation of multiple TSP, and develop a new encoder/decoder-based RL model to solve it effectively and efficiently. Using a simple example of cooperative drones, we show our replanner consistently (90% of the time) maintains performance within 10% of the state-of-the-art LKH3 heuristic solver, whilst running 85-370 times faster on a Raspberry Pi. This work paves the way for increased resilience in autonomous multi-agent systems.
|
| |
| 09:00-10:30, Paper WeI1I.407 | Add to My Program |
| Whole-Body Stabilization of a Cable-Suspended Multirotor Platform Carrying a Slung Load (I) |
|
| Das, Hemjyoti | Technical University of Vienna |
| Zambella, Grazia | TU Wien |
| Ott, Christian | TU Wien |
Keywords: Aerial Systems: Applications, Robotics and Automation in Construction, Aerial Systems: Mechanics and Control
Abstract: Suspended multirotor platforms are fascinating systems that can be employed in construction applications to provide safe transportation of heavy loads. Such a system comprising a cable-suspended platform with attached load features seven degrees of freedom (DoF) motion for the whole system. In this paper, we propose a composite whole-body control framework for the stabilization of the suspended multirotor platform system, leveraging singular perturbation theory to exploit the inherent three time-scale dynamics of the system. The control strategy computes the underactuated 3-DoF wrench space generated by the platform’s actuation units for the stabilization of the complete system. Building upon this, we develop a superposition-based shared control approach and then compare the two controllers. Moreover, to address specific cases where the time-scale separation between two dynamics of the triple-spherical pendulum becomes negligible, we design an operational space controller. The control approaches are validated using both extensive numerical simulations and experiments in different scenarios. We also carried out numerical robustness and stability analysis of the whole system. Note that our system relies on only onboard sensors for state estimation, which makes it effective for real-life outdoor applications.
|
| |
| 09:00-10:30, Paper WeI1I.408 | Add to My Program |
| Open-Source Web Lab for Remote and On-Site Robotics Practice in a Realistic Zero-Setup ROS2 Development Environment (I) |
|
| Krūmiņš, Dāvis | University of Tartu |
| Vunder, Veiko | University OfTartu |
| Kasemägi, Heiki | University of Tartu |
| Aabloo, Alvo | University of Tartu |
| Kruusamäe, Karl | University of Tartu |
Keywords: Education Robotics, Networked Robots, Software Tools for Robot Programming
Abstract: Robot Operating System 2 (ROS 2) has become the standard framework in modern robotics, but its steep learning curve and complex development environment present significant barriers to newcomers. This paper presents an open-source web lab through which learners can start practicing ROS2 programming on real robots right away with no development environment setup. The web lab offers realistic ROS2 development experience by serving browser-based Linux desktop workstations linked to physical remote robots. We also show how the web lab system can be leveraged to create portable infrastructure for hosting ad-hoc on-site robotics workshops that require zero setup time from the participants.
|
| |
| 09:00-10:30, Paper WeI1I.409 | Add to My Program |
| Nonlinear Modeling of the Finite Helical Deformation of 3D-Printed PneuNets |
|
| Yu, Qinghua | Shanghai Jiao Tong University |
| Zhang, Mengjie | Shanghai Jiao Tong University |
| Jiang, Chengru | Shanghai Jiao Tong University |
| Gu, Guoying | Shanghai Jiao Tong University |
| Wang, Dong | Shanghai Jiao Tong University |
| |
| 09:00-10:30, Paper WeI1I.410 | Add to My Program |
| EeLsT: An Energy-Efficient Long-Short Term Approach for Sustainable Sailboat Autonomy in Disturbed Marine Environment |
|
| Sun, Qinbo | The Chinese Univeristy of Hong Kong, Shenzhen |
| Qi, Weimin | The Chinese University of Hong Kong, Shenzhen |
| Qian, Huihuan (Alex) | The Chinese University of Hong Kong, Shenzhen |
Keywords: Field Robots, Energy and Environment-Aware Automation, Marine Robotics, Robotics in Hazardous Fields
Abstract: Sailboats are purely wind-driven and thus have great potential for long-term voyaging. For robotic sailboats, the constraints on the energy are crucial to the sustainability of automation. Reducing the control frequency of actuators is crucial for energy conservation. This study proposes an energy-efficient long-short term (EeLsT) approach for sustainable sailing. Our approach can be generally applied as an energy management module in sailing robots. It explicitly leverages the sailing motion characteristics and the dynamic model of the robot considering marine disturbances. We have designed an experimental enhanced simulation platform to evaluate motion performance and energy consumption. Both baseline approach and the scheme incorporating EeLsT method have been conducted. In simulation, EeLsT approach saves 31.8% of energy. In the real marine environment, experiments are conducted with OceanVoy. The results show that 27.4% of the energy is saved during stable sailing. In the long-term sailing, compared to the standby mode when the motors are not working, the average power of the full automation mode has increased by no more than 1W, i.e. 4% relatively.
|
| |
| 09:00-10:30, Paper WeI1I.411 | Add to My Program |
| Deformable Cluster Manipulation Via Whole-Arm Policy Learning |
|
| Jacob, Jayadeep | University of Sydney |
| Zhang, Wenzheng | University of Sydney |
| Warren, Houston | University of Sydney |
| Borges, Paulo Vinicius Koerich | CSIRO |
| Ramos, Fabio | University of Sydney, NVIDIA |
| Bandyopadhyay, Tirthankar | CSIRO |
Keywords: Dexterous Manipulation, Reinforcement Learning, Simulation and Animation
Abstract: Manipulating clusters of deformable objects presents a substantial challenge with widespread applicability, but requires contact-rich whole-arm interactions. A potential solution must address the limited capacity for realistic model synthesis, high uncertainty in perception, and the lack of efficient spatial abstractions, among others. We propose a novel framework for learning model-free policies integrating two modalities: 3D point clouds and proprioceptive touch indicators, emphasising manipulation with full body contact awareness, going beyond traditional end-effector modes. Our reinforcement learning framework leverages a distributional state representation, aided by kernel mean embeddings, to achieve improved training efficiency and real-time inference. Furthermore, we propose a novel context-agnostic occlusion heuristic to clear deformables from a target region for exposure tasks. We deploy the framework in a power line clearance scenario and observe that the agent generates creative strategies leveraging multiple arm links for de-occlusion. Finally, we perform zero-shot sim-to-real policy transfer, allowing the arm to clear real branches with unknown occlusion patterns, unseen topology, and uncertain dynamics.
|
| |
| 09:00-10:30, Paper WeI1I.412 | Add to My Program |
| CAR-Stereo: Confidence-Aware Adaptive Disparity Refinement for Real-Time Stereo Matching |
|
| Park, Chanill | Korea Institute of Robotics & Technology Convergence (KIRO) |
| Kim, Janghyun | Pusan National University |
| Kweon, Minseong | Minnesota Robotics Institute (MnRI), University of Minnesota, Twin Cities |
| Park, Jinsun | Pusan National University |
|
|
| |
| 09:00-10:30, Paper WeI1I.413 | Add to My Program |
| Priority-Aware Multi-Robot Coverage Path Planning |
|
| Lee, Kanghoon | Korea Advanced Institute of Science and Technology |
| Kim, Hyeonjun | Korea Military Academy |
| Li, Jiachen | University of California, Riverside |
| Park, Jinkyoo | Korea Advanced Institute of Science and Technology |
Keywords: Path Planning for Multiple Mobile Robots or Agents, Multi-Robot Systems, Cooperating Robots
Abstract: Multi-robot systems are widely used for coverage tasks that require efficient coordination across large environments. In Multi-Robot Coverage Path Planning (MCPP), the objective is typically to minimize the makespan by generating non-overlapping paths for full-area coverage. However, most existing methods assume uniform importance across regions, limiting their effectiveness in scenarios where some zones require faster attention. We introduce the Priority-Aware MCPP (PA-MCPP) problem, where a subset of the environment is designated as prioritized zones with associated weights. The goal is to minimize, in lexicographic order, the total priority-weighted latency of zone coverage and the overall makespan. To address this, we propose a scalable two-phase framework combining (1) greedy zone assignment with local search, spanning-tree-based path planning, and (2) Steiner-tree-guided residual coverage. Experiments across diverse scenarios demonstrate that our method significantly reduces priority-weighted latency compared to standard MCPP baselines, while maintaining competitive makespan. Sensitivity analyses further show that the method scales well with the number of robots and that zone coverage behavior can be effectively controlled by adjusting priority weights.
|
| |
| 09:00-10:30, Paper WeI1I.414 | Add to My Program |
| Real-Time Millimeter-Accurate Underwater Pose Estimation Via Tightly-Coupled Fusion of Vision and Optical Tracking |
|
| Gao, Yuer | The Hong Kong University of Science and Technology (Guangzhou) |
| Xu, Tongqing | The Hong Kong University of Science and Technology (Guangzhou) |
| Cai, Yi | The Hong Kong University of Science and Technology (Guangzhou) |
| |
| 09:00-10:30, Paper WeI1I.415 | Add to My Program |
| Deadlock-Aware Control for Multi-Robot Coordination with Multiple Safety Constraints |
|
| Zhang, Zhenwei | Huazhong University of Science and Technology |
| Zhang, Yuhao | Huazhong University of Science and Technology |
| Zhao, Xingwei | Huazhong University of Science and Technology |
| Tao, Bo | Huazhong University of Science and Technology |
| Ding, Han | Huazhong University of Science and Technology |
Keywords: Multi-Robot Systems, Deadlock Detection and Avoidance, Distributed Robot Systems, Motion Control
Abstract: Multi-robot coordination in shared workspaces is prone to deadlocks, which can compromise operational capabilities and task efficiency. Accurately determining the timing and spatial locations of deadlocks is essential for effective resolution, yet remains challenging due to dynamic robot interactions and growing system complexity. To this end, a distributed deadlock-aware control framework is proposed for robots to detect and avoid deadlocks while maintaining safe task execution. First, deadlocks are characterized by analyzing undesired equilibria in robot dynamics under safety constraints imposed by multiple stacked control barrier functions (CBFs). Our analysis reveals two critical properties: 1) Deadlocks occur at intersections of all active CBF boundaries, and 2) Deadlocks arise when robot stabilizing force are confined within the conical hull formed by active safety forces. These theoretical insights underpin a new detection method that identifies potential deadlocks from conflicts between safety requirements and task objectives. Furthermore, a reactive deadlock avoidance method is designed to help robots escape and prevent entry into potential deadlock regions by adaptively modulating the stabilizing force. A generalized workflow is established to systematically address deadlocks across various multi-robot tasks. Simulation and hardware experiments are conducted on robots collaborating in dense environments to validate the framework's effectiveness in preventing task failures caused by deadlocks.
|
| |
| 09:00-10:30, Paper WeI1I.416 | Add to My Program |
| DIBLF-Based Adaptive Optimal Constrained Control for Collaborative Robots under Different Human-Robot Interactive Tasks (I) |
|
| Wei, Yan | Zhejiang University of Technology |
| Feng, Yu | Zhejiang University of Technology |
| Ou, Linlin | Zhejiang University of Technology |
| Wang, Yueying | Shanghai University |
| Yu, Xinyi | Zhejiang University of Technology |
| |
| 09:00-10:30, Paper WeI1I.417 | Add to My Program |
| Beyond Frame-Wise Tracking: A Trajectory-Based Paradigm for Efficient Point Cloud Tracking |
|
| Fan, BaiChen | Nanjing University of Posts and Telecommunications |
| Cui, Yuanxi | Shanghai Jiao Tong University |
| Li, Jian | Nanjing University of Posts and Telecommunications |
| Wang, Qin | Nanjing University of Posts and Telecommunications |
| Zhao, Shibo | Carnegie Mellon University |
| Cao, Muqing | Carnegie Mellon University |
| Zhou, Sifan | Southeast University |
Keywords: Visual Tracking, Deep Learning for Visual Perception, Computer Vision for Transportation
Abstract: LiDAR-based 3D single object tracking (3D SOT) is a critical task in robotics and autonomous systems. Existing methods typically follow frame-wise motion estimation or a sequence-based paradigm. However, the two-frame methods are efficient but lack long-term temporal context, making them vulnerable in sparse or occluded scenes, while sequence-based methods that process multiple point clouds gain robustness at a significant computational cost. To resolve this dilemma, we propose a novel trajectory-based paradigm and its instantiation, TrajTrack. Traj- Track is a lightweight framework that enhances a base two-frame tracker by implicitly learning motion continuity from historical bounding box trajectories alone—without requiring additional, costly point cloud inputs. It first generates a fast, explicit motion proposal and then uses an implicit motion modeling module to predict the future trajectory, which in turn refines and corrects the initial proposal. Extensive experiments on the large-scale NuScenes benchmark show that TrajTrack achieves new state-of-the-art performance, dramatically improving tracking precision by 4.48% over a strong baseline while running at 56 FPS. Besides, we also demonstrate the strong generalizability of TrajTrack across different base trackers. Code will be available.
|
| |
| 09:00-10:30, Paper WeI1I.418 | Add to My Program |
| Online Dynamic SLAM with Incremental Smoothing and Mapping |
|
| Morris, Jesse | University of Sydney |
| Wang, Yiduo | University of Sydney |
| Ila, Viorela | The University of Sydney |
Keywords: SLAM, RGB-D Perception, Localization
Abstract: Dynamic SLAM methods jointly estimate for the static and dynamic scene components. However, existing approaches, while accurate, are computationally expensive and unsuitable for online applications. In this work, we present a novel factor-graph formulation and system architecture for Dynamic SLAM that inherently supports incremental optimisation and online estimation. This represents the first formulation explicitly designed to leverage incremental inference methods in the dynamic setting.On multiple datasets, we demonstrate that our method achieves camera pose and object motion accuracy equal to or better than state-of-the-art. We further analyse the structural properties of our approach to demonstrate its scalability and provide insight regarding the challenges of solving Dynamic SLAM incrementally. Finally, we show that our formulation leads to problem structure well-suited to incremental solvers, and our system architecture further enhances performance, achieving a 5x speed-up over existing methods. Code is open-sourced.
|
| |
| 09:00-10:30, Paper WeI1I.419 | Add to My Program |
| Self-Organised Sequential Multi-Agent Reinforcement Learning for Closely Cooperation Tasks |
|
| Fu, Hao | Tongji University |
| Mingyu, You | Tongji |
| Hongjun, Zhou | Tongji University |
| He, Bin | Tongji University |
Keywords: Reinforcement Learning, Multi-Robot Systems, Cooperating Robots
Abstract: Cooperative tasks are common in multi-agent systems, with closely cooperative tasks being a special case of this, where a change in the state of the environment requires multiple agents to perform a specific operation at the same time. Take a box-pushing task as an example, the box is heavy and requires multiple agents to push it simultaneously. Optimal actions in a closely cooperation task are correlated with the actions of other agents, which makes the individual optimal action potentially inconsistent with the group optimal action, which leads to more non-globally optimal Nash equilibrium policies in the problem. This makes it easier for the policy learned by reinforcement learning to fall into these locally optimal policies. In this paper, we propose a self-organised sequential multi-agent reinforcement learning algorithm (SOS-MARL). We propose sequential decision-making to change the optimization objective of the agent's policy so that the learned policy tends to group optimal policies. And propose an automatic grouping mechanism to make the policy smoother for training and reasoning in large-scale agent environments. We decompose the joint action value factorization outside the group into a combination of each group action value, thus guiding the agents to improve their group policies in a fine-grained manner. We deployed scenarios in both simulated and real environments and compared SOS-MARL with various classical MARL algorithms on box-pushing tasks, demonstrating the state-of-the-art of our method.
|
| |
| 09:00-10:30, Paper WeI1I.420 | Add to My Program |
| TacFlex: Multi-Mode Tactile Imprints Simulation for Visuotactile Sensors with Coating Patterns |
|
| Zhang, Chaofan | Institute of Automation, Chinese Academy of Sciences |
| Cui, Shaowei | Institute of Automation, Chinese Academy of Sciences |
| Hu, Jingyi | University of Chinese Academy of Sciences |
| Jiang, Tianyu | Institute of Automation, Chinese Academy of Scienses |
| Zhang, Tiandong | Institute of Automation, Chinese Academy of Sciences |
| Wang, Rui | Institute of Automation, Chinese Academy of Sciences |
| Wang, Shuo | Chinese Academy of Sciences |
Keywords: Force and Tactile Sensing, Dexterous Manipulation, Deep Learning in Robotics and Automation, Visuotactile simulation
Abstract: Visuotactile sensors can provide rich contact information for robots. However, how to build a high-fidelity visuotactile simulator that supports multi-mode tactile imprints and various sensor configurations remains a challenging problem. In this paper, we present TacFlex, a flexible simulator for visuotactile sensors, which physically simulates the elastomer deformation using Finite Element Methods, and focuses on linking the deformed elastomer mesh to diverse tactile imprints, including tactile images with arbitrary coating patterns and tactile 3D point clouds. We further propose a ray tracing-based rectification method to deal with multi-medium refraction effects to make the simulated tactile images more realistic. Extensive experiments are conducted to show the effectiveness of TacFlex on various sensors. Furthermore, we explore the Sim2Real performance of different tactile imprints provided by TacFlex in tactile perception and manipulation tasks, such as cylindrical object pose estimation and peg-in-hole. The perception/policy models trained in simulation are successfully deployed in real world. Finally, we discuss TacFlex's potential in robot learning.
|
| |
| 09:00-10:30, Paper WeI1I.421 | Add to My Program |
| Polarization-Controlled Microwave-Actuated Continuum Robot |
|
| Li, Yongze | Soochow University |
| Gao, He | Soochow University |
| Xing, Zhiguang | Harbin Institute of Technology, Weihai |
| Li, Xuan | Soochow University |
| Chen, Tao | Soochow University |
Keywords: Actuation and Joint Mechanisms, Soft Robot Applications, Modeling, Control, and Learning for Soft Robots
Abstract: In industrial and medical environments, robotic manipulation frequently encounters dual challenges of spatially constrained workspaces and visual obstructions. Wireless actuation methodologies, leveraging non-contact energy transmission and adaptive control mechanisms, offer an innovative solution to address the limitations of physical interconnections and enhance operational adaptability in structurally confined and uncertain obstructed environments. Here, we implemented a non-contact continuum robotic arm based on microwave-driving with polarization-directed guidance. The system incorporates customized flexible printed circuit (FPC) antennas and spatial microwave energy field modulation to enable wireless actuation and multi-degree-of-freedom (DOF) motion control of the robotic end-effector. By integrating spring-supported mechanisms and shape memory alloy (SMA) spring deformation responses, it accomplishes tasks such as obstacle penetration, component grasping, and retrieval—all without physical connections. This study demonstrates precise coupling between directionally controlled microwave energy and mechanical motion in obscured settings, offering a novel approach for non-contact, multi-DOF, and multi-structure robotic operations in sealed or unstructured environments. The proposed methodology significantly expands the potential applications and operational capabilities of robots in complex real-world conditions.
|
| |
| 09:00-10:30, Paper WeI1I.422 | Add to My Program |
| AINav: Large Language Model-Based Adaptive Interactive Navigation |
|
| Zhou, Kangjie | Peking University |
| Mu, Yao | The University of Hong Kong |
| Song, Haoyang | Peking University |
| Zeng, Yi | Peking University |
| Wu, Pengying | Peking University |
| Gao, Han | Peking University |
| Liu, Chang | Peking University |
Keywords: Task and Motion Planning, Reinforcement Learning, Legged Robots
Abstract: Robotic navigation in complex environments remains a critical research challenge. Traditional navigation focuses on optimal trajectory generation within free space, struggling in environments lacking viable paths to the goal, such as disaster zones or cluttered warehouses. To address this gap, we propose an adaptive interactive navigation approach that proactively interacts with environments to create feasible paths to reach unavailable goals. Specifically, we present a primitive tree for task planning with large language models (LLMs), facilitating effective reasoning to determine interaction objects and sequences. For subtask execution, we adopt reinforcement learning to pre-train a skill library containing versatile locomotion and interaction behaviors. Furthermore, we introduce an adaptive replanning method featuring two LLM-based modules: an advisor serving as a flexible replanning trigger and an arborist for autonomous plan adjustment. Integrated with the tree structure, the replanning mechanism allows for rapid plan modification in unknown environments. Comprehensive simulations and experiments have demonstrated our method's effectiveness and adaptivity in diverse scenarios.
|
| |
| 09:00-10:30, Paper WeI1I.423 | Add to My Program |
| ERPoT: Effective and Reliable Pose Tracking for Mobile Robots Using Lightweight Polygon Maps |
|
| Gao, Haiming | Zhejiang University |
| Qiu, Qibo | Zhejiang University |
| Liu, Hongyan | Chinese Academy of Sciences University |
| Liang, Dingkun | Zhejiang University of Technology |
| Wang, Chaoqun | Shandong University |
| Zhang, Xuebo | Nankai University, |
Keywords: Localization, Wheeled Robots, Mapping, Point-polygon Matching
Abstract: This paper presents an effective and reliable pose tracking solution, termed ERPoT, for mobile robots operating in large-scale outdoor and challenging indoor environments, underpinned by an innovative prior polygon map. Especially, to overcome the challenge that arises as the map size grows with the expansion of the environment, the novel form of a prior map composed of multiple polygons is proposed. Benefiting from the use of polygons to concisely and accurately depict environmental occupancy, the prior polygon map achieves longterm reliable pose tracking while ensuring a compact form. More importantly, pose tracking is carried out under pure LiDAR mode, and the dense 3D point cloud is transformed into a sparse 2D scan through ground removal and obstacle selection. On this basis, a novel cost function for pose estimation through point-polygon matching is introduced, encompassing two distinct constraint forms: point-to-vertex and point-to-edge. In this study, our primary focus lies on two crucial aspects: lightweight and compact prior map construction, as well as effective and reliable robot pose tracking. Both aspects serve as the foundational pillars for future navigation across diverse mobile platforms equipped with different LiDAR sensors in varied environments. Comparative experiments based on the publicly available datasets and our self-recorded datasets are conducted, and evaluation results show the superior performance of ERPoT on reliability, prior map size, pose estimation error, and runtime over the other six approaches. The corresponding code can be accessed at https://github.com/ghm0819/ERPoT, and the supplementary video is at https://youtu.be/6XdcXyUrLKw.
|
| |
| 09:00-10:30, Paper WeI1I.424 | Add to My Program |
| Sequentially Teaching Sequential Tasks (ST)²: Teaching Robots Long-Horizon Manipulation Skills |
|
| Ajanovic, Zlatan | RWTH Aachen University |
| Prakash, Ravi | Indian Institute of Science |
| de Souza Rosa, Leandro | Alma Mater Studiorum Università Di Bologna |
| Kober, Jens | TU Delft |
Keywords: AI-Enabled Robotics, Human-Centered Automation, Imitation Learning
Abstract: Learning from demonstration has proved itself useful for teaching robots complex skills with high sample efficiency. However, teaching long-horizon tasks with multiple skills is challenging as deviations tend to accumulate, the distributional shift becomes more evident, and human teachers become fatigued over time, thereby increasing the likelihood of failure. To address these challenges, we introduce (ST)², a sequential method for learning long-horizon manipulation tasks that allows users to control the teaching flow by specifying key points, enabling structured and incremental demonstrations. Using this framework, we study how users respond to two teaching paradigms: (i) a traditional monolithic approach, in which users demonstrate the entire task trajectory at once, and (ii) a sequential approach, in which the task is segmented and demonstrated step by step. We conducted an extensive user study on the restocking task with 16 participants in a realistic retail store environment, evaluating the user preferences and effectiveness of the methods. User-level analysis showed superior performance for the sequential approach in most cases (10 users), compared with the monolithic approach (5 users), with one tie. Our subjective results indicate that some teachers prefer sequential teaching---as it allows them to teach complicated tasks iteratively---or others prefer teaching in one go due to its simplicity.
|
| |
| 09:00-10:30, Paper WeI1I.425 | Add to My Program |
| Receding Horizon Control for Signal Temporal Logic Using Robustness-Conserving Partial Formula Evaluation |
|
| Ilyes, Roland | University of Oxford |
| Brudermüller, Lara | University of Oxford |
| Hawes, Nick | University of Oxford |
| Lacerda, Bruno | University of Oxford |
Keywords: Formal Methods in Robotics and Automation, Hybrid Logical/Dynamical Planning and Verification, Optimization and Optimal Control
Abstract: We present a bounded-memory receding horizon approach to robot control for complex specifications in dynamic environments. We use Signal Temporal Logic, a logic that quantifies how robustly trajectories satisfy the specification, to specify robot behavior. To handle unbounded specifications, we consider a short planning horizon, only searching for nonviolating trajectories. We identify the subset of Signal Temporal Logic for which this approach needs only a bounded memory of the past, and leverage syntactic separation to summarize the robust satisfaction of the trajectory as it evolves. We implement our approach using receding horizon control in dynamic environments. We demonstrate the effectiveness and scalability of our approach compared to the state-of-the-art approach in several case studies.
|
| |
| 09:00-10:30, Paper WeI1I.426 | Add to My Program |
| Degradation-Aware LiDAR-Thermal-Inertial SLAM |
|
| Wang, Yu | University of Science and Technology of China |
| Liu, Yufeng | Nanyang Technological University |
| Chen, Lingxu | Harbin Institute of Technology |
| Chen, Haoyao | Harbin Institute of Technology, Shenzhen |
| Zhang, Shiwu | University of Science and Technology of China |
Keywords: SLAM, Search and Rescue Robots
Abstract: During robotic disaster relief missions, state estimation still faces significant challenges, especially when GNSS is denied or sensor perception undergoes degradation. In this paper, we introduce a degradation-aware LiDAR-Thermal-Inertial SLAM, DaLiTI, that leverages the complementary nature of multi-modal information to achieve robust and precise state estimation in perceptually challenging environments. The system utilizes an iterated error state Kalman filter (IESKF) to loosely integrate LiDAR, thermal infrared camera, and IMU measurements. We propose an adaptive fusion mechanism that dynamically weights and fuses LiDAR and thermal measurements based on real-time modal quality to prevent failure information from propagating throughout the system. Experimental results demonstrate that, compared with state-of-the-art methods, DaLiTI maintains competitive performance in conventional environments and exhibits superior robustness and accuracy in degraded scenarios such as fire scenes or chemical plants with gas leaks. Our implementation is available at https://github.com/HITSZ-NRSL/DaLiTI.
|
| |
| 09:00-10:30, Paper WeI1I.427 | Add to My Program |
| LPAC: Learnable Perception-Action-Communication Loops with Applications to Coverage Control |
|
| Agarwal, Saurav | Indian Institute of Technology Bombay |
| Muthukrishnan, Ramya | Massachusetts Institute of Technology |
| Gosrich, Walker | University of Pennsylvania |
| Kumar, Vijay | University of Pennsylvania |
| Ribeiro, Alejandro | University of Pennsylvania |
Keywords: Multi-Robot Systems, Swarms, Deep Learning in Robotics and Automation, Graph Neural Networks
Abstract: Coverage control is the problem of navigating a robot swarm to collaboratively monitor features or a phenomenon of interest not known a priori. The problem is challenging in decentralized settings with robots that have limited communication and sensing capabilities. We propose a learnable Perception-Action-Communication (LPAC) architecture for the problem, wherein a convolutional neural network (CNN) processes localized perception; a graph neural network (GNN) facilitates robot communications; finally, a shallow multi-layer perceptron (MLP) computes robot actions. The GNN enables collaboration in the robot swarm by computing what information to communicate with nearby robots and how to incorporate received information. Evaluations show that the LPAC models---trained using imitation learning---outperform standard decentralized and centralized coverage control algorithms. The learned policy generalizes to environments different from the training dataset, transfers to larger environments with more robots, and is robust to noisy position estimates. The results indicate the suitability of LPAC architectures for decentralized navigation in robot swarms to achieve collaborative behavior.
|
| |
| 09:00-10:30, Paper WeI1I.428 | Add to My Program |
| 360DVO: Deep Visual Odometry for Monocular 360-Degree Camera |
|
| Guo, Xiaopeng | The Hong Kong University of Science and Technology |
| Xu, Yinzhe | The Hong Kong University of Science and Technology |
| Huang, Huajian | The Hong Kong University of Science and Technology |
| Yeung, Sai-Kit | Hong Kong University of Science and Technology |
Keywords: SLAM, Omnidirectional Vision, Data Sets for SLAM
Abstract: Monocular omnidirectional visual odometry (OVO) systems leverage 360-degree cameras to overcome field-of-view limitations of perspective VO systems. However, existing methods, reliant on handcrafted features or photometric objectives, often lack robustness in challenging scenarios, such as aggressive motion and varying illumination. To address this, we present 360DVO, the first deep learning-based OVO framework. Our approach introduces a distortion-aware spherical feature extractor (DAS-Feat) that adaptively learns distortion-resistant features from 360-degree images. These sparse feature patches are then used to establish constraints for effective pose estimation within a novel omnidirectional differentiable bundle adjustment (ODBA) module. To facilitate evaluation in realistic settings, we also contribute a new real-world OVO benchmark. Extensive experiments on this benchmark and public synthetic datasets (TartanAir V2 and 360VO) demonstrate that 360DVO surpasses state-of-the-art baselines (including 360VO and OpenVSLAM), improving robustness by 50% and accuracy by 37.5%. Homepage: https://360dvo.hkustvgd.com
|
| |
| 09:00-10:30, Paper WeI1I.429 | Add to My Program |
| To Harvest or Not to Harvest: Mapping and Decision-Making for a Selective Table Grape Harvesting Robot |
|
| Beumer, Ruben | Eindhoven University of Technology |
| Saraceni, Leonardo | Sapienza University of Rome |
| Nardi, Daniele | Sapienza University of Rome |
| Antunes, Duarte | Eindhoven University of Technology |
| Molengraft van de, René | Eindhoven University of Technology |
| Ciarfuglia, Thomas | San Raffaele University of Rome |
Keywords: Robotics and Automation in Agriculture and Forestry, Planning under Uncertainty, Visual Tracking
Abstract: This letter focuses on robotic harvesting of delicate crops such as table grapes, featuring selective harvesting based on individual product properties. The robot detects grape bunches and estimates their positions and quality attributes. However, sensor limitations and occlusions affect data completeness and accuracy, reducing the cost-effectiveness of automated harvesting systems. Determining in real-time the optimal harvesting order in the presence of uncertainty is therefore important for enhancing efficiency and grape quality for growers and consumers. This task is challenging not only due to data uncertainty, but also due to the need to consider factors such as obstructive low-quality bunches. Existing literature often resorts to sub-optimal approaches such as selecting the first available crop. In contrast, we propose (i) a mapping and tracking method based on multiple viewpoints to enhance bunch information quality and (ii) a decision-making algorithm in a decision-tree with a recursive structure based on a constructed reachability graph derived from the map to optimize harvested quality and execution time sequentially.
|
| |
| 09:00-10:30, Paper WeI1I.430 | Add to My Program |
| Soft 3D-Printed Endoskeleton for Precise Tendon Routing in Soft Robotics |
|
| Solfiti, Emanuele | IIT - Fondazione Istituto Italiano Di Tecnologia |
| Mondini, Alessio | Istituito Italiano Di Tecnologia |
| Del Dottore, Emanuela | University of Southern Denmark |
| Mazzolai, Barbara | Istituto Italiano Di Tecnologia |
| Parmiggiani, Alberto | Fondazione Istituto Italiano Di Tecnologia (IIT) |
Keywords: Soft Robot Materials and Design, Tendon/Wire Mechanism, Soft Sensors and Actuators
Abstract: This paper presents the design, development, and testing of a soft 3D-printed endoskeleton for arbitrary cable routing in tendon-driven soft actuators. The endoskeleton is embedded in a silicone body, and it is fixed to the mold prior to the casting process. It enables tendons to be placed through predefined eyelets, ensuring accurate positioning within the soft body. To minimize its impact on the overall stiffness of the soft body, the endoskeleton was designed with a slim profile, flexible connections, and fabricated using a 3D-printable elastic material (Shore A hardness 50), selected to roughly match the mechanical properties of the surrounding silicone matrix (typically with Shore 00 hardness 20–30). Although the reference geometry in this study is a cylindrical body, the design can be extended to a wide range of soft body shapes and sizes. Key features of the proposed solution include a 3D-printable guide for tendon routing that is (1) fully soft,(2) easy to place, (3) rapidly reconfigurable for arbitrary tendon paths, (4) adaptable to variable soft body geometries, and (5) easy to fabricate with single-step casting. The current work describes the design, manufacturing, simulation, and testing of a case study in which the endoskeleton is employed to reproduce a target pose predicted by FE analysis. The matching is satisfactory and demonstrates the effectiveness of the approach.
|
| |
| 09:00-10:30, Paper WeI1I.431 | Add to My Program |
| Triangle-Decomposable Graphs for Isoperimetric Robots |
|
| Usevitch, Nathan | Brigham Young University |
| Weaver, Isaac | Brigham Young University |
| Usevitch, James | Brigham Young University |
Keywords: Soft Robot Applications, Modeling, Control, and Learning for Soft Robots, Cellular and Modular Robots
Abstract: Isoperimetric robots are large scale, untethered inflatable robots that can undergo large shape changes, but have only been demonstrated in one 3D shape- an octahedron. These robots consist of independent triangles that can change shape while maintaining their perimeter by moving the relative position of their joints. We introduce an optimization routine that determines if an arbitrary graph can be partitioned into unique triangles, and thus be constructed as an isoperimetric robotic system. We enumerate all minimally rigid graphs that can be constructed with unique triangles up to 9 nodes (7 triangles), and characterize the workspace of one node of each of these robots. We also present a method for constructing larger graphs that can be partitioned by assembling subgraphs that are already partitioned into triangles. This enables a wide variety of isoperimetric robot configurations.
|
| |
| 09:00-10:30, Paper WeI1I.432 | Add to My Program |
| MoRe-ERL: Learning Motion Residuals Using Episodic Reinforcement Learning |
|
| Huang, Xi | Karlsruhe Institute of Technology |
| Zhou, Hongyi | Karlsruhe Institute of Technology |
| Li, Ge | Karlsruhe Institute of Technology (KIT) |
| Tang, Yucheng | University of Applied Sciences Karlsruhe |
| Liao, Weiran | Karlsruhe Institute of Technology |
| Hein, Björn | Karlsruhe University of Applied Sciences |
| Asfour, Tamim | Karlsruhe Institute of Technology (KIT) |
| Lioutikov, Rudolf | Karlsruhe Institute of Technology |
Keywords: Motion and Path Planning, Reinforcement Learning, Integrated Planning and Learning
Abstract: We propose MoRe-ERL, a framework that combines Episodic Reinforcement Learning (ERL) and residual learning, which refines preplanned reference trajectories into safe, feasible, and efficient task-specific trajectories. This framework is general enough to incorporate into arbitrary ERL methods and motion generators seamlessly. MoRe-ERL identifies trajectory segments requiring modification while preserving critical task-related maneuvers. Then it generates smooth residual adjustments using B-Spline-based movement primitives to ensure adaptability to dynamic task contexts and smoothness in trajectory refinement. Experimental results demonstrate that residual learning significantly outperforms training from scratch using ERL methods, achieving superior sample efficiency and task performance. Hardware evaluations further validate the framework, showing that policies trained in simulation can be directly deployed in real-world systems, exhibiting a minimal sim-to-real gap. exhibiting a minimal sim-to-real gap.
|
| |
| 09:00-10:30, Paper WeI1I.433 | Add to My Program |
| A Cable-Driven Soft Robotic Hand with an In-Hand RGB-D Camera for Dexterous Grasping and Manipulation |
|
| Zhou, Zhanfeng | University of Toronto |
| Zuo, Runze | University of Michigan |
| Du, Matthew | University of Toronto |
| Wang, Shaojia | University of Toronto |
| Levy, Sebastian | Apple Inc |
| Sun, Yu | University of Toronto |
| Liu, Xinyu | University of Toronto |
Keywords: Grasping, Grippers and Other End-Effectors, Soft Robot Applications, Biologically-Inspired Robots
Abstract: The aspiration to replicate the capabilities of the human hand has driven innovations in the design of soft robotic hands. Despite these advancements, many existing designs of soft hands still lack effective in-hand vision and the ability for each finger to achieve active multidegree-of-freedom motion. This article proposes a cable-driven soft robotic hand that can achieve dex- terous grasping and manipulation, vision-guided grasping, vision- based slip detection and compensation, as well as visually servoed in-hand manipulation. The hand has five soft fingers, each ca- pable of independent flexion/extension motion and bidirectional ad/abduction motion. A red–green–blue-depth (RGB-D) camera is integrated into the palm of the soft hand to enable in-hand vision capability. Modeling of the soft hand is established to analyze its kinematics, statics, and manipulability. A series of experiments are conducted to demonstrate its dexterous grasping and manipulation capabilities on a variety of objects. Using 3-D point cloud data from the in-palm camera, an effective vision-guided grasping strat- egy is developed to grasp objects on a table. The in-hand vision also enables slip detection and compensation during grasping to maintain the grasp stability. Furthermore, a hierarchical, visually servoed controller is developed to perform closed-loop in-hand object manipulation. With its high dexterity and visual feedback capabilities, the soft hand will find important applications such as household object manipulation and food picking/sorting, and may also be used as a prosthetic hand or an auxiliary hand for humans.
|
| |
| 09:00-10:30, Paper WeI1I.434 | Add to My Program |
| Towards Simulation-Based Optimization of Compliant Fingers for High-Speed Connector Assembly |
|
| Hartisch, Richard Matthias | TU Berlin |
| Rother, Alexander Sebastian Marcel | Technische Universität Berlin |
| Krüger, Jörg | Fraunhofer Institute for Production Systems and DesignTechnology (IPK) |
| Haninger, Kevin | Fraunhofer IPK |
|
|
| |
| 09:00-10:30, Paper WeI1I.435 | Add to My Program |
| Efficiently Learning Robust Torque-Based Locomotion through Reinforcement with Model-Based Supervision |
|
| Yan, Yashuai | TU Wien |
| Egle, Tobias | TU Wien |
| Ott, Christian | TU Wien |
| Lee, Dongheui | TU Wien |
Keywords: Humanoid and Bipedal Locomotion, Reinforcement Learning
Abstract: We propose a control framework that integrates model-based bipedal locomotion with residual reinforcement learning (RL) to achieve robust and adaptive walking in the presence of real-world uncertainties. Our approach leverages a model-based controller—comprising a Divergent Component of Motion (DCM) trajectory planner and a whole-body controller—as a reliable base policy. To address the uncertainties of inaccurate dynamics modeling and sensor noise, we introduce a residual policy trained through RL with domain randomization. Crucially, we employ a model-based oracle policy, which has privileged access to ground-truth dynamics during training, to supervise the residual policy via a novel supervised loss. This supervision enables the policy to efficiently learn corrective behaviors that compensate for unmodeled effects without extensive reward shaping. Our method demonstrates improved robustness and generalization across a range of randomized conditions, offering a scalable solution for sim-to-real transfer in bipedal locomotion.
|
| |
| 09:00-10:30, Paper WeI1I.436 | Add to My Program |
| Estimation of the Caged Object's Posture under Forces Using Stepwise Geometric Calculations |
|
| Yokomura, Ryota | The University of Tokyo |
| Liu, Yutong | University of Tokyo |
| Fukui, Rui | The University of Tokyo |
Keywords: Contact Modeling, Compliant Assembly, Simulation and Animation
Abstract: In pin–hole assembly processes, precise alignment or compliance mechanisms are typically required. This paper proposes a method for connecting objects by utilizing caging to constrain their motion, enabling the insertion of a pin into a hole to adjust the allowable relative pose for assembly. This approach eliminates the need for force control, even with low-degree-of-freedom manipulators, and reduces deflection caused by misalignment during connection. Although previous research has extensively studied appropriate finger configurations for caging, the behavior of caged objects under external forces remains insufficiently investigated. Furthermore, when connecting caged objects by contact, pose estimation often requires complex collision computations that account for intricate object geometries, which are computationally expensive and may fail to converge when small gaps are present. To address this issue, we propose a geometric method that approximates pose changes of caged three-dimensional objects under external forces as rotations about contact points. As a representative case, we focus on objects composed of cuboid elements. The estimated results for simple objects, including caged objects with small clearances, were consistent with geometrically derived theoretical solutions, and are obtained within 0.6 seconds, indicating a practical computation time.
|
| |
| 09:00-10:30, Paper WeI1I.437 | Add to My Program |
| DexFruit: Dexterous Manipulation and Gaussian Splatting Inspection of Fruit |
|
| Swann, Aiden | Stanford University |
| Qiu, Alex | Stanford University |
| Strong, Matthew | Stanford University |
| Zhang, Angelina | Stanford University |
| Morstein, Samuel | Stanford University |
| Rayle, Kai | Stanford University |
| Kennedy, Monroe | Stanford University |
Keywords: Dexterous Manipulation, Force and Tactile Sensing, Agricultural Automation
Abstract: DexFruit is a robotic manipulation framework that enables gentle, autonomous handling of fragile fruit and precise evaluation of damage. Soft fruits have long faced an issue of produce loss in both the harvesting and post-harvesting processes due to their extreme fragility and susceptibility to bruising, making them one of the hardest produce type to manipulate with automation. In this work, we demonstrate by using optical tactile sensing, autonomous manipulation of fruit with minimal damage can be achieved. We show that our tactile informed diffusion policies outperform baselines in both reduced bruising and pick-and-place success rate across three fruits: strawberries, tomatoes, and blackberries. In addition, we introduce FruitSplat, a novel technique to represent and quantify visual damage in a high-resolution 3D representation via 3D Gaussian Splatting (3DGS). Existing metrics for measuring damage lack quantitative rigor or require expensive equipment. With FruitSplat, we distill a 2D fruit mask as well as a 2D bruise segmentation mask into the 3DGS representation from just a web-cam video. Furthermore, this representation is modular and general, compatible with any relevant 2D model. Overall, we demonstrate a 92% grasping policy success rate, up to a 15% reduction in visual bruising, and up to a 31% improvement in grasp success rate on challenging fruit compared to our baselines across our three tested fruits. We rigorously evaluate this result with over 630 trials.
|
| |
| WeI1LB Late Breaking Results Session, Hall C |
Add to My Program |
| Late Breaking Results 3 |
|
| |
| |
| 09:00-10:30, Paper WeI1LB.1 | Add to My Program |
| HBRB-BoW: A Retrained Bag-Of-Words Vocabulary for ORB-SLAM Via Hierarchical BRB-KMeans |
|
| Lee, Minjae | Gyeongsang National University |
| Choi, Sang-Min | Gyeongsang National University |
| Kim, Gun-Woo | Gyeongsang National University, South Korea |
| Lee, Suwon | Gyeongsang National University |
Keywords: SLAM, Recognition, Localization
Abstract: In visual simultaneous localization and mapping (SLAM), the quality of the visual vocabulary is fundamental to the system's ability to represent environments and recognize locations. While ORB-SLAM is a widely used framework, its binary vocabulary, trained through the k-majority-based bag-of-words (BoW) approach, suffers from inherent precision loss. The inability of conventional binary clustering to represent subtle feature distributions leads to the degradation of visual words, a problem that is compounded as errors accumulate and propagate through the hierarchical tree structure. To address these structural deficiencies, this paper proposes hierarchical binary-to-real-and-back (HBRB)-BoW, a refined hierarchical binary vocabulary training algorithm. By integrating a global real-valued flow within the hierarchical clustering process, our method preserves high-fidelity descriptor information until the final binarization at the leaf nodes. Experimental results demonstrate that the proposed approach yields a more discriminative and well-structured vocabulary than traditional methods, significantly enhancing the representational integrity of the visual dictionary in complex environments. Furthermore, replacing the default ORB-SLAM vocabulary file with our HBRB-BoW file is expected to improve performance in loop closing and relocalization tasks.
|
| |
| 09:00-10:30, Paper WeI1LB.2 | Add to My Program |
| Unified Neural Gaussian SLAM with Feature Splatting |
|
| Tang, Xuyang | The Hong Kong Polytechnic University |
| Chu, Henry | The Hong Kong Polytechnic University |
| Sun, Yuxiang | City University of Hong Kong |
Keywords: SLAM, Localization, Mapping
Abstract: Recent advances in 3D Gaussian Splatting (3DGS) have demonstrated impressive progress in high-fidelity scene reconstruction within visual SLAM. However, existing approaches often suffer from scene inconsistency, leading to visual artifacts, and the explicit maintenance of millions of Gaussians imposes significant storage overhead. To address these limitations, we present a unified Neural Gaussian SLAM with feature splatting, which represents the spatial scene as a coherent feature space while encoding view direction, distance, and position into neural Gaussians. Arbitrary image modalities-including color, depth, normals, semantics, and even language-can be decoded from this feature space. Extensive evaluations on several challenging datasets show that our method achieves state-of-the-art performance in rendering quality, reconstruction accuracy, and pose estimation.
|
| |
| 09:00-10:30, Paper WeI1LB.3 | Add to My Program |
| Indirect Adaptive Predictor Preview Control with Unknown Time Varying Input Delay and Parameter |
|
| Kwon, HyunBin | Kyungpook National University |
| Yi, Hak | Kyungpook National University |
Keywords: Robust/Adaptive Control
Abstract: This paper presents an indirect adaptive predictor–preview control architecture for continuous-time systems with unknown time-varying input delays and unknown (slowly varying) parameters. An adaptive super-twisting algorithm (STA) estimates the unknown delay online using a monotone ramp probe, and an indirect recursive least-squares (RLS) module tracks slow parameter variations; both feed a frozen-parameter predictor and a preview feedforward-based on r(t + ˆ h(t)). Nominal exponential tracking is shown under exact prediction, and a practical input-to-state The stability (ISS) bound is derived that accounts for delay/parameter estimation errors, disturbances, and numerical approximation. On the DC motor speed servo benchmark, the controller reduces steady state RMSE/peak error to 0.046/0.074 (S1) and 0.062/0.099 (S4), below all compared baselines.
|
| |
| 09:00-10:30, Paper WeI1LB.4 | Add to My Program |
| Vibration-Resilient LiDAR-Inertial Odometry with External Disturbance Compensation for Quadruped Robots |
|
| Hoang, Quoc Hung | Chungbuk National University |
| Kim, Gon-Woo | Chungbuk National University |
Keywords: SLAM, Legged Robots, Localization
Abstract: This work presents a tightly coupled LiDAR–inertial odometry (LIO) framework tailored for quadruped robots operating under vibration and fluctuating conditions. By integrating time delay estimation (TDE) into an error-state Kalman filter (ESKF), external disturbances affecting the IMU are explicitly estimated and compensated during IMU pre-integration, significantly reducing vibration-induced errors. The resulting refined IMU poses are further used to correct LiDAR motion distortion, enabling a unified refinement process. This leads to smoother trajectories, improved localization accuracy, and enhanced robustness against both environmental and sensor uncertainties. The proposed framework is validated through real-time deployment on a quadruped robot.
|
| |
| 09:00-10:30, Paper WeI1LB.5 | Add to My Program |
| Motion-Robust Bird’s-Eye-View Mapping for Self-Balancing Exoskeletons Using RGB-D Sensing |
|
| Leisiazar, Sahar | Simon fraser university |
| Peykari, Behzad | Human in Motion |
| Arzanpour, Siamak | Simon Fraser University |
| Najafi, Farshid | Simon Fraser University |
| Park, Edward J. | Simon Fraser University |
| |
| 09:00-10:30, Paper WeI1LB.6 | Add to My Program |
| Highly Flexible Task Planner for Robots in Dynamic Environments |
|
| Guzman-Merino, Miguel | University of Rostock |
| Plönnigs, Jörn | University of Rostock |
Keywords: Robotics and Automation in Construction, Task Planning, Reinforcement Learning
Abstract: Construction site environments are highly non deterministic scenarios under constant changes. Complex tasks are usually required in these scenarios and multi agent systems have been probed as the flexible solution to solve them. Nevertheless, the uncertainty in the environments often makes the available information inaccurate, incomplete or difficult to integrate in multi agent systems. To successfully automate complex processes in construction environments it is necessary to overcome the barrier imposed by the lack of accurate information. The research challenge here presented is the coordination of multi agent systems in non deterministic environments. In this proposal, a Multi Agent Proximal Policy Optimization system (MAPPO) is proposed to create the necessary flexible framework. Various policy networks associated with different types of agents are trained over different scenarios. Different teams of agents are also proposed during the training process. With this approach it is intended to create a framework able to command different teams of agents independently from the constraints imposed by the information of the environment.
|
| |
| 09:00-10:30, Paper WeI1LB.7 | Add to My Program |
| Hierarchical LLM-VLA-Controller Integration for Task Generalization |
|
| Choi, Inhyuk | Kyungpook National University, School of Electronics Engineering |
| Lee, Sangmoon | Kyungpook National University |
Keywords: Task Planning, Imitation Learning, Autonomous Agents
Abstract: Vision-Language-Action (VLA) models often struggle with generalization due to their tendency to memorize training data rather than understanding task semantics. This paper proposes a hierarchical framework that integrates Large Language Models (LLMs) with VLA models to overcome these limitations. By leveraging GPT-4o as a high-level planner, our system decomposes complex instructions into atomic sub tasks executable by a low-level VLA. We introduce a “Home Pose Controller” between sub-tasks to ensure physical sta bility. Experimental results on the LIBERO-10 benchmark demonstrate that our approach achieves a 90% success rate on decomposable tasks, significantly outperforming the 9% baseline of the standalone VLA model.
|
| |
| 09:00-10:30, Paper WeI1LB.8 | Add to My Program |
| BIM-Informed Visual SLAM for Construction Monitoring |
|
| Bikandi-Noya, Asier | University of Luxembourg |
| Fernandez-Cortizas, Miguel | University of Luxembourg |
| Shaheer, Muhammad | University of Luxembourg |
| Tourani, Ali | University of Luxembourg |
| Voos, Holger | University of Luxembourg |
| Sanchez-Lopez, Jose Luis | University of Luxembourg |
Keywords: SLAM, Robotics and Automation in Construction, RGB-D Perception
Abstract: Monitoring construction sites requires comparing the as-planned design with the as-built state in real time. Visual SLAM offers a lightweight solution but is prone to trajectory drift in construction environments due to repetitive layouts, textureless surfaces, and occlusions. We augment an existing visual SLAM system with structural priors from the Building Information Model (BIM), associating detected walls with their BIM counterparts and including these correspondences as geometric constraints in the back-end optimization. The system operates in real time and is validated on multiple real construction sites, achieving 25.23% average trajectory error reduction and 7.14% map accuracy improvement over state-of-the-art baselines, with demonstrated resilience to incomplete BIM data and as-planned/as-built discrepancies.
|
| |
| 09:00-10:30, Paper WeI1LB.9 | Add to My Program |
| Innovative Rail–Air Multi-Robot System for Intelligent and Repeatable Bridge Visual Inspection Tasks |
|
| Jose, Nahuel | Istituto Italiano di Tecnologia |
| D'Imperio, Mariapaola | Istituto Italiano di Tecnologia |
| Mancini, Adriano | Università Politecnica delle Marche |
| Scaro, Agustina | Istituto Italiano di Tecnologia |
| Marchello, Gabriele | Istituto Italiano di Tecnologia |
| Galdelli, Alessandro | Università Politecnica delle Marche |
| Frontoni, Emanuele | Università Politecnica delle Marche |
| Cannella, Ferdinando | Istituto Italiano di Tecnologia |
| |
| 09:00-10:30, Paper WeI1LB.10 | Add to My Program |
| A MARL Approach for Connectivity-Aware Search and Rescue in Urban Environments |
|
| Meseguer Valenzuela, Andrés | Instituto Tecnológico De Informática |
Keywords: Multi-Robot Systems, Cooperating Robots, Networked Robots
Abstract: This work presents a closed-loop experimental framework for connectivity-aware urban search and rescue (SAR) using heterogeneous unmanned ground vehicles (UGVs) and unmanned aerial vehicles (UAVs). The setup couples a physics-based urban digital twin in NVIDIA Isaac Sim with Robot Operating System 2 (ROS2) orchestration, a Proximal Policy Optimization (PPO) multi-agent reinforcement learning (MARL) controller, and a fifth-generation (5G) link evaluation pipeline based on ns-3/5G-LENA key performance indicators (KPIs). Two UGVs execute mission-directed navigation toward a hazard region, while two UAV relays and a gNB-like aerial anchor adapt their positions to sustain end-to-end service under line-of-sight and non-line-of-sight transitions induced by urban occlusions. Preliminary simulation results validate end-to-end operability and provide quantitative evidence of simultaneous mission progress and network continuity. Across a representative episode, the minimum distance to the hazard-region center decreases from 27.9 m to 1.55 m (final 1.80 m), while latency remains in a low regime (mean 4.88 ms, p95 8.17 ms). Packet loss is bounded (mean 3.5% and 2.2% for the two UGVs), and outages are sparse (101 steps over 9000), even during partial traversal of building-dense areas. The platform enables systematic diagnosis of mobility–connectivity coupling and supports transfer-oriented refinement of relay control and coordination policies.
|
| |
| 09:00-10:30, Paper WeI1LB.11 | Add to My Program |
| A Multi-Inlet Extrusion System for Closed-Loop Spatial Profile Control in Large-Format Additive Manufacturing |
|
| Coronado Preciado, Angelica | King Abdullah University of Science and Technology |
| Parrott, Brian | King Abdullah University of Science and Technology |
| Feron, Eric | King Abdullah University of Science and Technology |
Keywords: Robotics and Automation in Construction, Mechanism Design, Process Control
Abstract: Rectangular nozzles are attractive for large-format additive manufacturing (LFAM) due to their improved deposition efficiency. However, single-inlet feeding of high-aspect-ratio nozzles inherently induces lateral pressure gradients, causing center-heavy flow and eliminating localized control during dynamic trajectories. We introduce a distributed multi-inlet extrusion testbed featuring three independently actuated inlets. Functioning as a programmable fluid manifold, this architecture actively manages the internal flow field. In-line laser profilometry is integrated as a continuous state estimator to quantify cross-sectional bead geometry. Experiments confirm this distributed architecture regularizes flow, achieving nominal steady-state extrusion with 33% less input flow and a 78% reduction in required plunger velocity per actuator compared to a single-inlet baseline. Furthermore, differential actuation enables high-resolution lateral steering and improves deposition under simulated outlet constraints with high allocation efficiency. This work establishes the hardware and state-estimation foundation for dynamically reconfigurable nozzle outlets by mapping inputs to spatial outputs.
|
| |
| 09:00-10:30, Paper WeI1LB.12 | Add to My Program |
| Online Jacobian Estimation and Tracking Control of a Two-Section Tendon-Driven Continuum Robot |
|
| Kuncara, Ivan Adi | Chonnam National University |
| Hong, Ayoung | Chonnam National University |
Keywords: Medical Robots and Systems, Sensor-based Control
Abstract: This paper proposes a control framework for a two-section, concentrically tendon-driven continuum robot. The tracking control is formulated based on a Jacobian approach using zeroing dynamics. The Jacobian is estimated online from input–output data, using applied tendon force as inputs and measured tip positions as outputs, without requiring an explicit model. Tendon force is employed as the control input to better account for the deformation under external loads. The proposed framework is experimentally validated on a real-scale continuum robot, with tip-position feedback from a stereo-vision system. Experimental results demonstrate that the robot tip follows the desired trajectory with an RMSE of about 1.18 mm, while maintaining good performance under unknown external loads.
|
| |
| 09:00-10:30, Paper WeI1LB.13 | Add to My Program |
| E2O-SLAM: A Hierarchical Visual SLAM Framework Using Edge-Based and Object-Level Representations |
|
| Choi, Eunseon | Postech |
| Han, Soohee | Pohang University of Science and Technology ( POSTECH ) |
Keywords: RGB-D Perception
Abstract: In this paper, we present a hierarchical simultaneous localization and mapping (SLAM) system that leverages point-level features, mid-level geometric organized edge representations, and high-level object semantics within a unified framework. While object-level SLAM provides semantic information and improves long-term data association, it often suffers from coarse geometric constraints and unreliable detections. In contrast, organized edge representations capture rich structural and textural information, offering stable geometric cues in low-texture or challenging environments. By hierarchically integrating these complementary representations, the proposed system achieves robust camera tracking, reliable data association, and consistent mapping.
|
| |
| 09:00-10:30, Paper WeI1LB.14 | Add to My Program |
| Local Linearized Cosserat Rod Model for Contact Force Estimation in Flexible Medical Instruments and Continuum Robots |
|
| Eyberg, Christoph | Fraunhofer IPA |
| Horsch, Johannes | Fraunhofer IPA |
| Bauernhansl, Thomas | Fraunhofer Institute for Manufacturing Engineering and Automation |
| Langejürgen, Jens | Fraunhofer IPA |
Keywords: Modeling, Control, and Learning for Soft Robots, Medical Robots and Systems, Force and Tactile Sensing
Abstract: Knowledge of instrument contact forces can lead to safer medical interventions. We present a formulation of the frequently used Cosserat rod model that is linearized around the measured shape for efficient model-based contact force estimation in flexible instruments and robots. Validation on instruments’ deflection in an endovascular use case resulted in an average force estimation error of only 14 %.
|
| |
| 09:00-10:30, Paper WeI1LB.15 | Add to My Program |
| Gaussian Process-Based Gait Optimization of a Cable-Driven Soft Quadruped Robot |
|
| Choi, Jeongil | Chonnam National University |
| Hong, Ayoung | Chonnam National University |
Keywords: Modeling, Control, and Learning for Soft Robots
Abstract: Soft robots offer significant advantages over rigid-bodied counterparts due to their inherent flexibility and deformability. However, these same characteristics often introduce challenges in control and simulation-to-reality (Sim-to-Real) synchronization. In this study, we quantified the locomotion performance of a soft quadruped robot through experimental trials. To alleviate the burden of exhaustive physical testing over an expanded parameter space, we employed a Gaussian process-based optimization framework.
|
| |
| 09:00-10:30, Paper WeI1LB.16 | Add to My Program |
| Region-Selective Synthetic Data Injection for Data-Driven Magnetic Capsule Pose Estimation |
|
| Darwin, Stevanus | Chonnam National University |
| Hong, Ayoung | Chonnam National University |
Keywords: Medical Robots and Systems
Abstract: Accurate Wireless Magnetic Capsule (WCE) pose estimation remains a challenge for advancing minimally invasive medical procedures, because the relationship between magnetic sensor measurements and capsule pose is highly nonlinear and sensitive to noise and modeling errors, making large-scale training data essential for data-driven estimation. However, data acquisition itself remains a limiting factor, restricting both the volume of training data and the effective workspace of the system. To address this limitation, we propose a region-selective synthetic data injection strategy that generates additional data points using a calibrated physics-based model. In this strategy, regions with high model fidelity are replaced with physics-based data at arbitrary points, while regions with lower fidelity rely on sensor data, which provides a more accurate representation of the real system. Experimental results show that the proposed strategy achieves performance comparable to that of a purely data-driven model while significantly reducing the data acquisition burden.
|
| |
| 09:00-10:30, Paper WeI1LB.17 | Add to My Program |
| Continuous Real-Time Inductive Tracking of Magnetic Microagents for Closed-Loop Control |
|
| Grossrieder, Tim | ETH Zürich |
| Forbrigger, Cameron | ETH Zürich |
| Park, Myungjin | ETH Zurich |
| Christiansen, Michael | ETH Zurich |
| Schuerle, Simone | ETH Zürich |
Keywords: Micro/Nano Robots, Medical Robots and Systems, Sensor-based Control
Abstract: Magnetic microagents hold great promise for improved medical treatments, including targeted drug delivery. Yet, tracking position and performance in vivo during actuation remains challenging. We introduce a sensing approach leveraging inductive signals derived from microagents actuated by a rotating magnetic field (RMF). This offers the unique possibility of controlling microagents while simultaneously tracking position and phase lag in real-time.
|
| |
| 09:00-10:30, Paper WeI1LB.18 | Add to My Program |
| Class-Agnostic Robotic Gaze Control Via Fast Normalized Cut |
|
| Lucny, Andrej | Comenius University in Bratislava |
| Zigo, Branislav | Comenius University |
| Farkaš, Igor | Comenius University in Bratislava |
|
|
| |
| 09:00-10:30, Paper WeI1LB.19 | Add to My Program |
| TeNet: Text-To-Network for Compact Policy Synthesis |
|
| Bighashdel, Ariyan | Vrije Universiteit Amsterdam |
| Luck, Kevin Sebastian | Vrije Universiteit Amsterdam |
Keywords: Machine Learning for Robot Control, Imitation Learning, AI-Based Methods
Abstract: Robots that follow natural-language instructions typically rely on either high-level planners with hand-designed interfaces or large end-to-end models that are difficult to deploy for real-time control. We propose TeNet (Text-to-Network), a framework that instantiates compact, task-specific policies directly from natural language. TeNet conditions a hypernetwork on embeddings from a pretrained language model to generate a fully executable policy, which operates solely on low-dimensional state inputs at high control frequencies. By using language only once at policy instantiation, TeNet combines the expressiveness of large language models with efficient execution. To improve generalization, we optionally ground language in behavior during training, without requiring demonstrations at inference. Experiments on MuJoCo and Meta-World show that TeNet produces policies that are orders of magnitude smaller than sequence-based baselines, while achieving strong performance in both multi-task and meta-learning settings and enabling high-frequency control. These results demonstrate that text-conditioned hypernetworks provide a practical approach for compact, language-driven robot control.
|
| |
| 09:00-10:30, Paper WeI1LB.20 | Add to My Program |
| Over-Actuation in Soft Robots: Towards Active Variable Stiffness & Viscoelasticity |
|
| Vazquez-Garcia, Carlos Ernesto | CINVESTAV |
| Olguín Díaz, Ernesto | Cinvestav |
| Parra-Vega, Vicente | Research Center for Advanced Studies (CINVESTAV) |
Keywords: Modeling, Control, and Learning for Soft Robots, Soft Robot Materials and Design, Robust/Adaptive Control
Abstract: Continuum soft robots (cSR) represent a particular class of the highest compliant deformable robots made of elastomers, typically driven by embedded pneumatic chambers. While their structural compliance is appealing for many tasks, most existing research on variable stiffness at a constant generalized position relies on empirical evidence rather than a formal method that enhances the properties of cSR. Consequently, few sound advances have been reported regarding closed-loop (control-based) variable stiffness and furthermore for variable viscoelasticity for tracking in cSR, nor have they fully exploited the ease with which soft robots incorporate additional actuation inputs. In this extended abstract, the control of the fundamental behavioral structural viscoelasticity is addressed through commanding redundancy of actuation during motion. To this end, m pneumatic chambers are introduced into the cSR such that n chambers are used for motion tracking, while the remaining r=m-n are available to allocate variations of stiffness in regulation tasks, and of viscoelasticity in tracking tasks.
|
| |
| 09:00-10:30, Paper WeI1LB.21 | Add to My Program |
| From Design to Realization: A Validated Pipeline for Magnetic Soft Robot Fabrication and Actuation |
|
| Abu-Shaera, Rawaan | McMaster University |
| Palanichamy, Veerash | McMaster University |
| Clancy, Kaitlyn | McMaster University |
| Onaizah, Onaizah | McMaster University |
Keywords: Micro/Nano Robots, Evolutionary Robotics, Soft Robot Applications
Abstract: Magneto-responsive soft materials have gained attention in biomedical engineering, with applications spanning robotics to regenerative medicine and drug delivery. These materials are the backbone of magnetic soft robots (MSRs), enabling customization of the magnetic domains that dictate their morphological capabilities and behavior. However, reliance on intuition to configure MSR magnetization profiles often results in a trial-and-error design approach, consuming time and resources. To address these challenges, this study optimizes an intelligent framework that uses a Covariant Matrix Adaptation Evolutionary Strategy along with a Material Point Method simulation environment to determine the magnetization profile of voxel-based MSRs to achieve ultimate performance. This study shows that unique, non-intuitive designs can be evolved. This intelligent design framework is linked to physical prototyping through additive manufacturing to realize these designs. Experimental validation of the generated designs confirms that the algorithm-based MSRs achieve a 10-fold increase in walking performance compared to the intuitively designed MSRs. This study also demonstrates the ability to improve upon both specific and random magnetization profiles and the ability to adapt to design constraints such as various modes of actuation. In general, the evolutionary algorithm, combined with physical prototyping, establishes an effective and efficient framework for the optimization of MSR behavior.
|
| |
| 09:00-10:30, Paper WeI1LB.22 | Add to My Program |
| Robust Robotic Task Planning Via Immutable Subgoals |
|
| Lim, Chulyong | Chung-Ang University |
| Baek, Jaewon | Chung-Ang Univ |
| Han, Junhee | Chung-Ang University |
| Bae, WooYeol | Chung-Ang University |
| Nam, Woochul | Chung-Ang University |
Keywords: Task Planning, Agent-Based Systems, Autonomous Agents
Abstract: Service robots require instruction-following capabilities to perform various tasks regardless of environmental changes. A task planner must accurately infer user intent even when human instructions are ambiguous. To this end, we propose TIGER, a task planning framework that generates reliable action sequences by deriving immutable subgoals from instructions. TIGER employs an Immutable Subgoal Planner (ISP) to decompose instructions into environment-independent subgoals and a Target Grounder (TG) to ground abstract keywords to real-world objects via visual perception and reasoning. A task-representative one-shot strategy improves subgoal generation using only seven annotated examples. TIGER outperformed LLM-Planner in the ALFRED benchmark, increasing success rates from 15.09% to 35.06% on the seen set and from 19.73% to 42.57% on the unseen set. Its scalability was also verified in real-world experiments with a UR5e robot.
|
| |
| 09:00-10:30, Paper WeI1LB.23 | Add to My Program |
| Hybrid Agentic AI-FSM Framework for Instruction-Based Industrial Manipulation Tasks |
|
| Joo, Sungmoon | Korea Atomic Energy Research Institute |
| Kim, Ikjune | Korea Atomic Energy Research Institute |
Keywords: Agent-Based Systems, AI-Based Methods, AI-Enabled Robotics
Abstract: This paper proposes a hybrid Agentic AI–FSM framework for robust natural-language-driven automation in safety-critical industrial robotics applications. Although natural-language procedures are commonplace in manufacturing, translating them into reliable robot programs remains labor-intensive. While Large Language Models (LLMs) offer strong parsing and planning capabilities, their inherent non-determinism and susceptibility to hallucinations preclude their direct use for robot control. To bridge this gap, our architecture employs an LLM-based planning agent to translate instructions offline into a structured task plan. Execution is then delegated to a deterministic Finite State Machine (FSM)-style execution engine to ensure reliability. Safety is further guaranteed by a multi-stage validation–simulation pipeline that verifies schema compliance and operational constraints through dry runs prior to deployment. For runtime anomalies, a RAG-enhanced Exception Handling Agent proposes recovery options, which are strictly mediated through a human-in-the-loop (HIL) interface for operator approval. Finally, a rule-based Safety Agent enforces physical constraints and provides an independent protection layer.
|
| |
| WeAT1 Award Session, Hall A2 |
Add to My Program |
| Award Finalists 3 |
|
| |
| |
| 11:00-11:10, Paper WeAT1.1 | Add to My Program |
| Dexora: Open-Source VLA for High-DoF Bimanual Dexterity |
|
| Zhang, Zongzheng | Tsinghua University |
| Pang, Jingrui | Tsinghua University |
| Yang, Zhuo | Beijing Institute of Technology |
| Li, Kun | Beijing Academy of Artificial Intelligence,BAAI |
| Liao, Minwen | School of Computer Science and Technology, University of Xinjiang |
| Zhang, Saining | Nanyang Technological University |
| Chi, Guoxuan | Tsinghua University |
| Guo, Jinbang | Beijing Academy of Artificial Intelligence |
| Gao, Huan-ang | Tsinghua University |
| Shi, Modi | Beihang University |
| Ge, Dongyun | Tsinghua University |
| Mu, Yao | The University of Hong Kong |
| Gu, Jiayuan | ShanghaiTech University |
| Chen, Rui | Tsinghua University |
| Dong, Hao | Peking University |
| Xu, Huazhe | Tsinghua University |
| Yi, Li | Tsinghua University |
| Zhu, Yixin | Peking University |
| Zhao, Hang | Tsinghua University |
| Wang, Pengwei | Beijing Academy of Artificial Intelligence |
| Zhang, Shanghang | Peking University |
| Yao, Guocai | Beijing Academy of Artificial Intelligence |
| Chen, Jianyu | Tsinghua University |
| Li, Hongyang | The University of Hong Kong |
| Zhao, Hao | Tsinghua University |
|
|
| |
| 11:10-11:20, Paper WeAT1.2 | Add to My Program |
| Robotic Dexterous Manipulation Via Anisotropic Friction Modulation Using Passive Rollers |
|
| Fisk, Ethan | Northeastern University |
| Lee, Taeyoon | Boston Dynamics AI Institute |
| Yuan, Shenli | The Boston Dynamics AI Institute |
Keywords: Dexterous Manipulation, Grippers and Other End-Effectors, Mechanism Design
Abstract: Controlling friction at the fingertip is fundamental to dexterous manipulation, yet remains difficult to realize in robotic hands. We present the design and analysis of a robotic fingertip equipped with passive rollers that can be selectively braked or pivoted to modulate contact friction and constraint directions. When unbraked, the rollers permit unconstrained sliding of the contact point along the rolling direction; when braked, they resist motion like a conventional fingertip. The rollers are mounted on a pivoting mechanism, allowing reorientation of the constraint frame to accommodate different manipulation tasks. We develop a constraint-based model of the fingertip integrated into a parallel-jaw gripper and analyze its ability to support diverse manipulation strategies. Experiments show that the proposed design enables a wide range of dexterous actions that are conventionally challenging for robotic grippers, including sliding and pivoting within the grasp, robust adaptation to uncertain contacts, multi-object or multi-part manipulation, and interactions requiring asymmetric friction across fingers. These results demonstrate the versatility of passive roller fingertips as a low-complexity, mechanically efficient approach to friction modulation, advancing the development of more adaptable and robust robotic manipulation.
|
| |
| 11:20-11:30, Paper WeAT1.3 | Add to My Program |
| Bi-Adapt: Few-Shot Bimanual Adaptation for Novel Categories of 3D Objects Via Semantic Correspondence |
|
| Zhou, Jinxian | National University of Singapore, Shanghai Qi Zhi Institute |
| Wu, Ruihai | Peking University |
| Liu, Yiwei | National University of Singapore |
| Yu, Checheng | National University of Singapore, the University of Hong Kong |
| Zhou, Xunzhe | National University of Singapore, the University of Hong Kong |
| Hou, Yiwen | National University of Singapore |
| Zhong, Licheng | Shanghai Qi Zhi Institute |
| Shao, Lin | National University of Singapore |
Keywords: Bimanual Manipulation, Visual Learning, Learning from Demonstration
Abstract: Bimanual manipulation is imperative yet challenging for robots to execute complex tasks, requiring coordinated collaboration between two arms. However, existing methods for bimanual manipulation often rely on costly data collection and training, struggling to generalize to unseen objects in novel categories efficiently. In this paper, we present Bi-Adapt, a novel framework designed for efficient generalization for bimanual manipulation via semantic correspondence. Bi-Adapt achieves cross-category affordance mapping by leveraging the strong capability of vision foundation models. Fine-tuning with restricted data on novel categories, Bi-Adapt exhibits notable generalization to out-of-category objects in a zero-shot manner. Extensive experiments conducted in both simulation and real-world environments validate the effectiveness of our approach and demonstrate its high efficiency, achieving a high success rate on different benchmark tasks across novel categories with limited data. Project website: https://biadapt-project.github.io/
|
| |
| 11:30-11:40, Paper WeAT1.4 | Add to My Program |
| OmniRetarget: Interaction-Preserving Data Generation for Humanoid Whole-Body Loco-Manipulation and Scene Interaction |
|
| Yang, Lujie | MIT |
| Huang, Xiaoyu | University of California, Berkeley |
| Wu, Zhen | Stanford University |
| Kanazawa, Angjoo | UC Berkeley |
| Abbeel, Pieter | UC Berkeley |
| Sferrazza, Carmelo | UC Berkeley |
| Liu, Karen | Stanford University |
| Duan, Yan | Amazon |
| Shi, Guanya | Carnegie Mellon University |
Keywords: Humanoid and Bipedal Locomotion, Reinforcement Learning, Multi-Contact Whole-Body Motion Planning and Control
Abstract: A dominant paradigm for teaching humanoid robots complex skills is to retarget human motions as kinematic references to train reinforcement learning (RL) policies. However, existing retargeting pipelines often struggle with the significant embodiment gap between humans and robots, producing physically implausible artifacts like foot-skating and penetration. More importantly, common retargeting methods neglect the rich human-object and human-environment interactions essential for expressive locomotion and loco-manipulation. To address this, we introduce OmniRetarget, an interaction-preserving data generation engine based on an interaction mesh that explicitly models and preserves the crucial spatial and contact relationships between an agent, the terrain, and manipulated objects. By minimizing the Laplacian deformation between the human and robot meshes while enforcing kinematic constraints, OmniRetarget generates kinematically feasible trajectories. Moreover, preserving task-relevant interactions enables efficient data augmentation, from a single demonstration to different robot embodiments, terrains, and object configurations. We comprehensively evaluate OmniRetarget by retargeting motions from OMOMO, LAFAN1, and our in-house MoCap datasets, generating over 8-hour trajectories that achieve better kinematic constraint satisfaction and contact preservation than widely used baselines. Such high-quality data enables proprioceptive RL policies to successfully execute long-horizon (up to 30 seconds) parkour and loco-manipulation skills on a Unitree G1 humanoid, trained with only 5 reward terms and simple domain randomization shared by all tasks, without any learning curriculum. All code, retargeted datasets, and result videos can be found at https://omniretarget.github.io.
|
| |
| 11:40-11:50, Paper WeAT1.5 | Add to My Program |
| Push Anything: Single and Multi-Object Pushing from First Sight with Contact-Implicit MPC |
|
| Bui, Hien | University of Pennsylvania |
| Gao, Yufeiyang | University of Pennsylvania |
| Yang, Haoran | University of Pennsylvania |
| Cui, Eric | University of Pennsylvania |
| Mody, Siddhant | University of Pennsylvania |
| Acosta, Brian | University of Pennsylvania |
| Felix, Thomas Stephen | University of Pennsylvania |
| Bianchini, Bibit | University of Pennsylvania |
| Posa, Michael | University of Pennsylvania |
Keywords: Dexterous Manipulation, Optimization and Optimal Control, Contact Modeling
Abstract: Non-prehensile manipulation of diverse objects remains a core challenge in robotics, driven by unknown physical properties and the complexity of contact-rich interactions. Recent advances in contact-implicit model predictive control (CI-MPC), with contact reasoning embedded directly in the trajectory optimization, have shown promise in tackling the task efficiently and robustly. However, demonstrations have been limited to narrowly curated examples. In this work, we showcase the broader capabilities of CI-MPC through precise planar pushing tasks over a wide range of object geometries, including multi-object domains. These scenarios demand reasoning over numerous inter-object and object-environment contacts to strategically manipulate and de-clutter the environment, which was intractable for prior CI-MPC methods. To achieve this, we introduce Consensus Complementarity Control Plus (C3+), an enhanced CI-MPC algorithm integrated into a complete pipeline spanning object scanning, mesh reconstruction, and hardware execution. Compared to its predecessor C3, C3+ achieves substantially faster solve times, enabling real-time performance even in multi-object pushing tasks. On hardware, our system achieves overall 98% success rate across 33 objects, reaching pose goals within tight tolerances. The average timeto- goal is approximately 0.5, 1.6, 3.2, and 5.3 minutes for 1-, 2-, 3-, and 4-object tasks, respectively. Project page: https: //dairlab.github.io/push-anything.
|
| |
| 11:50-12:00, Paper WeAT1.6 | Add to My Program |
| Design and Implementation of an Angle-Bisecting Foot Mechanism for a Leg-Wheel Transformable Robot |
|
| Lee, Hsing-Chen | National Taiwan University |
| Yu, Wei-Shun | National Taiwan University |
| Lin, Pei-Chun | National Taiwan University |
Keywords: Mechanism Design, Legged Robots, Wheeled Robots
Abstract: This paper presents the design, modeling, and experimental validation of a novel leg-wheel mechanism featuring an integrated, passive angle-bisecting foot. The core of the design is a two-stage planetary gear system. This system mechanically ensures a consistent foot-ground contact angle, addressing a key limitation in transformable robots with symmetrical leg-wheels. To leverage this innovation, we developed a comprehensive kinematic model. Furthermore, we designed a hierarchical motion planning framework that utilizes the pure rolling motion enabled by the mechanism. The effectiveness of the proposed design was validated through hardware experiments on a 23 kg prototype. The results demonstrated improved energy efficiency based on the Cost of Transport (C.O.T.) metric, achieving up to a 16.2% reduction in C.O.T. alongside a 28.6% reduction in pitch oscillation compared to a baseline design. This study provides a valuable guideline for developing adaptive gait controllers that can optimize for energy efficiency in real time.
|
| |
| 12:00-12:10, Paper WeAT1.7 | Add to My Program |
| DigiArm: An Anthropomorphic 3D-Printed Prosthetic Hand with Enhanced Dexterity for Typing Tasks |
|
| Zadok, Dean | Technion |
| Naamani, Tom | Technion |
| Bar-Ratson, Yuval | Technion |
| Barash, Elisha | Technion |
| Salzman, Oren | Technion |
| Wolf, Alon | Technion |
| Bronstein, Alexander | TECHNION |
| Krausz, Nili | Technion |
Keywords: Prosthetics and Exoskeletons, Wearable Robotics, Physically Assistive Devices
Abstract: Despite recent advancements, existing prosthetic limbs are unable to replicate the dexterity and intuitive control of the human hand. Current control systems for prosthetic hands are often limited to grasping, and commercial prosthetic hands lack the precision needed for dexterous manipulation or applications that require fine finger motions. Thus, there is a critical need for accessible and replicable prosthetic designs that enable individuals to interact with electronic devices and perform precise finger pressing, such as keyboard typing or piano playing, while preserving current prosthetic capabilities. This paper presents a low-cost, lightweight, 3D-printed robotic prosthetic hand, specifically engineered for enhanced dexterity with electronic devices such as a computer keyboard or piano, as well as general object manipulation. The robotic hand features a mechanism to adjust finger abduction/adduction spacing, a 2-D wrist with the inclusion of controlled ulnar/radial deviation optimized for typing, and control of independent finger pressing. We conducted a study to demonstrate how participants can use the robotic hand to perform keyboard typing and piano playing in real time, with different levels of finger and wrist motion. This supports the notion that our proposed design can allow for the execution of key typing motions more effectively than before, aiming to enhance the functionality of prosthetic hands.
|
| |
| 12:10-12:20, Paper WeAT1.8 | Add to My Program |
| Relaxation Dynamics in Oblate Spherical Rolling Robots |
|
| Oevermann, Micah | Texas A&M University |
| Ambrose, Robert | Texas A&M University |
Keywords: Dynamics, Kinematics, Modeling, Control, and Learning for Soft Robots
Abstract: Spherical robots rolling on flat ground often exhibit a wobbling motion that, at higher speeds, can escalate into end-over-end flipping. This paper proposes a fundamental dynamic cause of this instability: a relaxation effect analogous to the Intermediate Axis Theorem. Rotating bodies with oblate inertial profiles under dissipative loads tend to reorient toward spinning about their major moment of inertia, leading to the observed wobbling in spherical robots. While relaxation dynamics are well-studied in satellites and asteroids, this effect has not been previously applied to rolling systems. We extend these methods to constrained spherical robots, derive the governing dynamics, and conduct experiments with an empty shell on a slope and a reduced pendulum on flat ground and in water to aid in the discussion. Results suggest that translational rolling constraints act as a pseudo-dissipative load to drive the relaxation effect. This work bridges the fields of satellite dynamics theory and ground robotics, providing new insights into the stability of high-speed rolling robots to influence future hardware and control design choices.
|
| |
| 12:20-12:30, Paper WeAT1.9 | Add to My Program |
| A Novel Bio-Inspired Fish Robot with Tunable Stiffness Via Particle Jamming |
|
| Stonecipher, Jack | University of Wisconsin-Madison |
| Gao, Allen | University of Wisconsin - Madison |
| Wang, Wei | University of Wisconsin-Madison |
Keywords: Marine Robotics, Biologically-Inspired Robots, Soft Robot Applications
Abstract: Fish achieve efficient swimming across varied speeds through active modulation of their body flexibility. To explore the effects of tunable stiffness on swimming performance, we present a bio-inspired freely-swimming fish robot with a rapidly tunable particle jamming body. This design enables rapid stiffness adjustments with negligible changes in shape or volume, achieving a 54% variation in flexural rigidity across vacuum pressures 0 to –40 kPa. We visualize the midline of the oscillating body under both low and high stiffness conditions, and the comparison confirms that the body curvature varies with stiffness. We further experimentally evaluate the tunable stiffness body's effects on swimming performance using velocity and cost of transport (CoT) measurements obtained via a motion tracking system. Results show that active stiffness tuning is essential for sustaining efficient and high-speed swimming across beating frequencies of 1–3 Hz. At low frequencies (1-1.5 Hz), a softer body (0 kPa) maximizes velocity and minimizes CoT, whereas at high frequencies (2.5-3 Hz), a stiffer body (–40 kPa) delivers superior velocity and reduced transport cost. These findings highlight stiffness modulation as a key strategy for adaptive and efficient propulsion in bio-inspired robotic swimmers.
|
| |
| WeAT2 Regular Session, Hall A3 |
Add to My Program |
| Medical Robotics I |
|
| |
| |
| 11:00-11:10, Paper WeAT2.1 | Add to My Program |
| A Wirelessly Powered Robotic Capsule Chain for Large Volume Gastrointestinal Liquid Sampling |
|
| Boyd, Bella | University of Sheffield |
| Esendag, Kaan | University of Sheffield |
| Du, Liu | University of Sheffield |
| Koszowska, Zaneta | University of Sheffield |
| Liu, Jialun | University of Sheffield |
| Miyashita, Shuhei | University of Sheffield |
| Damian, Dana | University of Sheffield |
Keywords: Medical Robots and Systems, Mechanism Design
Abstract: Liquid sampling from the gastrointestinal (GI) tract offers significant diagnostic advantages. This study presents a novel magnetically actuated Robotic Capsule Chain (RCC) for large volume liquid sample collection within the GI tract. The RCC incorporates a wirelessly powered, on-demand motorized sampling mechanism that eliminates the need for onboard batteries or microcontrollers. The system demonstrated reliable operation at distances up to 60 mm from the transmitter coil. Validation experiments confirmed effective sealing of the sampling chamber and successful collection of up to 375 µL of fluids with viscosities comparable to those in the GI tract. Navigation and sampling were further demonstrated in a synthetic bowel model. These findings highlight the potential of robotic capsule chains to enable wireless, minimally invasive diagnostic procedures in the GI tract.
|
| |
| 11:10-11:20, Paper WeAT2.2 | Add to My Program |
| Breaking the Latency Barrier: Synergistic Perception and Control for High-Frequency 3D Ultrasound Servoing |
|
| Qian, Yizhao | The Chinese University of Hong Kong |
| Zhu, Yujie | Tsinghua University |
| Luo, Jiayuan | Great Bay University |
| Liu, Li | Great Bay University |
| Yuan, Yixuan | Chinese University of Hong Kong |
| Liao, Hongen | Tsinghua University |
| Ning, Guochen | Tsinghua University |
Keywords: Medical Robots and Systems, Imitation Learning, Computer Vision for Medical Robotics
Abstract: Tracking moving anatomical targets with robotic ultrasound is particularly challenging when the target motion is both fast and large in scale, as the end-to-end latency of existing systems prevents the perception–control loop from closing fast enough. In this paper, we argue that overcoming this limitation calls for the joint design of perception and control, rather than optimizing each in isolation. We present a tightly-coupled framework with two main components: (1) a Decoupled DualStream Perception Network that estimates 3D translational state from 2D ultrasound images at high frequency, and (2) a Single-Step Flow Policy that outputs an entire action sequence in one forward pass, removing the need for iterative rollouts used in conventional policies. Together, the two modules enable closed-loop control at over 60 Hz. In phantom experiments with complex 3D trajectories, the system achieves a mean tracking error below 6.5 mm and re-acquires the target after resultant displacements exceeding 170 mm. It tracks targets moving at speeds up to 102 mm/s with a terminal error under 1.7 mm. In-vivo trials on a human volunteer further confirm that the approach transfers to realistic clinical conditions. To our knowledge, this is the first RUSS framework to unify high-bandwidth dynamic tracking with large-scale repositioning within a single architecture, offering a concrete step toward autonomous ultrasound operation in the presence of patient motion.
|
| |
| 11:20-11:30, Paper WeAT2.3 | Add to My Program |
| ArthroCut: Autonomous Policy Learning for Robotic Bone Resection in Knee Arthroplasty |
|
| Lu, Xu | Tsinghua University |
| Zhang, Yiling | Tsinghua University |
| Cheng, Wenquan | Tsinghua University |
| Ma, Longfei | Tsinghua University |
| Chen, Fang | Shanghai Jiao Tong University |
| Liao, Hongen | Tsinghua University |
Keywords: Medical Robots and Systems, AI-Based Methods, Deep Learning Methods
Abstract: Despite rapid commercialization of surgical robots, their autonomy and real-time decision-making remain limited in practice. To address this gap, we propose ArthroCut, an autonomous policy learning framework that upgrades knee arthroplasty robots from assistive execution to context-aware action generation. ArthroCut fine-tunes a Qwen-VL backbone on a self-built, time-synchronized multimodal dataset from 21 complete cases (23,205 RGB-D pairs), integrating preoperative CT/MR, intraoperative NDI tracking of bones and end effector, RGB-D surgical video, robot state, and textual intent. The method operates on two complementary token families---Preoperative Imaging Tokens (PIT) to encode patient-specific anatomy and planned resection planes, and Time-Aligned Surgical Tokens (TAST) to fuse real-time visual, geometric, and kinematic evidence---and emits an interpretable action grammar under grammar/safety-constrained decoding. In bench-top experiments on a knee prosthesis across seven trials, ArthroCut achieves an average success rate of 86% over the six standard resections, significantly outperforming strong baselines trained under the same protocol. Ablations show that TAST is the principal driver of reliability while PIT provides essential anatomical grounding, and their combination yields the most stable multi-plane execution. These results indicate that aligning preoperative geometry with time-aligned intraoperative perception and translating that alignment into tokenized, constrained actions is an effective path toward robust, interpretable autonomy in orthopedic robotic surgery.
|
| |
| 11:30-11:40, Paper WeAT2.4 | Add to My Program |
| Anatomical Prior-Driven Framework for Autonomous Robotic Cardiac Ultrasound Standard View Acquisition |
|
| Cao, Zhiyan | Huazhong University of Science and Technology |
| Wu, Zhengxi | The National University of Singapore |
| Wang, Yiwei | Huazhong University of Science and Technology |
| Lin, Pei-Hsuan | National Chung Hsing University |
| Zhang, Li | Wuhan Union Hospital |
| Xie, Zhen | National University of Singapore |
| Zhao, Huan | Huazhong University of Science and Technology |
| Ding, Han | Huazhong University of Science and Technology |
Keywords: Medical Robots and Systems, Computer Vision for Medical Robotics
Abstract: Cardiac ultrasound diagnosis is critical for cardiovascular disease assessment, but acquiring standard views remains highly operator-dependent. Existing medical segmentation models often yield anatomically inconsistent results in images with poor textural differentiation between distinct feature classes, while autonomous probe adjustment methods either rely on simplistic heuristic rules or black-box learning. To address these issues, our study proposed an anatomical prior (AP)-driven framework integrating cardiac structure segmentation and autonomous probe adjustment for standard view acquisition. A YOLO-based multi-class segmentation model augmented by a spatial-relation graph (SRG) module is designed to embed AP into the feature pyramid. Quantifiable anatomical features of standard views are extracted. Their priors are fitted to Gaussian distributions to construct probabilistic APs. The probe adjustment process of robotic ultrasound scanning is formalized as a reinforcement learning (RL) problem, with the RL state built from real-time anatomical features and the reward reflecting the AP matching. Experiments validate the efficacy of the framework. The SRG-YOLOv11s improves mAP50 by 11.3% and mIoU by 6.8% on the Special Case dataset, while the RL agent achieves a 92.5% success rate in simulation and 86.7% in phantom experiments.
|
| |
| 11:40-11:50, Paper WeAT2.5 | Add to My Program |
| SpReg: An Autonomous Image-To-Patient Registration Framework for Robotic Bronchoscopy |
|
| Ye, Shucheng | Shanghai Jiao Tong University |
| Sun, Peibo | Shanghai Jiao Tong University |
| Ma, Jiajun | Tsinghua University |
| She, Wenbo | Shanghai Jiao Tong University |
| Liao, Hongen | Tsinghua University |
| Han, Dingpei | Ruijin Hospital, Shanghai Jiao Tong University School of Medicine |
| Huang, Tianqi | Shanghai Jiao Tong University |
| Chen, Fang | Shanghai Jiao Tong University |
|
|
| |
| 11:50-12:00, Paper WeAT2.6 | Add to My Program |
| Non-Rigid Motion Compensation with Skin Deformation Prediction for in Situ Bioprinting |
|
| Cuau, Lénaïc | LIRMM |
| Poignet, Philippe | LIRMM University of Montpellier CNRS |
| Zemiti, Nabil | Montpellier University - CNRS UMR 5506 |
Keywords: Medical Robots and Systems, Motion and Path Planning, Visual Servoing
Abstract: This letter introduces a novel method of non-rigid motion compensation for in situ bioprinting. Most bioprinting platforms use open-loop systems, but it raises concerns about patient safety and suboptimal wound coverage in case of patient motion. To handle these issues, our method integrates an RGB-D camera to manage orientation and to predict deformations, along with a laser telemeter to regulate deposited material thickness. The proposed approach has been evaluated on a moving silicone platform that deforms at 0.8 Hz with a 4 mm in-plane amplitude and a 20 mm elevation amplitude. Our method resulted in a wound coverage error of less than 1 %. Comparative analysis demonstrates a 73.0% enhancement in deforming path following compared to existing methods. Additionally, by predicting surface motion, the method enables more precise control of layer height, with an error inferior to 0.1 mm.
|
| |
| 12:00-12:10, Paper WeAT2.7 | Add to My Program |
| Stable Gravity Compensation and 6-DoF Manipulation of a Tethered Magnetic Endoscope with an Optimized End-Effector |
|
| Hossameldin, Ahmed | University of Burgundy Europe |
| Dahmouche, Redwan | Marie and Louis Pasteur University |
| Tahri, Omar | Saint Louis University |
Keywords: Medical Robots and Systems, Telerobotics and Teleoperation, Robotics and Automation in Life Sciences
Abstract: This paper presents the design and experimental validation of a magnetic end-effector optimized for the robust manipulation of a tethered magnetic endoscope in six-degrees-of-freedom (6-DoF). The symmetrical end-effector integrates two pairs of permanent magnets that generate a stable 2D attraction zone with a diameter exceeding 60 mm. This feature enables stable gravity compensation and precise control of centimeter-scale magnetic endoscopes. Experimental results demonstrate 6-DoF manipulation, stability and robustness against external disturbances, and successful navigation through obstacles in a confined environment. The stable gravity compensation allows to reduce friction and pressure on the endoscope’s environment during navigation, which represent a key advantage for advancing minimally invasive medical procedures in general and colonoscopy in particular where minimizing pressure on the colon is critical. Future work will focus on enhancing the system through active control of the tether’s length.
|
| |
| 12:10-12:20, Paper WeAT2.8 | Add to My Program |
| Image-To-Force Estimation for Soft Tissue Interaction in Robotic-Assisted Surgery Using Structured Light |
|
| Wang, Jiayin | Tongji University, Shanghai MicroPort MedBot |
| Yao, Mingfeng | Microport MedBot(Group) Company |
| Wei, Yanran | Peking University |
| Guo, Xiaoyu | City University of Hong Kong |
| Zheng, Ayong | Microport MedBot(Group) Company |
| Zhao, Weidong | Tongji University |
Keywords: Medical Robots and Systems, Haptics and Haptic Interfaces, Robotics and Automation in Life Sciences
Abstract: For Minimally Invasive Surgical (MIS) robots, accurate haptic interaction force feedback is essential for ensuring the safety of interacting with soft tissue. However, the majority of existing MIS robotic systems cannot facilitate direct measurement of the interaction force with hardware sensors due to space limitations. This letter introduces an effective vision-based scheme that utilizes a One-Shot structured light projection with a designed pattern on soft tissue coupled with haptic information processing through a trained image-to-force neural network. The images captured from the endoscopic stereo camera are analyzed to reconstruct high-resolution 3D point clouds for soft tissue deformation. The proposed methodology involves a modified PointNet-based force estimation method, which has demonstrated proficiency in accurately representing the intricate mechanical properties of soft tissue. To validate the efficacy of the proposed methodology, numerical force interaction experiments were conducted on three silicon materials with varying stiffness levels. The experimental results substantiate the efficacy of the proposed methodology.
|
| |
| 12:20-12:30, Paper WeAT2.9 | Add to My Program |
| A-SEE2.0: Active-Sensing End-Effector for Robotic Ultrasound Systems with Dense Contact Surface Perception Enabled Probe Orientation Adjustment |
|
| Zhetpissov, Yernar | Worcester Polytechnic Institute |
| Ma, Xihan | Worcester Polytechnic Institute |
| Yang, Kehan | Worcester Polytechnic Institute |
| Zhang, Haichong | Worcester Polytechnic Institute |
Keywords: Medical Robots and Systems, Sensor-based Control, Computer Vision for Medical Robotics
Abstract: Conventional freehand ultrasound (US) imaging is highly dependent on the skill of the operator, leading to inconsistent results and increased physical burden on sonographers. Robotic Ultrasound Systems (RUSS) aim to address these limitations by providing standardized and automated imaging solutions, especially in environments with limited access to skilled operators. This paper presents the development of a RUSS system that employs a novel end-effector, A-SEE2.0, which uses dual RGB-D depth cameras to maintain the US probe normal to the skin surface, a default starting configuration for anatomical landmarks identification. Our RUSS integrates RGB-D camera data with robotic control algorithms to maintain orthogonal probe alignment on uneven surfaces without preoperative data. Validation tests using a phantom model show that the system achieves robust normal positioning accuracy. A-SEE2.0 demonstrates 2.47 ± 1.25 degrees normal positioning error on a flat surface and 12.19 ± 5.81 degrees error on a mannequin surface. This work highlights the clinical potential of A-SEE2.0 by demonstrating that, during in-vivo forearm ultrasound examinations, it achieves image quality comparable to manual scanning by a human sonographer.
|
| |
| WeAT3 Regular Session, Lehar 1-4 |
Add to My Program |
| Robot Perception I |
|
| |
| Chair: Valada, Abhinav | University of Freiburg |
| Co-Chair: Wang, Sen | Imperial College London |
| |
| 11:00-11:10, Paper WeAT3.1 | Add to My Program |
| GFreeDet2: Exploiting Gaussian Splatting and Foundation Models for RGB-Based Model-Free 2D and 6D Detection of Unseen Objects |
|
| Wang, Gu | Tsinghua University |
| Liu, Xingyu | Tsinghua University |
| Tang, Jingyi | Tsinghua University |
| Li, Chengxi | Tsinghua University |
| Li, Yingyue | Tsinghua University |
| Huang, Ziqin | Tsinghua University |
| Ji, Xiangyang | Tsinghua University |
Keywords: Computer Vision for Automation, Perception for Grasping and Manipulation, Omnidirectional Vision
Abstract: We introduce GFreeDet2, which leverages Gaussian Splatting and foundation models to address RGB-based model-free 2D detection and 6D detection of unseen objects. GFreeDet2 reconstructs 3D Gaussian object models from multi-view RGB references, enabling efficient model-free detection without relying on CAD models. To accelerate reconstruction and consistently handle both pinhole and fisheye cameras, we propose projection-aware perspective cropping (PAPC) with visual hull initialization. PAPC further improves coarse 6D detection by accurately extracting pinhole crops from fisheye query images. The Gaussian objects enable rendering in place of CAD models within foundation model-driven pipelines, allowing existing state-of-the-art RGB-based methods for unseen 2D and 6D detection to be extended to the model-free setting with minimal modifications. Extensive experiments on all three BOP-H3 datasets demonstrate that GFreeDet2 achieves state-of-the-art performance and establishes a strong baseline for RGB-based, model-free 2D and 6D unseen object detection. The code is publicly available at https://github.com/wangg12/GFreeDet2.git.
|
| |
| 11:10-11:20, Paper WeAT3.2 | Add to My Program |
| RANGER: A Monocular Zero-Shot Semantic Navigation Framework through Visual Contextual Adaptation |
|
| Yu, Ming-Ming | BeiHang University |
| Chen, Yi | CASIA |
| Karlsson, Börje F. | Beijing Academy of Artificial Intelligence (BAAI) |
| Wu, Wenjun | Beihang University |
Keywords: Engineering for Robotic Systems, Embedded Systems for Robotic and Automation, Transfer Learning
Abstract: Efficient target localization and autonomous navigation in complex environments are fundamental to real-world embodied applications. While recent advances in multimodal foundation models have enabled zero-shot object goal navigation, allowing robots to search for arbitrary objects without fine-tuning, existing methods face two key limitations: (1) heavy reliance on ground-truth depth and pose information, which restricts applicability in real-world scenarios; and (2) lack of visual in-context learning (VICL) capability to extract geometric and semantic priors from environmental context, as in a short traversal video. To address these challenges, we propose RANGER, a novel zero-shot, open-vocabulary semantic navigation framework that operates using only a monocular camera. Leveraging powerful 3D foundation models, RANGER eliminates the dependency on depth and pose while exhibiting strong VICL capability. By simply observing a short video of the target environment, the system can also significantly improve task efficiency without requiring architectural modifications or task-specific retraining. The framework integrates several key components: keyframe-based 3D reconstruction, semantic point cloud generation, vision-language model (VLM)-driven exploration value estimation, high-level adaptive waypoint selection, and low-level action execution. Experiments on the HM3D benchmark and real-world environments demonstrate that RANGER achieves competitive performance in terms of navigation success rate and exploration efficiency, while showing superior VICL adaptability, with no previous 3D mapping of the environment required.
|
| |
| 11:20-11:30, Paper WeAT3.3 | Add to My Program |
| Sparse Variable Projection in Robotic Perception: Exploiting Separable Structure for Efficient Nonlinear Optimization |
|
| Papalia, Alan | University of Michigan |
| Sanderson, Nikolas | Northeastern University |
| Han, Haoyu | Harvard University |
| Yang, Heng | Harvard University |
| Singh, Hanumant | Northeastern University |
| Everett, Michael | Northeastern University |
Keywords: Optimization and Optimal Control, Sensor Fusion, SLAM
Abstract: Robotic perception often requires solving large nonlinear least-squares (NLS) problems. While sparsity has been well-exploited to scale solvers, a complementary and underexploited structure is emph{separability} -- where some variables (e.g., visual landmarks) enter the residuals linearly and, for any estimate of the remaining variables (e.g., poses), have a closed-form least-squares solution that can be substituted back to reduce the problem size and improve conditioning. Variable projection (VarPro) methods are a family of techniques to exploit this structure, however they have seen limited use in robotic perception; this is in part because gauge symmetries (e.g., cost invariance to global shifts and rotations) which are common in perception problems induce specific computational challenges in standard VarPro approaches. We present a VarPro scheme designed for problems with gauge symmetries that jointly exploits separability and sparsity. Our method can be applied as a one-time preprocessing step to construct a emph{matrix-free Schur complement operator}. This operator allows for efficiently evaluating costs, gradients, and Hessian-vector products of the reduced problem and readily integrates with standard iterative NLS solvers. We provide precise conditions under which our method applies, and describe extensions when these conditions are only partially met. Across synthetic and real benchmarks in SLAM, SNL, and SfM, our approach achieves up to textbf{2times--35times faster runtimes} than state-of-the-art methods while maintaining accuracy. We release an open-source C++ implementation and all datasets.
|
| |
| 11:30-11:40, Paper WeAT3.4 | Add to My Program |
| Class-Aware Queries for Robust Multi-View 3D Object Detection |
|
| Sung, Chaeyeon | Yonsei University |
| Woo, Sungmin | Yonsei University |
| Lee, Sangyoun | Yonsei University |
Keywords: Object Detection, Segmentation and Categorization, Deep Learning for Visual Perception, Autonomous Agents
Abstract: Query-based multi-view 3D object detectors typically rely on a fixed set of learnable queries that jointly predict object categories and locations. However, encoding both semantic and geometric information within a shared query embedding leads to representational conflicts, limiting optimization. While prior works decouple prediction heads to partially address this issue, such decoupling often treats classification and localization as independent tasks, leaving the queries themselves class-agnostic and unaware of the scene’s semantic context. In this paper, we present the first 3D object detection framework that constructs class-aware queries using scene-level object class predictions. Specifically, a multi-view image classifier first estimates which object classes are present in the scene, and these predictions are used to generate semantically guided queries for 3D localization within the transformer decoder. This allows our model to initialize each query with class-specific priors, in contrast to conventional uniform query initialization. As a result, queries attend more effectively to relevant regions and objects throughout decoding. Experiments on the nuScenes benchmark show that our method improves mAP by 2.1 points and NDS by 0.9 points over a strong DETR-based baseline. An oracle study further reveals that classification accuracy is a key bottleneck in existing DETR-style detectors, highlighting the benefit of early semantic guidance.
|
| |
| 11:40-11:50, Paper WeAT3.5 | Add to My Program |
| Robustness Is a Function, Not a Number: A Factorized Comprehinsive Study of OOD Robustness in Vision-Based Driving |
|
| Mallak, Amir | University of Haifa |
| Maalouf, Alaa | University of Haifa |
Keywords: Control Architectures and Programming, Agent-Based Systems, AI-Based Methods
Abstract: Out-of-distribution (OOD) robustness in vision-based autonomous driving is often reduced to a single number, hiding what breaks a policy and by how much. We adopt a factorized view, decomposing environments along five axes: scene (rural/urban), season, weather, time (day/night), and agent mix; and measure performance under controlled k-factor perturbations (k in {0,1,2,3}). Using closed-loop control in VISTA, we benchmark FC, CNN, and ViT policies, train compact ViT heads on frozen foundation-model (FM) features, and vary in-distribution (ID) support in scale, diversity, and temporal context. (1) ViT policies are markedly more OOD-robust than comparably sized CNN/FC, and adding FM features yields state-of-the-art success at a latency cost. (2) Naive temporal inputs (multi-frame) do not beat the best single-frame baseline. (3) The largest single-factor drops are rural -> urban and day -> night (~31% each); actor swaps ~10% and moderate rain ~7%; several season shifts are drastic, and combining a time flip with other changes further degrades performance. (4) FM-feature policies stay above 85% under three simultaneous changes; non-FM single-frame policies take a large first-shift hit, and all no-FM models fall below 50% by three changes. (5) Interactions are non-additive: some pairings (e.g., urban-night) partially offset, whereas season-time combinations are especially harmful. (6) Training on winter/snow is most robust to single-factor shifts, while a rural+summer baseline gives the best overall OOD performance. (7) Scaling traces/views of the same configuration improves robustness (about +11.8 points from 5 to 14 traces), yet targeted exposure to hard conditions can substitute for scale. (8) Using multiple ID environments broadens coverage and strengthens weak cases (urban OOD 60.6% -> 70.1%) with a small ID drop; single-ID preserves peak performance but in a narrow domain. These results yield actionable design rules for OOD-robust driving policies.
|
| |
| 11:50-12:00, Paper WeAT3.6 | Add to My Program |
| ROVER: A Multi-Season Dataset for Visual SLAM |
|
| Schmidt, Fabian | Esslingen University of Applied Sciences |
| Daubermann, Julian | Andreas Stihl Ag |
| Mitschke, Marcel | Andreas Stihl Ag |
| Blessing, Constantin | Esslingen University |
| Meyer, Stephan | Andreas Stihl Ag |
| Enzweiler, Markus | Esslingen University of Applied Sciences |
| Valada, Abhinav | University of Freiburg |
Keywords: SLAM, Robotics in Agriculture and Forestry, Localization, Field Robots
Abstract: Robust Simultaneous Localization and Mapping (SLAM) is a crucial enabler for autonomous navigation in natural, semi-structured environments such as parks and gardens. However, these environments present unique challenges for SLAM due to frequent seasonal changes, varying light conditions, and dense vegetation. These factors often degrade the performance of visual SLAM algorithms originally developed for structured urban environments. To address this gap, we present ROVER, a comprehensive benchmark dataset tailored for evaluating visual SLAM algorithms under diverse environmental conditions and spatial configurations. We captured the dataset with a robotic platform equipped with monocular, stereo, and RGBD cameras, as well as inertial sensors. It covers 39 recordings across five outdoor locations, collected through all seasons and various lighting scenarios, i.e., day, dusk, and night with and without external lighting. With this novel dataset, we evaluate several traditional and deep learning-based SLAM methods and study their performance in diverse challenging conditions. The results demonstrate that while stereo-inertial and RGBD configurations generally perform better under favorable lighting and moderate vegetation, most SLAM systems perform poorly in low-light and high-vegetation scenarios, particularly during summer and autumn. Our analysis highlights the need for improved adaptability in visual SLAM algorithms for outdoor applications, as current systems struggle with dynamic environmental factors affecting scale, feature extraction, and trajectory consistency. This dataset provides a solid foundation for advancing visual SLAM research in real-world, highlight{semi-structured} environments, fostering the development of more resilient SLAM systems for long-term outdoor localization and mapping. The dataset and the code of the benchmark are available under https://iis-esslingen.github.io/rover.
|
| |
| 12:00-12:10, Paper WeAT3.7 | Add to My Program |
| SG-Reg: Generalizable and Efficient Scene Graph Registration |
|
| Liu, Chuhao | Hong Kong University of Science and Technology |
| Qiao, Zhijian | Hong Kong University of Science and Technology |
| Shi, Jieqi | Nanjing University |
| Wang, Ke | Chang'an University |
| Liu, Peize | The Hong Kong University of Science and Technology, Robotic Institute |
| Shen, Shaojie | Hong Kong University of Science and Technology |
Keywords: SLAM, Deep Learning in Robotics and Automation, Multi-Robot Systems, Semantic Scene Understanding
Abstract: This paper addresses the challenge of registering two rigid semantic scene graphs, an essential capability for autonomous agents to align with remote agents or prior maps. Traditional methods rely on hand-crafted descriptors or ground-truth annotations, limiting their applicability in real-world scenarios. To address these issues, we propose a scene graph network that encodes multiple semantic node modalities: open-set semantic features, local topology with spatial awareness, and shape features. These modalities are fused to form compact semantic node representations for matching layers to perform coarse-to-fine correspondence search. A robust pose estimator in the back-end determines the transformation based on these correspondences. Our approach preserves a sparse, hierarchical scene representation, requiring fewer GPU resources and less communication bandwidth in multi-agent tasks. Additionally, we introduce a novel data generation method using vision foundation models and a semantic mapping module, avoiding the need for ground-truth annotations. We validate our method on a two-agent SLAM benchmark, demonstrating superior registration success and lower communication bandwidth.
|
| |
| 12:10-12:20, Paper WeAT3.8 | Add to My Program |
| SiLVR: Scalable Lidar-Visual Radiance Field Reconstruction with Uncertainty Quantification |
|
| Tao, Yifu | University of Oxford |
| Fallon, Maurice | University of Oxford |
Keywords: Mapping, SLAM, Sensor Fusion, Neural Radiance Field
Abstract: We present a neural radiance field (NeRF) based large-scale reconstruction system that fuses lidar and vision data to generate high-quality reconstructions that are geometrically accurate and capture photorealistic texture. Our system adopts the state-of-the-art NeRF representation to additionally incorporate lidar. Adding lidar data adds strong geometric constraints on the depth and surface normals, which is particularly useful when modelling uniform texture surfaces which contain ambiguous visual reconstruction cues. A key contribution of this work is a novel method to quantify the epistemic uncertainty of the lidar-visual NeRF reconstruction by estimating the spatial variance of each point location in the radiance field given the sensor observations from the cameras and lidar. This provides a principled approach to evaluate the contribution of each sensor modality to the final reconstruction. In this way, reconstructions that are uncertain (due to, e.g., uniform visual texture, limited observation viewpoints, or little lidar coverage) can be identified and removed. Our system is integrated with a real-time pose-graph lidar SLAM system which is used to bootstrap a Structure-from-Motion (SfM) reconstruction procedure. It also helps to properly constrain the overall metric scale which is essential for the lidar depth loss. The refined SLAM trajectory can then be divided into submaps using Spectral Clustering to group sets of co-visible images together. This submapping approach is more suitable for visual reconstruction than distance-based partitioning. Our uncertainty estimation is particularly effective when merging submaps, as their boundaries often contain artefacts due to limited observations. We demonstrate the reconstruction system using a multi-camera, lidar sensor suite in experiments involving both robot-mounted and handheld scanning with total area of more than 20,000 m^2. Code and dataset are available at https://dynamic.robots.ox.ac.uk/projects/silvr/
|
| |
| 12:20-12:30, Paper WeAT3.9 | Add to My Program |
| AQUA-SLAM: Tightly-Coupled Underwater Acoustic-Visual-Inertial SLAM with Sensor Calibration |
|
| Xu, Shida | Imperial College London |
| Zhang, Kaicheng | Imperial College London |
| Wang, Sen | Imperial College London |
Keywords: SLAM, Localization, Sensor Fusion, Marine Robotics
Abstract: Underwater environments pose significant challenges for visual Simultaneous Localization and Mapping (SLAM) systems due to limited visibility, inadequate illumination, and sporadic loss of structural features in images. Addressing these challenges, this paper introduces a novel, tightly-coupled Acoustic-Visual-Inertial SLAM approach, termed AQUA-SLAM, to fuse a Doppler Velocity Log (DVL), a stereo camera, and an Inertial Measurement Unit (IMU) within a graph optimization framework. Moreover, we propose an efficient sensor calibration technique, encompassing multi-sensor extrinsic calibration (among the DVL, camera and IMU) and DVL transducer misalignment calibration, with a fast linear approximation procedure for real-time online execution. The proposed methods are extensively evaluated in a tank environment with ground truth, and validated for offshore applications in the North Sea. The results demonstrate that our method surpasses current state-of-the-art underwater and visual-inertial SLAM systems in terms of localization accuracy and robustness. The proposed system will be made open-source for the community.
|
| |
| WeAT4 Regular Session, Strauss 1-2 |
Add to My Program |
| Field and Service Robotics |
|
| |
| |
| 11:00-11:10, Paper WeAT4.1 | Add to My Program |
| On Robust Coordinated Compliant Control Design for Space Manipulators under Flexible and Uncertain Dynamics |
|
| Nanos, Kostas | School of Pedagogical and Technological Education |
| Papadopoulos, Evangelos | National Technical University of Athens |
Keywords: Space Robotics and Automation, Compliance and Impedance Control, Dynamics
Abstract: This paper presents a coordinated compliant control strategy for space manipulator systems to enable safe and robust target capture during on-orbit servicing and assembly missions. The proposed controller operates in two distinct phases: free-space motion and contact interaction. In the free-space phase, end-effector accurate trajectory tracking is required, while in the contact phase, the control objective shifts to force regulation, maintaining contact forces within safe bounds and ensuring stable interaction durations to enhance operational safety. A key feature of the developed method is its ability to provide a smooth and continuous transition between free-space and contact phases without requiring controller switching. Furthermore, the controller preserves the attitude stability of the chaser spacecraft during manipulation, mitigating mission-critical risks such as communication loss or solar panel misalignment. The approach explicitly incorporates the coupled rigid-body dynamics of the SMS and demonstrates robustness to various real-world disturbances. Simulation results validate the effectiveness of the proposed controller. Future work includes experiments, using the planar Space Robotics Emulator of our lab at the National Technical University of Athens to assess real-world readiness.
|
| |
| 11:10-11:20, Paper WeAT4.2 | Add to My Program |
| Evaluating Multimodal Communication Methods for Autonomous Buses in Pedestrian-Dense University Environments |
|
| Mohamed, Abdalla Ahmed Roshdi | University of Kaiserslautern-Landau |
| Ashok, Ashita | University of Kaiserslautern-Landau |
| Jan, Qazi Hamza | University of Kaiserslautern-Landau |
| Babel, Franziska | Linköping University |
| Berns, Karsten | University of Kaiserslautern-Landau |
Keywords: Field Robots, Autonomous Agents, Social HRI
Abstract: In urban pedestrian zones where autonomous vehicles (AVs) increasingly operate alongside humans, clear communication between AVs and pedestrians is essential for safety and trust. This study conducted an exploratory research on pedestrian reactions to an autonomous shuttle bus (AutoBus) operating on a university campus. Using a real-world deployment, the effectiveness of visual and auditory communication cues in a real-world setting was evaluated. The AutoBus continuously looped a 480-meter path on campus during lunchtime, and pedestrians who walked toward and crossed the bus were invited to complete an online survey following this interaction. Data was collected from 58 participants at a technical university through behavioral observations and post-interaction surveys. The results reveal that visual cues were more consistently recognized than auditory ones, influencing pedestrian awareness and response. Trust in AV's safety was shaped more by its perceived safety than by prior experiences with the AutoBus. Moreover, willingness to yield was positively associated with the perceived social status of the AV, but not whether it was perceived as an autonomous robot or as representing its passengers. These findings offer practical insights for improving AV communication design to support safer, more intuitive interactions in shared spaces.
|
| |
| 11:20-11:30, Paper WeAT4.3 | Add to My Program |
| USV with Interfacial Pumping for Efficient Microplastics Removal |
|
| Wagner, Stephan | Cornell University |
| Fu, Yicong | Cornell University |
| Jung, Sunghwan | Cornell University |
| Petersen, Kirstin Hagelskjaer | Cornell University |
Keywords: Marine Robotics, Environment Monitoring and Management, Mechanism Design
Abstract: Microplastics that accumulate at the air–water interface pose urgent ecological and health risks. However, existing sampling and collection methods based on surface trawls are hindered by hydrodynamic resistance. We present the first in-motion characterization of an interfacial pump mounted on a small uncrewed surface vehicle (USV) to actively draw surface water into an onboard filter. Experiments combining thruster-driven forward motion with the undulating pump show that low thruster output and moderate pumping frequency maximize particles captured per unit energy, balancing the ram effect of forward speed with the lateral suction of the pump. Scaled towing tests reveal that the pontoon cross-section strongly influences intake flow, indicating that streamlined profiles can further boost filtration efficiency. Finally, flow-visualization confirms that the pump’s ability to generate far-field suction without bulk mixing (previously demonstrated only in static tests) persists while the USV is in motion. These results establish interfacial pumping as a promising bio-inspired strategy for manual and autonomous microplastics collection, and highlight design parameters that can guide future development of distributed, high-coverage sampling platforms.
|
| |
| 11:30-11:40, Paper WeAT4.4 | Add to My Program |
| STEAM-LIVO: Spatio-Temporally Adaptive Manifold Lidar-Inertial-Visual Odometry for Sensor Degradation in Unstructured Natural Aquatic-Terrestrial Scenes |
|
| Guo, Yubo | Huazhong University of Scienceand Technology |
| Peng, Gang | Huazhong University of Science and Technology |
| Li, Jialuo | Huazhong University of Science and Technology |
| Zhang, Hai-Tao | Huazhong University of Science AndTechnology |
Keywords: Field Robots, SLAM
Abstract: Sensor degradation in unstructured natural environments---manifesting as LiDAR point cloud sparsity or visual feature dropout---and out-of-sequence measurement challenges critically undermine localization robustness in autonomous systems. To address these limitations, we present STEAM-LIVO, a Spatio-Temporally Adaptive Manifold LiDAR-Inertial-Visual Odometry framework that enables tightly coupled multi-sensor fusion via a spatio-temporal manifold-driven iterative Kalman filter. The proposed method formulates an error-state iterative update mechanism on Lie group manifolds, executes IMU-centric real-time estimation, and ensures resilience under sensor degradation through an incremental observation model integrating LiDAR point-to-plane geometric residuals with visual feature reprojection errors within a shared filtering framework. Comprehensive evaluations in vegetated terrestrial landscapes and dynamic aquatic surfaces demonstrate an average relative pose error of 1.77%, with sustained robustness during partial sensor failures. Rigorous ablation studies further corroborate the efficacy of our spatio-temporal adaptive manifold architecture.
|
| |
| 11:40-11:50, Paper WeAT4.5 | Add to My Program |
| Multi-Robot Informative Sampling and Coverage in GPS-Denied Environments |
|
| Munir, Aiman | Pennsylvania State University |
| Latif, Ehsan | University of Georgia |
| Parasuraman, Ramviyas | University of Georgia |
Keywords: Multi-Robot Systems, Sensor Networks, Path Planning for Multiple Mobile Robots or Agents
Abstract: Multi-Robot Systems (MRS) in GPS-denied environments such as indoor spaces, subterranean areas, and urban canyons face the dual challenge of localizing themselves while performing informative path planning (IPP) to model unknown spatial fields. Current IPP methods rely heavily on GPS for localization, limiting their applicability in GPS-denied settings, while existing approaches addressing observation uncertainty fail to account for localization uncertainty that degrades mapping accuracy. This paper presents Anchor-Oriented IPP (AO-IPP), a framework that coordinates robot teams through relative positioning using Access Points and uncertainty-driven transitions between three phases: anchor point localization, informative sampling for field estimation, and spatial coverage optimization. Each robot maintains dual Gaussian Process models with transitions driven by uncertainty levels rather than fixed time schedules. Extensive simulations and real-world experiments demonstrate that AO-IPP achieves performance comparable to GPS-based IPP algorithms while outperforming existing methods in balancing IPP and coverage objectives by up to 54%. The approach exhibits sublinear regret bounds and enables autonomous coordination in challenging environments previously inaccessible to traditional IPP methods, providing a robust solution for environmental monitoring, exploration, and mapping applications requiring both accurate field estimation and comprehensive spatial coverage.
|
| |
| 11:50-12:00, Paper WeAT4.6 | Add to My Program |
| Stable Multi-Drone GNSS Tracking System for Marine Robots |
|
| Wen, Shuo | McGill University |
| Meriaux, Edwin | Mcgill |
| Sosa Guzmán, Mariana | McGill University |
| Wang, Zhizun | McGill University |
| Shi, Junming(Clark) | McGill University |
| Dudek, Gregory | McGill University |
Keywords: Marine Robotics, Field Robots, Visual Tracking
Abstract: Stable and accurate tracking is essential for marine robotics, yet Global Navigation Satellite System (GNSS) signals vanish immediately below the sea surface. Traditional alternatives suffer from error accumulation, high computational demands, or infrastructure dependence. In this work, we present a multi-drone GNSS-based tracking system for surface and near-surface marine robots. Our approach combines efficient visual detection, lightweight multi-object tracking, GNSS-based triangulation, and a confidence-weighted Extended Kalman Filter (EKF) to provide stable GNSS estimation in real time. We further introduce a cross-drone tracking ID alignment algorithm that enforces global consistency across views, enabling robust multi-robot tracking with cooperative aerial coverage. We validate our system in diversified complex settings to show the accuracy and robustness of the proposed algorithm.
|
| |
| 12:00-12:10, Paper WeAT4.7 | Add to My Program |
| Building Forest Inventories with Autonomous Legged Robots -- System, Lessons, and Challenges Ahead (I) |
|
| Mattamala, Matias | University of Edinburgh |
| Chebrolu, Nived | University of Oxford |
| Frey, Jonas | Stanford University |
| Freißmuth, Leonard | Technical University Munich |
| Oh, Haedam | University of Oxford |
| Casseau, Benoit | University of Oxford |
| Hutter, Marco | ETH Zurich |
| Fallon, Maurice | University of Oxford |
Keywords: Robotics and Automation in Agriculture and Forestry, Legged Robots, SLAM
Abstract: Legged robots are increasingly being adopted in industries such as oil, gas, mining, nuclear, and agriculture. However, new challenges exist when moving into natural, less-structured environments, such as forestry applications. This article presents a prototype system for autonomous, undercanopy forest inventory with legged platforms. Motivated by the robustness and mobility of modern legged robots, we introduce a system architecture, which enabled a quadruped platform to autonomously navigate and map forest plots. Our solution involves a complete navigation stack for state estimation, mission planning, and tree detection and trait estimation. We report the performance of the system from trials executed over one and a half years in forests in three European countries. Our results with the ANYmal robot demonstrate that we can survey plots up to 1-ha plot under 30 min while also identifying trees with typical diameter at breast height (DBH) accuracy of 2 cm. The findings of this project are presented as five lessons and challenges. In particular, we discuss the maturity of hardware development, state estimation limitations, open problems in forest navigation, future avenues for robotic forest inventory, and more general challenges to assess autonomous systems. By sharing these lessons and challenges, we offer insight and new directions for future research on legged robots, navigation systems, and applications in natural environments. Additional videos can be found in https://dy
|
| |
| 12:10-12:20, Paper WeAT4.8 | Add to My Program |
| Air-Ground Collaboration for Language-Specified Missions in Unknown Environments (I) |
|
| Cladera, Fernando | University of Pennsylvania |
| Ravichandran, Zachary | University of Pennsylvania |
| Hughes, Jason | University of Pennsylvania |
| Murali, Varun | Texas A&M University |
| Nieto-Granda, Carlos | DEVCOM U.S. Army Research Laboratory |
| Hsieh, M. Ani | University of Pennsylvania |
| Pappas, George J. | University of Pennsylvania |
| Taylor, Camillo Jose | University of Pennsylvania |
| Kumar, Vijay | University of Pennsylvania |
Keywords: AI-Enabled Robotics, Field Robots, Human-Robot Collaboration
Abstract: As autonomous robotic systems become increasingly mature, users will want to specify missions at the level of intent rather than in low-level detail. Language is an expressive and intuitive medium for such mission specification. However, realizing language-guided robotic teams requires overcoming significant technical hurdles. Interpreting and realizing language-specified missions require advanced semantic reasoning. Successful heterogeneous robots must effectively coordinate actions and share information across varying viewpoints. Additionally, communication between robots is typically intermittent, necessitating robust strategies that leverage communication opportunities to maintain coordination and achieve mission objectives. In this work, we present a first-of-its-kind system where an unmanned aerial vehicle (UAV) and an unmanned ground vehicle (UGV) can collaboratively accomplish missions specified in natural language while reacting to changes in specification on the fly. We leverage a large language model-enabled planner to reason over semantic-metric maps that are built online and opportunistically shared between an aerial and a ground robot. We consider task-driven navigation in urban and rural areas. Our system must infer mission-relevant semantics and actively acquire information via semantic mapping. In both ground and air-ground teaming experiments, we demonstrate our system on seven different natural-language specifications at up to kilometer-scale navigation.
|
| |
| 12:20-12:30, Paper WeAT4.9 | Add to My Program |
| Astrobee: Free-Flying Robots for the International Space Station (I) |
|
| Smith, Trey | NASA Ames Research Center |
| Alexandrov, Oleg | NASA Ames Research Center |
| Barlow, Jonathan | KBR, Inc. |
| Benavides, Jose | NASA |
| Bualat, Maria | NASA Ames Research Center |
| Carlino, Roberto | NASA |
| Coltin, Brian | NASA Ames Research Center (KBR Inc.) |
| Cortez, Jose | Redwire Space |
| Daley, Earl | NASA Ames Research Center |
| Feller, Jeffrey | NASA Ames Research Center |
| Flückiger, Lorenzo | Carnegie Mellon University |
| Fong, Terrence | NASA Ames Research Center (ARC) |
| Fusco, Jesse | NASA Ames Research Center |
| Garcia Ruiz, Ruben | KBR Inc, NASA Ames |
| Browne, Katie | University of Nevada, Reno |
| Kanis, Simeon | NASA Ames Research Center |
| Katterhagen, Aric | NASA Ames Research Center |
| Kim, Yunkyung | NASA Ames Research Center |
| Love, John | NASA Ames Research Center |
| McIntyre, Michael | NASA Ames Research Center |
| McLachlan, Blair | NASA Ames Research Center |
| Mora, Andres | NASA Ames Research Center |
| Moratto, Zachary | Google Inc. |
| Moreira, Marina | Instituto Superior Técnico, Lisbon University |
| Orosco, Henry | NASA Johnson Space Center |
| Park, In-Won | NASA Ames Research Center |
| Provencher, Christopher | NASA Ames Research Center |
| Sanchez, Hugo | NASA Ames Research Center |
| Sharif, Khaled | NASA Ames Research Center |
| Smith, Ernest | NASA Ames Research Center |
| Soussan, Ryan | Aerodyne Industries |
| Symington, Andrew | n/a |
| Talavera, Rafael Omar | NASA Ames Research Center |
| To, Vinh | Stinger Ghaffarian Technologies |
| Wheeler, Dawn | SGT, Inc. |
| Yoo, Jongwoon | NASA Ames Research Center |
|
|
| |
| WeI2I Interactive Session, Hall C |
Add to My Program |
| Interactive Session 4 |
|
| |
| |
| 15:00-16:30, Paper WeI2I.1 | Add to My Program |
| P-AgNav: Range View-Based Autonomous Navigation System for Cornfields |
|
| Kim, Kitae | Purdue University |
| Deb, Aarya | Purdue University |
| Cappelleri, David | Purdue University |
Keywords: Robotics and Automation in Agriculture and Forestry, Agricultural Automation
Abstract: In this paper, we present an in-row and under-canopy autonomous navigation system for cornfields, called the Purdue Agricultural Navigation System or P-AgNav. Our navigation framework is primarily based on range view images from a 3D light detection and ranging (LiDAR) sensor. P-AgNav is designed for an autonomous robot to navigate in the corn rows with collision avoidance and to switch between rows without GNSS assistance or pre-defined waypoints. The system enables robots, which are intended to monitor crops or conduct physical sampling, to autonomously navigate multiple crop rows with minimal human intervention, thereby increasing crop management efficiency. The capabilities of P-AgNav have been validated through experiments in both simulation and real cornfield environments.
|
| |
| 15:00-16:30, Paper WeI2I.2 | Add to My Program |
| A Champion-Level Vision-Based Reinforcement Learning Agent for Competitive Racing in Gran Turismo 7 |
|
| Lee, Hojoon | KAIST |
| Seno, Takuma | Sony AI |
| Tai, Jun Jet | Coventry University |
| Subramanian, Kaushik | Sony |
| Kawamoto, Kenta | Sony Research Inc |
| Stone, Peter | The University of Texas at Austin |
| Wurman, Peter | Sony |
Keywords: Autonomous Agents, Reinforcement Learning, Vision-Based Navigation
Abstract: Deep reinforcement learning has achieved super-human racing performance in high-fidelity simulators like Gran Turismo 7 (GT7). It typically utilizes global features that require instrumentation external to a car, such as precise localization of agents and opponents, limiting real-world applicability. To address this limitation, we introduce a vision-based autonomous racing agent that relies solely on ego-centric camera views and onboard sensor data, eliminating the need for precise localization during inference. This agent employs an asymmetric actor-critic framework: the actor uses a recurrent neural network with the sensor data local to the car to retain track layouts and opponent positions, while the critic accesses the global features during training. Evaluated in GT7, our agent consistently outperforms model predictive control drivers. To our knowledge, this work presents the first vision-based autonomous racing agent to demonstrate champion-level performance in competitive racing scenarios.
|
| |
| 15:00-16:30, Paper WeI2I.3 | Add to My Program |
| DMVC-Tracker: Distributed Multi-Agent Trajectory Planning for Target Tracking Using Dynamic Buffered Voronoi and Inter-Visibility Cells |
|
| Lee, Yunwoo | Carnegie Mellon University |
| Park, Jungwon | Seoul National University of Science and Technology |
| Kim, H. Jin | Seoul National University |
Keywords: Path Planning for Multiple Mobile Robots or Agents, Distributed Robot Systems, Motion and Path Planning
Abstract: This letter presents a distributed trajectory planning method for multi-agent aerial tracking. The proposed method uses a Dynamic Buffered Voronoi Cell (DBVC) and a Dynamic Inter-Visibility Cell (DIVC) to formulate the distributed trajectory generation. Specifically, the DBVC and the DIVC are time-variant spaces that prevent mutual collisions and occlusions among agents, while enabling them to maintain suitable distances from the moving target. We combine the DBVC and the DIVC with an efficient Bernstein polynomial motion primitive-based tracking trajectory generation method, which has been refined into a less conservative approach than in our previous work. The proposed algorithm can compute each agent's trajectory within several milliseconds on an Intel i7 desktop. We validate the tracking performance in challenging scenarios, including environments with dozens of obstacles.
|
| |
| 15:00-16:30, Paper WeI2I.4 | Add to My Program |
| Safety-Aware UAVs Formation Scheme for Guiding UGVs through Obstacle-Laden Environments |
|
| Xiao, Ruikang | Huazhong University of Science and Technology |
| Wang, Shuting | Huazhong University of Science and Technology |
| Xie, Yuanlong | Huazhong University of Science and Technology |
| Xie, Sheng Quan | University of Leeds |
| Zhang, Youmin | Concordia University |
Keywords: Integrated Planning and Control, Collision Avoidance, Motion and Path Planning
Abstract: Using unmanned aerial vehicle (UAV) formations to guide unmanned ground vehicles (UGVs) through unstructured obstacle-laden areas leads to highly efficient execution of tasks such as the transportation of supplies. However, existing methods fail to efficiently plan obstacle-avoidance strategies for the entire UAV-UGV swarm. Additionally, the formation controller and planner are isolated, resulting in the degradation of formation tracking accuracy, which presents potential security risks. This paper proposes a novel UAV formation scheme that integrates safe corridor (SC) generation, trajectory fitting, and formation tracking to ensure operational safety. The scheme employs a novel line-of-sight (LOS) mechanism to optimize A*-planned waypoints, generating the SC as an obstacle-avoidance strategy. A minimum snap trajectory is fitted to the optimized waypoints with SC constraints. Bridged by the trajectory, the scheme develops a rigid-graph-based controller (RGC) to track the planning result, enabling dynamic formation maneuvering within the SC. Consequently, the proposed UAV formation scheme achieves obstacle-avoidance guidance by restricting the UGVs to the formation projection. The validation results demonstrate that the proposed scheme exhibits enhanced robustness and superior planning capabilities compared to traditional methods.
|
| |
| 15:00-16:30, Paper WeI2I.5 | Add to My Program |
| NaturalVLM: Leveraging Fine-Grained Natural Language for Affordance-Guided Visual Manipulation |
|
| Xu, Ran | Peking University |
| Shen, Yan | Peking University |
| Li, Xiaoqi | Peking University |
| Wu, Ruihai | Peking University |
| Dong, Hao | Peking University |
Keywords: Perception for Grasping and Manipulation, Data Sets for Robot Learning, Deep Learning in Grasping and Manipulation
Abstract: Enabling home-assistant robots to perceive and manipulate a diverse range of 3D objects based on human language instructions is a pivotal challenge. Prior research has predominantly focused on simplistic and task-oriented instructions, i.e., "Slide the top drawer open". However, many real-world tasks demand intricate multi-step reasoning, and without human instructions, these will become extremely difficult for robot manipulation. To address these challenges, we introduce a comprehensive benchmark, NrVLM, comprising 15 distinct manipulation tasks, containing over 4500 episodes meticulously annotated with fine-grained language instructions. We split the long-term task process into several steps, with each step having a natural language instruction. Moreover, we propose a novel learning framework that completes the manipulation task step-by-step according to the fine-grained instructions. Specifically, we first identify the instruction to execute, taking into account visual observations and the end-effector’s current state. Subsequently, our approach facilitates explicit learning through action-prompts and perception-prompts to promote manipulation-aware cross-modality alignment. Leveraging both visual observations and linguistic guidance, our model outputs a sequence of actionable predictions for manipulation, including contact points and end-effector poses. We evaluate our method and baselines using the proposed benchmark NrVLM. The experimental results demonstrate the effectiveness of our approach. For additional details, please refer to https://sites.google.com/view/naturalvlm.
|
| |
| 15:00-16:30, Paper WeI2I.6 | Add to My Program |
| Search3D: Hierarchical Open-Vocabulary 3D Segmentation |
|
| Takmaz, Ayca | ETH Zurich |
| Delitzas, Alexandros | ETH Zurich |
| Sumner, Robert W. | Disney Research |
| Engelmann, Francis | Stanford University |
| Wald, Johanna | Google |
| Tombari, Federico | Technische Universität München |
Keywords: Semantic Scene Understanding, Object Detection, Segmentation and Categorization, RGB-D Perception
Abstract: Open-vocabulary 3D segmentation enables the exploration of 3D spaces using free-form text descriptions. Existing methods for open-vocabulary 3D instance segmentation primarily focus on identifying object-level instances in a scene. However, they face challenges when it comes to understanding more fine-grained scene entities such as object parts, or regions described by generic attributes. In this work, we introduce Search3D, an approach that builds a hierarchical open-vocabulary 3D scene representation, enabling the search for entities at varying levels of granularity: fine-grained object parts, entire objects, or regions described by attributes like materials. Our method aims to expand the capabilities of open-vocabulary instance-level 3D segmentation by shifting towards a more flexible open-vocabulary 3D search setting less anchored to explicit object-centric queries, compared to prior work. To ensure a systematic evaluation, we also contribute a scene-scale open-vocabulary 3D part segmentation benchmark based on MultiScan, along with a set of open-vocabulary fine-grained part annotations on ScanNet++. We verify the effectiveness of Search3D across several tasks, demonstrating that our approach outperforms baselines in scene-scale open-vocabulary 3D part segmentation, while maintaining strong performance in segmenting 3D objects and materials.
|
| |
| 15:00-16:30, Paper WeI2I.7 | Add to My Program |
| Symbolic Manipulation Planning with Discovered Object and Relational Predicates |
|
| Ahmetoglu, Alper | Brown University |
| Oztop, Erhan | Osaka University / Ozyegin University |
| Ugur, Emre | Bogazici University |
Keywords: Developmental Robotics, Learning Categories and Concepts, Deep Learning Methods
Abstract: Discovering the symbols and rules that can be used in long-horizon planning from a robot's unsupervised exploration of its environment and continuous sensorimotor experience is a challenging task. The previous studies proposed learning symbols from single or paired object interactions and planning with these symbols. In this work, we propose a system that learns rules with discovered object and relational symbols that encode an arbitrary number of objects and the relations between them, converts those rules to Planning Domain Description Language (PDDL), and generates plans that involve affordances of the arbitrary number of objects to achieve tasks. We validated our system with box-shaped objects in different sizes and showed that the system can develop a symbolic knowledge of pick-up, carry, and place operations, taking into account object compounds in different configurations, such as boxes would be carried together with a larger box that they are placed on. We also compared our method with other symbol learning methods and showed that planning with the operators defined over relational symbols gives better planning performance compared to the baselines.
|
| |
| 15:00-16:30, Paper WeI2I.8 | Add to My Program |
| Pushing the Limits of Reactive Navigation: Learning to Escape Local Minima |
|
| Meijer, Isar | Microsoft |
| Pantic, Michael | ETH Zürich |
| Oleynikova, Helen | ETH Zurich |
| Siegwart, Roland | ETH Zurich |
Keywords: Reactive and Sensor-Based Planning, Collision Avoidance, Deep Learning Methods
Abstract: Can a robot navigate a cluttered environment without an explicit map? Reactive methods that use only the robot’s current sensor data and local information are fast and flexible, but prone to getting stuck in local minima. Is there a middle-ground between reactive methods and map-based path planners? In this paper, we investigate feed forward and recurrent networks to augment a purely reactive sensor-based navigation algorithm, which should give the robot "geometric intuition" about how to escape local minima. We train on a large number of extremely cluttered simulated worlds, auto-generated from primitive shapes, and show that our system zero-shot transfers to worlds based on real data, 3D man-made environments, and can handle up to 30% sensor noise without degradation of performance. We also offer a discussion of what role network memory plays in our final system, and what insights can be drawn about the nature of reactive vs. map-based navigation. The implementation of the planners and all experiments is made available open-source.
|
| |
| 15:00-16:30, Paper WeI2I.9 | Add to My Program |
| MotIF: Motion Instruction Fine-Tuning |
|
| Hwang, Minyoung | Massachusetts Institute of Technology |
| Hejna, Donald | Stanford University |
| Sadigh, Dorsa | Stanford University |
| Bisk, Yonatan | Carnegie Mellon University |
Keywords: Intention Recognition, Data Sets for Robot Learning, Semantic Scene Understanding
Abstract: While success in many robotics tasks can be determined by only observing the final state and how it differs from the initial state - e.g., if an apple is picked up - many tasks require observing the full motion of the robot to correctly determine success. For example, brushing hair requires repeated strokes that correspond to the contours and type of hair. Prior works often use off-the-shelf vision-language models (VLMs) as success detectors; however, when success depends on the full trajectory, VLMs struggle to make correct judgments for two reasons. First, modern VLMs often use single frames, and cannot capture changes over a full trajectory. Second, even if we provide state-of-the-art VLMs with an input of multiple frames, they still fail to detect success due to a lack of robot data. Our key idea is to fine-tune VLMs using abstract representations that are able to capture trajectory-level information such as the path the robot takes by overlaying keypoint trajectories on the final image. We propose motion instruction fine-tuning (MotIF), a method that fine-tunes VLMs using the aforementioned abstract representations to semantically ground the robot's behavior in the environment. To benchmark and fine-tune VLMs for robotic motion understanding, we introduce the MotIF-1K dataset containing 653 human and 369 robot demonstrations across 13 task categories. MotIF assesses the success of robot motion given task and motion instructions. Our model significantly outperforms state-of-the-art VLMs and video LMs by at least twice in F1 score with high precision and recall, generalizing across unseen motions, tasks, and environments. Finally, we demonstrate practical applications of MotIF in ranking trajectories on how they align with task and motion descriptions.
|
| |
| 15:00-16:30, Paper WeI2I.10 | Add to My Program |
| AirBender: Adaptive Transportation of Bendable Objects Using Dual UAVs |
|
| Xu, Jiawei | University of Michigan |
| Gao, Longsen | University of New Mexico |
| Fierro, Rafael | University of New Mexico |
| Saldaña, David | Lehigh University |
Keywords: Aerial Systems: Mechanics and Control, Robust/Adaptive Control, Aerial Systems: Applications
Abstract: The interaction of robots with bendable objects in midair presents significant challenges in control, often resulting in performance degradation and potential crashes, especially for aerial robots due to their limited actuation capabilities and constant need to remain airborne. This paper presents an adaptive controller that enables two aerial vehicles to collaboratively follow a trajectory while transporting a bendable object without relying on explicit elasticity models. Our method allows on-the-fly adaptation to the object's unknown deformable properties, ensuring stability and performance in trajectory-tracking tasks. We use Lyapunov analysis to demonstrate that our adaptive controller is asymptotically stable. Our method is evaluated through hardware experiments in various scenarios, demonstrating the capabilities of using multirotor aerial vehicles to handle bendable objects.
|
| |
| 15:00-16:30, Paper WeI2I.11 | Add to My Program |
| SEPT: Standard-Definition Map Enhanced Scene Perception and Topology Reasoning for Autonomous Driving |
|
| Pei, Muleilan | Hong Kong University of Science and Technology |
| Shan, Jiayao | Zhuoyu Technology |
| Li, Peiliang | Zhuoyu Technology |
| Shi, Jieqi | Nanjing University |
| Huo, Jing | Nanjing University |
| Gao, Yang | Nanjing University |
| Shen, Shaojie | Hong Kong University of Science and Technology |
Keywords: Computer Vision for Transportation, Deep Learning for Visual Perception, Intelligent Transportation Systems
Abstract: Online scene perception and topology reasoning are critical for autonomous vehicles to understand their driving environment, particularly for mapless driving systems that endeavor to reduce reliance on costly High-Definition (HD) maps. However, recent advances in online scene understanding still face limitations, especially in long-range or occluded scenarios, due to the inherent constraints of onboard sensors. To address this challenge, we propose a Standard-Definition (SD) map Enhanced scene Perception and Topology reasoning (SEPT) framework, which explores how to effectively incorporate the SD map as prior knowledge into existing perception and reasoning pipelines. Specifically, we introduce a novel hybrid feature fusion strategy that combines SD maps with Bird’s-Eye-View (BEV) features, considering both rasterized and vectorized representations, while mitigating potential misalignment between SD maps and BEV feature spaces. Additionally, we leverage the SD map characteristics to design an auxiliary intersection-aware keypoint detection task, which further enhances the overall scene understanding performance. Experimental results on the large-scale OpenLane-V2 dataset demonstrate that by effectively integrating SD map priors, our framework significantly improves both scene perception and topology reasoning, outperforming existing methods by a substantial margin.
|
| |
| 15:00-16:30, Paper WeI2I.12 | Add to My Program |
| Estimation of Gait Phase of Human Stair Descent Walking Based on Phase Variable Approach |
|
| Cha, MyeongJu | Gwangju Institute of Science and Technology |
| Hur, Pilwon | Gwangju Institute of Science and Technology |
Keywords: Wearable Robotics, Prosthetics and Exoskeletons, Intention Recognition
Abstract: Synchronization between a wearer and a lower limb powered prosthesis is important for effective control. Typically, phase variable-based phase estimation methods are employed. However, there is a noticeable lack of studies focusing on estimating the gait phase during stair descent, likely due to the difficulty in generating a reliable phase variable. In most studies, the thigh angle is used to generate phase variables for level walking because it follows a sinusoidal pattern. However, during stair descent, the thigh angle exhibits only a partially sinusoidal shape, making it challenging to apply the methods used for level walking. In this study, we propose a novel phase variable generation method to address the difficulty of using only the thigh angle for stair descent. To estimate the gait phase reliably, the phase variable is defined differently for the stance and swing phases: the hip position is used to generate the phase variable during the stance phase, and the thigh angle is used during the swing phase. These phase variables are then unified into a single phase variable (PV-ENT) for the entire gait cycle of stair descent. During this unification process, a non-smooth transition occurs around the phase transition point. To address this, a blending method is applied. The proposed method was validated using the data from 12 healthy subjects, collected through a motion capture system and IMU sensors. The results demonstrate a reliable phase estimation performance. Moreover, the blending method successfully improves the smoothness of the phase variable around the phase transition point without reducing the overall phase estimation performance.
|
| |
| 15:00-16:30, Paper WeI2I.13 | Add to My Program |
| Enhancing Safety and Manipulability of Redundant Manipulators: Accelerated Motion Generation in Dynamic Environments |
|
| Xie, Zongwu | Harbin Institute of Technology |
| Li, Mengfei | Harbin Institute of Technology |
| Sun, Wandong | Harbin Institute of Technology |
| Cao, Baoshi | Harbin Institute of Technology |
| Liu, Yang | Harbin Institute of Technology |
| Wang, Zhengpu | Harbin Institute of Technology |
| Ji, Yiming | Harbin Instititute of Technology |
| Liu, Hong | Harbin Institute of Technology |
| Ma, Boyu | Nanyang Technological University |
| Wu, Zhihong | Institute of Spacecraft System Engineering |
Keywords: Dexterous Manipulation, Motion and Path Planning
Abstract: Motion generation in dynamic environments is crucial for human-machine interaction with redundant manipulators. In this context, we propose the Enhancing Safety and Manipulability (ESM) scheme, which integrates geometry-based dynamic obstacle avoidance, manipulability optimization,trajectory tracking, and joint limit avoidance into a unified scheme operating at the joint-angle level. The introduction of a flexible collision library enables the scheme to locate critical points based on geometry, while the incorporated obstacle speed allows the scheme to effectively avoid dynamic obstacles. In the ESM, manipulability is naturally set as the non-convex goal. To solve the ESM, the Accelerated Multi-agent recurrent Neural Network (AMNN) is proposed, which uses a meta-heuristic approach to construct activation functions, endowing the neural network with non-convex control capabilities. Subsequently, a GPU-based parallel computing method is implemented, significantly reducing computing time. Detailed simulations, experiments, and comparisons demonstrate the framework's effectiveness and superiority.
|
| |
| 15:00-16:30, Paper WeI2I.14 | Add to My Program |
| Rhythm-Based Power Allocation Strategy of Bionic Tail-Flapping for Propulsion Enhancement |
|
| Wu, Biao | Southern University of Science and Technology |
| Huang, Chaoyi | The University of Hong Kong |
| Li, Xiangru | The Hong Kong University of Science and Technology |
| Xu, Jiahao | Southern University of Science and Technology |
| Liu, Sicong | Southern University of Science and Technology |
| Lam, James | University of Hong Kong |
| Wang, Zheng | Southern University of Science and Technology |
| Dai, Jian | School of Natural and Mathematical Sciences, King's College London, University of London |
|
|
| |
| 15:00-16:30, Paper WeI2I.15 | Add to My Program |
| Multi-Vehicle Cooperative Persistent Coverage for Random Target Search |
|
| Li, Zhuo | Beijing Institute of Technology |
| Li, Guangzheng | Beijing Institute of Technology |
| Sadeghi, Alireza | University of Minnesota |
| Jian, Sun | Beijing Institute of Technology |
| Wang, Gang | Beijing Institute of Technology |
| Wang, Jialin | China Academy of Launch Vehicle Technology |
| |
| 15:00-16:30, Paper WeI2I.16 | Add to My Program |
| Improving Trust Estimation in Human-Robot Collaboration Using Beta Reputation at Fine-Grained Timescales |
|
| Dagdanov, Resul | University of Technology Sydney |
| Andrejevic, Milan | University of Technology Sydney |
| Liu, Dikai | University of Technology, Sydney |
| Lin, Chin-Teng | UTS |
Keywords: Human-Robot Collaboration, Physical Human-Robot Interaction, Acceptability and Trust
Abstract: When interacting with each other, humans adjust their behavior based on perceived trust. To achieve similar adaptability, robots must accurately estimate human trust at sufficiently granular timescales while collaborating with humans. Beta reputation is a popular way to formalize a mathematical estimation of human trust. However, it relies on binary performance, which updates trust estimations only after each task concludes. Additionally, manually crafting a reward function is the usual method of building a performance indicator, which is labor-intensive and time-consuming. These limitations prevent efficient capture of continuous trust changes at more granular timescales throughout the collaboration task. Therefore, this letter presents a new framework for the estimation of human trust using beta reputation at fine-grained timescales. To achieve granularity in beta reputation, we utilize continuous reward values to update trust estimates at each timestep of a task. We construct a continuous reward function using maximum entropy optimization to eliminate the need for the laborious specification of a performance indicator. The proposed framework improves trust estimations by increasing accuracy, eliminating the need to manually craft a reward function, and advancing toward the development of more intelligent robots.
|
| |
| 15:00-16:30, Paper WeI2I.17 | Add to My Program |
| Look at Them Go! Using an Autonomous Assistive GoBot to Encourage Movement Practice by Two Children with Motor Disabilities |
|
| Helmi, Ameer | Oregon State University |
| Wang, Tze-Hsuan | Oregon State University |
| Logan, Samuel W. | Oregon State University |
| Fitter, Naomi T. | Oregon State University |
Keywords: Rehabilitation Robotics, Human-Centered Robotics, Robot Companions
Abstract: Young children with motor disabilities face barriers and delays to learning motor skills such as walking. Pediatric body-weight support harness systems (BWSHes) are a newer technology for helping young children to practice supported motor skills. Incorporating an assistive robot to mediate BWSH interventions can support further child motion and engagement, but almost no work to date has studied autonomous robot-mediated BWSH use. We conducted a six-month-long single-case study series with two participants to evaluate the effectiveness of an autonomous assistive robot in motivating the children to move and stay engaged while in the BWSH. We collected and analyzed objective movement data and self-reported parent survey data to determine how much the child moved and stayed engaged during sessions. Our results showed that both children displayed more movement while the assistive robot was active (relative to in prior no-robot periods). Parents also rated their children as more engaged while the assistive robot was present. An autonomous assistive robot may provide motivation for a child to move and stay engaged while using a pediatric rehabilitation aid such as a BWSH. The products of this work can benefit roboticists who work with children with disabilities and researchers who use pediatric rehabilitation technologies.
|
| |
| 15:00-16:30, Paper WeI2I.18 | Add to My Program |
| BFA: Best-Feature-Aware Fusion for Multi-View Fine-Grained Manipulation |
|
| Lan, Zihan | Beijing University of Posts and Telecommunications |
| Mao, Weixin | Waseda University |
| Li, Haosheng | University of Chinese Academy of Sciences |
| Wang, Le | Beihang University |
| Wang, Tiancai | MEGVII Technology |
| Fan, Haoqiang | Megvii Inc |
| Yoshie, Osamu | Waseda University |
Keywords: Imitation Learning, Sensor Fusion, Bimanual Manipulation
Abstract: In real-world scenarios, multi-view cameras are typically employed for fine-grained manipulation tasks. Existing approaches (e.g., ACT ) tend to treat multi-view features equally and directly concatenate them for policy learning. How ever, it will introduce redundant visual information and bring higher computational costs, leading to ineffective manipulation. Fine-grained manipulation tasks typically consist of multiple stages, where the best view may vary across different phases. This paper proposes a plug-and-play Best-Feature-Aware (BFA) fusion strategy for multi-view manipulation tasks, which is adaptable to various policies. Building upon the visual backbone of the policy network, we design a lightweight subnetwork to effectively predict the importance score of each view. Based on the predicted importance scores, the reweighted multi-view features are subsequently fused and fed into the end-to-end policy network for seamless integration. Notably, our method demonstrates outstanding performance in fine-grained manip ulations. The experimental results show that our approach outperforms multiple baselines by 22-46% success rate on different tasks. Our work provides new insights and inspiration for tackling key challenges in fine-grained manipulations.
|
| |
| 15:00-16:30, Paper WeI2I.19 | Add to My Program |
| Nullspace Adaptive Velocity Controller for Ground Vehicles: Theory and Experimental Evaluation |
|
| Elsberry, Allan | Johns Hopkins University |
| Dawkins, Jeremy | United States Naval Academy |
| Whitcomb, Louis | Johns Hopkins University |
Keywords: Robust/Adaptive Control, Underactuated Robots
Abstract: We report a novel stable Model-Based Adaptive Velocity Tracking Controller (AVTC) for ground vehicles capable of asymptotically exactly tracking longitudinal and yaw reference velocities and simultaneously adaptively identifying unknown plant parameters and actuator parameters. The reported AVTC is developed for velocity control of the commonly accepted three degree-of-freedom second-order dynamic “bicycle” model for ground vehicles. A Lyapunov analysis shows asymptotic stability of the velocity tracking error in longitudinal and yaw velocities, boundedness of all signals, and convergence of the adaptive parameter estimates. A performance evaluation of the proposed AVTC is reported including numerical simulation evaluation and experimental evaluation that corroborates the analytical predictions of stability and tracking, and compares its performance to its non-adaptive counterpart and two alternative controllers. AVTC only requires body-frame velocities and control input signals and robustly detects, quantitatively identifies, and compensates dynamically in real-time for faults arising from changes to plant, actuator, and environmental parameters during operations.
|
| |
| 15:00-16:30, Paper WeI2I.20 | Add to My Program |
| Viscoelasticity-Based Mechanistic Modeling and Control of Bending Pneumatic Muscles |
|
| Zhao, Zishuo | Southeast University |
| Xu, Baoguo | Southeast University |
| Wang, Jiajin | Southeast University |
| Lai, Jianwei | Southeast University |
| Wang, Yifei | Southeast University |
| Li, Huijun | Southeast University |
| Song, Aiguo | Southeast University |
Keywords: Modeling, Control, and Learning for Soft Robots, Soft Sensors and Actuators
Abstract: Predicting the kinematics of bending pneumatic muscles (BPMs) remains challenging due to the necessity for models that effectively address the pronounced hysteresis and creep inherent in soft materials. While prior research has predominantly focused on phenomenological and data-driven modeling approaches, this study introduces a viscoelasticity-based mechanistic model (VBMM) and a feedforward-feedback hybrid control system tailored for BPMs. First, the VBMM is developed by leveraging the principles of viscoelasticity—a common property of soft materials and a mechanistic driver of hysteresis and creep. Second, we address the computational challenge arising from the history-dependent viscoelastic response of BPMs, where the current state depends on the cumulative stress-strain history. Conventional methods incur escalating computational costs over time, rendering real-time control impractical. To resolve this, we propose a sliding window-based long-term prediction mechanism (long-term VBMM) that maintains model accuracy while significantly reducing computational overhead. Finally, a hybrid control system integrating the long-term VBMM as a feedforward compensator with feedback correction is designed to achieve precise BPM motion tracking. Experimental validation confirms the VBMM’s superior predictive accuracy (error < 3.69%) and demonstrates the control system’s effectiveness.
|
| |
| 15:00-16:30, Paper WeI2I.21 | Add to My Program |
| Spatio-Temporal Motion Retargeting for Quadruped Robots |
|
| Yoon, Taerim | Korea University |
| Kang, Dongho | Robotics and AI Institute |
| Kim, Seungmin | Korea University |
| Cheng, Jin | ETH Zurich |
| Ahn, Min Sung | University of California, Los Angeles |
| Coros, Stelian | ETH Zurich |
| Choi, Sungjoon | Korea University |
Keywords: Legged Robots, Learning from Demonstration, Motion Control, Motion Retargeting
Abstract: This work presents a motion retargeting approach for legged robots, aimed at transferring the dynamic and agile movements to robots from source motions. In particular, we guide the imitation learning procedures by transferring motions from source to target, effectively bridging the morphological disparities while ensuring the physical feasibility of the target system. In the first stage, we focus on motion retargeting at the kinematic level by generating kinematically feasible whole-body motions from keypoint trajectories. Following this, we refine the motion at the dynamic level by adjusting it in the temporal domain while adhering to physical constraints. This process facilitates policy training via reinforcement learning, enabling precise and robust motion tracking. We demonstrate that our approach successfully transforms noisy motion sources, such as hand-held camera videos, into robot-specific motions that align with the morphology and physical properties of the target robots. Moreover, we demonstrate terrain-aware motion retargeting to perform BackFlip on top of a box. We successfully deployed these skills to four robots with different dimensions and physical properties in the real world through hardware experiments.
|
| |
| 15:00-16:30, Paper WeI2I.22 | Add to My Program |
| A Dual-Adhesion-Enhanced Soft Gripper with Microwedge Adhesives and SMA-Driven Microspines |
|
| Wang, Chang | Beihang University |
| Zi, Peijin | Beihang University |
| Luo, Yang | Beihang University |
| Song, Bochao | Beihang University |
| Zhang, Tao | Beihang University |
| Xu, Kun | Beihang University |
| Ding, Xilun | Beijing Univerisity of Aeronautics and Astronautics |
Keywords: Grippers and Other End-Effectors, Soft Robot Materials and Design, Grasping
Abstract: ,软握把因其适应性和安全性而备受推崇,但 它们固有的柔软性常常导致在重物下抓握失败 很多。大多数增强附着力的握把依赖单一附着 针对光滑或粗糙表面量身定制的策略。蜥蜴, 但在非结构化环境中,有效导航时,通过以下方式 基于 地表状况。灵感来自混合粘附策略 壁虎和变色龙,本研究展示了一种仿生的软抓握器 它集成了微楔干胶和SMA驱动的微棘。 微楔胶提供可控的附着力,保证平滑 而SMA驱动的微棘则延伸用于粗糙表面 粘附和回放以避免干扰。优化模型为 开发目的是确定最优链路维度,提升抓取能力 性能方面,力和半径。实验结果 各种表面验证了其有效
|
| |
| 15:00-16:30, Paper WeI2I.23 | Add to My Program |
| Autonomous Drone-Ground Robot Alignment through Ground Robot Visual Servo Control with Drone Detection and Tilt Correction |
|
| Dominguez, Sean Clark | Mindanao State University - Iligan Institute of Technology |
| Pao, Jeanette | De La Salle University - Manila |
| Paradela, Immanuel | Mindanao State University - Iligan Institute of Technology |
| Bolaybolay, John Mel | Mindanao State University - Iligan Institute of Technology |
| Aleluya, Earl Ryan | Mindanao State University - Iligan Institute of Technology |
| Alagon, Francis Jann | Mindanao State University - Iligan Institute of Technology |
| Guirnaldo, Sherwin | Mindanao State University - Iligan Institute of Technology |
| Salaan, Carl John | Mindanao State University - Iligan Insitute of Technology |
| Ohno, Kazunori | Tohoku University |
| Okada, Yoshito | Tohoku University |
| Bandala, Argel | De La Salle University |
| Shamsudin, Abu Ubaidah | Universiti Tun Hussein Onn Malaysia |
Keywords: Visual Servoing, Vision-Based Navigation, Field Robots
Abstract: Retrieving ground robots from dangerous environments after their operation is a challenging task that poses risks for the personnel. Some researchers often employ drones for retrieval, which makes operations safer. However, this setup requires an accurate method that guarantees drone and ground robot alignment due to inaccuracies in standard GPS devices, drone drifts, and wind gusts. Hence, this research article introduces simultaneous object detection and tilt correction as part of visual servoing to achieve precise drone-rover alignment. Drone detection using YOLOv8 and a tilt correction algorithm was integrated for the proposed visual servo of the ground robot. The study collected 3024 images as a data set for drone detection. The experimental results show that the trained instance segmentation model detected and captured drone objects. The study conducted an initial test for visual servo control of the ground robot in various surface terrains, resulting in a maximum alignment error on rough surfaces. Furthermore, the study conducted drone-ground robot alignment real test in an outdoor field setting. The alignment between the drone and the ground robot produced a maximum alignment error of 20.3 cm, below the threshold error. The open field experiments verified the effectiveness of the ground robot’s visual servo control with an actual drone operation.
|
| |
| 15:00-16:30, Paper WeI2I.24 | Add to My Program |
| Systematic Design of the Time-Independent and Computable Controller Based on Zero-Division-Avoidable Smoother for a Desired Orbit in Phase Space |
|
| Masuya, Ken | University of Miyazaki |
Keywords: Motion Control, Physical Human-Robot Interaction
Abstract: This letter proposes a method to systematically design a time-independent controller for a desired orbit in phase space. A time-independent controller is essential in robots that physically interact with humans or the environment. An approach to designing such a controller is based on the virtual dynamics of the desired orbit (VDDO), in which the desired orbit is assumed as a constraint. However, depending on the desired orbit, zero-division happens, and then the computation of control input breaks down. To address this issue, a zero-division-avoidable smoother, which functions as a low-pass filter and maintains computability even when the computation includes zero-division, is applied to compute the controller input based on the VDDO. This application establishes a systematic design of a VDDO-based controller that avoids zero-division. We investigated the performance of the proposed controller via experiments and simulations for three given orbits: a unit circle, super-ellipse, and spiric section. Results showed that the proposed time-independent controller can avoid zero-division while approaching the desired orbits. Furthermore, an experiment in which a human forces a robot to stop showed that the robot could restart from an unfavorable state and approach the desired orbits once more.
|
| |
| 15:00-16:30, Paper WeI2I.25 | Add to My Program |
| DINO-VO: A Feature-Based Visual Odometry Leveraging a Visual Foundation Model |
|
| Azhari, Maulana Bisyir | Korea Advanced Institute of Science and Technology |
| Shim, David Hyunchul | KAIST |
Keywords: Deep Learning Methods, Localization
Abstract: Learning-based monocular visual odometry (VO) poses robustness, generalization, and efficiency challenges in robotics. Recent advances in visual foundation models, such as DINOv2, have improved robustness and generalization in various vision tasks, yet their integration in VO remains limited due to coarse feature granularity. In this paper, we present DINO-VO, a feature-based VO system leveraging DINOv2 visual foundation model for its sparse feature matching. To address the integration challenge, we propose a salient keypoints detector tailored to DINOv2’s coarse features. Furthermore, we complement DINOv2's robust-semantic features with fine-grained geometric features, resulting in more localizable representations. Finally, a transformer-based matcher and differentiable pose estimation layer enable precise camera motion estimation by learning good matches. Against prior detector-descriptor networks like SuperPoint, DINO-VO demonstrates greater robustness in challenging environments. Furthermore, we show superior accuracy and generalization of the proposed feature descriptors against standalone DINOv2 coarse features. DINO-VO outperforms prior frame-to-frame VO methods on the TartanAir and KITTI datasets and competitive on EuRoC dataset, while running efficiently at 72 FPS with less than 1GB of memory usage on a single GPU. Moreover, it performs competitively against Visual SLAM systems on outdoor driving scenarios, showcasing its generalization capabilities.
|
| |
| 15:00-16:30, Paper WeI2I.26 | Add to My Program |
| Orthogonal Pulse-Width-Modulation for Combined Electromagnetic Actuation and Localization |
|
| von Arx, Denis | ETH Zurich |
| Nelson, Bradley J. | ETH Zurich |
| Boehler, Quentin | ETH Zurich |
Keywords: Surgical Robotics: Steerable Catheters/Needles, Localization
Abstract: Electromagnetic Navigation Systems can be used to remotely guide medical devices such as magnetic catheters or guidewires, holding potential in a variety of minimally invasive surgical applications. This paper introduces a method to simultaneously actuate and localize a tethered magnetic device with embedded sensor pickup coils using a single system. Six-degree-of- freedom localization is achieved by driving the electromagnets of the Electromagnetic Navigation System with mutually orthogonal pulse-width-modulated voltages of different frequencies. The method is demonstrated using a human-scale system composed of three electromagnets to actuate and localize a magnetic catheter prototype with pickup coils embedded at its tip. In this case, the pose is estimated at a rate of 77 Hz, with a typical mean accuracy below 2 mm in position and 2 degrees in orientation.
|
| |
| 15:00-16:30, Paper WeI2I.27 | Add to My Program |
| From Movement Primitives to Distance Fields to Dynamical Systems |
|
| Li, Yiming | Idiap Research Institute, EPFL |
| Calinon, Sylvain | Idiap Research Institute |
Keywords: Learning from Demonstration, Imitation Learning
Abstract: Developing autonomous robots capable of learning and reproducing complex motions from demonstrations remains a fundamental challenge in robotics. On the one hand, movement primitives (MPs) provide a compact and modular representation of continuous trajectories. On the other hand, autonomous systems provide control policies that are time independent. We propose in this paper a simple and flexible approach that gathers the advantages of both representations by transforming MPs into autonomous systems. The key idea is to transform the explicit representation of a trajectory as an implicit shape encoded as a distance field. This conversion from a time-dependent motion to a spatial representation enables the definition of an autonomous dynamical system with modular reactions to perturbation. Asymptotic stability guarantees are provided by using Bernstein basis functions in the MPs, representing trajectories as concatenated quadratic Bézier curves, which provide an analytical method for computing distance fields. This approach bridges conventional MPs with distance fields, ensuring smooth and precise motion encoding, while maintaining a continuous spatial representation. By simply leveraging the analytic gradients of the curve and its distance field, a stable dynamical system can be computed to reproduce the demonstrated trajectories while handling perturbations, without requiring a model of the dynamical system to be estimated. Numerical simulations and real-world robotic experiments validate our method's ability to encode complex motion patterns while ensuring trajectory stability, together with the flexibility of designing the desired reaction to perturbations.
|
| |
| 15:00-16:30, Paper WeI2I.28 | Add to My Program |
| 3D Targeting of a Magnetic Particle in Blood Vessels Using Field-Free Points in an Open-Type Electromagnetic Actuation System |
|
| Yang, Seungun | DGIST |
| Nguyen, Kim Tien | Korean Institute of Medical Microrobotics |
| Kee, Hyeonwoo | DGIST |
| Lee, Hyoryong | DGIST |
| Kim, Jayoung | Korea Institute of Medical Microrobotics |
| Park, Sukho | DGIST |
Keywords: Micro/Nano Robots, Automation at Micro-Nano Scales, Medical Robots and Systems
Abstract: Recent research has increasingly focused on delivering drug-carrying magnetic particles to diseased areas using electromagnetic actuation (EMA) systems. Particularly, in these systems, creating a field-free point (FFP) and using it to steer magnetic particles in the desired direction has attracted significant attention. However, most previous studies use closed-type EMA systems, which, due to their structural characteristics, are difficult to integrate into actual surgical environments and to operate in conjunction with external imaging systems like X-ray. This study addresses these limitations by using an open-type EMA system, which is better suited for surgical integration. However, an open-type EMA system faces issues such as a significant decrease in magnetic force and an anisotropic magnetic field as the distance from the coils increases in the region of interest (ROI). To overcome these challenges, we optimized the open-type EMA system and proposed a suitable FFP generation method. Furthermore, we presented a targeting algorithm for steering a magnetic particle in blood vessels using anisotropic FFP. This proposed open-type EMA system and the control strategy using FFP were validated through multiphysics simulations and phantom experiments, proving the viability of magnetic particle targeting.
|
| |
| 15:00-16:30, Paper WeI2I.29 | Add to My Program |
| Globally Optimal Data-Association-Free Landmark-Based Localization Using Semidefinite Relaxations |
|
| Korotkine, Vassili | McGill University |
| Cohen, Mitchell | McGill University |
| Forbes, James Richard | McGill University |
Keywords: Localization, Optimization and Optimal Control, Probabilistic Inference
Abstract: This paper proposes a semidefinite relaxation for landmark-based localization with unknown data associations in planar environments. The proposed method simultaneously solves for the optimal robot states and data associations in a globally optimal fashion. Relative position measurements to a fixed set of known landmarks are used, but the data association is unknown in that the robot does not know which landmark each measurement is generated from. The relaxation is shown to be tight in a majority of cases for moderate noise levels. The proposed algorithm is compared to local Gauss-Newton baselines initialized at the dead-reckoned trajectory, and is shown to significantly improve convergence to the problem’s global optimum in simulation and experiment. Accompanying software and supplementary material can be found at https://github.com/decargroup/certifiable_uda_loc.
|
| |
| 15:00-16:30, Paper WeI2I.30 | Add to My Program |
| Optimization of Preemptive Impact Mitigation without Prior Collision Testing |
|
| Nakamura, Hayato | Kyushu University |
| Arita, Hikaru | Kyushu University |
| Tokiwa, Shunsuke | Kyushu University |
| Tahara, Kenji | Kyushu University |
Keywords: Force Control, Sensor-based Control, Optimization and Optimal Control
Abstract: Effective impact mitigation strategies are cru- cial for preventing potential damage to both robotic sys- tems and their operational environments during high-velocity and dynamic maneuvers, as well as during the execution of high-precision tasks. The successful implementation of impact mitigation strategies in real-world applications fundamentally requires appropriate parameter tuning. However, owing to the destructive nature of collisions, heuristic parameter tuning is impractical, as it risks damage to both the robotic system and its operational environment during experimental trials. This study eliminates the need for preliminary collision experiments in parameter optimization by introducing a novel methodology that leverages recent proximity sensor-based preemptive impact mitigation strategies that reframe impact mitigation as a geometric rather than physical problem. The key innovation of this work lies in the reformulation of the proximity sensor output to enable both the analytical derivation of preemptive motion trajectories and the direct application of standard optimization solvers. The effectiveness of the proposed method- ology is validated through numerical simulations and two different experimental configurations. By eliminating the need for collision trials, robotic systems can safely execute potentially destructive tasks that would otherwise result in system damage without proper impact mitigation.
|
| |
| 15:00-16:30, Paper WeI2I.31 | Add to My Program |
| Eva-Tracker: ESDF-Update-Free, Visibility-Aware Planning with Target Reacquisition for Robust Aerial Tracking |
|
| Lin, Yue | Dalian University of Technology |
| Liu, Yang | Dalian University of Technology |
| Wang, Dong | Dalian University of Technology |
| Lu, Huchuan | Dalian University of Technology |
Keywords: Aerial Systems: Perception and Autonomy, Motion and Path Planning, Visual Tracking
Abstract: The Euclidean Signed Distance Field (ESDF) is widely used in visibility evaluation to prevent occlusions and collisions during tracking. However, frequent ESDF updates introduce considerable computational overhead. To address this issue, we propose Eva-Tracker, a visibility-aware trajectory planning framework for aerial tracking that eliminates ESDF updates and incorporates a recovery-capable path generation method for target reacquisition. First, we design a target trajectory prediction method and a visibility-aware initial path generation algorithm that maintain an appropriate observation distance, avoid occlusions, and enable rapid replanning to reacquire the target when it is lost. Then, we propose the Field of View ESDF (FoV-ESDF), a precomputed ESDF tailored to the tracker's field of view, enabling rapid visibility evaluation without requiring updates. Finally, we optimize the trajectory using differentiable FoV-ESDF-based objectives to ensure continuous visibility throughout the tracking process. Extensive simulations and real-world experiments demonstrate that our approach delivers more robust tracking results with lower computational effort than existing state-of-the-art methods.
|
| |
| 15:00-16:30, Paper WeI2I.32 | Add to My Program |
| MonoEM: Object-Level Monocular 3D Object Detection Based on Equirectangular Map under Inclement Weather |
|
| Yoon, Jae Hyun | Chonnam National University |
| Cho, Yeon Woo | Chonnam National University |
| Yoo, Seok Bong | Chonnam National University |
Keywords: Object Detection, Segmentation and Categorization, Computer Vision for Automation, AI-Based Methods
Abstract: Monocular 3D object detection has received growing recognition in contemporary research due to its reduced hardware complexity and lower deployment cost compared to multi-sensor-based approaches. Prior research has primarily addressed ideal environmental settings, neglecting the influence of diverse weather scenarios, including rain, snow, and fog, that significantly hinder detection reliability. To enhance robustness under inclement weather conditions, we introduce MonoEM, a monocular 3D object detection framework that leverages object-level image representations and equirectangular maps. Starting from 2D detection results, MonoEM derives equirectangular maps through an equirectangular object-level reconstruction. Furthermore, MonoEM suppresses inclement weather noise in object-level images through image restoration. Subsequently, MonoEM fuses the reconstructed equirectangular map with the restored image and performs 3D bounding box prediction using a visual-range fusion detector. The integration of 2D-3D box alignment loss between 2D and 3D bounding boxes improves the geometric alignment and 3D object detection accuracy. Experimental results across various inclement weather conditions validate the notable accuracy and robustness of MonoEM compared to existing monocular 3D baselines. The source code is provided at https://anonymous.4open.science/r/MonoEM-00AC.
|
| |
| 15:00-16:30, Paper WeI2I.33 | Add to My Program |
| DynOPETs: A Versatile Benchmark for Dynamic Object Pose Estimation and Tracking in Moving Camera Scenarios |
|
| Meng, Xiangting | ShanghaiTech University |
| Yang, Jiaqi | ShanghaiTech University |
| Chen, Mingshu | Fudan University |
| Yan, Chenxin | Shanghaitech University |
| Shi, Yujiao | ShanghaiTech University |
| Ding, Wenchao | Fudan University |
| Kneip, Laurent | ShanghaiTech University |
Keywords: Data Sets for Robotic Vision, Perception for Grasping and Manipulation, Object Detection, Segmentation and Categorization
Abstract: In the realm of object pose estimation, scenarios involving both dynamic objects and moving cameras are prevalent. However, the scarcity of corresponding real-world datasets significantly hinders the development and evaluation of robust pose estimation models. This is largely attributed to the inherent challenges in accurately annotating object poses in dynamic scenes captured by moving cameras. To bridge this gap, this paper presents a novel dataset DynOPETs and a dedicated data acquisition and annotation pipeline tailored for object pose estimation and tracking in such unconstrained environments. Our efficient annotation method innovatively integrates pose estimation and pose tracking techniques to generate pseudo-labels, which are subsequently refined through pose graph optimization. The resulting dataset offers accurate pose annotations for dynamic objects observed from moving cameras. To validate the effectiveness and value of our dataset, we perform comprehensive evaluations using 18 state-of-the-art methods, demonstrating its potential to accelerate research in this challenging domain. The dataset will be made publicly available to facilitate further exploration and advancement in the field.
|
| |
| 15:00-16:30, Paper WeI2I.34 | Add to My Program |
| Fabrication and Characterization of Additively Manufactured Stretchable Strain Sensors towards the Shape Sensing of Continuum Robots |
|
| Moyer, Daniel | Worcester Polytechnic Institue |
| Wang, Wenpeng | Worcester Polytechnic Institue |
| Karschner, Logan | Worcester Polytechnic Institute |
| Fichera, Loris | Worcester Polytechnic Institute |
| Rao, Pratap | Worcester Polytechnic Institute |
Keywords: Soft Sensors and Actuators, Surgical Robotics: Steerable Catheters/Needles
Abstract: This letter describes the manufacturing and experimental characterization of novel stretchable strain sensors for continuum robots. The overarching goal of this research is to provide a new solution for the shape sensing of these devices. The sensors are fabricated via direct ink writing, an extrusion-based additive manufacturing technique. Electrically conductive material (i.e., the ink) is printed into traces whose electrical resistance varies in response to mechanical deformation. The principle of operation of stretchable strain sensors is analogous to that of conventional strain gauges, but with a significantly larger operational window thanks to their ability to withstand larger strain. Among the different conductive materials considered for this study, we opted to fabricate the sensors with a high-viscosity eutectic Gallium-Indium ink, which in initial testing exhibited high linearity (R2 ≈ 0.99), gauge factor ≈ 1, and negligible drift. Benefits of the proposed sensors include (i) ease of fabrication, as they can be conveniently printed in a matter of minutes; (ii) ease of installation, as they can simply be glued to the outside body of a robot; and (iii) ease of miniaturization, which enables integration into millimiter-sized continuum robots.
|
| |
| 15:00-16:30, Paper WeI2I.35 | Add to My Program |
| Dynamic Modulation of Emotional Expressions in Social Robots: Effects on Liveliness and Naturalness |
|
| Park, Haeun | Pohang University of Science and Technology |
| Hwang, Sun Jun | UNIST (Ulsan National Institute of Science and Technology) |
| Kim, Hyojin | UNIST (Ulsan National Institute of Science and Technology) |
| Lee, Jiyeon | UNIST (Ulsan National Institute of Science and Technology) |
| Lee, Hui Sung | UNIST (Ulsan National Institute of Science and Technology) |
Keywords: Emotional Robotics, Gesture, Posture and Facial Expressions, Social HRI
Abstract: Humans naturally express emotions with subtle variations, and exaggerated expressions often appear as heightened intensity in facial, bodily, or vocal cues. This paper introduces a method for exaggerating robotic emotional expressions by dynamically adjusting intensity within an emotion dynamics model. By systematically manipulating the damping ratio, we generated five distinct intensity levels for each emotion, thereby producing emotional expressions that exhibited different degrees of overshoot. A user study revealed that liveliness ratings for surprise increased linearly with intensity, suggesting that exaggerated, high-energy dynamics are particularly effective for conveying surprise. In contrast, other emotions exhibited optimal points at intermediate levels, indicating that excessive exaggeration can reduce perceived naturalness. These findings highlight the need for emotion-specific and user-specific calibration of expression intensity, supporting more nuanced and engaging human-robot interactions.
|
| |
| 15:00-16:30, Paper WeI2I.36 | Add to My Program |
| AnyGeometry-CBS: Any Geometry Conflict-Based Search for Multi-Agent Path Finding |
|
| Li, Yichen | Nankai University |
| Zhang, Xuebo | Nankai University |
| Yu, Jingjin | Rutgers University |
| Wang, Yaonan | Hunan University |
Keywords: Motion and Path Planning, Path Planning for Multiple Mobile Robots or Agents
Abstract: The Multi-Agent Path Finding (MAPF) problem seeks to find conflict-free paths for multiple agents. However, most existing MAPF methods simplify agents to points or uniform circles, a model that fails when agents have diverse geometries or carry oversized loads. This oversimplification can lead to undetected collisions or the failure to find feasible paths. To address this, we propose AnyGeometry-CBS (AG-CBS), a novel extension of Conflict-Based Search (CBS) that accommodates agents of arbitrary, non-convex shapes. AG-CBS represents each geometry of agent via a set of grid cells and introduces enriched conflict definitions to handle complex interactions. To improve search efficiency, we develop a Multi-Constraint (MC) technique and a Shape Heuristic (SH) for suboptimal variants. Experimental results demonstrate that our method reduces runtime by up to 84.43% against optimal baselines and 88.24% against bounded-suboptimal ones, providing a general and effective solution to complex MAPF problems.
|
| |
| 15:00-16:30, Paper WeI2I.37 | Add to My Program |
| OWOD-FSL: Open-World Object Detection Via Few-Shot Learning and Dynamic Prototypes |
|
| Li, Zhiwei | Beijing University of Chemical Technology |
| Zhang, Zhiyu | Beijing University of Chemical Technology |
| Zhou, Yang | Beijing University of Chemical Technology |
| Li, Jianping | Dalian University of Technology |
| Shen, Tianyu | Beijing University of Chemical Technology |
| Wang, Li | Beijing Institute of Technology |
| Lu, Fengli | Guangxi Normal University |
| Liu, Huaping | Tsinghua University |
| Wang, Kunfeng | Beijing University of Chemical Technology |
Keywords: Incremental Learning, Continual Learning, Deep Learning for Visual Perception
Abstract: Open-World Object Detection (OWOD) presents a critical challenge for modern computer vision systems: detecting known classes, identifying unknown objects, and incrementally learning to recognize them over time. However, current approaches have two fundamental limitations: (1) the fixed-dimensional classification head inherently restricts incremental learning capabilities, and (2) heavy reliance on extensive annotated data hinders adaptability in few-shot settings. To address these limitations, we propose OWOD-FSL that integrates dynamic prototype classification head with few-shot learning. At the core of our approach are two major contributions: a dynamic prototype classification head that supplants traditional fixed classifiers with an expandable prototype classifier for unlimited class expansion, and a biologically-inspired bi-phase learning strategy that integrates offline prototype generation with incremental learning refinement. Comprehensive experiments on M-OWODB benchmark shows that OWOD-FSL achieves state-of-the-art performance in both unknown class recall (U-Recall) and known class mAP, significantly outperforming existing methods.
|
| |
| 15:00-16:30, Paper WeI2I.38 | Add to My Program |
| Denoising Particle Filters: Learning State Estimation with Single-Step Objectives |
|
| Röstel, Lennart | Technical University of Munich |
| Bäuml, Berthold | Technical University of Munich |
Keywords: Probability and Statistical Methods, Deep Learning Methods, Deep Learning in Grasping and Manipulation
Abstract: Learning-based methods commonly treat state estimation in robotics as a sequence modeling problem. While this paradigm can be effective at maximizing end-to-end performance, models are often difficult to interpret and expensive to train, since training requires unrolling sequences of predictions in time. As an alternative to end-to-end trained state estimation, we propose a novel particle filtering algorithm in which models are trained from individual state transitions, fully exploiting the Markov property in robotic systems. In this framework, measurement models are learned implicitly by minimizing a denoising score matching objective. At inference, the learned denoiser is used alongside a (learned) dynamics model to approximately solve the Bayesian filtering equation at each time step, effectively guiding predicted states toward the data manifold informed by measurements. We evaluate the proposed method on challenging robotic state estimation tasks in simulation, demonstrating competitive performance compared to tuned end-to-end trained baselines. Importantly, our method offers the desirable composability of classical filtering algorithms, allowing prior information and external sensor models to be incorporated without retraining.
|
| |
| 15:00-16:30, Paper WeI2I.39 | Add to My Program |
| GraspClutter6D: A Large-Scale Real-World Dataset for Robust Perception and Grasping in Cluttered Scenes |
|
| Back, Seunghyeok | Korea Institute of Machinery & Materials |
| Lee, Joosoon | Gwangju Institute of Science and Technology |
| Kim, Kangmin | Gwangju Institute of Science and Technology |
| Rho, Heeseon | Gwangju Institute of Science and Technology (GIST) |
| Lee, Geonhyup | Gwangju Institute of Science and Technology |
| Kang, Raeyoung | Gwangju Institute of Science and Technology |
| Lee, Sangbeom | Gwangju Institute of Science and Technology (GIST) |
| Noh, Sangjun | Gwangju Institute of Science and Technology (GIST) |
| Lee, Youngjin | Gwangju Institute of Science and Technology |
| Lee, Taeyeop | KAIST |
| Lee, Kyoobin | Gwangju Institute of Science and Technology |
Keywords: Data Sets for Robotic Vision, Data Sets for Robot Learning, Deep Learning in Grasping and Manipulation
Abstract: Robust grasping in cluttered environments remains an open challenge in robotics. While benchmark datasets have significantly advanced deep learning methods, they mainly focus on simplistic scenes with light occlusion and insufficient diversity, limiting their applicability to practical scenarios. We present GraspClutter6D, a large-scale real-world grasping dataset featuring: (1) 1,000 highly cluttered scenes with dense arrangements (14.1 objects/scene, 62.6% occlusion), (2) comprehensive coverage across 200 objects in 75 environment configurations (bins, shelves, and tables) captured using four RGB-D cameras from multiple viewpoints, and (3) rich annotations including 736K 6D object poses and 9.3B feasible robotic grasps for 52K RGB-D images. We benchmark state-of-the-art segmentation, object pose estimation, and grasping detection methods to provide key insights into challenges in cluttered environments. Additionally, we validate the dataset's effectiveness as a training resource, demonstrating that grasping networks trained on GraspClutter6D significantly outperform those trained on existing datasets in both simulation and real-world experiments. The dataset, toolkit, and annotation tools are publicly available on our project website: https://sites.google.com/view/graspclutter6d.
|
| |
| 15:00-16:30, Paper WeI2I.40 | Add to My Program |
| A Novel Hybrid Ureteroscope Tracking for Robotic-Assisted Retrograde Intrarenal Surgery Via Recognition of Pathway with Lumen Identification |
|
| Han, Jung-Min | KAIST, Roen Surgical, Inc |
| Kwon, Dong-Soo | KAIST, Roen Surgical, Inc |
| Kyung, Ki-Uk | Korea Advanced Institute of Science & Technology (KAIST) |
Keywords: Surgical Robotics: Planning, Medical Robots and Systems, Vision-Based Navigation
Abstract: Resolving disorientation of the surgeon caused by wrong recognition of scope’s position, which often increases procedural time and workload, remains a significant challenge in robotic-assisted retrograde intrarenal surgery (RIRS). This paper introduces a novel hybrid ureteroscope tracking algorithm that integrates low-latency lumen identification with robotic motion data to enhance intrarenal navigation. The system estimates the ureteroscope’s position on the centerline of the kidney by recognizing its pathway. In validation tests using a 3D-printed phantom, the proposed method achieved an average localization success rate of 89.2% for major calyx entry and 84.1% for minor calyx entry, with an average computation time of 0.26 seconds, ensuring low-latency operation. Usability testing with ten novice participants demonstrated a 44.5% reduction in cognitive workload (NASA-TLX), improved task success rates, and reduced manipulation effort. These results indicate that the proposed tracking algorithm significantly enhances ureteroscope navigation, improving efficiency and reducing the surgeon's cognitive load in robotic-assisted RIRS.
|
| |
| 15:00-16:30, Paper WeI2I.41 | Add to My Program |
| Development of Remote Center of Motion Mechanism for Biportal Endoscopic Spine Surgery Robot |
|
| Kim, Chunwoo | Korea Institute of Science and Technology |
| Kim, Chaewon | Korea Institute of Science and Technology, Korea University |
| Choi, Dong-Eun | Asan Medical Center |
Keywords: Medical Robots and Systems, Surgical Robotics: Laparoscopy, Mechanism Design
Abstract: Remote Center of Motion (RCM) mechanism is widely used in surgical robots for Minimally Invasive Surgery (MIS). For endoscopic spine surgery, the surgical robot requires RCM mechanism with sufficient stiffness and compact end effector to manipulate hard tissues while avoiding interference with other instruments. This paper presents a modified belt-driven RCM mechanism designed to meet these specific requirements of the endoscopic spine surgery. Gearboxes were incorporated to the belt-driven Remote Center of Motion (RCM) mechanism to reduce belt tension and resulting elastic deformation, while the RCM constraint is maintained through a specific relationship between gearbox reduction ratios. The prismatic joint for instrument insertion is relocated to the base to reduce the size of the end effector. Prototype of a surgical robot with the presented RCM mechanism achieved an RCM point accuracy of 0.56 mm, repeatability of 0.019 mm, and stiffness of 11.676, 12.435, and 5.341 N/mm in the X, Y, and Z directions, respectively. Feasibility of the robot was validated through simulated BESS on a spine phantom.
|
| |
| 15:00-16:30, Paper WeI2I.42 | Add to My Program |
| Rollbot: A Spherical Robot Driven by a Single Actuator |
|
| Wang, Jingxian | Northwestern University |
| Rubenstein, Michael | Northwestern University |
Keywords: Mechanism Design, Actuation and Joint Mechanisms, Hardware-Software Integration in Robotics
Abstract: Spherical robots typically require at least two actuators to achieve controlled 2D planar motion. Here we present Rollbot, the first spherical robot capable of controllably maneuvering on a 2D plane with a single actuator, challenging this assumption. Rollbot rolls on the ground in a circular pattern and controls its motion by changing the trajectory's curvature by accelerating and decelerating its single motor and the attached mass according to our derived quasi-stable state dynamics and control laws. We present the theoretical analysis, design, and control of Rollbot, and demonstrate its ability to move in a controllable circular pattern and follow waypoints, validating the efficacy of the proposed theoretical framework.
|
| |
| 15:00-16:30, Paper WeI2I.43 | Add to My Program |
| PO-GVINS: A Tightly Coupled GNSS-Visual-Inertial Navigation Framework Using Pose-Only Representation |
|
| Xu, Zhuo | Wuhan University |
| Zhu, Feng | Wuhan University |
| Zhang, Zihang | Wuhan University |
| Jian, Chang | Wuhan University |
| Lv, Jiarui | Wuhan University |
| Zhang, Yuantai | Mohamed Bin Zayed University of Artificial Intelligence |
| Zhang, Xiaohong | Wuhan University |
Keywords: Localization, Visual-Inertial SLAM, Autonomous Vehicle Navigation
Abstract: Accurate and reliable positioning is essential for perception, decision-making, and other high-level applications in autonomous driving, autonomous aerial vehicles, and intelligent robotics. Due to the inherent limitations of standalone sensors, integrating heterogeneous sensors with complementary capabilities is an effective approach to achieving this goal. The visual-inertial navigation system (VINS) fuses visual cameras and inertial measurement units (IMUs) to explore unknown environments. It requires a priori knowledge of 3D features and jointly estimates camera poses and feature positions, which inevitably introduces feature linearization errors. Meanwhile, the dimensionality of the system state increases with abundant textures, degrading real-time performance. To eliminate accumulated errors from VINS, frameworks further fuse measurements from the Global Navigation Satellite System (GNSS), but still suffer from similar limitations. To address the aforementioned issues, we propose a filtering-based, tightly coupled GNSS-visual-inertial positioning framework with a pose-only formulation applied to VINS, termed PO-GVINS. We first apply the PO formulation to our VINS (PO-VINS). GNSS raw measurements are subsequently incorporated, with integer ambiguities resolved, to achieve accurate and drift-free state estimation. Extensive experiments demonstrate that the proposed PO-VINS significantly outperforms the multi-state constraint Kalman filter (MSCKF) and achieves accuracy comparable to that of optimization-based VINS. By further incorporating GNSS measurements, PO-GVINS achieves accurate, drift-free state estimation, making it a robust solution for positioning in challenging environments.
|
| |
| 15:00-16:30, Paper WeI2I.44 | Add to My Program |
| No Need to Look! Locating and Grasping Objects by a Robot Arm Covered with Sensitive Skin |
|
| Bartůněk, Karel | Czech Technical University in Prague, Faculty of Electrical Engineering |
| Rustler, Lukas | Czech Technical University in Prague, Faculty of Electrical Engineering |
| Hoffmann, Matej | Czech Technical University in Prague, Faculty of Electrical Engineering |
|
|
| |
| 15:00-16:30, Paper WeI2I.45 | Add to My Program |
| Feasibility-Guided Planning Over Multi-Specialized Locomotion Policies |
|
| Luo, Ying-Sheng | Inventec Corporation |
| Wang, Lu-Ching | Inventec Corporation |
| Mandala, Hanjaya | Inventec Corporation |
| Chou, Yu-Lun | Inventec Corporation |
| Galelli Christmann, Guilherme Henrique | Inventec Corporation |
| Chen, Yu-Chung | National Taiwan University |
| Chan, Yung-Shun | National Taiwan University |
| Lee, Chun-Yi | National Taiwan University |
| Chen, Wei-Chao | Inventec Corporation |
Keywords: Motion and Path Planning, Reinforcement Learning, Legged Robots
Abstract: Planning over unstructured terrain presents a significant challenge in the field of legged robotics. Although recent works in reinforcement learning have yielded various locomotion strategies, planning over multiple experts remains a complex issue. Existing approaches encounter several constraints: traditional planners are unable to integrate skill-specific policies, whereas hierarchical learning frameworks often lose interpretability and require retraining whenever new policies are added. In this paper, we propose a feasibility-guided planning framework that successfully incorporates multiple terrain-specific policies. Each policy is paired with a Feasibility-Net, which learned to predict feasibility tensors based on the local elevation maps and task vectors. This integration allows classical planning algorithms to derive optimal paths. Through both simulated and real-world experiments, we demonstrate that our method efficiently generates reliable plans across diverse and challenging terrains, while consistently aligning with the capabilities of the underlying policies.
|
| |
| 15:00-16:30, Paper WeI2I.46 | Add to My Program |
| Time-Aware Assistive Navigation |
|
| Shangguan, Zhongkai | Boston University |
| Kuribayashi, Masaki | Waseda Univeristy |
| Ohn-Bar, Eshed | Boston University |
Keywords: Intelligent Transportation Systems, Robotics and Automation in Agriculture and Forestry, Robotics and Automation in Construction
Abstract: Can interactive vision-and-language agents learn not just what to say but also when to say it? Current language models rarely plan over whether and when to realize a real-time response to a user. However, providing accurate and timely support for human decision-making, such as when guiding visually impaired individuals through urban environments, requires careful real-time responsiveness — poorly timed responses can distract users or add unnecessary cognitive load. As a machine intelligence challenge for Multimodal Large Language Model (MLLM)-based agents, we introduce a large-scale multimodal benchmark for an egocentric, assistive navigation task in complex outdoor environments. Using this benchmark, we uncover a fundamental limitation of off-the-shelf MLLMs in delivering safe and time-sensitive navigation instructions, even with model finetuning on substantial amounts of data. We then demonstrate that a simple yet effective modification of the model, including direct supervision to predict the underlying reason for each instruction, yields significant performance gains across open-loop, closed-loop, and sim-to-real generalization settings. Nevertheless, our analysis highlights persistent challenges in temporal reasoning, safety-critical object awareness, and relational and distance understanding. To advance the development of scalable assistive agents, we will release our simulation, benchmark, and code (available at project website: https://timeli-icra.github.io/).
|
| |
| 15:00-16:30, Paper WeI2I.47 | Add to My Program |
| Few-Shot Transfer of Tool-Use Skills Using Human Demonstrations with Proximity and Tactile Sensing |
|
| Aoyama, Marina Y. | The University of Edinburgh |
| Vijayakumar, Sethu | University of Edinburgh |
| Narita, Tetsuya | Sony Group Corporation |
Keywords: Force and Tactile Sensing, Sensorimotor Learning, Learning from Demonstration
Abstract: Tools extend the manipulation abilities of robots, much like they do for humans. Despite human expertise in tool manipulation, teaching robots these skills faces challenges. The complexity arises from the interplay of two simultaneous points of contact: one between the robot and the tool, and another between the tool and the environment. Tactile and proximity sensors play a crucial role in identifying these complex contacts. However, learning tool manipulation with a small amount of real-world data using these sensors remains challenging due to the large sim-to-real gap and sensor noise. To address this, we propose a few-shot tool-use skill transfer framework using multimodal sensing. The framework involves pre-training the base policy to capture contact states common in tool-use skills in simulation and fine-tuning it with human demonstrations collected in the real-world target domain to bridge the domain gap. We validate that this framework enables teaching surface-following tasks using tools with diverse physical and geometric properties with a small number of demonstrations on the Franka Emika robot arm. Our analysis suggests that the robot acquires new tool-use skills by transferring the ability to recognise tool-environment contact relationships from pre-trained to fine-tuned policies. Additionally, integrating proximity and tactile sensors enhances the identification of contact states and environmental geometry.
|
| |
| 15:00-16:30, Paper WeI2I.48 | Add to My Program |
| How to Shake Trees with Aerial Manipulators? a Theoretical and Experimental Study |
|
| Gonzalez-Morgado, Antonio | Universidad De Sevilla |
| Cuniato, Eugenio | ETH Zurich |
| Heredia, Guillermo | University of Seville |
| Ollero, Anibal | AICIA. G41099946 |
| Siegwart, Roland | ETH Zurich |
| Tognon, Marco | Inria Rennes |
Keywords: Aerial Systems: Mechanics and Control, Force Control, Aerial Systems: Applications
Abstract: Aerial manipulators are advancing beyond traditional inspection roles to enable complex interactions with flexible structures. Applications such as structural health monitoring, and especially agricultural tasks like fruit harvesting or environmental monitoring, require inducing controlled vibrations into flexible elements. However, current solutions for controlled shaking of trees with aerial manipulators are limited to push and pull forces applied through translational movements, without exploiting the fully-capabilities of aerial platforms. This paper introduces a controlled shaking strategy that enables interaction with trees using both linear movements generated by forces (translation strategy) and rotational movements generated by torques (rotation strategy) thus exploiting the different interaction capabilities of the platform. These two strategies open a previously unexplored question: which strategy is more effective given a specific interaction point? To address this, the two interaction strategies are integrated with the Rayligh-Ritz model of the tree, obtaining the closed-loop dynamics of the system during the vibration. These closed-loop dynamics are then analyzed for the two shaking strategies, deriving which one is better for achieving higher oscillation amplitudes or frequencies. This analysis shows that, for a given interaction point of the tree trunk, this decision depends only on the platform's physical characteristics, such as mass and inertia. Finally, the theoretical analysis is experimentally validated with a hand-made bamboo tree and a fully-actuated platform through indoors flights.
|
| |
| 15:00-16:30, Paper WeI2I.49 | Add to My Program |
| Event-Driven MARL for Collaborative Swarm Confrontation in Asynchronous Environments |
|
| Wu, Qizhen | Beihang University |
| Chen, Lei | Beijing Institute of Technology |
| Liu, Kexin | Beihang University |
| Lv, Jinhu | Beihang University |
Keywords: Task and Motion Planning, Reinforcement Learning, Swarm Robotics
Abstract: Multi-agent reinforcement learning (MARL) provides a flexible solution for tackling task and motion planning challenges, particularly in swarm confrontation scenarios. By customizing termination conditions for diverse tasks, event-driven MARL reduces decision jitter caused by frequent task switching. However, it hinders robots from updating strategies on a consistent timescale, leading to misaligned information sharing that disrupts agent coordination. To address this, we propose a novel event-driven MARL approach that facilitates collaborative strategy learning under asynchronous conditions. The approach introduces an experience selection scheme tailored to diverse timescales, ensuring efficient training through synchronized information sharing among robots. By incorporating Transformers, our method enables robots to infer others' behaviors from historical data, optimizing collaborative strategies. Extensive experiments validate the effectiveness of our proposed approach.
|
| |
| 15:00-16:30, Paper WeI2I.50 | Add to My Program |
| Multi-Robot Formation Control Via Consensus-Based Sliding Mode and Obstacle-Aware Adaptive Scaling |
|
| Lin, Hsien-I | National Yang Ming Chiao Tung University |
| Chen, Yu-Xian | National Yang Ming Chiao Tung University |
Keywords: Multi-Robot Systems, Cooperating Robots
Abstract: This paper proposes a consensus-based sliding mode controller (CSMC) for multi-robot formation control. The framework integrates Laplacian-based consensus with sliding mode robustness and adaptive formation scaling to simultaneously achieve accurate formation tracking and high formation consistency, while ensuring flexibility in constrained environments. The approach is validated in NVIDIA Isaac Sim and real-world experiments with Mecanum-wheeled robots. Compared with conventional sliding mode control (SMC), CSMC achieves consistent improvements in formation consistency, tracking accuracy, and overall performance in both simulation and real-world experiments. When compared with flocking based approaches, CSMC provides substantially improved tracking performance and achieves better overall performance under consistency-prioritized evaluation metrics. These results demonstrate the effectiveness of CSMC in achieving reliable formation tracking, consistent coordination, and adaptive formation scaling for multi-robot navigation.
|
| |
| 15:00-16:30, Paper WeI2I.51 | Add to My Program |
| Plan Optimal Collision-Free Trajectories with Non-Convex Cost Functions Using Graphs of Convex Sets |
|
| Clark, Landon | University of Kentucky |
| Xie, Biyun | University of Kentucky |
Keywords: Motion and Path Planning, Collision Avoidance, Optimization and Optimal Control, Convex Optimization
Abstract: The recently developed approach to motion planning in graphs of convex sets (GCS) provides an efficient framework for computing shortest-distance collision-free paths using convex optimization. This new motion planner is notably more computationally efficient than popular sampling-based motion planners, but it does not support nonconvex cost functions. This article develops a novel motion planning algorithm, graph of convex sets with general costs (GCSGC), to solve this problem. A given nonconvex cost function is accurately approximated by a multiple-layer ReLU neural network and the configuration space is decomposed into a set of linear-cost regions using the hidden layers of the neural network. These linear-cost regions are intersected with a set of collision-free regions, and the resulting collision-free linear-cost regions are intersected to form the vertices and edges of the motion planner’s underlying graph structure. The edge costs have a closed-form solution within each collision-free linear-cost region, but it is nonconvex, so the McCormick relaxation is applied to convexify the edge costs. Finally, a graph preprocessing technique is developed to compute a representative graph structure that acts as a heuristic for the edge costs of the underlying GCS and then simplify the underlying graph structure by removing cycles and high-cost paths, which can significantly improve the efficiency of the planner and quality of the produced trajectories. The proposed motion planner is first validated in a 2-D configuration space with comparisons between different sized neural networks with and without preprocessing, comparisons between optimal trajectories from GCSGC with shortest-distance trajectories, and comparisons between GCSGC and GCS-Sequential linear programming (SLP). The GCSGC planner is further validated in a complex 7-D configuration space by comparing to state-of-the-art multiquery (PRM*, GCS-SLP) and single-query (TrajOpt, BIT*, AIT*, RRT*) planners.
|
| |
| 15:00-16:30, Paper WeI2I.52 | Add to My Program |
| A Lightweight Agentic Multimodal Framework for Scene Understanding in Healthcare Robotics |
|
| Jha, Saurav | SETLabs Research GmbH |
| Ehrlich, Stefan K. | SETLabs Research GmbH |
Keywords: AI-Based Methods, Computer Vision for Medical Robotics, Medical Robots and Systems
Abstract: Healthcare robotics requires robust multimodal perception and reasoning to ensure safety in dynamic clinical environments. Current Vision-Language Models (VLMs) demonstrate strong general-purpose capabilities but remain limited in temporal reasoning, uncertainty estimation, and structured outputs needed for robotic planning. We present a lightweight agentic multimodal framework for video-based scene understanding. Combining the Qwen2.5-VL-3B-Instruct model with a SmolAgent-based orchestration layer, it supports chain-of-thought reasoning, speech–vision fusion, and dynamic tool invocation. The framework generates structured scene graphs and leverages a hybrid retrieval module for interpretable and adaptive reasoning. Evaluations on the Video-MME benchmark and a custom clinical dataset show competitive accuracy and improved robustness compared to stateof-the-art VLMs, demonstrating its potential for applications in robot-assisted surgery, patient monitoring, and decision support.
|
| |
| 15:00-16:30, Paper WeI2I.53 | Add to My Program |
| Co-Design and Morphology-Guided Feedback Control: An Approach for Soft Robots |
|
| Nguyen, Nhan Huu | Japan Advanced Institute of Science and Technology |
| Do, Dinh Truong | Japan Advanced Institute of Science and Technology |
| Nguyen, Le Minh | Japan Advanced Institute of Science and Technology |
| Ho, Van | Japan Advanced Institute of Science and Technology |
Keywords: Modeling, Control, and Learning for Soft Robots, Evolutionary Robotics
Abstract: Soft robots, with their highly compliant bodies, exhibit numerous unforeseen configurations that often defy engineering intuition and complicate control design. This work introduces a simulation-based co-optimization framework that jointly optimizes both morphology and control. Unlike existing approaches that rely on oversimplified soft robot models or feed-forward controllers for simple tasks, our framework targets complex tasks that benefit from closed-loop feedback. The controller is trained over a hybrid design space combining discrete parameters, which define the nominal structure, and continuous parameters, which shift the morphology adaptively. The design distribution is iteratively manipulated to emphasize high-performing candidates until the optimal design–control pair emerges. Proprioceptive feedback in the form of mechanical strain is integrated to provide the controller with awareness of morphological state and interaction dynamics. Demonstrations show that the framework converges reliably to optimal design–control solutions, validating the effectiveness of the proposed joint optimization strategy.
|
| |
| 15:00-16:30, Paper WeI2I.54 | Add to My Program |
| Continuous-Time Optical Flow Estimation from Asynchronous Event-Frame Streams for Embedded Systems |
|
| Yang, Daolong | Beihang University |
| Liang, Hansheng | Beijing Institute of Technology |
| Liu, Haoyuan | Beihang University |
| Wang, Chengcai | Beihang University |
| Xu, Bin | Beijing Institute of Technology |
| Xu, Kun | Beihang University |
| Ding, Xilun | Beihang University |
Keywords: Neurorobotics, Sensor Fusion, Robotics in Under-Resourced Settings
Abstract: Bioinspired event cameras, with their high temporal resolution, low power consumption, and inherent motion responsiveness, have been widely adopted for fundamental vision tasks in robotics, notably optical flow estimation. Recent studies have shown that incorporating complementary frame data can significantly enhance the performance of event-based optical flow estimation. However, two major challenges hinder the real-time deployment of such methods on robotic platforms: (1) the asynchronous nature of events and frames makes it difficult to generalize across varying input temporal offsets; and (2) reliance on computationally expensive correlation volume construction and iterative refinement results in high inference latency on embedded systems. To address these issues, we propose a novel method that takes asynchronous event and frame streams as input and predicts high-quality dense flow in a single forward pass. Our approach temporally encodes both intra- and inter-sensor features and efficiently integrates them into a lightweight correlation volume to enhance flow prediction. Experimental results on real-world scenes demonstrate that our method improves flow accuracy by up to 22% over state-of-the-art hybrid event-frame methods, while being 3x faster on embedded GPUs. Furthermore, our approach maintains strong performance and generalizes well across diverse frame-event temporal offsets, introducing a novel paradigm for fusing asynchronous frame and event streams for continuous-time optical flow estimation.
|
| |
| 15:00-16:30, Paper WeI2I.55 | Add to My Program |
| Flow before Imitation: Learning Dexterous In-Hand Manipulation with Dynamic Visuotactile Shortcut Policy |
|
| Chen, Yijin | Shanghai Jiao Tong University |
| Xu, Wenqiang | Shanghai Jiaotong University |
| Yu, Zhenjun | Shanghai Jiao Tong University |
| Tang, Tutian | Shanghai Jiao Tong University |
| Li, Yutong | Shanghai Jiao Tong University |
| Yao, Siqiong | Shanghai Jiaotong University |
| Lu, Cewu | ShangHai Jiao Tong University |
Keywords: Imitation Learning, In-Hand Manipulation, Machine Learning for Robot Control
Abstract: Dexterous in-hand manipulation remains a long-standing challenge in robotics, primarily due to the complex contact dynamics and partial observability. While humans synergize vision and touch for such tasks, robotic approaches often prioritize one modality, therefore limiting adaptability. This paper introduces Flow Before Imitation (FBI), a visuotactile imitation learning framework that dynamically fuses tactile interactions with visual observations through motion dynamics. Unlike prior static fusion methods, FBI establishes a causal link between tactile signals and object motion via a dynamics-aware latent model. FBI employs a transformer-based interaction module to fuse flow-derived tactile features with visual inputs, training a one-step diffusion policy for real-time execution. Extensive experiments demonstrate that the proposed method outperforms the baseline methods in both simulation and the real world on two customized in-hand manipulation tasks and three standard dexterous manipulation tasks. Code, models, and more results are available on the website url{https://sites.google.com/view/dex-fbi}.
|
| |
| 15:00-16:30, Paper WeI2I.56 | Add to My Program |
| Flexible Locomotion Learning with Diffusion Model Predictive Control |
|
| Huang, Runhan | Harvard University |
| Balim, Haldun | Harvard University |
| Yang, Heng | Harvard University |
| Du, Yilun | Harvard University |
Keywords: AI-Based Methods, Legged Robots, Deep Learning Methods
Abstract: Legged locomotion demands controllers that are both robust and adaptable, while remaining compatible with task and safety considerations. However, model-free reinforcement learning (RL) methods often yield a fixed policy that can be difficult to adapt to new behaviors at test time. In contrast, Model Predictive Control (MPC) provides a natural approach to flexible behavior synthesis by incorporating different objectives and constraints directly into its optimization process. However, classical MPC relies on accurate dynamics models, which are often difficult to obtain in complex environments and typically require simplifying assumptions. We present Diffusion-MPC, which leverages a learned generative diffusion model as an approximate dynamics prior for planning, enabling flexible test-time adaptation through reward and constraint based optimization. Diffusion-MPC jointly predicts future states and actions; at each reverse step, we incorporate reward planning and impose feasibility projection, yielding trajectories that satisfy task objectives while remaining within physical limits. To obtain a planning model that adapts beyond imitation pretraining, we introduce an interactive training algorithm for diffusion based planner: we execute our reward-and-constraint planner in environment, then filter and reweight the collected trajectories by their realized returns before updating the denoiser. Our design enables strong test-time adaptability, allowing the planner to adjust to new reward specifications without retraining. We validate Diffusion-MPC on real world, demonstrating strong locomotion and flexible adaptation.
|
| |
| 15:00-16:30, Paper WeI2I.57 | Add to My Program |
| Is Pre-Training Applicable to the Decoder for Dense Prediction? |
|
| Ning, Chao | The University of Tokyo & RIKEN AIP |
| Gan, Wanshui | The University of Tokyo & RIKEN AIP |
| Xuan, Weihao | The University of Tokyo & RIKEN AIP |
| Yokoya, Naoto | The University of Tokyo & RIKEN AIP |
Keywords: Deep Learning for Visual Perception, RGB-D Perception, Semantic Scene Understanding
Abstract: Encoder-decoder networks are commonly used model architectures for dense prediction tasks, where the encoder typically employs a model pre-trained on upstream tasks, while the decoder is often either randomly initialized or pre-trained on other tasks. In this paper, we introduce ×Net, a novel framework that leverages a model pre-trained on upstream tasks as the decoder, fostering a ``pre-trained encoder × pre-trained decoder'' collaboration within the encoder-decoder network. ×Net effectively addresses the challenges associated with using pre-trained models in the decoding, applying the learned representations to enhance the decoding process. This enables the model to achieve more precise and high-quality dense predictions. By simply coupling the pre-trained encoder and pre-trained decoder, ×Net distinguishes itself as a highly promising approach. Remarkably, it achieves this without relying on decoding-specific structures or task-specific algorithms. Despite its streamlined design, ×Net outperforms advanced methods in tasks such as monocular depth estimation and semantic segmentation. The code is available at https://2j472no.github.io/xNet/.
|
| |
| 15:00-16:30, Paper WeI2I.58 | Add to My Program |
| V2V-LLM: Vehicle-To-Vehicle Cooperative Autonomous Driving with Multimodal Large Language Models |
|
| Chiu, Hsu-kuang | NVIDIA, Carnegie Mellon University |
| Hachiuma, Ryo | NVIDIA |
| Wang, Chien-Yi | NVIDIA |
| Smith, Stephen F. | Carnegie Mellon University |
| Wang, Yu-Chiang Frank | NVIDIA |
| Chen, Min-Hung | NVIDIA |
Keywords: Computer Vision for Transportation, Intelligent Transportation Systems, Deep Learning for Visual Perception
Abstract: Current autonomous driving vehicles rely mainly on their individual sensors to understand surrounding scenes and plan for future trajectories, which can be unreliable when the sensors are malfunctioning or occluded. To address this problem, cooperative perception methods via vehicle-to-vehicle (V2V) communication have been proposed, but they have tended to focus on perception tasks like detection or tracking. How those approaches contribute to overall cooperative planning performance is still under-explored. Inspired by recent progress using Large Language Models (LLMs) to build autonomous driving systems, we propose a novel problem setting that integrates a Multimodal LLM into cooperative autonomous driving, with the proposed Vehicle-to-Vehicle Question-Answering (V2V-QA) dataset and benchmark. We also propose our baseline method Vehicle-to-Vehicle Multimodal Large Language Model (V2V-LLM), which uses an LLM to fuse perception information from multiple connected autonomous vehicles (CAVs) and answer various types of driving-related questions: grounding, notable object identification, and planning. Experimental results show that our proposed V2V-LLM can be a promising unified model architecture for performing various tasks in cooperative autonomous driving, and outperforms other baseline methods that use different fusion approaches. Our work also creates a new research direction that can improve the safety of future autonomous driving systems. Our code and dataset are released to facilitate open-source research at https://eddyhkchiu.github.io/v2vllm.github.io/.
|
| |
| 15:00-16:30, Paper WeI2I.59 | Add to My Program |
| SailMAV: Water-Surface Locomotion and Biodiversity Monitoring (I) |
|
| Farinha, Andre | CSIRO |
| Romanello, Luca | TUM |
| Zufferey, Raphael | MIT |
| Lawson, Jenna Louise | UK Centre for Ecology & Hydrology |
| Armanini, Sophie Franziska | Imperial College London |
| Kovac, Mirko | Imperial College London |
Keywords: Field Robots, Environment Monitoring and Management, Aerial Systems: Applications
Abstract: Existing aquatic robotic vehicles tend to be large, heavy, and difficult to deploy. This often renders them unsuitable for monitoring delicate aquatic habitats and hard-to-access areas. We present a comprehensive framework for the design and development of sailing micro aerial vehicles (SailMAVs), whose combination of flight and sailing capabilities is highly valuable for sensing missions in aquatic environments. This concept allows for quick hand-launch deployment from land, access to remote areas, rapid multipoint sampling at six locations, and easy movement between separate water bodies. Our framework places particular emphasis on the complex aero-hydrodynamic design, ensuring dual use of subsystems in both locomotion modes, which in turn maximizes performance and reduces redundant payloads. The small scale of the robots considered represents a particular challenge, in terms of both practical design aspects and the underlying physics. In addition to the hardware design, control laws are derived to allow for automated long-duration mission execution. To illustrate the proposed framework, a robotic prototype is presented, analyzed, and tested as an example. The developed design and control laws are validated in autonomous outdoor sailing missions, demonstrating the effectiveness of the framework. The prototype is further employed in remote sensing missions, demonstrating the use of SailMAVs for passive acoustic monitoring (PAM) of aquatic environments. The data obtained demo
|
| |
| 15:00-16:30, Paper WeI2I.60 | Add to My Program |
| StarIO: A Lightweight Inertial Odometry for Nonlinear Motion |
|
| Zhang, Shanshan | Xiamen University |
| Wang, Siyue | Xiamen University |
| Zhang, Qi | Xiamen University |
| Wu, Liqin | Xiamen University |
| Wen, Tianshui | Xiamen University |
| Zhou, Ziheng | Xiamen University |
| Hong, Xuemin | Xiamen University |
| Peng, Ao | Xiamen University |
| Zheng, Lingxiang | Xiamen University |
| Yang, Yu | Xiamen University |
Keywords: Localization, AI-Based Methods, Sensor-based Control
Abstract: Inertial odometry (IO) is an attractive approach for consumer-grade localization. However, existing data-driven IO methods often suffer from significant drift under complex nonlinear motion patterns (e.g., turns), as they struggle to capture the nonlinear relationships between Inertial Measurement Unit (IMU) signals and motion states. To address this issue, we propose a lightweight IO model, StarIO. Specifically, we first apply the Star Operation to project IMU signals into a high-dimensional implicit nonlinear feature space, enabling effective extraction of the complex nonlinear motion characteristics that typically cause drift. We then capture contextual dependencies across both the temporal and channel dimensions to enhance trajectory estimation over long sequences.In addition, we introduce a multi-scale gated unit that fuses fine-grained local motion dynamics with contextual information to achieve a comprehensive representation of motion. Extensive experiments on six representative open-source datasets demonstrate that StarIO achieves a superior trade-off between model lightweightness and localization accuracy.For example, on the RoNIN dataset, our approach reduces the ATE by 5.21% compared to R-ResNet while using only 2.762M parameters.
|
| |
| 15:00-16:30, Paper WeI2I.61 | Add to My Program |
| Optimal Excitation Trajectories for System Identification of Underwater Vehicles |
|
| Panetsos, Fotis | New York University Abu Dhabi |
| Kyriakopoulos, Kostas | New York University - Abu Dhabi |
Keywords: Marine Robotics, Calibration and Identification, Dynamics
Abstract: In this work, we propose a structured methodology for the system identification of underwater vehicles through the design of optimal excitation trajectories. To this end, the trajectories are parameterized using Bezier curves, which ensure smooth and differentiable motion profiles while facilitating the enforcement of constraints through appropriate manipulation of the control points. An optimization problem is formulated to determine a dynamically feasible excitation trajectory that respects safety limits and maximizes the quality of the collected data, thereby enabling reliable estimation of the vehicle’s dynamic parameters using least squares. The proposed methodology is experimentally validated in a laboratory water tank, where the dynamic parameters, identified from the optimized trajectory, are evaluated by predicting the vehicle’s velocity through forward simulation on previously unseen trajectories.
|
| |
| 15:00-16:30, Paper WeI2I.62 | Add to My Program |
| Towards the Best Robot for the Job: Optimising Actuation Design through Multi-Task Co-Design and Component Selection |
|
| Roozing, Wesley | University of Twente |
| Schaaij, Jonathan Cornee | University of Twente |
| Forino, Alessandro | Maxon Motor Ag |
Keywords: Methods and Tools for Robot System Design, Actuation and Joint Mechanisms
Abstract: We propose a multi-task co-design approach to design a robot's actuation (motor sizes and gear ratios) based on trajectory optimisation. Leveraging an actuation model fit on data of series of components, we find the optimal set of design parameters for all joints over a set of representative tasks for the given robot. Critically, we close the loop towards component selection, given a finite set of available components. This enables more practical use of co-design tools. Our results show that the method is effective, and critically, show that it is possible to find a robot design that is capable of performing an entire set of tasks at an efficiency that is comparable to a robot co-designed for each specific task. Finally, we perform an extensive analysis of hyperparameter effects, and select discrete actuation components from catalogues and compare to co-design results.
|
| |
| 15:00-16:30, Paper WeI2I.63 | Add to My Program |
| Zero-Shot Metric Depth Estimation Via Monocular Visual-Inertial Rescaling for Autonomous Aerial Navigation |
|
| Yang, Steven | Carnegie Mellon University |
| Tian, Xiaoyu | Carnegie Mellon University |
| Goel, Kshitij | Carnegie Mellon University |
| Tabib, Wennie | Carnegie Mellon University |
Keywords: Field Robots, Aerial Systems: Applications
Abstract: This paper presents a methodology to predict metric depth from monocular RGB images and an inertial measurement unit (IMU). To enable collision avoidance during autonomous flight, prior works either leverage heavy sensors (e.g., LiDARs or stereo cameras) or data-intensive and domain-specific fine-tuning of monocular metric depth estimation methods. In contrast, we propose several lightweight zero-shot rescaling strategies to obtain metric depth from relative depth estimates via the sparse 3D feature map created using a visual-inertial navigation system. These strategies are compared for their accuracy in diverse simulation environments. The best performing approach, which leverages monotonic spline fitting, is deployed in the real-world on a compute-constrained quadrotor. We obtain on-board metric depth estimates at 15 Hz and demonstrate successful collision avoidance after integrating the proposed method with a motion primitives-based planner.
|
| |
| 15:00-16:30, Paper WeI2I.64 | Add to My Program |
| Multi-Robot Decentralized Collaborative SLAM in Planetary Analogue Environments: Dataset, Challenges, and Lessons Learned (I) |
|
| Lajoie, Pierre-Yves | École Polytechnique De Montréal |
| Soma, Karthik | École Polytechnique De Montréal |
| Bong, Haechan Mark | École Polytechnique De Montréal |
| Lemieux-Bourque, Alice | École Polytechnique De Montréal |
| Zhang, Rongge | École Polytechnique De Montréal |
| Varadharajan, Vivek shankar | École Polytechnique De Montréal |
| Beltrame, Giovanni | Ecole Polytechnique De Montreal |
Keywords: Multi-Robot SLAM, Space Robotics and Automation
Abstract: Decentralized collaborative simultaneous localization and mapping (C-SLAM) is essential to enable multirobot missions in unknown environments without relying on preexisting localization and communication infrastructure. This technology is anticipated to play a key role in the exploration of the Moon, Mars, and other planets. In this article, we share insights and lessons learned from C-SLAM experiments involving three robots operating on a Mars analogue terrain and communicating over an ad hoc network. We examine the impact of limited and intermittent communication on C-SLAM performance, as well as the unique localization challenges posed by planetary-like environments. Additionally, we introduce a novel dataset collected during our experiments, which includes real-time peer-to-peer inter-robot throughput and latency measurements. This dataset aims to support future research on communication-constrained, decentralized multirobot operations.
|
| |
| 15:00-16:30, Paper WeI2I.65 | Add to My Program |
| UniFuture: A 4D Driving World Model for Future Generation and Perception |
|
| Liang, Dingkang | Huazhong University of Science and Technology |
| Zhang, Dingyuan | Huazhong University of Science and Technology |
| Zhou, Xin | Huazhong University of Science and Technology |
| Tu, Sifan | Huazhong University of Science and Technologe |
| Feng, Tianrui | Huazhong University of Science and Technology |
| Li, Xiaofan | Baidu |
| Yumeng, Zhang | Baidu |
| Du, Mingyang | Huazhong University of Science and Technology |
| Tan, Xiao | Baidu |
| Xiang, Bai | Huazhong University of Science & Technology |
Keywords: Parallel Robots, Force Control, Field Robots
Abstract: We present UniFuture, a unified 4D Driving World Model designed to simulate the dynamic evolution of the 3D physical world. Unlike existing driving world models that focus solely on 2D pixel-level video generation (lacking geometry) or static perception (lacking temporal dynamics), our approach bridges appearance and geometry to construct a holistic 4D representation. Specifically, we treat future RGB images and depth maps as coupled projections of the same 4D reality and model them jointly within a single framework. To achieve this, we introduce a Dual-Latent Sharing (DLS) scheme, which maps visual and geometric modalities into a shared spatio-temporal latent space, implicitly entangling texture with structure. Furthermore, we propose a Multi-scale Latent Interaction (MLI) mechanism, which enforces bidirectional consistency: geometry constrains visual synthesis to prevent structural hallucinations, while visual semantics refine geometric estimation. During inference, UniFuture can forecast high- fidelity, geometrically consistent 4D scene sequences (image- depth pairs) from a single current frame. Extensive experiments on the nuScenes and Waymo datasets demonstrate that our method outperforms specialized models in both future generation and geometry perception, highlighting the potential of unified 4D modeling for autonomous driving. The code is available at https://github.com/dk-liang/UniFuture.
|
| |
| 15:00-16:30, Paper WeI2I.66 | Add to My Program |
| ActiveVLN: Towards Active Exploration Via Multi-Turn RL in Vision-And-Language Navigation |
|
| Zhang, Zekai | Southern University of Science and Technology |
| Zhu, Weiye | Southern University of Science and Technology |
| Pan, Hewei | Southern University of Science and Technology |
| Wang, Xiangchen | Southern University of Science and Technology |
| Xu, Rongtao | Spatialtemporal AI, China |
| Sun, Xing | Tencent |
| Zheng, Feng | SUSTech |
Keywords: Vision-Based Navigation, Reinforcement Learning, Visual Learning
Abstract: The Vision-and-Language Navigation (VLN) task requires an agent to follow natural language instructions and navigate through complex environments. Existing MLLM-based VLN methods primarily rely on imitation learning (IL) and often use DAgger for post-training to mitigate covariate shift. While effective, these approaches incur substantial data collection and training costs. Reinforcement learning (RL) offers a promising alternative. However, prior VLN RL methods lack dynamic interaction with the environment and depend on expert trajectories for reward shaping, rather than engaging in open-ended active exploration. This restricts the agent's ability to discover diverse and plausible navigation routes. To address these limitations, we propose ActiveVLN, a VLN framework that explicitly enables active exploration through multi-turn RL. In the first stage, a small fraction of expert trajectories is used for IL to bootstrap the agent. In the second stage, the agent iteratively predicts and executes actions, automatically collects diverse trajectories, and optimizes multiple rollouts via the GRPO objective. To further improve RL efficiency, we introduce a dynamic early-stopping strategy to prune long-tail or likely failed trajectories, along with additional engineering optimizations. Experiments show that ActiveVLN achieves the largest performance gains over IL baselines compared to both DAgger-based and prior RL-based post-training methods, while reaching competitive performance with state-of-the-art approaches despite using a smaller model.
|
| |
| 15:00-16:30, Paper WeI2I.67 | Add to My Program |
| LaViRA: Language-Vision-Robot Actions Translation for Zero-Shot Vision Language Navigation in Continuous Environments |
|
| Ding, Hongyu | Nanjing University |
| Xu, Ziming | Nanjing University |
| Fang, Yuk Tung Samuel | Nanjing University |
| Wu, You | NanJing University |
| Chen, Zixuan | Nanjing Univeristy |
| Shi, Jieqi | Nanjing University |
| Huo, Jing | Nanjing University |
| Zhang, Yifan | Institute of Automation, Chinese Academy of Sciences |
| Gao, Yang | Nanjing University |
Keywords: Visual Learning, Vision-Based Navigation, RGB-D Perception
Abstract: Zero-shot Vision-and-Language Navigation in Continuous Environments (VLN-CE) requires an agent to navigate unseen environments based on natural language instructions without any prior training. Current methods face a critical trade-off: either rely on environment-specific waypoint predictors that limit scene generalization, or underutilize the reasoning capabilities of large models during navigation. We introduce LaViRA, a simple yet effective zero-shot framework that addresses this dilemma by decomposing action into a coarse-to-fine hierarchy: Language Action for high-level planning, Vision Action for middle-level perceptual grounding, and Robot Action for low-level control. This modular decomposition allows us to leverage the distinct strengths of different scales of Multimodal Large Language Models (MLLMs) at each stage, creating a system that is powerful in its reasoning, grounding and practical control. LaViRA significantly outperforms existing state-of-the-art methods on the VLN-CE benchmark, demonstrating superior generalization capabilities in unseen environments, while maintaining transparency and efficiency for real-world deployment.
|
| |
| 15:00-16:30, Paper WeI2I.68 | Add to My Program |
| Precedence-Aware Multi-UAV Task Allocation with an Attention-Based Reinforcement Learning Framework |
|
| Liu, Xurui | Nanjing University |
Keywords: Integrated Planning and Learning, Motion and Path Planning, Reinforcement Learning
Abstract: Multi-UAV coordination is critical for complex real-world applications, but these missions are often constrained by intricate causal dependencies between tasks, alongside strict UAV energy and return-to-base constraints. Existing methods, ranging from exact solvers to standard deep reinforcement learning approaches, struggle to scale with the combinatorial complexity of this problem and often fail to effectively represent the underlying logical task structures. To address this gap, we propose the Causal-Channel Transformer for Joint Allocation (C2T-JA), an end-to-end reinforcement learning framework. The core of C2T-JA is a dual-branch hybrid attention encoder that explicitly constructs and reasons over multi-hop, disentangled causal channels, effectively decoupling logical dependencies from spatial task features. Building on this rich representation, a context-aware decoder generates a globally coordinated joint action for the entire team. We evaluated C2T-JA against established baselines, including an exact solver (Gurobi), a conventional heuristic (OR-Tools), and a leading learning-based approach (AM-joint), on procedurally generated benchmarks of varying scales and dependency structures. The results demonstrate that our approach consistently produces higher-quality solutions, measured by task completion rates, while reducing decision times by several orders of magnitude, particularly in large-scale scenarios.
|
| |
| 15:00-16:30, Paper WeI2I.69 | Add to My Program |
| EXOM: An Excavator Operation Monitoring Framework with Onboard Vision and Sensor Data |
|
| Kang, Seok-Kyu | HD Hyundai |
| Lee, Seong-Gye | HD Hyundai |
| Jang, Gye-Bong | HD Hyundai |
Keywords: Robotics and Automation in Construction, Sensor Fusion, Deep Learning Methods
Abstract: Reliable monitoring of excavator operations in real-world environments requires accurate excavation counting to ensure productivity, efficient computation for real-time inference, and cost-effective on-board sensing—a combination that most prior systems fail to achieve. We present EXOM (EXcavator Operation Monitoring), a lightweight and deployable framework that relies solely on a factory-installed cabin camera and built-in hydraulic sensors. EXOM integrates two embedded-friendly modules: a Video data Processing Module (VPM), where an ECSE algorithm leverages bucket detection to estimate excavation sections and counts from state transitions, and a Sensor data Processing Module (SPM), where an Adaptive Window (AW) process sparsifies time-series signals and drives a segmentation model through a learnable sparse tensor. To capture deployability, we introduce EXOM-I, a unified index that combines section-level F1 and normalized excavation counting accuracy. Experiments with real-world data demonstrate that EXOM consistently outperforms previous approaches, achieving state-of-the-art performance with real-time latency on resource-limited embedded excavator hardware.
|
| |
| 15:00-16:30, Paper WeI2I.70 | Add to My Program |
| User-Centric Object Navigation: A Benchmark with Integrated User Habits for Personalized Embodied Object Search |
|
| Wang, Hongcheng | Peking University |
| Zhu, Jinyu | Peking University |
| Dong, Hao | Peking University |
Keywords: AI-Enabled Robotics, Human-Centered Robotics, Domestic Robotics
Abstract: In the evolving field of robotics, the challenge of Object Navigation (ON) in household environments has attracted significant interest. Existing ON benchmarks typically place objects in locations guided by general scene priors, without accounting for the specific placement habits of individual users. This omission limits the adaptability of navigation agents in personalized household environments. To address this, we introduce User-centric Object Navigation (UcON), a new benchmark that incorporates user-specific object placement habits, referred to as user habits. This benchmark requires agents to leverage these user habits for more informed decision-making during navigation. UcON encompasses approximately 22,600 user habits across 489 object categories. To the best of our knowledge, UcON is the first object navigation benchmark that takes user habits into account and covers the widest range of target object categories. Additionally, we propose a habit retrieval module to extract and utilize habits related to target objects, enabling agents to infer their likely locations more effectively. Experimental results demonstrate that while current state-of-the-art ON methods struggle with UcON's challenges, integrating user habits significantly improves the success rate in locating objects.
|
| |
| 15:00-16:30, Paper WeI2I.71 | Add to My Program |
| OmniVLA: An Omni-Modal Vision-Language-Action Model for Robot Navigation |
|
| Hirose, Noriaki | UC Berkeley / TOYOTA Motor North America |
| Glossop, Catherine | University of California, Berkeley |
| Shah, Dhruv | Google DeepMind |
| Levine, Sergey | UC Berkeley |
Keywords: Deep Learning Methods, Vision-Based Navigation
Abstract: Humans can flexibly interpret and compose different goal specifications, such as language instructions, spatial coordinates, or visual references, when navigating to a destination. In contrast, most existing robotic navigation policies are trained on a single modality, limiting their adaptability to real-world scenarios where different forms of goal specification are natural and complementary. In this work, we present a training framework for robotic foundation models that enables omni-modal goal conditioning for vision-based navigation. Our approach leverages a high-capacity vision-language-action (VLA) backbone and trains with three primary goal modalities: 2D poses, egocentric images, and natural language, as well as their combinations, through a randomized modality fusion strategy. This design not only expands the pool of usable datasets but also encourages the policy to develop richer geometric, semantic, and visual representations. The resulting model, OmniVLA, achieves strong generalization to unseen environments, robustness to scarce modalities, and the ability to follow novel natural language instructions. We demonstrate that OmniVLA outperforms specialist baselines across modalities and offers a flexible foundation for fine-tuning to new modalities and tasks. We believe OmniVLA provides a step toward broadly generalizable and flexible navigation policies, and a scalable path for building omni-modal robotic foundation models.
|
| |
| 15:00-16:30, Paper WeI2I.72 | Add to My Program |
| InSpire: Vision-Language-Action Models with Intrinsic Spatial Reasoning |
|
| Zhang, Ji | Southwest Jiaotong University |
| Wu, Shihan | University of Electronic and Science Technology of China |
| Luo, Xu | University of Electronic Science and Technology of China |
| Wu, Hao | University of Electronic Science and Technology of China |
| Xie, Junlin | University of Electronic Science and Technology of China |
| Gao, Lianli | University of Electronic Science and Technology of China |
| Shen, Heng Tao | Tongji University |
| Song, Jingkuan | Tongji University |
Keywords: Transfer Learning, Learning from Demonstration, Representation Learning
Abstract: Leveraging pretrained Vision-Language Models (VLMs) to map language instruction and visual observations to raw low-level actions, Vision-Language-Action models (VLAs) hold great promise for achieving general-purpose robotic systems. Despite their advancements, existing VLAs tend to spuriously correlate task-irrelevant visual features with actions, limiting their generalization capacity beyond the training data. To tackle this challenge, we propose Intrinsic Spatial Reasoning (InSpire), a simple yet effective approach that mitigates the adverse effects of spurious correlations by boosting the spatial reasoning ability of VLAs. Specifically, InSpire redirects the VLA's attention to task-relevant factors by prepending the question “In which direction is the [object] relative to the robot?” to the language instruction and aligning the model's output answer “right/left/up/down/front/back/grasped” and predicted actions with the ground-truth. Notably, InSpire can be used as a plugin to enhance existing autoregressive VLAs, requiring no extra training data or interaction with other large models. Extensive experimental results in both simulation and real-world environments demonstrate the effectiveness and flexibility of InSpire. Code, pretrained models and demos are publicly available at: https://koorye.github.io/Inspire.
|
| |
| 15:00-16:30, Paper WeI2I.73 | Add to My Program |
| TreeLoc: 6-DoF LiDAR Global Localization in Forests Via Inter-Tree Geometric Matching |
|
| Jung, Minwoo | Seoul National University |
| Chebrolu, Nived | University of Oxford |
| Carvalho de Lima, Lucas | The University of Queensland |
| Oh, Haedam | University of Oxford |
| Fallon, Maurice | University of Oxford |
| Kim, Ayoung | Seoul National University |
Keywords: Robotics and Automation in Agriculture and Forestry, Localization, Field Robots
Abstract: Reliable localization is crucial for navigation in forests, where GPS is often degraded and LiDAR measurements are repetitive, occluded, and structurally complex. These conditions weaken the assumptions of traditional urban-centric localization methods, which assume that consistent features arise from unique structural patterns, necessitating forest-centric solutions to achieve robustness in these environments. To address these challenges, we propose TreeLoc, a LiDAR-based global localization framework for forests that handles place recognition and 6-DoF pose estimation. We represent scenes using tree stems and their diameter at breast height (DBH), which are aligned to a common reference frame via their axes and summarized using the tree distribution histogram (TDH) for coarse matching, followed by fine matching with a 2D triangle descriptor. Finally, pose estimation is achieved through a two-step geometric verification. On diverse forest benchmarks, TreeLoc outperforms baselines, achieving precise localization. Ablation studies validate the contribution of each component. We also propose applications for long-term forest management using descriptors from a compact global tree database. TreeLoc is open-sourced for the robotics community at https://github.com/minwoo0611/TreeLoc.
|
| |
| 15:00-16:30, Paper WeI2I.74 | Add to My Program |
| GaussR-SLAM: Gaussian Robust SLAM in Data Loss and Interference Environments |
|
| Zhang, Bowen | Hebei University |
| Liu, Yufan | UC,Berkeley |
| Li, Dong | University of Macau |
| Guo, Pengfei | NCEPU |
| Gui, Yuanze | Beijing University of Technology |
| Li, Mingrui | Dalian University of Technology |
| |
| 15:00-16:30, Paper WeI2I.75 | Add to My Program |
| Learning-Based Observer for Coupled Disturbance |
|
| Jia, Jindou | Beihang University |
| Wang, Meng | Beihang University |
| Yang, Zihan | Beihang University |
| Yang, Bin | Beihang University |
| Liu, Yuhang | Beihang University |
| Guo, Kexin | Beihang University |
| Yu, Xiang | Beihang University |
Keywords: Machine Learning for Robot Control, Robust/Adaptive Control, Aerial Systems: Applications
Abstract: Achieving high-precision control for robotic systems is hindered by the low-fidelity dynamical model and external disturbances. Especially, the intricate coupling between internal uncertainties and external disturbances further exacerbates this challenge. This study introduces an effective and convergent algorithm enabling accurate estimation of the coupled disturbance via combining control and learning philosophies. Concretely, by resorting to textit Chebyshev series expansion, the coupled disturbance is effectively decomposed into an unknown parameter matrix and two known structures dependent on system state and external disturbance respectively. A regularized least squares process is subsequently formalized to learn the parameter matrix using historical time-series data. Furthermore, a polynomial disturbance observer is specifically devised to achieve a high-precision estimation of the coupled disturbance by utilizing the learned structure portion. Extensive simulations and real flight tests valid the effectiveness of the proposed framework. We believe this work can offer a new pathway to integrate learning approaches into control frameworks for addressing longstanding challenges in robotic applications.
|
| |
| 15:00-16:30, Paper WeI2I.76 | Add to My Program |
| Vision-Language Feature Alignment for Road Anomaly Segmentation |
|
| He, Zhuolin | Fudan University |
| Tang, Jiacheng | Fudan University |
| Pu, Jian | Fudan University |
| Xue, Xiangyang | Fudan University |
Keywords: Computer Vision for Automation, Object Detection, Segmentation and Categorization, Deep Learning for Visual Perception
Abstract: Safe autonomous systems in complex environments require robust road anomaly segmentation to identify unknown obstacles. However, existing approaches often rely on pixel-level statistics to determine whether a region appears anomalous. This reliance leads to high false-positive rates on semantically normal background regions such as sky or vegetation, and poor recall of true Out-of-distribution (OOD) instances, thereby posing safety risks for robotic perception and decision-making. To address these challenges, we propose VL-Anomaly, a vision-language anomaly segmentation framework that incorporates semantic priors from pre-trained Vision-Language Models (VLMs). Specifically, we design a prompt learning-driven alignment module that adapts Mask2Former’s visual features to CLIP text embeddings of known categories, effectively suppressing spurious anomaly responses in background regions. At inference time, we further introduce a multi-source inference strategy that integrates text-guided similarity, CLIP-based image-text similarity and detector confidence, enabling more reliable anomaly prediction by leveraging complementary information sources. Extensive experiments demonstrate that VL-Anomaly achieves state-of-the-art performance on benchmark datasets including RoadAnomaly, SMIYC, and Fishyscapes. Code will be released upon acceptance.
|
| |
| 15:00-16:30, Paper WeI2I.77 | Add to My Program |
| Online Velocity Estimation of a Robotic Fish Using Artificial Lateral Line System with Velocity-Decoupling Sensing Ability |
|
| He, Jiarui | Zhejiang University |
| Zhou, Yan | National University of Singapore |
| Zhang, Chengqian | Zhejiang University |
| Dai, Huangzhe | Zhejiang University |
| Tang, Daofan | Zhejiang University |
| Pan, Chengfeng | Zhejiang University |
| Zhao, Peng | Zhejiang University |
Keywords: Biologically-Inspired Robots, Soft Sensors and Actuators, Marine Robotics
Abstract: The robotic fish has attracted widespread research interest over the past few decades, due to its outstanding agility and environmental friendliness. And the sensing ability of underwater environments is crucial for the robotic fish to accomplish various underwater tasks. Inspired by the lateral line of real fish, many types of artificial lateral line (ALL)sensors have been proposed, including pressure-based sensors and deformation-based sensors. However, currently these types of ALL sensors mounted on robotic fish are susceptible to the interference from robotic fish’s self-motions such as yaw motion and pitch motion, as well as the unavoidable vortices around the robotic fish. To address the above issues, a deformation-based magnetic ALL sensor capable of flow velocity-decoupling sensing is proposed, which can be used to measure the swimming speed of the robotic fish while suppressing the aforementioned noise. Besides, an ALL array is designed and mounted on both sides of a robotic fish, enabling the measurement of its swimming speed under both rectilinear and turning motion, with a mean absolute error (MAE) of 0.0153 m/s and 0.0125 m/s, respectively. Based on this, the ALL array is applied for trajectory estimation of the robotic fish, and the MAE of trajectory estimation under rectilinear and turning motion is 0.0600 m and 0.0730 m, respectively.
|
| |
| 15:00-16:30, Paper WeI2I.78 | Add to My Program |
| Class-Guided Network with Rare-Class Amplification for Sea State Estimation Based on Ship Motion Data |
|
| Xia, Wei | Tianjin University of Technology |
| Wang, Kexin | Tianjin University of Technology |
| Tian, Weiwei | Norwegian University of Science and Technology |
| Liu, Xiufeng | Technical University of Denmark |
| Shi, Fan | Tianjin University of Technology |
| Cheng, Xu | Smart Innovation Norway |
Keywords: Intelligent Transportation Systems, AI-Based Methods, Deep Learning Methods
Abstract: Accurate, real-time Sea State Estimation (SSE) is crucial for the safety and operational efficiency of Autonomous Surface Vessels (ASVs). However, existing deep learning methods for this task commonly face three major challenges: the inherent class imbalance of marine environments, the ambiguous boundaries between discrete sea state levels, and the difficulty of extracting multi-scale temporal features from vessel motion. To address these challenges, this paper proposes a novel framework named the Class-guided Rare-boosted Multi-Scale Net (CRUISE). The framework is built upon a multi-scale encoder-decoder architecture and integrates two key innovations: a Rare-Boosted Class Embedding (RBCE) module at the network's bottleneck and a class-guided decoding mechanism. The RBCE module first generates a preliminary class prediction and then dynamically enhances the representation of rare sea state classes to create a class-balanced conditional vector. This vector subsequently provides top-down guidance to the decoder, injecting class-aware information by modulating the feature reconstruction process. This synergistic design fundamentally addresses the data imbalance problem at the feature level and effectively sharpens the decision boundaries between easily confused transitional sea states. Extensive experiments on multiple public benchmarks, simulated ship motion, and real-world datasets demonstrate that CRUISE significantly outperforms existing state-of-the-art methods, showing a pronounced advantage in improving the recognition accuracy of rare and high-risk sea states. Furthermore, real-time inference tests on a physical model vessel validate the model's performance on edge computing devices, further confirming its feasibility and robustness for deployment in real-world marine environments.
|
| |
| 15:00-16:30, Paper WeI2I.79 | Add to My Program |
| ClustViT: Clustering-Based Token Merging for Semantic Segmentation |
|
| Montello, Fabio | Denmark Technical University (DTU) |
| Güldenring, Ronja | Technical University of Denmark |
| Nalpantidis, Lazaros | Technical University of Denmark |
Keywords: Deep Learning for Visual Perception, Object Detection, Segmentation and Categorization, Semantic Scene Understanding
Abstract: Vision Transformers can achieve high accuracy and strong generalization across various contexts, but their practical applicability on real-world robotic systems is limited due to their quadratic attention complexity. Recent works have focused on dynamically merging tokens according to the image complexity. Token merging works well for classification but is less suited to dense prediction. We propose ClustViT, where we expand upon the Vision Transformer (ViT) backbone and address semantic segmentation. Within our architecture, a trainable Cluster module merges similar tokens along the network guided by pseudo-clusters from segmentation masks. Subsequently, a Regenerator module restores fine details for downstream heads. Our approach achieves up to 2.18x fewer GFLOPs and 1.64x faster inference on three different datasets, with comparable segmentation accuracy. Our code and models are made publicly available.
|
| |
| 15:00-16:30, Paper WeI2I.80 | Add to My Program |
| GDP: Enhancing End-To-End Autonomous Driving with Goal-Driven Planner |
|
| Zhang, Qiming | The Hong Kong University of Science and Technology in Guangzhou |
| Zhao, Yue | Gac R&d Center |
| Wang, Yujian | Gac R&d Center |
| Wang, Wei | Gac R&d Center |
| Yang, Zetong | Gac R&d Center |
| Xu, Wei | Gac R&d Center |
| Zhou, Yin | Waymo |
| Ma, Jun | The Hong Kong University of Science and Technology |
Keywords: Representation Learning, Integrated Planning and Learning, Motion and Path Planning
Abstract: End-to-end (E2E) autonomous driving has emerged as a promising paradigm with the pervasive power of model architectures and the availability of large-scale driving datasets. Despite tremendous efforts in recent research, most E2E driving frameworks rely on rather general driving commands, such as "Go Straight" or "Turn Left", which fail to encapsulate the complexities of nuanced driving behaviors and lead to possible semantic ambiguities. Furthermore, such commands are not adequately translated into specific goal locations, which severely limits the planner's capacity to make informed, long-term decisions. This limitation hinders the integration of near-term trajectory planning with long-term goal achievement. To tackle these challenges, we propose the Goal-Driven Planner (GDP), accommodating an appealing plug-and-play feature, which particularly leverages explicit goal points and incorporates two complementary learning objectives: (i) predicting a scene-aware long-term route to the goal, and (ii) refining the near-term trajectory through interaction with the long-term routing. Extensive experiments conducted on the nuScenes and NAVSIM datasets showcase the effectiveness of GDP. When integrated into off-the-shelf E2E autonomous driving frameworks like UniAD, VAD-Tiny, and DiffusionDrive, GDP decreases L2 errors and collision rates, also improves closed-loop metrics in the open-loop evaluation. Essentially, these results highlight the strong generalization capability of GDP and its promising practical significance in enhancing planning reliability and safety in real-world autonomous driving systems.
|
| |
| 15:00-16:30, Paper WeI2I.81 | Add to My Program |
| LST-SLAM: A Stereo Thermal SLAM System for Kilometer-Scale Dynamic Environments |
|
| Jiang, Zeyu | The Hong Kong University of Science and Technology (Guangzhou) |
| Xu, Kuan | Nanyang Technological University |
| Chen, Changhao | The Hong Kong University of Science and Technology (Guangzhou) |
Keywords: SLAM, Localization
Abstract: Thermal cameras offer strong potential for robot perception under challenging illumination and weather conditions. However, thermal Simultaneous Localization and Mapping (SLAM) remains difficult due to unreliable feature extraction, unstable motion tracking, and inconsistent global pose and map construction, particularly in dynamic large-scale outdoor environments. To address these challenges, we propose LST-SLAM, a novel large-scale stereo thermal SLAM system that achieves robust performance in complex, dynamic scenes. Our approach combines self-supervised thermal feature learning, stereo dual-level motion tracking, and geometric pose optimization. We also introduce a semantic–geometric hybrid constraint that suppresses potentially dynamic features lacking strong inter-frame geometric consistency. Furthermore, we develop an online incremental bag-of-words model for loop closure detection, coupled with global pose optimization to mitigate accumulated drift. Extensive experiments on kilometer-scale dynamic thermal datasets show that LST-SLAM significantly outperforms recent representative SLAM systems, including AirSLAM and DROID-SLAM, in both robustness and accuracy.
|
| |
| 15:00-16:30, Paper WeI2I.82 | Add to My Program |
| SureGrip: Perceptual Grasping of Natural Handholds for Free-Climbing Robots |
|
| Panorel, Peter | Kyushu Institute of Technology |
| Goh, Khoon Chuan | Kyushu Institute of Technology |
| Nagaoka, Kenji | Kyushu Institute of Technology |
Keywords: Perception for Grasping and Manipulation, Grasping, Field Robots
Abstract: Exploration of steep and irregular terrains, such as lunar caves and vertical rock faces, requires free-climbing robots capable of identifying and securely grasping natural handholds. This study introduced SureGrip, a novel framework for detecting handholds and evaluating grasp quality in freeclimbing robots. By integrating depth-based contour extraction with gripper-specific contact analysis, SureGrip accurately identifies candidate handholds and quantifies their suitability using the proposed grasp metrics. Experimental results confirm that the framework can reliably detect handhold locations, estimate surface slopes, and distinguish between secure and unsuitable grasps across a range of artificial and natural surfaces. The findings emphasize the importance of both the number and placement of spine fingers for stable attachment. SureGrip thus enables informed handhold selection, improving climbing safety and efficiency.
|
| |
| 15:00-16:30, Paper WeI2I.83 | Add to My Program |
| Robust Person Re-Identification for Service Robots Via One-Class Body-Part Transformer and Continual Learning |
|
| Aleman-Gallegos, Enrique | Bielefeld University |
| Wachsmuth, Sven | Bielefeld University |
Keywords: Human Detection and Tracking, Service Robotics, Continual Learning
Abstract: This work presents a robust person tracking and re-identification system designed for Human-Robot Interaction applications. The approach introduces the One-Class Body-Part (OCBP) Transformer, trained online to model interactions among body-part features and construct a robust target representation. To improve data association and reduce identity swaps during the tracking phase, the SORT tracker is extended with depth information in order to provide correct samples for the Online Continual Learning (OCL) setting. The transformer is further enhanced through the use of pseudo-negative samples, which accelerate convergence during the online learning phase. Ablation studies compare the performance of the memory management system using different sample insertion configurations and highlight the benefit of using pseudo-negative samples. The proposed method is evaluated on a public dataset, where it outperforms state-of-the-art approaches in challenging scenarios, and is validated in a real-world person-following experiment with a robotic platform in an environment with multiple distractors, occlusions, out-of-view situations and illumination changes. Despite these complexities, the robot consistently re-identified and followed the target individual. Runtime analysis demonstrates that the system operates reliably on embedded computing platforms with NVIDIA GPUs, making it both robust and resource-efficient for real-world deployment.
|
| |
| 15:00-16:30, Paper WeI2I.84 | Add to My Program |
| Estimating Human Muscular Fatigue in Dynamic Collaborative Robotic Tasks with Learning-Based Models |
|
| Kiki, Feras | Koc University |
| Pourakbarian Niaz, Pouya | University of Innsbruck |
| Madani, Alireza | McGill University |
| Basdogan, Cagatay | Koc University |
Keywords: Physical Human-Robot Interaction, Human-Robot Collaboration
Abstract: Assessing human muscle fatigue is critical for optimizing performance in physical human–robot interaction (pHRI) tasks and mitigating safety risks for the human operator. This paper presents a data-driven framework for estimating muscle fatigue in dynamic pHRI tasks using surface electromyography (sEMG) sensors attached to the human arm. Subject-specific machine learning (ML) regression models were developed to estimate fatigue during cyclic (i.e., repetitive) pHRI tasks. Machine learning models were trained to estimate the fraction of cycles to fatigue (FCF) using EMG features. Their performance was compared with a CNN model that processes spectrogram representations of EMG signals. Unlike most earlier data-driven approaches that primarily formulated fatigue estimation as a classification problem, our method models the continuous progression of fatigue through regression, enabling tracking of gradual physiological changes rather than discrete states, which is critical for timely intervention and adaptive control in dynamic pHRI tasks. Experiments were conducted with ten participants who interacted with a collaborative robot operated under an admittance controller, performing lateral (left-right) cyclic movements of the end effector until the onset of muscular fatigue. The results demonstrate that the root mean square error (RMSE) of FCF estimation across participants was 20.8 ± 4.3%, 23.3 ± 3.8%, 24.8 ± 4.5%, and 26.9 ± 6.1% for the CNN, Random Forest, XGBoost, and Linear Regression models, respectively. To examine cross-task generalization, additional experiments were performed with one participant who executed vertical and circular repetitive movements. Models trained solely on the lateral-movement data were directly tested on these unseen tasks. The results indicate that the proposed models are robust to variations in movement direction, arm kinematics, and muscle recruitment patterns, while the Linear Regression model performed poorly.
|
| |
| 15:00-16:30, Paper WeI2I.85 | Add to My Program |
| Unleashing the Power of Discrete-Time State Representation: Ultrafast Target-Based IMU-Camera Spatial-Temporal Calibration |
|
| Song, Junlin | University of Luxembourg |
| Richard, Antoine | University of Luxembourg |
| Olivares-Mendez, Miguel A. | Interdisciplinary Centre for Security, Reliability and Trust - University of Luxembourg |
|
|
| |
| 15:00-16:30, Paper WeI2I.86 | Add to My Program |
| Drifting in the Future: Stabilizing Path Following Drifting on High-Latency Vehicle Systems |
|
| Werner, Frederik | Technische Universität München |
| Heintzenberg, Till | Technische Universität München |
| Lienkamp, Markus | Technical University of Munich |
| Betz, Johannes | Technical University of Munich |
Keywords: Field Robots, Autonomous Agents, Motion Control
Abstract: Autonomously controlling and handling a vehicle at and beyond its stability limit is a mathematically and computationally demanding task. Prior demonstrations of automated drifting have been limited to research platforms with instantaneous torque delivery and independently actuated wheels, leaving their applicability to production vehicles with actuator latencies and mechanically coupled axles uncertain. To overcome these issues, we design a predictor to compensate for powertrain delays, develop a revised control formulation to accommodate higher actuation latencies as well as a differential coupling, and introduce brake-based velocity stabilization. This paper presents the controller framework, the model extensions, and real-world experimental results. We observe that our controller enables a production sports car with a combustion engine to robustly sustain circular and figure-eight drifts, limiting lateral error to 1.1 m and sideslip overshoot to 0.06 rad despite actuator delays exceeding 250 ms, while mitigating oscillations and maintaining stable path and sideslip tracking. In conclusion, our results establish that autonomous drifting is feasible on production-ready vehicles, opening pathways to advanced safety systems capable of stabilizing cars in scenarios where traditional control fails.
|
| |
| 15:00-16:30, Paper WeI2I.87 | Add to My Program |
| Model-Free Subsurface Anomaly Detection Using Subspace Analysis Techniques for Sparse Telemetry for Extraterrestrial Drilling Robots |
|
| Boelter, Sarah | University of Minnesota |
| Brown, Greta | University of Minnesota |
| Temesgen, Ebasa | University of Minnesota |
| Weber, Lucas | Friedrich-Alexander-Universität Erlangen-Nürnberg |
| Stucky, Thomas | KBR Wyle Services, LLC |
| Glass, Brian | NASA Ames Research Center |
| Gini, Maria | University of Minnesota |
Keywords: Field Robots, Space Robotics and Automation, Mining Robotics
Abstract: In extraterrestrial planetary environments, computing, energy, and environmental constraints require robotic agents to complete tasks unsupervised. For specialized extraterrestrial robotic drilling agents there is no broadly applicable solution to detect drilling faults as they happen, before the fault escalates to hardware failure. We build upon previous work with time-series subspace analysis methods to to estimate drilling faults using drill avionics telemetry. This work introduces a subsurface anomaly detection method for planetary drilling robots and further evaluates the robustness of our time-series subspace analysis method. We implemented this novel fault and anomaly detection method on an extraterrestrial drilling robot and evaluated it first in a controlled lab environment with composite materials and then in a Mars planetary analog site in the Canadian High Arctic.
|
| |
| 15:00-16:30, Paper WeI2I.88 | Add to My Program |
| HMC: Learning Heterogeneous Meta-Control for Contact-Rich Loco-Manipulation |
|
| Wei, Lai | The Chinese University of Hong Kong, Shenzhen, Shenzhen 518172, China |
| Peng, Xuanbin | University of California, San Diego |
| Qiu, Ri-Zhao | University of California, San Diego |
| Huang, Tianshu | University of California, San Diego |
| Cheng, Xuxin | University of California, San Diego |
| Wang, Xiaolong | UC San Diego |
|
|
| |
| 15:00-16:30, Paper WeI2I.89 | Add to My Program |
| A Self-Rotating Tri-Rotor UAV for Field-Of-View Expansion and Autonomous Flight |
|
| Zhou, Xiaobin | Nanjing University |
| Zheng, Zihao | Nanjing University |
| Jin, Aoxu | Nanjing University |
| Qiang, Lei | Nanjing University |
| Zhu, Bo | Nanjing University |
Keywords: Aerial Systems: Mechanics and Control, Motion Control, Mechanism Design
Abstract: Unmanned Aerial Vehicles (UAVs) perception relies on onboard sensors like cameras and LiDAR, which are limited by the narrow field of view (FoV). We present Self-Perception INertial Navigation Enabled Rotorcraft (SPINNER), a self-rotating tri-rotor UAV for the FoV expansion and autonomous flight. Without adding extra sensors or energy consumption, SPINNER significantly expands the FoV of onboard camera and LiDAR sensors through continuous spin motion, thereby enhancing environmental perception efficiency. SPINNER achieves full 3-dimensional position and roll--pitch attitude control using only three brushless motors, while adjusting the rotation speed via anti-torque plates design. To address the strong coupling, severe nonlinearity, and complex disturbances induced by spinning flight, we develop a disturbance compensation control framework that combines nonlinear model predictive control (MPC) with incremental nonlinear dynamic inversion. Experimental results demonstrate that SPINNER maintains robust flight under wind disturbances up to 4.8 ,m/s and achieves high-precision trajectory tracking at a maximum speed of 2.0,m/s. Moreover, tests in parking garages and forests show that the rotational perception mechanism substantially improves FoV coverage and enhances perception capability of SPINNER.
|
| |
| 15:00-16:30, Paper WeI2I.90 | Add to My Program |
| Spike-IMU: An Accurate and Low-Power Spiking Neural Network for Pedestrian Velocity Estimation |
|
| Zou, Junye | Tsinghua University |
| Li, Xiaolei | Beijing Information Science and Technology University |
| Meng, Ziyang | Tsinghua University |
| Li, Guoqi | Institute of Automation, Chinese Academy of Sciences |
Keywords: Localization, Bioinspired Robot Learning
Abstract: Accurate pedestrian navigation on edge devices is a critical problem. While artificial neural networks (ANNs) have been shown to effectively solve this problem with acceptable accuracy, their energy consumption limits applications on low-power computation platforms. Spiking neural networks (SNNs) are promising alternatives, while their applicability in using noisy, high-frequency IMU data is hindered by two key issues: information loss during spike encoding and simplistic neuron dynamics that fail to capture complex motion. This paper introduces Spike-IMU, an SNN-based velocity estimation network designed to overcome these issues for the pedestrian navigation problem. In particular, a dynamic spiking neuron (DSN) is introduced based on the integer firing mechanism. In addition, a temporal feature fusion spike encoder (TFFSE) and a dynamic spiking long short-term memory network (DSLSTM) are proposed to encode and process IMU data into spike sequences. Our experiments on the RoNIN dataset show that Spike-IMU surpasses classical ANNs, reducing positioning error by 20% while consuming 70.3% less energy. This work demonstrates a novel pipeline to design SNNs that achieves both superior accuracy and energy efficiency, pushing applications of IMU-based pedestrian navigation to real-world low-power edge devices.
|
| |
| 15:00-16:30, Paper WeI2I.91 | Add to My Program |
| Depth Completion by Rescaling Monocular Depth Estimates Via Compressed Sensing |
|
| Zhong, Daoxin | Agency for Science, Technology and Research (A*STAR) |
| Li, Jun | Institute for Infocomm Research |
| Thadimari, Yeshas | Institute of Infocomm Research, A-Star |
| Chuah, Meng Yee (Michael) | Agency for Science, Technology and Research (A*STAR) |
Keywords: RGB-D Perception, Range Sensing
Abstract: Depth completion is the challenge of recovering a dense depth map from an RGB image and corresponding sparse depth measurements. Many modern depth completion strategies often rely on deep neural networks, using a monocular depth estimation (MDE) backbone to generate an initial dense depth map from the RGB image. This estimate is then further refined with the help of auxiliary network components that utilise the sparse depth measurements to improve accuracy and restore fine-grained depth details. However, such approaches introduce additional model parameters and require domain-specific fine-tuning, making them impractical for resource-constrained robotics applications. In this paper, we propose an alternative refinement strategy based on compressed sensing. Using the Discrete Cosine Transform (DCT) as our basis, we construct a ratio matrix that rescales the estimated depth map to align with measured ground truth data. Our experiments demonstrate that this method can significantly reduce the RMSE and MAE of the initial MDE estimate by more than a factor of 15. Furthermore, the proposed approach can outperform state-of-the-art depth completion models at sampling ratios above 50 percent, while also substantially reducing the overall GPU VRAM requirements. This pipeline is modular and compatible with any existing MDE model with no additional training, making it particularly suitable for deployment on GPU-constrained robotic platforms in previously unseen environments.
|
| |
| 15:00-16:30, Paper WeI2I.92 | Add to My Program |
| Constructing and Navigating Connected Air Roads: A Safety-Critical Reinforcement Learning Approach for Multi-UAV Systems |
|
| Qi, Qihan | Sichuan University |
| Xia, Haojie | Sichuan University |
| Yang, Xinsong | Sichuan University |
| Lu, Jianquan | Southeast University |
| Ju, Xingxing | Sichuan University |
Keywords: Reinforcement Learning, Distributed Robot Systems
Abstract: This paper presents an integrated control method in air road navigation for multi-UAV systems, combining an efficient reinforcement learning (RL) controller with a control barrier function (CBF)-based filter that guarantees flight safety. First, an air road construction method based on arbitrary quadrilateral combinations is proposed, which enables flexible air road design. Second, two specific CBFs are designed: an air road CBF which keeps UAVs within designed air roads, and a collision avoidance CBF which prevents collisions between UAVs. Based on the CBF-based filter, the RL controller is allowed to be trained in a simple, single-agent environment, which reduces computational costs and enhances training efficiency. Furthermore, the RL reward is carefully designed, which considers both the stability during movement and the optimality of energy conservation. The performance, safety, and efficiency of the proposed approach are rigorously validated through comprehensive simulations and real-world experiments.
|
| |
| 15:00-16:30, Paper WeI2I.93 | Add to My Program |
| Real-Time Motion Segmentation with Event-Based Normal Flow |
|
| Zhong, Sheng | Hunan University |
| Ren, Zhongyang | Northwestern Polytechnical University |
| Zhu, Xiya | Hunan University |
| Yuan, Dehao | University of Maryland, College Park |
| Fermuller, Cornelia | University of Maryland |
| Zhou, Yi | Hunan University |
Keywords: Object Detection, Segmentation and Categorization
Abstract: Event-based cameras are bio-inspired sensors with pixels that independently and asynchronously respond to brightness changes at microsecond resolution, offering the potential to handle visual tasks in challenging scenarios. However, due to the sparse information content in individual events, directly processing the raw event data to solve vision tasks is highly inefficient, which severely limits the applicability of state-of-the-art methods in real-time tasks, such as motion segmentation—a fundamental task for dynamic scene understanding. Incorporating normal flow as an intermediate representation to compress motion information from event clusters within a localized region provides a more effective solution. In this work, we propose a normal flow-based motion segmentation framework for event-based vision. Leveraging the dense normal flow directly learned from event neighborhoods as input, we formulate the motion segmentation task as an energy minimization problem solved via graph cuts, and optimize it iteratively with normal flow clustering and motion model fitting. By using a normal flow-based motion model initialization and fitting method, the proposed system is able to efficiently estimate the motion models of independently moving objects with only a limited number of candidate models, which significantly reduces the computational complexity and ensures real-time performance, achieving nearly a 800x speedup in comparison to the open-source state-of-the-art method. Extensive evaluations on multiple public datasets fully demonstrate the accuracy and efficiency of our framework. Our code will be open-sourced to facilitate further research in this field.
|
| |
| 15:00-16:30, Paper WeI2I.94 | Add to My Program |
| GauSem-SLAM: Gaussian Semantic Submaps with Loop Closure for Globally Consistent SLAM |
|
| Zhang, Bowen | Hebei University |
| Liu, Yufan | UC, Berkeley |
| Liang, Lebin | University of Chinese Academy of Sciences |
| Li, Dong | University of Macau |
| Li, Mingrui | Dalian University of Technology |
| Zhang, Xuanxuan | Wuhan University |
Keywords: SLAM, Localization, Mapping
Abstract: 3DGS has shown outstanding performance in multi-view geometry, driving its adoption in visual SLAM. However, real-time semantic 3DGS mapping faces challenges. Current methods typically treat semantics as external priors, making it hard to integrate them into SLAM tracking or loop closure correction. Moreover, traditional semantic SLAM corrects accumulated drift by applying rigid adjustments to dense point clouds, which is costly for 3DGS maps and limits loop closure performance. We propose GauSem-SLAM, which uses a Gaussian semantic submap representation with a progressive allocation strategy, integrating semantics into tracking, mapping, loop detection, and submap management. We fully exploit semantic information by designing a robust loop detection module that combines DINOv2 semantic features with 3D semantic landmarks. Furthermore, we introduce Semantic-Guided Registration (SGR), a method for computing inter-submap loop constraints. Through intra-submap and inter-submap loop correction, followed by a two-stage global map refinement, our system achieves globally consistent pose estimation and mapping. Experiments on three public datasets demonstrate that our method outperforms prior methods in both tracking and mapping.
|
| |
| 15:00-16:30, Paper WeI2I.95 | Add to My Program |
| Grasp-MPC: Closed-Loop Visual Grasping Via Value-Guided Model Predictive Control |
|
| Yamada, Jun | University of Oxford |
| Murali, Adithyavairavan | NVIDIA |
| Mandlekar, Ajay Uday | NVIDIA |
| Eppner, Clemens | N/A |
| Posner, Ingmar | Oxford University |
| Sundaralingam, Balakumar | NVIDIA Corporation |
Keywords: Grasping, Deep Learning in Grasping and Manipulation, Integrated Planning and Learning
Abstract: Grasping of diverse objects in unstructured environments remains a significant challenge. Open-loop grasping methods, effective in controlled settings, struggle in cluttered environments. Grasp prediction errors and object pose changes during grasping are the main causes of failure. In contrast, closed-loop methods address these challenges in simplified settings (e.g., single object on a table) on a limited set of objects, with no path to generalization. We propose Grasp-MPC, a closed-loop 6-DoF vision-based grasping policy designed for robust and reactive grasping of novel objects in cluttered environments. Grasp-MPC incorporates a value function, trained on visual observations from a large-scale synthetic dataset of 2 million grasp trajectories that include successful and failed attempts. We deploy this learned value function in an MPC framework in combination with other cost terms that encourage collision avoidance and smooth execution. We evaluate Grasp-MPC on FetchBench and real-world settings across diverse environments. Grasp-MPC improves grasp success rates by up to 32.6% in simulation and 33.3% in real-world noisy conditions, outperforming open-loop, diffusion policy, transformer policy, and IQL approaches. Videos and more at http://grasp-mpc.github.io.
|
| |
| 15:00-16:30, Paper WeI2I.96 | Add to My Program |
| Unreal Robotics Lab: A High-Fidelity Robotics Simulator with Advanced Physics and Rendering |
|
| Embley-Riches, Jonathan | University College London |
| Liu, Jianwei | University College London |
| Julier, Simon | University College London |
| Kanoulas, Dimitrios | University College London |
Keywords: Simulation and Animation, Software Tools for Benchmarking and Reproducibility, Data Sets for Robot Learning
Abstract: High-fidelity simulation is essential for robotics research, enabling safe and efficient testing of perception, control, and navigation algorithms. However, achieving both photorealistic rendering and accurate physics modeling remains a challenge. This paper presents a novel simulation framework, the Unreal Robotics Lab (URL), that integrates the advanced rendering capabilities of the Unreal Engine with MuJoCo’s high-precision physics simulation. Our approach enables realistic robotic perception while maintaining accurate physical interactions, facilitating benchmarking and dataset generation for vision-based robotics applications. The system supports complex environmental effects, such as smoke, fire, and water dynamics, which are critical to evaluating robotic performance under adverse conditions. We benchmark visual navigation and SLAM methods within our framework, demonstrating its utility for testing real-world robustness in controlled yet diverse scenarios. By bridging the gap between physics accuracy and photorealistic rendering, our framework provides a powerful tool for advancing robotics research and sim-to-real transfer. Our open-source framework is available at https://unrealroboticslab.github.io.
|
| |
| 15:00-16:30, Paper WeI2I.97 | Add to My Program |
| HUNT: High-Speed UAV Navigation and Tracking in Unstructured Environments Via Instantaneous Relative Frames |
|
| Saviolo, Alessandro | New York University |
| Mao, Jeffrey | New York University |
| Loianno, Giuseppe | UC Berkeley |
Keywords: Aerial Systems: Applications, Aerial Systems: Perception and Autonomy, Search and Rescue Robots
Abstract: Search and rescue operations require unmanned aerial vehicles to both traverse unknown unstructured environments at high speed and track targets once detected. Achieving both capabilities under degraded sensing and without global localization remains an open challenge. Recent works on relative navigation have shown robust tracking by anchoring planning and control to a visible detected object, but cannot address navigation when no target is in the field of view. We present HUNT: High-speed UAV Navigation and Tracking, a real-time framework that unifies traversal, acquisition, and tracking within a single relative formulation. HUNT defines navigation objectives directly from onboard instantaneous observables such as attitude, altitude, and velocity, enabling reactive high-speed flight during search. Once a target is detected, the same perception–control pipeline transitions seamlessly to tracking. Outdoor experiments in dense forests, container compounds, and search-and-rescue operations with vehicles and mannequins demonstrate robust autonomy where global methods fail. Video: https://youtu.be/YsSflqPPHhs
|
| |
| 15:00-16:30, Paper WeI2I.98 | Add to My Program |
| Optical Flow Estimation Using Speck Neuromorphic Hardware |
|
| Singh, Manupriya | TU Delft |
| Ou, Dequan | TU Delft |
| Hagenaars, Jesse J. | TU Delft |
| de Croon, Guido C. H. E. | TU Delft |
Keywords: Sensor-based Control, Computer Vision for Automation, Aerial Systems: Perception and Autonomy
Abstract: Neuromorphic hardware and spiking neural networks (SNNs) offer a bio-inspired path to low-latency, energy-efficient computation by emulating the brain’s asynchronous, spike-based processing. This is particularly attractive for resource-constrained robots that are tightly limited in size, weight, and power. We propose a neuromorphic approach to real-time optical flow estimation tailored to the SynSense Speck system-on-chip, which integrates a Dynamic Vision Sensor (DVS) with a neuromorphic processor. Our inference architecture combines spiking and artificial neural layers in a hybrid SNN–ANN framework, enabling the use of Speck to perform regression for closed-loop drone control, an application not previously demonstrated on this chip. Despite its compact form factor, the system produces dense flow in real time and achieves stable indoor hover and forward flight using flow-based control. The hybrid pipeline runs ~2x faster than an ANN-only baseline at identical power, highlighting the promise of neuromorphic sensing and processing for ultra-efficient autonomous flight in real-world scenarios. Code and data are available at: https://mavlab.tudelft.nl/speck-optical-flow
|
| |
| 15:00-16:30, Paper WeI2I.99 | Add to My Program |
| LAD-VF: LLM-Automatic Differentiation Enables Fine-Tuning-Free Robot Planning from Formal Methods Feedback |
|
| Yang, Yunhao | The University of Texas at Austin |
| Hong, Junyuan | The University of Texas at Austin |
| Perin, Gabriel Jacob | University of São Paulo |
| Fan, Zhiwen | The University of Texas at Austin |
| Yin, Li | SylphAI |
| Wang, Zhangyang (Atlas) | The University of Texas at Austin |
| Topcu, Ufuk | The University of Texas at Austin |
Keywords: Formal Methods in Robotics and Automation, Hybrid Logical/Dynamical Planning and Verification, Task Planning
Abstract: Large language models (LLMs) can translate natural language instructions into executable action plans for robotics, autonomous driving, and other domains. Yet, deploying LLM-driven planning in the physical world demands strict adherence to safety and regulatory constraints, which current models often violate due to hallucination or weak alignment. Traditional data-driven alignment methods, such as Direct Preference Optimization (DPO), require costly human labeling, while recent formal-feedback approaches still depend on resource-intensive fine-tuning. In this paper, we propose LAD-VF, a fine-tuning-free framework that leverages formal verification feedback for automated prompt engineering. By introducing a formal-verification-informed text loss integrated with LLM-AutoDiff, LAD-VF iteratively refines prompts rather than model parameters. This yields three key benefits: (i) scalable adaptation without fine-tuning; (ii) compatibility with modular LLM architectures; and (iii) interpretable refinement via auditable prompts. Experiments in robot navigation and manipulation tasks demonstrate that LAD-VF substantially enhances specification compliance, improving success rates from 60% to over 90%. Our method thus presents a scalable and interpretable pathway toward trustworthy, formally-verified LLM-driven control systems.
|
| |
| 15:00-16:30, Paper WeI2I.100 | Add to My Program |
| MMD-SLAM: Structure-Enhanced Multi-Meta Gaussian Distribution-Guided Visual SLAM |
|
| Zhu, Fan | University of Science and Technology of China |
| Chen, Ziyu | University of Science and Technology of China |
| Liu, Peichen | Aarhus University |
| Zhao, Yifan | University of Science and Technology of China |
| Xu, Zhisong | The University of Tokyo |
| Zhu, Hui | Hefei Institutes of Physical Science, Chinese Academy of Sciences |
| Zhou, Hongxing | Beijing University of Chemical Technology |
| Liu, Sixun | North China Electric Power University |
| Jiang, Chunmao | University of Science and Technology of China |
|
|
| |
| 15:00-16:30, Paper WeI2I.101 | Add to My Program |
| Best of Sim and Real: Decoupled Visuomotor Manipulation Via Learning Control in Simulation and Perception in Real |
|
| Huang, Jialei | Tsinghua University |
| Yin, Zhao-Heng | University of California, Berkeley |
| Hu, Yingdong | Tsinghua University |
| Wang, Shuo | University of Electronic Science and Technology of China |
| Lin, Xingyu | UC Berkeley |
| Gao, Yang | Tsinghua University |
Keywords: Dexterous Manipulation, Simulation and Animation
Abstract: Sim-to-real transfer remains a fundamental challenge in robot manipulation due to the entanglement of perception and control in end-to-end learning. We present a decoupled framework that learns each component where it is most reliable: control policies are trained in simulation with privileged state to master spatial layouts and manipulation dynamics, while perception is adapted only at deployment to bridge real observations to the frozen control policy. Our key insight is that control strategies and action patterns are universal across environments and can be learned in simulation through systematic randomization, while perception is inherently domain-specific and must be learned where visual observations are authentic. Unlike existing end-to-end approaches that require extensive real-world data, our method achieves strong performance with only 10-20 real demonstrations by reducing the complex sim-to-real problem to a structured perception alignment task. We validate our approach on tabletop manipulation tasks, demonstrating superior data efficiency and out-of-distribution generalization compared to end-to-end baselines. The learned policies successfully handle object positions and scales beyond the training distribution, confirming that decoupling perception from control fundamentally improves sim-to-real transfer.
|
| |
| 15:00-16:30, Paper WeI2I.102 | Add to My Program |
| Variable Stiffness Soft Robotic Arm with Positive-Pressure Layer Jamming for Enhanced Load Capacity |
|
| Wu, Zekai | University of Chinese Academy of Sciences; Shenyang Institute of Automation, Chinese Academy of Sciences |
| Fu, Xin | Shenyang Institute of Automation Chinese Academy of Sciences |
| Zhang, Daohui | Shenyang Institute of Automation, Chinese Academy of Sciences |
| Zhao, Xingang | Shenyang Institute of Automation, Chinese Academy of Sciences |
|
|
| |
| 15:00-16:30, Paper WeI2I.103 | Add to My Program |
| MUJICA: Multi-Skill Unified Joint Integration of Control Architecture for Wheeled-Legged Robots |
|
| Li, Yuqi | Fudan University |
| Zhai, Peng | Fudan University |
| ZHang, Yueqi | Fudan University |
| Wei, Xiaoyi | Fudan University |
| Qian, Quancheng | Fudan University |
| He, Zhengxu | Power China Huadong Engineering Corporation Limited |
| Yu, Qianxiang | Power China Huadong Engineering Corporation Limited |
| ZHang, Lihua | Fudan University |
Keywords: Legged Robots, Reinforcement Learning
Abstract: Wheeled-legged robots hold promise for traversing complex terrains and offer superior mobility compared to legged robots. However, wheeled-legged robots must effectively balance both wheeled driving and legged control. Furthermore, due to noisy proprioceptive sensing and real-world motor constraints, realizing robust and adaptive locomotion at peak performance of motors remains challenging. We propose the Multi-skill Unified Joint Integration of Control Architecture (MUJICA), a unified, fully proprioceptive control framework for wheeled-legged robots that integrates diverse low-level skills—including omnidirectional moving, high platform climbing, and fall recovery—within a single policy. All skills, distinguished by unique indicator variables, are trained jointly with accurate DC-motor constraint modeling. Additionally, a high-level skill selector is learned to dynamically choose the optimal skill based solely on proprioceptions, enabling adaptive responses to the surrounding environment. Therefore, MUJICA enhances sim-to-real robustness and enables seamless transitions across diverse locomotion modes, facilitating autonomous adjustment to the environment. We validate our framework in both simulation and real-world experiments on the Unitree Go2-W robot, demonstrating significant improvements in adaptability and task success in unstructured environments.
|
| |
| 15:00-16:30, Paper WeI2I.104 | Add to My Program |
| DiffPlace: Street View Generation Via Place-Controllable Diffusion Model Enhancing Place Recognition |
|
| Li, Ji | The University of Hong Kong |
| Li, Zhiwei | Beijing Institute of Technology |
| Li, ShiHao | Shandong Jianzhu University |
| Yu, ZhenJiang | Beijing Institute of Technology |
| Wang, Boyang | Beijing Institute of Technology |
| Liu, Haiou | Beijing Institute of Technology |
Keywords: Computer Vision for Transportation, Visual Learning, Localization
Abstract: Generative models have advanced significantly in realistic image synthesis, with diffusion models excelling in quality and stability. Recent multi-view diffusion models improve 3D-aware street view generation, but they struggle to produce place-aware and background-consistent urban scenes from text, BEV maps, and object bounding boxes. This limits their effectiveness in generating realistic samples for place recognition tasks. To address these challenges, we propose DiffPlace, a novel framework that introduces a place-ID controller to enable place-controllable multi-view image generation. The place-ID controller employs linear projection, perceiver transformer, and contrastive learning to map place-ID embeddings into a fixed CLIP space, allowing the model to synthesize images with consistent background buildings while flexibly modifying foreground objects and weather conditions. Extensive experiments, including quantitative comparisons and augmented training evaluations, demonstrate that DiffPlace outperforms existing methods in both generation quality and training support for visual place recognition. Our results highlight the potential of generative models in enhancing scene-level and place-aware synthesis, providing a valuable approach for improving place recognition. Code will be released.
|
| |
| 15:00-16:30, Paper WeI2I.105 | Add to My Program |
| IMPASTO: Integrating Model-Based Planning with Learned Dynamics Models for Robotic Oil Painting Reproduction |
|
| Wang, Yingke | Stanford University |
| Li, Hao | Stanford University |
| Zhu, Yifeng | The University of Texas at Austin |
| Yu, Hong-Xing | Stanford University |
| Goldberg, Ken | UC Berkeley |
| Fei-Fei, Li | Stanford University |
| Wu, Jiajun | Stanford University |
| Li, Yunzhu | Columbia University |
| Zhang, Ruohan | Stanford University |
Keywords: Art and Entertainment Robotics, Machine Learning for Robot Control, Model Learning for Control
Abstract: Robotic reproduction of oil paintings using soft brushes and pigments requires force-sensitive control of deformable tools, prediction of brushstroke effects, and multi-step stroke planning, often without human step-by-step demonstrations or faithful simulators. Given only a sequence of target oil painting images, can a robot infer and execute the stroke trajectories, forces, and colors needed to reproduce it? We present IMPASTO, a robotic oil-painting system that integrates learned pixel dynamics models with model-based planning. The dynamics models predict canvas updates from image observations and parameterized stroke actions; a receding-horizon model predictive control optimizer then plans trajectories and forces, while a force-sensitive controller executes strokes on a 7-DoF robot arm. IMPASTO integrates low-level force control, learned dynamics models, and high-level closed-loop planning, learns solely from robot self-play, and approximates human artists' single-stroke datasets and multi-stroke artworks, outperforming baselines in reproduction accuracy. Project website and appendix: https://impasto-robopainting.github.io/.
|
| |
| 15:00-16:30, Paper WeI2I.106 | Add to My Program |
| HiMAP: History-Aware Map-Occupancy Prediction with Fallback |
|
| Xu, Yiming | Leibniz University Hannover |
| Yang, Yi | Leibniz University Hannover |
| Cheng, Hao | University of Twente |
| Sester, Monika | Leibniz University Hannover, Institute of Cartography and Geoinformatics |
|
|
| |
| 15:00-16:30, Paper WeI2I.107 | Add to My Program |
| A Closed-Loop CPR Training Glove with Integrated Tactile Sensing and Haptic Feedback |
|
| Moon, Jaeyoung | Gwangju Institute of Science and Technology |
| Ma, Mingzhuo | University of Washington |
| Yang, Qifeng | University of Washington |
| Choi, Youjin | Gwangju Institution of Science and Technology |
| Hwang, Seokhyun | University of Washington |
| Burden, Samuel | University of Washington, Seattle |
| Kim, Kyung-Joong | Gwangju Institute of Science and Technology |
| Luo, Yiyue | University of Washington |
Keywords: Wearable Robotics, Force and Tactile Sensing, Haptics and Haptic Interfaces
Abstract: Cardiopulmonary resuscitation (CPR) is a critical life-saving procedure, and effective training benefits from self-directed practice beyond instructor-led sessions. In this paper, we propose a closed-loop CPR training glove that integrates a high-resolution tactile sensing array and vibrotactile actuators for self-directed practice. The tactile sensing array measures distributed pressures across the palm and dorsum to enable real-time estimation of compression rate, force, and hand pose. Based on these estimations, the glove delivers immediate haptic feedback to guide the user for proper CPR, reducing reliance on external audio-visual displays. We quantified the tactile sensor performance by measuring wide-range sensitivity (≈0.85 over 0-600 N), computing hysteresis (56.04%), testing stability (11.05% drift over 300 cycles), and estimating global signal-to-noise ratio (18.90 ± 2.41 dB at 600 N). Our closed-loop pipeline provides continuous modeling and feedback of key performance metrics essential for high-quality CPR. Our lightweight statistical models achieves >92% accuracy for force estimation and hand pose classification within sub-millisecond inference time. Our user study (N=8) showed that haptic feedback reduced visual distraction compared to audio-visual cues, though simplified patterns were required for reliable perception under dynamic load. These results highlight the feasibility of the proposed system and offer design insights for future haptic CPR self-training system.
|
| |
| 15:00-16:30, Paper WeI2I.108 | Add to My Program |
| Multimodal Fusion-Guided Diffusion Policy for Motion Planning in Rugged and Obstacle-Dense Environments |
|
| Xi, Haoyu | University of Chinese Academy of Sciences |
| Li, Wei | Institute of Computing Technology, Chinese Academy of Sciences |
| Hu, Yu | Institute of Computing Technology Chinese Academy of Sciences |
Keywords: Motion and Path Planning, Constrained Motion Planning, Field Robots
Abstract: Motion planning in unstructured environments remains a challenging task, particularly in scenarios with dense obstacles and discontinuous freespace, due to the need to ensure both safety and real-time performance for robots. To address these challenges, this paper proposes a multimodal fusion-guided diffusion policy framework, abbreviated as M-DP, synergistically guided by images, LiDAR points, and goal targets. A multimodal early-fusion mechanism is designed to combine visual and LiDAR data, leveraging the complementary nature of sensor observations to enhance obstacle perception. The fused feature vectors are utilized to guide the diffusion policy to generate multiple trajectories, and the Denoising Diffusion Implicit Model (DDIM) is employed for inference to improve real-time performance. Semantic and geometric constraints are incorporated to determine the optimal trajectory, enabling the selection of collision-free paths that balance safety, goal reaching, and bumpiness. Additionally, dynamic constraints are introduced to ensure the safety of robots in rugged and obstacle-dense environments. Real-world experimental evaluations demonstrate the safety and effectiveness of the framework compared to baseline methods, with ablation studies validating the contributions of key components. Codes and our self-collected dataset are available on https://github.com/xhy1599/M-DP.
|
| |
| 15:00-16:30, Paper WeI2I.109 | Add to My Program |
| CollabVLA: Self-Reflective Vision-Language-Action Model Dreaming Together with Human |
|
| Sun, Nan | Tsinghua University |
| Li, Yongchang | Yantai University |
| Wang, Chenxu | Tsinghua University |
| Mao, Bo | Beijing University of Posts and Telecommunications |
| Li, Huiying | Tsinghua University |
| Yao, Jiahe | Tianjin University of Technology |
| Li, Kanghao | Tianjin University of Technology |
| Zhang, Yifan | GoerTek Inc |
| Liu, Jian | Beihang University |
| Zhang, Guoying | China University of Mining & Technology, Beijing |
| Guo, Di | Beijing University of Posts and Telecommunications |
| Liu, Huaping | Tsinghua University |
Keywords: Human-Robot Collaboration, AI-Based Methods, AI-Enabled Robotics
Abstract: In this work, we present CollabVLA, a self-reflective vision-language-action framework that transforms a standard visuomotor policy into a collaborative assistant. CollabVLA tackles key limitations of prior VLAs, including domain overfitting, non-interpretable reasoning, and the high latency of auxiliary world models, by integrating VLM-based reflective reasoning with diffusion-based action generation under a mixture-of-experts design. Through a two-stage training recipe of action grounding and reflection tuning, it supports explicit self-reflection and proactively solicits human guidance when confronted with uncertainty or repeated failure. It cuts normalized Time by ∼ 2× and Dream counts by ∼ 4× vs. explicit-reasoning agents, achieving higher success rates, improved interpretability, and balanced low latency compared with existing methods. This work takes a pioneering step toward shifting VLAs from opaque controllers to genuinely assistive agents capable of reasoning, acting, and collaborating with humans.
|
| |
| 15:00-16:30, Paper WeI2I.110 | Add to My Program |
| DMTrack: Spatio-Temporal Multimodal Tracking Via Dual-Adapter |
|
| Li, Weihong | University of Chinese Academy of Sciences |
| Dong, Shaohua | University of North Texas |
| Lu, Haonan | OPPO AI Center |
| Zhang, Yanhao | OPPO |
| Fan, Heng | University of North Texas |
| Zhang, Libo | Iscas |
Keywords: Visual Tracking, Deep Learning for Visual Perception, Sensor Fusion
Abstract: In this paper, we explore adapter tuning and introduce a novel dual-adapter architecture for spatio-temporal multimodal tracking, dubbed DMTrack. The key of our DMTrack lies in two simple yet effective modules, including a spatio-temporal modality adapter (STMA) and a progressive modality complementary adapter (PMCA) module. The former, applied to each modality alone, aims to adjust spatio-temporal features extracted from a frozen backbone by self-prompting, which to some extent can bridge the gap between different modalities and thus allows better cross-modality fusion. The latter seeks to facilitate cross-modality prompting progressively with two specially designed pixel-wise shallow and deep adapters. The shallow adapter employs shared parameters between the two modalities, aiming to bridge the information flow between the two modality branches, thereby laying the foundation for following modality fusion, while the deep adapter modulates the preliminarily fused information flow with pixel-wise inner-modal attention and further generates modality-aware prompts through pixel-wise inter-modal attention. With such designs, DMTrack achieves promising spatio-temporal multimodal tracking performance with merely textbf{0.93M} trainable parameters. Extensive experiments on five benchmarks demonstrate that DMTrack achieves state-of-the-art results. Our code and models will be available at https://github.com/Nightwatch-Fox11/DMTrack
|
| |
| 15:00-16:30, Paper WeI2I.111 | Add to My Program |
| DPWM: Autonomous Exploration Via Diffusion-Based Map Prediction Guided Planning |
|
| Jia, Zemei | Zhejiang University |
| Qi, Peng | ZheJiang University |
| Liu, Xiaoxiang | Beijing Institute of Control Engineering |
| Yao, Zhihao | Zhejiang University |
| Li, Liang | Zhejiang Univerisity |
Keywords: Search and Rescue Robots, Reactive and Sensor-Based Planning, Mapping
Abstract: Autonomous exploration aims to efficiently map unknown environments, yet utilizing limited environmental information to achieve efficient path planning remains challenging. In this work, we focus on leveraging latent information in partial observations to predict the complete environmental structure, thereby furnishing a proposed path planner with the necessary context to devise a long-term optimal exploration strategy. Most existing prediction approaches extract environment features through convolutional neural networks (CNN) and infer the characteristics of neighboring regions. This information then feeds into a value function that evaluates candidate frontiers and guides the robot's planning. Notwithstanding its advantages over traditional heuristic methods, this paradigm remains inherently constrained by its lack of long-term foresight. To this end, we propose dPWM, a diffusion model-based framework for global map prediction, consisting of two key components. The first employs a DDPM with a variable mask to estimate the probability distribution of unknown regions and thereby predict structural features of the global map. We incorporate Gaussian heatmap positional fields into the denoising process via a cross-attention mechanism to enhance regional awareness. This guides the model to focus on nearby areas that are most valuable for exploration. Once the global predictive map is obtained, the second component refers to a designed Watchman Route Problem (WRP) solver to generate an optimal path from the current exploration state. Extensive evaluations show that dPWM reduces exploration path length by 18.53% on HouseExpo and achieves a 16.37% improvement in cross-domain generalization on Dungeon over SOTA baselines. Real-world experiments further validate its effectiveness in physical environments.
|
| |
| 15:00-16:30, Paper WeI2I.112 | Add to My Program |
| VISO: Robust Underwater Visual-Inertial-Sonar SLAM with Photometric Rendering for Dense 3D Reconstruction |
|
| Pan, Shu | Heriot Watt University |
| Archieri, Simon | Heriot-Watt University |
| Cinar, Ahmet Fatih | Frontier Robotics |
| Scharff Willners, Jonatan | Heriot-Watt University |
| Carlucho, Ignacio | Heriot-Watt University |
| Petillot, Yvan | Heriot-Watt University |
Keywords: Marine Robotics, SLAM, Range Sensing
Abstract: Visual challenges in underwater environments significantly hinder the accuracy of vision-based localisation and the high-fidelity dense reconstruction. In this paper, we propose VISO, a robust underwater SLAM system that fuses a stereo camera, an inertial measurement unit (IMU), and a 3D sonar to achieve accurate 6-DoF localisation and enable efficient dense 3D reconstruction with high photometric fidelity. We introduce a coarse-to-fine online calibration approach for extrinsic parameters estimation between the 3D sonar and the camera. Additionally, a photometric rendering strategy is proposed for the 3D sonar point cloud to enrich the sonar map with visual information. Extensive experiments in a laboratory tank and an open lake demonstrate that VISO surpasses current state-of-the-art underwater and visual-based SLAM algorithms in terms of localisation robustness and accuracy, while also exhibiting real-time dense 3D reconstruction performance comparable to the offline dense mapping method.
|
| |
| 15:00-16:30, Paper WeI2I.113 | Add to My Program |
| Robust and Resilient Soft Robotic Object Insertion with Compliance-Enabled Contact Formation and Failure Recovery |
|
| Shirasaka, Mimo | The University of Tokyo |
| Beltran-Hernandez, Cristian Camilo | OMRON SINIC X Corporation |
| Hamaya, Masashi | OMRON SINIC X Corporation |
| Ushiku, Yoshitaka | OMRON SINIC X Corpolation |
Keywords: Soft Robot Applications, Modeling, Control, and Learning for Soft Robots, Failure Detection and Recovery
Abstract: We address robust and resilient object insertion using a passively compliant soft wrist that permits large deformations and safely absorbs contacts, without high-frequency control or force sensing. To improve robustness, we structure the task as compliance-enabled contact formations: a sequence of contact states that progressively constrain specific degrees of freedom. While this segmentation mitigates moderate uncertainty, failures still occur under severe pose errors or environmental variations (e.g., friction changes, peg geometry), which traditionally require retuning goals or retraining controllers. To achieve both robustness and resilience, we therefore integrate compliance-enabled failure recovery into the contact-formation framework. Our key insight is that wrist compliance permits safe, repeated recovery attempts. A pre-trained vision-language model (VLM) assesses each skill execution from terminal poses and images, identifies failure modes, and proposes recovery actions by selecting skills and updating goals. In simulation, our method achieved an 83% success rate, recovering from failures induced by randomized conditions—including grasp misalignments up to 5 degrees, hole-pose errors up to 20 mm, fivefold increases in friction, and previously unseen square/rectangular pegs—and we further validate the approach on a real robot.
|
| |
| 15:00-16:30, Paper WeI2I.114 | Add to My Program |
| OcTac: An Octopus Sucker-Inspired Vision-Based Tactile Sensor with Self-Adaptive Adhesion for Complex Surface Interactions |
|
| Xiong, Yi | Beihang University |
| Yuan, Feiyang | Beihang University |
| Zhang, Qiyi | Beihang University |
| Bao, Lei | Beijing Soft Robot Tech Co., Ltd |
| Wen, Li | Beihang University |
Keywords: Soft Sensors and Actuators, Biologically-Inspired Robots, Soft Robot Applications
Abstract: Vision-based tactile soft sensors are increasingly applied to robotic perception and manipulation by leveraging high-resolution imaging during contact with environmental surfaces, thereby enabling more adaptable and robust inter-actions.Nonetheless, ensuring optimal contact force to achieve uniform, conformal, and stable contact between sensors and surfaces remains a key challenge, particularly within complex and unstructured environments. Inspired by the highly versatile suction cups of biological octopuses for environmental surface sensing, we introduce OcTac, a prototype that seamlessly com-bines adaptive adhesion capabilities with vision-based tactile perception. OcTac harnesses its self-guided adhesion mechanism and the intrinsic ffexibility of soft materials to autonomously achieve alignment with target surfaces, even when initially misaligned at signiffcant angles—facilitating tactile perception without relying on precise external control. We conducted experiments demonstrating that OcTac exhibits robust adaptive adhesion and self-detachment capabilities on surfaces with inclination angles ranging from 0° to 90°, as well as on surfaces with varying levels of roughness (with particle sizes up to 150 µm). On challenging inclined surfaces, OcTac’s self-aligning adhesion mechanism enables stable and uniform con-tact,achieving a signiffcant improvement in image uniformity by a factor of 4.53 compared to conventional vision-based tactile soft sensors. Additionally, we demonstrated OcTac mounted on a continuum soft robotic arm, enabling it to navigate around obstacles and perform surface perception, object recognition, and grasping tasks. This work presents a new approach for achieving adaptive tactile perception in complex environments by harnessing the inherent physical intelligence of soft adhesive materials.
|
| |
| 15:00-16:30, Paper WeI2I.115 | Add to My Program |
| EDAIL: Adversarial Imitation Learning Via Exploration-Driven Data Augmentation |
|
| Li, Pengcheng | National University of Defense Technology |
| Fang, Qiang | National University of Defense Technology |
| Xu, Xin | National University of Defense Technology |
Keywords: Imitation Learning, Learning from Demonstration, Reinforcement Learning
Abstract: Adversarial Imitation Learning (AIL) is a prominent paradigm in imitation learning that enables policy acquisition from expert demonstrations without relying on manually crafted reward functions. Although AIL has achieved promising results in certain scenarios, many existing methods suffer from mode collapse and training instability when expert demonstrations are limited. Given that agent–environment interactions are often abundant, we focus on effectively leveraging such interaction data to address the above challenges. In this paper, we propose a novel adversarial imitation learning framework called Exploration-Driven Adversarial Imitation Learning (EDAIL). First, we introduce exploratory policies that augment the discriminator’s training data with high-confidence state-action pairs generated by the agent, thereby improving coverage of the solution space under sparse expert data. Second, we design an asymmetric surrogate reward function that shifts the reward-penalty boundary to mitigate discriminator bias caused by class imbalance, enabling more reliable policy optimization. We evaluate our method on six simulated tasks, including robotic manipulation, locomotion, and navigation, using only 1% and 10% of the datasets employed in prior baselines as expert demonstrations. Experimental results show that our method outperforms the baselines, demonstrating both the effectiveness and robustness of our method. In particular, it achieves a success rate of 94% on the FetchPush task using only 1% of expert demonstrations, representing an absolute improvement of 19 points over the state-of-the-art method. Our code will be available at https://github.com/lipengcheng-nudt/EDAIL.
|
| |
| 15:00-16:30, Paper WeI2I.116 | Add to My Program |
| MetaDP: Meta-Manipulation Diffusion Policy for Robotic Manipulation |
|
| Zhao, Zheyi | Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) |
| He, Ying | Shenzhen University |
| Yu, Fei | Guangming Lab |
| Song, Jiyuan | Guangming Laboratory, Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) |
| Sun, Xilong | Kuban State University |
|
|
| |
| 15:00-16:30, Paper WeI2I.117 | Add to My Program |
| Towards Proprioception-Aware Embodied Planning for Dual-Arm Humanoid Robots |
|
| Li, Boyu | SKL-MAIS, Institute of Automation, Chinese Academy of Sciences |
| He, Siyuan | Southeast University |
| Xu, Hang | Fudan University |
| Yuan, Haoqi | Peking University |
| Xu, Xinrun | University of Chinese Academy of Sciences |
| Zang, Yu | Beijing University of Posts and Telecommunications |
| Hu, Liwei | China University of Mining and Technology |
| Jiang, ZhenXiong | East China University of Science and Technology |
| Yue, Junpeng | Peking University |
| Hu, Pengbo | University of Science and Technology of China |
| Karlsson, Börje F. | Beijing Academy of Artificial Intelligence (BAAI) |
| Zhao, Dongbin | SKL-MAIS, Institute of Automation, Chinese Academy of Sciences |
| Tang, Yehui | Imperial College London |
| Lu, Zongqing | Peking University |
Keywords: Integrated Planning and Learning, Simulation and Animation
Abstract: In recent years, Multimodal Large Language Models (MLLMs) have demonstrated the ability to serve as high-level planners, enabling robots to follow complex human instructions. However, their effectiveness, especially in long-horizon tasks involving dual-arm humanoid robots, remains limited. This limitation arises from two main challenges: (i) the absence of simulation platforms that systematically support task evaluation and data collection for humanoid robots, and (ii) the insufficient embodiment awareness of current MLLMs, which hinders reasoning about dual-arm selection logic and body positions during planning. To address these issues, we present DualTHOR, a new dual-arm humanoids simulator, with continuous transition and a contingency mechanism. Building on this platform, we propose Proprio-MLLM, a model that enhances embodiment awareness by incorporating proprioceptive information with motion-based position embedding and a cross-spatial encoder. Experiments show that, while existing MLLMs struggle in this environment, Proprio-MLLM achieves an average improvement of 19.75% in planning performance. Our work provides both an essential simulation platform and an effective model to advance embodied intelligence in humanoid robotics.
|
| |
| 15:00-16:30, Paper WeI2I.118 | Add to My Program |
| SHARP: Supercomputing for High-Speed Avoidance and Reactive Planning |
|
| Lachmansingh, Kieran | Ingenuity Labs Research Institute at Queen's University |
| González, José Ramón | Ingenuity Labs Research Institute at Queen's University |
| Chisholm, Jacob | Ingenuity Labs Research Institute at Queen's University |
| Grant, Ryan Eric | Ingenuity Labs Research Institute at Queen's University |
| Pan, Matthew | Ingenuity Labs Research Institute at Queen's University |
Keywords: Reactive and Sensor-Based Planning, Networked Robots, Collision Avoidance
Abstract: This paper presents SHARP (Supercomputing for High-speed Avoidance and Reactive Planning), a proof-of- concept study demonstrating how high-performance computing (HPC) can enable millisecond-scale responsiveness in robotic control. While modern robots face increasing demands for reactivity in human–robot shared workspaces, onboard pro- cessors are constrained by size, power, and cost. Offloading to HPC offers massive parallelism for trajectory planning, but its feasibility for real-time robotics remains uncertain due to network latency and jitter. We evaluate SHARP in a stress-test scenario where a 7-DOF manipulator must dodge high-speed foam projectiles. Using a hash-distributed multi-goal A* search implemented with MPI on both local and remote HPC clusters, the system achieves mean planning latencies of 22.9 ms (local) and 30.0 ms (remote, 300 km away), with avoidance success rates of 84% and 88%, respectively. These results show that when round-trip latency remains within the tens-of-milliseconds regime, HPC-side computation is no longer the bottleneck, enabling avoidance well below human reaction times. The SHARP results motivate hybrid control architectures: low-level reflexes remain onboard for safety, while bursty, high-throughput planning tasks are offloaded to HPC for scalability. By reporting per-stage timing and success rates, this study provides a reproducible template for assessing the real-time feasibility of HPC-driven robotics. Collectively, SHARP reframes HPC offloading as a viable pathway toward dependable, reactive robots in dynamic environments.
|
| |
| 15:00-16:30, Paper WeI2I.119 | Add to My Program |
| Exploring Haptic Augmentation and Language Design for Smartphone-Based Teleoperation |
|
| Baylis, Zachary Andrew Christopher | University of Cambridge |
| Chen, Ziling | Massachusetts Institute of Technology |
| Bautista Montesano, Rolando | Massachusetts Institute of Technology |
| Yoon, Yeo Jung | Massachusetts Institute of Technology |
| Bohné, Thomas | University of Cambridge |
| Tadeja, Slawomir Konrad | Massachusetts Institute of Technology |
| Liu, John | Massachusetts Institute of Technology |
Keywords: Telerobotics and Teleoperation, Human Performance Augmentation, Human-Robot Collaboration
Abstract: Smartphone-based teleoperation is gaining traction as a versatile remote control solution, using widely available hardware to provide a portable and scalable interface for telerobotics. However, a crucial limitation of such an approach is the lack of effective haptic feedback, which restricts accuracy and increases operator workload. While smartphones offer a low-entry barrier as well as both portability and scalability, current interfaces rely almost exclusively on visual cues. To address this gap, we investigate the use of symbolic haptic feedback delivered through an unmodified mobile device to support remote manipulation tasks. We designed a combined teleoperation task that integrates object sorting and peg-in-hole insertion, embedding five candidate haptic cues (i.e., contact, gripper state, alignment, error boundary, and motion initiation). A within-subjects study with 16 participants compared visual-only and visual-plus-haptic conditions. Results show that haptic augmentation reduced total errors by 42% and significantly lowered perceived workload. Continuous cues for alignment and error boundaries achieved the highest recognition rates of 94% and 81%, respectively, while brief state cues were less reliably interpreted. Post-task interviews highlighted user preference for simple, continuous, and intense signals in visually ambiguous scenarios. Our findings provide new design guidelines for haptic cue prioritisation and encoding strategies.
|
| |
| 15:00-16:30, Paper WeI2I.120 | Add to My Program |
| GaussianCaR: Gaussian Splatting for Efficient Camera-Radar Fusion |
|
| Montiel-Marín, Santiago | University of Alcalá |
| Antunes-García, Miguel | University of Alcalá |
| Sánchez-García, Fabio | Universidad De Alcalá |
| Llamazares, Angel | University of Alcalá |
| Caesar, Holger | Delft University of Technology |
| Bergasa, Luis M. | University of Alcalá |
Keywords: Deep Learning for Visual Perception, Sensor Fusion, Intelligent Transportation Systems
Abstract: Robust and accurate perception of dynamic objects and map elements is crucial for autonomous vehicles performing safe navigation in complex traffic scenarios. While vision-only methods have become the de facto standard due to their technical advances, they can benefit from effective and cost-efficient fusion with radar measurements. In this work, we advance fusion methods by repurposing Gaussian Splatting as an efficient universal view transformer that bridges the view disparity gap, mapping both image pixels and radar points into a common Bird’s-Eye View (BEV) representation. Our main contribution is GaussianCaR, an end-to-end network for BEV segmentation that, unlike prior BEV fusion methods, leverages Gaussian Splatting to map raw sensor information into latent features for efficient camera-radar fusion. Our architecture combines multi-scale fusion with a transformer decoder to efficiently extract BEV features. Experimental results demonstrate that our approach achieves performance on par with, or even surpassing, the state of the art on BEV segmentation tasks (57.3%, 82.9%, and 50.1% IoU for vehicles, roads, and lane dividers) on the nuScenes dataset, while maintaining a 3.2x faster inference runtime. Code and project page are available online.
|
| |
| 15:00-16:30, Paper WeI2I.121 | Add to My Program |
| CoTaP: Compliant Task Pipeline and Reinforcement Learning of Its Controller with Compliance Modulation |
|
| He, Zewen | Mohamed Bin Zayed University of Artificial Intelligence |
| Chenyuan, Chen | Mohamed Bin Zayed University of Artificial Intelligence |
| Azizov, Dilshod | Mohamed Bin Zayed University of Artificial Intelligence |
| Nakamura, Yoshihiko | Mohamed Bin Zayed University of Artificial Intelligence |
Keywords: Humanoid and Bipedal Locomotion, Reinforcement Learning, Compliance and Impedance Control
Abstract: Humanoid whole-body locomotion control is a critical approach for humanoid robots to leverage their inherent advantages. Learning-based control methods derived from retargeted human motion data provide an effective means of addressing this issue. However, because most current human datasets lack measured force data, and learning-based robot control is largely position-based, achieving appropriate compliance during interaction with real environments remains challenging. This paper presents Compliant Task Pipeline (CoTaP): a pipeline that leverages compliance information in the learning-based structure of humanoid robots. A two-stage dual-agent reinforcement learning framework combined with model-based compliance control for humanoid robots is proposed. In the training process, first a base policy with a position-based controller is trained; then in the distillation, the upper-body policy is combined with model-based compliance control, and the lower-body agent is guided by the base policy. In the upper-body control, adjustable task-space compliance can be specified and integrated with other controllers through compliance modulation on the symmetric positive definite (SPD) manifold, ensuring system stability. We validated the feasibility of the proposed strategy in simulation and experiment, primarily comparing the responses to external disturbances under different compliance settings.
|
| |
| 15:00-16:30, Paper WeI2I.122 | Add to My Program |
| T2-Nav: Algebraic-Topology–Aware Temporal Graph Memory and Loop Detection for Zero-Shot Visual Navigation |
|
| Quang Anh, Nguyen Duc | Vietnam National University |
| Minh Duc, Pham | Vietnam National University |
| Nguyen, Minh Anh | International School - Vietnam National University |
| Doan, Duy Tung | Hanoi University of Science and Technology |
| Dang, Tuan | University of Arkansas |
Keywords: Autonomous Agents, AI-Enabled Robotics, Autonomous Vehicle Navigation
Abstract: Deploying autonomous agents in the real world is complicated, especially when it comes to navigation, where systems must adapt to situations they haven’t encountered before. Traditional learning approaches require a substantial amount of data, constant tweaking, and sometimes starting over for every new task. That makes them hard to scale and not very flexible. Recent breakthroughs in foundation models, such as large language models and vision-language models, enable systems to attempt new navigation tasks without requiring additional training. However, many of these methods only work with specific types of inputs, employ relatively basic reasoning, and fail to fully utilize the details they observe or the structure of the spaces. Here, we introduce T2-Nav, a zero-shot navigation system that combines various types of data and employs graph-based reasoning. By leveraging visual information directly into the graph and matching it to the environment, our approach enables the system to find a good balance between exploration and reaching its goal. This strategy allows robust obstacle avoidance, reliable loop closure detection, and efficient path planning while eliminating redundant exploration patterns. The system demonstrates flexibility by handling goals specified through reference images of target object instances, making it particularly suitable for real-world deployment scenarios where agents must navigate to visually similar but spatially distinct instances. Experiments demonstrate that our approach worked efficiently and adapted well in complex, unfamiliar settings, moving toward practical zero-shot instance-image navigation capabilities.
|
| |
| 15:00-16:30, Paper WeI2I.123 | Add to My Program |
| Text-Conditioned Beat Gesture Generation for a Social Robot Via a Conditional Variational Autoencoder |
|
| Climent Peñalver, Alejandro | Universidad Carlos III De Madrid |
| Fernandez-Rodicio, Enrique | Universidad Carlos III De Madrid |
| Castro-González, Álvaro | Universidad Carlos III De Madrid |
Keywords: Human and Humanoid Motion Analysis and Synthesis, Natural Dialog for HRI, AI-Based Methods
Abstract: Conversation can benefit from small rhythmic gestures that track prosody, reinforce structure, and help to keep attention. However, many robots used in human–robot interaction still rely on fixed templates or clip libraries that scale poorly to open-domain interactions; moreover, embedded platforms impose tight limits on motion range, speeds, and timing. Consequently, gesture generation methods must be lightweight, stable, and easy to integrate. To address this need, this work presents a lightweight gesture-generation model that generates in real time beat gestures based on the transcription of the robot's speech. First, a Conditional Variational Autoencoder (CVAE) conditioned on sentence-level BERT embeddings is trained on 2D pose–text pairs to produce upper-body pose sequences. Next, a geometry-based retargeting algorithm deterministically maps those poses to the robot’s joints while enforcing kinematic limits. Finally, the joint sequence is converted into a pseudo-state machine and triggered in lockstep with the utterance. The results obtained show that the system achieves smooth, text-conditioned beat gestures with solid fidelity and temporal diversity, and demonstrates real-time performance when integrated on a social robot.
|
| |
| 15:00-16:30, Paper WeI2I.124 | Add to My Program |
| SPARO: Snap-On/Rotate-Off Passive Gripper for Aerial Perching |
|
| Domislovic, Jakob | University of Zagreb, Faculty of Electrical Engineering and Computing |
| Ivanovic, Antun | University of Zagreb, Faculty of Electrical Engineering and Computing |
| Petric, Frano | University of Zagreb, Faculty of Electrical Engineering and Computing |
| Orsag, Matko | University of Zagreb, Faculty of Electrical Engineering and Computing |
|
|
| |
| 15:00-16:30, Paper WeI2I.125 | Add to My Program |
| Learning Controlled Separation of Small Objects between Two Fingers with a Tactile Skin |
|
| Kasolowsky, Ulf | Technical University of Munich |
| Bäuml, Berthold | Technical University of Munich |
Keywords: In-Hand Manipulation, Force and Tactile Sensing, Multifingered Hands
Abstract: We introduce and solve the novel task of emph{controlled separation} of small objects with two fingers of a multi-purpose robotic hand: after grasping into a box of small objects, the task is to drop as many of them until a desired number remains between the fingers. The objects are small compared to the width of the fingers but also in absolute terms. In our case little pellets with a diameter of only 6mm are handled. We show that the task can be performed purely tactile (no vision) using a spatially-resolved tactile skin on a fingertip. The separation policy is trained in simulation via reinforcement learning using a straightforward sparse reward, which basically checks if the desired number of objects is reached. In simulation experiments, we provide an exhaustive analysis of the benefits of using spatially-resolved tactile feedback: while an ideal (high-resolution) tactile sensor allows solving the task almost perfectly, a sensor with lower spatial resolution (here 4x4 taxels) still leads to an improvement of up to 20% compared to using only the fingers' joint sensors. For this analysis, we further train an estimator alongside the policy that predicts the ground truth contact positions. Finally, we demonstrate the successful sim-to-real transfer for the DLR-Hand II equipped with a tactile skin.
|
| |
| 15:00-16:30, Paper WeI2I.126 | Add to My Program |
| GPD-AP: A Grasp Pose-Driven Active Perception Framework for Occlusion-Robust Robotic Manipulation |
|
| Wei, Yancong | Tsinghua University |
| Pang, Yunyi | Huazhong University of Science and Technology |
| Liu, Sicheng | Actibot Intelligence |
| Dong, Kangkang | Tsinghua University |
| Liu, Houde | Shenzhen Graduate School, Tsinghua University |
Keywords: Grasping, Reinforcement Learning, Perception for Grasping and Manipulation
Abstract: Humans instinctively adjust their viewpoints to resolve occlusions and infer spatial relationships, enabling effective perception and navigation in cluttered environments. This capability, however, remains a significant challenge for robotic systems. To address this, we propose GPD-AP, a novel active perception framework that leverages grasp pose estimation and associated scoring to systematically tackle grasping tasks in occluded and cluttered settings. The core innovation lies in an end-to-end system where a computationally efficient grasp pose estimation module directly informs a Next-Best-View (NBV) planner. This integration shifts the focus from generic scene exploration to a grasp-oriented visual search, guiding the robot to viewpoints that minimize uncertainty about potential grasps. To train and validate GPD-AP, we introduce a simulation reset method capable of generating highly challenging scenes with partially or fully occluded target objects. Experimental results demonstrate that GPD-AP improves grasping success rates by 30% in dense obstacle environments, effectively enabling the transition of target objects from invisible to visible and graspable states. This work marks a significant step towards autonomous and intelligent robotic manipulation in unstructured real-world scenarios.
|
| |
| 15:00-16:30, Paper WeI2I.127 | Add to My Program |
| Motion Compensation and Adaptive Force Control Via iOCT–FBG Sensor Fusion for Robotic Subretinal Injection |
|
| Long, Aoqi | Johns Hopkins University |
| Wu, Tianle | Johns Hopkins University |
| She, Chongyang | Johns Hopkins University |
| Esfandiari, Mojtaba | Johns Hopkins University |
| Gehlbach, Peter | Johns Hopkins Medical Institute |
| Taylor, Russell H. | The Johns Hopkins University |
| Iordachita, Ioan Iulian | Johns Hopkins University |
Keywords: Medical Robots and Systems, Sensor-based Control, Robust/Adaptive Control
Abstract: Subretinal injection is a highly delicate procedure that demands micron-level precision to avoid irreversible retinal damage. Current robotic systems achieve accurate positioning but remain limited by retinal motion and the lack of tip-force feedback. We present the first adaptive tip-force compensation framework for robotic subretinal injection, fusing intraoperative optical coherence tomography (iOCT) vision with fiber Bragg grating (FBG) force sensing. Our architecture integrates a finite-state machine (FSM) for surgical phase coordination, a Long Short-Term Memory (LSTM) enhanced residual Kalman filter for real-time motion prediction, and an adaptive compliance estimator for safe force regulation. Compared to previous vision-only and force-only method, ex vivo experiments on porcine eyes demonstrate robust improvements: the root-mean-square tracking error reduced by 40% (to 18.5μm), the maximum absolute error lowered by 2.5 times, and 96.7% of tip forces maintained within ± 0.7mN. Control delays were minimized to 0.25s, enabling low-latency corrections beyond freehand capabilities. Our system enhances precision and safety in fragile retinal tissues, advancing the potential for reliable robot-assisted surgeries for retinal diseases.
|
| |
| 15:00-16:30, Paper WeI2I.128 | Add to My Program |
| Performance-Guided Refinement for Visual Aerial Navigation Using Editable Gaussian Splatting in FalconGym 2.0 |
|
| Miao, Yan | University of Illinois at Urbana Champaign |
| Yuceel, Ege | University of Illinois at Urbana Champaign |
| Fainekos, Georgios | Toyota Motor North America R&D |
| Hoxha, Bardh | Toyota Motor North America R&D |
| Okamoto, Hideki | Toyota Motor North America R&D |
| Mitra, Sayan | University of Illinois at Urbana Champaign |
Keywords: Vision-Based Navigation, Aerial Systems: Perception and Autonomy
Abstract: Visual policy design is crucial for aerial navigation. However, state-of-the-art visual policies often overfit to a single track and their performance degrades when track geometry changes. We develop FalconGym 2.0, a photorealistic simulation framework built on Gaussian Splatting (GSplat) with an Edit API that programmatically generates diverse static and dynamic tracks in milliseconds. Leveraging FalconGym 2.0's editability, we propose a Performance-Guided Refinement (PGR) algorithm, which concentrates visual-policy training on challenging tracks while iteratively improving performance. Across two case studies (fixed-wing UAVs and quadrotors) with distinct dynamics and environments, we show that a single visual policy trained with PGR in FalconGym 2.0 outperforms state-of-the-art baselines in generalization and robustness: it generalizes to three unseen tracks with 100% success without per-track retraining and maintains higher success rates under gate-pose perturbations. Finally, we demonstrate zero-shot sim-to-real transfer of the PGR-trained visual policy to quadrotor hardware, achieving a 98.6% success rate (69/70 gates) over 30 trials across two three-gate tracks and one moving-gate track.
|
| |
| 15:00-16:30, Paper WeI2I.129 | Add to My Program |
| Occlusion-Robust Relative Pose Estimation for Multi-Robot Systems Via Geometric-Aware Diffusion Matching |
|
| Kang, Suyoung | University of Massachusetts Amherst |
| Dutta, Rishav | University of Massachusetts, Amherst |
| Gao, Peng | North Carolina State University |
| Wigness, Maggie | U.S. Army Research Laboratory |
| Rogers III, John G. | DEVCOM Army Research Laboratory |
| Kim, Donghyun | University of Massachusetts Amherst |
| Zhang, Hao | University of Massachusetts Amherst |
Keywords: Multi-Robot Systems, Deep Learning for Visual Perception, Localization
Abstract: Relative pose estimation is crucial for coordinated multi-robot navigation. However, robots in close proximity often face intra-team occlusions, where teammates partially block each other's field of view, while dynamic environments further introduce environmental occlusions. Classical relative pose estimation methods degrade under occlusion and texture scarcity, whereas learning-based methods often lack explicit geometric consistency, which limits their accuracy during real deployments. To address multi-robot relative pose estimation in complex 3D environments, we introduce Geometric-Aware Diffusion Matching (GADM), which enables a team of robots to estimate relative 6-DoF poses using only RGB-D sensors, even under occlusions. GADM uses a diffusion model to progressively exploit global and higher-order structural constraints encoded by a graph network, guiding smoother optimization and faster convergence to robust correspondence distributions under noise and occlusions. By integrating geometric consistency, GADM explicitly addresses occlusions by producing geometrically consistent matches suitable for real-time deployment on physical robots. The resulting correspondences are then used with geometry-based solvers to estimate 6-DoF relative poses, providing robustness even under partial view overlap and limited keypoint visibility. We conducted experiments using both robotics simulations and physical robot teams, and our results show that GADM achieves robust 6-DoF pose estimation performance in occluded scenarios.
|
| |
| 15:00-16:30, Paper WeI2I.130 | Add to My Program |
| ATA: Bridging Implicit Reasoning with Attention-Guided and Action-Guided Inference for Vision-Language Action Models |
|
| Yang, Cheng | Rutgers University |
| Jiao, Jianhao | University College London |
| Huang, Lingyi | Rutgers University |
| Xiao, Jinqi | Rutgers University |
| Tang, Zhexiang | Rutgers University |
| Gong, Yu | Rutgers University |
| Ying, Yibiao | Rutgers University |
| Sui, Yang | Rice University |
| Lin, Jintian | Shenzhen TCL High-Tech Development Co., Ltd |
| Huang, Wen | Shenzhen TCL High-Tech Development Co., Ltd |
| Yuan, Bo | Rutgers University |
Keywords: Deep Learning Methods, Deep Learning in Grasping and Manipulation
Abstract: Vision-Language-Action (VLA) models rely on current observations, including images, language instructions, and robot states, to predict actions and complete tasks. While accurate visual perception is crucial for precise action prediction and execution, recent work has attempted to further improve performance by introducing explicit reasoning during inference. However, such approaches face significant limitations. They often depend on data-intensive resources such as Chain-of-Thought (CoT) style annotations to decompose tasks into step-by-step reasoning, and in many cases require additional visual grounding annotations (e.g., bounding boxes or masks) to highlight relevant image regions. Moreover, they involve time-consuming dataset construction, labeling, and retraining, which ultimately results in longer inference sequences and reduced efficiency. To address these challenges, we propose ATA, a novel training-free framework that introduces implicit reasoning into VLA inference through complementary attention-guided and action-guided strategies. Unlike CoT or explicit visual-grounding methods, ATA formulates reasoning implicitly by integrating attention maps with an action-based region of interest (RoI), thereby adaptively refining visual inputs without requiring extra training or annotations. ATA is a plug-and-play implicit reasoning approach for VLA models, lightweight yet effective. Extensive experiments show that it consistently improves task success and robustness while preserving, and even enhancing, inference efficiency.
|
| |
| 15:00-16:30, Paper WeI2I.131 | Add to My Program |
| Learning to Grasp by Integrating Human Preferences and Success Feedback |
|
| Park, Juyeol | Hanyang University |
| Ko, Byungjin | Hanyang University ERICA |
| Yoon, Jong-Wan | Hanyang University ERICA |
| Park, Taejoon | Hanyang University |
| Park, Homin | Singapore University of Social Sciences |
Keywords: Deep Learning in Grasping and Manipulation, Reinforcement Learning, Deep Learning for Visual Perception
Abstract: End-to-end robotic grasping increasingly relies on reinforcement learning to enable safe and precise execution, yet defining a reward that consistently drives such behavior remains a central challenge. Human-engineered rewards have been widely explored, but they are prone to reward hacking, depend heavily on artificial design choices, and often fail to capture human intuition. Preference-based reward models offer a promising alternative by aligning policies with human feedback, but their application to robotic grasping has remained limited, and preference-aligned actions do not always translate into successful execution. We propose Human Preference and Success-based Grasping (HPSG), a three-stage framework that combines pre-training, reward modeling, and fine-tuning. At its core is the Weighted Success Reward (WSR), which integrates a preference-trained reward model with binary success feedback so that policies learn behaviors that are effective in practice and aligned with human judgment. This design resolves the mismatch between subjective preferences and execution outcomes, thereby improving reliability. Through extensive simulation and real-world experiments, we show that HPSG produces reliable grasping policies, achieving higher success and completion rates, reducing collisions, and transferring to physical settings with smaller performance degradation than baseline methods. Our code is publicly available at: https://github.com/qkrwnduf1997/HPSG
|
| |
| 15:00-16:30, Paper WeI2I.132 | Add to My Program |
| Disentangled Point Diffusion for Precise Object Placement |
|
| He, Lyuxing | Carnegie Mellon University |
| Cai, Eric | Carnegie Mellon University |
| Aggarwal, Shobhit | Carnegie Mellon University |
| Wang, Jianjun | ABB Robotics LLC |
| Held, David | Carnegie Mellon University |
Keywords: Deep Learning in Grasping and Manipulation, Learning from Demonstration, Deep Learning Methods
Abstract: Recent advances in robotic manipulation have highlighted the effectiveness of learning from demonstration. However, while end-to-end policies excel in expressivity and flexibility, they struggle both in generalizing to novel object geometries and in attaining a high degree of precision. An alternative, object-centric approach frames the task as predicting the placement pose of the target object, providing a modular decomposition of the problem. Building on this goal-prediction paradigm, we propose a hierarchical, disentangled point diffusion framework that achieves state-of-the-art performance in placement precision, multi-modal coverage, and generalization to variations in object geometries and scene configurations. Specifically, we model global scene-level placements through a novel feed-forward Dense Gaussian Mixture Model (GMM) that yields a spatially dense prior over global placements; we then model the local object-level configuration through a novel disentangled point cloud diffusion module that separately diffuses the object geometry and the placement frame, enabling precise local geometric reasoning. Interestingly, we demonstrate that our point cloud diffusion achieves substantially higher accuracy compared to prior approaches based on SE(3) diffusion, even in the context of rigid object placement. We validate our approach across a suite of challenging tasks in simulation and in the real-world on high-precision industrial insertion tasks. Furthermore, we present results on a cloth-hanging task in simulation, indicating that our method can further relax assumptions on object rigidity. Visualizations and supplementary materials can be found on our project website: https://3dgp-icra2026.github.io/.
|
| |
| 15:00-16:30, Paper WeI2I.133 | Add to My Program |
| Learning Social Navigation from Positive and Negative Demonstrations and Rule-Based Specifications |
|
| Kim, Chanwoo | Korea University |
| Yoon, JiHwan | Korea University |
| Kim, Hyeonseong | Korea University |
| Jeong, Taemoon | Korea University |
| Yoo, Changwoo | Korea University |
| Lee, Seungbeen | Yonsei University |
| Byeon, SooHwan | MOBINN |
| Chung, Hoon | Mobinn |
| Pan, Matthew | Queen's University |
| Oh, Jean | Carnegie Mellon University |
| Lee, Kyungjae | Korea University |
| Choi, Sungjoon | Korea University |
Keywords: Learning from Demonstration, Reactive and Sensor-Based Planning
Abstract: Mobile robot navigation in dynamic human environments requires policies that balance adaptability to diverse behaviors with compliance to safety constraints. We hypothesize that integrating data-driven rewards with rule-based objectives enables navigation policies to achieve a more effective balance of adaptability and safety. To this end, we develop a framework that learns a density-based reward from positive and negative demonstrations and augments it with rule-based objectives for obstacle avoidance and goal reaching. A sampling-based lookahead controller produces supervisory actions that are both safe and adaptive, which are subsequently distilled into a compact student policy suitable for real-time operation with uncertainty estimates. Experiments in synthetic and elevator co-boarding simulations show consistent gains in success rate and time efficiency over baselines, and real-world demonstrations with human participants confirm the practicality of deployment. A video illustrating this work can be found on our project page https://chanwookim971024.github.io/PioneeR/.
|
| |
| 15:00-16:30, Paper WeI2I.134 | Add to My Program |
| Proprioceptive Shape Estimation of Tensegrity Manipulators Using Energy Minimisation |
|
| Bhat, Tufail Ahmad | Kyushu Institure of Technology |
| Ikemoto, Shuhei | Kyushu Institute of Technology |
Keywords: Soft Robot Materials and Design, Flexible Robotics, Redundant Robots
Abstract: Shape estimation is fundamental for controlling continuously bending tensegrity manipulators, yet achieving it remains a challenge. Although using exteroceptive sensors makes the implementation straightforward, it is costly and limited to specific environments. Proprioceptive approaches, by contrast, do not suffer from these limitations. So far, several methods have been proposed; however, to our knowledge, there are no proven examples of large-scale tensegrity structures used as manipulators. This paper demonstrates that shape estimation of the entire tensegrity manipulator can be achieved using only the inclination angle information relative to gravity for each strut. Inclination angle information is intrinsic sensory data that can be obtained simply by attaching an inertial measurement unit (IMU) to each strut. Experiments conducted on a five-layer tensegrity manipulator with 20 struts and a total length of 1160 mm demonstrate that the proposed method can estimate the shape with an accuracy of 2.1 % of the total manipulator length from arbitrary initial conditions under both static conditions and maintains stable shape estimation under external disturbances.
|
| |
| 15:00-16:30, Paper WeI2I.135 | Add to My Program |
| Latent Activation Editing: Inference-Time Refinement of Learned Policies for Safer Multirobot Navigation |
|
| Das, Satyajeet | University of Southern California |
| Chiu, Darren | University of Southern California |
| Huang, Zhehui | University of Southern California |
| Lindemann, Lars | ETH Zurich |
| Sukhatme, Gaurav | University of Southern California |
Keywords: Multi-Robot Systems, Robot Safety, Machine Learning for Robot Control
Abstract: Reinforcement learning has enabled significant progress in complex domains such as coordinating and navigating multiple quadrotors. However, even well-trained policies remain vulnerable to collisions in obstacle-rich environments. Addressing these infrequent but critical safety failures through retraining or fine-tuning is costly and risks degrading previously learned skills. Inspired by activation steering in large language models and latent editing in computer vision, we introduce a framework for inference-time Latent Activation Editing (LAE) that refines the behavior of pre-trained policies without modifying their weights or architecture. The framework operates in two stages: (i) an online classifier monitors intermediate activations to detect states associated with undesired behaviors, and (ii) an activation editing module that selectively modifies flagged activations to shift the policy towards safer regimes. In this work, we focus on improving safety in multi-quadrotor navigation. We hypothesize that amplifying a policy’s internal perception of risk can induce safer behaviors. We instantiate this idea through a latent collision world model trained to predict future pre-collision activations, thereby prompting earlier and more cautious avoidance responses. Extensive simulations and real-world Crazyflie experiments demonstrate that LAE achieves statistically significant reduction in collisions (nearly 90% fewer cumulative collisions compared to the unedited baseline) and substantially increases the fraction of collision-free trajectories, while preserving task completion. More broadly, our results establish LAE as a lightweight paradigm, feasible on resource-constrained hardware, for post-deployment refinement of learned robot policies. Our project page with videos and code is available at https://lae-robotics.github.io/.
|
| |
| 15:00-16:30, Paper WeI2I.136 | Add to My Program |
| BodyGuards: Escorting by Multiple Robots in Unknown Environment under Limited Communication |
|
| Tian, Zhuoli | Peking University |
| Bao, Yanze | Peking University |
| Guo, Meng | Peking University |
Keywords: Multi-Robot Systems, Human-Robot Teaming, Task and Motion Planning
Abstract: Multi-robot systems are increasingly deployed in high-risk missions such as reconnaissance, disaster response, and subterranean operations. Protecting a human operator while navigating unknown and adversarial environments remains a critical challenge, especially when the communication among the operator and robots is restricted. Unlike existing collaborative exploration methods that aim for complete coverage, this work focuses on task-oriented exploration to minimize the navigation time of the operator to reach its goal while ensuring safety under adversarial threats. A novel escorting framework BodyGuards, is proposed to explicitly integrate seamlessly collaborative exploration, inter-robot-operator communication and escorting. The framework consists of three core components: (I) a dynamic movement strategy for the operator that maintains a local map with risk zones for proactive path planning; (II) a dual-mode robotic strategy combining frontier-based exploration with optimized return events to balance exploration, threat detection, and intermittent communication; and (III) multi-robot coordination protocols that jointly plan exploration and information sharing for efficient escorting. Extensive human-in-the-loop simulations and hardware experiments demonstrate that the method significantly reduces operator risk and mission time, outperforming baselines in adversarial and constrained environments.
|
| |
| 15:00-16:30, Paper WeI2I.137 | Add to My Program |
| RM-RL: Role-Model Reinforcement Learning for Precise Robot Manipulation |
|
| Chen, Xiangyu | Nanyang Technological University |
| Zhou, Chuhao | Nanyang Technological University |
| Liu, Yuxi | Nanyang Technological University |
| Yang, Jianfei | Nanyang Technological University |
Keywords: Reinforcement Learning, AI-Enabled Robotics, Deep Learning in Grasping and Manipulation
Abstract: Precise robot manipulation is critical for fine-grained applications such as chemical and biological experiments, where even small errors (e.g., reagent spillage) can invalidate an entire task. Existing approaches often rely on pre-collected expert demonstrations and train policies via imitation learning (IL) or offline reinforcement learning (RL). However, obtaining high-quality demonstrations for precision tasks is difficult and time-consuming, while offline RL commonly suffers from distribution shifts and low data efficiency. We introduce a Role-Model Reinforcement Learning (RM-RL) framework that unifies online and offline training in real-world environments. The key idea is a role-model strategy that automatically generates labels for online training data using approximately optimal actions, eliminating the need for human demonstrations. RM-RL reformulates policy learning as supervised training, reducing instability from distribution mismatch and improving efficiency. A hybrid training scheme further leverages online role-model data for offline reuse, enhancing data efficiency through repeated sampling. Extensive experiments show that RM-RL converges faster and more stably than existing RL methods, yielding significant gains in real-world manipulation: 53% improvement in translation accuracy and 20% in rotation accuracy. Finally, we demonstrate the successful execution of a challenging task, precisely placing a cell plate onto a shelf, highlighting the framework’s effectiveness where prior methods fail.
|
| |
| 15:00-16:30, Paper WeI2I.138 | Add to My Program |
| SurgVidLM: Towards Multi-Grained Video Understanding with Large Language Model in Robot-Assisted Surgery |
|
| Wang, Guankun | The Chinese University of Hong Kong |
| Wang, Junyi | The Chinese University of Hong Kong |
| Mo, Wenjin | Sun Yat-sen University |
| Bai, Long | Alibaba DAMO Academy |
| Yuan, Kun | University of Strasbourg |
| Hu, Ming | Monash University |
| Wu, Jinlin | Centre for Artificial Intelligence and Robotics (CAIR) Hong Kong Institute of Science & Innovation Chinese Academy of Sciences |
| He, Junjun | Shanghai AI Laboratory |
| Huang, Yiming | The Chinese University of Hong Kong |
| Padoy, Nicolas | University of Strasbourg |
| Lei, Zhen | Institute of Automation, Chinese Academy of Sciences |
| Liu, Hongbin | Hong Kong Institute of Science & Innovation, Chinese Academy of Sciences |
| Navab, Nassir | TU Munich |
| Ren, Hongliang | Chinese Univ Hong Kong (CUHK) & National Univ Singapore(NUS) |
|
|
| |
| 15:00-16:30, Paper WeI2I.139 | Add to My Program |
| ORN-CBF: Learning Observation-Conditioned Residual Neural Control Barrier Functions Via Hypernetworks |
|
| Derajic, Bojan | Aumovio & TU Berlin |
| Bernhard, Sebastian | Aumovio |
| Hoenig, Wolfgang | TU Berlin |
Keywords: Robot Safety, Machine Learning for Robot Control, Collision Avoidance
Abstract: Control barrier functions (CBFs) have been demonstrated as an effective method for safety-critical control of autonomous systems. Although CBFs are simple to deploy, their design remains challenging, motivating the development of learning-based approaches. Yet, issues such as suboptimal safe sets, applicability in partially observable environments, and lack of rigorous safety guarantees persist. In this work, we propose observation-conditioned neural CBFs based on Hamilton-Jacobi (HJ) reachability analysis, which approximately recover the maximal safe sets. We exploit certain mathematical properties of the HJ value function, ensuring that the predicted safe set never intersects with the observed failure set. Moreover, we leverage a hypernetwork-based architecture that is particularly suitable for the design of observation-conditioned safety filters. The proposed method is examined both in simulation and hardware experiments for a ground robot and a quadcopter. The results show improved success rates and generalization to out-of-domain environments compared to the baselines.
|
| |
| 15:00-16:30, Paper WeI2I.140 | Add to My Program |
| PersONAL: Towards a Comprehensive Benchmark for Personalized Embodied Agents |
|
| Ziliotto, Filippo | University of Padova |
| Akkara, Jelin Raphael | University of Padova |
| Daniele, Alessandro | Fondazione Bruno Kessler |
| Ballan, Lamberto | University of Padova |
| Serafini, Luciano | Fondazione Bruno Kessler |
| Campari, Tommaso | FBK |
Keywords: Human-Centered Robotics, Autonomous Agents, Data Sets for Robotic Vision
Abstract: Recent advances in Embodied AI have enabled agents to perform increasingly complex tasks and adapt to diverse environments. However, deploying such agents in realistic human-centered scenarios, such as domestic households, remains challenging, particularly due to the difficulty of modeling individual human preferences and behaviors. In this work, we introduce PersONAL (PERSonalized Object Navigation And Localization), a comprehensive benchmark designed to study personalization in Embodied AI. Agents must identify, retrieve, and navigate to objects associated with specific users, responding to natural-language queries such as find Lily’s backpack. PersONAL comprises over 2,000 high-quality episodes across 30+ photorealistic homes from the HM3D dataset. Each episode includes a natural-language scene description with explicit associations between objects and their owners, requiring agents to reason over user-specific semantics. The benchmark supports two evaluation modes: (1) active navigation in unseen environments, and (2) object grounding in previously mapped scenes. Experiments with state-of-the-art baselines reveal a substantial gap to human performance, highlighting the need for embodied agents capable of perceiving, reasoning, and memorizing over personalized information; paving the way towards real-world assistive robot. Code and dataset available at: github.io/PersONAL
|
| |
| 15:00-16:30, Paper WeI2I.141 | Add to My Program |
| Vectorized Online POMDP Planning |
|
| Hoerger, Marcus | Australian National University |
| Sudrajat, Muhammad Rafi | Australian National University |
| Kurniawati, Hanna | Australian National University |
Keywords: Planning under Uncertainty, Motion and Path Planning
Abstract: Planning under partial observability is an essential capability of autonomous robots. The Partially Observable Markov Decision Process (POMDP) provides a powerful framework for planning under partial observability problems, capturing the stochastic effects of actions and the limited information available through noisy observations. POMDP solving could benefit tremendously from massive parallelization on today's hardware, but parallelizing POMDP solvers has been challenging. Most solvers rely on interleaving numerical optimization over actions with the estimation of their values, which creates dependencies and synchronization bottlenecks between parallel processes that can offset the benefits of parallelization. In this paper, we propose Vectorized Online POMDP Planner (VOPP), a novel parallel online solver that leverages a recent POMDP formulation which analytically solves part of the optimization component, leaving numerical computations to consist of only estimation of expectations. VOPP represents all data structures related to planning as a collection of tensors, and implements all planning steps as fully vectorized computations over this representation. The result is a massively parallel online solver with no dependencies or synchronization bottlenecks between concurrent processes. Experimental results indicate that VOPP is at least 20X more efficient in computing near-optimal solutions compared to an existing state-of-the-art parallel online solver. Moreover, VOPP outperforms state-of-the-art sequential online solvers, while using a planning budget that is 1000X smaller.
|
| |
| 15:00-16:30, Paper WeI2I.142 | Add to My Program |
| Omnidirectional Dual-Arm Aerial Manipulator with Proprioceptive Contact Localization for Landing on Slanted Roofs |
|
| Brummelhuis, Martijn B.J. | Delft University of Technology |
| Lepora, Nathan F. | University of Bristol |
| Hamaza, Salua | Delft University of Technology |
Keywords: Aerial Systems: Perception and Autonomy, Aerial Systems: Mechanics and Control, Force and Tactile Sensing
Abstract: Operating drones in urban environments often means they need to land on rooftops, which can have different geometries and surface irregularities. Accurately detecting roof inclination using conventional sensing methods, such as vision based or acoustic techniques, can be unreliable, as measurement quality is strongly influenced by external factors including weather conditions and surface materials. To overcome these challenges, we propose a novel unmanned aerial manipulator morphology featuring a dual-arm aerial manipulator with an omnidirectional 3D workspace and extended reach. Building on this design, we develop a proprioceptive contact detection and contact localization strategy based on a momentum-based torque observer. This enables the UAM to infer the inclination of slanted surfaces blindly – through physical interaction – prior to touchdown. We validate the approach in flight experiments, demonstrating robust landings on surfaces with inclinations of up to 30.5◦ and achieving an average surface inclination estimation error of 2.87◦ over 9 experiments at different incline angles.
|
| |
| 15:00-16:30, Paper WeI2I.143 | Add to My Program |
| Quantization of DRL Models for Embedded Microcontrollers |
|
| Bohm, Peter | The University of Queensland |
| Pounds, Pauline | The University of Queensland |
| Moghadam, Peyman | CSIRO |
| Chapman, Archie | The University of Queensland |
| Chung, Jen Jen | The University of Queensland |
Keywords: Embedded Systems for Robotic and Automation
Abstract: For Deep Reinforcement Learning (DRL) models to deliver actual utility, they must function within production environments, which often lack the extensive computational resources of training environments. This requirement for dedicated GPU resources is not economically feasible and can be especially prohibitive in low-cost robotic contexts. Neural network quantization serves as a viable solution to these constraints. This technique aims to lessen computational and memory requirements, while maintaining performance. By reducing the precision of the DRL network weights and the network input (sensory observations), the deployment size can be compacted to fit within MCU class devices, while ensuring that inference operates at adequate frequencies. This paper investigates the impact of quantization on DRL policies and presents a quantization-friendly network architecture for the Soft Actor-Critic (SAC) and TD3 algorithms. We propose a streamlined actor network optimized for inference-only deployments and quantization, and integrate a GRU-based encoder into the DRL framework using a custom, quantization-compatible implementation. The changes enable both to be quantized to integer precision. We then deploy the quantized policies on a microcontroller-scale device (ESP32-S3) to control a low-cost quadrupedal robot using only proprioception and on-board inference.
|
| |
| 15:00-16:30, Paper WeI2I.144 | Add to My Program |
| When the Adversary Knows You Better: Adversarial Training for Learning-Based Legged Robots |
|
| Xu, Qinchao | Kyoto University |
| Yagi, Satoshi | Kyoto University |
| Yamamori, Satoshi | Kyoto University |
| Morimoto, Jun | Kyoto University |
Keywords: Reinforcement Learning, Legged Robots, Robot Safety
Abstract: Deep reinforcement learning has emerged as the dominant paradigm for training legged robots to locomote, however, when deployed in unstructured, dynamically varying real-world environments, the safety of neural network based controllers remains insufficiently guaranteed. Prior studies have demonstrated that sequential adversarial attacks, formulated via reinforcement learning, can effectively expose latent vulnerabilities in controllers and thus serve as a valuable complement to Domain Randomization techniques. These methods, however, are inherently constrained by the assumption that both the adversary and the locomotion policy share identical state space inputs. In contrast, our approach overcomes this limitation by incorporating privileged information into the adversarial network's observation input, thereby more than doubling the attack success rate. Furthermore, we mitigate the controller’s tendency toward overly conservative behavior under attacks by introducing stochastic termination criteria. We validate the proposed method in real-world deployments, showing that it not only significantly enhances robustness but also preserves original task performance.
|
| |
| 15:00-16:30, Paper WeI2I.145 | Add to My Program |
| AugVLA-3D: Depth-Driven Feature Augmentation for Vision-Language-Action Models |
|
| Rao, Zhifeng | Southern University of Science and Technology |
| Chen, Wenlong | Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) |
| Xie, Lei | Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) |
| Hua, Xia | Shanghai University |
| Yin, Dongfu | Guangdong Laboratory of Artificial Intelligence and Digital Economy(SZ) |
| Tian, Zhen | Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) |
| Yu, Fei | Guangming Lab |
|
|
| |
| 15:00-16:30, Paper WeI2I.146 | Add to My Program |
| Proprioceptive Contact State and Contact Point Estimation for a Leg-Wheel Transformable Robot |
|
| Huang, Kuan-Jung | National Taiwan University |
| Yu, Wei-Shun | National Taiwan University |
| Lin, Pei-Chun | National Taiwan University |
Keywords: Legged Robots, Wheeled Robots, Dynamics
Abstract: Hybrid leg-wheel robots offer exceptional mobility, but their complex mechanics and extended contact surfaces challenge modern control frameworks that rely on simple point-foot models. Accurately estimating both the contact state and the precise contact location using only proprioceptive sensors is a critical and unresolved problem for these platforms. To address this, we present the complete, proprioception-only framework that provides both contact state and contact point information for this class of robot. The framework is executed on a computationally efficient and simplified ynamic model of the complex 11-bar leg mechanism as an example, which enables a discrete-time Generalized Momentum Observer (GMO) to accurately estimate external wrenches. An optimization-based algorithm then precisely localizes the contact point by finding the location along the rim that best explains the full-body dynamics. The framework's performance was validated in high-fidelity simulations across diverse gaits. For contact state validation, the detector demonstrates over 97% single-leg accuracy during a dynamic 0.4 m/s trot. For contact point validation, the localization stage confirms the accurate estimation throughout the stance phase with RMS 0.0173 m. Our work provides the essential contact information required to provide advanced model-based control for these challenging platforms.
|
| |
| 15:00-16:30, Paper WeI2I.147 | Add to My Program |
| Empirical Prediction of Pedestrian Comfort in Mobile Robot–Pedestrian Encounters |
|
| Jafari, Alireza | National Cheng Kung University |
| Nguyen, Hong-Son | National Cheng Kung University |
| Liu, Yen-Chen | National Cheng Kung University |
Keywords: Human-Aware Motion Planning, Social HRI, Design and Human Factors
Abstract: Mobile robots joining public spaces like sidewalks must care for pedestrian comfort. Many studies consider pedestrians' objective safety, for example, by developing collision avoidance algorithms, but not enough studies take the pedestrian's subjective safety or comfort into consideration. Quantifying comfort is a major challenge that hinders mobile robots from understanding and responding to human emotions. We empirically look into the relationship between the mobile robot-pedestrian interaction kinematics and subjective comfort. We perform one-on-one experimental trials, each involving a mobile robot and a volunteer. Statistical analysis of pedestrians' reported comfort versus the kinematic variables shows moderate but significant correlations for most variables. We use the findings and empirically design three comfort estimators/predictors based on the minimum distance, the minimum projected time-to-collision, and a composite estimator. The composite estimator employs all studied kinematic variables and reaches the highest prediction rate and classifying performance among the predictors. The composite predictor has an odds ratio of 3.67. In simple terms, when it identifies a pedestrian as comfortable, it is almost 4 times more likely that the pedestrian is comfortable rather than uncomfortable. The study provides a comfort quantifier for incorporating pedestrian feelings into path planners for more socially compliant robots.
|
| |
| 15:00-16:30, Paper WeI2I.148 | Add to My Program |
| High-Bandwidth Tactile-Reactive Control for Grasp Adjustment |
|
| Lee, Yonghyeon | Massachusetts Institute of Technology |
| Lin, Tzu-Yuan | Massachusetts Institute of Technology |
| Alexiev, Alexander | Massachusetts Institute of Technology |
| Kim, Sangbae | Massachusetts Institute of Technology |
Keywords: Force and Tactile Sensing, Dexterous Manipulation, Grasping
Abstract: Vision-only grasping systems are fundamentally constrained by calibration errors, sensor noise, and grasp pose prediction inaccuracies, leading to unavoidable contact uncertainty in the final stage of grasping. High-bandwidth tactile feedback, when paired with a well-designed tactile-reactive controller, can significantly improve robustness in the presence of perception errors. This paper contributes to controller design by proposing a purely tactile-feedback grasp-adjustment algorithm. The proposed controller requires neither prior knowledge of the object’s geometry nor an accurate grasp pose, and is capable of refining a grasp even when starting from a crude, imprecise initial configuration and uncertain contact points. Through simulation studies and real-world experiments on a 15-DoF arm–hand system (featuring an 8-DoF hand) equipped with fingertip tactile sensors operating at 200 Hz, we demonstrate that our tactile-reactive grasping framework effectively improves grasp stability.
|
| |
| 15:00-16:30, Paper WeI2I.149 | Add to My Program |
| FG-HOCBF: Safe Operation Area Extension and Obstacle Avoidance Direction Guidance for Surface Detection in Narrow Environments |
|
| Li, Yujie | Huazhong University of Science and Technology |
| Gao, Zhitao | Huazhong University of Science and Technology |
| Chen, Chen | Wuhan University of Science and Technology |
| Peng, Fangyu | Huazhong University of Science and Technology |
| Zhang, Yukui | Huazhong University of Science and Technology |
| Yan, Rong | Huazhong University of Science and Technology |
| Tang, Xiaowei | Huazhong University of Science and Technology |
| Zhou, Wenke | Huazhong University of Science and Technology |
Keywords: Robot Safety, Collision Avoidance
Abstract: High-order control barrier functions (HOCBFs) that can achieve strict safety guarantees are widely used in robot safety control. However, robot obstacle avoidance in narrow environments with curved surfaces, as represented by aircraft blade detection, is still a challenge. Considering the narrow space between adjacent blades, the traditional spherical barrier boundary is not suitable for flat curved surface blades, which cannot obtain sufficient operational space. Furthermore, the lengths of obstacle avoidance paths in different directions vary greatly under the overall distortion characteristics of the blades, and HOCBF lacks explicit direction guidance. To navigate these challenges, we firstly propose an accurate surface envelope method with short solution time through rotated and scaled super-ellipsoids to obtain a large operational space. Building upon this, we propose a novel force-guided high-order control barrier function (FG-HOCBF) method to guide robot to closely adhere to the surface along the desired direction and complete detection of specific areas, which consists of two components: surface normal approach judgment and guiding force generation in desired direction. Finally, simulations and experiments validate the performance of the proposed method.
|
| |
| 15:00-16:30, Paper WeI2I.150 | Add to My Program |
| LP-MPPI: Low-Pass Filtering for Efficient Model Predictive Path Integral Control |
|
| Kicki, Piotr | Poznan University of Technology |
Keywords: Optimization and Optimal Control, Motion Control, Motion and Path Planning
Abstract: Model Predictive Path Integral (MPPI) control is a widely used sampling-based approach for real-time control, valued for its flexibility in handling arbitrary dynamics and cost functions. However, it often suffers from high-frequency noise in the sampled control trajectories, which hinders the search for optimal controls and transfers to the applied controls, leading to actuator wear. In this work, we introduce Low-Pass Model Predictive Path Integral Control (LP-MPPI), which integrates low-pass filtering into the sampling process to eliminate detrimental high-frequency components and enhance the algorithm's efficiency. Unlike prior approaches, LP-MPPI provides direct and interpretable control over the frequency spectrum of sampled control trajectory perturbations, leading to more efficient sampling and smoother control. Through extensive evaluations in Gymnasium environments, simulated quadruped locomotion, and real-world F1TENTH autonomous racing, we demonstrate that LP-MPPI consistently outperforms state-of-the-art MPPI variants, achieving significant performance improvements while reducing control signal chattering.
|
| |
| 15:00-16:30, Paper WeI2I.151 | Add to My Program |
| Diverse Skill Discovery in Fourier Latent Space Via Unsupervised Learning |
|
| Cui, Ruopeng | Fudan University |
| Sun, Yucong | Tsinghua University |
| Bu, Xizhou | Fudan University |
| Chao, Wang | Qiyuan Laboratory |
| Li, Wei | Fudan University |
Keywords: AI-Enabled Robotics, Reinforcement Learning, Deep Learning Methods
Abstract: Unsupervised skill discovery acquires a diverse repertoire of skills through intrinsic motivation, offering the potential to alleviate the labor-intensive reward engineering in reinforcement learning and the reliance on costly task-specific data in imitation learning. However, such methods typically measure diversity based on single-step states, neglecting the trajectory phase coherence, whose absence disrupts the smoothness of state transitions. In this work, we explore skills in Fourier latent space via a simple mutual-information-based reward function, aiming to train a single versatile policy capable of executing diverse state transition patterns. Specifically, we utilize a spatio-temporal representation learned through a Periodic Autoencoder, which effectively captures the periodic or quasi-periodic nature of motion. These features, rather than raw states, are used to measure skill diversity. We validate our method on the 12-DOF quadruped robot Unitree A1, achieving varied gaits. Simulation results show that our method reduces high-frequency power by 73%, while improving state space coverage by 133% compared to the baseline. To accomplish specific tasks, we trained a high-level controller to orchestrate the learned skills, which improves training efficiency. Real-world experiments demonstrate that the learned skills can reliably execute tasks.
|
| |
| 15:00-16:30, Paper WeI2I.152 | Add to My Program |
| Reactive Slip Control in Multifingered Grasping: Hybrid Tactile Sensing and Internal-Force Optimization |
|
| Ayral, Théo | Université Paris-Saclay, CEA, Leti |
| Aloui, Saifeddine | Université Grenoble Alpes, CEA, Leti |
| Grossard, Mathieu | Université Paris-Saclay, CEA, List |
Keywords: Multifingered Hands, Force Control, Force and Tactile Sensing
Abstract: We build a low-level reflex control layer driven by fast tactile feedback for multifinger grasp stabilization. Our hybrid approach combines learned tactile slip detection with model-based internal-force control to halt in-hand slip while preserving the object-level wrench. The multimodal tactile stack integrates piezoelectric sensing (PzE) for fast slip cues and piezoresistive arrays (PzR) for contact localization, enabling online construction of a contact-centric grasp representation without prior object knowledge. Experiments demonstrate reactive stabilization of multifingered grasps under external perturbations, without explicit friction models or direct force sensing. In controlled trials, slip onset is detected after 20.4 ± 6 ms. The framework yields a theoretical grasp response latency on the order of 30 ms, with grasp-model updates in less than 5 ms and internal-force selection in about 4 ms. The analysis supports the feasibility of sub-50 ms tactile-driven grasp responses, aligned with human reflex baselines.
|
| |
| 15:00-16:30, Paper WeI2I.153 | Add to My Program |
| Guaranteed Robust Nonlinear MPC Via Disturbance Feedback |
|
| Leeman, Antoine | ETH Zurich |
| Köhler, Johannes | Imperial College London |
| Zeilinger, Melanie N. | ETH Zurich |
Keywords: Optimization and Optimal Control, Robust/Adaptive Control
Abstract: Robots must satisfy safety-critical state and input constraints despite disturbances and model mismatch. We introduce a robust model predictive control (RMPC) formulation that is scalable and compatible with real-time implementation. Our formulation guarantees robust constraint satisfaction, input-to-state stability (ISS) and recursive feasibility. The key idea is to decompose the uncertain nonlinear system into (i) a nominal nonlinear dynamic model, (ii) disturbance-feedback controllers, and (iii) bounds on the model error. These components are optimized jointly using sequential convex programming. The resulting convex subproblems are solved using a recent disturbance-feedback MPC solver. The approach is validated across multiple dynamics, including a rocket-landing problem with steerable thrust. An open-source implementation is available at https://github.com/antoineleeman/robust-nonlinear-mpc.
|
| |
| 15:00-16:30, Paper WeI2I.154 | Add to My Program |
| SCU-Hand with Integrated Single-Sheet Valve: A Funnel-Shaped Robotic Hand for Milligram-Scale Powder Handling |
|
| Takahashi, Tomoya | OMRON SINIC X Corporation |
| Nakajima, Yusaku | SOKENDAI |
| Beltran-Hernandez, Cristian Camilo | OMRON SINIC X Corporation |
| Kuroda, Yuki | OMRON SINIC X Corporation |
| Tanaka, Kazutoshi | OMRON SINIC X Corporation |
| Hamaya, Masashi | OMRON SINIC X Corporation |
| Ono, Kanta | Osaka University |
| Ushiku, Yoshitaka | OMRON SINIC X Corpolation |
Keywords: Soft Robot Applications, Robotics and Automation in Life Sciences, Soft Robot Materials and Design
Abstract: Laboratory Automation (LA) has the potential to accelerate solid-state materials discovery by enabling continuous robotic operation without human intervention. While robotic systems have been developed for tasks such as powder grinding and X-ray diffraction (XRD) analysis, fully automating powder handling at the milligram scale remains a significant challenge due to the complex flow dynamics of powders and the diversity of laboratory tasks. To address this challenge, this study proposes the SCU-Hand-SV (Soft Conical Universal Robotic Hand with Single-sheet Valve), which preserves the softness and conical sheet designs in prior work while incorporating a controllable valve at the cone apex to enable precise, incremental dispensing of milligram-scale powder quantities. The SCU-Hand-SV is integrated with an external balance through a feedback control system based on a model of powder flow and online parameter identification. Experimental evaluations with glass beads, monosodium glutamate, and titanium dioxide demonstrated that 80% of the trials achieved an error within ±2 mg, and the maximum error observed was approximately 20 mg across a target range of 20 mg to 3 g. In addition, by incorporating flow prediction models commonly used for hoppers and performing online parameter identification, the system is able to adapt to variations in powder dynamics. Compared to direct PID control, the proposed model-based control significantly improved both accuracy and convergence speed. These results highlight the potential of the proposed system to enable efficient and flexible powder weighing, with scalability toward larger quantities and applicability to a broad range of laboratory automation tasks.
|
| |
| 15:00-16:30, Paper WeI2I.155 | Add to My Program |
| Learning to Annotate Delayed and False AEB Events: A Practical System for Extreme Class Imbalance and Asymmetric Label Noise |
|
| Hao, Mengxiang | Li Auto Inc |
| Jiang, Xin | Li Auto Inc |
| Huang, Xinghao | Li Auto Inc |
| Su, Wenliang | Li Auto Inc |
| Wang, Zhiteng | Li Auto Inc |
| Rao, Junjie | Li Auto Inc |
| Yang, Xiaotian | Li Auto Inc |
| Liao, Wei | Li Auto Inc |
| Han, Chengyu | Li Auto Inc |
| Liang, Gen | Li Auto Inc |
| Song, Yulun | Li Auto Inc |
| Xu, Zhitao | Li Auto Inc |
| Lang, Xianpeng | Li Auto Inc |
Keywords: Intelligent Transportation Systems, Collision Avoidance, Deep Learning Methods
Abstract: Autonomous Emergency Braking (AEB) optimization relies on accurately annotated real-world trigger events, particularly rare but critical delayed and false AEB triggers that expose system deficiencies. However, these minority events comprise less than 5% of thousands of daily triggers, making manual annotation prohibitively expensive at scale. We present the first automated AEB annotation framework to address this problem. During development, we identified two fundamental challenge that severely impair delayed/false trigger annotation accuracy: extreme class imbalance where minority events are overwhelmed by true triggers, and asymmetric label noise where mislabeled majority samples suppress minority class learning. To overcome these challenges, we propose two key innovations: (1) Specific data augmentation that synthesizes realistic samples by manipulating focal target attributes, transplanting ego-vehicle dynamics, and masking non-focal agents; (2) Probe-guided noise suppression using stable hardness estimation to clean mislabeled true trigger samples. We deploy our model as a practical annotation system with full-stack architecture, efficiently identifying critical AEB events from thousands of daily samples. Production results demonstrate 80% improvement in delayed/false triggers recall and 50% reduction in manual workload. Beyond immediate gains, the system enables continuous self-improvement through accumulated high-quality annotations, establishing a necessary data foundation for on-vehicle AEB system optimization.
|
| |
| 15:00-16:30, Paper WeI2I.156 | Add to My Program |
| PG-Match: A Pose-Guided Generalizable Framework for Semi-Dense Feature Matching |
|
| Pei, Jiayi | NanKai University |
| Song, Peili | Nankai University |
| Zhao, Chenyang | Southeast University |
| Sun, Lei | Nankai University |
| Liu, Jingtai | Nankai University |
Keywords: Deep Learning for Visual Perception, Computer Vision for Automation, Visual Learning
Abstract: Feature matching is a fundamental technique in visual perception, essential for tasks such as 3D reconstruction, SLAM, and visual localization. Existing detector-free methods often struggle with generalization due to their reliance on depth data, which is not available in many datasets. We propose PG-Match, a detector-free feature matching framework that leverages pose supervision instead of depth-based supervision, improving its generalization across diverse environments. Additionally, we introduce a Differentiable Outlier Rejection Module (DORM) to enhance global consistency and increase the inlier ratio. A coarse-to-fine matching strategy is employed for efficiency, where specially designed confidence scores are utilized to guide the sampling process. This ensures efficient convergence and avoids local optima. Experiments on the widely used MegaDepth-1500 dataset demonstrate that PG-Match consistently outperforms state-of-the-art approaches, highlighting the effectiveness of its pose-guided design. Additionally, experiments on the depth-free PhotoTourism dataset further evaluate generalization of PG-Match, and its performance is also assessed in a downstream Structure from Motion (SfM) task.
|
| |
| 15:00-16:30, Paper WeI2I.157 | Add to My Program |
| FALCO: Foundation Model Guided Active Learning for Cost-Effective Off-Road Freespace Detection |
|
| Wang, Shuai | Peking University |
| Li, Chenxin | Peking University |
| Chen, Yintong | Beijing Institute of Technology |
| Jia, Yaobo | Peking University |
| Li, Hongze | Peking University |
| Min, Chen | Chinese Academy of Sciences |
| Mei, Jilin | Institute of Computing Technology, Chinese Academy of Sciences |
| Zhao, Huijing | Peking University |
Keywords: Semantic Scene Understanding, Deep Learning for Visual Perception, Autonomous Vehicle Navigation
Abstract: Freespace detection in unstructured off-road environments is critical for safe autonomous navigation but remains highly challenging due to ambiguous boundaries, diverse terrains, and long-tail safety-critical cases. Constructing large annotated datasets in such environments is prohibitively costly, which makes active learning essential to maximize model robustness under limited annotation budgets. However, conventional uncertainty or diversity-based strategies are unreliable in these complex settings, often failing to capture rare yet important scenarios. To address this, we propose FALCO, a foundation model guided active learning framework for cost-effective off-road freespace detection. FALCO integrates three complementary criteria: prediction deviation from a vision foundation model, model uncertainty, and semantic evaluation from a vision-language model to form a reliable sample criticality score. In addition, we introduce a semantic grid based sampling strategy that balances coverage across scene conditions while prioritizing challenging cases. Extensive experiments show that FALCO substantially improves robustness on rare and difficult scenarios, achieving significant gains in low-percentile IoU compared to state-of-the-art baselines, while maintaining competitive overall performance.
|
| |
| 15:00-16:30, Paper WeI2I.158 | Add to My Program |
| QuASH: Using Natural-Language Heuristics to Query Visual-Language Robotic Maps |
|
| Pekkanen, Matti | Aalto University |
| Verdoja, Francesco | Aalto University |
| Kyrki, Ville | Aalto University |
Keywords: Object Detection, Segmentation and Categorization, Semantic Scene Understanding, Deep Learning for Visual Perception
Abstract: Embeddings from Visual-Language Models are increasingly utilized to represent semantics in robotic maps, offering an open-vocabulary scene understanding that surpasses traditional, limited labels. Embeddings enable on-demand querying by comparing embedded user text prompts to map embeddings via a similarity metric. The key challenge in performing the task indicated in a query is that the robot must determine the parts of the environment relevant to the query. This paper proposes a solution to this challenge. We leverage natural-language synonyms and antonyms associated with the query within the embedding space, applying heuristics to estimate the language space relevant to the query, and use that to train a classifier to partition the environment into matches and non-matches. We evaluate our method through extensive experiments, querying both maps and standard image benchmarks. The results demonstrate increased queryability of maps and images. Our querying technique is agnostic to the representation and encoder used, and requires limited training.
|
| |
| 15:00-16:30, Paper WeI2I.159 | Add to My Program |
| Manual2Skill++: Connector-Aware General Robotic Assembly from Instruction Manuals Via Vision–Language Models |
|
| Tie, Chenrui | National University of Singapore |
| Sun, Shengxiang | University of Toronto |
| Lin, Yudi | University of Southern California |
| Wang, Yanbo | Zhejiang University |
| Li, Zhongrui | National University of Singapore |
| Zhong, Zhouhan | National University of Singapore |
| Zhu, Jinxuan | Harbin Institute of Technology, Shenzhen |
| Pang, Yiman | Jilin University |
| Chen, Haonan | National University of Singapore |
| Chen, Junting | National University of Singapore |
| Wu, Ruihai | Peking University |
| Shao, Lin | National University of Singapore |
Keywords: Manipulation Planning, AI-Enabled Robotics, Learning Categories and Concepts
Abstract: Assembly hinges on reliably forming connections between parts; yet most robotic approaches plan assembly sequences and part poses while treating connectors as an afterthought. Connections represent the foundational physical constraints of assembly execution; while task planning sequences operations, the precise establishment of these constraints ultimately determines assembly success. In this paper, we treat connections as explicit, primary entities in assembly representation, directly encoding connector types, specifications, and locations for every assembly step. Drawing inspiration from how humans learn assembly tasks through step-by-step instruction manuals, we present Manual2Skill++, a vision-language framework that automatically extracts structured connection information from assembly manuals. We encode assembly tasks as hierarchical graphs where nodes represent parts and sub-assemblies, and edges explicitly model connection relationships between components. A large-scale vision-language model parses symbolic diagrams and annotations in manuals to instantiate these graphs, leveraging the rich connection knowledge embedded in human-designed instructions. We curate a dataset containing over 20 assembly tasks with diverse connector types to validate our representation extraction approach, and evaluate the complete task understanding-to-execution pipeline across four complex assembly scenarios in simulation, spanning furniture, toys, and manufacturing components with real-world correspondence. More detailed information can be found at https://nus-lins-lab.github.io/Manual2SkillPP/
|
| |
| 15:00-16:30, Paper WeI2I.160 | Add to My Program |
| Probability-Driven Gating for Resilient Multi-Modal Tracking in Robotic Systems |
|
| Wang, Huan | Henan University of Science and Technology |
| Chen, Haomin | Henan University of Science and Technology |
| Du, Pengcheng | Henan University of Science and Technology |
| Si, Pengju | Henan University of Science and Technology |
| Ji, Baofeng | Henan University of Science and Technology |
| Yang, Yongming | Shenyang Institute of Automation |
Keywords: Foundations of Automation, Planning, Scheduling and Coordination, Discrete Event Dynamic Automation Systems
Abstract: The deployment of robots in unstructured environments demands perception systems that are both accurate and resilient. While RGB-Thermal (RGB-T) fusion is promising, current trackers often fail due to rigid, non-adaptive fusion strategies and underutilized cross-modal cues, compromising reliability for robotics. We introduce DSTrack, a novel tracking framework that embeds two core mechanisms for robotic robustness: a Probability-Gated Dynamic Switch and a Synergistic Multi-Domain Enhancement Network. The switch acts as an online decision-maker, allowing the robot to dynamically select the most reliable fusion path based on real-time confidence estimation, enabling crucial adaptation to scene changes. The enhancement network concurrently strengthens target representations within each modality through tri-domain (channel, spatial, frequency) refinement and establishes compensatory links between modalities via a cross-attention module, ensuring performance even during partial sensor degradation. Extensive evaluations on RGB-T benchmarks demonstrate state-of-the-art accuracy. More critically, DSTrack exhibits key properties for robotic integration: real-time environmental adaptability, inherent sensor fault tolerance, and consistent output for downstream planning.
|
| |
| 15:00-16:30, Paper WeI2I.161 | Add to My Program |
| Perching-Based Haptic Guidance for Physical Human–Robot Interaction with Aerial Robots |
|
| Miyamichi, Ayano | The University of Tokyo |
| Okada, Kei | The University of Tokyo |
Keywords: Physical Human-Robot Interaction, Aerial Systems: Mechanics and Control, Haptics and Haptic Interfaces
Abstract: In recent years, the field of Human-Drone Interaction (HDI) has attracted significant attention, particularly in navigation assistance using aerial robots. However, existing approaches rely on non-contact cues or indirect physical connections such as cables, which demand continuous flight and high energy consumption. These limitations shorten interaction time and make direct assistance challenging. To address this issue, we employ a deformable aerial robot with inflatable structures that enables adaptive perching on the human arm. On this platform, we propose a perching-based haptic guidance method in which the robot maintains close contact to deliver directional cues via thrust modulation and provide alerts and arrival feedback through vibration signals. The system further switches feedback modes dynamically according to context, enabling intuitive and flexible guidance beyond conventional methods limited to simple directional cues. Through experiments, we quantitatively evaluated the presented force characteristics and confirmed that perching-based haptic guidance requires less power than continuous flight in the same platform. User experiments further demonstrated that participants could reach target locations without major deviations even when vision and hearing were blocked. Moreover, the entire process of approach, perching, haptic guidance, and deperching was stably executed on the real platform, validating the feasibility of perching-based haptic guidance. To the best of our knowledge, this is the first study to realize close physical Human-Drone Interaction (pHDI) through perching-based haptic guidance.
|
| |
| 15:00-16:30, Paper WeI2I.162 | Add to My Program |
| MAP-VLA: Memory-Augmented Prompting for Vision-Language-Action Model in Robotic Manipulation |
|
| Li, Runhao | Nanyang Technological University |
| Guo, Wenkai | Nanyang Technological University |
| Wu, Zhenyu | Beijing University of Posts and Telecommunications |
| Wang, Changyuan | Tsinghua University |
| Deng, Haoyuan | Nanyang Technological University |
| Weng, Zhenyu | South China University of Technology |
| Tan, Yap-Peng | Nanyang Technological University |
| Wang, Ziwei | Nanyang Technological University |
Keywords: Imitation Learning, Deep Learning in Grasping and Manipulation, Deep Learning Methods
Abstract: Pre-trained Vision-Language-Action (VLA) models have achieved remarkable success in improving robustness and generalization for end-to-end robotic manipulation. However, these models struggle with long-horizon tasks due to their lack of memory and reliance solely on immediate sensory inputs. To address this limitation, we propose Memory-Augmented Prompting for Vision-Language-Action model (MAP-VLA), a novel framework that empowers pre-trained VLA models with demonstration-derived memory prompts to augment action generation for long-horizon robotic manipulation tasks. To achieve this, MAP-VLA first constructs a memory library from historical demonstrations, where each memory unit captures information about a specific stage of a task. These memory units are implemented as learnable soft prompts optimized through prompt tuning. Then, during real-time task execution, MAP-VLA retrieves relevant memory through trajectory similarity matching and dynamically integrates it into the VLA model for augmented action generation. Importantly, this prompt tuning and retrieval augmentation approach operates as a plug-and-play module for a frozen VLA model, offering a lightweight and flexible solution to improve task performance. Experimental results show that MAP-VLA delivers up to 7.0% absolute performance gains in the simulation benchmark and 25.0% on real robot evaluations for long-horizon tasks, surpassing the current state-of-the-art methods.
|
| |
| 15:00-16:30, Paper WeI2I.163 | Add to My Program |
| State Estimation for Compliant and Morphologically Adaptive Robots |
|
| Yuryev, Valentin | EPFL |
| Polzin, Max | EPFL |
| Hughes, Josie | EPFL |
Keywords: Field Robots, Sensor Fusion, Deep Learning Methods
Abstract: Locomotion robots with active or passive compliance can show robustness to uncertain scenarios, which can be promising for agricultural, research and environmental industries. However, state estimation for these robots is challenging due to the lack of rigid-body assumptions and kinematic changes from morphing. We propose a method to estimate typical rigid-body states alongside compliance-related states, such as soft robot shape in different morphologies and locomotion modes. Our neural network-based state estimator uses a history of states and a mechanism to directly influence unreliable sensors. We test our framework on the GOAT platform, a robot capable of passive compliance and active morphing for extreme outdoor terrain. The network is trained on motion capture data in a novel compliance-centric frame that accounts for morphing-related states. Our method predicts shape-related measurements within 4.2% of the robot’s size, velocities within 6.3% and 2.4% of the top linear and angular speeds, respectively, and orientation within 1.5◦. We also demonstrate a 300% increase in travel range during a motor malfunction when using our estimator for closed-loop autonomous outdoor operation.
|
| |
| 15:00-16:30, Paper WeI2I.164 | Add to My Program |
| Motion-Specific Battery Health Assessment for Quadrotors Using High-Fidelity Battery Models |
|
| Kim, Joonhee | Pohang University of Science and Technology ( POSTECH ) |
| Park, Sanghyun | Pohang University of Science and Technology (POSTECH) |
| Kim, Donghyeong | Pohang University of Science and Technology ( POSTECH ) |
| Choi, Eunseon | Pohang University of Science and Technology ( POSTECH ) |
| Han, Soohee | Pohang University of Science and Technology ( POSTECH ) |
Keywords: Aerial Systems: Applications, Energy and Environment-Aware Automation, Optimization and Optimal Control
Abstract: Quadrotor endurance is ultimately limited by battery behavior, yet most energy-aware planning treats the battery as a simple energy reservoir and overlooks how flight motions induce dynamic current loads that accelerate battery degradation. This work presents an end-to-end framework for motion-aware battery health assessment in quadrotors. We first design a wide-range current sensing module to capture motion-specific current profiles during real flights, preserving transient features. In parallel, a high-fidelity battery model is calibrated using reference performance tests and a metaheuristic based on a degradation-coupled electrochemical model.By simulating measured flight loads in the calibrated model, we systematically resolve how different flight motions translate into degradation modes—loss of lithium inventory and loss of active material—as well as internal side reactions. The results demonstrate that even when two flight profiles consume the same average energy, their transient load structures can drive different degradation pathways, emphasizing the need for motion-aware battery management that balances efficiency with battery degradation.
|
| |
| 15:00-16:30, Paper WeI2I.165 | Add to My Program |
| Agility Meets Stability: Versatile Humanoid Control with Heterogeneous Data |
|
| Pan, Yixuan | The University of Hong Kong |
| Qiao, Ruoyi | East China Normal University |
| Chen, Li | The University of Hong Kong |
| Chitta, Kashyap | NVIDIA |
| Pan, Liang | The University of Hong Kong |
| Mai, Haoguang | The University of Hong Kong |
| Bu, Qingwen | The University of Hong Kong |
| Zheng, Cunyuan | Columbia University |
| Zhao, Hao | Tsinghua University |
| Luo, Ping | The University of Hong Kong |
| Li, Hongyang | The University of Hong Kong |
Keywords: Humanoid and Bipedal Locomotion, Human and Humanoid Motion Analysis and Synthesis, Reinforcement Learning
Abstract: Humanoid robots are envisioned to perform a wide range of tasks in human-centered environments, requiring controllers that combine agility with robust balance. Recent advances in locomotion and whole-body tracking have enabled impressive progress in either agile dynamic skills or stability-critical behaviors, but existing methods remain specialized, focusing on one capability while compromising the other. In this work, we introduce AMS (Agility Meets Stability), the first framework that unifies both dynamic motion tracking and extreme balance maintenance in a single policy. Our key insight is to leverage heterogeneous data sources: human motion capture datasets that provide rich, agile behaviors, and physically constrained synthetic balance motions that capture stability configurations. To reconcile the divergent optimization goals of agility and stability, we design a hybrid reward scheme that applies general tracking objectives across all data while injecting balance-specific priors only into synthetic motions. Further, an adaptive learning strategy with performance-driven sampling and motion-specific reward shaping enables efficient training across diverse motion distributions. We validate AMS extensively in simulation and on a real Unitree G1 humanoid. Experiments demonstrate that a single policy can execute agile skills such as dancing and running, while also performing zero-shot extreme balance motions like Ip Man’s Squat, highlighting AMS as a versatile control paradigm for future humanoid applications.
|
| |
| 15:00-16:30, Paper WeI2I.166 | Add to My Program |
| Bridging Large-Model Reasoning and Real-Time Control Via Agentic Fast-Slow Planning |
|
| Chen, Jiayi | Shenzhen Research Institute of Big Data, The Chinese University of Hong Kong (Shenzhen) |
| Wang, Shuai | Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences |
| Zhu, Guangxu | Shenzhen Research Institute of Big Data |
| Xu, Chengzhong | University of Macau |
|
|
| |
| 15:00-16:30, Paper WeI2I.167 | Add to My Program |
| Should I Replan? Learning to Spot the Right Time in Robust MAPF Execution |
|
| Zahrádka, David | Czech Institute of Informatics, Robotics and Cybernetics, Czech Technical University in Prague |
| Woller, David | Czech Technical University in Prague |
| Mužíková, Denisa | Czech Technical University in Prague |
| Kulich, Miroslav | Czech Technical University in Prague |
| Přeučil, Libor | CTU in Prague |
Keywords: Multi-Robot Systems, Planning, Scheduling and Coordination, Path Planning for Multiple Mobile Robots or Agents
Abstract: During the execution of Multi-Agent Path Finding (MAPF) plans in real-life applications, the MAPF assumption that the fleet's movement is perfectly synchronized does not apply. Since some of the agents may become delayed due to internal or external factors, it is often necessary to use a robust execution method to avoid collisions caused by desynchronization. Robust execution methods - such as the Action Dependency Graph (ADG) - synchronize the execution of risky actions, but often at the expense of increased plan execution cost, because it may require some agents to wait for the delayed agents. In such cases, the execution's cost can be reduced while still preserving safety by finding a new plan either by rescheduling (reordering the agents at crossroads) or the more general replanning capable of finding new paths. However, these operations may be costly, and the new plan may not even lead to lower execution cost than the original plan: for example, the two plans may be the exact same, as some losses may not be recoverable at all. Therefore, we estimate the benefit that can be achieved by single replanning in scenarios with delayed agents given an immediate state of the execution with a fully connected feed-forward neural network. The input to the neural network is a set of newly designed ADG-based features describing the execution's state and the impact of potential delays, and the output is an estimated benefit achievable by replanning. We train and test the network on a new labeled dataset containing 12,000 experiments and show that our proposed method is capable of significantly reducing the impact of recoverable delays.
|
| |
| 15:00-16:30, Paper WeI2I.168 | Add to My Program |
| Have We Mastered Scale in Deep Monocular Visual SLAM? the ScaleMaster Dataset and Benchmark |
|
| Ju, Hyoseok | DGIST |
| Suh, Bokeon | DGIST |
| Kim, Giseop | DGIST (Daegu Gyeongbuk Institute of Science and Technology) |
Keywords: Data Sets for SLAM, SLAM, Mapping
Abstract: Recent advances in deep monocular visual Simultaneous Localization and Mapping (SLAM) have achieved impressive accuracy and dense reconstruction capabilities, yet their robustness to scale inconsistency in large-scale indoor environments remains largely unexplored. Existing benchmarks are limited to room-scale or structurally simple settings, leaving critical issues of intra-session scale drift and inter-session scale ambiguity insufficiently addressed. To fill this gap, we introduce the ScaleMaster Dataset, the first benchmark explicitly designed to evaluate scale consistency under challenging scenarios such as multi-floor structures, long trajectories, repetitive views, and low-texture regions. We systematically analyze the vulnerability of state-of-the-art deep monocular visual SLAM systems to scale inconsistency, providing both quantitative and qualitative evaluations. Crucially, our analysis extends beyond traditional trajectory metrics to include a direct map-to-map quality assessment using metrics like Chamfer distance against high-fidelity 3D ground truth. Our results reveal that while recent deep monocular visual SLAM systems demonstrate strong performance on existing benchmarks, they suffer from severe scale-related failures in realistic, large-scale indoor environments. By releasing the ScaleMaster dataset and baseline results, we aim to establish a foundation for future research toward developing scale-consistent and reliable visual SLAM systems.
|
| |
| 15:00-16:30, Paper WeI2I.169 | Add to My Program |
| Scaling Multi-Agent Reinforcement Learning for Underwater Acoustic Tracking Via Autonomous Vehicles |
|
| Gallici, Matteo | Politecnic University of Catalunia |
| Masmitja, Ivan | Institut De Ciencies Del Mar - CSIC |
| Martin, Mario | Universidad Politecnica De Catalunya |
Keywords: Marine Robotics, Path Planning for Multiple Mobile Robots or Agents, Reinforcement Learning
Abstract: Autonomous vehicles (AVs) offer a cost-effective solution for scientific missions such as underwater tracking. Reinforcement learning (RL) has emerged as a powerful method for controlling AVs, but scaling to fleets (essential for multi-target tracking or rapidly moving targets) is challenging. Multi-Agent RL (MARL) is notoriously sample-inefficient, and while high-fidelity simulators like Gazebo’s LRAUV provide up to 100× faster-than-real-time single-robot simulations, they offer little speedup in multi-vehicle scenarios, making MARL training impractical. Yet, high-fidelity simulation is crucial to test complex policies and close the sim-to-real gap. To address these limitations, we develop a GPU-accelerated environment that achieves up to 30,000× speedup over Gazebo while preserving its dynamics. This enables fast, end-to-end GPU training and seamless transfer to Gazebo for evaluation. We also introduce a Transformer-based architecture (TransfMAPPO) that learns policies invariant to fleet size and number of targets, enabling curriculum learning to train larger fleets on increasingly complex scenarios. After large-scale GPU training, we perform extensive evaluations in Gazebo, showing our method maintains tracking errors below 5m even with multiple fast-moving targets.
|
| |
| 15:00-16:30, Paper WeI2I.170 | Add to My Program |
| NDO-Based Dual Quaternion Control of a Drone with a Cable-Suspended Load |
|
| Yuan, Yuxia | Technical University of Munich |
| Kang, Junjie | York University |
| Shan, Jinjun | York University |
| Ryll, Markus | Technical University Munich |
Keywords: Aerial Systems: Applications, Aerial Systems: Mechanics and Control
Abstract: This paper proposes a novel nonlinear disturbance observer (NDO) based dual quaternion dynamics modeling and control framework for a drone with a cable-suspended load. Leveraging dual quaternions, a compact and singularity-free mathematical representation, we derive a unified dynamic model that captures the coupled translational and rotational dynamics of both the drone and the slung load. NDOs are designed to estimate and compensate for uncertainties and external disturbances affecting the drone and the load. Building on this framework, we develop a robust control strategy that ensures precise trajectory tracking of the slung load while maintaining stable drone attitude control. The effectiveness of the proposed approach is validated through comprehensive simulations and real-world experiments on a cargo drone platform. The results highlight the robustness and reliability of the system in practical scenarios, demonstrating its potential application in cargo transportation.
|
| |
| 15:00-16:30, Paper WeI2I.171 | Add to My Program |
| Memory Efficient Point Cloud Registration Accelerator on FPGA |
|
| Qiong, Chang | Institute of Science Tokyo |
| Cai, Dongqi | Nanjing University |
| Dong, Ran | Chukyo University |
| Zhong, Junpei | The Technological and Higher Education Institute of Hong Kong |
Keywords: Computer Architecture for Robotic and Automation, Hardware-Software Integration in Robotics, Embedded Systems for Robotic and Automation
Abstract: Point cloud registration, which aligns multiple datasets into a unified coordinate system, is critical for mobile applications such as 3D SLAM and autonomous driving. Among existing methods, Iterative Closest Point (ICP) remains a widely used method for rigid registration due to its robustness and simplicity. However, its performance on mobile platforms is hindered by iterative computations and limited memory resources. This paper proposes a high-performance ICP registration framework implemented on FPGA. Building upon an efficient GPU-based method named VAN-ICP, our FPGA-based ICP accelerator achieves greater memory efficiency and faster processing speed, making it ideal for resource-constrained mobile platforms. Experimental results demonstrate a speedup of over 1.5× compared to mobile GPU-based implementations and a 99% reduction in memory usage, validating the effectiveness of the proposed approach for real-world point cloud registration on edge platforms. Beyond these improvements, the proposed framework also facilitates advancements in robotic vision technologies by enabling more accurate and efficient perception under stringent hardware constraints.
|
| |
| 15:00-16:30, Paper WeI2I.172 | Add to My Program |
| Failure Identification in Imitation Learning Via Statistical and Semantic Filtering |
|
| Rolland, Quentin | Université Paris-Saclay, CEA, List, F-91120, Palaiseau, France |
| Mayran de Chamisso, Fabrice | Université Paris-Saclay, CEA, LIST, F-91120 Palaiseau |
| Mouret, Jean-Baptiste | Inria |
|
|
| |
| 15:00-16:30, Paper WeI2I.173 | Add to My Program |
| HetroD: A High-Fidelity Drone Dataset and Benchmark for Autonomous Driving in Heterogeneous Traffic |
|
| Chen, Yu-Hsiang | National Yang Ming Chiao Tung University |
| Chang, Wei-Jer | University of California, Berkeley |
| Kotulla, Christian | Fka GmbH |
| Keutgens, Thomas | Fka GmbH |
| Runde, Steffen | Fka GmbH |
| Moers, Tobias | Fka GmbH |
| Klas, Christoph | Fka GmbH |
| Zhan, Wei | Univeristy of California, Berkeley |
| Tomizuka, Masayoshi | University of California |
| Chen, Yi-Ting | National Yang Ming Chiao Tung University |
Keywords: Intelligent Transportation Systems
Abstract: We present HetroD, a dataset and benchmark for developing autonomous driving systems in heterogeneous environments. HetroD targets the critical challenge of navigating real-world heterogeneous traffic dominated by vulnerable road users (VRUs), including pedestrians, cyclists, motorcyclists, and vehicles. These mixed agent types exhibit complex behaviors such as hook turns, lane splitting, and informal right-of-way negotiation. Such behaviors pose significant challenges for autonomous vehicles but remain underrepresented in existing datasets focused on structured, lane-disciplined traffic. To bridge the gap, we collect a large-scale drone-based dataset to provide a holistic observations of traffic scenes with centimeter-accurate annotations, HD maps, and traffic signal states. We further develop a modular toolkit for extracting per-agent scenarios to support downstream task development. In total, the dataset comprises over 65.4k high-fidelity agent trajectories, 70% of which are from VRUs. HetroD supports modeling of VRU behaviors in dense, heterogeneous traffic and provides standardized benchmarks for forecasting, planning, and simulation tasks. Evaluation results reveal that state-of-the-art prediction and planning models struggle with the challenges presented by our dataset: they fail to predict lateral VRU movements, cannot handle unstructured maneuvers, and exhibit limited performance in dense and multi-agent scenarios, highlighting the need for more robust approaches to heterogeneous traffic. See our project page for more examples: https://hetroddata.github.io/HetroD/
|
| |
| 15:00-16:30, Paper WeI2I.174 | Add to My Program |
| PinPoint3D: Fine-Grained 3D Part Segmentation from a Few Clicks |
|
| Zhang, Bojun | Southern University of Science and Technology |
| Ye, Hangjian | Southern University of Science and Technology |
| Zheng, Hao | Southern University of Science and Technology, Spatialtemporal AI |
| Huang, Jianzheng | Southern University of Science and Technology |
| Lin, Zhengyu | Southern University of Science and Technology |
| Guo, Zhenhong | Southern University of Science and Technology |
| Zheng, Feng | SUSTech |
|
|
| |
| 15:00-16:30, Paper WeI2I.175 | Add to My Program |
| Learning-Based Fusion for Robust Multi-Spectral Visual Servoing |
|
| Fiasche, Enrico | Inria |
| Savner, Siddharth Singh | Inria |
| Malis, Ezio | Inria |
| Martinet, Philippe | Inria |
Keywords: Visual Servoing, Deep Learning for Visual Perception, Visual Tracking
Abstract: Multispectral sensors, which measure multiple wavelength bands beyond the standard red, green, and blue channels, capture richer information than conventional RGB cameras. Such enriched data is especially valuable in visual servoing, where robot control critically depends on image content. However, leveraging multiple spectral bands (typically around a dozen) directly within real-time visual servoing constitutes a significant challenge. The only prior work tackled this problem using a Pixel Selection strategy based on image gradients. This paper introduces a learning-based framework to enhance Multi-Spectral Visual Servoing (MSVS) by fusing data from multispectral cameras into a single, robust representation for control. An autoencoder is employed to compress multispectral inputs into a noise-attenuated 2D image, which is then used within a standard rule-based Direct Visual Servoing (DVS) scheme. Comparison experiments both with simulated data and with a real robot in complex and unstructured environments show that the proposed learning-based fusion maintains stable convergence and improves positioning accuracy under noisy conditions while preserving computational efficiency.
|
| |
| 15:00-16:30, Paper WeI2I.176 | Add to My Program |
| T(R, O) Grasp: Efficient Graph Diffusion of Robot-Object Spatial Transformation for Cross-Embodiment Dexterous Grasping |
|
| Fei, Xin | National University of Singapore |
| Xu, Zhixuan | National University of Singapore |
| Fang, Huaicong | Zhejiang University |
| Zhang, Tianrui | National University of Singapore |
| Shao, Lin | National University of Singapore |
Keywords: Grasping, Dexterous Manipulation, Deep Learning in Grasping and Manipulation
Abstract: Dexterous grasping remains a central challenge in robotics due to the complexity of its high-dimensional state and action space. We introduce T(R,O) Grasp, a diffusion-based framework that efficiently generates accurate and diverse grasps across multiple robotic hands. At its core is the T(R,O) Graph, a unified representation that models spatial transformations between robotic hands and objects while encoding their geometric properties. A graph diffusion model, coupled with an efficient inverse kinematics solver, supports both unconditioned and conditioned grasp synthesis. Extensive experiments on a diverse set of dexterous hands show that T(R,O) Grasp achieves average success rate of 94.83%, inference speed of 0.21s, and throughput of 41 grasps per second on an NVIDIA A100 40GB GPU, substantially outperforming existing baselines. In addition, our approach is robust and generalizable across embodiments while significantly reducing memory consumption. More importantly, the high inference speed enables closed-loop dexterous manipulation, underscoring the potential of T(R,O) Grasp to scale into a foundation model for dexterous grasping.
|
| |
| 15:00-16:30, Paper WeI2I.177 | Add to My Program |
| Connectivity Maintenance for High-Speed Communication with Adversarial Jamming |
|
| Kaminsky, Thomas | Harvard University |
| Izhar, Hammad | Harvard University |
| Garces, Daniel | Harvard University |
| Brady, Collin | MIT Lincoln Laboratory |
| Rottner, Joe | MIT Lincoln Laboratory |
| Gil, Stephanie | Harvard University |
Keywords: Networked Robots, Robust/Adaptive Control, Sensor Networks
Abstract: We consider the problem of adaptively controlling a fleet of robots to maintain a communication network in an adversarial environment. In particular, a network team of robots is tasked with maintaining a directed communication channel at some data rate from an independent task robot to a fixed base station, accommodating the task robot's motion and adversarial intervention in the form of an omnidirectional jammer and network team robot removals. We utilize a physically-motivated model for directed signal strength between robots in the presence of a jammer, introducing asymmetry into communication which challenges connectivity maintenance approaches. Our main contribution in this paper is the introduction of a strategy for translating this directed model into an undirected graph for which enforcing connectedness is sufficient for maintaining high-rate communication. We demonstrate the efficacy of our approach in simulation using a CBF-based controller, showing that our controller maintains a high-rate connection throughout diverse trajectories, even when more conservative controllers fail.
|
| |
| 15:00-16:30, Paper WeI2I.178 | Add to My Program |
| IntentionVLA: Generalizable and Efficient Embodied Intention Reasoning for Human–Robot Interaction |
|
| Chen, Yandu | Harbin Institute of Technology (Shenzhen) |
| Gu, Kefan | Nanjing University |
| Wen, Yuqing | University of Science and Technology of China |
| Zhao, Yucheng | Dexmal |
| Wang, Tiancai | MEGVII Technology |
| Nie, Liqiang | Harbin Institute of Technology (shenzhen) |
Keywords: Deep Learning in Grasping and Manipulation, Deep Learning Methods, AI-Based Methods
Abstract: Vision-Language-Action (VLA) models leverage pretrained vision-language models (VLMs) to couple perception with robotic control, offering a promising path toward general purpose embodied intelligence. However, current SOTA VLAs are primarily pretrained on multimodal tasks with limited relevance to embodied scenarios, and then finetuned to map explicit instructions to actions. Consequently, due to the lack of reasoning-intensive pretraining and reasoning-guided manipulation, these models are unable to perform implicit human intention reasoning required for complex, real-world interactions. To overcome these limitations, we propose IntentionVLA, a VLA framework with a curriculum training paradigm and an efficient inference mechanism. Our proposed method first leverages carefully designed reasoning data that combine intention inference, spatial grounding, and compact embodied reasoning, endowing the model with both reasoning and perception capabilities. In the following finetuning stage, IntentionVLA employs the compact reasoning outputs as contextual guidance for action generation, enabling fast inference under indirect instructions. Experimental results show that IntentionVLA substantially outperforms pi_0, achieving 18% higher success rates with direct instructions and 28% higher than ECoT under intention instructions. On out-of-distribution intention tasks, IntentionVLA achieves over twice the success rate of all baselines, and further enables zero-shot human-robot interaction with 40% success rate. These results highlight IntentionVLA as a promising paradigm for next-generation human-robot interaction (HRI) systems.
|
| |
| 15:00-16:30, Paper WeI2I.179 | Add to My Program |
| SUBTA: A Framework for Supported User-Guided Bimanual Teleoperation in Structured Assembly |
|
| Liu, Xiao | Arizona State University |
| Baskaran, Prakash | Honda Research Institute |
| Li, Songpo | Honda Research Institute |
| Manschitz, Simon | Honda Research Institute Europe |
| Ma, Wei | Honda Research Institute Europe |
| Ruiken, Dirk | Honda Research Institute Europe |
| Iba, Soshi | Honda Research Institute USA |
Keywords: Telerobotics and Teleoperation, Virtual Reality and Interfaces, Intention Recognition
Abstract: In human-robot collaboration, shared autonomy enhances human performance through precise, intuitive support. Effective robotic assistance requires accurately inferring human intentions and understanding task structures to determine optimal support timing and methods. In this paper, we present SUBTA, a supported teleoperation system for bimanual assembly that couples learned intention estimation, scene-graph task planning, and context-dependent motion assists. We validate our approach through a user study (N=12) comparing standard teleoperation, motion-support only, and SUBTA. Linear mixed-effects analysis revealed that SUBTA significantly outperformed standard teleoperation in position accuracy (p<0.001, d=1.18) and orientation accuracy (p<0.001, d=1.75), while reducing mental demand (p=0.002, d=1.34). Post-experiment ratings indicate clearer, more trustworthy visual feedback and predictable interventions in SUBTA. The results demonstrate that SUBTA greatly improves both effectiveness and user experience in teleoperation.
|
| |
| 15:00-16:30, Paper WeI2I.180 | Add to My Program |
| Design and Control of a Perching Drone Inspired by the Prey-Capturing Mechanism of Venus Flytrap |
|
| Li, Ye | Harbin Institute of Technology |
| Liu, Daming | Harbin Institute of Technology |
| Zhu, Yanhe | Harbin Institute of Technology |
| Zhang, Junming | Harbin Institue of Techonlogy |
| Luo, Yongsheng | Harbin Institute of Technology |
| Wang, Ziqi | National University of Singapore |
| Liu, Chenyu | Harbin Institute of Technology |
| Zhao, Jie | Harbin Institute of Technology |
Keywords: Aerial Systems: Mechanics and Control, Biologically-Inspired Robots, Motion Control
Abstract: The endurance and energy efficiency of drones remain critical challenges in their design and operation. To extend mission duration, numerous studies explored perching mechanisms that enable drones to conserve energy by temporarily suspending flight. This paper presents a new perching drone that utilizes an active flexible perching mechanism inspired by the rapid predation mechanism of the Venus flytrap, achieving perching in less than 100 ms. The proposed system is designed for high-speed adaptability to the perching targets. The overall drone design is outlined, followed by the development and validation of the biomimetic perching structure. To enhance the system stability, a cascade extended high-gain observer (EHGO) based control method is developed, which can estimate and compensate for the external disturbance in real time. The experimental results demonstrate the adaptability of the perching structure and the superiority of the cascaded EHGO in resisting wind and perching disturbances.
|
| |
| 15:00-16:30, Paper WeI2I.181 | Add to My Program |
| Clutt3R-Seg: Sparse-View 3D Instance Segmentation for Language-Grounded Grasping in Cluttered Scenes |
|
| Noh, Jeongho | Seoul National University |
| Rhee, Tai Hyoung | Seoul National University |
| Lee, Eunho | Seoul National University |
| Kim, Jeongyun | SNU |
| Lee, Sunwoo | Hyundai Motor Group |
| Kim, Ayoung | Seoul National University |
Keywords: Perception for Grasping and Manipulation, Object Detection, Segmentation and Categorization, Deep Learning for Visual Perception
Abstract: Reliable 3D instance segmentation is fundamental to language-grounded robotic manipulation. Its critical application lies in cluttered environments, where occlusions, limited viewpoints, and noisy masks degrade perception. To address these challenges, we present Clutt3R-Seg, a zero-shot pipeline for robust 3D instance segmentation for language-grounded grasping in cluttered scenes. Our key idea is to introduce a hierarchical instance tree of semantic cues. Unlike prior approaches that attempt to refine noisy masks, our method leverages them as informative cues: through cross-view grouping and conditional substitution, the tree suppresses over- and under-segmentation, yielding view-consistent masks and robust 3D instances. Each instance is enriched with open-vocabulary semantic embeddings, enabling accurate target selection from natural language instructions. To handle scene changes during multi-stage tasks, we further introduce a consistency-aware update that preserves instance correspondences from only a single post-interaction image, allowing efficient adaptation without rescanning. Clutt3R-Seg is evaluated on both synthetic and real-world datasets, and validated on a real robot. Across all settings, it consistently outperforms state-of-the-art baselines in cluttered and sparse-view scenarios. Even on the most challenging heavy-clutter sequences, Clutt3R-Seg achieves an AP@25 of 61.66, over 2.2× higher than baselines, and with only four input views it surpasses MaskClustering with eight views by more than 2×. The code is available at: https://github.com/jeonghonoh/clutt3r-seg.
|
| |
| 15:00-16:30, Paper WeI2I.182 | Add to My Program |
| HAND Me the Data: Fast Robot Adaptation Via Hand Path Retrieval |
|
| Hong, Matthew M | University of Southern California |
| Liang, Anthony | University of Southern California |
| Kim, Kevin | University of Southern California |
| Belagavi Rajaprakash, Harshitha | University of Southern California |
| Thomason, Jesse | University of Southern California |
| Bıyık, Erdem | University of Southern California |
| Zhang, Jesse | University of Washington |
Keywords: Learning from Demonstration, Imitation Learning, Transfer Learning
Abstract: We present HAND, a simple and time-efficient method for teaching robots new manipulation tasks through human hand demonstrations. Instead of relying on task-specific robot demonstrations collected via teleoperation, HAND uses easy-to-provide hand demonstrations to retrieve relevant behaviors from task-agnostic robot play data. Using a visual tracking pipeline, HAND extracts the motion of the human hand from the hand demonstration and retrieves robot sub-trajectories in two stages: first filtering by visual similarity, then retrieving trajectories with similar behaviors to the hand. Fine-tuning a policy on the retrieved data enables real-time learning of tasks in under four minutes, without requiring calibrated cameras or detailed hand pose estimation. Experiments also show that HAND outperforms retrieval baselines by over 2x in average task success rates on real robots. Videos can be found at our project website: https://liralab.usc.edu/handretrieval.
|
| |
| 15:00-16:30, Paper WeI2I.183 | Add to My Program |
| Platform-Agnostic Reinforcement Learning Framework for Safe Exploration of Cluttered Environments with Graph Attention |
|
| Calzolari, Gabriele | Luleå Tekniska Universitet |
| Sumathy, Vidya | Luleå University of Technology |
| Kanellakis, Christoforos | LTU |
| Nikolakopoulos, George | Luleå University of Technology |
Keywords: Reinforcement Learning, Deep Learning Methods, AI-Enabled Robotics
Abstract: Autonomous exploration of obstacle-rich spaces requires strategies that ensure efficiency while guaranteeing safety against collisions with obstacles. This paper investigates a novel platform-agnostic reinforcement learning framework that integrates a graph neural network-based policy for next-waypoint selection, with a safety filter ensuring safe mobility. Specifically, the neural network is trained using reinforcement learning through the Proximal Policy Optimization (PPO) algorithm to maximize exploration efficiency while minimizing safety filter interventions. Henceforth, when the policy proposes an infeasible action, the safety filter overrides it with the closest feasible alternative, ensuring consistent system behavior. In addition, this paper introduces a reward function shaped by a potential field that accounts for both the agent’s proximity to unexplored regions and the expected information gain from reaching them. The proposed framework combines the adaptability of reinforcement learning-based exploration policies with the reliability provided by explicit safety mechanisms. This feature plays a key role in enabling the deployment of learning-based policies on robotic platforms operating in real-world environments. Extensive evaluations in both simulations and experiments performed in a lab environment demonstrate that the approach achieves efficient and safe exploration in cluttered spaces.
|
| |
| 15:00-16:30, Paper WeI2I.184 | Add to My Program |
| AdaptPNP: Integrating Prehensile and Non-Prehensile Skills for Adaptive Robotic Manipulation |
|
| Zhu, Jinxuan | Harbin Institute of Technology, Shenzhen |
| Tie, Chenrui | National University of Singapore |
| Cao, Xinyi | East China Normal University |
| Wang, Yuran | Peking University |
| Guo, Jingxiang | National University of Singapore |
| Chen, Zixuan | Nanjing Univeristy |
| Chen, Haonan | National University of Singapore |
| Chen, Junting | National University of Singapore |
| Xiao, Yangyu | Harbin Institute of Harbin, Shenzhen |
| Wu, Ruihai | Peking University |
| Shao, Lin | National University of Singapore |
Keywords: Manipulation Planning, Task and Motion Planning
Abstract: Non-prehensile (NP) manipulation, in which robots alter object states without forming stable grasps (for example, pushing, poking, or sliding), significantly broadens robotic manipulation capabilities when grasping is infeasible or insufficient. However, enabling a unified framework that generalizes across different tasks, objects, and environments while seamlessly integrating non-prehensile and prehensile (P) actions remains challenging: robots must determine when to invoke NP skills, select the appropriate primitive for each context, and compose P and NP strategies into robust, multi-step plans. We introduce AdaptPNP, a vision-language model (VLM)-empowered task and motion planning framework that systematically selects and combines P and NP skills to accomplish diverse manipulation objectives. Our approach leverages a VLM to interpret visual scene observations and textual task descriptions, generating a high-level plan skeleton that prescribes the sequence and coordination of P and NP actions. A digital-twin based object-centric intermediate layer predicts desired object poses, enabling proactive mental rehearsal of manipulation sequences. Finally, a control module synthesizes low-level robot commands, with continuous execution feedback enabling online task plan refinement and adaptive replanning through the VLM. We evaluate AdaptPNP across representative P&NP hybrid manipulation tasks in both simulation and real-world environments. These results underscore the potential of hybrid P&NP manipulation as a crucial step toward general-purpose, human-level robotic manipulation capabilities. More detailed information can be found at https://adaptpnp.github.io/
|
| |
| 15:00-16:30, Paper WeI2I.185 | Add to My Program |
| TWISTED-RL: Hierarchical Skilled Agents for Knot-Tying without Human Demonstrations |
|
| Freund, Guy | Reichman University |
| Jurgenson, Tom | Technion – Israel Institute of Technology[ |
| Sudry, Matan | Technion – Israel Institute of Technology[ |
| Karpas, Erez | Technion – Israel Institute of Technology[ |
Keywords: Reinforcement Learning, Integrated Planning and Learning, Machine Learning for Robot Control
Abstract: Robotic knot-tying represents a fundamental chal- lenge in robotics due to the complex interactions between de- formable objects and strict topological constraints. We present TWISTED-RL, a framework that improves upon the previous state-of-the-art in demonstration-free knot-tying (TWISTED), which smartly decomposed a single knot-tying problem into manageable subproblems, each addressed by a specialized agent. Our approach replaces TWISTED’s single-step inverse model that was learned via supervised learning with a multi- step Reinforcement Learning policy conditioned on abstract topological actions rather than goal states. This change allows more delicate topological state transitions while avoiding costly and ineffective data collection protocols, thus enabling better generalization across diverse knot configurations. Experimen- tal results demonstrate that TWISTED-RL manages to solve previously unattainable knots of higher complexity, including commonly used knots such as the Figure-8 and the Overhand. Furthermore, the increase in success rates and drop in planning time establishes TWISTED-RL as the new state-of-the-art in robotic knot-tying without human demonstrations.
|
| |
| 15:00-16:30, Paper WeI2I.186 | Add to My Program |
| Confidence-Gated Topology Reasoning with Fiducial Marker Priors for Occlusion-Robust Lane Graph Prediction |
|
| Wu, Zirui | The Pennsylvania State University |
| Hu, Xianbiao | Pennsylvania State University |
Keywords: Computer Vision for Transportation, Semantic Scene Understanding, Deep Learning for Visual Perception
Abstract: Accurate lane topology perception is crucial for safe autonomous driving, yet vision-based models such as BEVFormer and TopoNet degrade under heavy occlusion and other visibility degradations (e.g., ambiguous road markings). Existing approaches augment vision with global priors like Standard Definition (SD) maps, but these rely on precise GNSS localization and global alignment, which can be unreliable in urban canyons, tunnels, or GNSS-denied areas. Fiducial markers provide a complementary alternative: compact infrastructure-embedded tags that encode structurally complete local lane graphs, mitigating blind spots in topology reasoning where visual pipelines fail. However, marker detections are not always reliable—pose estimates may degrade with distance, and detections may be intermittent under occlusion. To address these challenges, we propose a Confidence-Gated Marker Fusion framework that integrates marker-derived priors into BEV features through a dynamic gating mechanism, regulating the contribution of noisy long-range inputs. In addition, we introduce a temporal marker memory that caches and decays reliable priors across frames, propagating topology guidance during short-term detection gaps. Evaluated on a marker-augmented OpenLane-V2 benchmark, our method outperforms both vision-only and SD map-augmented baselines, achieving notable gains (27%) in lane graph completeness and occlusion robustness. These results demonstrate that fiducial marker priors, when fused with vision-based reasoning, provide a practical and reliable pathway toward resilient lane topology prediction in GNSS-denied urban scenarios.
|
| |
| 15:00-16:30, Paper WeI2I.187 | Add to My Program |
| Force Estimation and Position Control of a Hydraulic Folded Pouch Actuator for Soft Robotics |
|
| Li, Jie | Imperial College London |
| Yang, Jianlin | Imperial College London |
| Qiu, Jinling | Imperial College London |
| Lou, Hanqi | Imperial College London |
| Zhou, Zhangxi | Imperial College London |
| Mylonas, George | Imperial College London |
Keywords: Soft Sensors and Actuators, Machine Learning for Robot Control, Force and Tactile Sensing
Abstract: This paper investigates position control and force estimation for a hydraulic folded pouch actuator. First, experimental platforms are designed to characterize the actuator and the results show two key properties: (i) angular hysteresis when the motion direction reverses, and (ii) strong nonlinearity between liquid volume, pressure, and angle. For position control, we explore three strategies: fully open-loop control, observerbased control, and sensor-based closed-loop control with angle feedback. The closed-loop controller employs dynamically tuned PID gains and an MLP feedforward predictor. Under a sinusoidal reference, the closed-loop controller achieves mean absolute error (MAE) = 4.82° and root mean square error (RMSE) = 5.48°. For force estimation, we train both MLP and LSTM models using liquid volume, angle, pressure, and angular rate as features to predict the external force on the actuator. Compared to the MLP, the LSTM incorporates temporal dynamics, which allows it to capture force variations more effectively and generate smoother prediction results. Under dynamic loads, both models capture the applied force, with the LSTM yielding the lower errors (MAE = 0.96 mN·m, RMSE = 1.23 mN·m).
|
| |
| 15:00-16:30, Paper WeI2I.188 | Add to My Program |
| TacUMI: A Multi-Modal Universal Manipulation Interface for Contact-Rich Tasks |
|
| Cheng, Tailai | Technische Universität München |
| Chen, Kejia | Technical University of Munich |
| Chen, Lingyun | Technical University of Munich |
| Zhang, Liding | Technical University of Munich |
| Zhang, Yue | Technical University of Munich |
| Ling, Yao | Technical University of Munich (TUM) |
| Hamad, Mahdi | Technische Universität München |
| Bing, Zhenshan | Technical University of Munich |
| Wu, Fan | Shanghai University |
| Sharma, Karan | Agile Robots |
| Knoll, Alois | Tech. Univ. Muenchen TUM |
Keywords: Force and Tactile Sensing, Bimanual Manipulation, Methods and Tools for Robot System Design
Abstract: Task decomposition is critical for understanding and learning complex long-horizon manipulation tasks. Especially for tasks involving rich physical interactions, relying solely on visual observations and robot proprioceptive information often fails to reveal the underlying event transitions. This raises the requirement for efficient collection of high-quality multi-modal data as well as robust segmentation method to decompose demonstrations into meaningful modules. Building on the idea of the handheld demonstration device Universal Manipulation Interface (UMI), we introduce TacUMI, a multi-modal data collection system that integrates additionally ViTac sensors, force–torque sensor, and pose tracker into a compact, robot-compatible gripper design, which enables synchronized acquisition of all these modalities during human demonstrations. We then propose a multi-modal segmentation framework that leverages temporal models to detect semantically meaningful event boundaries in sequential manipulations. Evaluation on a challenging cable mounting task shows more than 90% segmentation accuracy and highlights a remarkable improvement with more modalities, which validates that TacUMI establishes a practical foundation for both scalable collection and segmentation of multi-modal demonstrations in contact-rich tasks.
|
| |
| 15:00-16:30, Paper WeI2I.189 | Add to My Program |
| A Framework for Soft Robot Control: Integrating Physics-Based Modeling with Exploration Based Learning |
|
| Armanini, Costanza | New York University Abu Dhabi |
| Tzes, Anthony | New York University Abu Dhabi |
| Del Prete, Andrea | University of Trento |
| Abu-Dakka, Fares | New York University Abu Dhabi |
Keywords: Modeling, Control, and Learning for Soft Robots, Soft Robot Applications
Abstract: Soft robots present unique challenges for accurate modeling and control due to their virtually infinite degrees of freedom and highly nonlinear deformations. High-fidelity continuum models offer accuracy but are often computationally prohibitive for real-time control, while purely learning-based policies are efficient yet frequently lack robustness and require extensive data collection. In this paper, we propose a hybrid control framework that trains Reinforcement Learning (RL) policies using the physics-based Geometric Variable-Strain (GVS) formulation. This integration enables a Proximal Policy Optimization (PPO) agent to learn on a compact, physically exact state parameterization within the SoRoSim environment, leveraging continuum mechanics for accuracy without the need for real-world data collection. We validate our approach in simulation through two experiments: a basketball throw task and a multi-step pick-and-place scenario. In the throwing task, the agent achieved a 100% success rate within a defined tolerance, with 55% of trials reaching the target with 1cm precision. In the multi-step scenario, the controller maintained high accuracy with a maximum relative error of approximately 8.5%. These results demonstrate that combining GVS-based modeling with PPO yields robust, data-efficient control policies, providing a scalable solution for controlling soft robotic systems across diverse applications.
|
| |
| 15:00-16:30, Paper WeI2I.190 | Add to My Program |
| RVSPEC: Cyber-Physical Interplay Graphs for Formal Specification of Robotic Vehicle Control Software |
|
| Zhang, Chaoqi | Indiana University Bloomington |
| Cho, Minhyun | Purdue University |
| Hwang, Inseok | Purdue University |
| Kim, Hyungsub | Indiana University Bloomington |
Keywords: Software Tools for Robot Programming, Engineering for Robotic Systems, Robot Safety
Abstract: Robotic vehicles (RVs) have increasingly deployed in critical missions. Yet, RV control software is prone to logic bugs that cause unexpected physical behaviors, deviating from the developers’ intentions. For instance, Hakuto-R Mission 1 lunar lander physically crashed on the lunar surface due to a misinterpretation of sensor data. To discover such bugs, developers leverage bug-finding tools, from formal methods to fuzzing. To use these tools, human experts first need to manually create formal specifications (e.g., temporal logic) as bug oracles. Yet, such manual efforts are time-consuming and error-prone. Previous efforts to automatically generate such specifications merely translate natural-language documentation into formal specifications. In turn, they overlook the cyber-physical interplay inherent in RVs, which is often absent from the documentation, e.g., altitude changes caused by air pressure and servo lag. To tackle this limitation, we introduce RVSpec, an automatic specification generation framework. It first constructs a cyberphysical interplay graph (CPG). It captures the quantification about how much internal (control software-dependent and hardware-specific properties intrinsic to an RV) and external factors (environmental conditions) influence the RV’s physical states. Then, RVSpec uses the CPG to guide large language model agents, enabling the generation of cyber-physical interplay– aware formal specifications. We evaluated RVSpec on four popular RV control software packages, including ArduPilot and PX4 for aerial vehicles, openpilot for autonomous vehicles, and cFS for spacecrafts. The evaluation showed that specifications created by RVSpec achieved an accuracy of 80.7%, whereas the baseline’s ones attained 51.6%. When applying the specifications for fuzzing, those generated by RVSpec reduced the number of false positives from 4,790 (baseline) to 964 (79.9% reduction) while preserving the bug-finding capability.
|
| |
| 15:00-16:30, Paper WeI2I.191 | Add to My Program |
| Modeling and Control of a Pneumatic Soft Robotic Catheter Using Neural Koopman Operators |
|
| Yue, Yiyao | Johns Hopkins University |
| Barnes, Noah | Johns Hopkins University |
| Di, Lingyun | Johns Hopkins University |
| Young, Olivia | University of Maryland |
| Sochol, Ryan | University of Maryland |
| Brown, Jeremy DeLaine | Johns Hopkins University |
| Krieger, Axel | Johns Hopkins University |
Keywords: Surgical Robotics: Steerable Catheters/Needles, Modeling, Control, and Learning for Soft Robots, Model Learning for Control
Abstract: Catheter-based interventions are widely used for the diagnosis and treatment of cardiac diseases. Recently, robotic catheters have attracted attention for their ability to improve precision and stability over conventional manual approaches. However, accurate modeling and control of soft robotic catheters remain challenging due to their complex, nonlinear behavior. The Koopman operator enables lifting the original system data into a linear "lifted space", offering a data-driven framework for predictive control; however, manually chosen basis functions in the lifted space often oversimplify system behaviors and degrade control performance. To address this, we propose a neural network-enhanced Koopman operator framework that jointly learns the lifted space representation and Koopman operator in an end-to-end manner. Moreover, motivated by the need to minimize radiation exposure during X-ray fluoroscopy in cardiac ablation, we investigate open-loop control strategies using neural Koopman operators to reliably reach target poses without continuous imaging feedback. The proposed method is validated in two experimental scenarios: interactive position control and a simulated cardiac ablation task using an atrium-like cavity. Our approach achieves average errors of 2.1±0.4 mm in position and 4.9±0.6◦ in orientation, outperforming not only model-based baselines but also other Koopman variants in targeting accuracy and efficiency. These results highlight the potential of the proposed framework for advancing soft robotic catheter systems and improving catheter-based interventions.
|
| |
| 15:00-16:30, Paper WeI2I.192 | Add to My Program |
| Manifold-Constrained Hamilton-Jacobi Reachability Learning for Decentralized Multi-Agent Motion Planning |
|
| Chen, Qingyi | Purdue University |
| Ni, Ruiqi | Purdue University |
| Kim, Junyoung | Purdue University |
| Qureshi, Ahmed H. | Purdue University |
Keywords: Integrated Planning and Learning, Constrained Motion Planning, Multi-Robot Systems
Abstract: Safe multi-agent motion planning (MAMP) under task-induced constraints is a critical challenge in robotics. Many real-world scenarios require robots to navigate dynamic environments while adhering to manifold constraints imposed by tasks. For example, service robots must carry cups upright while avoiding collisions with humans or other robots. Despite recent advances in decentralized MAMP for high-dimensional systems, incorporating manifold constraints remains difficult. To address this, we propose a manifold-constrained Hamilton-Jacobi reachability (HJR) learning framework for decentralized MAMP. Our method solves HJR problems under manifold constraints to capture task-aware safety conditions, which are then integrated into a decentralized trajectory optimization planner. This enables robots to generate motion plans that are both safe and task-feasible without requiring assumptions about other agents’ policies. Our approach generalizes across diverse manifold-constrained tasks and scales effectively to high-dimensional multi-agent manipulation problems. Experiments show that our method outperforms existing constrained motion planners and operates at speeds suitable for real-world applications.
|
| |
| 15:00-16:30, Paper WeI2I.193 | Add to My Program |
| EasyMimic: A Low-Cost Framework for Robot Imitation Learning from Human Videos |
|
| Zhang, Tao | Renmin University of China |
| Xia, Song | Renmin University of China |
| Wang, Ye | Renmin University of China |
| Jin, Qin | Renmin University of China |
Keywords: Learning from Demonstration, Imitation Learning, Deep Learning in Grasping and Manipulation
Abstract: Robot imitation learning is often hindered by the high cost of collecting large-scale, real-world data. This challenge is especially significant for low-cost robots designed for home use, as they must be both user-friendly and affordable. To address this, we propose the EasyMimic framework, a lowcost and replicable solution that enables robots to quickly learn manipulation policies from human video demonstrations captured with standard RGB cameras. Our method first extracts 3D hand trajectories from the videos. An action alignment module then maps these trajectories to the gripper control space of a low-cost robot. To bridge the human-to-robot domain gap, we introduce a simple and user-friendly hand visual augmentation strategy. We then use a co-training method, fine-tuning a model on both the processed human data and a small amount of robot data, enabling rapid adaptation to new tasks. Experiments on the low-cost LeRobot platform demonstrate that EasyMimic achieves high performance across various manipulation tasks. It significantly reduces the reliance on expensive robot data collection, offering a practical path for bringing intelligent robots into homes. Project website: url{https://zt375356.github.io/EasyMimic-Project/}.
|
| |
| 15:00-16:30, Paper WeI2I.194 | Add to My Program |
| AnySafe: Adapting Latent Safety Filters at Runtime Via Safety Constraint Parameterization in the Latent Space |
|
| Agrawal, Sankalp, Sunny | The Ohio State University |
| Seo, Junwon | Carnegie Mellon University |
| Nakamura, Kensuke | Carnegie Mellon |
| Tian, Ran | UC Berkeley |
| Bajcsy, Andrea | Carnegie Mellon University |
Keywords: Robot Safety, Deep Learning Methods, Machine Learning for Robot Control
Abstract: Recent works have shown that foundational safe control methods, such as Hamilton–Jacobi (HJ) reachability analysis, can be applied in the latent space of world models. While this enables the synthesis of latent safety filters for hard-to-model vision-based tasks, they assume that the safety constraint is known a priori and remains fixed during deployment, limiting the safety filter's adaptability across scenarios. To address this, we propose constraint-parameterized latent safety filters that can adapt to user-specified safety constraints at runtime. Our key idea is to define safety constraints by conditioning on an encoding of an image that represents a constraint, using a latent-space similarity measure. The notion of similarity to failure is aligned in a principled way through conformal calibration, which controls how closely the system may approach the constraint representation. The parameterized safety filter is trained entirely within the world model's imagination, treating any image seen by the model as a potential test-time constraint, thereby enabling runtime adaptation to arbitrary safety constraints. In simulation and hardware experiments on vision-based control tasks with a Franka manipulator, we show that our method adapts at runtime by conditioning on the encoding of user-specified constraint images, without sacrificing performance. Video results can be found on the https://any-safe.github.io.
|
| |
| 15:00-16:30, Paper WeI2I.195 | Add to My Program |
| GazeMoE: Perception of Gaze Target with Mixture-Of-Experts |
|
| Dai, Zhuangzhuang | Aston University |
| Lu, Zhongxi | University of Leicester |
| Zakka, Vincent Gbouna | Aston University |
| Manso, Luis J. | Aston University |
| Alcaraz Calero, Jose Maria | Aston University |
| Li, Chen | Aalborg University |
Keywords: Gesture, Posture and Facial Expressions, Human Factors and Human-in-the-Loop, Human-Robot Collaboration
Abstract: Estimating human gaze target from visible images is a critical task for robots to understand human attention, yet the development of generalizable neural architectures and training paradigms remains challenging. While recent advances in pre-trained vision foundation models offer promising avenues for locating gaze targets, the integration of multi-modal cues— including eyes, head poses, gestures, and contextual features—demands adaptive and efficient decoding mechanisms. Inspired by Mixture-of-Experts (MoE) for adaptive domain expertise in large vision-language models, we propose GazeMoE, a novel end-to-end framework that selectively leverages gaze-target-related cues from a frozen foundation model through MoE modules. To address class imbalance in gaze target classification (in-frame vs. out-of-frame) and enhance robustness, GazeMoE incorporates a class-balancing auxiliary loss alongside strategic data augmentations, including region-specific cropping and photometric transformations. Extensive experiments on benchmark datasets demonstrate that our GazeMoE achieves state-of-the-art performance, outperforming existing methods on challenging gaze estimation tasks. The code and pre-trained models are released at: https://huggingface.co/zdai257/GazeMoE.
|
| |
| 15:00-16:30, Paper WeI2I.196 | Add to My Program |
| Visual-Auditory Extrinsic Contact Estimation |
|
| Yi, Xili | University of Michigan |
| Lee, Jayjun | University of Michigan |
| Fazeli, Nima | University of Michigan |
Keywords: Deep Learning in Grasping and Manipulation, Perception for Grasping and Manipulation, Deep Learning for Visual Perception
Abstract: Robust manipulation often hinges on a robot's ability to perceive extrinsic contacts—contacts between a grasped object and its surrounding environment. However, these contacts are difficult to observe through vision alone due to occlusions, limited resolution, and ambiguous near-contact states. In this paper, we propose a visual-auditory method for extrinsic contact estimation that integrates global scene information from vision with local contact cues obtained through active audio sensing. Our approach equips a robotic gripper with contact microphones and conduction speakers, enabling the system to emit and receive acoustic signals through the grasped object to detect external contacts. We train our perception pipeline entirely in simulation and zero-shot transfer to the real-world. To bridge the sim-to-real gap, we introduce a real-to-sim audio hallucination technique, injecting real-world audio samples into simulated scenes with ground-truth contact labels. The resulting multimodal model accurately estimates both the location and size of extrinsic contacts across a range of cluttered and occluded scenarios. We further demonstrate that explicit contact prediction significantly improves policy learning for downstream contact-rich manipulation tasks. Project webpage: https://va2contact.github.io
|
| |
| 15:00-16:30, Paper WeI2I.197 | Add to My Program |
| R3DPA: Leveraging 3D Representation Alignment and RGB Pretrained Priors for LiDAR Scene Generation |
|
| Sereyjol-Garros, Nicolas | Valeo.ai |
| Kirby, Ellington | Valeo.ai |
| Besnier, Victor | Valeo.ai |
| Samet, Nermin | Valeo.ai |
Keywords: Deep Learning for Visual Perception, Representation Learning, Range Sensing
Abstract: LiDAR scene synthesis is an emerging solution to scarcity in 3D data for robotic tasks such as autonomous driving. Recent approaches employ diffusion or flow matching models to generate realistic scenes, but 3D data remains limited compared to RGB datasets with millions of samples. We introduce R3DPA, the first LiDAR scene generation method to unlock image-pretrained priors for LiDAR point clouds, and leverage self-supervised 3D representations for state-ofthe-art results. Specifically, we (i) align intermediate features of our generative model with self-supervised 3D features, which substantially improves generation quality; (ii) transfer knowledge from large-scale image-pretrained generative models to LiDAR generation, mitigating limited LiDAR datasets; and (iii) enable point cloud control at inference for object inpainting and scene mixing with solely an unconditional model. On the KITTI-360 benchmark R3DPA achieves state of the art performance. Code and pretrained models are available at https://github.com/valeoai/R3DPA.
|
| |
| 15:00-16:30, Paper WeI2I.198 | Add to My Program |
| RoboPCA: Pose-Centered Affordance Learning from Human Demonstrations for Robot Manipulation |
|
| Xiao, Zhanqi | Institute of Computing Technology, Chinese Academy of Sciences |
| Wang, Ruiping | Institute of Computing Technology, Chinese Academy of Sciences |
| Chen, Xilin | Institute of Computing Technology, Chinese Academy of Sciences |
Keywords: Learning from Demonstration, Perception for Grasping and Manipulation
Abstract: Understanding spatial affordances---comprising the contact regions of object interaction and the corresponding contact poses---is essential for robots to effectively manipulate objects and accomplish diverse tasks. However, existing spatial affordance prediction methods mainly focus on locating the contact regions while delegating the pose to independent pose estimation approaches, which can lead to task failures due to inconsistencies between predicted contact regions and candidate poses. In this work, we propose RoboPCA, a pose-centered affordance prediction framework that jointly predicts task-appropriate contact regions and poses conditioned on instructions. To enable scalable data collection for pose-centered affordance learning, we devise Human2Afford, a data curation pipeline that automatically recovers scene-level 3D information and infers pose-centered affordance annotations from human demonstrations. With Human2Afford, scene depth and the interaction object's mask are extracted to provide 3D context and object localization, while pose-centered affordance annotations are obtained by tracking object points within the contact region and analyzing hand-object interaction patterns to establish a mapping from the 3D hand mesh to the robot end-effector orientation. By integrating geometry-appearance cues through an RGB-D encoder and incorporating mask-enhanced features to emphasize task-relevant object regions into the diffusion-based framework, RoboPCA outperforms baseline methods on image datasets, simulation, and real robots, and exhibits strong generalization across tasks and categories.
|
| |
| 15:00-16:30, Paper WeI2I.199 | Add to My Program |
| Complexity Reduction of the Three-Point Dubins Problem (3PDP) Via Symmetry Exploitation for Machine Learning Purposes |
|
| Frego, Marco | Free University of Bolzano |
| Saccon, Enrico | University of Trento |
| De Martini, Davide | Università Degli Studi Di Trento |
| Palopoli, Luigi | University of Trento |
Keywords: Motion and Path Planning, Nonholonomic Motion Planning, Integrated Planning and Learning
Abstract: This work proposes a machine learning approach for the Three-Point Dubins Problem (3PDP) based on classification and regression. The 3PDP is a path planning problem with Dubins curves through 3 waypoints. It is required to find the heading at the intermediate point and the form of the two Dubins paths joining the three points. Classification is used to select the correct path type (out of 18) to avoid the trial-and-error enumeration of all cases; regression is employed to have a good initial guess for finding the heading angle. Our results are used to improve and speed-up existing methods in terms of efficiency and accuracy
|
| |
| 15:00-16:30, Paper WeI2I.200 | Add to My Program |
| MCGS-SLAM: A Multi-Camera SLAM Framework Using Gaussian Splatting for High-Fidelity Mapping |
|
| Cao, Zhihao | ETH Zurich |
| Wu, Hanyu | ETH Zurich |
| Tang, Li Wa | ETH Zürich |
| Luo, Zizhou | University of Zurich |
| Zhang, Wei | University of Stuttgart |
| Pollefeys, Marc | ETH Zurich |
| Zhu, Zihan | ETH Zurich |
| Oswald, Martin R. | University of Amsterdam |
Keywords: SLAM, Mapping, Sensor Fusion
Abstract: Recent progress in dense SLAM has primarily targeted monocular setups, often at the expense of robustness and geometric coverage. We present MCGS-SLAM, the first purely RGB-based multi-camera SLAM system built on 3D Gaussian Splatting (3DGS). Unlike prior methods relying on sparse maps or inertial data, MCGS-SLAM fuses dense RGB inputs from multiple viewpoints into a unified, continuously optimized Gaussian map. A multi-camera bundle adjustment (MCBA) jointly refines poses and depths via dense photometric and geometric residuals, while a scale consistency module enforces metric alignment across views using low-rank priors. The system supports RGB input and maintains real-time performance at large scale. Experiments on synthetic and real-world datasets show that MCGS-SLAM consistently yields accurate trajectories and photorealistic reconstructions, usually outperforming monocular baselines. Notably, the wide field of view from multi-camera input enables reconstruction of side-view regions that monocular setups miss, critical for safe autonomous operation. These results highlight the promise of multi-camera Gaussian Splatting SLAM for high-fidelity mapping in robotics and autonomous driving.
|
| |
| 15:00-16:30, Paper WeI2I.201 | Add to My Program |
| Zero-Shot Sim2Real Transfer for Magnet-Based Tactile Sensor on Insertion Tasks |
|
| Han, Beining | Princeton University |
| Joshi, Abhishek | Princeton University |
| Deng, Jia | Princeton University |
Keywords: Sensorimotor Learning, Force and Tactile Sensing, Contact Modeling
Abstract: Tactile sensing is an important sensing modality for robot manipulation. Among different types of tactile sensors, magnet-based sensors, like u-skin, balance well between tactile density, high-durability, and compactness. However, the large sim-to-real gap of tactile sensors prevents robots from acquiring useful tactile-based manipulation skills from simulation data, a recipe that has been successful for achieving complex and sophisticated control policies. Prior work has used binarization techniques to bridge the sim-to-real gap for dexterous in-hand manipulation with magnet-based sensors. However, binarization inherently loses much information that is useful in many other tasks, e.g., insertion. In our work, we propose GCS, a novel sim-to-real technique to learn contact-rich insertion skills with dense, distributed, 3-axes tactile readings from magnet-based tactile sensors. We evaluated our approach on blind insertion tasks and show successful zero-shot sim-to-real transfer of RL policies with raw tactile readings as input.
|
| |
| 15:00-16:30, Paper WeI2I.202 | Add to My Program |
| Self-Supervised Domain Adaptation for Visual 3D Pose Estimation of Nano-Drone Racing Gates by Enforcing Geometric Consistency |
|
| Carlotti, Nicholas | IDSIA USI-SUPSI |
| Antonazzi, Michele | KTH Royal Institute of Technology |
| Cereda, Elia | USI and SUPSI |
| Nava, Mirko | IDSIA |
| Basilico, Nicola | University of Milan |
| Palossi, Daniele | ETH Zurich |
| Giusti, Alessandro | IDSIA USI-SUPSI |
Keywords: Deep Learning for Visual Perception, Transfer Learning, Deep Learning Methods
Abstract: We consider the task of visually estimating the relative pose of a drone racing gate in front of a nano-quadrotor, using a convolutional neural network pre-trained on simulated data to regress the gate's pose. Due to the sim-to-real gap, the pre-trained model underperforms in the real world and must be adapted to the target domain. We propose an unsupervised domain adaptation (UDA) approach using only real image sequences collected by the drone flying an arbitrary trajectory in front of a gate; sequences are annotated in a self-supervised fashion with the drone's odometry as measured by its onboard sensors. On this dataset, a state consistency loss enforces that two images acquired at different times yield pose predictions that are consistent with the drone's odometry. Results indicate that our approach outperforms other SoA UDA approaches, has a low mean absolute error in position (x=26, y=28, z=10 cm) and orientation (psi=13 degrees), an improvement of 40% in position and 37% in orientation over a baseline. The approach's effectiveness is appreciable with as few as 10 minutes of real-world flight data and yields models with an inference time of 30.4ms (33 fps) when deployed aboard the Crazyflie 2.1 Brushless nano-drone.
|
| |
| 15:00-16:30, Paper WeI2I.203 | Add to My Program |
| Development of a Miniaturized 6-DoF Surgical Instrument with a 4-Mm Elbow and 3-Mm Wrist for Transoral Robotic Surgery |
|
| Balasubramanian, Cate | University of Waterloo |
| Zhang, Sunny | University of Toronto |
| Li, Teng | The Hospital for Sick Children |
| Kang, Paul Hoseok | University of Toronto |
| Nazari, Ali A. | The Hospital for Sick Children |
| Looi, Thomas | Hospital for Sick Children |
| Podolsky, Dale | University of Toronto |
Keywords: Surgical Robotics: Laparoscopy, Medical Robots and Systems, Actuation and Joint Mechanisms
Abstract: Current platforms for Transoral Robotic Surgery (TORS) are suboptimal for confined oropharyngeal workspaces, particularly in pediatric applications. To address this, we present the design and characterization of a novel cable-driven robotic instrument providing 6 degrees-of-freedom (DoF)—shaft roll, elbow pitch/yaw, wrist pitch/yaw, and grip—for dexterous manipulation. The system integrates a miniaturized 4-mm proximal elbow with a previously developed 3-mm distal wrist. This architecture is enabled by a novel “sandwich” link architecture that facilitates high-density cable routing through the joint’s center plane, providing a compact, rigid alternative to traditional pin-jointed designs. Experimental validation identified significant kinematic coupling between in-plane joint pairs. An empirical real-time compensation strategy reduced this coupling rate by 82.9% for pitch and 80.8% for yaw. Workspace analysis confirmed the proximal elbow enables high distal dexterity at regions critical for complex surgical tasks. Integration with a Franka Research 3 manipulator enabled fully coordinated macro-micro teleoperation, providing a pilot demonstration for TORS workflows. This represents the first demonstration of a 4-mm elbow-3mm wrist mechanism for TORS, providing the hardware foundation necessary for future evaluation of dexterity-intensive tasks, including suturing and dissection.
|
| |
| 15:00-16:30, Paper WeI2I.204 | Add to My Program |
| Learning Dexterous Manipulation with Quantized Hand State |
|
| Feng, Ying | Shanghai Jiao Tong University |
| Fang, Hongjie | Shanghai Jiao Tong University |
| He, Yinong | Carnegie Mellon University |
| Chen, Jingjing | Shanghai Jiao Tong University |
| Wang, Chenxi | Shanghai Noematrix Intelligence Technology Ltd |
| He, Zihao | Shanghai Jiao Tong University |
| Liu, Ruonan | Shanghai Jiao Tong University |
| Lu, Cewu | ShangHai Jiao Tong University |
Keywords: Imitation Learning, Dexterous Manipulation, Deep Learning in Grasping and Manipulation
Abstract: Dexterous robotic hands enable robots to perform complex manipulations that require fine-grained control and adaptability. Achieving such manipulation is challenging because the high degrees of freedom tightly couple hand and arm motions, making learning and control difficult. Successful dexterous manipulation relies not only on precise hand motions, but also on accurate spatial positioning of the arm and coordinated arm-hand dynamics. However, most existing visuomotor policies represent arm and hand actions in a single combined space, which often causes high-dimensional hand actions to dominate the coupled action space and compromise arm control. To address this, we propose DQ-RISE, which quantizes hand states to simplify hand motion prediction while preserving essential patterns, and applies a continuous relaxation that allows arm actions to diffuse jointly with these compact hand states. This design enables the policy to learn arm-hand coordination from data while preventing hand actions from overwhelming the action space. Experiments show that DQ-RISE achieves more balanced and efficient learning, paving the way toward structured and generalizable dexterous manipulation. Project website: https://rise-policy.github.io/DQ-RISE/.
|
| |
| 15:00-16:30, Paper WeI2I.205 | Add to My Program |
| Symmetry-Aware Skill Transfer with Energy-Tank Passive Control for Ankle Exoskeletons |
|
| Largeteau, Etienne | University D'Évry |
| Bencharif, Loqmane | Paris Saclay |
| Conte, Bangaly | IBISC Laboratory, University of Paris Saclay |
| Ibset, Abderahim | University Paris-Saclay |
| Su, Hang | Paris Saclay University |
| Bruneau, Olivier | ENS CACHAN |
| Alfayad, Samer | Paris-Saclay Universit -Evry University |
Keywords: Control Architectures and Programming, Rehabilitation Robotics, Prosthetics and Exoskeletons
Abstract: This paper presents a unified framework that combines symmetry-aware skill transfer with energy-tank passive control to achieve safe and adaptive ankle exoskeleton assistance. Subject-specific ankle references are first extracted from wearable IMU data : Dynamic Time Warping (DTW) aligns gait cycles onto a normalized phase axis , and Gaussian Mixture Regression (GMR) synthesizes smooth probabilistic templates suitable for online modulation. When only unilateral sensing is available, contralateral trajectories are reconstructed through either a half-period phase shift or a DTW-informed nonlinear mapping, enabling robust bilateral assistance. These references are then tracked by a joint-space PID controller wrapped with an energy tank, which bounds power exchange and prevents unintended energy injection. In simulation experiments, the proposed controller improved center-of-mass smoothness relative to plain PID. Benchtop validation confirms the efficacy of both GMR-generated and symmetric-generated trajectories. Furthermore, experimental results show a reduction of 40 N in peak interaction force (from 120 N to 80 N), resulting in less mechanical strain on the user. By unifying phase-consistent gait synthesis with passivity shaping, this work advances ankle exoskeleton assistance that is individualized, robust, and inherently safe.
|
| |
| 15:00-16:30, Paper WeI2I.206 | Add to My Program |
| LLM-Driven Corrective Robot Operation Code Generation with Static Text-Based Simulation |
|
| Wang, Wenhao | University of Massachusetts Dartmouth |
| Rong, Yi | University of Massachusetts Dartmouth |
| Li, Yanyan | California State University San Marcos |
| Jiao, Long | University of Massachusetts Dartmouth |
| Yuan, Jiawei | University of Massachusetts Dartmouth |
Keywords: Task Planning, AI-Based Methods, AI-Enabled Robotics
Abstract: Recent advances in Large language models (LLMs) have demonstrated their promising capabilities of generating robot operation code to enable LLM-driven robots. To enhance the reliability of operation code generated by LLMs, corrective designs with feedback from the observation of executing code have been increasingly adopted in existing research. However, the code execution in these designs relies on either a physical experiment or a customized simulation environment, which limits their deployment due to the high configuration effort of the environment and the potential long execution time. In this paper, we explore the possibility of directly leveraging LLM to enable static simulation of robot operation code, and then leverage it to design a new reliable LLM-driven corrective robot operation code generation framework. Our framework configures the LLM as a static simulator with enhanced capabilities that reliably simulate robot code execution by interpreting actions, reasoning over state transitions, analyzing execution outcomes, and generating semantic observations that accurately capture trajectory dynamics. To validate the performance of our framework, we performed experiments on various operation tasks for different robots, including UAVs and small ground vehicles. The experiment results not only demonstrated the high accuracy of our static text-based simulation but also the reliable code generation of our LLM-driven corrective framework, which achieves a comparable performance with state-of-the-art research while does not rely on dynamic code execution using physical experiments or simulators.
|
| |
| 15:00-16:30, Paper WeI2I.207 | Add to My Program |
| Towards Autonomous Tape Handling for Robotic Wound Redressing |
|
| Liang, Xiao | University of California San Diego |
| Shen, Lu | Cornell University |
| Zhang, Peihan | University of California San Diego |
| Atar, Soofiyan | University of California San Diego |
| Richter, Florian | University of California, San Diego |
| Yip, Michael C. | University of California, San Diego |
Keywords: Medical Robots and Systems, Physical Human-Robot Interaction, Rehabilitation Robotics
Abstract: Chronic wounds, such as diabetic, pressure, and venous ulcers, affect over 6.5 million patients in the United States alone and generate an annual cost exceeding 25 billion. Despite this burden, chronic wound care remains a routine yet manual process performed exclusively by trained clinicians due to its critical safety demands. We envision a future in which robotics and automation support wound care to lower costs and enhance patient outcomes. This paper introduces an autonomous framework for one of the most fundamental yet challenging subtasks in wound redressing: adhesive tape manipulation. Specifically, we address two critical capabilities: tape initial detachment (TID) and secure tape placement. To handle the complex adhesive dynamics of detachment, we propose a force-feedback imitation learning approach trained from human teleoperation demonstrations. For tape placement, we develop a numerical trajectory optimization method based to ensure smooth adhesion and wrinkle-free application across diverse anatomical surfaces. We validate these methods through extensive experiments, demonstrating reliable performance in both quantitative evaluations and integrated wound redressing pipelines. Our results establish tape manipulation as an essential step toward practical robotic wound care automation.
|
| |
| 15:00-16:30, Paper WeI2I.208 | Add to My Program |
| IndustryShapes: An RGB-D Benchmark Dataset for 6D Object Pose Estimation of Industrial Assembly Components and Tools |
|
| Sapoutzoglou, Panagiotis | National Technical University of Athens |
| Vaggelis, Orestis | National Technical University of Athens |
| Zacharia, Athina | National Technical University of Athens |
| Sartinas, Evangelos | National Technical University of Athens |
| Pateraki, Maria | National Technical University of Athens |
Keywords: Performance Evaluation and Benchmarking, Industrial Robots
Abstract: We introduce IndustryShapes, a new RGB-D benchmark dataset of industrial tools and components, designed for both instance-level and novel object 6D pose estimation approaches. The dataset provides a realistic and application relevant testbed for benchmarking these methods in the context of industrial robotics bridging the gap between lab-based research and deployment in real-world manufacturing scenarios. Unlike many previous datasets that focus on household or consumer products or use synthetic, clean tabletop datasets, or objects captured solely in controlled lab environments, IndustryShapes introduces five new object types with challenging properties, also captured in realistic industrial assembly settings. The dataset has diverse complexity, from simple to more challenging scenes, with single and multiple objects, including scenes with multiple instances of the same object and it is organized in two parts: the classic set and the extended set. The classic set includes a total of 4,6k images and 6k annotated poses. The extended set introduces additional data modalities to support the evaluation of model-free and sequence-based approaches. To the best of our knowledge, IndustryShapes is the first dataset to offer RGB-D static onboarding sequences. We further evaluate the dataset on a representative set of state-of-the art methods for instance-based and novel object 6D pose estimation, including also object detection, segmentation, showing that there is room for improvement in this domain. The dataset page can be found in https://pose-lab.github.io/IndustryShapes
|
| |
| 15:00-16:30, Paper WeI2I.209 | Add to My Program |
| RoboSQ: Semantic Queries for Task-Aligned Robot Training Data |
|
| Chen, Kaiyuan | University of California, Berkeley |
| Xie, Shuangyu | UC Berkeley |
| Hari, Kush | UC Berkeley |
| Goldberg, Andrew | University of California Berkeley |
| Kondap, Kavish | University of California, Berkeley |
| Goldberg, Ken | UC Berkeley |
Keywords: Big Data in Robotics and Automation, Data Sets for Robot Learning, Deep Learning in Grasping and Manipulation
Abstract: Training robot policies often requires extracting appropriate subsets of data from large and noisy datasets. For example, one might want to extract only robot demonstrations with accurate captions or only those related to cooking. We present RoboSQ, a robot data management system that enables semantic queries. RoboSQ samples temporally distributed frames and overlays projected sensor information from robot trajectories and constructs structured Visual Question Answering (VQA) prompts for Vision-Language Models (VLMs). RoboSQ efficiently handles queries by pipelining data loading, frame extraction, and VLM inference. We evaluate RoboSQ on the DROID dataset with three semantic queries: 1) failure detection, 2) calibration error detection and 3) visual complexity scoring. It filters out the failure trajectories with 78% accuracy and 86% F1 score, and identifies the trajectories with incorrect extrinsic calibration between camera frame and end effector frame at 86% accuracy and 88% F1 score. We evaluate RoboSQ by training a pick-and-place Action Chunking Transformer policy with a UR5 robot arm using mixed quality demonstration data. Data extracted by RoboSQ is closely aligned with the expert-curated data. A policy trained on RoboSQ-selected data achieves 13 successes out of 15 trials, compared to only 1 out of 15 when trained on the full mixed dataset.
|
| |
| 15:00-16:30, Paper WeI2I.210 | Add to My Program |
| NavSpace: How Navigation Agents Follow Spatial Intelligence Instructions |
|
| Yang, Haolin | Peking University |
| Long, Yuxing | Peking University |
| Yu, Zhuoyuan | Peking University |
| Yang, Zihan | Peking University |
| Wang, Minghan | Peking University |
| Xu, Jiapeng | Peking University |
| Wang, Yihan | Peking University |
| Yu, Ziyan | Peking University |
| Cai, Wenzhe | Southeast University |
| Kang, Lei | Peking University |
| Dong, Hao | Peking University |
Keywords: Vision-Based Navigation
Abstract: Instruction-following navigation is a key step toward embodied intelligence. Prior benchmarks mainly focus on semantic understanding but overlook systematically evaluating navigation agents' spatial perception and reasoning capabilities. In this work, we introduce the NavSpace benchmark, which contains six task categories and 1,228 trajectory–instruction pairs designed to probe the spatial intelligence of navigation agents. On this benchmark, we comprehensively evaluate 22 navigation agents, including state-of-the-art navigation models and multimodal large language models. The evaluation results lift the veil on spatial intelligence in embodied navigation. Furthermore, we propose SNav, a new spatially intelligent navigation model. SNav outperforms existing navigation agents on NavSpace and real robot tests, establishing a strong baseline for future work.
|
| |
| 15:00-16:30, Paper WeI2I.211 | Add to My Program |
| Vi-TacMan: Articulated Object Manipulation Via Vision and Touch |
|
| Cui, Leiyao | University of Chinese Academy of Sciences |
| Zhao, Zihang | Peking University |
| Xie, Sirui | Institute for Artificial Intelligence, Peking University |
| Zhang, Wenhuan | Harbin Institute of Technology |
| Han, Zhi | Shenyang Institute of Automation, Chinese Academy of Sciences |
| Zhu, Yixin | Peking University |
Keywords: Force and Tactile Sensing, Soft Sensors and Actuators, Sensor-based Control
Abstract: Autonomous manipulation of articulated objects represents a basic skill for robots deployed in human environments. Current vision-based methods can infer object hidden kinematics, but their estimates are sometimes imprecise in driving reliable actions, especially on previously unseen objects. Tactile methods, on the other hand, excel once contact is made, yet they require a reasonable initial guess about where and how to interact. This observation suggests a natural division of labor: vision provides global, coarse guidance, while touch delivers precise, robust execution. Building on this complementarity, we propose a systematic approach, Vi-TacMan, which uses vision to plan and touch to control. We begin by training a vision module that accurately detects holdable and movable parts. Once identified, these parts are then segmented for further processing. From these detections, the system proposes feasible grasps along with a coarse interaction direction modeled by a von Mises-Fisher (vMF) distribution. To enhance directional reasoning, we explicitly incorporate surface normals on movable regions as a geometric prior. This inductive bias clarifies the expected motion and improves generalization to unseen objects, yielding significant gains over baseline methods (all p-values less than 0.0001). Finally, seeded with the vision-derived grasp and motion direction, a tactile-informed controller establishes and maintains stable interactions, enabling reliable execution of the manipulation. Real-world object experiments on diverse objects further confirm reliable manipulation without explicit kinematic models. These findings establish a paradigm for multi-modal robotic perception that could advance autonomous systems operating in complex, unstructured environments.
|
| |
| 15:00-16:30, Paper WeI2I.212 | Add to My Program |
| Lightweight Visual Reasoning for Socially-Aware Robots |
|
| Galatolo, Alessio | Uppsala University |
| Cumbal, Ronald | Uppsala University |
| Rouchitsas, Alexandros | Uppsala University |
| Winkle, Katie | Uppsala University |
| Gurdur Broo, Didem | Uppsala University |
| Castellano, Ginevra | Uppsala University |
Keywords: Multi-Modal Perception for HRI, Social HRI, Intention Recognition
Abstract: Robots operating in shared human environments must not only navigate, interact, and detect their surroundings, they must also interpret and respond to dynamic, and often unpredictable, human behaviours. Although recent advances have shown promise in enhancing robotic perception and instruction-following using Vision-Language Models (VLMs), they remain limited in addressing the complexities of multimodal human-robot interactions (HRI). Motivated by this challenge, we introduce a lightweight language-to-vision feedback module that closes the loop between an LLM and the vision encoder in VLMs. The module projects image-token hidden states through a gated Multi-Layer Perceptron (MLP) back into the encoder input, prompting a second pass that reinterprets the scene under text context. We evaluate this approach on three robotics-centred tasks: navigation in a simulated environment (Habitat), sequential scene description (Mementos-Robotics), and human-intention recognition (our HRI dataset). Results show that our method improves Qwen 2.5 (7B) by 3.3% (less distance), +0.057 description score, and +2.93% accuracy, with less than 3% extra parameters; Gemma 3 (4B) and LLaVA OV 1.5 (4B) show mixed navigation results but gains +0.111,+0.055 and +10.81%, +4.79% on the latter two tasks.
|
| |
| 15:00-16:30, Paper WeI2I.213 | Add to My Program |
| ArmGS: Composite Gaussian Appearance Refinement for Modeling Dynamic Urban Environments |
|
| Wu, Guile | Huawei Noah's Ark Lab |
| Bai, Dongfeng | Noah's Ark Lab, Huawei Technologies |
| Liu, Bingbing | Huawei Technologies |
Keywords: Simulation and Animation, Deep Learning for Visual Perception, Computer Vision for Transportation
Abstract: This work focuses on modeling dynamic urban environments for autonomous driving simulation. Contemporary data-driven methods using neural radiance fields have achieved photorealistic driving scene modeling, but they suffer from low rendering efficacy. Recently, some approaches have explored 3D Gaussian splatting for modeling dynamic urban scenes, enabling high-fidelity reconstruction and real-time rendering. However, these approaches often neglect to model fine-grained variations between frames and camera viewpoints, leading to suboptimal results. In this work, we propose a new approach named ArmGS that exploits composite driving Gaussian splatting with multi-granularity appearance refinement for autonomous driving scene modeling. The core idea of our approach is devising a multi-level appearance modeling scheme to optimize a set of transformation parameters for composite Gaussian refinement from multiple granularities, ranging from local Gaussian level to global image level and dynamic actor level. This not only models global scene appearance variations between frames and camera viewpoints, but also models local fine-grained changes of background and objects. Extensive experiments on multiple challenging autonomous driving datasets, namely, Waymo, KITTI, NOTR and VKITTI2, demonstrate the superiority of our approach over the state-of-the-art methods.
|
| |
| 15:00-16:30, Paper WeI2I.214 | Add to My Program |
| Tactile-Proprioceptive Sensor Fusion for Contact Wrench Estimation in Whole-Body Physical Human-Robot Interaction |
|
| Min, Junha | DGIST |
| Ma, Junghyeon | DGIST |
| Kwon, Jiwung | DGIST |
| Bae, Sunggyu | DGIST |
| Kim, Joohyung | University of Illinois Urbana-Champaign |
| Park, Kyungseo | DGIST |
Keywords: Physical Human-Robot Interaction, Sensor Fusion, Motion Control
Abstract: Direct physical guidance is a natural means of teaching and interacting with robots, and robotic skins make a key contribution by enabling sensitive contact sensing and localization. This paper presents a tactile–proprioceptive sensor fusion framework for rev{natural physical human-robot interaction.} Tactile cues from pneumatic skin pads serve as contact indicators that bypass the ambiguity between frictional residues and applied external forces, enabling highly sensitive contact detection without explicit friction identification. We fuse these cues with motor-current–based proprioception to reconstruct multi-axis contact forces on the robot surface. To maintain accuracy during motion, we rev{employ a temporal convolutional network (TCN) to mitigate friction hysteresis during} stick–slip transitions, reducing uncertainty at contact onset and yielding smooth, responsive guidance. We validate the approach on a skin-integrated robot arm: (i) multi-axis forces are reconstructed in stationary contacts, and (ii) simultaneous force estimation and kinesthetic teaching are demonstrated. Results indicate improved sensitivity and responsiveness across diverse contact conditions compared with tactile-only and proprioceptive-only baselines, supporting tactile–proprioceptive fusion as a reliable pathway to safe, intuitive physical human–robot interaction.
|
| |
| 15:00-16:30, Paper WeI2I.215 | Add to My Program |
| Fast-SmartWay: Panoramic-Free End-To-End Zero-Shot Vision-And-Language Navigation |
|
| Shi, Xiangyu | The University of Adelaide |
| Li, Zerui | Adelaide University |
| Qiao, Yanyuan | EPFL |
| Wu, Qi | University of Adelaide |
Keywords: Human-Robot Collaboration, Vision-Based Navigation
Abstract: Recent advances in Vision-and-Language Navigation in Continuous Environments (VLN-CE) have leveraged multimodal large language models (MLLMs) to achieve zero-shot navigation. However, existing methods often rely on panoramic observations and two-stage pipelines involving waypoint predictors, which introduce significant latency and limit real-world applicability. In this work, we propose Fast-SmartWay, an end-to-end zero-shot VLN-CE framework that eliminates the need for panoramic views and waypoint predictors. Our approach uses only three frontal RGB-D images combined with natural language instructions, enabling MLLMs to directly predict actions. To enhance decision robustness, we introduce an Uncertainty-Aware Reasoning module that integrates (i) a Disambiguation Module for avoiding local optima, and (ii) a Future-Past Bidirectional Reasoning mechanism for globally coherent planning. Experiments on both simulated and real-robot environments demonstrate that our method significantly reduces per-step latency while achieving competitive or superior performance compared to panoramic-view baselines. These results demonstrate the practicality and effectiveness of Fast-SmartWay for real-world zero-shot embodied navigation.
|
| |
| 15:00-16:30, Paper WeI2I.216 | Add to My Program |
| ALOHA Lightning: Learning Fast and Precise Manipulation |
|
| Yao, John Hua | Stanford University |
| Wu, Qi | Stanford University |
| Gao, Yihuai | Stanford University |
| Finn, Chelsea | Stanford University |
| Fu, Zipeng | Stanford University |
Keywords: Imitation Learning, Bimanual Manipulation
Abstract: Learning from human demonstrations has enabled robots to acquire a wide range of manipulation skills, but learned policies typically execute far slower than ordinary humans. This speed gap is mainly due to lack of an interface for collecting demonstration data at high speed, and the difficulty in training policies that can robustly execute high-speed motions. In this paper, we present ALOHA Lightning, a system for learning fast and precise robotic manipulation. Our system uses kinesthetic teaching to intuitively collect near-human-speed demonstrations on a backdrivable bimanual platform, yielding natural and fast trajectories. We also present a learning pipeline that enables smooth high-speed execution through test-time action smoothing and aligns the visual data distribution between data collection and deployment with masking. Given 50 demonstrations for each task, ALOHA Lightning autonomously completes tasks such as folding shorts, battery insertion, and bussing tables for over 80% success rates at or close to human speed.
|
| |
| 15:00-16:30, Paper WeI2I.217 | Add to My Program |
| TrajMamba: An Ego-Motion-Guided Mamba Model for Pedestrian Trajectory Prediction from an Egocentric Perspective |
|
| Peng, Yusheng | Hefei University of Technology |
| Zhang, Gaofeng | Hefei University of Technology |
| Zheng, Liping | Hefei University of Technology |
Keywords: Human-Aware Motion Planning, Deep Learning Methods, Computer Vision for Transportation
Abstract: Future trajectory prediction of a tracked pedestrian from an egocentric perspective is a key task in areas such as autonomous driving and robot navigation. The challenge of this task lies in the complex dynamic relative motion between the ego-camera and the tracked pedestrian. To address this challenge, we propose an ego-motion-guided trajectory prediction network based on the Mamba model. Firstly, two Mamba models are used as encoders to extract pedestrian motion and ego-motion features from pedestrian movement and ego-vehicle movement, respectively. Then, an ego-motion-guided Mamba decoder that explicitly models the relative motion between the pedestrian and the vehicle by integrating pedestrian motion features as historical context with ego-motion features as guiding cues to capture decoded features. Finally, the future trajectory is generated from the decoded features corresponding to the future timestamps. Extensive experiments demonstrate the effectiveness of the proposed model, which achieves state-of-the-art performance on the PIE and JAAD datasets.
|
| |
| 15:00-16:30, Paper WeI2I.218 | Add to My Program |
| Multi-Robot Multi-Source Localization in Complex Flows with Physics-Preserving Environment Models |
|
| Shaffer, Benjamin | University of Pennsylvania |
| Edwards, Victoria | College of the Atlantic |
| Kinch, Brooks | University of Pennsylvania |
| Trask, Nathaniel | University of Pennsylvania |
| Hsieh, M. Ani | University of Pennsylvania |
Keywords: Distributed Robot Systems, Model Learning for Control, Sensor-based Control
Abstract: Source localization in a complex flow poses a significant challenge for multi-robot teams tasked with localizing the source of chemical leaks or tracking the dispersion of an oil spill. The flow dynamics can be time-varying and chaotic, resulting in sporadic and intermittent sensor readings, and complex environmental geometries further complicate a team's ability to model and predict the dispersion. To accurately account for the physical processes that drive the dispersion dynamics, robots must have access to computationally intensive numerical models, which can be difficult when onboard computation is limited. We present a distributed mobile sensing framework for source localization in which each robot carries a machine-learned, finite element model of its environment to guide information-based sampling. The models are used to evaluate an approximate mutual information criterion to drive an infotaxis control strategy, which selects sensing regions that are expected to maximize informativeness for the source localization objective. Our approach achieves faster error reduction compared to baseline sensing strategies and results in more accurate source localization compared to baseline machine learning approaches.
|
| |
| 15:00-16:30, Paper WeI2I.219 | Add to My Program |
| CASSR: Continuous A-Star Search through Reachability for Real Time Footstep Planning |
|
| Wang, Jiayi | Beijing Institute for General Artificial Intelligence (BIGAI) |
| Tonneau, Steve | The University of Edinburgh |
Keywords: Humanoid and Bipedal Locomotion, Legged Robots, Multi-Contact Whole-Body Motion Planning and Control
Abstract: Footstep planning involves a challenging combinatorial search. Traditional A* approaches require discretising reachability constraints, while Mixed-Integer Programming (MIP) supports continuous formulations but quickly becomes intractable, especially when rotations are included. We present CASSR, a novel framework that recursively propagates convex, continuous formulations of a robot’s kinematic constraints within an A* search. Combined with a new cost-to-go heuristic based on the EPA algorithm, CASSRefficiently plans contact sequences of up to 30 footsteps in under 125 ms. Experiments on biped locomotion tasks demonstrate that CASSR outperforms traditional discretised A* by up to a factor of 100, while also surpassing a commercial MIP solver. These results show that CASSR enables fast, reliable, and real-time footstep planning for biped robots.
|
| |
| 15:00-16:30, Paper WeI2I.220 | Add to My Program |
| Integration and Continual Learning-Based Modeling of a Soft Robotic Sensor for Social Robot Proprioception |
|
| Hau, Pak Chuen | KTH Royal Institute of Technology |
| Thorapalli Muralidharan, Seshagopalan | KTH Royal Institute of Technology |
| Gomez, Randy | Honda Research Institute Japan Co., Ltd |
| Andrikopoulos, Georgios | KTH Royal Institute of Technology |
Keywords: Soft Robot Applications, Soft Sensors and Actuators, Soft Robot Materials and Design
Abstract: This paper presents an embedded soft sensor for proprioceptive feedback in a soft continuum actuator (SCA) forming the neck of the social robot HARU. The sensor is fabricated in a single-step multi-material additive manufacturing process, co-extruding conductive and non-conductive thermoplastic polyurethane to form an integrated structure. Several sensor geometries are evaluated, with a gauge-type configuration selected based on linearity and repeatability criteria. The design is embedded in a cross-configuration to measure the actuator’s two dominant degrees of freedom, pitch and roll. Sensor signals are mapped to angle estimates using linear regression, a static neural network, and a continual-learning framework that updates parameters online. Experiments involving predefined trajectories, randomized motions, and repeated test cycles show that the continual-learning model achieves R 2 > 0.97 and mean absolute errors below 1 degree, consistently improving upon the baseline models. The results demonstrate the feasibility of directly embedding 3D-printed soft sensors into functional actuators and highlight the role of adaptive learning in supporting long-term soft robotic proprioception.
|
| |
| 15:00-16:30, Paper WeI2I.221 | Add to My Program |
| How IMU Drift Influences Multi-Radar Inertial Odometry for Ground Robots in Subterranean Terrains |
|
| Mukherjee, Moumita | Lulea University of Technology |
| Norén, Magnus | Luleå University of Technology |
| Koval, Anton | Luleå University of Technology |
| Banerjee, Avijit | Luleå University of Technology |
| Nikolakopoulos, George | Luleå University of Technology |
Keywords: Sensor Fusion
Abstract: Reliable radar inertial odometry (RIO) requires mitigating IMU bias drift, a challenge that intensifies in subterranean environments due to extreme temperatures and gravity induced accelerations. Cost-effective IMUs such as the Pixhawk, when paired with FMCW TI IWR6843AOP EVM radars, suffer from drift induced degradation compounded by sparse, noisy, and flickering radar returns, making fusion less stable than LiDAR based odometries. Yet, LiDAR fails under smoke, dust and aerosols, whereas FMCW radars remain compact, lightweight, cost-effective, and robust to these situations. To address these challenges, we propose a two stage MRIO framework that combines an IMU bias estimator for resilient localization and mapping in GPS-denied subterranean environments affected by smoke. In this, radar's ego velocity estimation is formulated through a least square approach and incorporated into an EKF for online IMU bias correction, thus, the corrected IMU accelerations are fused with heterogeneous measurements from multiple radars and IMU to refine odometry. The proposed framework further supports radar only mapping by exploiting the robot’s estimated translational and rotational displacements. In subterranean field trials, MRIO delivers robust localization and mapping, outperforming single stage EKF-RIO. It maintains accuracy across cost-efficient FMCW radar setups and different IMUs, with resilience on Pixhawk and using higher-grade units like VectorNav. The implementation will be provided as an open-source resource to the community:url{https://github.com/LTU-RAI/MRIO}
|
| |
| 15:00-16:30, Paper WeI2I.222 | Add to My Program |
| Evolutionary Automatic Guidance Scheme for Magnetic Nanoparticles in High-Flow Vascular Models Using a Uniform Magnetic Force Field |
|
| Son, Boyoung | Gwangju Institute of Science and Technology(GIST) |
| Sitti, Metin | Max-Planck Institute for Intelligent Systems |
| Yoon, Jungwon | Gwangju Institutue of Science and Technology |
Keywords: Automation at Micro-Nano Scales, Integrated Planning and Control
Abstract: Magnetic nanoparticle (MNP) guidance has attracted considerable attention for biomedical applications, such as targeted drug delivery and minimally invasive therapy. However, precise navigation in vivo remains challenging, particularly in high-flow vascular environments, where drag forces dominate particle dynamics and real-time feedback is impractical. Here, we present an evolutionary automatic guidance scheme for feedforward control of MNP chains in a vascular model. The proposed approach leverages chain alignment with the flow direction to achieve directional migration into specific branches, without relying on swarm cohesion or online feedback. To provide uniform actuation, a Halbach array is designed and optimized to generate a nearly uniform magnetic force field within the target workspace. A physics-based simulator incorporating magnetic, drag, and wall interaction forces is developed to model chain dynamics, and control sequences are optimized using the Covariance Matrix Adaptation Evolution Strategy (CMA-ES), ensuring robustness to variations in chain length and injection conditions. The method is experimentally validated using a four-channel vascular model, demonstrating that feedforward magnetic actuation can reliably guide nanoparticle chains under physiologically relevant high-flow conditions. This study establishes a practical and scalable strategy for nanoparticle navigation, providing a foundation for future biomedical applications in dynamic vascular environments.
|
| |
| 15:00-16:30, Paper WeI2I.223 | Add to My Program |
| From Swept Contact to Pose: Probe-Aware Registration Via Complementary-Shape Docking |
|
| Chen, Chen | Tsinghua University |
| Li, Yunwen | Beijing Tsingsribe Medical Co., Ltd |
| Xu, Yifan | Tsinghua University |
| Yan, Xiangjie | Tsinghua University |
| Shu, Chang | Department of Priority Oral Care, Peking University School and Hospital of Stomatology |
| Hou, Jianxia | Peking University School of Stomatology |
| Song, Shiji | Tsinghua University |
| Li, Xiang | Tsinghua University |
Keywords: Calibration and Identification, Perception for Grasping and Manipulation, Medical Robots and Systems
Abstract: Accurate registration between a prior model and the real scene is essential for high-precision robotic manipulation, yet optical methods suffer from long calibration chains, line-of-sight constraints, and fabrication errors. We propose a calibration-free alternative that reformulates contact registration as complementary-shape docking between the object and the probe's swept volume, explicitly accounting for probe geometry and leveraging both contact and non-contact evidence. Our solver integrates a global-to-local search via 3D FFT correlation over low-discrepancy SO(3) samples, then followed by continuous SE(3) refinement using Lie-algebra updates and analytic contact sensitivities. This pipeline yields efficient exploration and metric-grade convergence without fragile point correspondences. Simulation across free-form meshes achieved sub-0.04 mm and sub-0.4° accuracy and robustness to pose noise and contact loss. On a tooth-preparation robot, our method attained 0.42 mm and 3.75°, outperforming an optical tracker registration while requiring no external sensors. These results demonstrate a practical and precise registration strategy for surgical and industrial robots.
|
| |
| 15:00-16:30, Paper WeI2I.224 | Add to My Program |
| High-Performance Dual-Arm Task and Motion Planning for Tabletop Rearrangement |
|
| Zhang, Duo | Rutgers University |
| Huang, Junshan | Rutgers University |
| Yu, Jingjin | Rutgers University |
Keywords: Task and Motion Planning, Dual Arm Manipulation, Manipulation Planning
Abstract: We propose Synchronous Dual-Arm Rearrange- ment Planner (SDAR), a task and motion planning (TAMP) framework for tabletop rearrangement, where two robot arms equipped with 2-finger grippers must work together in close proximity to rearrange objects whose start and goal config- urations are strongly entangled. To tackle such challenges, SDAR tightly knit together its dependency-driven task planner (SDAR-T) and synchronous dual-arm motion planner (SDAR- M), to intelligently sift through a large number of possible task and motion plans. Specifically, SDAR-T applies a simple yet effective strategy to decompose the global object dependency graph induced by the rearrangement task, to produce more optimal dual-arm task plans than solutions derived from optimal task plans for a single arm. Leveraging state-of-the-art GPU SIMD-based motion planning tools, SDAR-M employs a layered motion planning strategy to sift through many task plans for the best synchronous dual-arm motion plan while ensuring high levels of success rate. Comprehensive evaluation demonstrates that SDAR delivers a 100% success rate in solving complex, non-monotone, long-horizon tabletop rearrangement tasks with solution quality far exceeding the previous state- of-the-art. Experiments on two UR-5e arms further confirm SDAR directly and reliably transfers to robot hardware. Source code and supplementary materials are available at https://github.com/arc-l/dual-arm.
|
| |
| 15:00-16:30, Paper WeI2I.225 | Add to My Program |
| SRPO: Self-Reflection Policy Optimization for Stable and Robust Autonomous Driving |
|
| Wang, Dejin | Northeastern University |
| Ghoreishi, Seyede Fatemeh | Northeastern University |
Keywords: Autonomous Vehicle Navigation, Planning under Uncertainty
Abstract: Autonomous driving demands reinforcement learning (RL) agents that are not only performant, but also stable, sample-efficient, and robust to uncertainty. However, conventional policy optimization methods often suffer from unstable convergence, sensitivity to reward scaling, and limited generalization in safety-critical or out-of-distribution scenarios. We propose Self-Reflection Policy Optimization (SRPO), a principled, model-free framework that introduces policy-level self-evaluation by benchmarking each policy iteration against its own historical performance. This self-reflection yields a reward-shaping signal based on relative improvement, which is redistributed across trajectory steps using a rank-based credit assignment mechanism. This design emphasizes informative steps, eliminates dependence on absolute reward magnitudes, and improves stability in practice. We theoretically show that a bounds-based variant of SRPO preserves policy optimality and convergence. Empirically, we evaluate SRPO on both Highway-env and the high-fidelity CARLA simulator under adversarial perturbations and out-of-distribution driving conditions. SRPO consistently improves training efficiency, robustness, and policy performance compared to the baseline techniques. These results position SRPO as a promising and theoretically grounded approach to more reliable decision-making for autonomous driving. The source code is available at: url{https://github.com/dejin-wang/SRPO_anonymous_code}.
|
| |
| 15:00-16:30, Paper WeI2I.226 | Add to My Program |
| VB-Com: Learning Vision-Blind Composite Humanoid Locomotion against Deficient Perception |
|
| Ren, Junli | Hong Kong University |
| Huang, Tao | The Chinese University of Hong Kong |
| Wang, Huayi | Shanghai Jiao Tong University, Shanghai Artificial Intelligence Laboratory |
| Wong, Ziseoi | Zhejiang University |
| Ben, Qingwei | Tsinghua University |
| Long, Junfeng | University of California at Berkeley |
| Yang, Yanchao | Stanford University |
| Pang, Jiangmiao | Shanghai AI Laboratory |
| Luo, Ping | The University of Hong Kong |
|
|
| |
| 15:00-16:30, Paper WeI2I.227 | Add to My Program |
| NeuroLiDAR: Adaptive Frame Rate Depth Sensing Via Neuromorphic Event-LiDAR Fusion |
|
| Rathnayake, Darshana | Singapore Management University |
| Weerakoon Mudiyanselage, Dulanga Kaveesha Weerakoon | Singapore-MIT Alliance for Research & Technology |
| Radhakrishnan, Meera | University of Technology Sydney |
| Misra, Archan | Singapore Management University |
Keywords: Sensor Fusion, Deep Learning for Visual Perception, RGB-D Perception
Abstract: LiDARs are widely used for 3D depth reconstruction, but their performance is often limited by inherent hardware constraints that impose trade-offs between range, spatial resolution, and frame rate. Many LiDAR systems typically operate at low frame rates (e.g., 5-10 Hz), prioritizing long-range sensing over responsiveness to rapid scene changes. We present NeuroLiDAR, an adaptive depth sensing framework that achieves effective frame rates of up to ≈66 Hz by fusing temporally sparse LiDAR data with temporally dense inputs from neuromorphic event cameras. NeuroLiDAR integrates two components: event-based keyframe detection and event-guided depth extrapolation, to dynamically adjust the sensing rate in response to scene dynamics. To evaluate our approach, we introduce ELiDAR, a dataset spanning outdoor and indoor scenarios, and show that NeuroLiDAR reduces depth reconstruction error by ≈29% in RMSE while achieving adaptive frame rates between 27.8–47.3 Hz. Our code and dataset are available at https://github.com/darshanakgr/neurolidar.
|
| |
| 15:00-16:30, Paper WeI2I.228 | Add to My Program |
| Contrastive Learning on 3D Point Clouds for Robotic Geometric Defect Detection |
|
| Tarvo, Alexander | University of Washington, MACS Lab |
| Wan, Yusen | University of Washington |
| Chen, Xu | University of Washington |
Keywords: Computer Vision for Manufacturing, Industrial Robots, Factory Automation
Abstract: Robotic quality inspection is emerging as a key enabler in intelligent manufacturing, allowing robots to transcend human limitations in endurance, consistency, and access to complex structures. By detecting subtle defects with speed and precision, robotic inspection enhances efficiency while elevating production quality. While most existing approaches emphasize 2D image-based surface defect detection, they often overlook geometric defects, which are more prevalent and challenging in industrial inspection. To overcome this gap, we formulate geometric defect detection as anomaly detection in 3D point clouds and propose a novel framework that integrates contrastive learning with spatially aware comparisons of local geometries. Specifically, we partition point cloud surfaces into patches and employ contrastive learning to train a neural network-based feature extractor capable of capturing rich geometric representations. An anomaly detection algorithm is then introduced to identify defects by comparing patch-level features in a spatially consistent manner. Evaluated on the recent Real3D-AD benchmark, our method achieves a mean area under the ROC curve of 0.901, establishing a new state of the art and demonstrating the potential of robotic inspection systems to move beyond human limitations in detecting subtle geometric anomalies
|
| |
| 15:00-16:30, Paper WeI2I.229 | Add to My Program |
| LARVO-RRT*: Learning Guided Adaptive Reconfiguration with Vector Field Oriented RRT* |
|
| Rishan Sachinthana, Wijenayaka Kankanamge | Singapore University of Technology and Design |
| Samarakoon Mudiyanselage, Bhagya Prasangi Samarakoon | Singapore University of Technology and Design |
| Muthugala Arachchige, Viraj Jagathpriya Muthugala | Singapore University of Technology and Design |
| Elara, Mohan Rajesh | Singapore University of Technology and Design |
Keywords: Motion and Path Planning, Task and Motion Planning, Deep Learning Methods
Abstract: Path planning is a challenging problem in robotics, for which numerous algorithms have been developed to address it. Sampling-based algorithms, such as the Rapidly Exploring Random Tree (RRT) and its variants, are renowned for their ability to explore the search space efficiently. However, these algorithms consume a significant amount of computation time and memory to derive the optimal path. Furthermore, when it comes to different geometrical factors, such as narrow passages, fixed-shape robots often fail to navigate because of their structural constraints. This limitation raises the need to use reconfigurable robots, which are capable of changing their shape to access these confined areas. This paper proposes a novel path planning approach for a reconfigurable robot, based on a machine learning model, to address the aforementioned limitations. The proposed method employs a Convolutional Neural Network (CNN) model that predicts sample distribution, a flow field, and robot configurations, which is combined with RRT*, termed Learning Guided Adaptive Reconfiguration with Vector field Oriented RRT* (LARVO-RRT*). The model has been trained using the optimal sample distributions and flow fields generated with the help of optimal paths from a customized A* algorithm. Experimental results demonstrate that the proposed method surpasses the existing learning and non-learning-based path planning algorithms in terms of time cost, iteration count, and path quality. Furthermore, the algorithm has been able to outperform the existing path planners even without considering the reconfigurations.
|
| |
| 15:00-16:30, Paper WeI2I.230 | Add to My Program |
| Subsecond 3D Mesh Generation for Robot Manipulation |
|
| Wang, Qian | Yale University |
| Abdellall, Omar | Yale University |
| Gao, Tony | Yale University |
| Sun, Xiatao | Yale University |
| Rakita, Daniel | Yale University |
Keywords: Perception for Grasping and Manipulation
Abstract: 3D meshes are a fundamental representation widely used in computer science and engineering. In robotics, they are particularly valuable because they capture objects in a form that aligns directly with how robots interact with the physical world, enabling core capabilities such as predicting stable grasps, detecting collisions, and simulating dynamics. Although automatic 3D mesh generation methods have shown promising progress in recent years, potentially offering a path toward real-time robot perception, two critical challenges remain. First, generating high-fidelity meshes is prohibitively slow for real-time use, often requiring tens of seconds per object. Second, mesh generation by itself is insufficient. In robotics, a mesh must be contextually grounded, i.e., correctly segmented from the scene and registered with the proper scale and pose. Additionally, unless these contextual grounding steps remain efficient, they simply introduce new bottlenecks. In this work, we introduce an end-to-end system that addresses these challenges, producing a high-quality, contextually grounded 3D mesh from a single RGB-D image in under one second. Our contribution is a system level design that integrates open-vocabulary object segmentation, accelerated diffusion-based mesh generation, and robust point cloud registration, each optimized for both speed and accuracy. We demonstrate its effectiveness in a real-world manipulation task, showing that it enables meshes to be used as a practical, on-demand representation for robotics perception and planning. Open-source code and videos are located at the paper website: https://apollo-lab-yale.github.io/26-ICRA-subsecond-mesh-gen-website/
|
| |
| 15:00-16:30, Paper WeI2I.231 | Add to My Program |
| Learning Therapist Policy from Therapist-Exoskeleton-Patient Interaction |
|
| Snyder, Grayson | Northwestern University |
| Vianello, Lorenzo | Shirley Ryan Ability Lab |
| Hargrove, Levi | Rehabilitation Institute of Chicago |
| Elwin, Matthew | Northwestern University |
| Pons, Jose L. | Shirley Ryan AbilityLab |
Keywords: Rehabilitation Robotics, Prosthetics and Exoskeletons, Physical Human-Robot Interaction
Abstract: Post-stroke rehabilitation is often necessary for patients to regain proper walking gait. However, the typical therapy process can be exhausting and physically demanding for therapists, potentially reducing therapy intensity, duration, and consistency over time. We propose a Patient-Therapist Force Field (PTFF) to visualize therapist responses to patient kinematics and a Synthetic Therapist (ST) machine learning model to support the therapist in dyadic robot-mediated physical interaction therapy. The first encodes patient and therapist stride kinematics into a shared low-dimensional latent manifold using a Variational Autoencoder (VAE) and models their interaction through a Gaussian Mixture Model (GMM), which learns a probabilistic vector field mapping patient latent states to therapist responses. This representation visualizes patient–therapist interaction dynamics to inform therapy strategies and robot controller design. The latter is implemented as a Long Short-Term Memory (LSTM) network trained on patient–therapist interaction data to predict therapist-applied joint torques from patient kinematics. Trained and validated using leave-one-out cross-validation across eight post-stroke patients, the model was integrated into a ROS-based exoskeleton controller to generate real-time torque assistance based on predicted therapist responses. Offline results and preliminary testing indicate the potential of their use as an alternative approach to post-stroke exoskeleton therapy. The PTFF provides understanding of the therapist’s actions while the ST frees the human therapist from the exoskeleton, allowing them to continuously monitor the patient’s nuanced condition.
|
| |
| 15:00-16:30, Paper WeI2I.232 | Add to My Program |
| GIMloco: Generic Internal Model-Based Locomotion for Quadruped Robots |
|
| Yan, Zhonghuai | Beihang University |
| Quan, Quan | Beihang University |
Keywords: Legged Robots, Deep Learning Methods
Abstract: A central challenge in robust quadruped locomotion, which relies solely on proprioceptive information, is how to effectively encode the history of observations. While current methods, such as regression, struggle with high-dimensional multi-time-step histories, and Temporal Convolutional Networks (TCNs) incur computational overhead, we propose a more efficient and theoretically grounded alternative. Inspired by the Generic Internal Model (GIM) from control theory, we introduce GIMloco, which maps the history of proprioceptive observations into a compact and stable internal model space through a predesigned first-order integral system with stability and orthogonality guarantees. This encoded representation drives three downstream tasks: state estimation, latent variable learning, and control policy learning. Our experiments show that GIMloco outperforms strong baselines in velocity tracking, system overshoot, response speed. Furthermore, it can navigate more complex terrains while also demonstrating better training stability across random seeds. Crucially, our method reduces training time by two orders of magnitude compared to TCN-based approaches. Our work presents GIMloco as a robust and computationally efficient framework for locomotion based on proprioceptive information.
|
| |
| 15:00-16:30, Paper WeI2I.233 | Add to My Program |
| How Well Do Diffusion Policies Learn Kinematic Constraint Manifolds? |
|
| Foland, Lexi | Massachusetts Institute of Technology |
| Cohn, Thomas | Massachusetts Institute of Technology |
| Wei, Adam | Massachusetts Institute of Technology |
| Pfaff, Nicholas Ezra | Massachusetts Institute of Technology |
| Chen, Boyuan | Massachusetts Institute of Technology |
| Tedrake, Russ | Massachusetts Institute of Technology |
Keywords: Bimanual Manipulation, Imitation Learning
Abstract: Diffusion policies have shown impressive results in robot imitation learning, even for tasks that require satisfaction of kinematic equality constraints. However, task performance alone is not a reliable indicator of the policy’s ability to precisely learn constraints in the training data. To investigate, we analyze how well diffusion policies discover these manifolds with a case study on a bimanual pick-and-place task that encourages fulfillment of a kinematic constraint for success. We study how three factors affect trained policies: dataset size, dataset quality, and manifold curvature. Our experiments show diffusion policies learn a coarse approximation of the constraint manifold with learning affected negatively by decreases in both dataset size and quality. However, manifold curvature showed inconclusive correlations with constraint satisfaction and task success. A hardware evaluation verifies the applicability of our results in the real world. Project website with additional results and visuals: https://diffusion-learns-kinematic.github.io/.
|
| |
| 15:00-16:30, Paper WeI2I.234 | Add to My Program |
| LIDIA: Localizing in the Dark with Illumination-Awareness Toward Perception-Aware Planning |
|
| Velentzas, Iason Georgios | Georgia Institute of Technology |
| Tomita, Kento | Mitsubishi Electric Research Laboratories |
Keywords: Perception-Action Coupling, Localization, Computer Vision for Automation
Abstract: Accurate Localization is a fundamental challenge in robotic autonomy, with applications ranging from autonomous driving to space proximity operations. Visual Localization is a viable choice in GPS-denied environments, such as subterranean, indoor, urban, or space environments; however, its performance degrades under often encountered conditions, such as low light or varying illumination. This paper introduces LIDIA, an illumination-aware model of localization quality for Perception-Aware Planning. LIDIA involves the efficient integration of light source direction into the planning framework, enabling the prediction of visually informative regions in the map under varying lighting. Unlike prior geometric approaches, LIDIA jointly exploits geometric and photometric information without requiring computationally expensive real-time rendering, thereby preserving online applicability. Our results demonstrate that LIDIA consistently outperforms existing geometric methods such as FIF in predicting the information gain of candidate camera poses and in planning trajectories that achieve higher localization accuracy. To the best of our knowledge, this is the first approach to unify geometric and photometric reasoning in an efficient, active localization system, paving the way for robust autonomy in illumination-constrained environments.
|
| |
| 15:00-16:30, Paper WeI2I.235 | Add to My Program |
| Graph-Of-Constraints Model Predictive Control for Reactive Multi-Agent Task and Motion Planning |
|
| Manganaris, Anastasios | Purdue University |
| Lu, Jeremy | Purdue University |
| Qureshi, Ahmed H. | Purdue University |
| Jagannathan, Suresh | Purdue University |
Keywords: Reactive and Sensor-Based Planning, Task and Motion Planning, Multi-Robot Systems
Abstract: Sequences of interdependent geometric constraints are central to many multi-agent Task and Motion Planning (TAMP) problems. However, existing methods for handling such constraint sequences struggle with partially ordered tasks and dynamic agent assignments. They typically assume static assignments and cannot adapt when disturbances alter task allocations. To overcome these limitations, we introduce Graph-of-Constraints Model Predictive Control (GoC-MPC), a generalized sequence-of-constraints framework integrated with MPC. GoC-MPC naturally supports partially ordered tasks, dynamic agent coordination, and disturbance recovery. By defining constraints over tracked 3D keypoints, our method robustly solves diverse multi-agent manipulation tasks—coordinating agents and adapting online from visual observations alone, without relying on training data or environment models. Experiments demonstrate that GoC-MPC achieves higher success rates, significantly faster TAMP computation, and shorter overall paths compared to recent baselines, establishing it as an efficient and robust solution for multi-agent manipulation under real-world disturbances. Our supplementary video and code can be found at https://sites.google.com/view/goc-mpc/home.
|
| |
| 15:00-16:30, Paper WeI2I.236 | Add to My Program |
| Few-Shot Neural Differentiable Simulator: Real-To-Sim Rigid-Contact Modeling |
|
| Huang, Zhenhao | National University of Singapore |
| Luo, Siyuan | National University of Singapore |
| Zhou, Bingyang | National University of Singapore |
| Zeng, Ziqiu | National University of Singapore |
| Shi, Fan | National University of Singapore |
Keywords: Contact Modeling
Abstract: Accurate physics simulation is essential for robotic learning and control, yet analytical simulators often fail to capture complex contact dynamics, while learning-based simulators typically require large amounts of costly real-world data. To bridge this gap, we propose a few-shot real-to-sim approach that combines the physical consistency of analytical formulations with the representational capacity of graph neural network (GNN)-based models. Using only a small amount of real-world data, our method calibrates analytical simulators to generate large-scale synthetic datasets that capture diverse contact interactions. On this foundation, we introduce a mesh-based GNN that implicitly models rigid-body forward dynamics and derive surrogate gradients for collision detection, achieving full differentiability. Experimental results demonstrate that our approach enables learning-based simulators to outperform differentiable baselines in replicating real-world trajectories. In addition, the differentiable design supports gradient-based optimization, which we validate through simulation-based policy learning in multi-object interaction scenarios. Extensive experiments show that our framework not only improves simulation fidelity with minimal supervision but also increases the efficiency of policy learning. Taken together, these findings suggest that differentiable simulation with few-shot real-world grounding provides a powerful direction for advancing future robotic manipulation and control.
|
| |
| 15:00-16:30, Paper WeI2I.237 | Add to My Program |
| DIAL-GS: Dynamic Instance Aware Reconstruction for Label-Free Street Scenes with 4D Gaussian Splatting |
|
| Su, Chenpeng | Shanghai Jiao Tong University |
| Wu, Wenhua | Shang Hai Jiao Tong University |
| Peng, Chensheng | University of California, Berkeley |
| Deng, Tianchen | Shanghai Jiao Tong University |
| Liu, Zhe | Shanghai Jiao Tong University |
| Wang, Hesheng | Shanghai Jiao Tong University |
Keywords: Mapping, Computer Vision for Automation, Computer Vision for Transportation
Abstract: Urban scene reconstruction is critical for autonomous driving, enabling structured 3D representations for data synthesis and closed-loop testing. Supervised approaches rely on costly human annotations and lack scalability, while current self-supervised methods often confuse static and dynamic elements and fail to distinguish individual dynamic objects, limiting fine-grained editing. We propose DIAL-GS, a novel dynamic instance-aware reconstruction method for label-free street scenes with 4D Gaussian Splatting. We first accurately identify dynamic instances by exploiting appearance–position inconsistency between warped rendering and actual observation. Guided by instance-level dynamic perception, we employ instance-aware 4D Gaussians as the unified volumetric representation, realizing dynamic-adaptive and instance-aware reconstruction. Furthermore, we introduce a reciprocal mechanism through which identity and dynamics reinforce each other, enhancing both integrity and consistency. Experiments on urban driving scenarios show that DIAL-GS surpasses existing self-supervised baselines in reconstruction quality and instance-level editing, offering a concise yet powerful solution for urban scene modeling.
|
| |
| 15:00-16:30, Paper WeI2I.238 | Add to My Program |
| TransforMARS: Fault-Tolerant Self-Reconfiguration for Arbitrary-Shaped Modular Aerial Robot Systems |
|
| Huang, Rui | National University of Singapore |
| Gao, Zhiyu | National University of Singapore |
| Tang, Siyu | National University of Singapore |
| Zhang, Jialin | National University of Singapore |
| He, Lei | National University of Singapore |
| Zhang, Ziqian | National University of Singapore |
| Zhao, Lin | National University of Singapore |
Keywords: Cellular and Modular Robots, Failure Detection and Recovery, Aerial Systems: Applications
Abstract: Modular Aerial Robot Systems (MARS) consist of multiple drone modules that are physically bound together to form a single structure for flight. Exploiting structural redundancy, MARS can be reconfigured into different formations to mitigate unit or rotor failures and maintain stable flight. Prior work on MARS self-reconfiguration has solely focused on maximizing controllability margins to tolerate a single rotor or unit fault for rectangular-shaped MARS. We propose TransforMARS, a general fault-tolerant reconfiguration framework that transforms arbitrarily shaped MARS under multiple rotor and unit faults while ensuring continuous in-air stability. Specifically, we develop algorithms to first identify and construct minimum controllable assemblies containing faulty units. We then plan feasible disassembly-assembly sequences to transport MARS units or subassemblies to form target configuration. Our approach enables more flexible and practical feasible reconfiguration. We validate TransforMARS in challenging arbitrarily shaped MARS configurations, demonstrating substantial improvements over prior works in both the capacity of handling diverse configurations and the number of faults tolerated. The videos and source code of this work are available at https://github.com/RuiHuangNUS/TransforMARS
|
| |
| 15:00-16:30, Paper WeI2I.239 | Add to My Program |
| ST-HNet: A CNN-LSM Hybrid Architecture for Spatio-Temporal Feature Learning in Event-Based Visual Place Recognition |
|
| Xiao, Xun | National Univorsity of Defence Technology |
| Guo, Shasha | National University of Defense Technology |
| Junbo, Tie | National University of Defense Technology |
| Zhao, Jingyue | Defense Innovation Institute, AMS |
| Wang, Ziqi | National University of Defence Technology |
| Li, Yuan | National University of Defense Technology |
| Yuan, Jingzhuo | National University of Defence Technology |
| Dou, Qiang | Phytium Technology Co., Ltd |
| Wang, Lei | Defense Innovation Institute, AMS |
Keywords: Deep Learning for Visual Perception, Localization, Visual Learning
Abstract: Visual Place Recognition (VPR) based on Dynamic Vision Sensors (DVSs) has gained attention due to their high temporal resolution and robustness under challenging lighting conditions. However, the sparse and asynchronous event stream output of DVS introduces unique challenges for effective VPR. In this paper, we propose ST-HNet, a novel framework for VPR that introduces improvements in event representation, spatio-temporal feature extraction, and loss design. Specifically, we introduce a compact and efficient event representation called Bipolar Binary Voxel Grid (BBVG). Then, we propose a hybrid feature extractor that combines a Convolutional Neural Network (CNN) for spatial encoding and a Liquid State Machine (LSM) for temporal aggregation. We refer to this combination as a CNN-LSM hybrid architecture. Moreover, we introduce a soft-margin triplet loss to better accommodate the gradual transitions between nearby locations in the event-based VPR task. Extensive experiments conducted on the Brisbane-Event-VPR and DDD20 datasets demonstrate that our method outperforms state-of-the-art approaches, achieving improvements of 11% and 23% in Recall@1 performance, respectively.
|
| |
| 15:00-16:30, Paper WeI2I.240 | Add to My Program |
| C-Free-Uniform: A Map-Conditioned Trajectory Sampler for Model Predictive Path Integral Control |
|
| Cao, Yukang | University of Minnesota |
| Mahesh, Rahul Moorthy | University of Minnesota - Twin Cities |
| Poyrazoglu, Oguzhan Goktug | University of Minnesota |
| Isler, Volkan | The University of Texas at Austin |
Keywords: Motion and Path Planning, Integrated Planning and Learning, Constrained Motion Planning
Abstract: Trajectory sampling is a key component of sampling-based control mechanisms. Trajectory samplers rely on control input samplers, which generate control inputs u from a distribution p(u | x) where x is the current state. We introduce the notion of Free Configuration Space Uniformity (C-Free-Uniform for short) which has two key features: (i) the generated control input can be used to uniformly sample the free configuration space, and (ii) in contrast to previously introduced trajectory sampling mechanisms where the distribution p(u | x) is independent of the environment, C-Free-Uniform is explicitly conditioned on the current local map. Next, we integrate this sampler into a new Model Predictive Path Integral (MPPI) Controller, CFU-MPPI. Experiments show that CFU-MPPI outperforms existing methods in terms of success rate in challenging navigation tasks in cluttered polygonal environments while requiring a much smaller sampling budget. Code: https://github.com/ogpoyrazoglu/cuniform_sampling.
|
| |
| 15:00-16:30, Paper WeI2I.241 | Add to My Program |
| ExpReS-VLA: Specializing Vision-Language-Action Models through Experience Replay and Retrieval |
|
| Syed, Shahram Najam | Carnegie Mellon University |
| Ahuja, Yatharth | Carnegie Mellon University |
| Jakobsson, Arthur | Carnegie Mellon University |
| Ichnowski, Jeffrey | Carnegie Mellon University |
Keywords: Continual Learning, Machine Learning for Robot Control, Incremental Learning
Abstract: Vision-Language-Action (VLA) models like Open-VLA demonstrate impressive zero-shot generalization across robotic manipulation tasks but struggle to adapt to specific deployment environments where consistent high performance on a limited set of tasks is more valuable than broad generalization. We present EXPierence replayed, REtrieval augmented, Specialized VLA (ExpReS-VLA), a method that enables rapid on-device adaptation of pre-trained VLAs to target domains while preventing catastrophic forgetting through compressed experience replay and retrieval-augmented generation. Our approach maintains a memory-efficient buffer by storing extracted embeddings from OpenVLA’s frozen vision backbone, reducing storage requirements by 97% compared to raw image-action pairs. During deployment, ExpReS-VLA retrieves the k most similar past experiences using cosine similarity to augment training batches, while a prioritized experience replay buffer preserves recently successful trajectories. To leverage failed attempts, we introduce Thresholded Hybrid Contrastive Loss (THCL), enabling the model to learn from both successful and unsuccessful demonstrations collected during deployment. Experiments on the LIBERO simulation benchmark show that ExpReS-VLA improves success rates from 82.6% to 93.1% on spatial reasoning tasks and from 61% to 72.3% on long horizon tasks compared to base OpenVLA, with consistent gains across VLA architectures including π0 (+3.2 points) and OpenVLA-OFT (+1.7 points). Physical robot experiments across five manipulation tasks demonstrate that our approach achieves 98% success on both in-distribution and out-of-distribution tasks (with unseen backgrounds and objects), improving from 84.7% and 32% respectively for naive fine-tuning. ExpReSVLA accomplishes this adaptation in 31 seconds using only 12 demonstrations on a single RTX 5090, making it practical for real-world deployment where robots must quickly specialize to their specific operating environment.
|
| |
| 15:00-16:30, Paper WeI2I.242 | Add to My Program |
| SwarmNav: Swarm Robotics Navigation in Dynamic and Dense Environments Via Reinforcement Learning |
|
| Li, Shengbo | Zhejiang University |
| Lv, Chuanjie | Zhejiang University |
| Yuan, Xiangqian | Zhejiang University |
| Xu, Liming | Zhejiang University |
| Liu, Xinyang | Zhejiang University |
| Zhu, Zongzhi | Zhejiang Guoli Xin'an Technology Co., Ltd |
| Xu, Gang | Zhejiang University |
| Liu, Yong | Zhejiang University |
Keywords: Swarm Robotics, Collision Avoidance, Reinforcement Learning
Abstract: Collision avoidance and navigation in dynamic and dense environments remain highly challenging for swarm robotics. To address this, we propose SwarmNav, a novel goal-region amplification navigation policy that leverages LiDAR-based position data to generate velocity commands guiding robots toward their goals while actively avoiding obstacles. SwarmNav is trained within a deep reinforcement learning actor-critic framework. In this framework, the reward function integrates a goal-region amplification term with the reciprocal velocity obstacles formulation, enabling goal-directed navigation under dynamic obstacle uncertainty. Extensive simulations demonstrate that SwarmNav significantly outperforms state-of-the-art approaches, including both reinforcement learning-based and traditional velocity obstacle-based methods, in terms of success rate and computational efficiency. Real-world experiments across diverse scenarios further confirm its effectiveness in dynamic and dense environments.
|
| |
| 15:00-16:30, Paper WeI2I.243 | Add to My Program |
| Simultaneous Deep Model-Based Reinforcement Learning and State Inference under Partial Observability |
|
| Cong, William | University of Wisconsin -- Madison |
| Hanna, Josiah | University of Wisconsin -- Madison |
Keywords: Reinforcement Learning, Machine Learning for Robot Control, Deep Learning Methods
Abstract: Model-based reinforcement learning (MBRL) is a promising approach to enabling robots to learn directly from a limited number of real-world interactions. Model-based reinforcement learning (MBRL) is notoriously difficult in settings without full state observability because algorithms must simultaneously infer state from incomplete observations and use these inferences to learn environment dynamics. Toward the use of MBRL for autonomous robots, we introduce EMBRL, an expectation-maximization framework that combines classical Bayesian state estimation with deep MBRL to jointly infer states and learn neural network state transition models. This framework takes advantage of the rich theory and practice of state estimation from the field of robotics, while enabling behavior learning without a priori known robot dynamics. Though conceptually straightforward, our instantiation of this framework for deep MBRL reveals several key challenges when using a learned transition model both for state inference and policy learning. We introduce a practical implementation of EMBRL using both particle and extended Kalman filters and smoothers and discuss key design choices necessary for effective implementation. Finally, we evaluate different instantiations of the EMBRL framework on both simulated and real-robot tasks and show that our methods learn higher performing policies compared to strong MBRL baselines using recurrent neural networks.
|
| |
| 15:00-16:30, Paper WeI2I.244 | Add to My Program |
| BlurPoint: Efficient Motion Blur Aware Student-Teacher Local Feature Learning |
|
| Wang, Wenting | The Chinese University of Hong Kong |
| Zhao, Zhenjun | University of Zaragoza |
| Guo, Jiaxin | The Chinese University of Hong Kong |
| Liu, Yunhui | Chinese University of Hong Kong |
| Wang, Charlie C.L. | The University of Manchester |
| Yam, Yeung | The Chinese University of Hong Kong |
Keywords: Mapping, Localization, Visual Learning
Abstract: Local feature detection and description serve as the foundation for many 3D vision tasks. However, most existing algorithms rely on sharp images, resulting in degraded performance when motion blur occurs due to long exposure. To tackle this challenge, we propose an effective end-to-end model that jointly learns feature detection and description from blurred images in a self-supervised manner, without requiring any additional labeled data. Rather than simply mixing sharp and blurred samples during training, we design a student–teacher framework to explicitly transfer knowledge from sharp to blurred domains. The teacher model extracts local features from sharp images and enforces photometric consistency in feature space, which is then distilled to the student model trained on blurred inputs. To facilitate this knowledge transfer, we introduce two tailored loss functions, feature divergence loss and triplet knowledge distillation loss, both aimed at aligning feature representations under motion blur. Extensive experiments on homography estimation, relative pose estimation, and visual localization demonstrate that our method achieves state-of-the-art performance on blurred images, while maintaining competitive accuracy on sharp images.
|
| |
| 15:00-16:30, Paper WeI2I.245 | Add to My Program |
| FiLM-Nav: Efficient and Generalizable Navigation Via VLM Fine-Tuning |
|
| Yokoyama, Naoki | Georgia Institute of Technology |
| Ha, Sehoon | Georgia Institute of Technology |
Keywords: Vision-Based Navigation, Semantic Scene Understanding, Transfer Learning
Abstract: Enabling robotic assistants to navigate complex environments and locate objects described in free-form language is a critical capability for real-world deployment. While foundation models, particularly Vision-Language Models (VLMs), offer powerful semantic understanding, effectively adapting their web-scale knowledge for embodied decision-making remains a key challenge. We present FiLM-Nav (Fine-tuned Language Model for Navigation), an approach that directly fine-tunes pre-trained VLM as the navigation policy. In contrast to methods that use foundation models primarily in a zero-shot manner or for map annotation, FiLM-Nav learns to select the next best exploration frontier by conditioning directly on raw visual trajectory history and the navigation goal. Leveraging targeted simulated embodied experience allows the VLM to ground its powerful pre-trained representations in the specific dynamics and visual patterns relevant to goal-driven navigation. Critically, fine-tuning on a diverse data mixture combining ObjectNav, OVON, ImageNav, and an auxiliary spatial reasoning task proves essential for achieving robustness and broad generalization. FiLM-Nav sets a new state-of-the-art in both SPL and success rate on HM3D ObjectNav among open-vocabulary methods, and sets a state-of-the-art SPL on the challenging HM3D-OVON benchmark, demonstrating strong generalization to unseen object categories. Our work validates that directly fine-tuning VLMs on diverse simulated embodied data is a highly effective pathway towards generalizable and efficient semantic navigation capabilities.
|
| |
| 15:00-16:30, Paper WeI2I.246 | Add to My Program |
| A Magnetic-Wheeled Inspection Robot for Interior Corner Traversal |
|
| Nadan, Paul | Carnegie Mellon University |
| Kumar, Jai | Carnegie Mellon University |
| Klein, Nate | Carnegie Mellon University |
| Wallace, Jesse | Carnegie Mellon University |
| Wang, Hairong | Carnegie Mellon University |
| Hakim, Alexander | Carnegie Mellon University |
| Dassey, Adam | Chevron Corporation |
| Hoelen, Thomas | Chevron Corporation |
| Lowry, Gregory V. | Carnegie Mellon University |
| Johnson, Aaron M. | Carnegie Mellon University |
Keywords: Climbing Robots, Environment Monitoring and Management, Wheeled Robots
Abstract: Automated inspection of steel structures using magnetic climbing robots can reduce costs and improve safety, but many such structures feature interior corners that are challenging for wheeled or tracked robots to traverse. We present the first magnetic-wheeled robot to use X-ray fluorescence for steel structure inspection, Sally, capable of overcoming all interior corner transition types, traversing small obstacles, and maneuvering in tight spaces. By re-purposing its steering and sensor deployment mechanisms, the robot is able to transition back and forth between a steel wall and an adjacent steel ceiling, steel wall, or any floor. We analyze the feasibility of these interior corner transitions and validate the results through experimental demonstrations with Sally. We also demonstrate line scanning, a continuous surface measurement technique enabled by the wheeled design that estimates the average element concentrations along a line, and show it provides greater accuracy and efficiency in both simulation and robot trials compared to the traditional grid point measurement method. Finally, we discuss lessons learned from a field test of Sally at an industrial site.
|
| |
| 15:00-16:30, Paper WeI2I.247 | Add to My Program |
| Amplifying Force-Feedback Cues for Enhancing Dexterous Skill Transfer in Virtual Environments |
|
| Kim, Eunchae | Hanyang University |
| Lim, Jaewan | Hanyang University |
| Park, Jiyoung | Hanyang University |
| Yoo, Yongjae | Hanyang University |
Keywords: Haptics and Haptic Interfaces, Physical Human-Robot Interaction, Education Robotics
Abstract: How to teach sensorimotor skills in haptic virtual environments is a classic research question and has been investigated with different target skills and strategies. In this study, we studied how to assist users by modulating haptic sensations in the learning environment, presented via a force-feedback haptic device. We developed a haptic amplification method and evaluated its effectiveness on skill training with the target skill of needle felting. To this end, we initially collected the force profile data captured from an expert's job and amplified the magnitude of force to be felt clearly. Then, the augmented haptic sensations were rendered in the virtual learning environment. We assessed the usefulness of our method by conducting a user study with 24 participants performing virtual needle felting tasks involving many micro-movements. As a result, amplified force profile feedback significantly improved the novice participants' learning performance. Based on the results, we then discussed how we can provide an adequate haptic feedback method on learning tasks, especially in fields requiring precise dexterous or tool movements.
|
| |
| 15:00-16:30, Paper WeI2I.248 | Add to My Program |
| MICA: Multi-Agent Industrial Coordination Assistant |
|
| Wen, Di | Karlsruhe Institute of Technology |
| Peng, Kunyu | Karlsruhe Institute of Technology |
| Zheng, Junwei | Karlsruhe Institute of Technology |
| Chen, Yufan | Karlsruher Institut Für Technologie |
| Shi, Yitian | Karlsruhe Institute of Technology |
| Wei, Jiale | Karlsruhe Institut of Technology |
| Liu, Ruiping | Karlsruhe Institute of Technology |
| Yang, Kailun | Hunan University |
| Stiefelhagen, Rainer | Karlsruhe Institute of Technology |
Keywords: Assembly, Industrial Robots, Wearable Robotics
Abstract: Industrial workflows demand adaptive and trustworthy assistance that can operate under limited computing, connectivity, and strict privacy constraints. In this work, we present MICA (Multi-Agent Industrial Coordination Assistant), a perception-grounded and speech-interactive system that delivers real-time guidance for assembly, troubleshooting, part queries, and maintenance. MICA coordinates five role-specialized language agents, audited by a safety checker, to ensure accurate and compliant support. To achieve robust step understanding, we introduce Adaptive Step Fusion (ASF), which dynamically blends expert reasoning with online adaptation from natural speech feedback. Furthermore, we establish a new multi-agent coordination benchmark across representative task categories and propose evaluation metrics tailored to industrial assistance, enabling systematic comparison of different coordination topologies. Our experiments demonstrate that MICA consistently improves task success, reliability, and responsiveness over baseline structures, while remaining deployable on practical offline hardware. Together, these contributions highlight MICA as a step toward deployable, privacy-preserving multi-agent assistants for dynamic factory environments. The source code will be made publicly available at https://github.com/Kratos-Wen/MICA.
|
| |
| 15:00-16:30, Paper WeI2I.249 | Add to My Program |
| DiffVL: Diffusion-Based Visual Localization on 2D Maps Via BEV-Conditioned GPS Denoising |
|
| Gao, Li | Alibaba Amap |
| Sun, Hongyang | Zhejiang University |
| Liu, Liu | Alibaba Amap |
| Li, Yunhao | Ucas |
| Cai, Yang | Alibaba Amap |
Keywords: Localization, Deep Learning for Visual Perception, Computer Vision for Automation
Abstract: Accurate visual localization is crucial for autonomous driving, yet existing methods face a fundamental dilemma: While high-definition (HD) maps provide high-precision localization references, their costly construction and maintenance hinder scalability, which drives research toward standard-definition (SD) maps like OpenStreetMap. Current SD-map-based approaches primarily focus on Bird's-Eye View (BEV) matching between images and maps, overlooking a ubiquitous signal—noisy GPS. Although GPS is readily available, it suffers from multipath errors in urban environments. We propose DiffVL, the first framework to reformulate visual localization as a GPS denoising task using diffusion models. Our key insight is that noisy GPS trajectory, when conditioned on visual BEV features and SD maps, implicitly encode the true pose distribution, which can be recovered through iterative diffusion refinement. DiffVL, unlike prior BEV-matching methods (e.g., OrienterNet) or transformer-based registration approaches, learns to reverse GPS noise perturbations by jointly modeling GPS, SD map, and visual signals, achieving sub-meter accuracy without relying on HD maps. Experiments on multiple datasets demonstrate that our method achieves state-of-the-art accuracy compared to BEV-matching baselines. Crucially, our work proves that diffusion models can enable scalable localization by treating noisy GPS as a generative prior—marking a paradigm shift from traditional matching-based methods.
|
| |
| 15:00-16:30, Paper WeI2I.250 | Add to My Program |
| Geometry-Aided Vision-Based Localization of Future Mars Helicopters in Challenging Illumination Conditions |
|
| Pisanti, Dario | Space Robotics Research Group, SnT, University of Luxembourg |
| Hewitt, Robert | Jet Propulsion Laboratory, California Institute of Technology |
| Brockers, Roland | Jet Propulsion Laboratory, California Institute of Technology |
| Georgakis, Georgios | Jet Propulsion Laboratory, California Institute of Technology |
Keywords: Space Robotics and Automation, Localization, Deep Learning for Visual Perception
Abstract: Planetary exploration using aerial assets has the potential for unprecedented scientific discoveries on Mars. While NASA's Mars helicopter Ingenuity proved flight in Martian atmosphere is possible, future Mars rotorcraft will require advanced navigation capabilities for long-range flights. One such critical capability is Map-based Localization (MbL) which registers an onboard image to a reference map during flight to mitigate cumulative drift from visual odometry. However, significant illumination differences between rotorcraft observations and a reference map prove challenging for traditional MbL systems, restricting the operational window of the vehicle. In this work, we investigate a new MbL system and propose Geo-LoFTR, a geometry-aided deep learning model for image registration that is more robust under large illumination differences than prior models. The system is supported by a custom simulation framework that uses real orbital maps to produce large amounts of realistic images of the Martian terrain. Comprehensive evaluations show that our proposed system outperforms prior MbL efforts in terms of localization accuracy under significant lighting and scale variations. Furthermore, we demonstrate the validity of our approach across a simulated Martian day, and on real Mars imagery. Code and datasets are available at: https://dpisanti.github.io/geo-loftr/.
|
| |
| 15:00-16:30, Paper WeI2I.251 | Add to My Program |
| GRS-SLAM3R: Real-Time Dense SLAM with Gated Recurrent State |
|
| Shen, Guole | Shanghai Jiao Tong University |
| Deng, Tianchen | Shanghai Jiao Tong University |
| Wang, Yanbo | Shanghai Jiao Tong University |
| Chen, Yongtao | Shanghai Jiao Tong University |
| Shen, Yilin | Shanghai JiaoTong University |
| Liu, Jiuming | University of Cambridge |
| Wang, Jingchuan | Shanghai Jiao Tong University |
Keywords: SLAM, Mapping, Computer Vision for Transportation
Abstract: DUSt3R-based end-to-end scene reconstruction has recently shown promising results in dense visual SLAM. However, most existing methods only use image pairs to estimate pointmaps, overlooking spatial memory and global consistency. To this end, we introduce GRS-SLAM3R, an end-to-end SLAM framework for dense scene reconstruction and pose estimation from RGB images without any prior knowledge of the scene or camera parameters. Unlike existing DUSt3R-based frameworks, which operate on all image pairs and predict per-pair point maps in local coordinate frames, our method supports sequentialized input and incrementally estimates metric-scale point clouds in the global coordinate. In order to improve consistent spatial correlation, we use a latent state for spatial memory and design a transformer-based gated update module to reset and update the spatial memory that continuously aggregates and tracks relevant 3D information across frames. Furthermore, we partition the scene into submaps, apply local alignment within each submap, and register all submaps into a common world frame using relative constraints, producing a globally consistent map. Experiments on various datasets show that our framework achieves superior reconstruction accuracy while maintaining real-time performance.
|
| |
| 15:00-16:30, Paper WeI2I.252 | Add to My Program |
| Hebbian Attractor Networks for Robot Locomotion |
|
| Dittrich, Alexander | École Polytechnique Fédérale De Lausanne (EPFL), Switzerland |
| van Diggelen, Fuda | École Polytechnique Fédérale De Lausanne |
| Floreano, Dario | Ecole Polytechnique Fédérale De Lausanne (EPFL) |
Keywords: Bioinspired Robot Learning, Evolutionary Robotics, Sensorimotor Learning
Abstract: Biological neural networks continuously adapt and modify themselves in response to experiences throughout their lifetime - a capability largely absent in artificial neural networks. Hebbian plasticity offers a promising path toward rapid adaptation in changing environments. Here, we introduce Hebbian Attractor Networks (HAN), a class of plastic neural networks in which local weight update normalization induces emergent attractor dynamics. Unlike prior approaches, HANs employ dual-timescale plasticity and temporal averaging of pre- and postsynaptic activations to induce either co-dynamic limit cycles or fixed-point weight attractors. Using simulated locomotion benchmarks, we gain insight into how Hebbian update frequency and activation averaging influence weight dynamics and control performance. Our results show that slower updates, combined with averaged pre- and postsynaptic activations, promote convergence to stable weight configurations, while faster updates yield oscillatory co-dynamic systems. We further demonstrate that these findings generalize to high-dimensional quadrupedal locomotion with a simulated Unitree Go1 robot. These results highlight how the timing of plasticity shapes neural dynamics in embodied systems, providing a principled characterization of the attractor regimes that emerge in self-modifying networks.
|
| |
| 15:00-16:30, Paper WeI2I.253 | Add to My Program |
| TC-LEC: Targetless Calibration for LiDAR-Event Camera Systems |
|
| Yang, Ying | Xiamen University |
| Li, Jianing | Peking University |
| Shi, Jiangming | Xiamen University |
| Qu, Yanyun | Xiamen University |
Keywords: Sensor Fusion, Search and Rescue Robots, SLAM
Abstract: LiDAR-event camera integration has shown considerable promise and is gaining traction across various perception applications. Event cameras offer high temporal resolution and wide dynamic range but suffer from noise sensitivity and lack depth information. LiDAR complements these capabilities by providing absolute scale and robustness, yet accurate calibration between the two sensors remains a significant challenge. This paper presents targetless calibration framework for LiDAR–event camera systems that removes dependence on dedicated calibration targets and strong initial assumptions. The method estimates the event camera angular velocity by analyzing the timestamp and spatial changes of per-pixel, enabling precise detection of natural edges. Calibration proceeds in two stages: (i) motion-based initialization, where Canonical Correlation Analysis (CCA) on rotational estimates from the event camera and LiDAR jointly recovers the temporal offset and rotation; (ii) nonlinear refinement of the extrinsics via cross-modal alignment of natural edge features. Experiments on physical platforms and public datasets demonstrate robust performance and high calibration accuracy across diverse scenarios. This work provides a solid foundation for further development and application of LiDAR-event camera fusion.
|
| |
| 15:00-16:30, Paper WeI2I.254 | Add to My Program |
| Sim2real Image Translation Enables Viewpoint-Robust Policies from Fixed-Camera Datasets |
|
| Coholich, Jeremiah | Georgia Institute of Technology |
| Wit, Justin | Georgia Institute of Technology |
| Azarcon, Robert | Georgia Institute of Technology |
| Kira, Zsolt | Georgia Institute of Technology |
Keywords: Deep Learning in Grasping and Manipulation, Imitation Learning, Deep Learning for Visual Perception
Abstract: Vision-based policies for robot manipulation have achieved significant recent success, but are still brittle to distribution shifts such as camera viewpoint variations. Robot demonstration data is scarce and often lacks appropriate variation in camera viewpoints. Simulation offers a way to collect robot demonstrations at scale with comprehensive coverage of different viewpoints, but presents a visual sim2real challenge. To bridge this gap, we propose MANGO -- an unpaired image translation method with a novel segmentation-conditioned InfoNCE loss, a highly-regularized discriminator design, and a modified PatchNCE loss. We find that these elements are crucial for maintaining viewpoint consistency during sim2real translation. When training MANGO, we only require a small amount of fixed-camera data from the real world, but show that our method can generate diverse unseen viewpoints by translating simulated observations. In this setting, MANGO outperforms all other image translation methods we tested. In certain real-world tabletop manipulation tasks, MANGO augmentation increases shifted-view success rates by over 40 percentage points compared to policies trained without augmentation.
|
| |
| 15:00-16:30, Paper WeI2I.255 | Add to My Program |
| Correcting Autonomous Vehicle Behavior to Ensure Rule Compliance |
|
| Toledo, Felipe | University of Virginia |
| Woodlief, Trey | William & Mary |
| Elbaum, Sebastian | University of Virginia |
| Dwyer, Matthew | University of Virginia |
Keywords: Autonomous Vehicle Navigation, Semantic Scene Understanding, Robot Safety
Abstract: As autonomous vehicles (AVs) continue to gain prominence in public life, the cost of their failures becomes increasingly drastic, endangering human life. Such failures arise from AVs' inability to meet their safety specifications in the field. Recent works have aimed to improve AVs' compliance with their safety specification through improved training and runtime enforcement. However, these methods are limited, requiring access to system internals or relying on narrow assumptions, which reduces their generality. In this work, we propose a different paradigm, Monitoring for Property Compliance (M4PC), which independently evaluates the system's compliance with the specification. The approach operates in two steps. First, it leverages scene graph abstractions and a specialized graph generator to map sensor data to driving rule preconditions to determine if an intervention is needed. Second, to correct an erroneous system output, M4PC defines a safe region within the control space defined by all relevant postconditions and minimally alters the system’s output to ensure it remains within this safe region, thereby preventing property violations. We apply M4PC to improve the specification compliance of three state-of-the-art autonomous vehicles with varying architectures in the CARLA simulator. Our current implementation can improve a baseline system, while our most optimized implementation outperforms state-of-the-art techniques that require system access.
|
| |
| 15:00-16:30, Paper WeI2I.256 | Add to My Program |
| ROI-GSurFisher: Next Best View Selection for Active Gaussian Splatting Via Fisher Information of ROI-Selected Gaussian Surfels |
|
| Wang, Wei | Beijing University of Technology |
| Ma, Wei | Beijing University of Technology |
| Zhang, Hongliang | Beijing University of Technology |
| Zha, Hongbin | Peking University |
Keywords: View Planning for SLAM, Mapping, SLAM
Abstract: Next Best View (NBV) selection is critical for achieving high-quality 3D reconstruction in unknown environments. This paper presents an active NBV selection approach tailored for Gaussian Splatting (GS), a widely adopted 3D reconstruction technique that has recently gained significant attention and been extended to Simultaneous Localization and Mapping (SLAM) systems. Existing state-of-the-art NBV methods for GS focus on minimizing uncertainties of GS parameters but often fail to prioritize views that improve geometric reconstruction quality. To address this limitation, we propose an active view selection method for GS-based reconstruction, with its core being ROI-GSurFisher. This method calculates Fisher Information on Gaussian surfels selected via a Region of Interest (ROI) mechanism. Both the use of surfels for computation and the ROI constraint enhance ROI-GSurFisher's ability to evaluate geometric information gain. We further introduce a close-front view scoring module that prioritizes viewpoints conducive to high-quality reconstruction. The final NBV is selected by maximizing the combined geometric information gain and close-front score. Experimental results on 3D reconstruction of various objects and scenes demonstrate consistent qualitative and quantitative improvements. Beyond standalone 3D reconstruction, the proposed NBV method can be integrated into SLAM systems to select fewer but more valuable keyframes. Code is available at https://github.com/WW11111/ROI-GSurFisher.
|
| |
| 15:00-16:30, Paper WeI2I.257 | Add to My Program |
| Magnetic-Acoustic Microbubble Microrobot for Targeted Mechanical Stimulation of Cancer Cells |
|
| Wang, Cun | Queen's University |
| Li, Ruicheng | Queen's University |
| Dai, Qianyi | Queen's University |
| Wang, Zhaokai | Queen's University |
| Lai, Yongjun | Queen's University |
| You, Lidan | Queen's University |
| Wang, Xian | Queen's University |
Keywords: Biological Cell Manipulation, Micro/Nano Robots, Automation at Micro-Nano Scales
Abstract: Mechanical stimulation has recently been shown as a promising approach to induce targeted cancer cell death. With precise field control, magnetic microrobots were navigated to the tumor site for delivering mechanical stimulation as a new treatment approach. However, most magnetic microrobots suffer from low force output when generating mechanical stimulation. Acoustic microrobots, with a microbubble as their simplest form, generate strong mechanical stimulation yet lack precise position control. In this paper, leveraging the force output of the acoustic field and precision of magnetic field control, we present a magnetic acoustic microbubble microrobot (MAM) that integrates magnetic navigation and acoustic stimulation. MAMs were fabricated with DSPC/PEG lipid shells and iron-oxide nanoparticles (IONPs) using a flow-focusing microfluidic method. The size of the fabricated monodispersed MAMs is approximately 10 μm. The fabricated MAMs were navigated by a quadrupole magnetic tweezer system, with a maximum field gradient of 2–3 T/m, and controlled to oscillate to generate mechanical stimulation under an acoustic transducer at 1 MHz. As a proof-of-concept, we applied MAM acoustic treatment to breast cancers (MDA-MB-231) and showed that MAM acoustic treatment led to reduced cell viability compared to the control group and the acoustic-only group. Considering all the components for MAM fabrication are FDA-approved materials, MAM holds promise for clinical translation in tumor mechanical stimulations.
|
| |
| 15:00-16:30, Paper WeI2I.258 | Add to My Program |
| Learnable Conformal Prediction for Safe and Efficient Robotics under Perception and Planning Uncertainties |
|
| Kumar, Divake | University of Illinois Chicago |
| Tayebati, Sina | University of Illinois Chicago |
| Migliarba, Francesco | University of Illinois Chicago |
| Krishnan, Ranganath | Intel |
| Trivedi, Amit Ranjan | University of Illinois at Chicago (UIC), Chicago, USA |
Keywords: Deep Learning Methods, Planning under Uncertainty, Robot Safety
Abstract: Deep learning models in robotics often output point estimates with poorly calibrated confidences, offering no native mechanism to quantify predictive reliability under novel, noisy, or out-of-distribution inputs. Conformal prediction (CP) addresses this gap by providing distribution-free coverage guarantees, yet its reliance on fixed nonconformity scores ignores context and can yield intervals that are overly conservative or unsafe. We address this with Learnable Conformal Prediction (LCP), which replaces fixed scores with a lightweight neural function that leverages geometric, semantic, and model cues. Trained to balance coverage, efficiency, and calibration, LCP preserves CP's finite-sample guarantees while producing intervals that adapt to instance difficulty, achieving context-aware uncertainty without ensembles or repeated inference. On the MRPB benchmark, LCP raises navigation success to 91.5% versus 87.8% for Standard CP, while limiting path inflation to 4.5% compared to 12.2%. For object detection on COCO, BDD100K, and Cityscapes, it reduces mean interval width by 46-54% at 90% coverage, and on classification tasks (CIFAR-100, HAM10000, ImageNet) it shrinks prediction sets by 4.7-9.9%. The method achieves real-time performance on resource-constrained edge hardware (Intel NUC, <30W) while simultaneously providing uncertainty estimates along with the mean prediction.
|
| |
| 15:00-16:30, Paper WeI2I.259 | Add to My Program |
| Screw Geometry Meets Bandits: Incremental Acquisition of Demonstrations to Generate Manipulation Plans |
|
| Das, Dibyendu | Stony Brook University |
| Patankar, Aditya | Stony Brook University |
| Chakraborty, Nilanjan | Stony Brook University |
| Ramakrishnan, C. R. | Stony Brook University |
| Ramakrishnan, I. V. | Stony Brook University |
Keywords: Manipulation Planning, Learning from Demonstration, Motion and Path Planning
Abstract: In this paper, we study the problem of methodically obtaining a sufficient set of kinesthetic demonstrations, one at a time, such that a robot can be confident of its ability to perform a complex manipulation task in a given region of its workspace. Although programming by demonstration has been an active area of research, the problems of checking whether a set of demonstrations is sufficient and systematically seeking additional demonstrations have remained open. We present an approach for the robot to incrementally and actively ask for new demonstration examples, one at a time, until the robot can assess with high confidence that it can perform the task successfully. Our approach uses (i) a screw geometric representation of motion to generate manipulation plans from demonstrations, which makes the sufficiency of a set of demonstrations measurable; (ii) a sampling strategy based on PAC-learning from multi-armed bandit optimization to evaluate the robot's ability to generate manipulation plans in a subregion of its task space; and (iii) a heuristic to seek additional demonstration from areas of weakness. We present results of a user study conducted with 22 participants (without any background in robotics) on two example manipulation tasks, namely pouring and scooping, to assess the utility and usability of our approach. The results show that a handful of examples (fewer than 10) were needed to successfully teach the robot to plan tasks. A video supplement is available on YouTube: https://youtu.be/ncsb_m6CCNY
|
| |
| 15:00-16:30, Paper WeI2I.260 | Add to My Program |
| A Comprehensive Analysis of the Effects of Network Quality of Service on Robotic Telesurgery |
|
| Zhang, Zhaomeng | University of Virginia |
| Roodabeh, Seyed HamidReza | University of Virginia |
| Alemzadeh, Homa | University of Virginia |
Keywords: Surgical Robotics: Laparoscopy, Telerobotics and Teleoperation, Networked Robots
Abstract: The viability of long-distance telesurgery hinges on reliable network Quality of Service (QoS), yet the impact of realistic network degradations on task performance is not sufficiently understood. This paper presents a comprehensive analysis of how packet loss, delay, and communication loss affect telesurgical task execution. We introduce NetFI, a novel fault injection tool that emulates different network conditions using stochastic QoS models informed by real-world network data. By integrating NetFI with a surgical simulation platform, we conduct a user study involving 15 participants at three proficiency levels, performing a standardized Peg Transfer task under varying levels of packet loss, delay, and communication loss. We analyze the effect of network QoS on overall task performance and the fine-grained motion primitives (MPs) using objective performance and safety metrics and subjective operator's perception of workload. We identify specific MPs vulnerable to network degradation and find strong correlations between proficiency, objective performance, and subjective workload. These findings offer quantitative insights into the operational boundaries of telesurgery. Our open-source tools and annotated dataset provide a foundation for developing robust and network-aware control and mitigation strategies.
|
| |
| 15:00-16:30, Paper WeI2I.261 | Add to My Program |
| Real-Time Learning of Predictive Dynamic Obstacle Models for Robotic Motion Planning |
|
| Kombo, Stella | California Institute of Technology |
| Burdick, Joel | California Institute of Technology |
| Haseli, Masih | California Institute of Technology |
| Wei, Skylar X. | California Institute of Technology |
Keywords: Dynamics, Calibration and Identification, Machine Learning for Robot Control
Abstract: Autonomous systems often must predict the motions of nearby agents from partial and noisy data. This paper asks and answers the question: "Can we learn, in real-time, a nonlinear predictive model of another agent's motions?" Our online framework denoises and forecasts such dynamics using a modified sliding-window Hankel Dynamic Mode Decomposition (Hankel-DMD). Partial noisy measurements are embedded into a Hankel matrix, while an associated Page matrix enables singular-value hard thresholding (SVHT) to denoise the Hankel matrix and estimate its rank. A Cadzow projection enforces structured low-rank consistency, yielding a denoised trajectory and local noise variance estimates. From this representation, a time-varying Hankel-DMD lifted linear predictor is constructed for multi-step forecasts. The residual analysis provides variance-tracking signals that can support downstream estimators and risk-aware planning. We validate the approach in simulation under Gaussian and heavy-tailed noise, and experimentally on a dynamic crane testbed. Results show that the method achieves stable variance-aware denoising and short-horizon prediction suitable for integration into real-time control frameworks.
|
| |
| 15:00-16:30, Paper WeI2I.262 | Add to My Program |
| DisFlow: Scene Flow from Distance Field for Object Pose, Velocity Tracking, and Surface Reconstruction |
|
| Wu, Lan | University of Technology Sydney |
| Sutjipto, Sheila | University of Technology, Sydney |
| Wakulicz, Jennifer | University of Sydney, Australian Centre for Field Robotics |
| Vidal-Calleja, Teresa A. | University of Technology Sydney |
Keywords: Mapping, RGB-D Perception
Abstract: We present DisFlow, a novel framework for online scene flow estimation from distance field that enables 6DoF dynamic object pose estimation, motion tracking, and surface reconstruction. The scene is represented by Gaussian Process Implicit Surfaces (GPIS), with surface normals serving as derivative constraints, enabling accurate signed distance computations near the surface and gradient queries with uncertainty. With this representation as a foundation, we compute a scene flow from the distance field that describes how surface points are transported over time in consecutive frames. Through our flow, we can estimate an object's pose and motion by incrementally registering a new observed point cloud via an elegant closed-form optimisation. Unlike prior methods that operate in the camera or world frame, our approach performs probabilistic fusion directly in the object frame, where the object remains geometrically consistent over time. The tight coupling of the DisFlow method in space and time yields dense geometry, surface normals, object pose trajectories, velocities, and uncertainty, all at real-time rates. We evaluate DisFlow on dynamic object sequences and demonstrate that it achieves accurate pose and motion tracking while simultaneously reconstructing high-quality object surfaces.
|
| |
| 15:00-16:30, Paper WeI2I.263 | Add to My Program |
| Robust Differentiable Collision Detection for General Objects |
|
| Chen, Jiayi | Peking University |
| Zhao, Wei | Tsinghua University |
| Ruan, Liangwang | Peking University |
| Chen, Baoquan | Peking University |
| Wang, He | Peking University |
Keywords: Contact Modeling, Grasping
Abstract: Collision detection is a core component of robotics applications such as simulation, control, and planning. Traditional algorithms like GJK+EPA compute textit{witness points}—the closest or deepest-penetration pairs between two objects—but are inherently non-differentiable, preventing gradient flow and limiting gradient-based optimization in contact-rich tasks such as grasping and manipulation. Recent work introduced efficient first-order randomized smoothing to make witness points differentiable; however, their direction-based formulation is restricted to convex objects and lacks robustness for complex geometries. In this work, we propose a robust and efficient differentiable collision detection framework that supports both convex and concave objects across diverse scales and configurations. Our method introduces distance-based first-order randomized smoothing, adaptive sampling, and equivalent gradient transport for robust and informative gradient computation. Experiments on complex meshes from DexGraspNet and Objaverse show significant improvements over existing baselines. Finally, we demonstrate a direct application of our method for dexterous grasp synthesis to refine the grasp quality. The code is available at https://github.com/JYChen18/DiffCollision.
|
| |
| 15:00-16:30, Paper WeI2I.264 | Add to My Program |
| FRESHR-GSI: A Generalized Safety Model and Evaluation Framework for Mobile Robots in Multi-Human Environments |
|
| Pandey, Pranav Kumar | University of Georgia |
| Parasuraman, Ramviyas | University of Georgia |
| Doshi, Prashant | University of Georgia |
Keywords: Safety in HRI, Human-Robot Collaboration, Social HRI
Abstract: Human safety is critical in applications involving close human-robot interactions (HRI) and is a key aspect of physical compatibility between humans and robots. While measures of human safety in HRI exist, these mainly target industrial settings involving robotic manipulators. Less attention has been paid to settings where mobile robots and humans share the space. This paper introduces a new robot-centered directional framework of human safety. It is particularly useful for evaluating mobile robots as they operate in environments populated by multiple humans. The framework integrates several key metrics, such as each human's relative distance, speed, and orientation. The core novelty lies in the framework's flexibility to accommodate different application requirements while allowing for both the robot-centered and external observer points of view. We instantiate the framework by using RGB-D based vision integrated with a deep learning-based human detection pipeline to yield a proxemics-guided generalized safety index (GSI) that instantaneously assesses human safety. We extensively validate GSI's capability of producing appropriate and fine-grained safety measures in real-world experimental scenarios and demonstrate its superior efficacy against extant safety models.
|
| |
| 15:00-16:30, Paper WeI2I.265 | Add to My Program |
| HapCompass: A Rotational Haptic Device for Contact-Rich Robotic Teleoperation |
|
| Tan, Xiangshan | Toyota Technological Institute at Chicago |
| Ji, Jingtian | Toyota Technological Institute at Chicago |
| Jiang, Tianchong | Toyota Technological Institute at Chicago |
| Lopes, Pedro | University of Chicago |
| Walter, Matthew | Toyota Technological Institute at Chicago |
Keywords: Haptics and Haptic Interfaces, Telerobotics and Teleoperation
Abstract: The contact-rich nature of manipulation makes it a significant challenge for robotic teleoperation. While haptic feedback is critical for contact-rich tasks, providing intuitive directional cues within wearable teleoperation interfaces remains a bottleneck. Existing solutions, such as non-directional vibrations from handheld controllers, provide limited information, while vibrotactile arrays are prone to perceptual interference. To address these limitations, we propose HapCompass, a novel, low-cost wearable haptic device that renders 2D directional cues by mechanically rotating a single linear resonant actuator (LRA). We evaluated HapCompass's ability to convey directional cues to human operators and showed that it increased the success rate, decreased the completion time and the maximum contact force for teleoperated manipulation tasks when compared to vision-only and non-directional feedback baselines. Furthermore, we conducted a preliminary imitation-learning evaluation, suggesting that the directional feedback provided by HapCompass enhances the quality of demonstration data and, in turn, the trained policy. We release the design of the HapCompass device along with the code that implements our teleoperation interface: https://ripl.github.io/HapCompass/.
|
| |
| 15:00-16:30, Paper WeI2I.266 | Add to My Program |
| UniVideo: Universal Monocular Video Understanding |
|
| Lu, Yawen | Adobe |
| Cao, Zhiwen | Adobe |
| Lin, Wei-An | Adobe |
| Kalarot, Ratheesh | Adobe |
Keywords: Deep Learning for Visual Perception, Visual Learning, Visual Tracking
Abstract: Video flow, depth, and panoptic segmentation are fundamental to diverse robotic perception and computer vision applications. Despite recent advances in specialized approaches, several inherent limitations remain challenging: first, training and inferencing three separate models is computationally costly; second, separate training prohibits learning underlying feature representations and knowledge from other tasks. In this work, we address these challenges by reformulating video flow estimation, depth estimation and panoptic segmentation as a sequence of feature correspondence matching, updating and tracking problems. This approach allows these tasks to be addressed by a single architecture that compares feature similarities across frames. By incorporating a shared feature representation with distinct prediction heads, our model can simultaneously predict consistent and reliable optical flow, depth maps, and object masks for videos. We further demonstrate that this universal model maintains temporal consistency across tasks while requiring no task-specific re-training. Extensive experiments on the FlyingThings, Sintel, VKITTI, KITTI, and VIPSeg benchmarks demonstrates superior performance. Furthermore, the model exhibits zero-shot performance on unseen wild scenes.
|
| |
| 15:00-16:30, Paper WeI2I.267 | Add to My Program |
| PAGTM: Position and Attention-Guided Token Merging for Efficient Visual Place Recognition |
|
| Cho, Hongchan | Yonsei University |
| Lee, Youngjo | Yonsei University |
| Jang, Jinwoo | Yonsei University |
| Yu, Seunghan | Yonsei University |
| Kim, Euntai | Yonsei University |
Keywords: Deep Learning for Visual Perception, Recognition, Localization
Abstract: Recent advances in Vision Transformers (ViTs) have significantly improved the performance of Visual Place Recognition (VPR), but their high computational cost—due to the quadratic complexity of self-attention—limits their practical deployment in real-world scenarios. To address this challenge, we propose PAGTM (Positional- and Attention-Guided Token Merging), a training-free token reduction framework designed specifically for ViT-based VPR models. In VPR, preserving the spatial layout of a scene (e.g. road alignment, building structures) and focusing on semantically meaningful regions are both critical for robust matching under viewpoint and appearance variations. However, existing token reduction methods often overlook these aspects, leading to degraded recognition performance. To address this, PAGTM incorporates two key cues. The first is positional proximity, which merges spatially adjacent tokens to maintain the scene’s structural layout. The second is attention-based token protection, which retains tokens that receive high attention because they represent regions important for distinguishing places, such as signs or distinctive structures. Without requiring any fine-tuning, PAGTM can be directly applied at inference time and consistently outperforms existing token reduction methods such as ToMe and ToFu across multiple ViT-based VPR models and datasets, achieving a better trade-off between computational efficiency and retrieval accuracy.
|
| |
| 15:00-16:30, Paper WeI2I.268 | Add to My Program |
| Sample-Based Hybrid Mode Control: Asymptotically Optimal Switching of Algorithmic and Non-Differentiable Control Modes |
|
| Liu, Yilang | Yale University |
| You, Haoxiang | Yale University |
| Abraham, Ian | University of Sydney |
Keywords: Hybrid Logical/Dynamical Planning and Verification, Optimization and Optimal Control, Legged Robots
Abstract: This paper investigates a sample-based solution to the hybrid mode control problem across non-differentiable and algorithmic hybrid modes. Our approach reasons about a set of hybrid control modes as an integer-based optimization problem where we select what mode to apply, when to switch to another mode, and the duration for which we are in a given control mode. A sample-based variation is derived to efficiently search the integer domain for optimal solutions. We find our formulation yields strong performance guarantees that can be applied to a number of robotics-related tasks. In addition, our approach is able to synthesize complex algorithms and policies to compound behaviors and achieve challenging tasks. Last, we demonstrate the effectiveness of our approach in a real-world robotic examples that requires reactive switching between long-term planning and high-frequency control.
|
| |
| 15:00-16:30, Paper WeI2I.269 | Add to My Program |
| Underwater Dense Mapping with the First Compact 3D Sonar |
|
| Burgul, Chinmay | University of Delaware |
| Huang, Yewei | Dartmouth College |
| Chatzispyrou, Michalis | University of Delaware |
| Rekleitis, Ioannis | University of Delaware |
| Quattrini Li, Alberto | Dartmouth College |
| Xanthidis, Marios | SINTEF Ocean |
Keywords: Marine Robotics, Field Robots, Range Sensing
Abstract: In the past decade, the adoption of compact 3D range sensors, such as LiDARs, has driven the developments of robust state-estimation pipelines, making them a standard sensor for aerial, ground, and space autonomy. Unfortunately, poor propagation of electromagnetic waves underwater, has limited the visibility-independent sensing options of underwater state-estimation to acoustic range sensors, which provide 2D information including, at-best, spatially ambiguous information. This paper, to the best of our knowledge, is the first study examining the performance, capacity, and opportunities arising from the recent introduction of the first compact 3D sonar. Towards that purpose, we introduce calibration procedures for extracting the extrinsics between the 3D sonar and a camera and we provide a study on acoustic response in different surfaces and materials. Moreover, we provide novel mapping and SLAM pipelines tested in deployments in underwater cave systems and other geometrically and acoustically challenging underwater environments. Our assessment showcases the unique capacity of 3D sonars to capture consistent spatial information allowing for detailed reconstructions and localization in datasets expanding to hundreds of meters. At the same time it highlights remaining challenges related to acoustic propagation, as found also in other acoustic sensors. Datasets collected for our evaluations would be released and shared with the community to enable further research advancements.
|
| |
| 15:00-16:30, Paper WeI2I.270 | Add to My Program |
| Safe Payload Transfer with Ship-Mounted Cranes: A Robust Model Predictive Control Approach |
|
| Das, Ersin | Illinois Institute of Technology |
| Welch, William A. | California Institute of Technology |
| Spieler, Patrick | JPL |
| Albee, Keenan | University of Southern California |
| Noca, Aurelio | Caltech |
| Edlund, Jeffrey | Jet Propulsion Lab |
| Becktor, Jonathan | Techincal University of Denmark |
| Touma, Thomas | Caltech |
| Todd, Jessica | Caltech |
| Bhamidipati, Sriramya | Stanford University |
| Kombo, Stella | California Institute of Technology |
| Saboia Da Silva, Maira | NASA Jet Propulsion Laboratory |
| Sabel, Anna | NASA JPL |
| Lim, Grace | Jet Propulsion Laboratory |
| Thakker, Rohan | Nasa's Jet Propulsion Laboratory, Caltech |
| Rahmani, Amir | Jet Propulsion Laboratory |
| Burdick, Joel | California Institute of Technology |
Keywords: Robot Safety, Optimization and Optimal Control, Underactuated Robots
Abstract: Ensuring safe real-time control of ship-mounted cranes in unstructured transportation environments requires handling multiple safety constraints while maintaining effective payload transfer performance. Unlike traditional crane systems, ship-mounted cranes are consistently subjected to significant external disturbances affecting underactuated crane dynamics due to the ship's dynamic motion response to harsh sea conditions, which can lead to robustness issues. To tackle these challenges, we propose a robust and safe model predictive control (MPC) framework and demonstrate it on a 5-DOF crane system, where a Stewart platform simulates the external disturbances that ocean surface motions would have on the supporting ship. The crane payload transfer operation must avoid obstacles and accurately place the payload within a designated target area. We use a robust zero-order control barrier function (R-ZOCBF)-based safety constraint in the nonlinear MPC to ensure safe payload positioning, while time-varying bounding boxes are utilized for collision avoidance. We introduce a new optimization-based online robustness parameter adaptation scheme to reduce the conservativeness of R-ZOCBFs. Experimental trials on a crane prototype demonstrate the overall performance of our safe control approach under significant perturbing motions of the crane base. While our focus is on crane-facilitated transfer, the methods more generally apply to safe robotically-assisted parts mating and parts insertion.
|
| |
| 15:00-16:30, Paper WeI2I.271 | Add to My Program |
| Uncertainty-Aware Vision-Based Risk Object Identification Via Conformal Risk Tube Prediction |
|
| Fu, Kai-Yu | National Yang Ming Chiao Tung University |
| Chen, Yi-Ting | National Yang Ming Chiao Tung University |
Keywords: Deep Learning for Visual Perception, Computer Vision for Transportation, Intelligent Transportation Systems
Abstract: We study object importance-based vision risk object identification (Vision-ROI), a key capability for intelligent driving systems. Existing approaches are deterministic and ignore uncertainty, potentially compromising safety. For example, fixed decision thresholds in ambiguous scenarios can cause premature or delayed risk detection and temporally unstable predictions. These issues worsen under diverse contexts with multiple interacting risks that perturb where and when risks occur. However, current vision methods lack a principled way to model uncertainty jointly across space and time, limiting adaptability to scene complexity. We propose Risk Tube Prediction, a unified formulation for modeling spatiotemporal risk uncertainty. We further introduce a conformal prediction framework to provide coverage guarantees for the true risks and yield calibrated risk scores and uncertainty estimates. Specifically, we employ risk-category–aware calibrators that consider distinct characteristics to reduce confused calibration. To evaluate, we present a new dataset and metrics probing diverse scenario configurations with multi-risk coupling effects. We systematically analyze factors affecting uncertainty estimation, including scenario variations, per-risk category behavior, and perception error propagation. Our method delivers substantial improvements over prior approaches, enhancing vision-ROI robustness and downstream performance, such as reducing nuisance braking alerts. For more qualitative results, please visit our project webpage: https://hcis-lab.github.io/CRTP/
|
| |
| 15:00-16:30, Paper WeI2I.272 | Add to My Program |
| D-Compress: Detail-Preserving LiDAR Range Image Compression for Real-Time Streaming on Resource-Constrained Robots |
|
| Wang, Shengqian | Department of Information Engineering, The Chinese University of Hong Kong |
| Tu, Chang | The Chinese University of Hong Kong |
| Chen, He | The Chinese University of Hong Kong |
|
|
| |
| 15:00-16:30, Paper WeI2I.273 | Add to My Program |
| Implementing Robust M-Estimators with Certifiable Factor Graph Optimization |
|
| Xu, Zhexin | Northeastern University |
| Zhang, Hanna | Northeastern University |
| Calatrava, Helena | Northeastern University |
| Closas, Pau | Northeastern University |
| Rosen, David | Northeastern University |
Keywords: SLAM, Mapping, Localization
Abstract: Parameter estimation in robotics and computer vision faces formidable challenges from both outlier contamination and nonconvex optimization landscapes. While M-estimation addresses the problem of outliers through robust loss functions, it creates severely nonconvex problems that are difficult to solve globally. Adaptive reweighting schemes provide one particularly appealing strategy for implementing M-estimation in practice: these methods solve a sequence of simpler weighted least squares (WLS) subproblems, enabling both the use of standard least squares solvers and the recovery of higher-quality estimates than simple local search. However, adaptive reweighting still crucially relies upon solving the inner WLS problems effectively, a task that remains challenging in many robotics applications due to the intrinsic nonconvexity of many common parameter spaces (e.g. rotations and poses). In this paper, we show how one can easily implement adaptively-reweighted M-estimators with certifiably correct solver for the inner WLS subproblems using only fast local optimization over smooth manifolds. Our approach exploits recent work on certifiable factor graph optimization to provide global optimality certificates for the inner WLS subproblems while seamlessly integrating into existing factor graph-based software libraries and workflows. Experimental evaluation on pose-graph optimization and landmark SLAM tasks demonstrates that our adaptively reweighted certifiable estimation approach provides higher-quality estimates than alternative local search-based methods, while scaling tractably to realistic problem sizes.
|
| |
| 15:00-16:30, Paper WeI2I.274 | Add to My Program |
| Distortion-Aware PETR for BEV Object Detection with Mixed Pinhole–Fisheye Cameras |
|
| Liu, Xiangzhong | Technical University of Munich |
Keywords: Visual Learning, Omnidirectional Vision, Computer Vision for Transportation
Abstract: Fisheye cameras are widely deployed in autonomous driving perception suites for their low cost and full-coverage field of view (FOV), yet their potential remains under-leveraged in 3D object detection. Severe radial distortion challenges most BEV detectors by violating the fundamental assumption of uniform sampling. To bridge this gap, we propose Distortion-Aware PETR (DAPETR), a projection-free detector tailored for mixed pinhole–fisheye camera setups. DAPETR incorporates two key learned-adaptive modules: a unified distortion-aware positional embedding that harmonizes positional encodings for image representations with fisheye geometry, and a bidirectional feature-geometry co-modulation module that mutually adapts image features and 3D positional embeddings. In our experiments on a converted KITTI-360 benchmark, we systematically compare our learned-adaptive approach against PETR in polar coordinates (PolarPETR). We find that while both methods improve over the baseline, our learned modules achieve superior performance. Crucially, we uncover a negative interaction when combining both strategies, revealing that learned adaptation and explicit geometric re-parameterization can conflict. Our final DAPETR model significantly advances the research and benchmark for fisheye BEV detection, providing critical insights into effective distortion-aware 3D perception design other than image rectification.
|
| |
| 15:00-16:30, Paper WeI2I.275 | Add to My Program |
| Systematic Characterization of Drilling Parameters in Concentric Tube Steerable Drilling Robots: A Comparative Study |
|
| Maroufi, Daniyal | University of Texas at Austin |
| Kulkarni, Yash | The University of Texas at Austin |
| Rajesh Kanna, Vibhu Kanna | The University of Texas at Austin |
| Amadio, Jordan P. | University of Texas Dell Medical School |
| Khadem, Mohsen | University of Edinburgh |
| Bird, Justin E. | The University of Texas M.D. Anderson Cancer Center |
| Siewerdsen, Jeff | John Hopkins |
| Alambeigi, Farshid | University of Texas at Austin |
Keywords: Medical Robots and Systems, Surgical Robotics: Steerable Catheters/Needles
Abstract: To establish a foundational understanding for creating J-shaped trajectories with Concentric Tube Steerable Drilling Robots (CT-SDRs), this paper presents a systematic characterization of two operational factors: drill feed rate and rotational speed. We developed and compared a custom High-Speed Drill (HSD) and a Low-Speed Drill (LSD) to analyze how these parameters affect performance in flexible robotic drills versus conventional systems utilizing rigid instruments. By integrating the CT-SDRs with a seven degree-of-freedom robotic manipulator, we conducted experiments in synthetic bone phantoms of varying densities, assessing metrics such as motor current, hole diameter, radius of curvature, and drilling time. The results reveal critical performance trade-offs, demonstrating that high-speed drilling in CT-SDRs is essential for successfully penetrating dense bone. Further, we found that while slower feed rates improve trajectory accuracy and reduce hole enlargement, they significantly increase procedural time. These findings offer a quantitative guideline for design choices, component selection, and operational control of CT-SDRs tailored to patient-specific bone quality.
|
| |
| 15:00-16:30, Paper WeI2I.276 | Add to My Program |
| ActiveUMI: Robotic Manipulation with Active Perception from Robot‑Free Human Demonstrations |
|
| Zeng, Qiyuan | ShangHai University |
| Li, Chengmeng | Shanghai University |
| St. John, Jude | Stanford University |
| Zhou, Zhongyi | East China Normal University |
| Wen, Junjie | East China Normal University |
| Xu, Yi | Midea |
| Feng, Guorui | Shanghai University |
| Zhu, Yichen | University of Toronto |
Keywords: Imitation Learning
Abstract: We present ActiveUMI, a framework for a data collection system that transfers in-the-wild human demonstrations to robots capable of complex bimanual manipulation. ActiveUMI couples a portable VR teleoperation kit with sensorized controllers that mirror the robot's end-effectors, bridging human-robot kinematics via precise pose alignment. To ensure mobility and data quality, we introduce several key techniques, including immersive 3D model rendering, a self-contained wearable computer, and efficient calibration methods. ActiveUMI's defining feature is its capture of active, egocentric perception. By recording an operator's deliberate head movements via a head-mounted display, our system learns the crucial link between visual attention and manipulation. We evaluate ActiveUMI on six challenging bimanual tasks. Policies trained exclusively on ActiveUMI data achieve an average success rate of 70% on in-distribution tasks and demonstrate strong generalization, retaining a 56% success rate when tested on novel objects and in new environments.. Our results demonstrate that portable data collection systems, when coupled with learned active perception, provide an effective and scalable pathway toward creating generalizable and highly capable real-world robot policies.
|
| |
| 15:00-16:30, Paper WeI2I.277 | Add to My Program |
| Conflict Mitigation in Shared Environments Using Flow-Aware Multi-Agent Path Finding |
|
| Heuer, Lukas | Örebro University |
| Zhu, Yufei | Örebro University |
| Palmieri, Luigi | Robert Bosch GmbH |
| Mannucci, Anna | Robert Bosch GmbH Corporate Research |
| Rudenko, Andrey | Technical University of Munich |
| Koenig, Sven | University of California, Irvine |
| Magnusson, Martin | Örebro University |
Keywords: Multi-Robot Systems, Path Planning for Multiple Mobile Robots or Agents, Motion and Path Planning
Abstract: Deploying multi-robot systems in environments shared with dynamic and uncontrollable agents presents sig- nificant challenges, especially for large robot fleets. In such environments, individual robot operations can be delayed due to unforeseen conflicts with uncontrollable agents. While existing research primarily focuses on preserving the completeness of Multi-Agent Path Finding (MAPF) solutions considering delays, there is limited emphasis on utilizing additional environmental information to enhance solution quality in the presence of other dynamic agents. To this end, we propose Flow-Aware Multi-Agent Path Finding (FA-MAPF), a novel framework that integrates learned motion patterns of uncontrollable agents into centralized MAPF algorithms. Our evaluation, conducted on a diverse set of benchmark maps with simulated uncontrollable agents and on a real-world map with recorded human trajecto- ries, demonstrates the effectiveness of FA-MAPF compared to state-of-the-art baselines. The experimental results show that FA-MAPF can consistently reduce conflicts with uncontrollable agents, up to 55%, without compromising task efficiency.
|
| |
| 15:00-16:30, Paper WeI2I.278 | Add to My Program |
| Structured Diversity Control: A Dual-Level Framework for Group-Aware Multi-Agent Coordination |
|
| Yang, Shuocun | Northwestern Polytechnical University |
| Hu, Huawen | Northwestern Polytechnical University |
| Liu, Xuan | Northwestern Polytechnical University |
| Yao, Yincheng | Northwestern Polytechnical University |
| Shi, Enze | Northwestern Polytechnical University |
| Zhang, Shu | Northwestern Polytechnical University |
Keywords: Reinforcement Learning, Multi-Robot Systems, Deep Learning Methods
Abstract: Controlling the behavioral diversity is a pivotal challenge in multi-agent reinforcement learning (MARL), particularly in complex collaborative scenarios. While existing methods attempt to regulate behavioral diversity by directly differentiating across all agents, they lack deep characterization and learning of multi-agent composition structures. This limitation leads to suboptimal performance or coordination failures when facing more complex or challenging tasks. To bridge this gap, we introduce Structured Diversity Control (SDC), a framework that redefines the system-wide diversity metric as a weighted combination of intra-group diversity, which is minimized for cohesion and inter-group diversity, which is maximized for specialization. The trade-off is governed by a pre-set Diversity Structure Factor (DSF), allowing for fine-grained, group-aware control over the collective strategy. Our method directly constrains the policy architecture without altering reward functions. This structural definition of diversity enables SDC to deliver substantial performance gains across various experiments, including increasing average rewards by up to 47.1% in multi-target pursuit and reducing episode lengths by 12.82% in complex neutralization scenarios. The proposed method offers a novel analytical perspective on the problem of cooperation in group-aware multi-agent systems.
|
| |
| 15:00-16:30, Paper WeI2I.279 | Add to My Program |
| ProbeMDE: Uncertainty-Guided Active Proprioception for Monocular Depth Estimation in Surgical Robotics |
|
| Jordan, Britton | University of Utah |
| Thompson, Jordan | University of Utah |
| d'Almeida, Jesse F. | Vanderbilt University |
| Li, Hao | Vanderbilt University |
| Kumar, Nithesh | Vanderbilt University |
| Stern, Susheela Sharma | Vanderbilt University |
| Ferguson, James | University of Utah |
| Oguz, Ipek | Vanderbilt University |
| Webster III, Robert James | Vanderbilt University |
| Brown, Daniel | University of Utah |
| Kuntz, Alan | Vanderbilt University |
Keywords: Computer Vision for Medical Robotics, Medical Robots and Systems, Deep Learning for Visual Perception
Abstract: Monocular depth estimation (MDE) provides a useful tool for robotic perception, but its predictions are often uncertain and inaccurate in challenging environments such as surgical scenes where textureless surfaces, specular reflections, and occlusions are common. To address this, we propose ProbeMDE, a cost-aware active sensing framework that combines RGB images with sparse proprioceptive measurements for MDE. Our approach utilizes an ensemble of MDE models to predict dense depth maps conditioned on both RGB images and a sparse set of known depth measurements obtained via proprioception, where the robot has touched the environment in a known configuration. We quantify predictive uncertainty via the ensemble's variance and measure the gradient of the uncertainty with respect to candidate measurement locations. To prevent mode collapse while selecting maximally informative locations to propriocept (touch), we leverage Stein Variational Gradient Descent (SVGD) over this gradient map. We validate our method in both simulated and physical experiments on central airway obstruction surgical phantoms. Our results demonstrate that our approach outperforms baseline methods across standard depth estimation metrics, achieving higher accuracy while minimizing the number of required proprioceptive measurements.
|
| |
| 15:00-16:30, Paper WeI2I.280 | Add to My Program |
| Uncertainty-Aware Non-Prehensile Manipulation with Mobile Manipulators under Object-Induced Occlusion |
|
| Hwang, Jiwoo | Korea Advanced Institute of Science and Technology |
| Yang, Taegeun | Korea Advanced Institute of Science and Technology |
| Jeong, Jeil | Korea Advanced Institute of Science and Technology |
| Yoon, Minsung | Korea Advanced Institute of Science and Technology (KAIST) |
| Yoon, Sung-eui | KAIST |
Keywords: Reinforcement Learning, Mobile Manipulation
Abstract: Non-prehensile manipulation using onboard sensing presents a fundamental challenge: the manipulated object occludes the sensor's field of view, creating occluded regions that can lead to collisions. We propose CURA-PPO, a reinforcement learning framework that addresses this challenge by explicitly modeling uncertainty under partial observability. By predicting collision possibility as a distribution, we extract both risk and uncertainty to guide the robot's actions. The uncertainty term encourages active perception, enabling simultaneous manipulation and information gathering to resolve occlusions. When combined with confidence maps that capture observation reliability, our approach enables safe navigation despite severe sensor occlusion. Extensive experiments across varying object sizes and obstacle configurations demonstrate that CURA-PPO achieves up to 3 times higher success rates than the baselines, with learned behaviors that handle occlusions. Our method provides a practical solution for autonomous manipulation in cluttered environments using only onboard sensing.
|
| |
| 15:00-16:30, Paper WeI2I.281 | Add to My Program |
| Towards Exploratory and Focused Manipulation with Bimanual Active Perception: A New Problem, Benchmark and Strategy |
|
| He, Yuxin | The Hong Kong University of Science and Technology (Guangzhou) |
| Zhang, Ruihao | Macau University of Science and Technology |
| Shen, Tianao | The Chinese University of Hong Kong(Shenzhen) |
| Liu, Cheng | The Hong Kong University of Science and Technology (Guangzhou) |
| Nie, Qiang | Hong Kong University of Science and Technology (Guangzhou) |
|
|
| |
| 15:00-16:30, Paper WeI2I.282 | Add to My Program |
| Consistency-Driven Confidence Estimation for Stereo Matching |
|
| Lu, Shuheng | Nanyang Technological University |
| Gu, Zaiwang | Astar |
| Jiang, Xudong | Nanyang Technological University |
| Cheng, Jun | Institute for Infocomm Research, A*STAR |
Keywords: Deep Learning for Visual Perception, Computational Geometry, RGB-D Perception
Abstract: Confidence estimation for stereo matching is crucial for enhancing the reliability and accuracy of depth perception in real-world applications. Despite effectively capturing aleatoric uncertainty through probabilistic modeling and statistical aggregation, current regression-based confidence estimation methods neglect uncertainty arising from unstable training dynamics, resulting in over-confident predictions near occlusion boundaries, textureless regions, and reflective surfaces where errors are most severe. We propose a novel epoch-wise consistency accumulation algorithm that explicitly incorporates training dynamics into confidence estimation. Specifically, we design a full-image cross-epoch alignment mechanism to dynamically quantify pixel-wise training consistency between consecutive epochs, thereby significantly enhancing the estimation of confidence. We further propose a consistency-ranked evidential discrepancy loss, which aligns evidential uncertainty estimates with consistency-derived ordinal supervision, aiming to improve the correlation between confidence scores and actual prediction errors. Our approach is incorporated into MonSter, an advanced stereo baseline, achieving SOTA performance in confidence estimation across KITTI 2012, KITTI 2015 and Middlebury benchmarks.
|
| |
| 15:00-16:30, Paper WeI2I.283 | Add to My Program |
| OmniVLA: Physically-Grounded Multimodal VLA with Unified Multi-Sensor Perception for Robotic Manipulation |
|
| Guo, Heyu | Princeton University, Microsoft Research Asia |
| Wang, Shanmu | University of California, Los Angeles |
| Ma, Ruichun | Microsoft |
| Jiang, Shiqi | Microsoft |
| Ghasempour, Yasaman | Princeton University |
| Abari, Omid | UCLA |
| Guo, Baining | Microsoft |
| Qiu, Lili | Microsoft |
Keywords: AI-Enabled Robotics, AI-Based Methods, Deep Learning in Grasping and Manipulation
Abstract: Vision-language-action (VLA) models have shown strong generalization in robotic manipulation through large-scale vision-language pretraining. However, most existing models rely solely on RGB cameras, limiting their perception and, consequently, manipulation capabilities. We present OmniVLA, an omni-modality VLA model that integrates novel sensing modalities to enable beyond-RGB robotic perception and manipulation. The core of our approach is the sensor-masked image, a unified representation that overlays physically meaningful, spatially grounded masks onto the RGB images. These masks are derived from sensors including an infrared camera, a mmWave radar, and a microphone array. This image-native unification keeps sensor input close to RGB statistics to facilitate training, provides a uniform interface across sensor hardware, and enables data-efficient learning with lightweight per-sensor projectors. Building on this, we design a multimodal vision-language-action model architecture and train OmniVLA by extending an RGB-pretrained VLA backbone. We evaluate OmniVLA on challenging real-world tasks that require sensor-modality perception to guide the manipulation. OmniVLA achieves an average task success rate of 84%, significantly outperforms both RGB-only and raw-sensor-input baseline models by 59% and 28% respectively, meanwhile showing higher learning efficiency and stronger generalization capability.
|
| |
| 15:00-16:30, Paper WeI2I.284 | Add to My Program |
| Assigning Multi-Robot Tasks to Multitasking Robots |
|
| Smith, Winston | Arizona State University |
| Zhang, Yu (Tony) | Arizona State University |
Keywords: Cooperating Robots, Multi-Robot Systems, Agent-Based Systems
Abstract: One simplifying assumption in existing and well-performing task allocation methods is that the robots are single-tasking: each robot operates on a single task at any given time. While this assumption is harmless to make in some situations, it can be inefficient or even infeasible in others. In this paper, we consider assigning multi-robot tasks to multitasking robots. The key contribution is a novel task allocation framework that incorporates the consideration of physical constraints introduced by multitasking. This is in contrast to the existing work where such constraints are largely ignored. After formulating the problem, we propose a compilation to weighted MAX-SAT, which allows us to leverage existing solvers for a solution. A more efficient greedy heuristic is then introduced. For evaluation, we first compare our methods with a modern baseline that is efficient for single-tasking robots to validate the benefits of multitasking in synthetic domains. Then, using a site-clearing scenario in simulation, we further illustrate the complex task interaction considered by the multitasking robots in our approach to demonstrate its performance. Finally, we demonstrate a higher-complexity simulation to demonstrate the scalability and applicability of our approach.
|
| |
| 15:00-16:30, Paper WeI2I.285 | Add to My Program |
| FastViDAR: Real-Time Omnidirectional Depth Estimation Via Alternative Hierarchical Attention |
|
| Zhao, Hangtian | Peng Cheng Laboratory |
| Chen, Xiang | East China Normal University |
| Li, Yizhe | Xidian University |
| Wang, Qianhao | Zhejiang University |
| Lu, Haibo | Peng Cheng Laboratory |
| Gao, Fei | Zhejiang University |
Keywords: Omnidirectional Vision, RGB-D Perception, Deep Learning for Visual Perception
Abstract: In this paper, we propose FastViDAR, a novel framework that takes four fisheye camera inputs and produces a full 360 depth map along with per-camera depth, fusion depth, and confidence estimates. Our main contributions are: (1) We introduce Alternative Hierarchical Attention (AHA) mechanism that efficiently fuses features across views through separate intra-frame and inter-frame windowed self-attention, achieving cross-view feature mixing with reduced overhead. (2) We propose a novel equirectangular projection (ERP) fusion approach that projects multi-view depth estimates to a shared equirectangular coordinate system to obtain the final fusion depth. (3) We generate ERP image-depth pairs using HM3D and 2D-3D-S datasets for comprehensive evaluation, demonstrating competitive zero-shot performance on real datasets while achieving up to 20 FPS on NVIDIA Orin NX embedded hardware.
|
| |
| 15:00-16:30, Paper WeI2I.286 | Add to My Program |
| Non-Contact Tactile Perception in Human-Robot Interaction: Deep Learning-Enhanced Super-Resolution Spatial Sensing |
|
| Zhou, Shuyao | Zhejiang University |
| Liang, Jikai | Zhejiang University |
| Zhu, Zhengjie | Zhejiang University |
| Depeng, Kong | Dongfang Electric (Hangzhou) Innovation Institute Co., Ltd |
| He, Zhiao | Zhejiang University |
| Lyu, Honghao | Zhejiang University |
| Yang, Geng | Zhejiang University |
Keywords: Touch in HRI, Haptics and Haptic Interfaces, Physical Human-Robot Interaction
Abstract: With the increasing deployment of robots in dynamic and unpredictable scenarios, it becomes necessary for robots to acquire not only contact-based but also non-contact tactile signals to enhance environmental understanding. However, current non-contact tactile sensors are largely limited to detecting or coarsely recognizing external stimuli, while achieving high spatial resolution typically entails increased sensor density and complex fabrication. This work presents a flexible sparse 2D sensor array, in conjunction with a tailored deep learning model called adaptive spatial-temporal graph convolutional network (ASTGCN), facilitating 3D spatial super-resolution (SR) perception. Built on single-electrode triboelectric nanogenerators with an optimized layout, the sensor array achieves spatial perception while providing a large perception space at low sensor density. Enhanced by the ASTGCN model, this system achieves an average spatial positioning error of 3.11 mm with a physical resolution of only 23 sensors. This research provides novel insights into non-contact haptic perception systems, enabling spatial super-resolution tasks, including spatial trajectory tracking and non-contact gesture classification with 99.33% accuracy, where the gesture classification is used to control a dexterous hand for human-robot interaction.
|
| |
| 15:00-16:30, Paper WeI2I.287 | Add to My Program |
| Lightweight Guidance Sampling and Deep Refinement Reconstruction Network for Adaptive Compressive Sensing |
|
| Cai, Zhaoxin | Northeastern University |
| Zhang, Yunzhou | Northeastern University |
| Bai, Haoyue | Northeastern University |
| Wang, Lu | Northeastern University |
| Zhang, Tengda | Northeastern University |
| Wang, Sizhan | Northeastern University |
| Zhang, Shibo | Northeastern University |
Keywords: Computer Vision for Automation, Deep Learning for Visual Perception, Visual Learning
Abstract: Adaptive Compressive Sensing (ACS) has attracted increasing attention for its ability to progressively improve image reconstruction quality by dynamically adjusting sampling allocation. Multi-stage sampling is a promising strategy that leverages intermediate reconstructions to guide sampling without relying on image prior information. However, existing multi-stage methods often struggle to capture global structural information, resulting in biased sampling and suboptimal performance. Furthermore, the strong dependency between intermediate reconstruction for sampling guidance and the final reconstruction can hinder targeted optimization. To address these issues, we propose LGDR-Net, a Lightweight Guidance Sampling and Deep Refinement Reconstruction Network. Specifically, the Gradient-Fused Cross-Attention (GFCA) module, embedded within a lightweight guidance network, leverages globally fused information to compensate for incomplete content during multi-stage sampling. Then, sampling resource allocation is driven by inter-stage reconstruction differences, effectively exploiting image sparsity information. Finally, the Deep Refinement Network incorporates a Decoder Dense Feedback Mechanism (DDFM) to reduce cross-layer structural bias and a Multi-Branch Attention Fusion (MBAF) module for improved fine-texture representation. Extensive experiments demonstrate that our proposed LGDR-Net outperforms state-of-the-art methods, achieving an excellent trade-off between computational cost and reconstruction quality.
|
| |
| 15:00-16:30, Paper WeI2I.288 | Add to My Program |
| ED-SLAM: Event-Depth Gaussian Splatting SLAM |
|
| Huang, Jian | Zhejiang University |
| Shen, Haotian | Westlake University |
| Lou, Xinhao | Westlake University |
| Liu, Peidong | Westlake University |
Keywords: Computer Vision for Automation, Sensor Fusion
Abstract: Event-based Gaussian splatting (GS) reconstruction approach has recently attracted considerable attention. Existing methods usually assume the camera poses are known as a prior, or struggle to process long event streams due to the robustness of the method while poses are not known. In this work, we present ED-SLAM, an Event-Depth Gaussian Splatting-based simultaneous localization and mapping(SLAM) pipeline, which is robust to long event streams and does not require ground-truth camera poses. The pipeline achieves high-accuracy pose estimation and high-fidelity 3D reconstruction thanks to the impressive 3D representation capability of Gaussian splatting. In particular, we propose a novel patch-based event-depth tracking algorithm and seamlessly integrate it into the Gaussian splatting mapping pipeline. Extensive experiments on both synthetic and real-world datasets demonstrate that our method significantly improves tracking accuracy and robustness, and also delivers improved reconstruction performance.
|
| |
| 15:00-16:30, Paper WeI2I.289 | Add to My Program |
| Beyond the Patch: Exploring Vulnerabilities of Visuomotor Policies Via Viewpoint-Consistent 3D Adversarial Object |
|
| Lee, Chanmi | Korea Advanced Institute of Science and Technology (KAIST) |
| Yoon, Minsung | Korea Advanced Institute of Science and Technology (KAIST) |
| Kim, Woojae | Korea Advanced Institute of Science and Technology (KAIST) |
| Lee, Sebin | Korea Advanced Institute of Science and Technology (KAIST) |
| Yoon, Sung-eui | Korea Advanced Institute of Science and Technology (KAIST) |
Keywords: Deep Learning for Visual Perception, Deep Learning in Grasping and Manipulation, Deep Learning Methods
Abstract: Neural network–based visuomotor policies enable robots to perform manipulation tasks but remain susceptible to perceptual attacks. For example, conventional 2D adversarial patches are effective under fixed-camera setups, where appearance is relatively consistent; however, their efficacy often diminishes under dynamic viewpoints from moving cameras, such as wrist-mounted setups, due to perspective distortions. To proactively investigate potential vulnerabilities beyond 2D patches, this work proposes a viewpoint-consistent adversarial texture optimization method for 3D objects through differentiable rendering. As optimization strategies, we employ Expectation over Transformation (EOT) with a Coarse-to-Fine (C2F) curriculum, exploiting distance-dependent frequency characteristics to induce textures effective across varying camera–object distances. We further integrate saliency-guided perturbations to redirect policy attention and design a targeted loss that persistently drives robots toward adversarial objects. Our comprehensive experiments show that the proposed method is effective under various environmental conditions, while confirming its black-box transferability and real-world applicability.
|
| |
| 15:00-16:30, Paper WeI2I.290 | Add to My Program |
| Simulation-Ready Tree: High-Quality Dynamic Tree Reconstruction from a Single RGB-D Sensor |
|
| Jin, Hao | Northwest A&F University |
| Jiao, Mingxin | Northwest A&F University |
| Xie, Haoran | Japan Advanced Institute of Science and Technology |
| Hu, Shaojun | Northwest A&F University |
Keywords: Robotics and Automation in Agriculture and Forestry, RGB-D Perception, Agricultural Automation
Abstract: Realistic animation of real trees is challenging due to the difficulty in accurately capturing and simulating their movements under varying environmental conditions. Most of real tree reconstruction methods focus on the static modeling of trees from RGB images or LiDAR point clouds. Rather than RGB images, RGB-D (RGB+Depth) sensors provide a low-cost solution for faithful reconstruction of dynamic tree models in 3D. However, it is difficult to capture and reconstruct a complete dynamic tree with complex branch structures using a single RGB-D sensor. In this paper, we propose Simulation-Ready Tree, a dynamic tree reconstruction framework that synthesizes simulation-ready trees by reconstructing 3D tree models and extracting material properties of tree branches from only a single RGB-D sensor. It starts by pre-scanning multi-view RGB-D images around an outdoor tree. For creating a complete static tree point cloud, we presented a coarse-to-fine registration method by considering the skeleton features of main branches of tree points from multi-views. Then, a static tree model is reconstructed from the registered point cloud using an improved space colonization algorithm. Subsequently, a DeT (deep RGB-D tracking) model is employed to track the movements of tree branches during pull-testing, and the material properties of the tree are approximated by Fourier transform and half-power bandwidth methods. Next, a simulation-ready tree is created by constructing its hierarchical structures with corresponding material properties. Finally, the modal analysis method of curved cantilever beams is applied to the simulation-ready tree for animating trees under static load. We demonstrate realistic animation results of our framework by comparing with the ground truth RGB-D data sequences for various tree species.
|
| |
| 15:00-16:30, Paper WeI2I.291 | Add to My Program |
| Beyond the Teacher: Leveraging Mixed-Skill Demonstrations for Robust Imitation Learning |
|
| Saharsh, Saharsh | Indian Institute of Science, Bangalore, India |
| Sonkar, Shubham | Indian Institute of Science, Bangalore, India |
| Jagtap, Pushpak | Indian Institute of Science |
| Prakash, Ravi | Indian Institute of Science |
Keywords: Imitation Learning, Probabilistic Inference, Data Sets for Robot Learning
Abstract: Achieving expert-like robotic task execution in dynamic environments typically requires extensive, high-quality expert demonstrations, a significant bottleneck for real-world deployment. We present a novel learning framework that overcomes this data dependency, enabling robots to perform complex periodic tasks with expert-like proficiency, even when learning from naive demonstrations. Our two-stage pipeline first selects a representative demonstration based on user-defined information-aware task intention scores. This single best demo is then used to extract a canonical motion shape via Periodic Dynamic Movement Primitives (DMPs). Finally, a Long Short-Term Memory (LSTM) network refines the entire set of demonstrations,leveraging a multi-objective score that combines the canonical shape with mutual information and other task quality metrics. The proposed approach is demonstrated on a Franka Research 3 robot performing phasic tasks across three contrasting domains: wiping in human assistive services, weaving in the textile industry, and pick-and-place operations for warehouse automation. Visit project page at: https://focaslab.github.io/beyondtheteacher.
|
| |
| 15:00-16:30, Paper WeI2I.292 | Add to My Program |
| Towards Automated Chicken Deboning Via Learning-Based Dynamically-Adaptive 6-DoF Multi-Material Cutting |
|
| Yang, Zhaodong | Georgia Institute of Technology |
| Hu, Ai-Ping | Georgia Tech Research Institute |
| Ravichandar, Harish | Georgia Institute of Technology |
Keywords: Agricultural Automation, Dexterous Manipulation, Simulation and Animation
Abstract: Automating chicken shoulder deboning requires precise 6-DoF cutting through a partially occluded, deformable, multi-material joint, since contact with the bones presents serious health and safety risks. Our work makes both systems-level and algorithmic contributions to train and deploy a reactive force-feedback cutting policy that dynamically adapts a nominal trajectory and enables full 6-DoF knife control to traverse the narrow joint gap while avoiding contact with the bones. First, we introduce an open-source custom-built simulator for multi-material cutting that models coupling, fracture, and cutting forces, and supports reinforcement learning, enabling efficient training and rapid prototyping. Second, we design a reusable physical testbed emulate the chicken shoulder: two rigid ``bone” spheres with controllable pose embedded in a softer block, enabling rigorous and repeatable evaluation while preserving essential multi-material characteristics of the target problem. Third, we train and deploy a residual RL policy, with discretized force observations and domain randomization, enabling robust zero-shot sim-to-real transfer and the first demonstration of a learned policy that debones a real chicken shoulder. Our experiments in our simulator, on our physical testbed, and on real chicken shoulders show that our learned policy reliably navigates the joint gap and reduces undesired bone/cartilage contact, resulting in up to a 4x improvement over existing open-loop cutting baselines in terms of success rate and bone avoidance. Our results also illustrate the necessity of force feedback for safe and effective multi-material cutting. The project website is at https://star-lab.cc.gatech.edu/papers/Yang-automated-deboning.
|
| |
| 15:00-16:30, Paper WeI2I.293 | Add to My Program |
| SM^2ITH: Safe Mobile Manipulation with Interactive Human Prediction Via Task-Hierarchical Bilevel Model Predictive Control |
|
| D'Orazio, Francesco | Sapienza University of Rome |
| Samavi, Sepehr | University of Toronto |
| Du, Xintong | University of Toronto |
| Zhou, Siqi | Technical University of Munich |
| Oriolo, Giuseppe | Sapienza University of Rome |
| Schoellig, Angela P. | TU Munich |
Keywords: Mobile Manipulation, Human-Aware Motion Planning, Collision Avoidance
Abstract: Mobile manipulators are designed to perform complex sequences of navigation and manipulation tasks in human-centered environments. While recent optimization-based methods such as Hierarchical Task Model Predictive Control (HTMPC) enable efficient multitask execution with strict task priorities, they have so far been applied mainly to static or structured scenarios. Extending these approaches to dynamic human-centered environments requires predictive models that capture how humans react to the actions of the robot. This work introduces Safe Mobile Manipulation with Interactive Human Prediction via Task-Hierarchical Bilevel Model Predictive Control (SM 2ITH), a unified framework that combines HTMPC with interactive human motion prediction through bilevel optimization that jointly accounts for robot and human dynamics. The framework is validated on two different mobile manipulators, the Stretch 3 and the Ridgeback–UR10, across three experimental settings: (i) delivery tasks with different navigation and manipulation priorities, (ii) sequential pick-and-place tasks with different human motion prediction models, and (iii) interactions involving adversarial human behavior. Our results highlight how interactive prediction enables safe and efficient coordination, outperforming baselines that rely on weighted objectives or open-loop human models. Code: https://github.com/utiasDSL/sm2ith.git
|
| |
| 15:00-16:30, Paper WeI2I.294 | Add to My Program |
| Cross-Modal Instructions for Robot Motion Generation |
|
| Baron, William | Carnegie Mellon University |
| Dong, Xiaoxiang | Carnegie Mellon University |
| Johnson-Roberson, Matthew | Carnegie Mellon University |
| Zhi, Weiming | The University of Sydney; Vanderbilt University |
Keywords: Learning from Demonstration, Big Data in Robotics and Automation
Abstract: Teaching robots novel behaviors typically requires motion demonstrations via teleoperation or kinaesthetic teaching, that is, physically guiding the robot. While recent work has explored using human sketches to specify desired behaviors, data collection remains cumbersome, and demonstration datasets are difficult to scale. In this paper, we introduce an alternative paradigm, Learning from Cross-Modal Instructions, where robots are shaped by demonstrations in the form of rough annotations, which can contain free-form text labels, and are used in lieu of physical motion. We introduce the CrossInstruct framework, which integrates cross-modal instructions as examples into the context input to a foundational vision–language model (VLM). The VLM then iteratively queries a smaller, fine-tuned model, and synthesizes the desired motion over multiple 2D views. These are then subsequently fused into a coherent distribution over 3D motion trajectories in the robot's workspace. By incorporating the reasoning of the large VLM with a fine-grained pointing model, CrossInstruct produces executable robot behaviors that generalize beyond the environment of in the limited set of instruction examples. We then introduce a downstream reinforcement learning pipeline that leverages CrossInstruct outputs to efficiently learn policies to complete fine-grained tasks. We rigorously evaluate CrossInstruct on benchmark simulation tasks and real hardware, demonstrating effectiveness without additional fine-tuning and providing a strong initialization for policies subsequently refined via reinforcement learning.
|
| |
| 15:00-16:30, Paper WeI2I.295 | Add to My Program |
| Adaptive Curvature-Aware Routing for Stiff Cable Control Via Dual Manipulation |
|
| Long, JiaHao | South China University of Technology |
| Cong, Yang | South China University of Technology |
| Ren, Yu | South China University of Technology |
Keywords: Manipulation Planning, Dual Arm Manipulation, Robust/Adaptive Control
Abstract: Deformable Linear Objects (DLOs), such as cables and ropes, pose significant challenges for robotic manipulation due to their high-dimensional state space, nonlinear deformation dynamics, and strong sensitivity to external forces. Cable routing tasks, in particular, are further complicated by geometric constraints, residual stresses in stiff cables, and the necessity of precise alignment with designated connectors. Existing approaches often rely on endpoint manipulation or external fixtures, which limits flexibility and scalability in real-world applications. While data-driven and graph-based models have shown promise for flexible ropes, they struggle to generalize across varying cable stiffness and suffers high computational costs. To address these challenges, we propose Adaptive Curvature-Aware Routing (ACR), a dual manipulation framework capable of adaptively handling cables of high stiffness and arbitrary lengths. Specifically, our framework combines local curvature analysis with Radial Basis Function Networks (RBFNs) to predict cable deformations. By prioritizing regions with high curvature discrepancies, it adaptively selects manipulation segments and performs safe, precise corrective actions to shape the cable toward the target configuration without heavy reliance on fixtures. Furthermore, we develop a constraint-aware cooperative controller that integrates both kinematic feasibility and physical safety into the motion strategy. Experiments in both simulation and real-world setups demonstrate that ACR significantly outperforms baseline methods in terms of success rate and terminal accuracy, validating the effectiveness of combining curvature-based adaptivity with data-driven modeling for complex cable routing tasks.
|
| |
| 15:00-16:30, Paper WeI2I.296 | Add to My Program |
| RoboCade: Gamifying Robot Data Collection |
|
| Mirchandani, Suvir | Stanford University |
| Tang, Mia | Stanford University |
| Duan, Jiafei | University of Washington |
| Hamid, Jubayer | Stanford University |
| Cho, Chung Yeung Michael | FrodoBots |
| Sadigh, Dorsa | Stanford University |
Keywords: Telerobotics and Teleoperation, Data Sets for Robot Learning, Human Factors and Human-in-the-Loop
Abstract: Imitation learning from human demonstrations has become a dominant approach for training autonomous robot policies. However, collecting demonstration datasets is costly: it often requires access to robots and needs sustained effort in a tedious, long process. These factors limit the scale of data available for training policies. We aim to address this scalability challenge by involving a broader audience in a gamified data collection experience that is both accessible and motivating. Specifically, we develop a gamified remote teleoperation platform, RoboCade, to engage general users in collecting data that is beneficial for downstream policy training. To do this, we embed gamification strategies into the design of the system interface and data collection tasks. In the system interface, we include components such as visual feedback, sound effects, goal visualizations, progress bars, leaderboards, and badges. We additionally propose principles for constructing gamified tasks that have overlapping structure with useful downstream target tasks. We instantiate RoboCade on three manipulation tasks—including spatial arrangement, scanning, and insertion. To illustrate the viability of gamified robot data collection, we collect a demonstration dataset through our platform, and show that co-training robot policies with this data can improve success rate on non-gamified target tasks (+16-56%). Further, we conduct a user study to validate that novice users find the gamified platform significantly more enjoyable than a standard non-gamified platform (+24%). These results highlight the promise of gamified data collection as a scalable, accessible, and engaging method for collecting demonstration data. Videos are available at robocade.github.io.
|
| |
| 15:00-16:30, Paper WeI2I.297 | Add to My Program |
| Parallel Heuristic Search As Inference for Actor-Critic Reinforcement Learning Models |
|
| Yang, Hanlan | Purdue University |
| Mishani, Itamar | Carnegie Mellon University, Robotics Institute |
| Pivetti, Luca | Università Degli Studi Di Milano-Bicocca |
| Kingston, Zachary | Purdue University |
| Likhachev, Maxim | Carnegie Mellon University |
Keywords: Motion and Path Planning, Reinforcement Learning
Abstract: Actor-critic models are a class of model-free deep reinforcement learning (RL) algorithms that have demonstrated effectiveness across various robot learning tasks. While considerable research has focused on improving training stability and data sampling efficiency, most deployment strategies have remained relatively simplistic, typically relying on direct actor policy rollouts. In contrast, we propose PACHS (Parallel Actor-Critic Heuristic Search), an efficient parallel best-first search algorithm for inference that leverages both components of the actor-critic architecture: the actor network generates actions, while the critic network provides cost-to-go estimates to guide the search. Two levels of parallelism are employed within the search--actions and cost-to-go estimates are generated in batches by the actor and critic networks respectively, and graph expansion is distributed across multiple threads. We demonstrate the effectiveness of our approach in robotic manipulation tasks, including collision-free motion planning and contact-rich interactions such as non-prehensile pushing. Visit p-achs.github.io for demonstrations and examples.
|
| |
| 15:00-16:30, Paper WeI2I.298 | Add to My Program |
| KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis |
|
| Hosseinzadeh, Mehdi | The Australian Institute for Machine Learning (AIML) -- The University of Adelaide |
| Wong, King Hang | The Australian Institute for Machine Learning |
| Dayoub, Feras | The University of Adelaide |
|
|
| |
| 15:00-16:30, Paper WeI2I.299 | Add to My Program |
| SeFA-Policy: Fast and Accurate Visuomotor Policy Learning with Selective Flow Alignment |
|
| Xue, Rong | University of Southern California |
| Mao, Jiageng | University of Southern California |
| Zhang, Mingtong | UIUC |
| Wang, Yue | University of Southern California |
Keywords: Imitation Learning, Visual Learning, Deep Learning in Grasping and Manipulation
Abstract: Developing efficient and accurate visuomotor policies poses a central challenge in robotic imitation learning. While recent rectified flow approaches have advanced visuomotor policy learning, they suffer from a key limitation: After iterative distillation, generated actions may deviate from the ground-truth actions corresponding to the current visual observation, leading to accumulated error as the reflow process repeats and unstable task execution. We present Selective Flow Alignment (SeFA), an efficient and accurate visuomotor policy learning framework. SeFA resolves this challenge by a selective flow alignment strategy, which leverages expert demonstrations to selectively correct generated actions and restore consistency with observations, while preserving multimodality. This design introduces a consistency correction mechanism that ensures generated actions remain observation-aligned without sacrificing the efficiency of one-step flow inference. Extensive experiments across both simulated and real-world manipulation tasks show that SeFA surpasses state-of-the-art diffusion-based and flow-based policies, achieving superior accuracy and robustness while reducing inference latency by over 98%. By unifying rectified flow efficiency with observation-consistent action generation, SeFA provides a scalable and dependable solution for real-time visuomotor policy learning.
|
| |
| 15:00-16:30, Paper WeI2I.300 | Add to My Program |
| GPU-Accelerated Continuous-Time Successive Convexification for Contact-Implicit Legged Locomotion |
|
| Buckner, Samuel | University of Washington |
| Elango, Purnanand | Mitsubishi Electric Research Laboratories |
Keywords: Multi-Contact Whole-Body Motion Planning and Control, Legged Robots, Motion and Path Planning
Abstract: Contact-implicit trajectory optimization (CITO) enables the automatic discovery of contact sequences, but most methods rely on fine time discretization to capture all contact events accurately, which increases problem size and runtime while tying solution quality to grid resolution. We extend the recently proposed sequential convex programming (SCP) approach for trajectory optimization, continuous-time successive convexification (ct-SCvx), to CITO by introducing integral cross-complementarity constraints, which eliminate the risk of missing contact events between discretization nodes while preserving the flexibility of contact mode changes. The resulting framework, contact-implicit successive convexification (ci-SCvx), models full multibody dynamics in maximal coordinates, including stick-slip friction and partially elastic impacts. To handle complementarity constraints, we embed a backtracking homotopy scheme within SCP for reliable convergence. We implement this framework in a stand-alone Python software, leveraging JAX for GPU acceleration and a custom canonical-form parser for the convex subproblems of SCP to avoid the overhead of general-purpose modeling tools such as CVXPY. We demonstrate ci-SCvx on diverse legged-locomotion tasks. In particular, we validate the approach in MuJoCo with the Gymnasium HalfCheetah model against the MuJoCo MPC baseline, showing that a tracking simulation with the optimized torque profiles from ci-SCvx produces physically consistent trajectories with lesser energy consumption. We also show that the resulting software achieves faster solve times than an existing state-of-the-art SCP toolbox by over an order of magnitude, thereby demonstrating a practically important contribution to scalable real-time trajectory optimization.
|
| |
| 15:00-16:30, Paper WeI2I.301 | Add to My Program |
| Stream-To-Act: ROS 2 Native Token Streaming for Continuous Motion Execution of Vision-Language-Action Models |
|
| Kim, Dahyun | LG Electronics |
| Jeon, Yunseong | Kookmin University |
| Park, Hongkyun | Kookmin University |
| Kim, Jong-Chan | Kookmin University |
Keywords: Software Architecture for Robotic and Automation, Software Tools for Robot Programming, Software, Middleware and Programming Environments
Abstract: Vision-Language Models (VLMs) are increasingly used in robotics for natural language understanding and executable plan generation, yet integrating them into real-time control pipelines remains challenging. Many existing systems rely on HTTP/JSON-based inference interfaces that require repeated Base64 serialization, introducing unnecessary overhead and increasing end-to-end latency. At the execution level, waiting for a full plan leads to stalls where no valid actions are available, while naive streaming of partial plans produces stop-and-go behavior due to token arrival gaps. To address these issues, we extend llama-ros with Stream-to-Act, a ROS2-native execution mechanism that begins acting once sufficient tokens arrive while ensuring continuous execution through an optimal start-time policy. Our open-source implementation is evaluated on RTX GPUs and NVIDIA Jetson platforms through end-to-end latency analysis, token generation throughput measurements, and execution timeline visualization. In addition, a Carla-based driving scenario illustrates how the proposed execution policy eliminates stop-and-go behavior and maintains continuous control, even when the total plan generation time remains unchanged.
|
| |
| 15:00-16:30, Paper WeI2I.302 | Add to My Program |
| Action-Informed Estimation and Planning: Clearing Clutter on Staircases Via Quadrupedal Pedipulation |
|
| Sriganesh, Prasanna | Carnegie Mellon University |
| Satheeshkumar, Barathkrishna | Carnegie Mellon University |
| Sabnis, Anushree Bapusaheb | Carnegie Mellon University |
| Travers, Matthew | Carnegie Mellon University |
Keywords: Object Detection, Segmentation and Categorization, Legged Robots, Integrated Planning and Control
Abstract: For robots to operate autonomously in densely cluttered environments, they must reason about and potentially physically interact with obstacles to clear a path. Safely clearing a path on challenging terrain, such as a cluttered staircase, requires controlled interaction. For example, a quadrupedal robot that pushes objects out of the way with one leg while maintaining a stable stance with its three other legs. However, tightly coupled physical actions, such as one-legged pushing, create new constraints on the system that can be difficult to predict at design time. In this work, we present a new method that addresses one such constraint, wherein the object being pushed by a quadrupedal robot with one of its legs becomes occluded from the robot's sensors during manipulation. To address this challenge, we present a tightly coupled perception-action framework that enables the robot to perceive clutter, reason about feasible push paths, and execute the clearing maneuver. Our core contribution is an interaction-aware state estimation loop that uses proprioceptive feedback regarding foot contact and leg position to predict an object's displacement during the occlusion. This prediction guides the perception system to robustly re-detect the object after the interaction, closing the loop between action and sensing to enable accurate tracking even after partial pushes. Using this feedback allows the robot to learn from physical outcomes, reclassifying an object as immovable if a push fails due to it being too heavy. We present results of implementing our approach on a Boston Dynamics Spot robot that show our interaction-aware approach achieves higher task success rates and tracking accuracy in pushing objects on stairs compared to open-loop baselines.
|
| |
| 15:00-16:30, Paper WeI2I.303 | Add to My Program |
| AdaThinkDrive: Adaptive Thinking Via Reinforcement Learning for Autonomous Driving |
|
| Luo, Yuechen | Tsinghua University |
| Li, Fang | XiaoMI EV |
| Xu, Shaoqing | University of Macau, Xiaomi EV |
| Lai, Zhiyi | Xiaomi EV |
| Yang, Lei | Tsinghua University |
| Chen, Qimao | Tsinghua University |
| Luo, Ziang | TsingHua University |
| Xie, Zixun | Peking University |
| Jiang, Shengyin | Xiaomi |
| Liu, Jiaxin | Tsinghua University |
| Chen, Long | Wayve |
| Wang, Bing | Xiaomi Corporation |
| Yang, Zhi-Xin | University of Macau |
Keywords: Autonomous Vehicle Navigation, Deep Learning for Visual Perception, Computer Vision for Transportation
Abstract: While reasoning technology like Chain-of-Thought (CoT) has been widely adopted in Vision-Language-Action (VLA) models, it demonstrates promising capabilities in end-to-end autonomous driving. However, recent efforts to integrate CoT reasoning often fall short in simple scenarios, introducing unnecessary computational overhead without improving decision quality. To address this, we propose AdaThinkDrive, a novel VLA framework with a dual-mode reasoning mechanism inspired by fast and slow thinking. First, our framework is pretrained on large-scale autonomous driving (AD) scenarios using both question-answering (QA) and trajectory datasets to acquire world knowledge and driving commonsense. During supervised fine-tuning (SFT), we introduce a two-mode dataset—fast answering (w/o CoT) and slow thinking (with CoT), enabling the model to distinguish between scenarios that require reasoning. Furthermore, an Adaptive Think Reward strategy is proposed in conjunction with the Group Relative Policy Optimization (GRPO), which rewards the model for selectively applying CoT by comparing trajectory quality across different reasoning modes. Extensive experiments on the Navsim benchmark show that AdaThinkDrive achieves a PDMS of 90.3, surpassing the best vision-only baseline by 1.7 points. Moreover, ablations show that AdaThinkDrive surpasses both the never-Think and always-Think baselines, improving PDMS by 2.0 and 1.4, respectively. It also reduces inference time by 14% compared to the always-Think baseline, demonstrating its ability to balance accuracy and efficiency through adaptive reasoning.
|
| |
| 15:00-16:30, Paper WeI2I.304 | Add to My Program |
| KALIKO: Kalman-Implicit Koopman Operator Learning for Prediction of Nonlinear Dynamical Systems |
|
| Li, Albert H. | California Institute of Technology |
| Rodriguez, Ivan Dario Jimenez Rodriguez | California Institute of Technology |
| Burdick, Joel | California Institute of Technology |
| Yue, Yisong | California Institute of Technology |
| Ames, Aaron | California Institute of Technology |
Keywords: Representation Learning, Deep Learning Methods, Dynamics
Abstract: Long-horizon dynamical prediction is fundamental in robotics and control, underpinning canonical methods like model predictive control. Yet, many systems and disturbance phenomena are difficult to model due to effects like nonlinearity, chaos, and high-dimensionality. Koopman theory addresses this by modeling the linear evolution of embeddings of the state under an infinite-dimensional linear operator that can be approximated with a suitable finite basis of embedding functions, effectively trading model nonlinearity for representational complexity. However, explicitly computing a good choice of basis is nontrivial, and poor choices may cause inaccurate forecasts or overfitting. To address this, we present Kalman-Implicit Koopman Operator (KALIKO) Learning, a method that leverages the Kalman filter to implicitly learn embeddings corresponding to latent dynamics without requiring an explicit encoder. KALIKO produces interpretable representations consistent with both theory and prior works, yielding high-quality reconstructions and inducing a globally linear latent dynamics. Evaluated on wave data generated by a high-dimensional PDE, KALIKO surpasses several baselines in open-loop prediction and in a demanding closed-loop simulated control task: stabilizing an underactuated manipulator's payload by predicting and compensating for strong wave disturbances.
|
| |
| 15:00-16:30, Paper WeI2I.305 | Add to My Program |
| SoftMimicGen: A Data Generation System for Scalable Robot Learning in Deformable Object Manipulation |
|
| Moghani, Masoud | University of Toronto |
| Azizian, Mahdi | Nvidia |
| Garg, Animesh | Georgia Institute of Technology |
| Zhu, Yuke | The University of Texas at Austin |
| Huver, Sean | NVIDIA |
| Mandlekar, Ajay Uday | NVIDIA |
Keywords: Learning from Demonstration, Data Sets for Robot Learning
Abstract: Large-scale robot datasets have facilitated the learning of a wide range of robot manipulation skills, but these datasets remain difficult to collect and scale further, owing to the intractable amount of human time, effort, and cost required. Simulation and synthetic data generation have proven to be an effective alternative to fuel this need for data, especially with the advent of recent work showing that such synthetic datasets can dramatically reduce real-world data requirements and facilitate generalization to novel scenarios unseen in real-world demonstrations. However, this paradigm has been limited to rigid-body tasks, which are easy to simulate. Deformable object manipulation encompasses a large portion of real-world manipulation and remains a crucial gap to address towards increasing adoption of the synthetic simulation data paradigm. In this paper, we introduce SoftMimicGen, an automated data generation pipeline for deformable object manipulation tasks. We introduce a suite of high-fidelity simulation environments that encompasses a wide range of deformable objects (stuffed animal, rope, tissue, towel) and manipulation behaviors (high-precision threading, dynamic whipping, folding, pick-and-place), across four robot embodiments: a single-arm manipulator, bimanual arms, a humanoid, and a surgical robot. We apply SoftMimicGen to generate datasets across the task suite, train high-performing policies from the data, and systematically analyze the data generation system. Project website: softmimicgen.github.io
|
| |
| 15:00-16:30, Paper WeI2I.306 | Add to My Program |
| GARO: Geometry-Aware Redundancy Optimization for Real-Time and High-Fidelity Dynamic Gaussian Splatting |
|
| Xue, Huiwen | Northwest Polytechnical University |
| Zhao, Kaixing | Northwestern Polytechnical University |
| Ming, Zuheng | Université Sorbonne Paris Nord |
| Li, Tingcheng | Suzhou University of Science and Technology |
Keywords: Deep Learning for Visual Perception, Visual Learning, Mapping
Abstract: Novel view synthesis is a key task for dynamic scene reconstruction, where high rendering speed is essential for applications such as virtual reality. Existing deformable Gaussian Splatting methods achieve high-fidelity dynamic scene modeling, but still face limitations in memory usage and rendering efficiency due to large numbers of redundant Gaussians. To address these challenges, we propose Geometry-Aware Redundancy Optimization (GARO), a unified redundancy measurement framework in the adaptive density control stage of the traditional dynamic scene reconstruction pipeline. This framework first selects low-gradient candidates using an optimization activity assessment strategy, and then evaluates geometric complexity through low curvature analysis to further filter and prune redundant points, resulting in a compact and expressive Gaussian representation. Extensive experiments on synthetic and real-world datasets demonstrate that GARO achieves robust trade-offs between quality and speed, with PSNR remaining stable and rendering speed improved by 2times, validating the efficiency and effectiveness of GARO.
|
| |
| 15:00-16:30, Paper WeI2I.307 | Add to My Program |
| C^2ROPE: Causal Continuous Rotary Positional Encoding for 3D Large Multimodal-Models Reasoning |
|
| Ye, Guanting | University of Macau State Key Laboratory of Internet of Things for Smart City |
| Zhao, Qiyan | Chinese Academy of Sciences |
| Yu, Wenhao | University of Science and Technology of China |
| Zhang, Xiaofeng | Shang hai jiao tong university |
| Ji, Jianmin | University of Science and Technology of China |
| Zhang, Yanyong | University of Science and Technology of China |
| Yuen, Ka-Veng | University of Macau |
|
|
| |
| 15:00-16:30, Paper WeI2I.308 | Add to My Program |
| GeoTeacher: Geometry-Guided Semi-Supervised 3D Object Detection |
|
| Li, Jingyu | Fudan University |
| Zhao, Xiaolong | Tongji University |
| Liu, Zhe | The University of Hong Kong |
| Wu, Wenxiao | Huazhong University of Science and Technology |
| Zhang, Li | Fudan University |
Keywords: Deep Learning for Visual Perception, Computer Vision for Automation, Object Detection, Segmentation and Categorization
Abstract: Semi-supervised 3D object detection (SS3D), aiming to explore unlabeled data for boosting 3D object detectors, has emerged as an active research area in recent years. Some previous methods have shown substantial improvements by either employing heterogeneous teacher models to provide high-quality pseudo labels or enforcing feature-perspective consistency between the teacher and student networks. However, these methods overlook the fact that the model usually tends to exhibit low sensitivity to object geometries with limited labeled data, making it difficult to capture geometric information, which is crucial for enhancing the student model’s ability in object perception and localization. In this paper, we propose GeoTeacher to enhance the student model's ability to capture geometric relations of objects with limited training data, especially unlabeled data. We design a keypoint-based geometric relation supervision module that transfers the teacher model’s knowledge of object geometry to the student, thereby improving the student’s capability in understanding geometric relations. Furthermore, we introduce a voxel-wise data augmentation strategy that increases the diversity of object geometries, thereby further improving the student model’s ability to comprehend geometric structures. To preserve the integrity of distant objects during augmentation, we incorporate a distance-decay mechanism into this strategy. Moreover, GeoTeacher can be combined with different SS3D methods to further improve their performance. Extensive experiments on the ONCE and Waymo datasets indicate the effectiveness and generalization of our method and we achieve the new state-of-the-art results. Code will be available at https://github.com/LogosRoboticsGroup/GeoTeacher
|
| |
| 15:00-16:30, Paper WeI2I.309 | Add to My Program |
| ImagiDrive: A Unified Imagination-And-Planning Framework for Autonomous Driving |
|
| Li, Jingyu | Fudan University |
| Zhang, Bozhou | Fudan University |
| Jin, Xin | Eastern Institute for Advanced Study |
| Deng, Jiankang | Imperial College London |
| Zhu, Xiatian | University of Surrey |
| Zhang, Li | Fudan University |
Keywords: Autonomous Vehicle Navigation, Computer Vision for Automation, Integrated Planning and Learning
Abstract: Autonomous driving requires rich contextual comprehension and precise predictive reasoning to navigate dynamic and complex environments safely. Vision-Language Models (VLMs) and Driving World Models (DWMs) have independently emerged as powerful recipes addressing different aspects of this challenge. VLMs provide interpretability and robust action prediction through their ability to understand multi-modal context, while DWMs excel in generating detailed and plausible future driving scenarios essential for proactive planning. Integrating VLMs with DWMs is an intuitive, promising, yet understudied strategy to exploit the complementary strengths of accurate behavioral prediction and realistic scene generation. Nevertheless, this integration presents notable challenges, particularly in effectively connecting action-level decisions with high-fidelity pixel-level predictions and maintaining computational efficiency. In this paper, we propose ImagiDrive, a novel end-to-end autonomous driving framework that integrates a VLM-based driving agent with a DWM-based scene imaginer to form a unified imagination-and-planning loop. The driving agent predicts initial driving trajectories based on multi-modal inputs, guiding the scene imaginer to generate corresponding future scenarios. These imagined scenarios are subsequently utilized to iteratively refine the driving agent’s planning decisions. To address efficiency and predictive accuracy challenges inherent in this integration, we introduce an early stopping mechanism and a trajectory selection strategy. Extensive experimental validation on the nuScenes and NAVSIM datasets demonstrates the robustness and superiority of ImagiDrive over previous alternatives under both open-loop and closed-loop conditions.
|
| |
| 15:00-16:30, Paper WeI2I.310 | Add to My Program |
| OrthoSwarm: Orthoimagery Drone Swarms |
|
| Zhao, Tuhao | Sun Yat-Sen University |
| Yi, Peng | Sun Yat-Sen University |
| Zhai, Haozhou | Sun Yat-Sen University |
| Hu, Tianjiang | Sun Yat-Sen University |
Keywords: Aerial Systems: Applications, Swarm Robotics, Search and Rescue Robots
Abstract: This paper addresses the urgent need for rapid synthesis of georeferenced orthoimages in post-disaster scenarios, where pre-disaster satellite maps cannot be directly reused due to significant urban changes. Drone swarms offer advantages of large scale, wide aerial view and rapid coverage, of disaster-stricken areas. However, synthesizing georeferenced orthoimages within limited time remains challenging without camera calibration, primarily due to inevitable inconsistencies in intrinsics and extrinsics across different cameras, as well as sensor errors. To tackle this issue, we propose OrthoSwarm, a parallelizable calibration-free system architecture that leverages drone swarms rectilinear path planning and pre-disaster satellite maps for efficient orthoimage synthesis. OrthoSwarm's performance is validated on a self-constructed benchmark dataset, generated by drone swarms in a digital twin city covering 3 natural disaster scenarios(debris, waterlogging, haze), with real-world validation using real single-drone aerial videos split into segments to simulate swarm acquisition. Experimental results from both simulated and real-captured data confirm the effectiveness of the proposed approach, enabling fast and visually consistent georeferenced orthoimage synthesis in stable post-disaster environments to support first responders promptly.
|
| |
| 15:00-16:30, Paper WeI2I.311 | Add to My Program |
| Physically-Grounded Data Generation Via Video Diffusion Models |
|
| Yenamandra, Sriram | Stanford University |
| Sadigh, Dorsa | Stanford University |
Keywords: Data Sets for Robot Learning, Imitation Learning, Simulation and Animation
Abstract: Existing datasets for training generalist manipulation policies often lack diversity in object variety and initial states, limiting the range of physically grounded interactions present in them. Consequently, these policies struggle with unseen object shapes, sizes, or unfamiliar object poses. Manually collecting real-world trajectories with diverse physical interactions is tedious, time-consuming, and expensive, underscoring the need to generate these autonomously. Simulators offer a scalable pathway to autonomously generate trajectories by enabling extensive variation not only in tasks (e.g., objects, object properties, and initial conditions), but also in the robot behaviors required to solve these tasks. We develop a data generation pipeline that autonomously produces physically grounded trajectories in simulation using video diffusion models. Our approach first simulates random initial conditions across various tasks using a diverse asset library. A video diffusion model generates videos of a robot performing these tasks in physically diverse scenarios, which are then fed to a learned goal-conditioned planner to extract actions that closely follow the generated videos. Unlike prior trajectory generation methods, our pipeline generalizes to new objects across multiple tasks without relying on human demonstrations. Using our approach, we generate a simulation dataset PHYSVIVID containing 5k+ demonstrations involving 400+ objects. We demonstrate the effectiveness of PHYSVIVID by fine-tuning robot policies on it, and demonstrating generalization of policies to unseen objects with varying shapes, textures, and sizes, as well as to unseen object categories.
|
| |
| 15:00-16:30, Paper WeI2I.312 | Add to My Program |
| CAVER: Curious AudioVisual Exploring Robot |
|
| Macesanu, Luca | New York University |
| Folefack, Boueny | The Univeristy of Texas at Austin |
| Ray, Ruchira | University of Texas at Austin |
| Abbatematteo, Ben | The University of Texas at Austin |
| Martín-Martín, Roberto | University of Texas at Austin |
Keywords: Robot Audition, Perception for Grasping and Manipulation, Perception-Action Coupling
Abstract: Multimodal audiovisual perception can enable new avenues for robotic manipulation, from better material classification to the imitation of demonstrations for which only audio signals are available (e.g., playing a tune by ear). However, to unlock such multimodal potential, robots need to learn the correlations between an object’s visual appearance and the sound it generates when they interact with it. Such an active sensorimotor experience requires new interaction capabilities, representations, and exploration methods to guide the robot in efficiently building increasingly rich audiovisual knowledge. In this work, we present CAVER, a novel robot that builds and utilizes rich audiovisual representations of objects. CAVER includes three novel contributions: 1) a novel 3D printed end- effector, attachable to parallel grippers, that excites objects’ audio responses, 2) an audiovisual representation that combines local and global appearance information with sound features, and 3) an exploration algorithm that uses and builds the audiovisual representation in a curiosity-driven manner that prioritizes interacting with high uncertainty objects to obtain good coverage of surprising audio with fewer interactions. We demonstrate that CAVER builds rich representations in different scenarios more efficiently than several exploration baselines, and that the learned audiovisual representation leads to significant improvements in material classification and the imitation of audio-only human demonstrations.
|
| |
| 15:00-16:30, Paper WeI2I.313 | Add to My Program |
| Adaptive Event Stream Slicing for Open-Vocabulary Event-Based Object Detection Via Vision-Language Knowledge Distillation |
|
| Zhang, Jinchang | Binghamton University |
| Li, Zijun | Binghamton University |
| Lin, Jiakai | University of Georgia |
| Lu, Guoyu | Binghamton University |
Keywords: Object Detection, Segmentation and Categorization, Deep Learning for Visual Perception
Abstract: Event camera offers advantages in object detection tasks for its properties such as high-speed response, low latency, and robustness to motion blur. However, event cameras inherently lack texture and color information, making open-vocabulary detection particularly challenging. Current event-based detection methods are typically trained on predefined target categories, limiting their ability to generalize to novel objects, where encountering previously unseen objects is common. Vision-language models (VLMs) have enabled open-vocabulary object detection in RGB images. However, the modality gap between images and event streams makes it ineffective to directly transfer CLIP to event data, as CLIP was not designed for event streams. To bridge this gap, we propose an event-image knowledge distillation framework, leveraging CLIP’s semantic understanding to achieve open-vocabulary object detection on event data. Instead of training CLIP directly on event streams, we use image frames as teacher model inputs, guiding the event-based student model to learn CLIP’s rich visual representations. Through spatial attention-based distillation, the student network learns meaningful visual features directly from raw event inputs, while inheriting CLIP’s broad visual knowledge. Furthermore, to prevent information loss due to event data segmentation, we design a hybrid Spiking Neural Network (SNN) and Convolutional Neural Network (CNN) framework. Unlike fixed-group event segmentation methods, which often discard crucial temporal information, our SNN adaptively determines the optimal event segmentation moments, ensuring that key temporal features are extracted. The extracted event features are then processed by CNNs for object detection.
|
| |
| 15:00-16:30, Paper WeI2I.314 | Add to My Program |
| A Parametric Wave-Structured 3-DoF Compliant Joint with Tunable Stiffness |
|
| Saeed, Anjum | University of Siena |
| Dragusanu, Mihai | University of Siena |
| Malvezzi, Monica | University of Siena |
| Prattichizzo, Domenico | Università Di Siena |
| Salvietti, Gionata | University of Siena |
Keywords: Compliant Joints and Mechanisms, Tendon/Wire Mechanism, Grippers and Other End-Effectors
Abstract: Soft–rigid tendon-driven robotic hands are widely adopted due to their simple fabrication and effective compliance, enabling robust and adaptive grasping. However, achieving dexterity, such as in-hand manipulation, remains challenging because actuation systems typically constrain finger trajectories. This paper presents a novel parametric wave-structured 3-DoF compliant joint with tunable stiffness, designed to enhance dexterity while maintaining a compact form factor. The joint combines a compliant structure and a particular geometry with a Twisted String Actuation (TSA) system, allowing simultaneous modulation of joint stiffness and the mobility of a universal joint that can be used to resemble flexion/extension and abduction/adduction motion of the human hand fingers. Two tendons, independently actuated, control asymmetric bending and stiffness regulation, while a third tendon drives flexion/extension. Analytical modeling and numerical simulations are provided to characterize the kinematics, statics, and stiffness modulation properties of the joint. A functional prototype demonstrates significant improvements in workspace and dexterity when integrated as the base joint of a wearable robotics supernumerary finger. Experimental evaluations validate the proposed design and confirm its potential as a versatile building block for dexterous, lightweight, and adaptive robotic hands.
|
| |
| 15:00-16:30, Paper WeI2I.315 | Add to My Program |
| An Annotation-To-Detection Framework for Autonomous and Robust Vine Trunk Localization in the Field by Mobile Agricultural Robots |
|
| Chatziparaschis, Dimitrios | UC Riverside |
| Scudiero, Elia | University of California, Riverside |
| Sams, Brent | Gallo |
| Karydis, Konstantinos | University of California, Riverside |
Keywords: Robotics and Automation in Agriculture and Forestry, Agricultural Automation, Field Robots
Abstract: The dynamic and heterogeneous nature of agricultural fields presents significant challenges for object detection and localization, particularly for autonomous mobile robots that are tasked with surveying previously unseen unstructured environments. Concurrently, there is a growing need for real-time detection systems that do not depend on large-scale manually labeled real-world datasets. In this work, we introduce a comprehensive annotation-to-detection framework designed to train a robust multi-modal detector using limited and partially labeled training data. The proposed methodology incorporates cross-modal annotation transfer and an early-stage sensor fusion pipeline, which, in conjunction with a multi-stage detection architecture, effectively trains and enhances the system's multi-modal detection capabilities. The effectiveness of the framework was demonstrated through vine trunk detection in novel vineyard settings that featured diverse lighting conditions and varying crop densities to validate performance. When integrated with a customized multi-modal LiDAR and Odometry Mapping (LOAM) algorithm and a tree association module, the system demonstrated high-performance trunk localization, successfully identifying over 70% of trees in a single traversal with a mean distance error of less than 0.37 m. The results reveal that by leveraging multi-modal, incremental-stage annotation and training, the proposed framework achieves robust detection performance regardless of limited starting annotations, showcasing its potential for real-world and near-ground agricultural applications.
|
| |
| 15:00-16:30, Paper WeI2I.316 | Add to My Program |
| MotionGS-SLAM: Event-Modulated Gaussian Splatting for Motion-Blur Robust SLAM |
|
| Hu, Zhiqiang | Tokyo University of Science |
| Huang, Shouren | Tokyo University of Science |
| Ishikawa, Masatoshi | Tokyo University of Science |
Keywords: SLAM, Deep Learning Methods
Abstract: Current Vision-based SLAM systems fail catastrophically when motion blur corrupts the visual input, as they attempt the ill-posed inverse problem of recovering sharp content from degraded observations. We present MotionGS-SLAM, which fundamentally reimagines motion blur handling through a paradigm shift: rather than removing blur artifacts, we reformulate the challenge as a well-constrained forward problem that generatively models blur formation within the rendering pipeline. By leveraging event cameras' microsecond temporal resolution and immunity to motion blur, we introduce a novel event-modulated Gaussian kernel} that dynamically adapts each Gaussian's rasterization based on precise motion cues. Our dual-modulation mechanism transforms 2D Gaussian projections from isotropic dots into anisotropic, motion-aligned elliptical brush strokes (spatial modulation) while adaptively varying exposure integral sampling density based on local velocity (temporal modulation). This physics-based approach enables joint optimization of intra-exposure camera trajectories and 3D scene geometry through blur-aware photometric and event-based constraints. Extensive experiments demonstrate significant improvements over state-of-the-art methods in trajectory accuracy and map quality under severe high-motion conditions.
|
| |
| 15:00-16:30, Paper WeI2I.317 | Add to My Program |
| Disturbance-Aware Adaptive Compensation in Hybrid Force-Position Locomotion Policy for Legged Robots |
|
| Zhang, Yang | Shanghai Jiao Tong University |
| Cao, Zhanxiang | Shanghai Jiao Tong University, Shanghai Innovation Institute |
| Nie, Buqing | Shanghai Jiao Tong University |
| Fu, Yangqing | Shanghai Jiao Tong University |
| Li, Haoyang | Shanghai Jiao Tong University, Shanghai Innovation Institute |
| Zhang, Zheng | Shanghai Jiao Tong University, Shanghai Innovation Institute |
| Chen, Yizhi | Shanghai Innovation Institute |
| Yang, Xiaokang | Shanghai Jiao Tong University |
| Gao, Yue | Shanghai JiaoTong University, Shanghai Innovation Institute |
Keywords: Legged Robots, Reinforcement Learning, Robust/Adaptive Control
Abstract: Reinforcement Learning (RL)-based methods have significantly improved the locomotion performance of legged robots. However, these motion policies face significant challenges when deployed in the real world. Robots operating in uncertain environments struggle to adapt to payload variations and external disturbances, resulting in severe degradation of motion performance. In this work, we propose a novel Hybrid Force-Position Locomotion Policy (HFPLP) learning framework, where the action space of the policy is defined as a combination of target joint positions and feedforward torques, enabling the robot to rapidly respond to payload variations and external disturbances. In addition, the proposed Disturbance-Aware Adaptive Compensation (DAAC) provides compensation actions in the torque space based on external disturbance estimation, enhancing the robot's adaptability to dynamic environmental changes. We validate our approach in both simulation and real-world deployment, demonstrating that it outperforms existing methods in carrying payloads and resisting disturbances.
|
| |
| 15:00-16:30, Paper WeI2I.318 | Add to My Program |
| Real-Time Glass Detection and Reprojection Using Sensor Fusion Onboard Aerial Robots |
|
| Hopkins, Malakhi | University of Pennslyvania |
| Murali, Varun | Texas A&M University |
| Kumar, Vijay | University of Pennsylvania |
| Taylor, Camillo Jose | University of Pennsylvania |
Keywords: Sensor Fusion, Mapping, Aerial Systems: Perception and Autonomy
Abstract: Autonomous aerial robots are increasingly being deployed in real-world scenarios, where transparent obstacles present significant challenges to reliable navigation and mapping. These materials pose a unique problem for traditional perception systems because they lack discernible features and can cause conventional depth sensors to fail, leading to inaccurate maps and potential collisions. To ensure safe navigation, robots must be able to accurately detect and map these transparent obstacles. Existing methods often rely on large, expensive sensors or algorithms that impose high computational burdens, making them unsuitable for low Size, Weight, and Power (SWaP) robots. We present a resource-constrained sensing pipeline for detecting and mapping transparent planar obstacles onboard a sub-300g quadrotor. By exploiting Time-of-Flight (ToF) speckle morphology and sonar-gated fusion, our system identifies specular reflections and reprojects their depth into empty space regions in real-time, with safety margins analytically validated for indoor flight speeds. The entire pipeline operates onboard an embedded processor using approximately 20% of a single CPU core at 2 Hz. We validate our system through experiments in controlled and real-world environments, confirming its ability to accurately render transparent obstacles visible. To our knowledge, this is the first CPU-only, real-time demonstration of transparent plane reprojection on a sub-300g quadrotor.
|
| |
| 15:00-16:30, Paper WeI2I.319 | Add to My Program |
| Flow with the Force Field: Learning 3D Compliant Flow Matching Policies from Force and Demonstration-Guided Simulation Data |
|
| Li, Tianyu | University of Pennsylvania |
| Li, Yihan | University of Pennsylvania |
| Zhang, Zizhe | University of Pennsylvania |
| Figueroa, Nadia | University of Pennsylvania |
Keywords: Imitation Learning, Learning from Demonstration, Compliance and Impedance Control
Abstract: While visuomotor policy has made advancements in recent years, contact-rich tasks still remain a challenge. Robotic manipulation tasks that require continuous contact demand explicit handling of compliance and force. However, most visuomotor policies ignore compliance, overlooking the importance of physical interaction with the real world, often leading to excessive contact forces or fragile behavior under uncertainty. Introducing force information into vision-based imitation learning could help improve awareness of contacts. However, current visuomotor policy approaches require a lot of data to perform well. One remedy for data scarcity is to generate data in simulation, yet computationally taxing processes are required to generate data good enough not to suffer from the Sim2Real gap. In this work, we introduce a framework for generating force-informed data in simulation, instantiated by a single human demonstration, and show how coupling with a compliant policy improves the performance of a visuomotor policy learned from synthetic data. We validate our approach on real-robot tasks, including non-prehensile block flipping and a bi-manual object moving, where the learned policy exhibits reliable contact maintenance and adaptation to novel conditions. Project Website: https://flow-with-the-force-field.github.io/webpage/
|
| |
| 15:00-16:30, Paper WeI2I.320 | Add to My Program |
| GeoDrive: 3D Geometry-Informed Driving World Model with Precise Action Control |
|
| Chen, Anthony | Peking University |
| Zheng, Wenzhao | University of California, Berkeley |
| Wang, Yida | Li Auto Inc |
| Zhang, Xueyang | Li Auto Inc |
| Zhan, Kun | Li Auto Inc |
| Jia, Peng | Li Auto Inc |
| Keutzer, Kurt | University of California, Berkeley |
| Zhang, Shanghang | Peking University |
Keywords: Simulation and Animation, Autonomous Agents, Computational Geometry
Abstract: Recent advancements in world models have revolutionized dynamic environment simulation, allowing systems to foresee future states and assess potential actions. In autonomous driving, these capabilities help vehicles anticipate the behavior of other road users, perform risk-aware planning, accelerate training in simulation, and adapt to novel scenarios, thereby enhancing safety and reliability. Current approaches exhibit deficiencies in maintaining robust 3D geometric consistency or accumulating artifacts during occlusion handling, both critical for reliable safety assessment in autonomous navigation tasks. To address this, we introduce GeoDrive, which explicitly integrates robust 3D geometry conditions into driving world models to enhance spatial understanding and action controllability. Specifically, we first extract a 3D representation from the input frame and then obtain its 2D rendering based on the user-specified ego-car trajectory. To enable dynamic modeling, we propose a dynamic editing module during training to enhance the renderings by editing the positions of the vehicles. Extensive experiments demonstrate that our method significantly outperforms existing models in both action accuracy and 3D spatial awareness, leading to more realistic, adaptable, and reliable scene modeling for safer autonomous driving. Additionally, our model can generalize to novel trajectories and offers interactive scene editing capabilities, such as object editing and object trajectory control.
|
| |
| 15:00-16:30, Paper WeI2I.321 | Add to My Program |
| HE-VPR: Height Estimation Enabled Aerial Visual Place Recognition against Scale Variance |
|
| He, Mengfan | TsinghuaUniversity |
| Shao, Xingyu | Tsinghua University |
| Li, Chunyu | Beijing Institute of Technology |
| Chen, Chao | Beijing University of Chemical Technology |
| Sun, Liangzheng | Beijing Information Science & Technology University |
| Meng, Ziyang | Tsinghua University |
| Wu, YuanQing | Guangdong University of Technology |
Keywords: Aerial Systems: Perception and Autonomy, Deep Learning for Visual Perception
Abstract: In this work, we propose HE-VPR, a visual place recognition (VPR) framework that incorporates height estimation. Our system decouples height inference from place recognition, allowing both modules to share a frozen DINOv2 backbone. Two lightweight bypass adapter branches are integrated into our system. The first estimates the height partition of the query image via retrieval from a compact height database, and the second performs VPR within the corresponding heightspecific sub-database. The adaptation design reduces training cost and significantly decreases the search space of the database. We also adopt a center-weighted masking strategy to further enhance the robustness against scale differences. Experiments on two self-collected challenging multi-altitude datasets demonstrate that HE-VPR achieves up to 6.1% Recall@1 improvement over state-of-the-art ViT-based baselines and reduces memory usage by up to 90%. These results indicate that HEVPR offers a scalable and efficient solution for height-aware aerial VPR, enabling practical deployment in GNSS-denied environments. All the code and datasets for this work have been released on https://github.com/hmf21/HE-VPR.
|
| |
| 15:00-16:30, Paper WeI2I.322 | Add to My Program |
| Lightweight Learning-Based Feature Selection for Real-Time Optical Flow Navigation on a Quadrotor Platform |
|
| Abosaad, Ali | York University |
| Shan, Jinjun | York University |
Keywords: Vision-Based Navigation, Deep Learning for Visual Perception, Sensor Fusion
Abstract: Accurate state estimation in GPS-denied environments is critical for autonomous quadrotor navigation. Conventional visual-inertial odometry (VIO) pipelines rely on dense feature extraction and tracking, which increases computational cost and is prone to drift when low-quality features dominate. Although learning-based detectors improve robustness, most are too computationally heavy for embedded deployment. This paper proposes a lightweight learning-based feature selection framework that prunes unreliable features to enable efficient optical flow navigation. A compact Convolutional Neural Network (CNN) is employed, with its pruning threshold adaptively adjusted to maintain a stable number of reliable features. The CNN augments ORB and Lucas–Kanade optical flow in a multithreaded pipeline with rotational false-velocity compensation and EKF fusion. Experiments on the Quanser QDrone2 demonstrate up to 75–80% reduction in position RMSE and approximately 25–30% reduction in computation time compared to the Fourier-based Phase Correlation (FPC) method, while sustaining real-time performance above 120 Hz without reliance on external localization systems.
|
| |
| 15:00-16:30, Paper WeI2I.323 | Add to My Program |
| Déjà Vu: Unlocking Transparent Action Reasoning for Object-Goal Navigation Via Large Language Models |
|
| Du, Heming | The University of Queensland |
| Wang, Sen | The University of Queensland |
| Xue, Li | The University of Queensland |
| Xin, Yu | The University of Queensland |
Keywords: Learning from Experience
Abstract: The remarkable interaction and reasoning capabilities of Large Language Models (LLMs) make them promising in collaborative Embodied AI tasks, particularly for Object-goal Navigation (ObjNav) tasks that require both decision-making and transparent explanation. However, existing work mainly uses LLMs as proxy target indicators, leaving the role of direct action decision-making to other components. This separation causes non-transparent action decisions and extra adaptation requirements. This observation prompts us to reconsider their role: Can LLMs be transformed into the central ``brain'' of agents, directly outputting action choices and explaining their reasoning? In pursuit of this inquiry, we decouple perception from action reasoning to focus specifically on the feasibility of deploying LLMs as navigation policies. We introduce the Experience-aware Action Cogitator (proposedmethod) that integrates two kinds of experience, ie, expert-informed experience and trial & error experience, into prompts. Inspired by David Hume's philosophical principles that knowledge is acquired through reflective experience, these experiences are designed for two critical questions: (i) ``What action should be selected as the best option?'' and (ii) ``What actions have been tried but proven suboptimal?'' By analyzing and reflecting on these two types of experience, we show that LLMs can reason navigation actions in unseen environments effectively without costly fine-tuning. Experiments on the widely-adopted iTHOR yield significant improvements in ObjNav performance. These compelling results validate the feasibility of our proposedmethod. Compared to vanilla LLMs, proposedmethod nearly doubles both the Success Rate and the Success weighted by Path Length, reaching peak values of 73.93% and 48.35% in unseen scenes, respectively.
|
| |
| 15:00-16:30, Paper WeI2I.324 | Add to My Program |
| UrbanVLA: A Vision-Language-Action Model for Urban Micromobility |
|
| Li, Anqi | Peking University |
| Wang, Zhiyong | Beijing Galbot Co., Ltd |
| Zhang, Jiazhao | Peking University |
| Li, Minghan | Beijing Galbot Co., Ltd |
| Qi, Yunpeng | University of Science of Technology of China |
| Chen, Zhibo | University of Science and Technology of China |
| Zhang, Zhizheng | Beijing Galbot Co., Ltd |
| Wang, He | Peking University |
Keywords: Vision-Based Navigation, Imitation Learning, Reinforcement Learning
Abstract: Urban micromobility applications, such as delivery robots, demand reliable navigation across large-scale urban environments while following long-horizon route instructions. This task is particularly challenging due to the dynamic and unstructured nature of real-world city areas, yet most existing navigation methods remain tailored to short-scale and controllable scenarios. Effective urban micromobility requires two complementary levels of navigation skills: low-level capabilities such as point-goal reaching and obstacle avoidance, and high-level capabilities, such as route–visual alignment. To this end, we propose UrbanVLA, a route-conditioned Vision-Language-Action (VLA) framework designed for scalable urban navigation. Our method explicitly aligns noisy route waypoints with visual observations during execution, and subsequently plans trajectories to drive the robot. To enable UrbanVLA to master both levels of navigation, we employ a two-stage training pipeline. The process begins with Supervised Fine-Tuning (SFT) using simulated environments and trajectories parsed from web videos. This is followed by Reinforcement Fine-Tuning (RFT) on a mixture of simulation and real-world data, which enhances the model's safety and adaptability in real-world settings. Experiments demonstrate that UrbanVLA surpasses strong baselines by more than 55% in the SocialNav task in MetaUrban. Furthermore, UrbanVLA achieves reliable real-world navigation, showcasing both scalability to large-scale urban environments and robustness against real-world uncertainties.
|
| |
| 15:00-16:30, Paper WeI2I.325 | Add to My Program |
| CaLoRA-Stereo: Robust Stereo Endoscopic Depth Estimation Network Via Camera-Aware LoRA and Dual-View Geometry |
|
| Ma, Shixing | Shandong University |
| Shao, Shuwei | Shandong University |
| Lin, Zhaoxi | Tianjin University |
| Du, Xinzhe | Shandong University |
| Song, Rui | Shandong University |
| Li, Yibin | Shandong University |
| Meng, Max Q.-H. | The Chinese University of Hong Kong |
| Min, Zhe | University College London |
Keywords: Medical Robots and Systems, Computer Vision for Medical Robotics
Abstract: Stereo depth estimation has drawn widespread attention from the robotics and vision community due to its broad applications such as 3D reconstruction. Recently, stereo matching foundation models have made significant progress by being trained on the large-scale datasets containing natural images. However, directly leveraging these pretrained large models to minimally invasive surgery still remains challenging due to domain shifts in aspects of specular highlights and low-texture tissue. In this paper, we propose a parameter-efficient adaptation framework to address this gap. Specifically, we introduce Camera-Aware LoRA for fine-tuning FoundationStereo, using a camera-aware scaling gate computed from focal length and baseline to address intraoperative intrinsics drift arising from instrument self-heating and other thermal effects. We further develop a geometric consistency constraint and a spectral alignment regularizer that enforce cross-view depth agreement. Extensive experiments on the SCARED and Hamlyn datasets indicate that the proposed method achieves state-of-the-art performance. Notably, CaLoRA is easy to integrate into standard fine-tuning pipelines, requiring no backbone changes and only a small number of trainable parameters.
|
| |
| 15:00-16:30, Paper WeI2I.326 | Add to My Program |
| SurveilNav: Collaborative Object Goal Navigation with Robot and Surveillance System |
|
| Yu, Ming-Ming | BeiHang University |
| Wang, Qunbo | Beijing Jiaotong University |
| Xu, Rongtao | ATeam |
| Mei, Yanghong | China Academy of Sciences, Institute of Automation |
| Yang, YiRong | Beihang University |
| Guo, Longteng | Institute of Automation of the Chinese Academy of Sciences |
| Wu, Wenjun | Beihang University |
| Liu, Jing | Institute of Automation, Chinese Academy of Science |
Keywords: Surveillance Robotic Systems, Cognitive Modeling
Abstract: With the growing deployment of surveillance systems in factories, offices, and homes, integrating them with robots offers a promising direction for collaborative and efficient task execution. However, existing approaches largely focus on single-robot scenarios and struggle with multi-view collaboration in large-scale environments. In this paper, we present a novel indoor collaborative object navigation dataset built on Habitat-Sim, featuring 206 cameras across 74 floors. The dataset enables systematic evaluation of an agent’s ability to exploit multi-view surveillance information. To address the limitations of single-robot perception, we propose SurveilNav, a collaborative navigation framework that integrates active camera scheduling, joint 2D/3D mapping, VLM-based value estimation, and collaborative target verification. By synergizing the robot's dynamic local perception with the static global view of surveillance, this architecture effectively overcomes both the limited perception range of single agents and the inherent blind spots of fixed cameras, resolving inefficient exploration. Experimental results on the HM3D dataset demonstrate that SurveilNav substantially outperforms existing methods in large-scale environments, achieving state-of-the-art performance in both exploration efficiency and navigation success rate. Moreover, the system shows strong potential for applications in large-scale search, home environments, and rescue missions.
|
| |
| 15:00-16:30, Paper WeI2I.327 | Add to My Program |
| GUIDES: Guidance Using Instructor-Distilled Embeddings for Pre-Trained Robot Policy Enhancement |
|
| Gao, Minquan | University of California, Riverside |
| Li, Xinyi | Johns Hopkins University |
| Yan, Qing | Johns Hopkins University |
| Sun, Xiaojian | Johns Hopkins University |
| Zhang, Xiaopan | University of California, Riverside |
| Huang, Chien-Ming | Johns Hopkins University |
| Li, Jiachen | University of California, Riverside |
Keywords: Perception for Grasping and Manipulation, Semantic Scene Understanding, Imitation Learning
Abstract: Pre-trained robot policies serve as the foundation of many validated robotic systems, which encapsulate extensive embodied knowledge. However, they often lack the semantic awareness characteristic of foundation models, and replacing them entirely is impractical in many situations due to high costs and the loss of accumulated knowledge. To address this gap, we introduce GUIDES, a lightweight framework that augments pre-trained policies with semantic guidance from foundation models without requiring architectural redesign. GUIDES employs a fine-tuned vision-language model (Instructor) to generate contextual instructions, which are encoded by an auxiliary module into guidance embeddings. These embeddings are injected into the policy’s latent space, allowing the legacy model to adapt to this new semantic input through brief, targeted fine-tuning. For inference-time robustness, a large language model–based Reflector monitors the Instructor’s confidence and, when confidence is low, initiates a reasoning loop that analyzes execution history, retrieves relevant examples, and augments the VLM’s context to refine subsequent actions. Extensive validation in the RoboCasa simulation environment across diverse policy architectures shows consistent and substantial improvements in task success rates. Real-world deployment on a UR5 robot further demonstrates that GUIDES enhances motion precision for critical sub-tasks such as grasping. Overall, GUIDES offers a practical and resource-efficient pathway to upgrade, rather than replace, validated robot policies.
|
| |
| 15:00-16:30, Paper WeI2I.328 | Add to My Program |
| Single-View 3D-Aware Representations for Reinforcement Learning by Cross-View Neural Radiance Fields |
|
| Cho, Daesol | Georgia Institute of Technology |
| Yoo, Seungyeon | Seoul National University |
| Shim, Dongseok | Seoul National University |
| Kim, H. Jin | Seoul National University |
Keywords: Reinforcement Learning, Representation Learning, Visual Learning
Abstract: Reinforcement learning (RL) has enabled robots to develop complex skills, but its success in image-based tasks often depends on effective representation learning. Prior works have primarily focused on 2D representations, often overlooking the inherent 3D geometric structure of the world, or have attempted to learn 3D representations that require extensive resources such as synchronized multi-view images even during deployment. To address these issues, we propose a novel RL framework that extracts 3D-aware representations from single-view RGB input, without requiring camera pose or synchronized multi-view images during the downstream RL. Our method employs an autoencoder architecture, using a masked Vision Transformer (ViT) as the encoder and a latent-conditioned Neural Radiance Fields (NeRF) as the decoder, trained with cross-view completion to implicitly capture fine-grained, 3D geometry-aware representations. Additionally, we utilize a time contrastive loss that further regularizes the learned representation for consistency across different viewpoints, which enables viewpoint-robust robot manipulations. Our method significantly enhances the RL agent’s performance both in simulation and real-world experiments, demonstrating superior effectiveness compared to prior 3D-aware representation-based methods, even when using only single-view RGB images during deployment.
|
| |
| 15:00-16:30, Paper WeI2I.329 | Add to My Program |
| Dynamics of Mental Models: Objective vs. Subjective User Understanding of a Robot in the Wild |
|
| Gebellí, Ferran | PAL Robotics |
| Garrell, Anais | UPC-CSIC |
| Lemaignan, Séverin | IIIA-CSIC |
| Ros, Raquel | IIIA-CSIC |
Keywords: Social HRI, Long term Interaction, Human-Centered Robotics
Abstract: In Human-Robot Interaction research, assessing how humans understand the robots they interact with is crucial, particularly when studying the impact of explainability and transparency. Some studies evaluate objective understanding by analysing the accuracy of users' mental models, while others rely on perceived, self-reported levels of subjective understanding. We hypothesise that both dimensions of understanding may diverge, thus being complementary methods to assess the effects of explainability on users. In our study, we track the weekly progression of the users' understanding of an autonomous robot operating in a healthcare centre over five weeks. Our results reveal a notable mismatch between objective and subjective understanding. In areas where participants lacked sufficient information, the perception of understanding, i.e. subjective understanding, raised with increased contact with the system while their actual understanding, objective understanding, did not. We attribute these results to inaccurate mental models that persist due to limited feedback from the system. Future research should clarify how both objective and subjective dimensions of understanding can be influenced by explainability measures, and how these two dimensions of understanding affect other desiderata such as trust or usability.
|
| |
| 15:00-16:30, Paper WeI2I.330 | Add to My Program |
| Neural Profiling with fNIRS of Operator Performance in Teleoperated Human-Like Social Robot Interactions |
|
| Achanccaray, David | Institut Supérieur De l'Aéronautique Et De L'Espace |
| Andreu-Perez, Javier | University of Essex |
| Sumioka, Hidenobu | ATR |
Keywords: Telerobotics and Teleoperation, Social HRI, Human Factors and Human-in-the-Loop
Abstract: Social robot teleoperation is a skill that must be acquired through practice with the social robot. Mobile neuroimaging and human-computer interface performance metrics permit the gathering of information from the operators' systemic and behavioral responses associated with their skill acquisition. Profiling the skill levels of social robot operators using this information can help improve training protocols. In this study, thirty-two participants performed real-world social robot teleoperation tasks. Brain function signals from the prefrontal cortex (PFC), and behavioral data from interactions with the system were collected using functional near-infrared spectroscopy (fNIRS). Participants were divided into two groups (high and low performance) based on an integrative metric of task efficiency, workload, and presence when operating the social robot. Significant differences were found in the operation time, width, and multiscale entropy of the hemoglobin oxygenation curve of the operator's PFC. Functional connectivity in the PFC also depicted differences in the low- and high-performance groups when connectivity networks were compared and in the leaf fraction metrics of the functional networks. These findings contribute to understanding the operator's progress during teleoperation training protocols and designing the interface to assist in enhancing task performance.
|
| |
| 15:00-16:30, Paper WeI2I.331 | Add to My Program |
| OmniMap: A General Mapping Framework Integrating Optics, Geometry, and Semantics |
|
| Deng, Yinan | Beijing Institute of Technology |
| Yue, Yufeng | Beijing Institute of Technology |
| Dou, Jianyu | Beijing Institute of Technology |
| Zhao, Jingyu | Beijing Institute of Technology |
| Wang, Jiahui | Beijing Institute of Technology |
| Tang, Yujie | Beijing Institute of Technology |
| Yang, Yi | Beijing Institute of Technology |
| Fu, Mengyin | Beijing Institute of Technology |
Keywords: Mapping, Semantic Scene Understanding, RGB-D Perception, Perception for Grasping and Manipulation
Abstract: Robotic systems demand accurate and comprehensive 3D environment perception, requiring simultaneous capture of photo-realistic appearance (optical), precise layout shape (geometric), and open-vocabulary scene understanding (semantic). Existing methods typically achieve only partial fulfillment of these requirements while exhibiting optical blurring, geometric irregularities, and semantic ambiguities. To address these challenges, we propose OmniMap. Overall, OmniMap represents the first online mapping framework that simultaneously captures optical, geometric, and semantic scene attributes while maintaining real-time performance and model compactness. At the architectural level, OmniMap employs a tightly coupled 3DGS–Voxel hybrid representation that combines fine-grained modeling with structural stability. At the implementation level, OmniMap identifies key challenges across different modalities and introduces several innovations: adaptive camera modeling for motion blur and exposure compensation, hybrid incremental representation with normal constraints, and probabilistic fusion for robust instance-level understanding. Extensive experiments show OmniMap's superior performance in rendering fidelity, geometric accuracy, and zero-shot semantic segmentation compared to state-of-the-art methods across diverse scenes. The framework's versatility is further evidenced through a variety of downstream applications, including multi-domain scene Q&A, interactive editing, perception-guided manipulation, and map-assisted navigation.
|
| |
| 15:00-16:30, Paper WeI2I.332 | Add to My Program |
| A Transendoscopic Telerobotic System Using Heterogeneous Flexible Manipulators for Bimanual Endoscopic Submucosal Dissection |
|
| Gao, Huxin | The Chinese University of Hong Kong |
| Yang, Xiaoxiao | Qilu Hospital of Shandong University |
| Zhang, Tao | Chinese University of Hong Kong |
| Xiao, Xiao | Southern University of Science and Technology |
| Li, Changsheng | Beijing Institute of Technology |
| Meng, Max Q.-H. | The Chinese University of Hong Kong |
| Zuo, Xiuli | QiluHospitalofShandongUniversity |
| Li, Yanqing | Qilu Hospital of Shandong University |
| Ren, Hongliang | Chinese Univ Hong Kong (CUHK) & National Univ Singapore(NUS) |
Keywords: Medical Robots and Systems, Flexible Robotics, Bimanual Manipulation
Abstract: Endoscopic submucosal dissection (ESD) is an effective technique to resect early cancers in the gastrointestinal (GI) tract. Bimanual telerobotic manipulation is an approach to performing ESD intuitively and efficiently, which requires two robotic instruments with flexibility, stiffness, dexterity and accuracy. However, the priority of these properties depends on the specific surgical tasks. In this work, we proposed the first heterogeneous flexible manipulators (HFMs) for bimanual ESD, which can take advantage of different mechanical structures. The grasping instrument employs a serial articulated manipulator (SAM) to perform multidirectional (relatively higher dexterity) and stable (relatively higher stiffness) traction for better submucosal visualization. The electrosurgical instrument utilizes a parallel continuum wrist (PCW) to execute accurate (higher accuracy) tissue dissection. Both HFMs have sufficient flexibility to go through the flexible endoscopic working channels. Based on the HFMs, we established a transendoscopic telerobotic system. The kinematics of the SAM and PCW were built in the endoscope frame using the Denavit-Hartenberg (DH) method and Cosserat rod method,
|
| |
| 15:00-16:30, Paper WeI2I.333 | Add to My Program |
| RoA-Planner: Rotatable Area-Based Path Planner in Dense Spaces (I) |
|
| Son, Yeongwoo | Sungkyunkwan University |
| Lee, Hyunyong | AIDIN ROBOTICS |
| Kang, Hansol | Sungkyunkwan University |
| Park, Ji Man | SungkyunKwan University |
| Nam, SeongWon | Sungkyunkwan University |
| Oh, JaeYoung | Sungkyunkwan University |
| Yi, Bumsu | SungkyunKwan University |
| Song, Junha | Sungkyunkwan University |
| Choi, SooYeon | Sungkyunkwan University |
| Kim, Bogeun | Sungkyunkwan University |
| Song, Daegeun | Sungkyunkwan University |
| Choi, Hyouk Ryeol | Sungkyunkwan University |
Keywords: Motion and Path Planning, Collision Avoidance
Abstract: Path planning in obstacle-dense environments is a challenging problem, particularly for robots with asymmetric rectangular footprints. To address this problem, we propose a novel collision-checking approach, called a Rotatable Area, which represents a range of heading angles where the robot can rotate without colliding with obstacles. Based on the relationship between two rotatable areas, we define safe local motion and extend this concept to the RoA-Planner, a path planning framework in SE(2) dense space. We validate our planner through extensive simulations and real-world experiments in complex and narrow environments. The results demonstrate that our method achieves fast planning speed while ensuring safety and robustness, making it suitable for practical applications.
|
| |
| 15:00-16:30, Paper WeI2I.334 | Add to My Program |
| Don't Let Your Robot Be Harmful: Responsible Robotic Manipulation Via Safety-As-Policy |
|
| Ni, Minheng | Hong Kong Polytechnic University |
| Zhang, Lei | University of Hamburg |
| Chen, Zihan | Harbin Institute of Technology |
| Bai, Kaixin | University of Hamburg |
| Chen, Zhaopeng | University of Hamburg |
| Zhang, Jianwei | University of Hamburg |
| Zhang, Lei | Hong Kong Polytechnic University |
| Zuo, Wangmeng | Harbin Institute of Technology |
Keywords: Safety in HRI, Task and Motion Planning, Multi-Modal Perception for HRI
Abstract: Unthinking execution of human instructions in robotic manipulation can lead to severe safety risks, such as poisonings, fires, and even explosions. In this paper, we present responsible robotic manipulation, which requires robots to consider potential hazards in the real-world environment while completing instructions and performing complex operations safely and efficiently. However, such scenarios in real world are variable and risky for training. To address this challenge, we propose Safety-as-policy, which includes (i) a world model to automatically generate scenarios containing safety risks and conduct virtual interactions, and (ii) a mental model to infer consequences with reflections and gradually develop the cognition of safety, allowing robots to accomplish tasks while avoiding dangers. Additionally, we create the SafeBox synthetic dataset, which includes one hundred responsible robotic manipulation tasks with different safety risk scenarios and instructions, effectively reducing the risks associated with real-world experiments. Experiments demonstrate that Safety-as-policy can avoid risks and efficiently complete tasks in both synthetic dataset and real-world experiments, significantly outperforming baseline methods. Our SafeBox dataset shows consistent evaluation results with real-world scenarios, serving as a safe and effective benchmark for future research. Our code, data, and supplementary materials are available at: https://sites.google.com/view/safety-as-policy.
|
| |
| 15:00-16:30, Paper WeI2I.335 | Add to My Program |
| DriveAgent: Multi-Agent Structured Reasoning with LLM and Multimodal Sensor Fusion for Autonomous Driving |
|
| Hou, Xinmeng | Chang'an University, A*STAR |
| Wang, Wuqi | Chang’an University |
| Yang, Long | Chang'an University |
| Lin, Hao | Google LLC |
| Feng, Jinglun | City College of New York |
| Min, Haigen | Chang'an University |
| Zhao, Xiangmo | School of Information Engineering, Chang 'an University |
Keywords: Agent-Based Systems, AI-Based Methods, Autonomous Agents
Abstract: We introduce DriveAgent, a modular multi-agent autonomous driving framework that leverages large language model (LLM) reasoning combined with multimodal sensor fusion for autonomous driving. DriveAgent orchestrates specialized agents operating on camera, Light Detection and Ranging (LiDAR), Inertial Measurement Unit (IMU), and Global Positioning System (GPS) with LLM-driven analytical processes to deliver temporally aligned perception, causal reasoning, and action recommendations. The framework operates through a modular agent-based pipeline comprising four principal modules: (i) a descriptive analysis agent identifying critical sensor data events based on filtered timestamps, (ii) dedicated vehicle-level analysis conducted by LiDAR and vision agents that collaboratively assess vehicle conditions and movements, (iii) environmental reasoning and causal analysis agents explaining contextual changes and their underlying mechanisms, and (iv) an urgency-aware decision-generation agent prioritizing insights and proposing timely maneuvers. This modular design empowers the LLM to effectively coordinate specialized perception and reasoning agents, delivering cohesive, interpretable insights into complex autonomous driving scenarios. Extensive experiments demonstrate that DriveAgent substantially outperforms baseline methods, achieving a 26.31% improvement in vehicle reasoning and consistent enhancements of up to 2.85% in environmental reasoning. These results highlight the effectiveness of our LLM-driven multi-agent sensor fusion framework in boosting the robustness and reliability of autonomous driving systems. footnote{Code available at url{https://github.com/Paparare/DriveAgent}}
|
| |
| 15:00-16:30, Paper WeI2I.336 | Add to My Program |
| Anti-Backlash Mechanisms for Cycloidal Drive Robotic Actuators: Design and Evaluation |
|
| Roozing, Wesley | University of Twente |
| Volbeda, Jelmer | University of Twente |
Keywords: Actuation and Joint Mechanisms, Mechanism Design
Abstract: We design and experimentally evaluate two anti-backlash mechanisms for cycloidal reducers. The two mechanisms are integrated into variations of a proposed design of quasi-direct drive actuator. Three prototypes are realised to compare the two mechanisms against the baseline design. We evaluate the effectiveness of the anti-backlash mechanisms under varying preload with measurements of friction, backlash, and stiffness. The results demonstrate that the anti-backlash mechanisms are effective at reducing backlash by approx. 2-3x, at the expected expense of increased friction (<2x).
|
| |
| 15:00-16:30, Paper WeI2I.337 | Add to My Program |
| SHeRLoc: Synchronized Heterogeneous Radar Place Recognition for Cross-Modal Localization |
|
| Kim, Hanjun | Seoul National University |
| Jung, Minwoo | Seoul National University |
| Yang, Wooseong | Seoul National University |
| Kim, Ayoung | Seoul National University |
Keywords: Localization, SLAM, Range Sensing
Abstract: Despite the growing adoption of radar in robotics, the majority of research has been confined to homogeneous sensors, overlooking the integration and cross-modality challenges inherent in heterogeneous radar. This leads to significant difficulties in generalizing across diverse radar types, with modality-aware approaches that could leverage the complementary strengths of heterogeneous radar remaining unexplored. To bridge these gaps, we propose SHeRLoc, the first deep network tailored for heterogeneous radar, which utilizes radar cross-section polar matching to align multimodal radar data. Our hierarchical optimal transport-based feature aggregation generates rotationally robust multi-scale descriptors. By employing FFT-similarity-based data mining and adaptive margin-based triplet loss, SHeRLoc enables FOV-aware metric learning. SHeRLoc achieves an order of magnitude improvement in heterogeneous radar place recognition, increasing recall@1 from below 0.1 to 0.9, and paves the way for cross-modal localization.
|
| |
| 15:00-16:30, Paper WeI2I.338 | Add to My Program |
| Manipulating Elasto-Plastic Objects with 3D Occupancy and Learning-Based Predictive Control |
|
| Zhang, Zhen | The Chinese University of Hong Kong |
| Chu, Xiangyu | The Chinese University of Hong Kong |
| Tang, Yunxi | The Chinese University of Hong Kong |
| Zhao, Lulu | Beijing Normal University |
| Huang, Jing | The Chinese University of Hong Kong |
| Jiang, Zhongliang | The University of Hong Kong |
| Au, K. W. Samuel | The Chinese University of Hong Kong |
Keywords: Machine Learning for Robot Control, Dexterous Manipulation
Abstract: Manipulating elasto-plastic object remains a significant challenge due to severe self-occlusion, difficulties of representation, and complicated dynamics. This work proposes a novel framework for elasto-plastic object manipulation with a quasi-static assumption for motions, leveraging 3D occupancy to represent such objects, a learned dynamics model trained with 3D occupancy, and a learning-based predictive control algorithm to address these challenges effectively. We build a novel data collection platform to collect full spatial information and propose a pipeline for generating a 3D occupancy dataset. To infer the 3D occupancy during manipulation, an occupancy prediction network is trained with multiple RGB images supervised by the generated dataset. We design a deep neural network empowered by a 3D convolution neural network (CNN) and a graph neural network (GNN) to predict the complex deformation with the inferred 3D occupancy results. A learning-based predictive control algorithm is introduced to plan the robot’s actions, incorporating a novel shape-based action initialization module specifically designed to improve the planner’s efficiency. The proposed framework in this paper can successfully shape the elasto-plastic objects into a given goal shape and has been verified in various experiments both in simulation and the real world.
|
| |
| 15:00-16:30, Paper WeI2I.339 | Add to My Program |
| OVITA: Open-Vocabulary Interpretable Trajectory Adaptations |
|
| Maurya, Anurag | Indian Institute of Science, Bengaluru |
| Ghosh, Tashmoy | Indian Institute of Science |
| Nguyen, Anh | University of Liverpool |
| Prakash, Ravi | Indian Institute of Science |
Keywords: Motion and Path Planning, Human-Robot Collaboration, Big Data in Robotics and Automation
Abstract: Adapting trajectories to dynamic situations and user preferences is crucial for robot operation in unstructured environments with non-expert users. Natural language enables users to express these adjustments in an interactive manner. We introduce OVITA, an interpretable, open-vocabulary, language-driven framework designed for adapting robot trajectories in dynamic and novel situations based on human instructions. OVITA leverages multiple pre-trained Large Language Models (LLMs) to integrate user commands into trajectories generated by motion planners or those learned through demonstrations. OVITA employs code as an adaptation policy generated by an LLM, enabling users to adjust individual waypoints, thus providing flexible control. Another LLM which acts as a code explainer removes the need for expert users, enabling intuitive interactions. The efficacy and significance of the proposed OVITA framework is demonstrated through extensive simulations and real-world environments with diverse tasks involving spatiotemporal variations on heterogeneous robotic platforms such as a KUKA IIWA robot manipulator, Clearpath Jackal ground robot, and CrazyFlie drone.
|
| |
| 15:00-16:30, Paper WeI2I.340 | Add to My Program |
| Diffusion Trajectory-Guided Policy for Long-Horizon Robot Manipulation |
|
| Fan, Shichao | Beihang University |
| Yang, Quantao | KTH Royal Institute of Technology |
| Liu, Yajie | Beihang University |
| Wu, Kun | Syracuse University |
| Che, Zhengping | X-Humanoid |
| Liu, Qingjie | Beihang University |
| Wan, Min | Beihang University |
Keywords: Imitation Learning, Learning from Demonstration, Deep Learning in Grasping and Manipulation
Abstract: Recently, Vision-Language-Action Models (VLA) have advanced robot imitation learning, but high data collection costs and limited demonstrations hinder generalization and current imitation learning methods struggle in out-of-distribution scenarios, especially for long-horizon tasks. A key challenge is how to mitigate compounding errors in imitation learning, which lead to cascading failures over extended trajectories. To address these challenges, we propose the Diffusion Trajectory-guided Policy (DTP) framework, which generates 2D trajectories through a diffusion model to guide policy learning for long-horizon tasks. By leveraging task-relevant trajectories, DTP provides trajectory-level guidance to reduce error accumulation. Our two-stage approach first trains a generative vision-language model to create diffusion-based trajectories, then refines the imitation policy using them. Experiments on the CALVIN benchmark show that DTP outperforms state-of-the art baselines by 25% in success rate, starting from scratch without external pretraining. Moreover, DTP significantly improves real-world robot performance.
|
| |
| 15:00-16:30, Paper WeI2I.341 | Add to My Program |
| Have We Scene It All? Scene Graph-Aware Deep Point Cloud Compression |
|
| Stathoulopoulos, Nikolaos | Luleå University of Technology |
| Kanellakis, Christoforos | Luleå University of Technology |
| Nikolakopoulos, George | Luleå University of Technology |
Keywords: Range Sensing, Deep Learning for Visual Perception
Abstract: Efficient transmission of 3D point cloud data is critical for advanced perception in centralized and decentralized multi-agent robotic systems, especially nowadays with the growing reliance on edge and cloud-based processing. However, the large and complex nature of point clouds creates challenges under bandwidth constraints and intermittent connectivity, often degrading system performance. We propose a deep compression framework based on semantic scene graphs. The method decomposes point clouds into semantically coherent patches and encodes them into compact latent representations with semantic-aware encoders conditioned by Feature-wise Linear Modulation (FiLM). A folding-based decoder, guided by latent features and graph node attributes, enables structurally accurate reconstruction. Experiments on the SemanticKITTI and nuScenes datasets show that the framework achieves state-of-the-art compression rates, reducing data size by up to 98% while preserving both structural and semantic fidelity. In addition, it supports downstream applications such as multi-robot pose graph optimization and map merging, achieving trajectory accuracy and map alignment comparable to those obtained with raw LiDAR scans.
|
| |
| 15:00-16:30, Paper WeI2I.342 | Add to My Program |
| Is Open Robotics Innovation a Threat to International Peace and Security? |
|
| Righetti, Ludovic | New York University |
| Boulanin, Vincent | Stockholm International Peace Research Institute |
Keywords: Ethics and Philosophy, Product Design, Development and Prototyping, Control Architectures and Programming
Abstract: Open access to publication, software and hardware is central to robotics: it lowers barriers to entry, supports reproducible science and accelerates reliable system development. However, openness also exacerbates the inherent dual-use risks associated with research and innovation in robotics. It lowers barriers for states and non-state actors to develop and deploy robotics systems for military use and harmful purposes. Compared to other fields of engineering where dual-use risks are present – e.g., those that underlie the development of weapons of mass destruction (chemical, biological, radiological, and nuclear weapons) and even the field of AI, robotics offers no specific regulation and little guidance as to how research and innovation may be conducted and disseminated responsibly. While other fields can be used for guidance, robotics has its own needs and specificities which have to be taken into account. The robotics community should therefore work toward its own set of sector-specific guidance and possibly regulation. To that end, we propose a roadmap focusing on four practices: a) education in responsible robotics; b) incentivizing risk assessment; c) moderating the diffusion of high-risk material; and d) developing red lines.
|
| |
| 15:00-16:30, Paper WeI2I.343 | Add to My Program |
| EquiMus: Energy-Equivalent Dynamic Modeling and Simulation of Musculoskeletal Robots Driven by Linear Elastic Actuators |
|
| Zhu, Yinglei | Tsinghua University |
| Dong, Xuguang | Tsinghua University |
| Wang, Qiyao | Tsinghua University |
| Shao, Qi | Tsinghua University |
| Xie, Fugui | Tsinghua University |
| Liu, Xin-Jun | Tsinghua University |
| Zhao, Huichan | Tsinghua University |
Keywords: Modeling, Control, and Learning for Soft Robots, Biologically-Inspired Robots, Dynamics
Abstract: Dynamic modeling and control are critical to unlocking soft robots’ potential, yet remain challenging due to complex constitutive behaviors and real-world operating conditions. Bio-inspired musculoskeletal robots, which integrate rigid skeletons with soft actuators, combine the advantages of heavy load-bearing capacity and inherent flexibility. Although actuation dynamics has been studied through experimental methods and surrogate models, accurate and effective modeling and simulation still pose a significant challenge when soft actuators are applied at a large scale, especially in hybrid rigid-soft robots with continuously distributed mass, kinematic loops and diverse motion modes. To address these challenges, we propose EquiMus, an energy-equivalent dynamic modeling and MuJoCo-based simulation for musculoskeletal rigid--soft hybrid robots with linear elastic actuators. The equivalence and effectiveness are proven in detail and examined through simulations and real experiments on a bionic robotic leg. EquiMus further demonstrates utility for downstream tasks, including controller design and learning-based control.
|
| |
| 15:00-16:30, Paper WeI2I.344 | Add to My Program |
| A MagsL-HUD Endoscopic System for Magnetic Compression Anastomosis Surgery in Unstructured Endoluminal Environment |
|
| Sun, Yichong | The Chinese University of Hong Kong |
| Xian, Yitian | The Chinese University of Hong Kong |
| Xu, Ruoyu | The Chinese University of Hong Kong, Shenzhen |
| Chan, Wai Shing | The Chinese University of Hong Kong |
| Yip, Hon Chi | The Chinese University of Hong Kong |
| Chiu, Philip, Wai-yan | Chinese University of Hong Kong |
| Li, Zheng | The Chinese University of Hong Kong |
Keywords: Medical Robots and Systems, Localization, Sensor Fusion, Magnetic-Assisted Endoscopy
Abstract: Magnetic compression anastomosis (MCA) offers a promising solution for minimally invasive anastomosis surgery. However, current MCA schemes lack safe, real-time localization, and guidance for compression magnets, hindering surgeons’ ability to control the compression magnets effectively in complex, unstructured endoluminal environments. To address these limitations, this article introduces the MagsL-HUD endoscopic system, a novel solution that enables multimagnetic six-degree-of-freedom (six-DoF) localization and head-up display (HUD) guidance within the endoscopic view (EV). Specifically, the system integrates a developed Endo-MagCap device with an orthogonal magnet configuration, along with a magnetic sensor array, to achieve real-time full-pose localization. An endoscopic camera model is incorporated for HUD visualization, enhancing intuitive interaction for surgeons’ better informed decisions. Eventually, the effectiveness of the Mags-LHUD endoscopic system is validated through laboratory experiments and ex vivo animal trials. The system demonstrates six-DoF tracking accuracy with average errors of 0.0070 m and 0.1437 rad, and 0.0071 m and 0.1721 rad in the designed trajectory cases for two compression magnets, respectively. Additionally, ex vivo porcine tests confirm the system’s feasibility and applicability, successfully performing a stomach-colon MCA surgery with a final compression gap of approximately 0.00247 m. Further comparative studies demonstrate that the MagsL-HUD method has a compression success rate of 71.4% versus 42.9% of the non-HUD approaches in the designed tests. This work represents a significant step toward the clinical adoption of magnetic-assisted endoscopy for minimally invasive anastomosis surgeries, holding substantial practical significance for improving the safety and efficacy of MCA procedures in complex, unstructured endoluminal environments.
|
| |
| 15:00-16:30, Paper WeI2I.345 | Add to My Program |
| Fast Iterative Region Inflation for Computing Large 2-D/3-D Convex Regions of Obstacle-Free Space |
|
| Wang, Qianhao | Zhejiang University |
| Wang, Zhepei | Zhejiang University |
| Wang, Mingyang | Zhejiang University |
| Ji, Jialin | Zhejiang University |
| Han, Zhichao | Zhejiang University |
| Wu, Tianyue | Zhejiang University |
| Jin, Rui | Nanyang Technological University |
| Gao, Yuman | Zhejiang University |
| Xu, Chao | Zhejiang University |
| Gao, Fei | Zhejiang University |
Keywords: Collision Avoidance, Motion and Path Planning, Autonomous Vehicle Navigation, Aerial Systems: Applications
Abstract: Convex polytopes have compact representations and exhibit convexity, which makes them suitable for abstracting obstacle-free spaces from various environments. Existing generation methods struggle with balancing high-quality output and efficiency. Moreover, another crucial requirement for convex polytopes to accurately contain certain seed point sets, such as a robot or a front-end path, is proposed in various tasks, which we refer to as manageability. In this paper, we propose Fast Iterative Regional Inflation (FIRI) to generate high-quality convex polytope while ensuring efficiency and manageability simultaneously. FIRI consists of two iteratively executed submodules: Restrictive Inflation (RsI) and Maximum Volume Inscribed Ellipsoid (MVIE) computation. By explicitly incorporating constraints that include the seed point set, RsI guarantees manageability. Meanwhile, iterative MVIE optimization ensures high-quality result through monotonic volume bound improvement. In terms of efficiency, we design methods tailored to the low-dimensional and multi-constrained nature of both modules, resulting in orders of magnitude improvement compared to generic solvers. Notably, in 2-D MVIE, we pr
|
| |
| 15:00-16:30, Paper WeI2I.346 | Add to My Program |
| Collision Avoidance of Moving 3-D Objects in Dynamic Environments (I) |
|
| Dhal, Kashish | University of Texas at Arlington |
| Kashyap, Abhishek | University of Texas at Arlington |
| Chakravarthy, Animesh | University of Texas at Arlington |
| |
| 15:00-16:30, Paper WeI2I.347 | Add to My Program |
| Pitch Variation Filter for LiDAR-Only SLAM and Localization in Self-Balancing Mobile Robot |
|
| Kim, Doyeon | Kumoh National Institute of Technology |
| Lee, Heoncheol | Kumoh National Institute of Technology |
| Choi, Ka Hyung | LG Electronics |
Keywords: Localization, Mapping
Abstract: This paper addresses the Pitch Variation Problem in Two-Wheeled Self-Balancing (TWSB) robots that use 2D LiDAR for Simultaneous Localization And Mapping (SLAM). The issue arises from sudden accelerations or decelerations, leading to abrupt pitch variations that cause the 2D LiDAR to capture data from unintended surfaces, such as the ground or ceiling, destabilizing the robot’s position estimation. To mitigate this, we propose a novel preprocessing method that efficiently removes point clusters affected by pitch variation by leveraging their distinct characteristics, without the need for an alignment process. Experimental results demonstrate that our method reduces errors by at least 24.31% across various scan matching algorithms. Furthermore, as the proposed method operates independently of SLAM, it can be seamlessly integrated into a wide range of systems and has been shown to substantially enhance SLAM performance when used alongside existing algorithms.
|
| |
| 15:00-16:30, Paper WeI2I.348 | Add to My Program |
| ES-HPC-MPC: Exponentially Stable Hybrid Perception Constrained MPC for Quadrotor with Suspended Payloads |
|
| Recalde, Luis F. | Worcester Polytechnic Institute |
| Sarvaiya, Mrunal | University of California, Berkeley |
| Loianno, Giuseppe | UC Berkeley |
| Li, Guanrui | Worcester Polytechnic Institute |
Keywords: Aerial Systems: Applications, Aerial Systems: Perception and Autonomy, Aerial Systems: Mechanics and Control
Abstract: Aerial transportation using quadrotors with cable-suspended payloads holds great potential for applications in disaster response, logistics, and infrastructure maintenance. However, their hybrid and underactuated dynamics pose significant control and perception challenges. Traditional approaches often assume a taut cable condition, limiting their effectiveness in real-world applications where slack-to-taut transitions occur due to disturbances. We introduce ES-HPC-MPC, a model predictive control framework that enforces exponential stability and perception-constrained control under hybrid dynamics. Our method leverages Exponentially Stabilizing Control Lyapunov Functions (ES-CLFs) to enforce stability during the tasks and Control Barrier Functions (CBFs) to maintain the payload within the onboard camera’s field of view (FoV). We validate our method through both simulation and real-world experiments, demonstrating stable trajectory tracking and reliable payload perception. We validate that our method maintains stability and satisfies perception constraints while tracking dynamically infeasible trajectories and when the system is subjected to hybrid mode transitions caused by unexpected disturbances.
|
| |
| 15:00-16:30, Paper WeI2I.349 | Add to My Program |
| DRIM: Depth Restoration with Interference Mitigation in Multiple LiDAR Depth Cameras |
|
| Shin, Seunghui | Kyung Hee University |
| Jang, Jaeyun | Kyung Hee University |
| Park, Sundong | Kyung Hee University |
| Hwang, Hyoseok | Kyung Hee University |
Keywords: RGB-D Perception, Deep Learning for Visual Perception, Data Sets for Robotic Vision
Abstract: LiDAR depth cameras are widely used for accurate depth measurement in various applications. However, when multiple cameras operate simultaneously, mutual interference causes artifacts in the captured depth data, which existing image restoration methods struggle to handle. In this paper, we propose DRIM, a novel approach for real-time depth restoration under multi-device interference. Our method begins by distinguishing interference-induced artifacts, then predicts and leverages these artifacts to guide the restoration process. Since there is no existing dataset for learning interference in multiple LiDAR depth cameras, we create and provide the first depth interference dataset. Our experiments demonstrate superior depth restoration performance compared to other image restoration methods, achieving real-time processing speeds (≈33 FPS) that are significantly faster than existing approaches while showing the capability to restore depth in challenging scenarios. These results demonstrate that our proposed method effectively restores interfered depth in multiple LiDAR depth cameras with practical real-time performance. Datasets and codes are available at DRIM project page.
|
| |
| 15:00-16:30, Paper WeI2I.350 | Add to My Program |
| D-LIO: 6DoF Direct LiDAR-Inertial Odometry Based on Simultaneous Truncated Distance Field Mapping |
|
| Coto-Elena, Lucia | University Pablo De Olavide |
| Maese, Jose Enrique | Universidad Pablo De Olavide |
| Merino, Luis | Universidad Pablo De Olavide |
| Caballero, Fernando | Universidad Pablo De Olavide |
Keywords: SLAM, Range Sensing
Abstract: This paper presents a new approach for 6DoF Direct LiDAR-Inertial Odometry (D-LIO) based on the simultaneous mapping of truncated distance fields on CPU. Such continuous representation (in the vicinity of the points) enables working with raw 3D LiDAR data online, avoiding the need of LiDAR feature selection and tracking, simplifying the odometry pipeline and easily generalizing to many scenarios. The method is based on the proposed Fast Truncated Distance Field (Fast-TDF) method as a convenient tool to represent the environment, employing binary masks that encodes the L1 distance. Such representation enables i) solving the LiDAR point-cloud registration as a nonlinear optimization process without the need of selecting/tracking LiDAR features in the input data, ii) simultaneously producing an accurate truncated distance field map of the environment, and iii) updating such map at constant time independently of its size. The approach is tested using open datasets, aerial and ground. It is also benchmarked against other state-of-the-art odometry approaches, demonstrating the same or better level of accuracy with the added value of an online-generated TDF representation of the environment, that can be used for other robotics tasks as planning or collision avoidance. The source code is publicly available at https://github.com/robotics-upo/D-LIO.git
|
| |
| 15:00-16:30, Paper WeI2I.351 | Add to My Program |
| Stable Trajectory Planning for Quadruped Robots Using Terrain Features at Feet End |
|
| Li, Congfei | City University of Hong Kong |
| Lin, Shuyue | City University of Hong Kong |
| Qu, Shenwei | Harbin Institute of Technology |
| Liu, Zhuoyuan | The Hong Kong Polytechnic University |
| Yang, Qingjun | Harbin Institute of Technology |
| Meng, Max Q.-H. | The Chinese University of Hong Kong |
| Sun, Yuxiang | City University of Hong Kong |
Keywords: Legged Robots, Motion and Path Planning
Abstract: Quadruped robots have received increasing attention in recent years. Most existing trajectory planning algorithms for quadruped robots focus on how to avoid obstacles and achieve shortest trajectory or time, which is similar to the planning algorithms for mobile robots. These algorithms could not take full advantage of the agility and flexibility of quadruped robots. This letter designs a trajectory planner by taking advantage of the agility and flexibility of quadruped robots. With our trajectories, quadruped robots could navigate through complex terrains with more stability (e.g., less momentum variations along Z-axis). To achieve this goal, we use ground features at the landing point of the feet end to construct objective function, rather than using the center point of the robot body. Current discrete map representations, such as grid map or cost map, are difficult for optimization algorithms to introduce environment constraints. So, we use the Sparse Variational Gaussian Process (SVGP) to predict terrain features with point-cloud data as input, so that the environment constraints can be introduced into the optimization problem. Experimental results in both simulation and real-world environments demonstrate the effectiveness of our method.
|
| |
| 15:00-16:30, Paper WeI2I.352 | Add to My Program |
| Memory-Efficient Voxelized Renderable Neural 3D Spatial Representation for Vision-Based Robotics |
|
| Jun, Howoong | Seoul National University |
| Ha, Seongbo | Sungkyunkwan University |
| Lee, Jaewon | Seoul National University |
| Yu, Hyeonwoo | SungKyunKwan University |
| Oh, Songhwai | Seoul National University |
Keywords: Visual Learning, Computer Vision for Automation, Vision-Based Navigation
Abstract: In this paper, we introduce a novel approach for modeling a memory-efficient spatial representation with 3D Gaussian splatting. Efficient vision-based spatial representation poses a significant challenge due to the memory demands of visual information. Recent advances in 3D rendering technologies, such as neural radiance fields (NeRF) and 3D Gaussian splatting, have prompted exploration of their applications in robotics. However, such 3D rendering methods often focus on rendering high-quality images, requiring numerous parameters and resulting in large data, which are unsuitable for robotics applications. To tackle this challenge, we introduce 3DSR, an efficient voxelized renderable neural 3D spatial representation that utilizes 3D Gaussian splatting. 3DSR leverages the strengths of both voxelization (memory efficiency) and 3D Gaussian splatting (high-quality image reconstruction). The proposed method achieves memory efficiency by reducing the number of 3D Gaussians in the 3D representation through voxelization, while preserving the image quality required for effective vision-based robotic applications. Experimental results demonstrate that 3DSR achieves over 90% of the best method's reconstruction quality while requiring only 54.54% of its memory. Additional experiments on visual localization and navigation further confirm that the proposed method is readily applicable to robotics.
|
| |
| 15:00-16:30, Paper WeI2I.353 | Add to My Program |
| A Multi-Mode Motion Polar Robot: Energy-Saving through Foldable Sail and Transformable Tracks |
|
| Luo, Yongsheng | Harbin Institute of Technology |
| Guo, Zhaokun | Harbin Institute of Technology |
| Liu, Tao | Harbin Institute of Technology |
| Li, Kaixuan | Harbin Institute of Technology |
| Liao, Jinnong | Harbin Institute of Technology |
| Guo, Lefan | Harbin Institute of Technology |
| Zhu, Yanhe | Harbin Institute of Technology |
| Liu, Gangfeng | Harbin Institute of Technology |
| Zhao, Jie | Harbin Institute of Technology |
Keywords: Field Robots, Mechanism Design, Energy and Environment-Aware Automation, Polar Robot
Abstract: Existing polar robots are constrained by limited energy supply, making it difficult to carry out long-term scientific exploration missions, which highlights an urgent demand for energy conservation. An energy-efficient multi-mode motion polar robot is proposed to address this challenge. Both increasing external assistance and reducing the driving force are critical for lowering energy consumption. A foldable sail is designed to provide external assistance. When unfolded, the sail generates assistive force. When folded, it maintains stability in extreme polar climates. The sail shape is designed based on a symmetrically extended NACA0018 airfoil, and the influence of different sail parameters on performance is discussed. The transformable tracks realize switching between traction and sliding modes through the separation of the track and teeth chain, using the sliding mode to reduce driving force. The effect of teeth parameter variations on traction performance is analyzed. The system kinematics and dynamics are model, and stability conditions are determined. Based on this, an energy-saving motion control framework for multi-mode motion is proposed. Finally, experiments are conducted to evaluate the energy-saving contribution of each independent mode under different configurations. Comprehensive experiments in multi-mode motion demonstrate an overall energy-saving rate of approximately 24%, verifying the effectiveness of the energy-saving motion control strategy. With its energy-saving advantages, this robot shows strong potential for enabling long-term scientific exploration in polar regions.
|
| |
| 15:00-16:30, Paper WeI2I.354 | Add to My Program |
| Temporal Transfer Learning for Traffic Optimization with Coarse-Grained Advisory Autonomy |
|
| Cho, Jung-Hoon | Massachusetts Institute of Technology |
| Li, Sirui | Massachusetts Institute of Technology |
| Kim, Jeongyun | Massachusetts Institute of Technology |
| Wu, Cathy | Massachusetts Institute of Technology |
Keywords: Intelligent Transportation Systems, Learning and Adaptive Systems, Deep Learning in Robotics and Automation, Transfer Learning
Abstract: The recent development of connected and automated vehicle (CAV) technologies has spurred investigations to optimize dense urban traffic, maximizing vehicle speed and throughput. This article explores advisory autonomy, in which real-time driving advisories are issued to human drivers, thus achieving near-term performance of automated vehicles. Due to the complexity of traffic systems, recent studies of coordinating CAVs have leveraged deep reinforcement learning (RL). Coarse-grained advisory is formalized as zero-order holds, and we consider a range of hold durations from 0.1 to 40 s. However, despite the similarity of the higher frequency tasks for CAVs, a direct application of deep RL fails to generalize to advisory autonomy tasks. To overcome this, we employ zero-shot transfer, training policies on a set of source tasks—specific traffic scenarios with designated hold durations—and then evaluating the efficacy of these policies on different target tasks. We introduce temporal transfer learning (TTL) algorithms to select source tasks for zero-shot transfer, systematically leveraging the temporal structure to solve the full range of tasks. TTL selects the most suitable source tasks to maximize the performance of the range of tasks. We validate our algorithms on diverse mixed-traffic scenarios, demonstrating that TTL more reliably solves the tasks than baselines. This article underscores the potential of coarse-grained advisory autonomy with TTL in traffic flow optimization.
|
| |
| 15:00-16:30, Paper WeI2I.355 | Add to My Program |
| Hierarchical Planning for Vehicle Routing and Scheduling in Marsupial Robotic Systems |
|
| Kim, Donghyun | KAIST |
| Kim, Jinwhan | KAIST |
Keywords: Planning, Scheduling and Coordination, Multi-Robot Systems
Abstract: This letter presents a hierarchical planning approach to the vehicle routing and scheduling problem (VRSP) for marsupial robotic systems, a specialized class of heterogeneous robotic systems in which one type of mobile robot is capable of carrying another. While traditional VRSPs have been widely studied, the marsupial variant (MVRSP) has received relatively little attention. To address the NP-hard nature of MVRSP, this work introduces a hierarchical planning structure that decomposes the problem into two subproblems with reduced complexity: a high-level routing problem, formulated as a mixed-integer linear program (MILP), and a low-level scheduling problem, modeled in the Planning Domain Definition Language (PDDL). These subproblem solutions are integrated to generate complete mission plans. The proposed approach is validated through qualitative plan visualizations and quantitative Monte Carlo simulations in an autonomous subsea mapping scenario, where an unmanned surface vehicle carries multiple underwater vehicles. Results show that the hierarchical planner significantly improves both planning efficiency and solution quality compared to baseline methods.
|
| |
| 15:00-16:30, Paper WeI2I.356 | Add to My Program |
| OpenPyRo-A1: An Open Python-Based Low-Cost Bimanual Robot for Embodied AI |
|
| Huang, Helong | Huawei Noah’s Ark Lab |
| Mower, Christopher Edwin | Huawei Technologies Research & Development |
| Huang, Guowei | Huawei Technologies Co., Ltd |
| Das, Sarthak | Huawei Research (Noah's Ark Labs) |
| Dierking, Magnus | Huawei Noah's Ark Lab |
| Luo, Guangyuan | Huawei |
| Tan, Kai | Huawei |
| Chen, Xi | Huawei |
| Yang, Yehai | Huawei |
| Chen, Yingbing | Huawei Hongkong Research Center |
| Zeng, Yiming | City University of Hong Kong |
| Li, Yinchuan | Huawei |
| Zhang, Zhanpeng | Huawei Noah's Ark Lab |
| Wu, Shuang | Huawei |
| Zhang, Yingxue | Huawei Noah's Ark Lab |
| Qiu, Weichao | Huawei |
| Cao, Tongtong | Noah's Ark Lab, Huawei Technologies |
| Qin, Mian | Huawei |
| Pakdamansavoji, Sajjad | Huawei |
| Liu, Yuecheng | Huawei Noah's Ark Lab |
| Zhuang, Yuzheng | Huawei Technologies Company |
| Tian, Guangjian | Huawei |
| Hao, Jianye | Noah's Ark Lab |
| Wang, Jun | University College London |
| Bou Ammar, Haitham | Huawei |
| Quan, Xingyue | Huawei |
Keywords: Mechanism Design, Hardware-Software Integration in Robotics
Abstract: Many real-world tasks, such as assembly, cooking, and object handovers, require bi-manual coordination. Learning such skills via imitation remains challenging due to dataset scarcity, mainly caused by the high cost of bi-manual robotic platforms and barriers to entry in robotics software. To address these challenges, we introduce (1) ROBOTNAME, a low-cost, bi-manual humanoid robot priced at approximately 14K. ROBOTNAME achieves 0.2 mm repeatability and supports a 5 kg payload per arm, and (2) a Python-first distributed control framework for seamless teleoperation, data collection, and policy deployment, designed for ease of use; moreover, the code-base is installable via pip. We conducted imitation learning experiments in both simulation and the real world, integrating the robot with perception models, motion planning, and a large language model. The results demonstrate that ROBOTNAME is a stable, user-friendly, and high-precision dual-arm platform. We expect that the ROBOTNAME hardware, control system, and curated dataset of bi-manual manipulation episodes will advance affordable and scalable dual-arm robotics
|
| |
| 15:00-16:30, Paper WeI2I.357 | Add to My Program |
| Design of an Active Morphable Pneumatic Bilayer Planar Actuator Inspired by Starfish |
|
| Zhou, Chengdi | ZheJiang University |
| Li, Jituo | Zhejiang University |
| Long, Juncai | Zhejiang University |
| Diao, Xiaojie | Zhejiang University |
| Lu, GuoDong | Zhejiang University |
| Hassen, Nigatu | Zhejiang university, robotics institute |
| Li, Howard | University of New Brunswick |
| |
| 15:00-16:30, Paper WeI2I.358 | Add to My Program |
| TOP: Trajectory Optimization Via Parallel Optimization towards Constant Time Complexity |
|
| Yu, Jiajun | Zhejiang University |
| Chen, Nanhe | Zhejiang University |
| Liu, Guodong | Tsinghua University |
| Xu, Chao | Zhejiang University |
| Gao, Fei | Zhejiang University |
| Cao, Yanjun | Zhejiang University, Huzhou Institute of Zhejiang University |
Keywords: Motion and Path Planning, Aerial Systems: Applications, Autonomous Vehicle Navigation
Abstract: Optimization has been widely used to generate smooth trajectories for motion planning. However, existing trajectory optimization methods show weakness when dealing with large-scale long trajectories. Recent advances in parallel computing have accelerated optimization in some fields, but how to efficiently solve trajectory optimization via parallelism remains an open question. In this paper, we propose a novel trajectory optimization framework based on the Consensus Alternating Direction Method of Multipliers (CADMM) algorithm, which decomposes the trajectory into multiple segments and solves the subproblems in parallel. The proposed framework reduces the time complexity to O(1) per iteration with respect to the number of segments, compared to O(N) of the state-of-the-art (SOTA) approaches. Furthermore, we introduce a closed-form solution that integrates convex linear and quadratic constraints to speed up the optimization, and we also present a numerical solution for general convex inequality constraints. A series of simulations and experiments demonstrate that our approach outperforms the SOTA approach in terms of efficiency and smoothness. Especially for a large-scale trajectory, with one hundred segments, achieving over a tenfold speedup. To fully explore the potential of our algorithm on modern parallel computing architectures, we deploy our framework on a GPU and show high performance with thousands of segments.
|
| |
| 15:00-16:30, Paper WeI2I.359 | Add to My Program |
| Torque Transmission Modeling of Two Coaxial Electrorheological Clutches for Reciprocating Actuation |
|
| Huang, Shouren | Tokyo University of Science |
| Ishikawa, Masatoshi | Tokyo University of Science |
Keywords: Model Learning for Control, Actuation and Joint Mechanisms
Abstract: This study focuses on modeling the transmission torque of two coaxial electrorheological (ER) fluid clutches through a data-driven approach. Instead of simplifying the viscosity term in the Bingham model to be a constant as shown in conventional methods, we propose the method of introducing electric field-dependent nonlinearity into the viscosity term to better capture the complex rheological behavior of ER fluids. Based on this framework, we developed a heuristic explicit model (HEM) and a radial basis function model (RBFM) that incorporate the mechanical characteristics of the coaxial clutch structure. Furthermore, we explored direct estimation methods using a radial basis function network (RBFN) and a feedforward neural network (FNN) without relying on the Bingham model. Comparative evaluations with traditional ER models validated the effectiveness of our nonlinear formulations. Notably, the FNN approach demonstrated superior accuracy even with a single hidden layer containing only a few neurons, making it well-suited for real-time implementation with minimal computational overhead. Real-time validation across diverse operating conditions further confirmed the feasibility and robustness of the FNN-based method. These findings contribute new insights into ER fluid applications.
|
| |
| 15:00-16:30, Paper WeI2I.360 | Add to My Program |
| Integrating Expert Knowledge and Traffic Data for Lane-Changing Intention Prediction in Autonomous Vehicles |
|
| Sun, Chao | Beijing Institute of Technology |
| Wen, Da | Beijing Institute of Technology |
| Gao, Haoming | Beijing Institute of Technology |
| Ning, Changjiu | Beijing institude of technology |
| Zhang, Zhang | Beijing Institute of Technology |
| Leng, Jianghao | Beijing Institute of Technology |
| |
| 15:00-16:30, Paper WeI2I.361 | Add to My Program |
| CBF-Based Hierarchical Quadratic Programs with Guaranteed Feasibility for Safety-Critical Systems (I) |
|
| Xie, Junjun | Harbin Institute of Technology, Shenzhen |
| Hu, Liang | Harbin Institute of Technology, Shenzhen |
| Tan, Yunzhe | Department of Automation, School of Mechanical Engineering and Automation, Harbin Institute of Technology, Shenzhen, China |
| Yang, Jun | Loughborough University |
Keywords: Collision Avoidance, Robot Safety, Optimization and Optimal Control
Abstract: Control Barrier Function (CBF) based quadratic programs (QPs) have become an effective method for enforcing safety in safety-critical systems and robotics. However, these methods often suffer from infeasibility or overly conservative relaxations when handling multiple constraints, potentially compromising safety. In this paper, we propose a hierarchical framework called ``Safety-first" for control design, which simultaneously incorporates performance objectives formulated using Control Lyapunov Functions (CLFs), and safety guarantees via CBFs with input constraints. Unlike existing approaches, the proposed method guarantees solution feasibility while achieving improved performance, and it is scalable to an arbitrary number of CBF constraints. This scalability enables more precise and flexible representation of complex safety requirements using multiple simple CBFs. For application to mobile robot navigation, we employ Constrained Delaunay Triangulation (CDT) to construct multiple CBFs that approximate irregularly-shaped obstacles. Real-world experiments in cluttered and dynamic environments demonstrate that the Safety-first algorithm achieves safe navigation, validating both the theoretical guarantee and practical advantages over existing methods.
|
| |
| 15:00-16:30, Paper WeI2I.362 | Add to My Program |
| Learning from Planned Data to Improve Robotic Pick-And-Place Planning Efficiency |
|
| Qin, Liang | Osaka University |
| Wan, Weiwei | Osaka University |
| Takahashi, Jun | H.U. Research Institute |
| Negishi, Ryo | H. U. Group Research Institute G. K |
| Matsushita, Masaki | H.U. Group Research Inst. G. K., Japan |
| Harada, Kensuke | Osaka University |
Keywords: Grasping, Manipulation Planning, Deep Learning in Grasping and Manipulation
Abstract: This work proposes a learning method to accelerate robotic pick-and-place planning by predicting shared grasps. Shared grasps are defined as grasp poses feasible to both the initial and goal object configurations in a pick-and-place task. Traditional analytical methods for solving shared grasps evaluate grasp candidates separately, leading to substantial computational overhead as the candidate set grows. To overcome the limitation, we introduce an Energy-Based Model (EBM) that predicts shared grasps by combining the energies of feasible grasps at both object poses. The formulation enables early identification of promising candidates and significantly reduces the search space. Experiments show that our method improves grasp selection performance, offers higher data efficiency, and generalizes well to varying grasps and table heights, given that variations fall within the learned distributions.
|
| |
| 15:00-16:30, Paper WeI2I.363 | Add to My Program |
| A Semi-Active Occupational Shoulder Exoskeleton for Overhead Work with Free Mode and Personalized Assistive Torque |
|
| Tian, Jin | Harbin Institute of Teconology |
| Wei, Baichun | Harbin Institute of Technology |
| Yang, Chifu | Harbin Institute of Technology |
| Wang, Haijiao | Harbin Institute of Technology |
| Jiang, Feng | Harbin Institute of Technology |
| Huang, Hong | Sichuan University of Science & Engineering |
| Li, Xiang | Harbin Institute of Technology |
| Zhu, Haiqi | Harbin Institute of Technology |
| Yi, Chunzhi | Harbin Institute of Technology |
Keywords: Prosthetics and Exoskeletons, Wearable Robotics, Mechanism Design
Abstract: Current passive or semi-active shoulder exoskeletons for overhead work provide fixed assistive torque for all participants and tasks, which lacks adaptability. In addition, due to the need to store energy at low elevation angles, they may increase physical demand on the user when assistance is not required. This study presents a novel semi-active shoulder exoskeleton that can provide the free mode (i.e., no assistance) and personalized assistive torque to assist overhead work. The exoskeleton includes the motorized torque generator and hybrid control strategy. The motorized torque generator equipped with servo motor and encoder is characterized by its ability to electrically adjust the peak assistive torque angle and peak torque. In addition, we propose a hybrid control strategy with free and assistive modes. The free mode allows the exoskeleton to not interfere with movements that do not require assistance. The assistive mode provides personalized torque with three levels based on the height and weight of the user. Experimental results validated the exoskeleton's mechanical performance (e.g., high backdrivability) and its assistive effectiveness. The results showed that the exoskeleton could reduce shoulder muscle activation by up to 55.03% and demonstrated a significant difference compared to fixed assistance.
|
| |
| 15:00-16:30, Paper WeI2I.364 | Add to My Program |
| Robustness of Panoptic Segmentation for Degraded Automotive Cameras Data (I) |
|
| Wang, Yiting | University of Warwick |
| Zhao, Haonan | University of Warwick |
| Dianati, Mehrdad | Queen's University of Belfast |
| Debattista, Kurt | University of Warwick |
| Donzella, Valentina | Queen Mary Univeristy of London |
Keywords: Computer Vision for Automation, Data Sets for Robotic Vision, Deep Learning for Visual Perception
Abstract: Abstract— Precise situational awareness is vital for the safe deployment of artificial intelligence in real-world scenarios, especially in assisted and automated driving (AAD) systems. Panoptic segmentation, which unifies semantic and instance segmentation, plays a key role in identifying objects, hazards, and drivable areas at the pixel level. However, the relationship between segmentation robustness and camera image quality remains insufficiently explored. To address this, we propose a unified pipeline to evaluate the robustness of panoptic segmentation models for automotive cameras, correlating their performance with eight traditional image quality assessment (IQA) metrics. We introduce D-Cityscapes+, a degraded dataset featuring 19 realistic automotive degradations at multiple severity levels, including novel darkness and snowfall models with veiling effects. Evaluation across 14 state-of-the-art backbones reveals: (1) large-particle degradations (e.g., lens droplets, heavy snow) cause severe performance drops and edge-concentrated errors; (2) Transformer-based models are more robust but limited by <2 FPS and >500 GFLOPs; (3) frequency-domain IQA metrics such as CW-SSIM strongly correlate with segmentation performance; and (4) generic image restoration does not consistently improve perception. The findings and benchmark (https://github.com/Warwick-Jocelyn/BRPS ) provide practitioners with diagnostic tools, datasets, and guidelines for developing robust, real-time panoptic
|
| |
| 15:00-16:30, Paper WeI2I.365 | Add to My Program |
| DOSE3: Diffusion-Based Unified Out-Of-Distribution Detection on SE(3) Trajectories |
|
| Cheng, Hongzhe | Carnegie Mellon University |
| Zheng, Tianyou | Carnegie Mellon University |
| Ma, Ziyong | Carnegie Mellon University |
| Zhang, Tianyi | Carnegie Mellon University |
| Johnson-Roberson, Matthew | Carnegie Mellon University |
| Zhi, Weiming | The University of Sydney; Vanderbilt University |
Keywords: Deep Learning Methods, Big Data in Robotics and Automation
Abstract: Out-Of-Distribution (OOD) detection, the task of identifying when an input falls outside the distribution seen at training time, is critical for deploying safe and reliable systems. Traditional OOD methods require retraining models whenever the in‐distribution has changed. Recent work introduces unified models for OOD detection, where metrics can be constructed from an unconditional diffusion model trained on an arbitrary dataset, and the inlier distribution can be changed without retraining the diffusion model. However, these unified approaches have been largely confined to Euclidean or latent space domains. In contrast, real‐world robotics systems often perceive and act through sequences of 6 degrees-of-freedom poses in the Special Euclidean Group SE(3), taking into account both translations and orientation changes over time. In this work, we extend OOD detection to trajectories in Special Euclidean Group in 3D SE(3) by presenting a Diffusion-based Out-of-distribution detection on SE(3) (DOSE3). DOSE3 constructs an OOD metric from the noise estimator model of a diffusion model over SE(3) to separate outlier samples from inlier distributions. We demonstrate DOSE3's strong performance on OOD detection frameworks through extensive validation on multiple real-world robotics and autonomous systems datasets, covering vehicle and robot manipulator motion trajectories.
|
| |
| 15:00-16:30, Paper WeI2I.366 | Add to My Program |
| Communication-Efficient Module-Wise Federated Learning for Grasp Pose Detection in Cluttered Environments |
|
| Kang, Woonsang | Hanyang University |
| Lee, Joohyung | Gachon University |
| Kim, Seungjun | Hanyang University |
| Cho, Jungchan | Gachon University |
| Oh, Yoonseon | Hanyang University |
Keywords: Deep Learning in Grasping and Manipulation, Grasping
Abstract: Grasp pose detection (GPD) is a fundamental capability for robotic autonomy, but its reliance on large, diverse datasets creates significant data privacy and centralization challenges. Federated Learning (FL) offers a privacy-preserving solution, but its application to GPD is hindered by the substantial communication overhead of large models, a key issue for resource-constrained robots. To address this, we propose a novel module-wise FL framework that begins by analyzing the learning dynamics of the GPD model's functional components. This analysis identifies slower-converging modules, to which our framework then allocates additional communication effort. This is realized through a two-phase process: a standard full-model training phase is followed by a communication-efficient phase where only an adaptively identified subset of slower-converging modules is trained and their partial updates are aggregated. Extensive experiments on the GraspNet-1B dataset demonstrate that our method outperforms standard FedAvg and other baselines, achieving higher accuracy for a given communication budget. Furthermore, real-world experiments on a physical robot validate our approach, showing a superior grasp success rate compared to baseline methods in cluttered scenes. Our work presents a communication-efficient framework for training robust, generalized GPD models in a decentralized manner, effectively improving the trade-off between communication cost and model performance.
|
| |
| 15:00-16:30, Paper WeI2I.367 | Add to My Program |
| Real-Time Path-Reconfigurable Coverage Planning for Multi-UAV Missions Over Disjoint Areas |
|
| Luo, Cai | China University of Pertroleum (East China) |
| Guo, Jintao | China University of Petroleum(East China) |
| Liu, Zixuan | China University of Petroleum(East China) |
| Liu, Lei | Shandong University |
| Luo, Chunbo | University of Exeter |
|
|
| |
| 15:00-16:30, Paper WeI2I.368 | Add to My Program |
| In-Orbit Space Structure Inspection Trajectory Generation |
|
| Apodaca, Brandon | University of Michigan |
| Helgeson, Thor | University of Michigan |
| Atkins, Ella | University of Michigan |
| Stirling, Leia | University of Michigan |
Keywords: Space Robotics and Automation, Optimization and Optimal Control, Human Factors and Human-in-the-Loop
Abstract: Abstract—Exterior International Space Station (ISS) visual inspection currently requires astronaut extravehicular activity (EVA), a safety risk. Free-flying space robots can perform visual inspection but risk station collision and high astronaut overhead for teleoperation. Existing inspection planners do not effectively co-optimize inspection coverage and energy consumption with consideration of both orbital dynamics and human supervisor situation awareness. This paper presents an inspection trajectory generation pipeline that integrates orbital dynamics with robot coverage path planning methods to assure collision avoidance and investigate situation awareness. Inspection trajectories meet thrust and space robot dynamics constraints while achieving 98% coverage with 17 grams of fuel on a space station model scaled to the ISS. Pareto front analysis balances fuel consumption with coverage directly. Presented solutions show that paths vary as a function of coverage versus energy prioritization. Methods in this paper contribute towards reducing risk posed to astronaut safety during space station operation and maintenance by providing trajectory generation algorithms towards external semi-autonomous in-orbit inspection of complex space structures.
|
| |
| 15:00-16:30, Paper WeI2I.369 | Add to My Program |
| PushingBots: Collaborative Pushing Via Neural Accelerated Combinatorial Hybrid Optimization |
|
| Tang, Zili | Peking University |
| Zhang, Ying | Peking University |
| Guo, Meng | Peking University |
Keywords: Multi-Robot Systems, Motion and Path Planning, Planning, Scheduling and Coordination, Collaborative Pushing
Abstract: Many robots are not equipped with a manipulator and many objects are not suitable for prehensile manipulation (such as boxes and large cylinders). In these cases, pushing is a simple yet effective non-prehensile skill for robots to interact with and further change the environment. Existing work often assumes a set of predefined pushing modes and fixed-shape objects. This work tackles the general problem of controlling a robotic fleet to push collaboratively numerous arbitrary objects to respective destinations, within complex environments of cluttered and movable obstacles. It incorporates several characteristic challenges for multi-robot systems such as online task coordination under large uncertainties of cost and duration, and for contact-rich tasks such as hybrid switching among different contact modes, and under-actuation due to constrained contact forces. The proposed method is based on combinatorial hybrid optimization over dynamic task assignments and hybrid execution via sequences of pushing modes and associated forces. It consists of three main components: (I) the decomposition, ordering and rolling assignment of pushing subtasks to robot subg
|
| |
| 15:00-16:30, Paper WeI2I.370 | Add to My Program |
| Non-Submodular Visual Attention for Robot Navigation |
|
| Vafaee, Reza | Boston College |
| Behzad, Kian | Northeastern University |
| Siami, Milad | Northeastern University |
| Carlone, Luca | Massachusetts Institute of Technology |
| Jadbabaie, Ali | University of Pennsylvania |
Keywords: Visual-Based Navigation, Autonomous Vehicle Navigation, Localization, Optimization and Optimal Control
Abstract: This paper presents a task-oriented computational framework to enhance Visual-Inertial Navigation (VIN) in robots, addressing challenges such as limited time and energy resources. The framework strategically selects visual features using a Mean Square Error (MSE)-based, non-submodular objective function and a simplified dynamic anticipation model. To address the NP‐hardness of this problem, we introduce four polynomial‐time approximation algorithms: a classic greedy method with constant‐factor guarantees; a low‐rank greedy variant that significantly reduces computational complexity; a randomized greedy sampler that balances efficiency and solution quality; and a linearization‐based selector based on a first‐order Taylor expansion for near‐constant‐time execution. We establish rigorous performance bounds by leveraging submodularity ratios, curvature, and element‐wise curvature analyses. Extensive experiments on both standardized benchmarks and a custom control‐aware platform validate our theoretical results, demonstrating that these methods achieve strong approximation guarantees while enabling real‐time deployment.
|
| |
| 15:00-16:30, Paper WeI2I.371 | Add to My Program |
| Service Placement in Dynamic Multi-AGV Environments for Minimized Energy Consumption |
|
| Torres-Pérez, Claudia | I2CAT Foundation |
| Carmona-Cejudo, Estela | I2CAT Foundation |
| Cervelló-Pastor, Cristina | Universitat Politècnica De Catalunya |
| Masoumi, Maryam | Universidad De Valladolid |
| Coronado, Estefanía | Universidad De Castilla-La Mancha, i2CAT Foundation |
| Siddiqui, Muhammad Shuaib | I2CAT Foundation |
Keywords: Task Planning, Planning, Scheduling and Coordination, Swarm Robotics
Abstract: In multi-automated guided vehicle (AGV) environments, inefficient service placement increases energy consumption, and charging cycles, lowering battery lifespan. Consequently, minimizing energy consumption is key for maintaining operational efficiency and sustainability. Additionally, the unpredictable arrival of service requests in multi-AGV systems can lead to system saturation. However, previous research overlooked the energy costs of on-device computation, especially under dynamic service arrivals. To address these challenges, this work proposes an energy minimization service placement algorithm (EMSPA). The results demonstrate that EMSPA outperforms a baseline random selection (RS) algorithm for different numbers of AGVs, services, and tasks per service, reducing normalized energy consumption by up to 2.34% and improving mean service acceptance rates by up to 16.09% with lineal execution time overhead. Further, EMSPA outperforms a queue-aware scheduling and deadlock mitigation strategy (QASDMS) in terms of processing power ratio by over 58.94%.
|
| |
| 15:00-16:30, Paper WeI2I.372 | Add to My Program |
| MANG-SLAM: Multi-Agents Neural Submap and Gaussian Representation for Dense Mapping |
|
| Li, Yonghao | Beijing University of Posts and Telecommunications |
| Ye, Ping | beijing university of posts and telecommunications |
| Jia, Qingxuan | Beijing University of Posts and Telecommunications |
| |
| 15:00-16:30, Paper WeI2I.373 | Add to My Program |
| RoboMT: Human-Like Compliance Control for Assembly Via a Bilateral Robotic Teleoperation and Hybrid Mamba-Transformer Framework |
|
| Wang, Rundong | National University of Singapore |
| Cheng, Yanchun | National University of Singapore |
| Yuan, Qilong | Singapore Institute of Technology |
| Prakash, Alok | Singapore-MIT Alliance for Research and Technology |
| Tay, Francis | NUS |
| Ang Jr, Marcelo H | National University of Singapore |
Keywords: Compliance and Impedance Control, Model Learning for Control, Telerobotics and Teleoperation
Abstract: Robotic compliance control is critical for delicate tasks such as electronic connector assembly, where precise force regulation and adaptability are paramount. However, traditional methods often struggle with modeling inaccuracies and sensor noise. Inspired by human adaptability in complex assembly operations, we present RoboMT, a novel framework that integrates a Mamba algorithm with a Transformer architecture to achieve human-like compliance control. By leveraging a bilateral teleoperation platform, we collect extensive real-time force/torque and motion data to form a comprehensive dataset for training. Furthermore, RoboMT incorporates an Adaptive Action Chunk module and a Temporal Fusion module to ensure smooth and robust action prediction. Experimental results across four electronic assembly tasks show that RoboMT achieves superior success rates (62–98%) over baselines (29–98%), while maintaining stable force regulation around 2.5N, closely resembling human performance. During task transitions, RoboMT quickly stabilizes at 5N with minimal overshoot, avoiding the large force spikes (over 24N) seen in baselines. Additionally, RoboMT maintains an average inference speed of 55 ms per batch, balancing real-time responsiveness and control robustness. Overall, RoboMT presents a compelling pathway toward error-minimized, human-level compliance control, and generalization for real-world robotic assembly, setting a new benchmark for precision, adaptability, and robustness in robotic assembly.
|
| |
| 15:00-16:30, Paper WeI2I.374 | Add to My Program |
| Analytical Parameter Conversion of CPC and POE Models |
|
| Sumenkov, Oleg | Sirius University of Science and Technology |
Keywords: Kinematics, Formal Methods in Robotics and Automation, Calibration and Identification
Abstract: This paper presents an analytical framework for parameter conversion between Complete and Parametrically Continuous (CPC) and Product of Exponentials (POE) kinematic models of serial-chain mechanisms. The approach, grounded in Lie group and algebra theory, formulates and proves three key lemmas to enable exact POE-to-CPC parameter conversion. Building upon established POE-DH and CPC-DH transitions, the proposed framework facilitates flexible model selection based on application-specific needs, independent of the initial parameterization. Primarily designed for robot calibration, this method also serves as a unifying tool for comparing and analyzing calibration results across different kinematic conventions. The framework's effectiveness is demonstrated through numerical validation on the PUMA 560 robot, confirming its accuracy and practical applicability.
|
| |
| 15:00-16:30, Paper WeI2I.375 | Add to My Program |
| Narrow Passage Path Planning Via Homotopy-Preserving Collision Constraint Interpolation |
|
| Lee, Minji | Seoul National University |
| Lee, Jeongmin | Seoul National University |
| Lee, Dongjun | Seoul National University |
| |
| 15:00-16:30, Paper WeI2I.376 | Add to My Program |
| HCOA*: Hierarchical Class-Ordered A* for Navigation in Semantic Environments |
|
| Psomiadis, Evangelos | Georgia Tech |
| Tsiotras, Panagiotis | Georgia Tech |
Keywords: Motion and Path Planning, Semantic Scene Understanding, Autonomous Agents
Abstract: This paper addresses the problem of robot navigation in mixed geometric/semantic 3D environments. Given a hierarchical representation of the environment, the objective is to navigate from a start position to a goal while satisfying task-specific safety constraints and minimizing computational cost. We introduce Hierarchical Class-ordered A* (HCOA*), an algorithm that leverages the environment's hierarchy for efficient and safe path-planning in mixed geometric/semantic graphs. We use a total order over the semantic classes and prove theoretical performance guarantees for the algorithm. We propose three approaches for higher-layer node classification based on the semantics of the lowest layer: a Graph Neural Network method, a k-Nearest Neighbors method, and a Majority-Class method. We evaluate our algorithm in simulations on two 3D Scene Graphs, comparing it to the state-of-the-art and assessing the performance of each classification approach. Results show that HCOA* reduces the computational time of navigation by up to 50%, while maintaining near-optimal performance across a wide range of scenarios.
|
| |
| 15:00-16:30, Paper WeI2I.377 | Add to My Program |
| Informed, Constrained, Aligned: A Field Analysis on Degeneracy-Aware Point Cloud Registration in the Wild (I) |
|
| Tuna, Turcan | ETH Zurich, Robotic Systems Lab |
| Nubert, Julian | ETH Zurich, Robotic Systems Lab |
| Pfreundschuh, Patrick | ETH Zurich, Autonomous Systems Lab |
| Cadena, Cesar | ETH Zurich, Robotic Systems Lab |
| Khattak, Shehryar | NASA Jet Propulsion Laboratory |
| Hutter, Marco | ETH Zurich, Robotic Systems Lab |
Keywords: Field Robots, SLAM, Mapping
Abstract: The iterative closest point registration algorithm has been a preferred method for light detection and ranging LiDAR-based robot localization for nearly a decade. However, even in modern simultaneous localization and mapping (SLAM) solutions, ICP can degrade and become unreliable in geometrically ill-conditioned environments. In response, this work investigates and compares new and existing degeneracy mitigation methods for robust LiDAR-based localization and analyzes the efficacy of these approaches in degenerate environments for the first time in the literature at this scale. Specifically, this work investigates i) the effect of using active or passive degeneracy mitigation methods for the problem of ill-conditioned ICP in LiDAR degenerate environments and ii) the evaluation of truncated singular value decomposition (TSVD), inequality constraints (Ineq. Con.), and linear/nonlinear Tikhonov regularization for the application of degenerate point cloud registration for the first time. Furthermore, a sensitivity analysis for the least-squares minimization step of the ICP problem is carried out to better understand how each method affects the optimization and what to expect from each method. The results of the analysis are validated through multiple real-world robotic field and simulated experiments. The analysis demonstrates that active optimization degeneracy mitigation is necessary and advantageous in the absence of reliable external estimate assistance for LiDAR-SLAM.
|
| |
| 15:00-16:30, Paper WeI2I.378 | Add to My Program |
| A High-DOF BCI Control Strategy Mapping Discrete Commands to Continuous Motion for a Drone (I) |
|
| Mei, Jie | Tianjin University |
| Chen, Weize | Tianjin University |
| Huang, Yongzhi | Tianjin University |
| XIao, Xiaolin | Tianjin University |
| Wang, Kun | Tianjin University |
| Yi, Weibo | Beijing Institute of Mechanical Equipment |
| Jung, Tzyy-Ping | University of California at San Diego |
| Xu, Minpeng | Tianjin University |
| Ming, Dong | Tianjin University |
Keywords: Brain-Machine Interfaces, Physical Human-Robot Interaction, Human-Centered Robotics
Abstract: Because of the non-stationary nature of electroencephalogram (EEG) signals, traditional non-invasive brain-computer interfaces (BCIs) usually only produce discrete commands, limiting their ability to control external devices continuously. This study proposes a novel BCI control strategy mapping multiple discrete commands to continuous motion, enabling real-time manipulation of a drone in four degrees of freedom (DOF). Our strategy used the fast steady state visual evoked potential (SSVEP) encoding and decoding method to convert user intentions into the drone’s flight status in near real-time. Simultaneously, the drone’s live video was embedded into the SSVEP stimuli, providing users with a first-person perspective control experience. In drone control experiments, participants successfully maneuvered the drone through complex path-following tasks in simulated and physical scenarios. The mean flight trajectory bias ratio was measured as 0.81, with a mean flight smoothness of -3.31 (measured by spectral arc length) and mean Fitts’s throughput of 9.18 bits/min. Notably, the brain-to-hand ratio (BHR) for all metrics approached 1, indicating that our non-invasive control system achieved comparable performance to manual control systems. These results suggest the effectiveness of our proposed BCI control strategy that maps discrete commands to continuous motion and extends the capabilities of non-invasive BCIs in continuous control scenarios.
|
| |
| 15:00-16:30, Paper WeI2I.379 | Add to My Program |
| Collaborative Task Allocation for Heterogeneous Multi-Robot Systems through Iterative Clustering |
|
| Martin, David R. | University of California, Irvine |
| Butler, Brooks A. | University of California, Irvine |
| Nivison, Scott | Air Force Research Laboratory |
| Egerstedt, Magnus | University of California, Irvine |
| Al Faruque, Mohammad Abdullah | University of California, Irvine |
| Khargonekar, Pramod | University of California, Irvine |
Keywords: Multi-Robot Systems, Task Planning, Cooperating Robots
Abstract: Multi-robot systems face the challenge of efficiently allocating teams of heterogeneous robots to tasks. The task allocation problem is complicated by collaborative interactions between robots where teams of robots develop emergent capabilities that enable them to complete tasks that would be inefficient or impossible for individual robots. To address these challenges, we present an iterative clustering algorithm for collaborative task allocation in heterogeneous multirobot systems. This approach partitions the computationallyintractable global optimization problem into smaller, tractablesubproblems by iteratively forming clusters of robots and tasks, then optimizing assignments within each cluster. By ensuring robots remain clustered with their currently assigned tasks, we guarantee monotonic improvement in overall utility with each iteration. We analyze the convergence of the algorithm and characterize how cluster size constraints determine which suboptimal assignments could trap the algorithm. In simulation, iterative clustering consistently outperforms simulated annealing, and a group-based auction in both computation time and solution quality, and outperforms a hedonic game approach in solution quality.
|
| |
| 15:00-16:30, Paper WeI2I.380 | Add to My Program |
| Nav-SCOPE: Swarm Robot Cooperative Perception and Coordinated Navigation (I) |
|
| Li, Chenxi | Tsinghua University |
| Lu, Weining | Tsinghua University |
| Lin, Qingquan | Tsinghua University |
| Meng, Litong | Tsinghua University |
| Li, Haolu | Beijing Information Science and Technology University |
| Liang, Bin | Tsinghua University |
Keywords: Path Planning for Multiple Mobile Robots or Agents, Multi-Robot Systems, Distributed Robot Systems
Abstract: This paper proposes a lightweight decentralized solution for multi-robot coordinated navigation with cooperative perception. First, we introduce a rapid way to process sensory data, thus obtaining safe directions and key environmental features. Then, an information flow is created to facilitate real-time perception sharing over wireless ad-hoc networks. Consequently, the environmental uncertainties of each robot are reduced by interaction fields that deliver complementary information. Finally, path optimization is achieved in a probabilistic way, enabling self-organized coordination with effective convergence, divergence, and collision avoidance. Our method is fully interpretable and ready for deployment without gaps. Comprehensive simulations and real-world experiments demonstrate reduced path redundancy, robust performance across various tasks, and minimal demands on computation and communication.
|
| |
| 15:00-16:30, Paper WeI2I.381 | Add to My Program |
| Steering Performance Optimization for Wheeled Mobile Robots in Granular Media Via DRFM: Enhancing Locomotion Precision and Energy Efficiency |
|
| Cao, Chuang | Shanghai Jiao Tong University |
| Huang, Lei | Shanghai Jiao Tong University |
| Zhang, Feiyu | Shanghai Jiao Tong University |
| Yin, Yh | Shanghai Jiao Tong Uni |
Keywords: Wheeled Robots, Dynamics, Field Robots
Abstract: Degraded steering performance and increased energy consumption present significant barriers to deploying wheeled mobile robots (WMRs) in granular media such as sand and lunar regolith. This study presents and experimentally validates a systematic optimization framework based on Dynamic Resistive Force Model (DRFM). By integrating the DRFM with a four-wheel vehicle dynamics model featuring front-wheel steering, this approach accurately captures wheel–terrain interactions in granular materials. Subject to a prescribed trajectory root-mean-square error constraint, the framework minimizes energy consumption per unit distance while determining optimal front-wheel steering angles and wheel-speed ratios. Experiments demonstrate that the active steering strategy reduces energy consumption per unit distance by 12.3% while maintaining trajectory root-mean-square error within 6.5%. The proposed method provides a generalizable design paradigm for motion-control optimization on granular terrain, establishing the foundation for long-duration, energy-efficient operations of rovers operating in granular terrain.
|
| |
| 15:00-16:30, Paper WeI2I.382 | Add to My Program |
| Floating Prosthesis: A Wearable Passive Exoskeleton for Transradial Prosthesis Gravity Compensation |
|
| Shi, Ke | Southeast University |
| Chen, Tongshu | Southeast University |
| Zhang, Maozeng | Southeast University |
| Yan, Jiejun | Southeast University |
| Zhu, Lifeng | Southeast University |
| Song, Aiguo | Southeast University |
| |
| 15:00-16:30, Paper WeI2I.383 | Add to My Program |
| Real-Time Generation of Near-Minimum-Energy Trajectories Via Constraint-Informed Residual Learning |
|
| Dona', Domenico | University of Padua |
| Franzese, Giovanni | TU Delft |
| Della Santina, Cosimo | TU Delft |
| Boscariol, Paolo | University of Padua |
| Lenzo, Basilio | University of Padua |
Keywords: Industrial Robots, Optimization and Optimal Control, Learning from Demonstration
Abstract: Industrial robotics demands significant energy to operate, making energy-reduction methodologies increasingly important. Strategies for planning minimum-energy trajectories typically involve solving nonlinear optimal control problems (OCPs), which rarely cope with real-time requirements. In this paper, we propose a paradigm for generating near minimum-energy trajectories for manipulators by learning from optimal solutions. Our paradigm leverages a residual learning approach, which embeds boundary conditions while focusing on learning only the adjustments needed to steer a standard solution to an optimal one. Compared to a computationally expensive OCP-based planner, our paradigm achieves 87.3% of the performance near the training dataset and 50.8% far from the dataset, while being two to three orders of magnitude faster.
|
| |
| 15:00-16:30, Paper WeI2I.384 | Add to My Program |
| Impact of Active vs. Passive Robot Behavior on Task Efficiency in Collaborative Physical HRI |
|
| Tiozzo, Alessandro | Italian Institute of Technology |
| Scorza Azzarà, Giulia | University of Genoa, Italian Institute of Technology |
| Rizzo, Alessandro | Politecnico Di Torino |
| Sciutti, Alessandra | Italian Institute of Technology |
| Rea, Francesco | Italian Institute of Technology |
Keywords: Human-Robot Collaboration, Humanoid Robot Systems, Cooperating Robots
Abstract: Advancements in physical Human-Robot Interaction (pHRI) aim to achieve natural and efficient collaboration between humans and robots, especially in dynamic environments where task performance is essential. This study focuses on co-manipulative human-robot joint activities, exploring key components of performance and synchronization. The primary objective was to design an active control technique for the iCub robot's arms that enhances task efficiency with a distinct approach than traditional force feedback controls. Comparing an iCub's passive behavior with the designed active one has registered an increase in its contribution, given through adaptive velocity and mimicry, and showcasing its ability to respond dynamically to changes in human actions. Furthermore, a measurement of the exertion applied by the counterparts revealed that the active behavior required greater energy consumption to reach those levels of synchronization and performance. These results highlight the implications of balancing active behavior with effort intensity to achieve task efficiency in pHRIs.
|
| |
| 15:00-16:30, Paper WeI2I.385 | Add to My Program |
| Curb-Tracker: An Integrated Curb Following System for Autonomous Vehicles |
|
| Liang, Jiahao | Nanyang Technological University |
| Wang, Yuanzhe | Shandong University |
| Peng, Guohao | Nanyang Technological University |
| Wu, Zhenyu | Nanyang Technological University |
| Wang, Danwei | Nanyang Technological University |
Keywords: Autonomous Vehicle Navigation, Motion and Path Planning, Motion Control, Trajectory Generation
Abstract: Curb following is a critical technology for autonomous road sweeping vehicles. However, existing solutions face two primary challenges: unreliable curb detection and inefficient motion generation. Unreliable curb detection stems from the wide variability in curb dimensions and types, as well as interference from roadside features such as vegetation and infrastructure. Inefficient motion generation occurs when existing methods prioritize tracking accuracy while neglecting task completion efficiency, leading to prolonged operation times. To address these challenges, we propose Curb-Tracker, an integrated curb-following system designed for autonomous vehicles operating in diverse road environments. Firstly, we develop a robust and adaptive curb detection algorithm that leverages a 2.5D elevation map of the local environment and dynamically adjusts key parameters online to ensure reliable detection across varying scenarios. Secondly, to achieve accurate and efficient curb-aligned motion generation, we leverage Model Predictive Contouring Control (MPCC) as a tailored framework specifically designed for the curb-following task to generate an optimal control sequence for the vehicle to ma
|
| |
| 15:00-16:30, Paper WeI2I.386 | Add to My Program |
| GNSS-Inertial State Initialization Using Inter-Epoch Baseline Residuals |
|
| Cerezo, Samuel | Universidad De Zaragoza |
| Civera, Javier | Universidad De Zaragoza |
Keywords: SLAM, Sensor Fusion
Abstract: Initializing the state of a sensorized platform can be challenging, as a limited set of measurements often provide low-informative constraints that are in addition highly non-linear. This may lead to poor initial estimates that may converge to local minima during subsequent non-linear optimization. We propose an adaptive GNSS–inertial initialization strategy that delays the incorporation of global GNSS constraints until they become sufficiently informative. In the initial stage, our method leverages inter-epoch baseline vector residuals between consecutive GNSS fixes to mitigate inertial drift. To determine when to activate global constraints, we introduce a general criterion based on the evolution of the Hessian matrix’s singular values, effectively quantifying system observability. Experiments on EuRoC, GVINS and MARS-LVIG datasets show that our approach consistently outperforms the naive strategy of fusing all measurements from the outset, yielding more accurate and robust initializations.
|
| |
| 15:00-16:30, Paper WeI2I.387 | Add to My Program |
| Primal-Dual iLQR for GPU-Accelerated Learning and Control in Legged Robots |
|
| Amatucci, Lorenzo | Istituto Italiano Di Tecnologia |
| Moreira de Sousa Pinto, Joao | Tesla |
| Turrisi, Giulio | Istituto Italiano Di Tecnologia |
| Orban, Dominique | Polytechnique Montréal |
| Barasuol, Victor | Istituto Italiano Di Tecnologia |
| Semini, Claudio | Istituto Italiano Di Tecnologia |
Keywords: Optimization and Optimal Control, Legged Robots, Multi-Contact Whole-Body Motion Planning and Control
Abstract: This paper introduces a novel Model Predictive Control (MPC) implementation for legged robot locomotion that leverages GPU parallelization. Our approach enables both temporal and state-space parallelization by incorporating a parallel associative scan to solve the primal-dual Karush-Kuhn-Tucker (KKT) system. In this way, the optimal control problem is solved in O(log2(n)log(N) + log2(m)) complexity, instead of O(N(n + m)3) , where n, m, and N are the dimension of the system state, control vector, and the length of the prediction horizon. We demonstrate the advantages of this implementation over two state-of-the-art solvers (acados and crocoddyl), achieving up to a 60% improvement in runtime for Whole Body Dynamics (WB)-MPC and a 700% improvement for Single Rigid Body Dynamics (SRBD)-MPC when varying the prediction horizon length. The presented formulation scales efficiently with the problem state dimensions as well, enabling the definition of a centralized controller for up to 16 legged robots that can be computed in less than 25 ms. Furthermore, thanks to the JAX implementation, the solver supports large-scale parallelization across multiple environments, allowing the possibility of performing learning with the MPC in the loop directly in GPU. The code associated with this work can be found at https://github.com/iit-DLSLab/mpx .
|
| |
| 15:00-16:30, Paper WeI2I.388 | Add to My Program |
| MILD: Tractable Terrain Modeling for Learning Improved Bipedal Locomotion on Deformable Surfaces |
|
| Luo, Zeren | The University of Hong Kong |
| Zhang, Jiahui | The University of Hong Kong |
| Xu, Zhe | Beijing Institute of Technology |
| Li, Wanyue | The University of Hong Kong |
| Li, Xinqi | The University of Hong Kong |
| Chen, Xuechao | Beijing Insititute of Technology |
| Yu, Zhangguo | Beijing Institute of Technology |
| Tang, Annan | The University of Tokyo |
| Lu, Peng | The University of Hong Kong |
Keywords: Machine Learning for Robot Control, Reinforcement Learning
Abstract: Enabling robots to walk on yielding terrain is vital for applications ranging from disaster response to planetary exploration. While bipedal robots hold immense potential, their locomotion on deformable surfaces remains limited as current simulators fail to capture the spatiotemporal heterogeneity of such yielding substrates. We present MILD, featuring a physics-grounded discrete-element contact solver that accurately simulates spatially varying foot-terrain interactions. Complementing this model, we train a terrain-aware locomotion controller via deep reinforcement learning with latent modulation and proprioceptive estimation. Quantitative comparisons against state-of-the-art methods show our approach generates more diverse and realistic contact scenarios during training, resulting in controllers that exhibit natural adaptation on real deformable surfaces. Through hardware experiments, we demonstrate the system's capability for online terrain identification and adaptation across a wide range of surface stiffness.
|
| |
| 15:00-16:30, Paper WeI2I.389 | Add to My Program |
| Online Modifications of High-Level Swarm Behaviors |
|
| Huang, Bo-Ruei | Cornell University |
| Kress-Gazit, Hadas | Cornell University |
Keywords: Swarm Robotics, Formal Methods in Robotics and Automation
Abstract: Recent work has demonstrated how one can write high-level specifications for swarm behaviors and automatically create controllers for the individual robots to achieve the overall swarm task. In this paper, we address the question of how to modify, during execution, the desired behavior while maintaining guarantees on the behavior, if possible; we define three types of modification: changing the maximum number of robots in a region of the workspace, changing the connectivity of the workspace, and redistributing robots. During execution, if the specification is modified, we update the controller by creating local patches. Given the starting and ending state of the patch, we jointly use a symbolic synthesis tool and a constraint programming solver to synthesize robot control. We demonstrate our approach in simulation.
|
| |
| 15:00-16:30, Paper WeI2I.390 | Add to My Program |
| RoboPacker: An Autonomous Robotic Packing System for General Objects (I) |
|
| Wu, Zhenyu | Beijing University of Posts and Telecommunications |
| Wang, Ziwei | Nanyang Technological University |
| Huang, Sichao | Tsinghua University |
| Liu, Zhan | Tsinghua University |
| Xu, Xiuwei | Tsinghua University |
| Yan, Haibin | Beijing University of Posts and Telecommunications |
| Lu, Jiwen | Tsinghua University |
Keywords: Logistics, Grasping, Perception for Grasping and Manipulation
Abstract: In this paper, we propose an autonomous robot packing system named RoboPacker designed to tightly store cluttered general objects into shipping boxes with high space utilization, which is a fundamental process in numerous industrial applications. However, achieving tight packaging for general objects often demands significant labor from human packers, particularly in high-throughput scenes. Compared to existing robot packing approaches, RoboPacker effectively overcomes challenges such as diverse object appearances, severe occlusion, and crowded packing spaces. Specifically, we propose an open-vocabulary shape estimation method to reconstruct complete point clouds for cluttered objects. We also design effective interactions with object clutter to gather informative visual clues for shape estimation under high uncertainty. Additionally, we introduce a hierarchical reinforcement learning framework to optimize packing order, location, and orientation for maximum space utilization. The robotic packing system integrates these techniques with feasible manipulation methods for real-world implementation. In this way, RoboPacker achieves efficient packing of novel and irregular objects, which is more suitable for real deployment environments. The Real-world experiments demonstrate RoboPacker can tightly pack 20 densely cluttered everyday objects from 8 seen and 4 novel classes into the 40x40x20 cm shipping box with a 73.3% success rate.
|
| |
| 15:00-16:30, Paper WeI2I.391 | Add to My Program |
| Real-Time Spatiotemporal Tubes for Dynamic Unsafe Sets |
|
| Das, Ratnangshu | Indian Institute of Science, Bangalore |
| Upadhyay, Siddhartha | Indian Institute of Science Bengaluru |
| Jagtap, Pushpak | Indian Institute of Science |
Keywords: Planning under Uncertainty, Reactive and Sensor-Based Planning, Integrated Planning and Control
Abstract: This paper presents a real-time control framework for nonlinear pure-feedback systems with unknown dynamics to satisfy reach-avoid-stay tasks within a prescribed time in dynamic environments. To achieve this, we introduce a real-time spatiotemporal tube (STT) framework. An STT is defined as a time-varying ball in the state space whose center and radius adapt online using only real-time sensory input. A closed-form, approximation-free control law is then derived to constrain the system output within the STT, ensuring safety and task satisfaction. We provide formal guarantees for obstacle avoidance and on-time task completion. The effectiveness and scalability of the framework are demonstrated through simulations and hardware experiments on a mobile robot and an aerial vehicle, navigating in cluttered dynamic environments.
|
| |
| 15:00-16:30, Paper WeI2I.392 | Add to My Program |
| MoonBot: Modular and On-Demand Reconfigurable Robot Toward Moon Base Construction (I) |
|
| Uno, Kentaro | Tohoku University |
| Neppel, Elian | Tohoku University |
| Diaz Huenupan, Gustavo Hernan | Tohoku University |
| Mishra, Ashutosh | Tohoku University |
| Karimov, Shamistan | Tohoku University |
| Jain, A. Sejal | Tohoku University |
| Habib, Ayesha | Tohoku University |
| Pama, Pascal | Tohoku University |
| Gozbasi, Hazal | Tohoku University |
| Santra, Shreya | Tohoku University |
| Yoshida, Kazuya | Tohoku University |
Keywords: Space Robotics and Automation, Cellular and Modular Robots, Telerobotics and Teleoperation
Abstract: The allure of lunar surface exploration and development has recently captured widespread global attention. Robots have proved to be indispensable for exploring uncharted terrains, uncovering and leveraging local resources, and facilitating the construction of future human habitats. In this article, we introduce the modular and on-demand reconfigurable robot (MoonBot), a modular and reconfigurable robotic system engineered to maximize functionality while operating within the stringent mass constraints of lunar payloads and adapting to varying environmental conditions and task requirements. This article details the design and development of MoonBot and presents a preliminary field demonstration that validates the proof of concept through the execution of milestone tasks simulating the establishment of lunar infrastructure. These tasks include essential civil engineering operations, infrastructural component transportation and deployment, and assistive operations with inflatable modules. Furthermore, we systematically summarize the lessons learned during testing, focusing on the connector design and providing valuable insights for the advancement of modular robotic systems in future lunar missions.
|
| |
| 15:00-16:30, Paper WeI2I.393 | Add to My Program |
| Customize-Your-Joy Hand: A User-Oriented, Cost-Effective 22-DOF Platform for Future Human-Robot Community |
|
| Chai, Jin | University of Science and Technology of China |
| Yanghong, Li | University of Science and Technology of China |
| Dong, Erbao | University of Science and Technology of China |
Keywords: Tendon/Wire Mechanism, Prosthetics and Exoskeletons, Biologically-Inspired Robots
Abstract: Humanoid dexterous hands have significant potential in prosthetics, service robotics, and high-performance manipulation. However, existing designs often struggle to balance the challenging requirements of lightweight design, high biomimicry, personalized customization, and low cost. To address these challenges, we present the CYJ Hand (Customize-Your-Joy Hand), an innovative 22-DOF humanoid dexterous hand. Featuring a highly biomimetic structure, the CYJ system weighs only 750 grams (forearm included). Its modular design supports user-oriented customization while simplifying assembly, maintenance, and functional expansion. Inspired by Da Vinci' s mechanics, the CYJ Hand integrates a novel, controllable tendon mechanism that allows for reconfigurable tendon routing and actuation system to meet diverse needs. Constructed with 3D printing and affordable commercial materials, the hardware cost for the CYJ Hand structure (excluding actuators) is under 60. Experimental results demonstrate that the CYJ Hand achieves a 100% success rate in both the Kapandji Test and GRASP taxonomy, and further exhibits dynamic grasping, in-hand manipulation, sub-millimeter motion repeatability (~0.7 mm), and reliable load-bearing performance, validating its exceptional dexterity and biomimetic design. With its comprehensive advantages and innovations, the CYJ Hand provides a versatile platform for the future applications and research in personalized prosthetics and dexterous robotic manipulation, bridging the gap between high dexterity and accessibility in humanoid robotics. Related files and methods are open-sourced at GitHub repository.
|
| |
| 15:00-16:30, Paper WeI2I.394 | Add to My Program |
| Navigation of Robotic Swarmalators with Dynamics and Constraints |
|
| Xu, Xinyue | University of Michigan, Ann Arbor |
| Xiao, Wei | Worcester Polytechnic Institute |
| Ceron, Steven | University of Michigan |
Keywords: Swarm Robotics, Multi-Robot Systems, Cooperating Robots
Abstract: Swarmalator studies have enabled self-organized collective behaviors that emerge from dual spatial and temporal coupling, without relying on external inputs. These behaviors arise solely from attractive and repulsive interactions modulated by a few global parameters. Here, we treat the swarmalator model as a planner and study how several of the collective behaviors change in terms of space-phase organization when the agents are robots with vehicle dynamics and constraints: including omnidirectional, unicycle, and bicycle dynamics. Furthermore, we use the control barrier function method to guarantee that the collective can navigate around objects, through cluttered environments, and transport objects in between obstacles by exploiting global and local control methods. This work brings us closer to realizing large groups of robotic swarmalators, of heterogeneous dynamics, that can enable shape formation, navigation, and object manipulation in cluttered environments.
|
| |
| 15:00-16:30, Paper WeI2I.395 | Add to My Program |
| Gaussian Variational Inference with Non-Gaussian Factors for State Estimation: A UWB Localization Case Study |
|
| Stirling, Andrew | McGill University |
| Lukashchuk, Mykola | Eindhoven University of Technology |
| Bagaev, Dmitry | Eindhoven Technical University |
| Kouw, Wouter | Eindhoven University of Technology |
| Forbes, James Richard | McGill University |
Keywords: Localization, Sensor Fusion, Range Sensing
Abstract: This letter extends the exactly sparse Gaussian variational inference (ESGVI) algorithm for state estimation in two complementary directions. First, ESGVI is generalized to operate on matrix Lie groups, enabling the estimation of states with orientation components while respecting the underlying group structure. Second, factors are introduced to accommodate heavy-tailed and skewed noise distributions, as commonly encountered in ultra-wideband (UWB) localization due to non-line-of-sight (NLOS) and multipath effects. Both extensions are shown to integrate naturally within the ESGVI framework while preserving its sparse and derivative-free structure. The proposed approach is validated in a UWB localization experiment with NLOS-rich measurements, demonstrating improved accuracy and comparable consistency. Finally, a Python implementation within a factor-graph-based estimation framework is made open-source to support broader research use.
|
| |
| 15:00-16:30, Paper WeI2I.396 | Add to My Program |
| Bundled Liquid Crystal Elastomer Actuators with Integrated Cooling for Mesoscale Soft Robots |
|
| Sepehri, Anoush | University of California, San Diego |
| Kim, Sukjun | University of California, San Diego |
| Agrawal, Devyansh | University of California San Diego |
| Yared, Hannah | University of California Los Angeles |
| Dong, Gaoweiang | University of California San Diego |
| Cai, Shengqiang | University of California San Diego |
| Morimoto, Tania K. | University of California San Diego |
Keywords: Soft Robot Materials and Design, Soft Robot Applications, Soft Sensors and Actuators
Abstract: Liquid crystal elastomer (LCE) is a promising material for developing thermally-driven soft actuators due to its high force density, large elastic strain limit, and mechanically programmable nature. However, the complex trade-off between the force generated and the response speed (i.e., cooling rate), along with the lack of systematic design guidelines necessary to build actuators using LCE, has significantly limited its widespread adoption, especially for soft robotic applications at the meso-scale (i.e., cm-scale). In this work, we developed thermally-driven soft actuators by bundling liquid crystal elastomer units with integrated cooling that increased the response speed by over 400% when compared to relying only on passive cooling. We developed and experimentally validated an electro-thermo-mechanical model to predict the forces and cooling rates of our actuator and established systematic design guidelines to build our actuators for different soft robotic applications. Using our proposed guidelines, we present an inchworm-inspired locomotion robot that can achieve a top speed of 6 body lengths per minute. We also present a textile forearm cuff with integrated haptic feedback that can provide over 4 mm of skin stretch feedback with a cooling rate of 1 second. Overall, the presented actuator, experimental results, and design guidelines expand the potential use cases for thermally-driven actuators in soft robotic applications at the meso-scale.
|
| |
| 15:00-16:30, Paper WeI2I.397 | Add to My Program |
| Robotic Harvesting of Delicate Fruit: Design and Implementation of an Under-Actuated Disturbance-Resistant Gripper |
|
| Wang, Jianguo | Zhejiang University |
| Li, Jihao | Zhejiang University |
| Li, Jituo | Zhejiang University |
| Du, Xiaoqiang | Zhejiang Sci-Tech University |
| Dong, Huixu | Zhejiang University |
Keywords: Grippers and Other End-Effectors, Grasping, Contact Modeling
Abstract: Agricultural harvesting grippers have emerged as a pivotal technology in the evolution of smart agriculture, enhancing efficient fruit collection. Existing grippers frequently fail to accommodate the diverse morphologies of fruits. Moreover, achieving stable grasping under external disturbances, such as natural wind, remains a significant challenge. To mitigate these limitations, we present a novel under-actuated gripper for fruit harvesting, alongside the formulation of innovative harvesting strategies aimed at optimizing both the harvest success rate and fruit integrity. Firstly, a strategy is developed to promote the success rate of harvesting operations under disturbance. The harvesting strategy involves stabilizing the fruit by enclosing the stem, followed by grasping and severing the stem to detach the fruit. Secondly, the auxiliary locking and grasping coupling mechanism allowing for a single actuator driving all components of this gripper to reduce the cost and control complexity. Thirdly, the force distribution constraint unit allocates the actuator’s power between grasping and shearing actions, enabling regulation of the grasping force to protect the fruit. Finally, the performance of the gripper is validated assessed through a series of rigorous experiments on fruits with diverse sizes, textures, and surface characteristics, demonstrating its superior efficacy in preserving fruit integrity during real-world harvesting scenarios.
|
| |
| 15:00-16:30, Paper WeI2I.398 | Add to My Program |
| MSG-Loc: Multi-Label Likelihood-Based Semantic Graph Matching for Object-Level Global Localization |
|
| Lee, Gihyeon | Inha University |
| Lee, Jungwoo | Inha University |
| Kim, Juwon | Inha University |
| Shin, Young-Sik | Kyungpook National University |
| Cho, Younggun | Inha University |
Keywords: Localization, Semantic Scene Understanding
Abstract: Robots are often required to localize in environments with unknown object classes and semantic ambiguity. However, when performing global localization using semantic objects, high semantic ambiguity intensifies object misclassification and increases the likelihood of incorrect associations, which in turn can cause significant errors in the estimated pose. Thus, in this letter, we propose a multi-label likelihood-based semantic graph matching framework for object-level global localization. The key idea is to exploit multi-label graph representations, rather than single-label alternatives, to capture and leverage the inherent semantic context of object observations. Based on these representations, our approach enhances semantic correspondence across graphs by combining the likelihood of each node with the maximum likelihood of its neighbors via context-aware likelihood propagation. For rigorous validation, data association and pose estimation performance are evaluated under both closed-set and open-set detection configurations. In addition, we demonstrate the scalability of our approach to large-vocabulary object categories in both real-world indoor scenes and synthetic environments.
|
| |
| 15:00-16:30, Paper WeI2I.399 | Add to My Program |
| STATE-NAV: Stability-Aware Traversability Estimation for Bipedal Navigation on Rough Terrain |
|
| Yoon, Ziwon | Georgia Institute of Technology |
| Zhu, Lawrence Y. | Georgia Institute of Technology |
| Lu, Jingxi | University of Southern California |
| Gan, Lu | Georgia Institute of Technology |
| Zhao, Ye | Georgia Institute of Technology |
Keywords: Humanoid and Bipedal Locomotion, Vision-Based Navigation, Legged Robots
Abstract: Bipedal robots have advantages in maneuvering human-centered environments, but face greater failure risk compared to other stable mobile plarforms such as wheeled or quadrupedal robots. While learning-based traversability has been widely studied for these platforms, bipedal traversability has instead relied on manually designed rules with limited consideration of locomotion stability on rough terrain. In this work, we present the first learning-based traversability estimation and risk-sensitive navigation framework for bipedal robots operating in diverse, uneven environments. TravFormer, a transformer-based neural network, is trained to predict bipedal instability with uncertainty, enabling risk-aware and adaptive planning. Based on the network, we define traversability as stability-aware command velocity—the fastest command velocity that keeps instability below a user-defined limit. This velocity-based traversability is integrated into a hierarchical planner that combines traversability-informed Rapid Random Tree Star (TravRRT*) for time-efficient planning and Model Predictive Control (MPC) for safe execution. We validate our method in MuJoCo simulation and the real world, demonstrating improved navigation performance, with enhanced robustness and time efficiency across varying terrains compared to existing methods.
|
| |
| 15:00-16:30, Paper WeI2I.400 | Add to My Program |
| Generalizable and Fast Surrogates: Model Predictive Control of Articulated Soft Robots Using Physics-Informed Neural Networks |
|
| Habich, Tim-Lukas | Leibniz University Hannover |
| Mohammad, Aran | Leibniz University Hannover |
| Ehlers, Simon F. G. | Leibniz University Hannover |
| Bensch, Martin | Leibniz University Hannover |
| Seel, Thomas | Leibniz University Hannover |
| Schappler, Moritz | Leibniz University Hannover |
Keywords: Modeling, Control, and Learning for Soft Robots, Model Learning for Control, Optimization and Optimal Control, Physics-Informed Machine Learning
Abstract: Soft robots can revolutionize several applications with high demands on dexterity and safety. When operating these systems, real-time estimation and control require fast and accurate models. However, prediction with first-principles (FP) models is slow, and learned black-box models have poor generalizability. Physics-informed machine learning offers excellent advantages here, but it is currently limited to simple, often simulated systems without considering changes after training. We propose physics-informed neural networks (PINNs) for articulated soft robots (ASRs) with a focus on data efficiency. The amount of expensive real-world training data is reduced to a minimum - one dataset in one system domain. Two hours of data in different domains are used for a comparison against two gold-standard approaches: In contrast to a recurrent neural network, the PINN provides a high generalizability. The prediction speed of an accurate FP model is exceeded with the PINN by up to a factor of 467 at slightly reduced accuracy. This enables nonlinear model predictive control (MPC) of a pneumatic ASR. Accurate position tracking with the MPC running at 47 Hz is achieved in six dynamic experiments.
|
| |
| 15:00-16:30, Paper WeI2I.401 | Add to My Program |
| Autonomous Exploration for Shape Reconstruction and Measurement Via Informative Contact-Guided Planning |
|
| Zhao, Feiyu | ShanghaiTech University |
| Xiao, Chenxi | ShanghaiTech University |
Keywords: Force and Tactile Sensing, Reactive and Sensor-Based Planning, Planning under Uncertainty
Abstract: Coordinate Measuring Machines (CMMs) are widely used for high-precision inspection of industrial parts, particularly in scenarios where visual systems are ineffective or cost-prohibitive. However, conventional CMMs rely on CAD model priors and user-defined probing paths, which limit their applicability and efficiency in measuring freeform parts. To overcome these limitations, we present a fully autonomous, CAD model-free, tactile-based framework that enables dense 3D shape reconstruction to facilitate subsequent measurements. Our approach leverages a dual Gaussian Process Implicit Surface architecture, termed Exploration-Reconstruction GPIS (ER-GPIS), which enables both high-fidelity shape reconstruction and uncertainty estimation on the object’s surface. A hybrid exploration motion planner is then employed to adaptively sample surface geometries by integrating local surface exploration, global exploration, and contact recovery policies for robust shape estimation. Extensive real-world experiments demonstrate that the proposed method effectively reconstructs object geometries across diverse shapes, highlighting its ability to autonomously reconstruct and measure both surfaces and internal features without relying on CAD model priors.
|
| |
| 15:00-16:30, Paper WeI2I.402 | Add to My Program |
| START: Traversing Sparse Footholds with Terrain Reconstruction |
|
| Yu, Ruiqi | Zhejiang University |
| Wang, Qianshi | Zhejiang University |
| Li, Hongyi | Zhejiang University |
| Zheng, Jun | The 19th Asian Games Hangzhou 2022 Organising Committee |
| Wang, Zhicheng | Zhejiang University |
| Wu, Jun | Zhejiang University |
| Zhu, Qiuguo | Zhejiang University |
Keywords: Legged Robots, Reinforcement Learning, Deep Learning for Visual Perception
Abstract: Traversing terrains with sparse footholds like legged animals presents a promising yet challenging task for quadruped robots, as it requires precise environmental perception and agile control to secure safe foot placement while maintaining dynamic stability. Model-based hierarchical controllers excel in laboratory settings, but suffer from limited generalization and overly conservative behaviors. End-to-end learning-based approaches unlock greater flexibility and adaptability, but existing state-of-the-art methods either rely on heightmaps that introduce noise and complex, costly pipelines, or implicitly infer terrain features from egocentric depth images, often missing accurate critical geometric cues and leading to inefficient learning and rigid gaits. To overcome these limitations, we propose START, a single-stage learning framework that enables agile, stable locomotion on highly sparse and randomized footholds. START leverages only low-cost onboard vision and proprioception to accurately reconstruct local terrain heightmap, providing an explicit intermediate representation to convey essential features relevant to sparse foothold regions. This supports comprehensive environmental understanding and precise terrain assessment, reducing exploration cost and accelerating skill acquisition. Experimental results demonstrate that START achieves zero-shot transfer across diverse real-world scenarios, showcasing superior adaptability, precise foothold placement, and robust locomotion.
|
| |
| 15:00-16:30, Paper WeI2I.403 | Add to My Program |
| Universal Trajectory Optimization Framework for Differential Drive Robot Class (I) |
|
| Zhang, Mengke | Zhejiang University |
| Chen, Nanhe | Zhejiang University |
| Wang, Hu | China Tobacco Zhejiang Industrial CO., LTD |
| JianXiong, Qiu | China Tobacco Zhejiang Industrial CO., LTD |
| Han, Zhichao | Zhejiang University |
| Ren, Qiuyu | Zhejiang University |
| Xu, Chao | Zhejiang University |
| Gao, Fei | Zhejiang University |
| Cao, Yanjun | Zhejiang University, Huzhou Institute of Zhejiang University |
Keywords: Motion and Path Planning, Autonomous Vehicle Navigation, Optimization and Optimal Control
Abstract: Differential drive robots are widely used in various scenarios thanks to their straightforward principle, from household service robots to disaster response field robots. The nonholonomic dynamics and possible lateral slip of these robots lead to difficulty in getting feasible and high-quality trajectories. Although there are several types of driving mechanisms for real-world applications, they all share a similar driving principle, which involves controlling the relative motion of independently actuated tracks or wheels to achieve both linear and angular movement. Therefore, a comprehensive trajectory optimization to compute trajectories efficiently for various kinds of differential drive robots is highly desirable. In this paper, we propose a universal trajectory optimization framework, enabling the generation of high-quality trajectories within a restricted computational timeframe for these robots. We introduce a novel trajectory representation based on polynomial parameterization of motion states or their integrals, such as angular and linear velocities, which inherently matches the robots' motion to the control principle. The trajectory optimization problem is formulated to minimize computation complexity while prioritizing safety and operational efficiency. We conduct extensive simulations and real-world testing in crowded environments with three kinds of differential drive robots to validate the effectiveness of our approach.
|
| |
| 15:00-16:30, Paper WeI2I.404 | Add to My Program |
| A Mixed Integer Programming Formulation for Risk Stratification |
|
| Mekhaldi, Rachda Naila | Ecole Supérieure Nationale Des Mines De Saint-Etienne |
| Fleck, Julia | Ecole Supérieure Nationale Des Mines De Saint-Etienne |
| Phan, Raksmey | Ecole Supérieure Nationale Des Mines Saint-Étienne |
| Xie, Xiaolan | Ecole Supérieure Nationale Des Mines Saint-Étienne |
Keywords: Health Care Management, AI-Based Methods, Optimization and Optimal Control
Abstract: Risk stratification is the process of segmenting patients into distinct groups of similar complexity and care needs in order to improve resource allocation. Patients are typically risk stratified using statistical or machine learning methods that generate an individual risk score for some measure of resource use. One of the main limitations of existing methods is reduced interpretability, which is often inherent to artificial intelligence techniques. In this work, we propose a novel risk stratification approach that optimizes the representation of different patient groups and generates interpretable risk profiles. We associate risk scores to patient profiles and determine the optimal com- bination of representative profiles for each patient group using a Mixed Integer Programming (MIP) formulation. We generate continuous ratings for patient risk scores ranging from 0 to 1 that allow for dynamic thresholding. Our method stratifies patients into several risk groups (e.g., low, medium, high risk), which is frequently more clinically significant than binary classification. We apply our approach to both public and proprietary real data in the context of accidental fall risk assessment and show that the generated risk profiles provide clinical insights that can be used for the design of targeted interventions
|
| |
| 15:00-16:30, Paper WeI2I.405 | Add to My Program |
| Automated Generation of MDPs Using Logic Programming and LLMs for Robotic Applications |
|
| Saccon, Enrico | University of Trento |
| De Martini, Davide | Università Degli Studi Di Trento |
| Saveriano, Matteo | University of Trento |
| Lamon, Edoardo | University of Trento |
| Palopoli, Luigi | University of Trento |
| Roveri, Marco | University of Trento |
Keywords: Planning under Uncertainty, AI-Based Methods, Human-Robot Collaboration
Abstract: We present a novel framework that integrates Large Language Models (LLMs) with automated planning and formal verification to streamline the creation and use of Markov Decision Processes (MDP). Our system leverages LLMs to extract structured knowledge in the form of a Prolog knowledge base from natural language (NL) descriptions. It then automatically constructs an MDP through reachability analysis, and synthesizes optimal policies using the Storm model checker. The resulting policy is exported as a state-action table for execution. We validate the framework in two human-robot interaction scenarios, demonstrating its ability to produce executable policies with minimal manual effort. This work highlights the potential of combining language models with formal methods to enable more accessible and scalable probabilistic planning in robotics.
|
| |
| 15:00-16:30, Paper WeI2I.406 | Add to My Program |
| M3TR: A Generalist Model for Real-World HD Map Completion |
|
| Immel, Fabian | FZI Research Center for Information Technology |
| Schwarzkopf, Richard | Karlsruhe Institute of Technology |
| Bieder, Frank | Karlsruhe Institute of Technology |
| Pauls, Jan-Hendrik | Karlsruhe Institute of Technology (KIT) |
| Stiller, Christoph | Karlsruhe Institute of Technology |
Keywords: Intelligent Transportation Systems, Deep Learning for Visual Perception, Computer Vision for Transportation
Abstract: Autonomous vehicles rely on HD maps for their operation, but offline HD maps eventually become outdated. For this reason, online HD map construction methods use live sensor data to infer map information instead. Research on real map changes shows that oftentimes entire parts of an HD map remain unchanged and can be used as a prior. We therefore introduce M3TR (Multi-Masking Map Transformer), a generalist approach for HD map completion both with and without offline HD map priors. As a necessary foundation, we address shortcomings in ground truth labels for Argoverse 2 and nuScenes and propose the first comprehensive benchmark for HD map completion. Unlike existing models that specialize in a single kind of map change, which is unrealistic for deployment, our Generalist model handles all kinds of changes, matching the effectiveness of Expert models. With our map masking as augmentation regime, we can even achieve a +1.4 mAP improvement without a prior. Finally, by fully utilizing prior HD map elements and optimizing query designs, M3TR outperforms existing methods by +4.3 mAP while being the first real-world deployable model for offline HD map priors.
|
| |
| 15:00-16:30, Paper WeI2I.407 | Add to My Program |
| Serving Innovation: Seamless Service by Advancing Food Runners with Mobile Manipulation |
|
| Yamsani, Sankalp | University of Illinois Urbana-Champaign |
| Gim, Kevin | University of Illinois, Urbana-Champaign |
| Smithline, Tyler | University of Michigan |
| Qiu, Richard | University of Illinois Urbana-Champaign |
| Mineyev, Roman | Georgia Institute of Technology |
| Hirashima, Kenta | University of Illinois Urbana-Champaign |
| Kang, Sungmin | University of Illinois Urbana Champaign |
| Park, Kyungseo | Daegu Gyeongbuk Institute of Science and Technology (DGIST) |
| Kang, Yoon-Koo | Yonsei University, HDHyundai Robotics Co., Ltd |
| An, Seulbi | Ulsan National Institute of Science and Technology (UNIST) |
| Ahn, SungHwan | Samsung Electronics |
| Kim, Joohyung | University of Illinois Urbana-Champaign |
Keywords: Service Robotics, Mobile Manipulation, Engineering for Robotic Systems
Abstract: The Mobile Object Manipulation Operator (MOMO) is an innovative and reconfigurable robotic system that transforms traditional serving robots into mobile manipulators. Leveraging the form factor and mobility of serving robots, MOMO integrates up to three pluggable devices, including 6-DoF manipulators of varying sizes and a 3-DoF sensor head. Its design incorporates two independent shoulder lifts to enhance vertical reach. The adaptability of the system tailors its capabilities to tasks beyond simple object transportation. Opposed to current food delivery robots, MOMO showcases its ability to remove obstructions from the floor and deliver items to recipients without human intervention. This paper provides a comprehensive analysis of MOMO’s hardware and software components, emphasizing its modular design and adaptability for complex applications.
|
| |
| 15:00-16:30, Paper WeI2I.408 | Add to My Program |
| Ultra-Fast Lightweight Incipient Slip Detection Using Hyperdimensional Computing with the PapillArray Tactile Sensor |
|
| Zhang, Jingtao | Sun Yat-Sen University |
| Liu, Yi | Sun Yat-Sen University |
| Lu, Yanxun | Sun Yat-Sen University |
| Redmond, Stephen J. | University College Dublin |
| Wang, Changhong | Sun Yat-Sen University |
Keywords: Force and Tactile Sensing, Hardware-Software Integration in Robotics, Machine Learning for Robot Control
Abstract: Timely detection of incipient slip is critical for delicate robotic grasping and dexterous manipulation. However, existing learning-based methods suffer from detection latency and high computational demands. In this paper, we present an ultra-fast lightweight incipient slip detection framework based on hyperdimensional (HD) computing, using the PapillArray optical tactile sensor. Our approach introduces a novel graphical-spatial-temporal HD encoding scheme coupled with a context-driven training and inference strategy, achieving a high slip detection accuracy of 91.78% in offline evaluation. The resulting model is exceptionally compact and highly edge-compatible, with a size of only 0.375 kB. Furthermore, hardware acceleration on an FPGA enables inference within 0.42 microseconds, representing an over 10^4× speedup compared to optimized CPU implementations. Online robotic experiments involving grip-force control based on the proposed slip detection method further validate its practical effectiveness. This work offers a practical and scalable solution for real-time slip detection in robotic manipulation tasks.
|
| |
| 15:00-16:30, Paper WeI2I.409 | Add to My Program |
| TakeAD: Preference-Based Post-Optimization for End-To-End Autonomous Driving with Expert Takeover Data |
|
| Liu, Deqing | Institute of Automation, Chinese Academy of Sciences |
| Gao, YinFeng | University of Science and Technology Beijing |
| Qian, Deheng | Chongqing Changan Automobile |
| Zhang, Qichao | Institute of Automation, Chinese Academy of Sciences |
| Ye, Xiaoqing | Baidu Inc. |
| Han, Junyu | Chongqing Chang'an Technology Co., Ltd |
| Zheng, Yupeng | School of Artificial Intelligence, University of Chinese Academy of Sciences |
| Liu, Xueyi | Institute of Automation, Chinese Academy of Science |
| Xia, Zhongpu | baidu |
| Ding, Dawei | University of Science and Technology Beijing |
| Pan, Yifeng | Chongqing Changan Automobile |
| Zhao, Dongbin | Chinese Academy of Sciences |
| |
| 15:00-16:30, Paper WeI2I.410 | Add to My Program |
| VISTA: Open-Vocabulary, Task-Relevant Robot Exploration with Online Semantic Gaussian Splatting |
|
| Nagami, Keiko | Stanford University |
| Chen, Timothy | Stanford University |
| Yu, Javier | Stanford University |
| Shorinwa, Ola | Stanford University |
| Adang, Maximilian | Stanford University |
| Dougherty, Carlyn | Lincoln Laboratory |
| Cristofalo, Eric | MIT Lincoln Laboratory |
| Schwager, Mac | Stanford University |
Keywords: Semantic Scene Understanding, Task and Motion Planning, Mapping
Abstract: We present VISTA (Viewpoint-based Image selection with Semantic Task Awareness), an active exploration method for robots to plan informative trajectories that improve 3D map quality in areas most relevant for task completion. Given an open-vocabulary search instruction (e.g., "find a person"), VISTA enables a robot to explore its environment to search for the object of interest, while simultaneously building a real-time semantic 3D Gaussian Splatting reconstruction of the scene. The robot navigates its environment by planning receding-horizon trajectories that prioritize semantic similarity to the query and exploration of unseen regions of the environment. To evaluate trajectories, VISTA introduces a novel, efficient viewpoint-semantic coverage metric that quantifies both the geometric view diversity and task relevance in the 3D scene. On static datasets, our coverage metric outperforms state-of-the-art baselines, FisherRF and Bayes' Rays, in computation speed and reconstruction quality. In quadrotor hardware experiments, VISTA achieves 6x higher success rates in challenging maps, compared to baseline methods, while matching baseline performance in less challenging maps. Lastly, we show that VISTA is platform-agnostic by deploying it on a quadrotor drone and a Spot quadruped robot. Code and videos can be found on our project page: https://stanfordmsl.github.io/VISTA/.
|
| |
| 15:00-16:30, Paper WeI2I.411 | Add to My Program |
| Co-NavGPT: Multi-Robot Cooperative Visual Semantic Navigation Using Vision Language Models |
|
| Yu, Bangguo | University of Groningen |
| Yuan, Qihao | University of Groningen |
| Li, Kailai | University of Groningen |
| Kasaei, Hamidreza | University of Groningen |
| Cao, Ming | University of Groningen |
Keywords: Vision-Based Navigation, Multi-Robot Systems, AI-Enabled Robotics
Abstract: Visual target navigation is a critical capability for autonomous robots operating in unknown environments, particularly in human-robot interaction scenarios. While classical and learning-based methods have shown promise, most existing approaches lack common-sense reasoning and are typically designed for single-robot settings, leading to reduced efficiency and robustness in complex environments. To address these limitations, we introduce Co-NavGPT, a novel framework that integrates a Vision Language Model (VLM) as a global planner to enable common-sense multi-robot visual target navigation. Co-NavGPT aggregates sub-maps from multiple robots with diverse viewpoints into a unified global map, encoding robot states and frontier regions. The VLM uses this information to assign frontiers across the robots, facilitating coordinated and efficient exploration. Experiments on the Habitat-Matterport 3D (HM3D) demonstrates that Co-NavGPT outperforms existing baselines in terms of success rate and navigation efficiency, without requiring task-specific training. Ablation studies further confirm the importance of semantic priors from the VLM. We also validate the framework in real-world scenarios using quadrupedal robots. Supplementary video and code are available at:https://sites.google.com/view/co-navgpt2-submission.
|
| |
| 15:00-16:30, Paper WeI2I.412 | Add to My Program |
| RefDiffMap: Diffusion-Guided Progressive Refinement for Vectorized HD Map Construction |
|
| Gao, Wenjie | Xi'an Jiaotong University |
| Chang, Entao | Xi'an Jiaotong University |
| Fu, Jiawei | The Chinese University of Hong Kong |
| Zhu, Ziyu | Xi'an Jiaotong University |
| Chen, Shitao | Xi'an Jiaotong University |
| Zheng, Nanning | Xi'an Jiaotong University |
Keywords: Semantic Scene Understanding, Mapping, Computer Vision for Transportation
Abstract: High-definition (HD) map learning serves as an essential component of autonomous driving scene understanding, providing structured priors for planning and prediction. Recent transformer-based methods regress vectorized map elements via deformable attention over Bird’s-Eye View (BEV) features. They typically employ a single-pass paradigm, starting from a set of initial queries. However, these queries struggle to precisely localize map elements within the large-scale BEV space. This difficulty is severely amplified when using lightweight backbones that produce less distinctive features. To address this, we propose RefDiffMap, which recasts map construction as a progressive refinement process driven by a diffusion model. We introduce a novel denoising query generator that, at each step, leverages the intermediate noisy geometry to sample relevant features from adaptive BEV RoIs. These features are distilled into context-aware queries that guide the decoder's next refinement. This creates a powerful geometry-feature co-evolution loop, allowing the model to iteratively correct localization errors. Comprehensive experiments show that RefDiffMap achieves competitive performance on the nuScenes and Argoverse 2 datasets. Notably, its robustness is highlighted with a ResNet-18 backbone, where it improves mAP by a significant 11.3% over our baseline MapTRv2. Further ablation studies validate the effectiveness of our approach.
|
| |
| 15:00-16:30, Paper WeI2I.413 | Add to My Program |
| Automated Straight-Line Sewing of Stretchable Fabrics with Different Lengths |
|
| Jin, Bingchen | The University of Hong Kong |
| Kobayashi, Akinari | Centre for Transformative Garment Production |
| Bhattacharya, Dipankar | Imperial College of Science, Technology and Medicine |
| Seino, Akira | Centre for Transformative Garment Production |
| Tokuda, Fuyuki | Tohoku University |
| Tien, Norman | University of Hong Kong |
| Kosuge, Kazuhiro | The University of Hong Kong |
Keywords: Grippers and Other End-Effectors, Mechanism Design, Methods and Tools for Robot System Design
Abstract: Different Length Alignment Sewing (DLAS), which involves stretching the shorter fabric to match the longer one and sewing them together in a straight line, is a challenging task that needs to satisfy several requirements when automating the sewing process. To address the challenges, this research proposes a novel robotic sewing system, Different Length Robotic Sewing System (DLRoSS), which consists of a roller type end-effector, attached to a 6-DoF manipulator. The end-effector composed of active shorter and longer fabric rollers, and a passive press-roller attached to the shorter-fabric roller. Assuming that one end of the two fabric layers are initially positioned under the sewing machine’s presser foot, the system automates DLAS by operating in four distinct phases. (P1) Fabric wrapping: Individual fabric layers are picked, held, and wrapped from the other end onto the feed rollers. (P2) Sewing: During the sewing, the shorter fabric is stretched and aligned with the longer fabric in realtime using roller velocity control based on the sewing speed and apriori known length ratio. (P3) Sewing completion: In the final sewing round on the fabric rollers, the press roller is engaged to prevent the stretched fabric from slipping off due to internal tension. (P4) Sewing fabric release: At the end of sewing, the fabric edge moves past the press roller, and the fabric releases from the rollers. Experimental results demonstrate that DLRoSS achieves consistent, high-quality sewing of stretchable fabrics of different materials and lengths.
|
| |
| 15:00-16:30, Paper WeI2I.414 | Add to My Program |
| Privacy-Preserving Robotic Perception for Object Detection in Curious Cloud Robotics |
|
| Antonazzi, Michele | KTH Royal Institute of Technology |
| Alberti, Matteo | University of Milan |
| Bassot, Alex | University of Milan |
| Luperto, Matteo | Università Degli Studi Di Milano |
| Basilico, Nicola | University of Milan |
Keywords: Deep Learning in Robotics and Automation, Object Detection, Segmentation and Categorization, Service Robots, Privacy for Robotic Applications
Abstract: Cloud robotics allows low-power robots to perform computationally intensive inference tasks by offloading them to the cloud, raising privacy concerns when transmitting sensitive images. Although end-to-end encryption secures data in transit, it does not prevent misuse by inquisitive third-party services since data must be decrypted for processing. This paper tackles these privacy issues in cloud-based object detection tasks for service robots. We propose a co-trained encoder-decoder architecture that retains only task-specific features while obfuscating sensitive information, utilizing a novel weak loss mechanism with proposal selection for privacy preservation. A theoretical analysis of the problem is provided, along with an evaluation of the trade-off between detection accuracy and privacy preservation through extensive experiments on public datasets and a real robot.
|
| |
| 15:00-16:30, Paper WeI2I.415 | Add to My Program |
| Dilated Superpixel Aggregation for Visual Place Recognition |
|
| Zeng, Zichao | University College London |
| Goo, June Moh | University College London |
| Boehm, Jan | University College London |
Keywords: Localization, Recognition, Vision-Based Navigation
Abstract: Visual Place Recognition (VPR) is a fundamental task in robotics and computer vision, enabling systems to identify locations seen in the past using visual information. Previous state-of-the-art approach - SegVLAD - focuses on encoding and retrieving semantically meaningful supersegment representations of images to significantly enhance recognition recall rates. However, we find that they struggle to cope with significant variations in viewpoint and scale, as well as scenes with sparse or limited information. Furthermore, these semantic-driven supersegment representations often exclude semantically meaningless yet valuable pixel information. In this work, we present Sel-V and MuSSel-V, two efficient variants within the segment-level VPR paradigm that replace heavy and fragmented supersegments with lightweight, visually compact and complete dilated superpixels for local feature aggregation. The use of superpixels preserves pixel-level details while reducing computational overhead. A multi-scale extension further enhances robustness to viewpoint and scale changes. Comprehensive experiments on twelve public benchmarks show that our approach achieves a better trade-off between accuracy and efficiency than existing segment-based methods. These results demonstrate that lightweight, non-semantic segmentation can serve as an effective alternative for high-performance, resource efficient visual place recognition in robotics.
|
| |
| 15:00-16:30, Paper WeI2I.416 | Add to My Program |
| SonarSplat: Novel View Synthesis of Imaging Sonar Via Gaussian Splatting |
|
| Venkatramanan Sethuraman, Advaith | University of Michigan |
| Rucker, Max | University of Michigan |
| Bagoren, Onur | University of Michigan |
| Kung, Pou-Chun | University of Michigan, Ann Arbor |
| Naresh Babu Amutha, Nibarkavi | University of Michigan, Ann Arbor |
| Skinner, Katherine | University of Michigan |
Keywords: Marine Robotics, Mapping, Deep Learning for Visual Perception
Abstract: In this paper, we present SonarSplat, a novel Gaussian splatting framework for imaging sonar that demonstrates realistic novel view synthesis and models acoustic streaking phenomena. Our method represents the scene as a set of 3D Gaussians with acoustic reflectance and saturation properties. We develop a novel method to efficiently rasterize Gaussians to produce a range/azimuth image that is faithful to the acoustic image formation model of imaging sonar. In particular, we develop a novel approach to model azimuth streaking in a Gaussian splatting framework. We evaluate SonarSplat using real-world datasets of sonar images collected from an underwater robotic platform in a controlled test tank and in a real-world river environment. Compared to the state-of-the-art, SonarSplat offers improved image synthesis capabilities (+3.2 dB PSNR) and more accurate 3D reconstruction (77% lower Chamfer Distance). We also demonstrate that SonarSplat can be leveraged for azimuth streak removal.
|
| |
| 15:00-16:30, Paper WeI2I.417 | Add to My Program |
| FilMBot: A High-Speed Soft Parallel Robotic Micromanipulator |
|
| Yu, Jiangkun | Aalto University |
| Bettahar, Houari | Aalto University |
| Kandemir, Hakan | VTT Technical Research Centre of Finland |
| Zhou, Quan | Aalto University |
Keywords: Micro/Nano Robots, Soft Robot Materials and Design, Parallel Robots, Micromanipulators
Abstract: Soft robotic manipulators are generally slow despite their great adaptability, resilience, and compliance. This limitation also extends to current soft robotic micromanipulators. Here, we introduce FilMBot, a 3-DOF film-based, electromagnetically actuated, soft kinematic robotic micromanipulator achieving speeds up to 2117°/s and 2456°/s in α and β angular motions, with corresponding linear velocities of 1.61 m/s and 1.92 m/s using a 4-cm needle end-effector, 0.54 m/s along the Z-axis, and 1.57 m/s during Z-axis morph switching. The robot can reach ∼1.50 m/s in path-following tasks, with an operational bandwidth below ∼30 Hz, and remains responsive at 50 Hz. It demonstrates high precision (∼6.3 μm, or ∼0.05% of its workspace) in path-following tasks, with precision remaining largely stable across frequencies. The novel combination of the low-stiffness soft kinematic film structure and strong electromagnetic actuation in FilMBot opens new avenues for soft robotics. Furthermore, its simple construction and inexpensive, readily accessible components could broaden the application of micromanipulators beyond current academic and professional users.
|
| |
| 15:00-16:30, Paper WeI2I.418 | Add to My Program |
| A Universal Framework for Extrinsic Calibration of Camera, Radar, and LiDAR |
|
| Hu, Sijie | University Paris-Saclay |
| Goldwurm, Alessandro | LAAS-CNRS |
| Mujica, Martin | LAAS-CNRS - University Paul Sabatier |
| Cadou, Sylvain | MANITOU Group |
| Lerasle, Frederic | LAAS - CNRS, University Paul Sabatier |
| |
| 15:00-16:30, Paper WeI2I.419 | Add to My Program |
| Barrier Method for Inequality Constrained Factor Graph Optimization with Application to Model Predictive Control |
|
| Abdelkarim, Anas | University of Luxembourg, RPTU Kaiserslautern–Landau |
| Görges, Daniel | RPTU Kaiserslautern–Landau |
| Voos, Holger | University of Luxembourg |
Keywords: SLAM, Optimization and Optimal Control
Abstract: Factor graphs have demonstrated remarkable efficiency for robotic perception tasks, particularly in localization and mapping applications. However, their application to optimal control problems---especially Model Predictive Control (MPC)---has remained limited due to fundamental challenges in constraint handling. This paper presents a novel integration of the Barrier Interior Point Method (BIPM) with factor graphs, implemented as an open-source extension to the widely adopted g2o framework. Our approach introduces specialized inequality factor nodes that encode logarithmic barrier functions, thereby overcoming the quadratic-form limitations of conventional factor graph formulations. To the best of our knowledge, this is the first g2o-based implementation capable of efficiently handling the constraints within a unified optimization backend. We validate the method through a multi-objective adaptive cruise control application for autonomous vehicles. Benchmark comparisons with state-of-the-art constraint-handling techniques demonstrate faster convergence and improved computational efficiency. (Code repository: https://github.com/snt-arg/bipm_g2o)}
|
| |
| 15:00-16:30, Paper WeI2I.420 | Add to My Program |
| Transferring Policy of Offline Reinforcement Learning from Hybrid Dataset to Real World Via Progressive Neural Network |
|
| Zhao, Pengyu | Ocean University of China |
| Fang, Zheng | Ocean University of China |
| Ai, Tongxu | Ocean University of China |
| Nichols, Eric | Honda Research Institute Japan |
| Gomez, Randy | Honda Research Institute Japan Co., Ltd |
| He, Bo | Ocean University of China |
| Li, Guangliang | Ocean University of China |
Keywords: Reinforcement Learning, Transfer Learning, Motion Control
Abstract: Offline reinforcement learning (Offline RL) provides a compelling solution for applying RL in high-risk or resource-constrained real-world domains such as healthcare, autonomous driving, and robotic manipulation. However, Offline RL faces critical challenges arising from limited data coverage and potential distributional mismatch between the pre-training dataset and real-world environment. In this paper, we propose to allow an agent to learn from a hybrid dataset: high-quality real-world data and high-diversity simulation data, and assume that the dynamics of the simulation and the real world do not match, but the state space is the same. To address the policy extrapolation error and potentially catastrophic failures because of out-of-distribution actions and sim-to-real gap, we use progressive neural networks (PNNs) to transfer the offline policy to the real world. Results in two robotic manipulation tasks with a six-degree-of-freedom Ned robotic arm show that, the hybrid dataset facilitates faster offline learning and better adaptation to real-world tasks during online learning. In addition, further analysis shows that transferring the offline policy via PNN can not only effectively retain the policy learned from the hybrid dataset and bridge the gap between simulation and reality data, but also allow the agent to explore in a more diverse distribution of samples during online learning.
|
| |
| 15:00-16:30, Paper WeI2I.421 | Add to My Program |
| Real-Time Millimeter-Accurate Underwater Pose Estimation Via Tightly-Coupled Fusion of Vision and Optical Tracking |
|
| Gao, Yuer | The Hong Kong University of Science and Technology (Guangzhou) |
| Xu, Tongqing | The Hong Kong University of Science and Technology (Guangzhou) |
| Cai, Yi | The Hong Kong University of Science and Technology (Guangzhou) |
| |
| 15:00-16:30, Paper WeI2I.422 | Add to My Program |
| Lagrangian Neural Network-Based Control: Improving Robotic Trajectory Tracking Via Linearized Feedback |
|
| Weiss, Manuel | Berlin University of Applied Sciences And Technology |
| Pawluchin, Alexander | Berliner Hochschule Für Technik |
| Ewering, Jan-Hendrik | Leibniz Universität Hannover |
| Seel, Thomas | Leibniz Universität Hannover |
| Boblan, Ivo | Berliner Hochschule Fuer Technik |
Keywords: Machine Learning for Robot Control, Model Learning for Control, Motion Control
Abstract: This paper introduces a control framework that leverages Lagrangian neural networks (LNNs) for computed torque control (CTC) of robotic systems with unknown dynam- ics. Unlike prior LNN-based controllers that are placed outside the feedback-linearization framework (e.g., feedforward), we embed an LNN inverse-dynamics model within a CTC loop, thereby shaping the closed-loop error dynamics. This strategy, referred to as LNN-CTC, ensures a physically consistent model and improves extrapolation, requiring neither prior model knowledge nor extensive training data. The ap- proach is experimentally validated on a robotic arm with four degrees of freedom and compared with conventional model- based CTC, physics-informed neural network (PINN)-CTC, deep neural network (DNN)-CTC, an LNN-based feedforward controller, and a PID controller. Results demonstrate that LNN- CTC significantly outperforms model-based baselines by up to 30 % in tracking accuracy, achieving high performance with minimal training data. In addition, LNN-CTC outperforms all other evaluated baselines in both tracking accuracy and data efficiency, attaining lower joint-space RMSE for the same training data. The findings highlight the potential of physics-informed neural architectures to generalize robustly across various operating conditions and contribute to narrowing the performance gap between learned and classical control strategies.
|
| |
| 15:00-16:30, Paper WeI2I.423 | Add to My Program |
| Vision-Guided Outdoor Flight and Obstacle Evasion Via Reinforcement Learning |
|
| Dutta, Shiladitya | University of California, Berkeley |
| Gupta, Aayush | University of California, Berkeley |
| Saran, Varun | University of California, Berkeley |
| Zakhor, Avideh | University of California, Berkeley |
Keywords: Aerial Systems: Perception and Autonomy, Reinforcement Learning, Vision-Based Navigation
Abstract: Although quadcopters boast impressive traversal capabilities enabled by their omnidirectional maneuverability, the need for continuous pilot control in complex environments impedes their application in GNSS and telemetry-denied scenarios. To this end, we propose a novel sensorimotor policy that uses stereo-vision depth and visual-inertial odometry (VIO) to autonomously navigate through obstacles in an unknown environment to reach a goal point. The policy is comprised of a pre-trained autoencoder as the perception head followed by a planning and control LSTM network which outputs velocity commands that can be followed by an off-the-shelf commercial drone. We leverage reinforcement and privileged learning paradigms to train the policy in simulation through a two-stage process: 1) initial training with optimal trajectories generated by a global motion planner acting as a supervisory backbone, 2) further fine-tuning in a curriculum environment. To bridge the sim-to-real gap, we employ domain randomization and reward shaping to create a policy that is both robust to noise and domain shift. In outdoor experiments, our approach achieves successful zero-shot transfer to both obstacle environments and a drone platform that were never encountered during training.
|
| |
| 15:00-16:30, Paper WeI2I.424 | Add to My Program |
| High-Quality Sparse-View Gaussian Splatting without Ground-Truth Camera Poses |
|
| Lim, Chun Her | Zhejiang University |
| Guo, Yingnan | Zhejiang University |
| Yang, Wen | Zhejiang University |
| Zhang, Yu | Zhejiang University |
Keywords: Deep Learning for Visual Perception, Visual Learning, Autonomous Vehicle Navigation
Abstract: The existing methods for novel view synthesis depend on dense input images and accurate camera poses, which significantly limits their practical application. We propose a novel framework that enables high-quality sparse-view reconstruction via 3D Gaussian Splatting (3DGS) without knowing camera poses. Our approach leverages MASt3R, a ViT-based multi-view stereo prior, to generate point clouds and coarse camera poses from uncalibrated sparse images. We use the point clouds to initial 3DGS. Additionally, we propose several regularization techniques, including point-rendered LPIPS regularization, geometric regularization (local depth regularization and normal regularization), and semantic regularization to improve the quality of reconstructed scenes and enhance the generalization capability of the model in unseen viewpoint. Due to the inaccuracies in the camera poses output by MASt3R, we optimized the camera poses during both the training and testing phase. Experimental results on the Tanks and Temples and MVImgNet datasets demonstrate that our method outperforms state-of-the-art techniques in novel view synthesis and camera pose estimation under sparse-view settings. Our approach achieves higher fidelity and more photorealistic visual effects.
|
| |
| 15:00-16:30, Paper WeI2I.425 | Add to My Program |
| Vertical-Plane Locomotion Control of a High-Speed Robotic Tuna Via NMPC (I) |
|
| Tong, Ru | Peking University |
| Li, Sijie | Institute of Automation, Chinese Academy of Sciences |
| Chen, Di | Institute of Automation, Chinese Academy of Sciences |
| Wu, Zhengxing | Institute of Automation, Chinese Academy of Sciences |
| Yu, Junzhi | Peking University |
Keywords: Biologically-Inspired Robots, Motion Control, Integrated Planning and Control
Abstract: The development of bionic underwater robots has brought new vitality to ocean exploration. Motion control is crucial for the stability of underwater robots due to significant differences in flow field characteristics at various swimming speeds. This study focuses on vertical-plane motion and proposes a model predictive control method to achieve integrated control of depth position and pitch attitude for bionic robotic fish. First, based on a robotic tuna system, high-maneuverability vertical-plane motion configuration elements are analyzed and summarized, laying the foundation for motion stability and controllability. Second, through hydrodynamic sampling in aquatic environments, a system model covering the range of swimming speeds is established. Regarding the control method, the proposed motion planning approach converts the desired motion sequence into an equivalent “pitch-depth” trajectory curve. A nonlinear model predictive controller (NMPC) is then designed to track the trajectory curve, ultimately achieving the desired vertical-plane motion. Experimental results validate that the proposed method not only ensures control accuracy under both low and high-speed conditions, but also enables the execution of complex motion sequence control. This study provides a fresh perspective on the motion instability analysis of robotic fish at high swimming speed and a novel control framework for regulating continuous posture sequences in the vertical plane. Note to Practitioners—The
|
| |
| 15:00-16:30, Paper WeI2I.426 | Add to My Program |
| Whole-Body Inverse Dynamics MPC for Legged Loco-Manipulation |
|
| Molnar, Lukas | ETH Zurich |
| Cheng, Jin | ETH Zurich |
| Fadini, Gabriele | ZHAW |
| Kang, Dongho | Robotics and AI Institute |
| Zargarbashi, Fatemeh | ETH Zurich |
| Coros, Stelian | ETH Zurich |
Keywords: Legged Robots, Mobile Manipulation, Whole-Body Motion Planning and Control
Abstract: Loco-manipulation demands coordinated whole-body motion to manipulate objects effectively while maintaining locomotion stability, presenting significant challenges for both planning and control. In this work, we propose a whole-body model predictive control (MPC) framework that directly optimizes joint torques through full-order inverse dynamics, enabling unified motion and force planning and execution within a single predictive layer. This approach allows emergent, physically consistent whole-body behaviors that account for the system’s dynamics and physical constraints. We implement our MPC formulation using open software frameworks (Pinocchio and CasADi), along with the state-of-the-art interior-point solver Fatrop. In real-world experiments on a Unitree B2 quadruped equipped with a Unitree Z1 manipulator arm, our MPC formulation achieves real-time performance at 80 Hz. We demonstrate loco manipulation tasks that demand fine control over the end effector’s position and force to perform real-world interactions like pulling heavy loads, pushing boxes, and wiping whiteboards.
|
| |
| 15:00-16:30, Paper WeI2I.427 | Add to My Program |
| FAGR: Feature-Action Generative Replay for Robot Lifelong Imitation Learning |
|
| Yang, Yushi | Institute of Automation,Chinese Academy of Sciences |
| Xiangli, Nie | Institute of Automation, Chinese Academy of Sciences |
| Liu, Chang | Peking University |
| |
| 15:00-16:30, Paper WeI2I.428 | Add to My Program |
| Estimating Deformable-Rigid Contact Interactions for a Deformable Tool Via Learning and Model-Based Optimization |
|
| Van der Merwe, Mark | University of Michigan |
| Oller, Miquel | University of Michigan |
| Berenson, Dmitry | University of Michigan |
| Fazeli, Nima | University of Michigan |
Keywords: Perception for Grasping and Manipulation, Deep Learning in Grasping and Manipulation, Dexterous Manipulation
Abstract: Dexterous manipulation requires careful reasoning over extrinsic contacts. The prevalence of deforming tools in human environments, the use of deformable sensors, and the increasing number of soft robots yields a need for approaches that enable dexterous manipulation through contact reasoning where not all contacts are well characterized by classical rigid body contact models. Here, we consider the case of a deforming tool dexterously manipulating a rigid object. We propose a hybrid learning and first-principles approach to the modeling of simultaneous motion and force transfer of tools and objects. The learned module is responsible for jointly estimating the rigid object's motion and the deformable tool's imparted contact forces. We then propose a Contact Quadratic Program to recover forces between the environment and object subject to quasi-static equilibrium and Coulomb friction. The results is a system capable of modeling both intrinsic and extrinsic motions, contacts, and forces during dexterous deformable manipulation. We train our method in simulation and show that our method outperforms baselines under varying block geometries and physical properties, during pushing and pivoting manipulations, and demonstrate transfer to real world interactions.
|
| |
| 15:00-16:30, Paper WeI2I.429 | Add to My Program |
| Human-In-The-Loop Gaussian Splatting for Robotic Teleoperation |
|
| Lee, Yongseok | DGIST |
| Kim, Hyunsu | Seoul National University |
| Ji, Harim | Seoul National University |
| Heo, Jinuk | Seoul National University |
| Lee, Youngseon | Seoul National University |
| Kang, Jiseock | Seoul National University |
| Lee, Jeongseob | Seoul National University |
| Lee, Dongjun | Seoul National University |
Keywords: Human Factors and Human-in-the-Loop, Telerobotics and Teleoperation, Mapping
Abstract: Safe, precise teleoperation demands a third-person 3D view that reveals collision clearances and task-critical geometry in full detail. Yet most systems still rely on live camera streams that offer tunnel-vision perspectives and weak depth cues, hiding hazards and denying operators the spatial context for precise manipulation. 3D Gaussian Splatting (GS) enables real-time photorealistic streaming, but acquiring the required multi-view imagery safely and efficiently remains a critical bottleneck in cluttered teleoperation environments. We propose Human-in-the-Loop Gaussian Splatting (HIL-GS) that delivers safe, robust, and efficient 3D scene reconstruction for challenging teleoperation environments. HIL-GS combines three modules in a tightly-coupled loop: (1) motion-aware GS reconstruction that fuses RGB-D and proprioceptive sensors for drift-free and robust mapping under aggressive motions; (2) VR-based informative display that renders the GS map with contextual overlays/feedback in real time to ensure situational awareness and reconstruction completeness; and (3) finger- based control interface to guide the robot toward informative viewpoints through safe, non-redundant motions. Through simulation, real-world experiments, and a user study, we demonstrate that HIL-GS outperforms traditional approaches in reconstruction quality, usability, and efficiency.
|
| |
| 15:00-16:30, Paper WeI2I.430 | Add to My Program |
| F-RRT: An Efficient Algorithm for Semi-Constrained Path Planning Problems |
|
| de Mathelin de Papigny, Guillaume | TU Wien |
| Gassibe, Franco Ivan | Aerospline |
| Padois, Vincent | Inria Bordeaux |
Keywords: Constrained Motion Planning, Motion and Path Planning, Industrial Robots
Abstract: This letter addresses the challenging problem of Semi-Constrained End-Effector Path Planning for robotic manipulators. This problem arises when complex specifications restrict the end-effector’s motion during the execution of industrial tasks. Traditional path planning algorithms often struggle with such problems due to the difficulty of exploring the robot's valid configuration space, or constrained manifold, under these conditions. In this work, we propose a novel sampling-based approach that efficiently navigates the constrained manifold by exploring an alternative space representing the end-effector’s degrees of freedom, such as process-related tolerances, throughout the task. This method retains the simplicity of sampling-based techniques. Building on this approach, we introduce the F-RRT algorithm, an adaptation of the renowned RRT planner (LaValle and Kuffner, 2001). F-RRT demonstrates enhanced speed and robustness compared to existing solutions, particularly in complex and cluttered environments.
|
| |
| 15:00-16:30, Paper WeI2I.431 | Add to My Program |
| M3Bench: Benchmarking Whole-Body Motion Generation for Mobile Manipulation in 3D Scenes |
|
| Zhang, Zeyu | Beijing Institute for General Artificial Intelligence |
| Yan, Sixu | Huazhong University of Science and Technology |
| Han, Muzhi | Hillbot, Inc |
| Wang, Zaijin | Beijing Institute for General Artificial Intelligence (BIGAI) |
| Wang, Xinggang | Huazhong University of Science and Technology |
| Zhu, Song-Chun | UCLA |
| Liu, Hangxin | Beijing Institute for General Artificial Intelligence (BIGAI) |
Keywords: Data Sets for Robot Learning, Simulation and Animation, AI-Based Methods
Abstract: We propose M3Bench, a new benchmark for whole-body motion generation in mobile manipulation tasks. Given a 3D scene context, M3Bench requires an embodied agent to reason about its configuration, environmental constraints, and task objectives to generate coordinated whole-body motion trajectories for object rearrangement. M3Bench features 30,000 object rearrangement tasks across 119 diverse scenes, providing expert demonstrations generated by our newly developed M3BenchMaker, an automatic data generation tool that produces whole-body motion trajectories from high-level task instructions using only basic scene and robot information. Our benchmark includes various task splits to evaluate generalization across different dimensions and leverages realistic physics simulation for trajectory assessment. Extensive evaluation analysis reveals that state-of-the-art models struggle with coordinating base-arm motion while adhering to environmental and task-specific constraints, underscoring the need for new models to bridge this gap. By releasing M3Bench and M3BenchMaker at https://zeyuzhang.com/papers/m3bench, we aim to advance robotics research toward more adaptive and capable mobile manipulation in diverse, real-world environments.
|
| |
| 15:00-16:30, Paper WeI2I.432 | Add to My Program |
| OccTENS: 3D Occupancy World Model Via Temporal Next-Scale Prediction |
|
| Jin, Bu | HKUST |
| Gu, Songen | Fudan University |
| Hu, Xiaotao | HKUST |
| Zheng, Yupeng | School of Artificial Intelligence, University of Chinese Academy of Sciences |
| Guo, Xiaoyang | Horizon Robotics |
| Zhang, Qian | Horizon Robotics |
| Long, Xiaoxiao | The University of Hong Kong |
| Yin, Wei | University of Adelaide |
Keywords: Intelligent Transportation Systems, Computer Vision for Transportation, Deep Learning for Visual Perception
Abstract: In this paper, we propose OccTENS, a generative occupancy world model that enables controllable, high-fidelity long-term occupancy generation while maintaining computational efficiency. Different from visual generation, the occupancy world model must capture the fine-grained 3D geometry and dynamic evolution of the 3D scenes, posing great challenges for the generative models. Recent approaches based on autoregression (AR) have demonstrated the potential to predict vehicle movement and future occupancy scenes simultaneously from historical observations, but they typically suffer from inefficiency, temporal degradation in long-term generation and lack of controllability. To holistically address these issues, we reformulate the occupancy world model as a temporal next-scale prediction (TENS) task, which decomposes the temporal sequence modeling problem to the modeling of spatial scale-by-scale generation and temporal scene-by-scene prediction. With a textbf{TensFormer}, OccTENS can effectively manage the temporal causality and spatial relationships of occupancy sequences in a flexible and scalable way. To enhance the pose controllability, we further propose a holistic pose aggregation strategy, which features a unified sequence modeling for occupancy and ego-motion. Experiments show that OccTENS outperforms the state-of-the-art method with both higher occupancy quality and faster inference time.
|
| |
| 15:00-16:30, Paper WeI2I.433 | Add to My Program |
| USPilot: An Embodied Robotic Assistant Ultrasound System with a Large Language Model Enhanced Graph Planner |
|
| Chen, Mingcong | City University of Hong Kong |
| Fan, Siqi | Chinese University of Hong Kong |
| Cao, Guanglin | Institute of Automation, Chinese Academy of Sciences |
| Liu, Yunhui | Chinese University of Hong Kong |
| Liu, Hongbin | Hong Kong Institute of Science & Innovation, Chinese Academy of Sciences |
| |
| 15:00-16:30, Paper WeI2I.434 | Add to My Program |
| An Underwater Exoskeleton for Scuba Diving: Reducing Air Consumption and Muscle Activation through Knee Assistance |
|
| Wu, Xianda | Peking University |
| Xu, Ming | Peking University |
| Zhou, Zhihao | Peking University |
| Lou, Wenjie | Peking University |
| Zhang, Teng | Peking University |
| Zhou, Yalei | Peking University |
| Mai, Jingeng | Peking University |
| Wang, Qining | Peking University |
Keywords: Wearable Robots, Human-Centered Robotics, Human Performance Augmentation, Exoskeletons
Abstract: Evolutionary pressures have pushed humans to become efficient walkers, but inefficient divers. People consume more energy to travel the same distance underwater than on land. In diverse overground locomotion, emerging exoskeletons have reduced the metabolic cost of humans. Can we also improve the energy economy in underwater locomotion via exoskeletons? Here, we propose an underwater exoskeleton to assist scuba diving using flutter kick, by applying assistive knee extension torque during the strike phase of the diving kick cycle. When divers wore the powered exoskeleton, the average net air cost across six experienced divers was reduced by 22.7±10.0%, and the peak quadriceps activation was decreased by 20.9±7.5%, compared with normal diving without the exoskeleton. The average gastrocnemius activation also decreased by 20.6±5.3%, suggesting that the divers sufficiently utilized the exoskeleton assistance. These results indicate that applying exoskeleton assistance is conducive to improving the endurance of human underwater diving and enhancing our ability to explore the underwater world. Our study extends the application boundary of wearable robots, and provides a reference for the
|
| |
| 15:00-16:30, Paper WeI2I.435 | Add to My Program |
| Design and Control of Isoperimetric Robot from Tape Springs |
|
| Jackson, Hancey | Brigham Young University |
| Lundgreen, Fisk | Brigham Young University |
| Usevitch, Nathan | Facebook Reality Labs |
Keywords: Compliant Joints and Mechanisms, Cellular and Modular Robots, Parallel Robots
Abstract: Isoperimetric robots can dramatically change shape to adapt to different tasks. They are built from triangle modules, each formed by a continuous structural member that passes through three roller units, one at each corner. The robot changes shape as the roller units drive along the structural member, changing the location of the joints. Previous designs used inflated fabric tubes as the structural member, but these systems are prone to leaking and changes in pressure due to temperature effects. We present an isoperimetric robot composed of tape-springs (curved spring steel tapes) as the primary structural member, and assemble an octahedron robot. We detail the design of the roller modules that can drive along the tape spring. We also show that with tape springs, all three roller units at the vertices of each triangle can drive along the tape spring. This increases the robot's speed moving between configurations and enables new types of behaviors, such as motion of the beam without motion of the rollers. We also present an optimization procedure for the tape spring isoperimetric robot that minimizes the time required to reach a desired configuration, assuming each roller is limited to a maximum speed.
|
| |
| 15:00-16:30, Paper WeI2I.436 | Add to My Program |
| Time-Series Data-Driven Three Dimensional Shape Control of Deformable Linear Objects Using a Dual-Arm Robot with Dynamic Model Updating |
|
| Choi, Jiyoung | Chonnam National University |
| Gebrezgiher, Micheale Haileslassie | Chonnam National University |
| Lee, Donggun | North Carolina State University |
| Hong, Ayoung | Chonnam National University |
Keywords: Dual Arm Manipulation, Model Learning for Control
Abstract: Deformable objects(DOs) are prevalent in everyday environments and represent important targets for robotic manipulation. However, their high degrees of freedom and complex nonlinear deformations make them more challenging to model and control than rigid objects when relying on traditional analytical approaches. To address this, we propose a data-driven method to model the dynamics of deformable objects. Our method utilizes time-series data to predict future states without relying on complex dynamics. We employ model predictive control(MPC) for robot manipulation and improve its performance through online updates of the data-driven model. To handle cables with varying configurations, interpolation is applied to align model input structures. In this study, we focus on manipulating deformable linear objects(DLOs) with different mechanical properties and configurations using a dual-arm robotic system, both in simulation and in real-world environments.
|
| |
| 15:00-16:30, Paper WeI2I.437 | Add to My Program |
| Motion Pattern Analysis of a Rolling Locomotion Robot Featuring Dual Rimless Wheels and Elastic Connectors |
|
| Sedoguchi, Taiki | Japan Advanced Institute of Science and Technology |
| Asano, Fumihiko | Japan Advanced Institute of Science and Technology |
Keywords: Compliant Joints and Mechanisms, Legged Robots, Passive Walking
Abstract: Rimless wheel, one of the simplest walking models, has been widely studied as a theoretical framework for bipedal locomotion. This study introduces a dual rimless wheel (DRW) connected by elastic elements for maintaining the body shape and investigates its passive locomotion capability through numerical simulations. Simulation results reveal that as the stiffness of the elastic elements increases, the walking behavior approaches that of a rigid rimless wheel, resulting in higher forward velocity. Conversely, lower stiffness enhances body flexibility and enables the generation of low-speed gaits with remarkably small energy loss. These findings suggest that the DRW may be advantageous in environments where collisions have strong impacts, such as compliant terrains. Furthermore, through comparative simulations with several other models, including the rigid rimless wheel, we demonstrate that the low-stiffness DRW model can generate clearly slower passive locomotion while maintaining a feasible walking region. On the other hand, basic prototype experiment indicates that the low-stiffness DRW model achieves more stable and faster walking than the high-stiffness DRW model in low-angle slopes. While the results do not imply that the DRW is universally optimal, they provide new insights into generating soft and stable gaits and underline the usefulness of tensegrity mechanisms.
|
| |
| 15:00-16:30, Paper WeI2I.438 | Add to My Program |
| Toward Efficient and Robust Behavior Models for Multi-Agent Driving Simulation |
|
| Konstantinidis, Fabian | Cariad SE |
| Sackmann, Moritz | CARIAD SE |
| Hofmann, Ulrich Franz | CARIAD SE |
| Stiller, Christoph | Karlsruhe Institute of Technology |
Keywords: Learning from Demonstration, Intelligent Transportation Systems, Multi-Robot Systems
Abstract: Scalable multi-agent driving simulation requires behavior models that are both realistic and computationally efficient. We address this by optimizing the behavior model that controls individual traffic participants. To improve efficiency, we adopt an instance-centric scene representation, where each traffic participant and map element is modeled in its own local coordinate frame. This design enables efficient, viewpoint-invariant scene encoding and allows static map tokens to be reused across simulation steps. To model interactions, we employ a query-centric symmetric context encoder with relative positional encodings between local frames. We use Adversarial Inverse Reinforcement Learning to learn the behavior model and propose an adaptive reward transformation that automatically balances robustness and realism during training. Experiments demonstrate that our approach scales efficiently with the number of tokens, significantly reducing training and inference times, while outperforming several agent-centric baselines in terms of positional accuracy and robustness.
|
| |
| WeI2LB Late Breaking Results Session, Hall C |
Add to My Program |
| Late Breaking Results 4 |
|
| |
| |
| 15:00-16:30, Paper WeI2LB.1 | Add to My Program |
| Accurate Joint Torque Estimation in Lower-Limb Exoskeletons Via Hybrid Modeling and Learning |
|
| Yazdankhoo, Behnam | Simon Fraser University |
| Mousavi, Milad | Simon Fraser University |
| Faridi Rad, Nafise | Simon Fraser University |
| Mansouri, Saeed | Human In Motion Robotics Inc. |
| Peykari, Behzad | Human in Motion |
| Arzanpour, Siamak | Simon Fraser University |
| Najafi, Farshid | Simon Fraser University |
| Park, Edward J. | Simon Fraser University |
| |
| 15:00-16:30, Paper WeI2LB.2 | Add to My Program |
| Self-Adaptive Autonomous Navigation Based on Reservoir Computing in Snowy Environments |
|
| Li, Fangzheng | Japan Advanced Institute of Science and Technology |
| Ji, Yonghoon | JAIST |
Keywords: Field Robots, Robotics in Hazardous Fields, Reinforcement Learning
Abstract: Autonomous navigation in snowy environments is essential for snow removal robots operating in regions with heavy snowfall. However, snow accumulation obscures terrain features and introduces sensor noise, making reliable perception and navigation difficult. Moreover, snow removal robots typically operate only during winter, while the environment may change during other seasons, requiring the robot to adapt to new situations. To address these challenges, this study proposes a self-adaptive navigation framework that learns directly in real snowy environments without relying on simulation. The framework integrates reservoir computing (RC), reinforcement learning (RL), and artificial bee colony (ABC) optimization. In addition, a snow-region detection method based on thermal and grayscale images is introduced to guide the robot toward areas requiring snow removal.
|
| |
| 15:00-16:30, Paper WeI2LB.3 | Add to My Program |
| Friction-Aware Actuator Modeling for Accurate Torque Estimation Using External Sensors |
|
| Park, Jiman | SungKyunKwan University |
| Lee, Hyunyong | AIDIN ROBOTICS |
| Kang, Hansol | Sungkyunkwan University |
| Nam, SeongWon | Sungkyunkwan University |
| Son, Yeongwoo | Sungkyunkwan University |
| Yi, Bumsu | SungKyunKwan University |
| Oh, Jaeyoung | Sungkyunkwan University |
| Choi, Hyouk Ryeol | Sungkyunkwan University |
Keywords: Dynamics, Software-Hardware Integration for Robot Systems, Calibration and Identification
Abstract: Modern robotic controllers are typically designed in simulation and subsequently deployed on real robots. However, discrepancies between simulated and actual actuator torque often lead to sim-to-real (sim2real) problems. Various actuator approaches have been proposed to address this problem, but when external torque sensors are used, it is difficult to measure the intrinsic actuator output torque due to disturbances from external load systems. This paper proposes an actuator modeling method that minimizes the influence of external systems. The friction torque of the actuator is first identified under no-load conditions, and the measured torque under loaded conditions is compensated accordingly to estimate the pure output torque. Experimental results across various actuators and load conditions demonstrate that the proposed model closely matches the measured torque, even in actuators with large friction. The proposed approach overcomes the modeling limitation using external sensors and provides an effective solution for reducing sim2real problems in diverse actuator systems.
|
| |
| 15:00-16:30, Paper WeI2LB.4 | Add to My Program |
| A Digital Twin-Driven Intelligent Inspection Robotic System for Elevator Buffer in Confined Hazardous Space |
|
| Hu, Zhiyong | Tianjin Special Equipment Inspection Institute |
| |
| 15:00-16:30, Paper WeI2LB.5 | Add to My Program |
| Uncertainty-Aware BIT*: Collision-Free Path Planning for Maritime Autonomous Surface Ships under Target Ship Position Uncertainty |
|
| Kim, Sojin | UST-KRISO |
| Park, Jeonghong | KRISO |
Keywords: Motion and Path Planning, Collision Avoidance, Marine Robotics
Abstract: This paper proposes a BIT*-based collision-free path planning method for remotely operated Maritime Autonomous Surface Ships (MASS) under target ship position uncertainty. In a remote operating environment, the Remote Operating Center (ROC) receives target ship position data from long-range sensors such as AIS and radar. This data inherently contains uncertainty due to measurement errors and communication delays, making it essential to account for such uncertainty during path planning. As MASS must comply with the International Regulations for Preventing Collisions at Sea (COLREGs) during navigation, path planning must also reflect these requirements. The proposed method models target ship position uncertainty as a two-dimensional Gaussian distribution and incorporates it into edge evaluation as a collision risk cost, while applying penalty costs to edges that enter COLREGs non-compliant regions. A turning radius constraint is also incorporated into the edge selection process to ensure navigational feasibility. The method is validated through headon and crossing encounter simulations on an Electronic Nautical Chart (ENC)-based grid map of Ulsan Port, South Korea. The results show that higher levels of position uncertainty lead to more conservative avoidance paths, resulting in greater Distance to Closest Point of Approach (DCPA).
|
| |
| 15:00-16:30, Paper WeI2LB.6 | Add to My Program |
| Wrench-Feasible Whole-Body Planning Via Time-Layered DAG Optimization for Omnidirectional Aerial Manipulation |
|
| Park, Daum | Kyung Hee University |
| Pak, Bohyeong | Kyung Hee University |
| Kim, Sanghyun | Kyung Hee University |
Keywords: Field Robots, Aerial Systems: Mechanics and Control, Motion and Path Planning
Abstract: Omnidirectional aerial manipulators (OAMs) must coordinate a floating base and onboard arm to track end-effector trajectories under coupled geometric and dynamic constraints. In cluttered long-horizon tasks, collision-free motions may still be dynamically inadmissible and dense time layered planning graphs can become disconnected. We present a GPU-accelerated whole-body planning framework that combines reverse-chain node sampling, collision and wrench-aware feasibility filtering, guide-based connectivity refinement, and dynamic programming on a time-layered directed acyclic graph. At each timestep, the target end-effector pose is converted to a wrist-anchored representation, from which large batches of whole-body candidates are generated in parallel on the GPU. Sampled nodes are pruned using self/environment collision checks and rotor-allocation-based wrench-feasibility tests. When local disconnections remain, sparse guides trigger targeted dense resampling to recover connectivity. On drawing and peg-in-hole tasks, the proposed method achieves 91.7% and 100.0% dynamic-safe runs, versus 0.0% and 8.3% for a wrench-unconstrained baseline. Guide refinement reduces mean broken layers by 200.2 and 49.6, yielding 100 percentage-point success gains over no refinement. GPU batching keeps planning practical for long-horizon cluttered OAM problems.
|
| |
| 15:00-16:30, Paper WeI2LB.7 | Add to My Program |
| Beyond Domain Randomization: Safety Certificates for Reinforcement Learning |
|
| Stocco, Paula | Stanford University |
| Micheli, Francesco | Amazon |
| Schmid, Niklas | ETH Zurich, Automatic Control Laboratory (IfA) |
| Lygeros, John | ETH Zurich |
| Balta, Efe | Inspire AG |
Keywords: Robot Safety, Planning under Uncertainty, Reinforcement Learning
Abstract: With the growing acceptance of robotics in daily life there is a growing need for certifiably safe control policies. While simulation provides a safe training environment, policies often fail in sim-to-real transfer. We propose a data-driven certification framework for reinforcement learning based on Pick-to-Learn (P2L), a meta-algorithm that uses data preference ordering to compute probabilistic bounds on the satisfaction of application dependent properties of interest. Our results demonstrate that using P2L maintains high performance while distinguishing between policies that appear similar under domain randomization alone. This work offers a practical method for preparing safe reinforcement learning policies by providing formal safety guarantees prior to hardware deployment.
|
| |
| 15:00-16:30, Paper WeI2LB.8 | Add to My Program |
| An Underactuated Robotic Gripper with Flowability and Variable Stiffness for Food Bin-Picking |
|
| Xue, Yitong | RITSUMEIKAN UNIVERSITY |
| Wang, Zhongkui | Ritsumeikan University |
Keywords: Grippers and Other End-Effectors, Grasping, Underactuated Robots
Abstract: This paper presents a single-motor underactuated gripper with variable stiffness, designed for food bin-picking tasks. The gripper employs highly compliant fingers that can passively adapt to cluttered environments and enclose target objects. Upon grasping, tendon-driven actuation increases structural stiffness, enabling low-damage, error-tolerant manipulation. Experiments on a variety of food items demonstrate robust grasping performance and stable object handling.
|
| |
| 15:00-16:30, Paper WeI2LB.9 | Add to My Program |
| Robust Localization in Large-Scale Symmetric Environments through Dynamic Topological Mapping |
|
| Flor Rodríguez-Rabadán, Rafael | Alcalá University |
| Lafuente-Arroyo, Sergio | University of Alcalá |
| Maldonado-Bascón, Saturnino | Universidad De Alcalá |
| Gutiérrez Álvarez, Carlos | Universidad De Alcalá |
| López-Sastre, Roberto J. | University of Alcalá |
Keywords: Localization, Vision-Based Navigation, Mapping
Abstract: Visual place recognition in large-scale, indoor environments often suffers from perceptual aliasing due to structural symmetries and dynamic changes. This work presents a robust hierarchical topological mapping framework designed for long-term robot autonomy. Our system integrates multi-modal data (including 2D LiDAR, odometry, and RGB imagery) into a two-layer architecture. First, a Layout Layer is designed to capture the geometric structure of the environment. Then, a Visual Layer is used to encode image sequences. A key contribution is the dynamic map maintenance mechanism, which monitors the attenuation of edge weights to detect environmental transitions, such as the opening or closing of doors. This allows for seamless lifelong updates without human intervention in large-scale environments. We evaluate our approach using various visual descriptors (eg SuperGlue, Patch-NetVLAD, and SeqVLAD) within a sequence-based matching pipeline. Experimental results in a 750 m^2 real-world facility demonstrate that the proposed method achieves high discrimination and scalability, even in challenging open areas and symmetric corridors. This framework provides a reliable solution for assistive robotics navigating complex, evolving public spaces.
|
| |
| 15:00-16:30, Paper WeI2LB.10 | Add to My Program |
| Ultra-Low-Impedance Robotic Gripper for High-Bandwidth and Transparent Physical Interaction |
|
| Lee, Joon | Sogang University |
| Choi, Ari | Sogang University |
| Jeong, Seokhwan | Mechanical Eng., Sogang University |
Keywords: Grippers and Other End-Effectors, Multifingered Hands, In-Hand Manipulation
Abstract: Conventional robotic grippers relying on external force sensors or high gear-ratio actuators suffer from high mechanical impedance and limited control bandwidth. To address these limitations, this study proposes a novel 9-DOF, three-fingered Direct-Drive Differential (DDD) gripper that integrates DD motors with an low gear ratio (1:2) differential transmission. This mechanism centralizes the actuator mass at the base to achieve an ultra-low inertia design, while the differential architecture couples motors in parallel to amplify torque for flexion movements. Performance evaluations demonstrate that the prototype delivers a nominal grasping force of 15 N and a fingertip force of 3.1 N, while maintaining a remarkably low system inertia (motor contribution of 0.236%) and mechanical impedance (<700 N/m) within the typical human manipulation frequency range. The proposed hardware successfully resolves the trade-offs among torque, transparency, and kinematics, establishing a robust foundation for highly responsive, sensorless proprioceptive force estimation in dynamic environments.
|
| |
| 15:00-16:30, Paper WeI2LB.11 | Add to My Program |
| Towards Human-Like Table Tennis Serving: Preliminary Exploration with Simplified Serving Motion Using an Industrial Robotic Manipulator in NVIDIA Isaac Sim |
|
| Chiou, Po-Chuan | National Yang Ming Chiao Tung University |
| Hong, Jing-Chen | National Yang Ming Chiao Tung University |
Keywords: Motion and Path Planning, Constrained Motion Planning, Simulation and Animation
Abstract: In this report, we present our latest work-inprogress result of table tennis serving simulation. The ultimate goal for our research is realization of human-like table tennis serving with a 6-joint robotic arm. The preliminary evaluation indicates the potential of multi-joint bionic serve motion planning with extended topics and future directions discussed.
|
| |
| 15:00-16:30, Paper WeI2LB.12 | Add to My Program |
| Nnodely - Neuralize Your Model |
|
| Rosati Papini, Gastone Pietro | University of Trento |
| Plebe, Alice | University of Trento |
| Sharifzadeh, Mojtaba | University of Trento |
| Piazza, Mattia | University of Trento |
| Taddei, Sebastiano | University of Trento, Politecnico Di Bari |
| Scialla, Giovanni | University of Trento |
| Baroni, Francesco | University of Trento |
| De Martini, Davide | University of Trento |
| La Scala, Giovanni Maria Francesco | University of Trento |
| Defrancesco, Gioele | University of Trento |
| Faccini, Filippo | University of Trento |
Keywords: Software Tools for Robot Programming, Deep Learning Methods, Software Architecture for Robotic and Automation
Abstract: Modeling and control of physical systems remain challenging for purely data-driven methods, which often lack interpretability and fail to leverage prior knowledge. Model-structured neural networks (MSNNs) embed physical laws into neural architectures; however, their design and implementation can be nontrivial. We present nnodely, an open-source framework that simplifies MSNN development through a modular workflow, improving interpretability, data efficiency, and deployment on resource-constrained platforms. The paper highlights the framework’s features, positions it within the landscape of existing tool, and demonstrates its effectiveness in two case studies. nnodely is released under the MIT license and is available at https://github.com/tonegas/nnodely
|
| |
| 15:00-16:30, Paper WeI2LB.13 | Add to My Program |
| Towards Dexterous Agri-Food Manipulation: Topology-Dependent Interaction Patterns in a Reconfigurable Multifingered Gripper |
|
| Lan, Hongyu | University of Bologna |
| Caporali, Alessio | University of Bologna |
| Dong, Chengxiao | University of Bologna |
| Palli, Gianluca | University of Bologna |
| Melchiorri, Claudio | University of Bologna |
Keywords: Grippers and Other End-Effectors, Grasping, Agricultural Automation
Abstract: Robotic agri-food manipulation remains challenging because food items vary substantially in geometry, compliance, mass distribution, and surface properties, while their fragile nature makes grasping sensitive to small pose errors. This work presents a compact simulation-based study of how grasp topology affects robustness and mechanics-level interaction behavior in a reconfigurable four-finger gripper. Using AGX Dynamics, we evaluate three grasp configurations across representative agri-food objects under controlled yaw and planar-offset perturbations. The results show that spherical grasping is most robust to planar misplacement, torque is more perturbation-sensitive than force, and friction demand is governed more by object geometry than by grasp configuration. These findings provide an interpretable basis for robust and damage-aware configuration selection in agri-food manipulation.
|
| |
| 15:00-16:30, Paper WeI2LB.14 | Add to My Program |
| UniOMA: Unified Optimal-Transport Multi-ModalStructural Alignment for Robot Perception |
|
| Zu, Xinrui | Vrije University Amsterdam |
| Luck, Kevin Sebastian | Vrije Universiteit Amsterdam |
| Yu, Shujian | Vrije Universiteit Amsterdam |
Keywords: Representation Learning, Sensor Fusion, Perception-Action Coupling
Abstract: Contrastive objectives such as InfoNCE align multimodal representations at the instance level but are unable to keep intra-modal geometries, which is called a structural alignment gap. We propose UniOMA, a multimodal structural alignment method using Gromov--Wasserstein (GW) barycenter regularizer to align each modality to a shared structural consensus, scaling linearly to 3+ modalities. Experiments on five robotic benchmarks (vision, force, depth, audio, tactile, proprioception) show consistent improvements in downstream tasks like regression, classification, and cross-modal retrieval.
|
| |
| 15:00-16:30, Paper WeI2LB.15 | Add to My Program |
| A 2-DoF Ankle Rehabilitation Platform Based on an Inclined Dual-Cylinder Mechanism |
|
| Kim, Donggeon | Chung-Ang University |
| Kiefer, Kira | MCI the Entrepreneurial School |
| Veits, Luisa | MCI the Entrepreneurial School |
| Ulbl, Laura | MCI the Entrepreneurial School |
| Morgado-Vega, Necolle | Yale University |
| Kim, Tae-Hyoung | Chung-Ang University |
| Kim, Yeongmi | MCI |
Keywords: Rehabilitation Robotics, Mechanism Design, Motion Control
Abstract: This paper presents a novel ankle rehabilitation platform based on an inclined dual-cylinder mechanism that provides 2-DoF motion through geometric coupling, without complex multi-link structures. Two cylinders sharing a 9° inclined contact surface are driven by two stepper motors, enabling simultaneous dorsiflexion/plantarflexion and inver sion/eversion of up to 18° in each axis. The platform provides both a passive mode, which follows predefined trajectories, and an active mode, which captures user intent through center-of pressure estimation using a force-sensing resistor–based insole. A Particle Swarm Optimization–tuned PD controller is used in both modes, achieving an RMS tracking error below 0.35°in experimental validation. An IMU-integrated gamification environment further demonstrates the feasibility of the platform as an interactive active training system.
|
| |
| 15:00-16:30, Paper WeI2LB.16 | Add to My Program |
| Sat-RoMa: Cross-Scale Dense Matching for Multi-Temporal UAV-To-Orthophoto Registration |
|
| Krupka, Maciej | Poznan University of Technology |
| Węgrzynowski, Jan | Poznan University of Technology |
| Skrzypczynski, Piotr | Poznan University of Technology |
Keywords: Localization, Aerial Systems: Perception and Autonomy, Deep Learning for Visual Perception
Abstract: Reliable Global Navigation Satellite System (GNSS) signals are increasingly denied or jammed in real-world applications, such as search and rescue operations. In such scenarios, Unmanned Aerial Vehicles (UAVs) must rely on downward-facing cameras for absolute localization against reference satellite maps. While Visual Inertial Odometry (VIO) is highly accurate locally, it inevitably accumulates drift over time. Localizing a drone image against a pre-existing satellite map (e.g., Google Earth) via homography estimation is a viable solution, but it is severely challenged by seasonal variations, construction, and vegetation changes. In this paper, we propose Sat-RoMa, an end-to-end robust dense feature matcher adapted from the state-of-the-art RoMa architecture. By utilizing a frozen, pre-trained DinoV3 encoder specifically tuned for satellite imagery, and formulating the task as matching a small drone image to a 4x larger reference map, Sat-RoMa explicitly handles scale discrepancies and temporal appearance changes. Preliminary results demonstrate that Sat-RoMa significantly outperforms baselines like LoFTR and LightGlue, achieving a 16.0% scale error compared to over 100% for existing methods, paving the way for robust GPS-denied UAV navigation.
|
| |
| 15:00-16:30, Paper WeI2LB.17 | Add to My Program |
| Curvature Adaptable Robotic End-Effectors |
|
| Rincon, Jhonatan | Purdue University |
| Osorio, Juan | Purdue University |
| Kim, Wonhee | GM R&D |
| Alexander, Paul | GM R&D |
| Hwang, Dooil | GM R&D |
| Arrieta, Andres | Purdue University |
Keywords: Grippers and Other End-Effectors, Grasping, Soft Robot Applications
Abstract: Flexible robotic manipulators are rapidly gaining traction in automotive assembly to boost productivity and adaptability. Conventional end-effector systems depend heavily on custom tooling engineered for individual curved parts, a strategy that drives up reconfiguration costs, limits interoperability across different product lines, and increases downtime between production runs. We propose an underactuated endeffector that integrates a metasheet of dome-shaped bistable units interconnected into actuation groups, individually addressable via a pneumatic inflation. This arrangement permits transitions between multiple stable configurations, each corresponding to a distinct curvature profile, allowing the manipulator to accommodate different objects found on assembly lines. By tuning the geometry of the proposed endeffector, the system triggers transitions in targeted groups, reconfiguring the system’s overall shape to conform to diverse part geometries. This flexibility enables a single manipulator platform to handle a broad family of components without the expense and downtime associated with bespoke tooling changes. By leveraging intrinsic compliance and multistability, the proposed approach strikes an effective balance between mechanical complexity and operational simplicity.
|
| |
| 15:00-16:30, Paper WeI2LB.18 | Add to My Program |
| Semantic 3D Skeleton Extraction for Precision Agricultural Robotics : Preliminary Result |
|
| Yang, Dayeon | Gwangju Institute of Science and Technology |
| Ju, Chanyoung | Korea Institute of Industrial Technology |
Keywords: AI-Enabled Robotics, AI-Based Methods, Object Detection, Segmentation and Categorization
Abstract: A multi-modal dataset was constructed in a real orchard environment under leaf-off conditions using an RGB-D camera and LiDAR, enabling clear observation of branch and trunk structures. The complementary geometric information from both sensors allows for more precise 3D structural reconstruction. Dense point clouds obtained from the RGB-D camera are fused with LiDAR point clouds via ICP registration, followed by ground removal and DBSCAN clustering to segment individual trees. AdTree is then applied to each segmented tree to extract the 3D skeletal structure and generate Ground Truth. The constructed GT explicitly represents the hierarchical branch structure of each tree, and additional data collection under leaf-on conditions is planned to enable quantitative evaluation of skeleton extraction performance across varying foliage conditions. Furthermore, the constructed dataset will be utilized for training and evaluation of a Flow Matching-based generative model for tree skeletonization. Flow Matching enables stable skeleton reconstruction even from noisy and heavily occluded point clouds in real orchard environments, and the dataset is expected to facilitate quantitative analysis of performance differences between leaf-off and leaf-on conditions.
|
| |
| 15:00-16:30, Paper WeI2LB.19 | Add to My Program |
| Gaussian Splatting and Point Cloud-Based Workspace Prediction for Collision-Free Trajectory Planning in Collaborative Robots |
|
| Seo, Jungho | Daegu Gyeongbuk Institute of Science&Technology |
| Kim, DongWook | Daegu Gyeongbuk Institute of Science and Technology (DGIST) |
Keywords: Motion and Path Planning, Multi-Robot Systems, Deep Learning Methods
Abstract: As multi-robot collaboration becomes increasingly prevalent in modern industrial settings, ensuring collision-free operation among robots sharing the same workspace remains a critical challenge. This paper proposes an integrated framework that combines 3D Gaussian Splatting (3D-GS) for high-fidelity scene reconstruction, Generalized Iterative Closest Point (GICP) with Fast Global Registration (FGR) for robust pose estimation, a Deep Graph Convolutional Neural Network (DGCNN) for joint angle regression from point cloud data, Dynamic Mode Decomposition (DMD) for trajectory prediction, and a Control Barrier Function (CBF) for real-time safety enforcement. Through experiments, we validated the trajectory prediction of 0DOF objects and confirmed that joint angle prediction is possible from 3D-GS-based PLY data using DGCNN-based regression, utilizing joint angle training data collected at intervals of 15 to 45 degrees.
|
| |
| 15:00-16:30, Paper WeI2LB.20 | Add to My Program |
| Three-Dimensional Needle Tip Estimation from Multi-View X-Ray Images for Interventional Pain Procedures |
|
| Han, Seunghui | Chonnam National University |
| Hong, Ayoung | Chonnam National University |
Keywords: Computer Vision for Medical Robotics
Abstract: This study addresses the challenge of estimating the three-dimensional(3D) position of a needle tip from two-dimensional(2D) X-ray images. We propose a classical image processing–based framework for needle tip localization and 3D reconstruction. The method first detects a circular marker attached to the robotic end-effector that controls the needle insertion and identifies the needle head position within the marker. Preprocessing steps, including bilateral filtering, thresholding, and iterative morphological operations, are applied to improve image quality and ensure the continuity of the needle shaft. A flood-fill algorithm is then used to segment the needle body, after which the needle trajectory is extracted using the A^star algorithm. Finally, the 3D position of the needle tip is reconstructed by Triangulation from multiple X-ray images acquired at different viewing angles.
|
| |
| 15:00-16:30, Paper WeI2LB.21 | Add to My Program |
| The RoboAtlas: Mapping the Global Robotics Landscape |
|
| Zhang, Jiacheng | National University of Singapore |
| Sun, Shuo | Singapore-MIT Alliance for Research and Technology (SMART) |
| Charisi, Vicky | Massachusetts Institute of Technology |
| Wang, Xinru | Singapore-MIT Alliance for Research and Technology Centre (SMART) |
| Xinyue, Chen | National University of Singapore |
| Ma, Zhexuan | National University of Singapore |
| Prakash, Alok | Singapore-MIT Alliance for Research and Technology |
| Malone, Thomas | Massachusetts Institute of Technology (MIT) |
Keywords: AI-Enabled Robotics, Social HRI, Software Tools for Benchmarking and Reproducibility
Abstract: Structured, model-level information on the world’s robot systems remains scarce: existing reports often provide aggregated market statistics, while industry directories typically stop at company-level information. In this work, we present an LLM-assisted, web-grounded analysis pipeline for studying the global robotics landscape at the robot-model level. The method combines company discovery, iterative verification, and model-level extraction of robot type, target industries, release year, and task descriptions from open-web evidence. Applying this pipeline, we study 8,229 robot models associated with 1,062 companies across 50 countries and 6 continents. Our findings reveal strong geographic concentration in the United States, China, and Japan, rapid growth after 2017, and substantial diffusion of robotics beyond manufacturing into logistics, healthcare, education, and household settings. Our work illustrates both the promise and certain limitations of LLM-assisted web analysis for large-scale robotics landscape mapping.
|
| |
| 15:00-16:30, Paper WeI2LB.22 | Add to My Program |
| Reward-Free Continual Adaptation for Resilient Space Robots |
|
| Orsula, Andrej | University of Luxembourg |
| Olivares-Mendez, Miguel A. | Interdisciplinary Centre for Security, Reliability and Trust - University of Luxembourg |
| Martinez, Carol | UniversitÉ Du Luxembourg |
Keywords: Space Robotics and Automation, Continual Learning, Reinforcement Learning
Abstract: Space robots operate in extreme environments where hardware degradation can critically compromise traditional control strategies. While continual reinforcement learning offers a promising mechanism for online adaptation, it inherently requires access to a reward signal during deployment. However, precise reward computation in space is often infeasible due to the lack of external tracking systems and the overall complexity of the environment. To address the challenge of unobservable rewards, we introduce a reward-free continual learning framework that leverages latent-state world models. By pre-training a model-based agent across diverse simulations, the world model learns a robust predictor of the reward structure within its latent space. Upon deployment to an environment with severe hardware degradation, we freeze the observation encoder and reward predictor to update only the transition dynamics of the world model through unsupervised rollouts. By training the policy entirely on imagined trajectories generated by this updated world model, the agent adapts to altered dynamics without receiving new rewards. We demonstrate our approach across simulated planetary traversal, orbital navigation, and precision assembly tasks subjected to severe morphological failures.
|
| |
| WeBT1 Award Session, Hall A2 |
Add to My Program |
| Award Finalists 4 |
|
| |
| Chair: Ajoudani, Arash | Istituto Italiano Di Tecnologia |
| Co-Chair: Kurniawati, Hanna | Australian National University |
| |
| 16:45-16:55, Paper WeBT1.1 | Add to My Program |
| Dynamics Modeling of a Multi-UAV Slung Load System Using a Discrete-Link Cable Approach |
|
| Merton, Harvey | Massachusetts Institute of Technology |
| Hunter, Ian | Massachusetts Institute of Technology |
Keywords: Multi-Robot Systems, Aerial Systems: Mechanics and Control, Simulation and Animation
Abstract: A common assumption to simplify the problem of controlling a multi-UAV slung load system (MUSLS) is that the flexible cables can be modeled as massless rigid rods. In this work, we propose an alternative Euler-Newton derived dynamical model which uses a series of rigid links to model the flexible cables. The model is specifically designed to allow efficient simulation using Featherstone's articulated body algorithm. We perform real-world validation of this model on gentle, aggressive, and tension-engagement maneuvers and run a parameter sweep to determine the number of links, joint damping, and joint friction to achieve the greatest model fidelity. The model closely matches real-world flight data with mean load translation errors below 132 mm (5.5% of the cable length) and orientation errors below 11.4 degrees. We make the real-world flight data publicly available for the development of future cable models.
|
| |
| 16:55-17:05, Paper WeBT1.2 | Add to My Program |
| A Distributed Gaussian Process Model for Multi-Robot Mapping |
|
| Nabarro, Seth | Imperial College London |
| van der Wilk, Mark | University of Oxford |
| Davison, Andrew J | Imperial College London |
Keywords: Multi-Robot Systems, Distributed Robot Systems, Mapping
Abstract: We propose DistGP: a multi-robot learning method for collaborative learning of a global function using only local experience and computation. We utilise a sparse Gaussian process (GP) model with a factorisation that mirrors the multi-robot structure of the task, and admits distributed training via Gaussian belief propagation (GBP). Our loopy model outperforms Tree-Structured GPs [1] and can be trained online and in settings with dynamic connectivity. We show that such distributed, asynchronous training can reach the same performance as a centralised, batch-trained model, albeit with slower convergence. Last, we compare to DiNNO [2], a distributed neural network (NN) optimiser, and find DistGP achieves superior accuracy, is more robust to sparse communication and is better able to learn continually.
|
| |
| 17:05-17:15, Paper WeBT1.3 | Add to My Program |
| Optimal Multi-Robot Planning for Simultaneous Area and Line Coverage |
|
| Zheng, Tianyuan | Rutgers University |
| Yu, Kaiyan | Binghamton University |
| Gao, Mingyang | University of Illinois at Urbana-Champaign |
| Yi, Jingang | Rutgers University |
Keywords: Motion and Path Planning, Path Planning for Multiple Mobile Robots or Agents, Robotics and Automation in Construction
Abstract: Robotic coverage tasks often require teams of robots to not only survey regions of interest but also trace and interact with linear features such as cracks, seams, or pipelines. We term this the double coverage problem, where robots must balance two competing roles: wide-area exploration for inspection and precise trajectory following for servicing linear structures. This paper develops an optimal multi-robot planning framework that unifies area coverage and line servicing. We formulate a topological analysis in manifold space and introduce the hierarchical cyclic merging regulation (HCMR) method, for which optimality under a fixed sweep direction is proven. The framework is experimentally validated for a multi-robot crack survey and filling application. Benchmark comparisons demonstrate that HCMR reduces planned path length by at least 10.0%, shortens task completion time by at least 16.9%, and ensures complete crack coverage with virtually conflict-free operation, outperforming state-of-the-art coverage planners. These results highlight the feasibility and efficiency of deploying topology-informed multi-robot planning for practical inspection and repair scenarios.
|
| |
| 17:15-17:25, Paper WeBT1.4 | Add to My Program |
| Planar-Sector LOS Guidance for Interception of Agile Targets with Lifting-Wing Quadcopters |
|
| Liu, Linkai | Beihang University |
| Yang, Kun | China Academy of Launch Vehicle Technology |
| Zou, Han | Beihang University |
| Min, Chen | Beihang University |
| Lv, Shuli | Beihang University |
| Wang, Shuai | Beihang University |
| Quan, Quan | Beihang University |
Keywords: Aerial Systems: Applications, Motion Control, Sensor-based Control
Abstract: This paper proposes a Planar-Sector Line-of-Sight (PS-LOS) guidance law and an accompanying control stack for lifting-wing quadcopters, enabling robust image-based interception of agile targets. The PS-LOS relaxes conventional conical constraints, preserving maneuverability while reducing aerodynamic penalties. For perception, we employ a delay-compensated Extended Kalman Filter (EKF) to provide low-latency, continuous target estimates. The controller is tailored to lifting-wing quadcopter dynamics and includes coordinated-turn compensation. Outdoor flight experiments demonstrate interceptions against unpredictable agile targets up to 138 m, validating the method's range and robustness.
|
| |
| 17:25-17:35, Paper WeBT1.5 | Add to My Program |
| GuideTWSI: A Diverse Tactile Walking Surface Indicator Dataset from Synthetic and Real-World Images for Blind and Low-Vision Navigation |
|
| Hwang, Hochul | University of Massachusetts Amherst |
| Yang, Soowan | University of Massachusetts Amherst |
| Nguyen, Nhat Hong Anh | University of Massachusetts Amherst |
| Goel, Parth | University of Massachusetts Amherst |
| Adhikari, Krisha | University of Massachusetts Amherst |
| Lee, Sunghoon Ivan | UMass Amherst |
| Biswas, Joydeep | The University of Texas at Austin |
| Giudice, Nicholas | University of Maine |
| Kim, Donghyun | University of Massachusetts Amherst |
Keywords: Data Sets for Robotic Vision, Safety in HRI, Object Detection, Segmentation and Categorization
Abstract: Tactile Walking Surface Indicators (TWSIs) are safety-critical landmarks that blind and low-vision (BLV) pedestrians use to locate crossings and hazard zones. From our observation sessions with BLV guide dog handlers, trainers, and an O&M specialist, we confirmed the critical importance of reliable and accurate TWSI segmentation for navigation assistance of BLV individuals. Achieving such reliability requires large-scale annotated data. However, TWSIs are severely underrepresented in existing urban perception datasets, and even existing dedicated paving datasets are limited: they lack robot-relevant viewpoints (e.g., egocentric or top-down) and are geographically biased toward East Asian directional bars - raised parallel strips used for continuous guidance along sidewalks. This narrow focus overlooks truncated domes - rows of round bumps used primarily in North America and Europe as detectable warnings at curbs, crossings, and platform edges. As a result, models trained only on bar-centric data struggle to generalize to dome based warnings, leading to missed detections and false stops in safety critical environments. We introduce GuideTWSI, the largest and most diverse TWSI dataset, which combines a photorealistic synthetic dataset, carefully curated open-source tactile data, and quadruped real-world data collected and annotated by the authors. Notably, we developed an Unreal Engine–based synthetic data generation pipeline to obtain segmented, labeled data across diverse materials, lighting conditions, weather, and robot-relevant viewpoints. Extensive evaluations show that synthetic augmentation improves truncated dome segmentation across diverse state-of-the-art models, with gains of up to +29 mIoU points, and enhances cross-domain robustness. Moreover, real-robot experiments demonstrate accurate stoppings at truncated domes, with high repeatability and stop success rates (96.15%). The GuideTWSI dataset, model weights, and code will be publicly released.
|
| |
| 17:35-17:45, Paper WeBT1.6 | Add to My Program |
| Sonar-MASt3R: Real-Time Opti-Acoustic Fusion in Turbid, Unstructured Environments |
|
| Phung, Amy | MIT-WHOI Joint Program |
| Camilli, Richard | Woods Hole Oceanographic Institution |
Keywords: Marine Robotics, Field Robots, Sensor Fusion
Abstract: Underwater intervention is an important capability in several marine domains, with numerous industrial, scientific, and defense applications. However, existing perception systems used during intervention operations rely on data from optical cameras, which limits capabilities in poor visibility or lighting conditions. Prior work has examined opti-acoustic fusion methods, which use sonar data to resolve the depth ambiguity of the camera data while using camera data to resolve the elevation angle ambiguity of the sonar data. However, existing methods cannot achieve dense 3D reconstructions in real-time, and few studies have reported results from applying these methods in a turbid environment. In this work, we propose the opti-acoustic fusion method Sonar-MASt3R, which uses MASt3R to extract dense correspondences from optical camera data in real-time and pairs it with geometric cues from an acoustic 3D reconstruction to ensure robustness in turbid conditions. Experimental results using data recorded from an “opti-acoustic eye-in-hand” configuration across turbidity values ranging from <0.5 to >12 NTU highlight this method’s improved robustness to turbidity relative to baseline methods.
|
| |
| 17:45-17:55, Paper WeBT1.7 | Add to My Program |
| HITTER: A HumanoId Table TEnnis Robot Via Hierarchical Planning and Learning |
|
| Su, Zhi | Tsinghua University |
| Zhang, Bike | University of California, Berkeley |
| Rahmanian, Nima Abraham | University of California, Berkeley |
| Gao, Yuman | Zhejiang University |
| Liao, Qiayuan | University of California, Berkeley |
| Regan, Caitlin | UC Berkeley |
| Sreenath, Koushil | University of California, Berkeley |
| Sastry, Shankar | University of California, Berkeley |
Keywords: Humanoid Robot Systems, Reinforcement Learning, Whole-Body Motion Planning and Control
Abstract: Humanoid robots have recently achieved impressive progress in locomotion and whole-body control, yet they remain constrained in tasks that demand rapid interaction with dynamic environments through manipulation. Table tennis exemplifies such a challenge: with ball speeds exceeding 5 m/s, players must perceive, predict, and act within sub-second reaction times, requiring both agility and precision. To address this, we present a hierarchical framework for humanoid table tennis that integrates a model-based planner for ball trajectory prediction and racket target planning with a reinforcement learning-based whole-body controller. The planner determines striking position, velocity and timing, while the controller generates coordinated arm and leg motions that mimic human strikes and maintain stability and agility across consecutive rallies. Moreover, to encourage natural movements, human motion references are incorporated during training. We validate our system on a general-purpose humanoid robot, achieving up to 106 consecutive shots with a human opponent and sustained exchanges against another humanoid. These results demonstrate real-world humanoid table tennis with sub-second reactive control, marking a step toward agile and interactive humanoid behaviors.
|
| |
| 17:55-18:05, Paper WeBT1.8 | Add to My Program |
| SymSkill: Symbol and Skill Co-Invention for Data-Efficient and Reactive Long-Horizon Manipulation |
|
| Shao, Yifei | University of Pennsylvania |
| Zheng, Yuchen | University of Pennsylvania |
| Sun, Sunan | University of Pennsylvania |
| Chaudhari, Pratik | University of Pennsylvania |
| Kumar, Vijay | University of Pennsylvania |
| Figueroa, Nadia | University of Pennsylvania |
Keywords: Task and Motion Planning, Learning from Demonstration, Imitation Learning
Abstract: Multi-step manipulation in dynamic environments remains challenging. Imitation learning (IL) is reactive but lacks compositional generalization, since monolithic policies do not decide which skill to reuse when scenes change. Classical task-and-motion planning (TAMP) offers compositionality, but its high planning latency prevents real-time failure recovery. We introduce SymSkill, a unified framework that jointly learns predicates, operators, and skills from unlabeled, unsegmented demonstrations, combining compositional generalization with real-time recovery. Offline, SymSkill learns symbolic abstractions and goal-oriented skills directly from demonstrations. Online, given a conjunction of learned predicates, it uses a symbolic planner to compose and reorder skills to achieve symbolic goals while recovering from failures at both the motion and symbolic levels in real time. Coupled with a compliant controller, SymSkill supports safe execution under human and environmental disturbances. In RoboCasa simulation, SymSkill executes 12 single-step tasks with 85% success and composes them into multi-step plans without additional data. On a real Franka robot, it learns from 5 minutes of play data and performs 12-step tasks from goal specifications. Code and additional analysis are available at https://sites.google.com/view/symskill.
|
| |
| 18:05-18:15, Paper WeBT1.9 | Add to My Program |
| ActivePusher: Active Learning and Planning with Residual Physics for Nonprehensile Manipulation |
|
| Zhong, Zhuoyun | Worcester Polytechnic Institute |
| Golestaneh, Seyedali | Worcester Polytechnic Institute |
| Chamzas, Constantinos | Worcester Polytechnic Institute |
Keywords: Motion and Path Planning, Integrated Planning and Learning, Planning under Uncertainty
Abstract: Planning with learned dynamics models offers a promising approach toward versatile real-world manipulation, particularly in nonprehensile settings such as pushing or rolling, where accurate analytical models are difficult to obtain. However, collecting training data for learning-based methods can be costly and inefficient, as it often relies on randomly sampled interactions that are not necessarily the most informative. Furthermore, learned models tend to exhibit high uncertainty in underexplored regions of the skill space, undermining the reliability of long-horizon planning. To address these challenges, we propose ActivePusher, a novel framework that combines residual-physics modeling with kernel-based active learning, to focus data acquisition on the most informative skill parameters. Additionally, ActivePusher seamlessly integrates with model-based kinodynamic planners, leveraging uncertainty estimates to bias control sampling toward more reliable actions. We evaluate our approach in both simulation and real-world environments, and demonstrate that it consistently improves data efficiency and achieves higher planning success rates in comparison to baseline methods.
|
| |
| WeBT2 Regular Session, Hall A3 |
Add to My Program |
| Robot Learning I |
|
| |
| |
| 16:45-16:55, Paper WeBT2.1 | Add to My Program |
| Agile in the Face of Delay: Asynchronous End-To-End Learning for Real-World Aerial Navigation |
|
| Li, Yude | Harbin Institute of Technology, Shenzhen |
| Zhou, Zhexuan | Harbin Institute of Technology, Shenzhen |
| Li, Huizhe | Harbin Institute of Technology, Shenzhen |
| Gong, Youmin | Harbin Institution of Technology, Shenzhen |
| Mei, Jie | Harbin Insitute of Technology, Shenzhen |
Keywords: Aerial Systems: Perception and Autonomy, Reinforcement Learning, Motion and Path Planning
Abstract: Robust autonomous navigation for Autonomous Aerial Vehicles (AAVs) in complex environments is a critical capability. However, modern end-to-end navigation faces a key challenge: the high-frequency control loop needed for agile flight conflicts with low-frequency perception streams, which are limited by sensor update rates and significant computational cost. This mismatch forces conventional synchronous models into undesirably low control rates. To resolve this, we propose an asynchronous reinforcement learning framework that decouples perception and control, enabling a high-frequency policy to act on the latest IMU state for immediate reactivity, while incorporating perception features asynchronously. To manage the resulting data staleness, we introduce a theoretically-grounded Temporal Encoding Module (TEM) that explicitly conditions the policy on perception delays, a strategy complemented by a two-stage curriculum to ensure stable and efficient training. Validated in extensive simulations, our method was successfully deployed in zero-shot sim-to-real transfer on an onboard NUC, where it sustains a 100 Hz control rate and demonstrates robust, agile navigation in cluttered real-world environments. Our source code will be released for community reference.
|
| |
| 16:55-17:05, Paper WeBT2.2 | Add to My Program |
| Human2Nav: Learning Crowd Navigation from Human Videos across Robots Via Feasibility-Guided Flow Matching |
|
| Zhang, Shenghong | Shanghai Jiao Tong University |
| Chen, JunJie | Shanghai Jiao Tong University |
| Yan, Sichi | Shanhai Jiao Tong University |
| Ban, Yutong | Shanghai Jiao Tong University |
| Li, Xiao | Shanghai Jiaotong University |
Keywords: Imitation Learning, Motion and Path Planning, Transfer Learning
Abstract: Enabling robots to navigate safely and efficiently in dynamic, crowded environments requires learning from large-scale demonstrations, which are costly and unsafe to collect on physical platforms. While human videos offer a rich and scalable alternative, transferring these motion patterns to robots is challenged by the embodiment gap across observation and action spaces. This paper presents Human2Nav, a data-efficient framework that learns navigation policies directly from human videos via test-time feasibility-guided flow matching. Human2Nav employs a bird's-eye-view representation to align visual observations and trains a conditional flow matching model to capture nuanced human navigation patterns. Crucially, we introduce a training-free feasibility guidance mechanism that during inference steers generated trajectories to satisfy heterogeneous robot-specific kinematic and dynamic constraints without retraining. Extensive experiments in simulation and on real-world heterogeneous robotic platforms demonstrate that Human2Nav achieves superior data efficiency and navigation performance compared to model-based and learning-based baselines, while ensuring safe and executable trajectories across diverse crowd scenarios.
|
| |
| 17:05-17:15, Paper WeBT2.3 | Add to My Program |
| Hybrid Diffusion Policies with Projective Geometric Algebra for Efficient Robot Manipulation Learning |
|
| Sun, Xiatao | Yale University |
| Wang, Yuxuan | University of Pennsylvania |
| Yang, Shuo | University of Pennsylvania |
| Chen, Yinxing | Yale University |
| Rakita, Daniel | Yale University |
Keywords: Imitation Learning, Deep Learning in Grasping and Manipulation, Learning from Demonstration
Abstract: Diffusion policies are a powerful paradigm for robot learning, but their training is often inefficient. A key reason is that networks must relearn fundamental spatial concepts, such as translations and rotations, from scratch for every new task. To alleviate this redundancy, we propose embedding geometric inductive biases directly into the network architecture using Projective Geometric Algebra (PGA). PGA provides a unified algebraic framework for representing geometric primitives and transformations, allowing neural networks to reason about spatial structure more effectively. In this paper, we introduce hPGA-DP, a novel hybrid diffusion policy that capitalizes on these benefits. Our architecture leverages the Projective Geometric Algebra Transformer (P-GATr) as a state encoder and action decoder, while employing established U-Net or Transformer-based modules for the core denoising process. Through extensive experiments and ablation studies in both simulated and real-world environments, we demonstrate that hPGA-DP significantly improves task performance and training efficiency. Notably, our hybrid approach achieves substantially faster convergence compared to both standard diffusion policies and architectures that rely solely on P-GATr.
|
| |
| 17:15-17:25, Paper WeBT2.4 | Add to My Program |
| Shifted Flow Policy: Uncertainty-Aware Time Reparameterization for Visuomotor Learning |
|
| Ahn, Dasom | Keimyung University |
| Jung, Chanhyuk | Keimyung University |
| Baek, Joonki | Keimyung University |
| Yoo, Sungkeun | Keimyung University |
| Ko, Byoung Chul | Keimyung University |
Keywords: Imitation Learning, Deep Learning Methods, Motion and Path Planning
Abstract: Imitation learning for robotics often uses action chunking to mitigate the compounding errors associated with autoregressive policies. By predicting multiple future actions simultaneously, action chunking limits the accumulation of errors but introduces new difficulties. In particular, it relies on outdated observations to predict future actions, which can lead to inaccuracies. In this study, we propose Shifted Flow Policy (SFP), a simple yet effective alternative to action chunking. The SFP reparameterizes time by linearly shifting the time steps for future actions, thereby capturing the natural increase in uncertainty over time. This formulation allows each predicted action to be conditioned on up-to-date observations. Experimental results on the Push-T and MimicGen benchmarks demonstrate that SFP outperforms state-of-the-art action chunking methods across a variety of manipulation tasks by achieving higher success rates and faster inference. These findings suggest that shifted flow provides a robust and practical alternative to action chunking in visuomotor policy learning. Our code is available at https://shifted-flow-policy.github.io
|
| |
| 17:25-17:35, Paper WeBT2.5 | Add to My Program |
| Closed-Loop Action Chunks with Dynamic Corrections for Training-Free Diffusion Policy |
|
| Wu, Pengyuan | Zhejiang University, Shanghai AI Laboratory |
| Zhang, Pingrui | Fudan University, Shanghai AI Laboratory |
| Wang, Zhigang | Shanghai AI Laboratory |
| Wang, Dong | Shanghai AI Laboratory |
| Zhao, Bin | Northwestern Polytechnical University, Shanghai AI Laboratory |
| Li, Xuelong | TeleAI, China Telecom Corp Ltd |
Keywords: Manipulation Planning
Abstract: Diffusion-based policies have achieved remarkable results in robotic manipulation but often struggle to adapt rapidly in dynamic scenarios, leading to delayed responses or task failures. We present DCDP, a Dynamic Closed-Loop Diffusion Policy framework that integrates chunk-based action generation with real-time correction. DCDP integrates a self-supervised dynamic feature encoder, cross-attention fusion, and an asymmetric action encoder-decoder to inject environmental dynamics before action execution, achieving real-time closed-loop action correction and enhancing the system's adaptability in dynamic scenarios. In dynamic PushT simulations, DCDP improves adaptability by 19% without retraining while requiring only 5% additional computation. Its modular design enables plug-and-play integration, achieving both temporal coherence and real-time responsiveness in dynamic robotic scenarios, including real-world manipulation tasks. The project page is at: url{https://github.com/wupengyuan/dcdp}
|
| |
| 17:35-17:45, Paper WeBT2.6 | Add to My Program |
| EMMA: Scaling Mobile Manipulation Via Egocentric Human Data |
|
| Zhu, Lawrence Y. | Georgia Institute of Technology |
| Kuppili, Pranav | Georgia Institute of Technology |
| Punamiya, Ryan | Georgia Institute of Technology |
| Aphiwetsa, Patcharapong | Georgia Institute of Technology |
| Patel, Dhruv | Georgia Institute of Technology |
| Kareer, Simar | Georgia Tech |
| Ha, Sehoon | Georgia Institute of Technology |
| Xu, Danfei | Georgia Institute of Technology |
Keywords: Mobile Manipulation, Imitation Learning, Learning from Demonstration
Abstract: Scaling mobile manipulation imitation learning is bottlenecked by expensive mobile robot teleoperation. We present Egocentric Mobile MAnipulation (EMMA), an end-to-end framework training mobile manipulation policies from human mobile manipulation data with static robot data, sidestepping mobile teleoperation. To accomplish this, we co-train the human full-body motion data with the static robot data. In our experiments across three real-world tasks, EMMA demonstrates comparable performance to baselines trained on teleoperated mobile robot data (Mobile ALOHA), achieving higher or equivalent task performance in full task success. We find that EMMA is able to generalize to new spatial configurations and scenes, and we observe positive performance scaling as we increase the hours of human data, opening new avenues for scalable robotic learning in real-world environments.
|
| |
| 17:45-17:55, Paper WeBT2.7 | Add to My Program |
| Better Than Diverse Demonstrators: Reward Decomposition from Suboptimal and Heterogeneous Demonstrations |
|
| Xue, Chunyue | Brown University |
| Chen, Letian | Waymo |
| Gombolay, Matthew | Georgia Institute of Technology |
Keywords: Reinforcement Learning, Learning from Demonstration
Abstract: Inverse Reinforcement Learning (IRL) typically involves inferring a reward function from expert demonstrations to enable agents to imitate the demonstrated behavior. However, real-world settings often provide suboptimal and heterogeneous demonstrations, where human demonstrators use diverse strategies and imperfect actions. Yet, we are unaware of any prior work that simultaneously addresses the challenges of IRL, of which demonstrations are both heterogeneous and suboptimal. In this work, we propose a novel approach, REPRESENT (Reward dEcomPosition fRom hEterogeneous Suboptimal dEmoNstraTion), that disentangles the latent intrinsic task reward and the strategy-specific reward from suboptimal and diverse strategies. Our method learns to identify a shared task reward component that generalizes across varying demonstrator preferences while also modeling distinct strategy-specific rewards. By decomposing the common task reward across varied demonstrations, REPRESENT extracts the core objectives shared by all strategies, enabling the agent to perform better than the demonstrators while preserving individual strategy preferences. We validate our approach on three robotic domains, showing a higher correlation with the true task reward and improved policy performance compared to baselines. These results suggest that REPRESENT can effectively handle suboptimality and heterogeneity, providing a solution for real-world LfD applications to better learn from demonstrations varied in quality and strategy.
|
| |
| 17:55-18:05, Paper WeBT2.8 | Add to My Program |
| GRAM: Generalization in Deep RL with a Robust Adaptation Module |
|
| Queeney, James | Amazon Robotics |
| Cai, Xiaoyi | Massachusetts Institute of Technology |
| Schperberg, Alexander | Mitsubishi Electric Research Laboratories |
| Corcodel, Radu | Mitsubishi Electric Research Laboratories |
| Benosman, Mouhacine | Amazon Robotics |
| How, Jonathan | Massachusetts Institute of Technology |
Keywords: Reinforcement Learning, Machine Learning for Robot Control
Abstract: The reliable deployment of deep reinforcement learning in real-world settings requires the ability to generalize across a variety of conditions, including both in-distribution scenarios seen during training as well as novel out-of-distribution scenarios. In this work, we present a framework for dynamics generalization in deep reinforcement learning that unifies these two distinct types of generalization within a single architecture. We introduce a robust adaptation module that provides a mechanism for identifying and reacting to both in-distribution and out-of-distribution environment dynamics, along with a joint training pipeline that combines the goals of in-distribution adaptation and out-of-distribution robustness. Our algorithm GRAM achieves strong generalization performance across in-distribution and out-of-distribution scenarios upon deployment, which we demonstrate through extensive simulation and hardware locomotion experiments on a quadruped robot.
|
| |
| 18:05-18:15, Paper WeBT2.9 | Add to My Program |
| VGC-RIO: A Tightly Integrated Radar-Inertial Odometry with Spatial Weighted Doppler Velocity and Local Geometric Constrained RCS Histograms |
|
| Xiang, Jian Guang | National University of Defense Technology |
| Xiaofeng, He | National University of Defense Technology |
| Chen, Zizhuo | National University of Defense Technology |
| Zhang, Lilian | National University of Defense Technology |
| Luo, Xincan | College of Intelligence Science and Technology, National University of Defense Technology |
| Mao, Jun | National University of Defense Technology |
|
|
| |
| WeBT3 Regular Session, Lehar 1-4 |
Add to My Program |
| Humanoid Robots |
|
| |
| |
| 16:45-16:55, Paper WeBT3.1 | Add to My Program |
| TWIST2: Scalable, Portable, and Holistic Humanoid Data Collection System |
|
| Ze, Yanjie | Stanford University |
| Zhao, Siheng | University Pf Southern California |
| Wang, Weizhuo | Stanford University |
| Kanazawa, Angjoo | UC Berkeley |
| Duan, Yan | Amazon |
| Abbeel, Pieter | UC Berkeley |
| Shi, Guanya | Carnegie Mellon University |
| Wu, Jiajun | Stanford University |
| Liu, Karen | Stanford University |
Keywords: Humanoid Robot Systems, Imitation Learning, Big Data in Robotics and Automation
Abstract: Large-scale data has driven breakthroughs in robotics, from language models to vision-language-action models in bimanual manipulation. However, humanoid robotics lacks equally effective data collection frameworks. Existing humanoid teleoperation systems either use decoupled control or depend on expensive motion capture setups. We introduce TWIST2, a portable, mocap-free humanoid teleoperation and data collection system that preserves full whole-body control while advancing scalability. Our system leverages PICO4U VR for obtaining real-time whole-body human motions, with a custom 2-DoF robot neck (cost around 250) for egocentric vision, enabling holistic human-to-humanoid control. We demonstrate long-horizon dexterous and mobile humanoid skills and we can collect 100 demonstrations in 15 minutes with an almost 100% success rate. Building on this pipeline, we propose a hierarchical visuomotor policy framework that autonomously controls the full humanoid body based on egocentric vision. Our visuomotor policy successfully demonstrates whole-body dexterous manipulation and dynamic kicking tasks. The entire system is fully reproducible and open-sourced at https://yanjieze.com/TWIST2/ . Our collected dataset is also open-sourced at https://twist-data.github.io/ .
|
| |
| 16:55-17:05, Paper WeBT3.2 | Add to My Program |
| Mixture-Of-Experts Policy for Smooth and Stable Multi-Posture Fall Recovery in Bipedal Robot |
|
| Rong, Haomin | Sun Yat-Sen University |
| Chen, Yuying | Sun Yat-Sen University |
| Xu, Zhiyong | Sun Yat-Sen University |
| Xie, Lijie | Sun Yat-Sen University |
| Yan, Qingyu | Orbot Co., Ltd |
| Cheng, Hui | Sun Yat-Sen University |
Keywords: Failure Detection and Recovery, Reinforcement Learning, Legged Robots
Abstract: Bipedal robots are inherently prone to falling due to their higher center of mass and narrower support polygon, making automatic fall recovery a long-standing challenge. Existing approaches often rely on posture-specific strategies or exhibit limited robustness and generalization, restricting their real-world applicability. We present a unified Mixture-of-Experts (MoE) framework that trains a single policy capable of recovering from diverse fallen configurations. By leveraging base height estimation and proprioceptive history within a gating mechanism, the framework dynamically allocates recovery tasks to specialized experts, yielding smooth and stable motions. Extensive real-world experiments show that the policy transfers zero-shot to hardware and consistently achieves recovery not only under repeated disturbances, but also from highly challenging postures and even on inclined slopes—demonstrating robustness and generalization beyond prior methods.
|
| |
| 17:05-17:15, Paper WeBT3.3 | Add to My Program |
| MimicDroid: In-Context Learning for Humanoid Robot Manipulation from Human Play Videos |
|
| Shah, Rutav | The University of Texas at Austin |
| Liu, Shuijing | The University of Texas at Austin |
| Wang, Qi | The University of Texas at Austin |
| Jiang, Zhenyu | The Unversity of Texas at Austin |
| Kumar, Sateesh | University of Texas at Austin |
| Seo, Mingyo | The University of Texas at Austin |
| Martín-Martín, Roberto | University of Texas at Austin |
| Zhu, Yuke | The University of Texas at Austin |
Keywords: Learning from Demonstration, Imitation Learning, Deep Learning Methods
Abstract: We aim to enable humanoid robots to efficiently solve new manipulation tasks from a few video examples. In-context learning (ICL) is a promising framework for achieving this goal due to its test-time data efficiency and rapid adaptability. However, current ICL methods rely on labor-intensive teleoperated data for training, which restricts scalability. We propose using human play videos—continuous, unlabeled videos of people interacting freely with their environment—as a scalable and diverse training data source. We introduce MimicDroid, which enables humanoids to perform ICL using human play videos as the only training data. MimicDroid extracts trajectory pairs with similar manipulation behaviors and trains the policy to predict the actions of one trajectory conditioned on the other. Through this process, the model acquired ICL capabilities for adapting to novel objects and environments at test time. To bridge the embodiment gap, MimicDroid first retargets human wrist poses estimated from RGB videos to the humanoid, leveraging kinematic similarity. It also applies random patch masking during training to reduce overfitting to human-specific cues and improve robustness to visual differences. To evaluate few-shot learning for humanoids, we introduce an open-source simulation benchmark with increasing levels of generalization difficulty. MimicDroid outperformed state-of-the-art methods and achieved nearly a twofold higher success rate in the real world. Additional materials can be found on: ut-austin-rpl.github.io/MimicDroid
|
| |
| 17:15-17:25, Paper WeBT3.4 | Add to My Program |
| TOLEBI: Learning Fault-Tolerant Bipedal Locomotion Via Online Status Estimation and Fallibility Rewards |
|
| Lee, Hokyun | Seoul National University |
| Baek, Woo-Jeong | Seoul National University, Karlsruhe Institute of Technology |
| Cha, Junhyeok | Seoul National University |
| Park, Jaeheung | Seoul National University |
Keywords: Humanoid and Bipedal Locomotion, Legged Robots, Reinforcement Learning
Abstract: With the growing employment of learning algorithms in robotic applications, research on reinforcement learning for bipedal locomotion has become a central topic for humanoid robotics. While recently published contributions achieve high success rates in locomotion tasks, scarce attention has been devoted to the development of methods that enable to handle hardware faults that may occur during the locomotion process. However, in real-world settings, environmental disturbances or sudden occurrences of hardware faults might yield severe consequences. To address these issues, this paper presents TOLEBI: A fault-tolerant control framework for bipedal locomotion that handles faults and external disturbances on the robot during operation. Specifically, joint locking, power loss and external disturbances are injected in simulation to learn fault-tolerant locomotion strategies. In addition to transferring the learned policy to the real robot via sim-toreal transfer, an online joint status estimator incorporated. This module enables to classify joint conditions by referring to the actual observations at runtime under real-world conditions. The validation experiments conducted both in real-world and simulation with the humanoid robot TOCABI highlight the applicability of the proposed approach. To our knowledge, this manuscript provides the first learning-based fault-tolerant framework for bipedal locomotion, thereby fostering the development of efficient learning methods in this field.
|
| |
| 17:25-17:35, Paper WeBT3.5 | Add to My Program |
| Towards Adaptable Humanoid Control Via Adaptive Motion Tracking |
|
| Huang, Tao | The Chinese University of Hong Kong |
| Wang, Huayi | Shanghai Jiao Tong University, Shanghai Artificial Intelligence Laboratory |
| Ren, Junli | Hong Kong University |
| Yin, Kangning | Shanghai Jiao Tong University |
| Wong, Ziseoi | Zhejiang University |
| Chen, Xiao | The Chinese University of Hong Kong |
| Jia, Feiyu | University of Science and Technology of China |
| Zhang, Wentao | University of Tokyo |
| Long, Junfeng | University of California at Berkeley |
| Wang, Jingbo | Shanghai AI Lab |
| Pang, Jiangmiao | Shanghai AI Laboratory |
|
|
| |
| 17:35-17:45, Paper WeBT3.6 | Add to My Program |
| DreamControl: Human-Inspired Whole-Body Humanoid Control for Scene Interaction Via Guided Diffusion |
|
| Kalaria, Dvij | University of California, Berkeley |
| Harithas, Sudarshan S | Brown University |
| Katara, Pushkal | General Robotics |
| Kwak, Sangkyung | General Robotics |
| Bhagat, Sarthak | General Robotics |
| Sastry, Shankar | University of California, Berkeley |
| Sridhar, Srinath | Brown University |
| Vemprala, Sai | General Robotics |
| Kapoor, Ashish | General Robotics |
| Huang, Jonathan Chung-Kuan | General Robotics |
Keywords: Whole-Body Motion Planning and Control, Humanoid Robot Systems, Reinforcement Learning
Abstract: We introduce DreamControl, a novel methodology for learning autonomous whole-body humanoid skills. DreamControl leverages the strengths of diffusion models and Reinforcement Learning (RL): our core innovation is the use of a diffusion prior trained on human motion data, which subsequently guides an RL policy in simulation to complete specific tasks of interest (e.g., opening a drawer or picking up an object). We demonstrate that this human motion-informed prior allows RL to discover solutions unattainable by direct RL, and that diffusion models inherently promote natural-looking motions, aiding in sim-to-real transfer. We validate DreamControl’s effectiveness on a Unitree G1 robot across a diverse set of challenging tasks involving simultaneous lower and upper body control and object interaction. Project website: genrobo.github.io/DreamControl/ Appendix: genrobo.github.io/DreamControl/Appendix.pdf
|
| |
| 17:45-17:55, Paper WeBT3.7 | Add to My Program |
| Learn to Teach: Sample-Efficient Privileged Learning for Humanoid Locomotion Over Real-World Uneven Terrain |
|
| Wu, Feiyang | Georgia Institute of Technology |
| Nal, Xavier | EPFL |
| Jang, Jaehwi | Georgia Institute of Technology |
| Zhu, Wei | Tohoku University |
| Gu, Zhaoyuan | Georgia Institute of Technology |
| Wu, Anqi | Georgia Institute of Technology |
| Zhao, Ye | Georgia Institute of Technology |
Keywords: Humanoid and Bipedal Locomotion, Reinforcement Learning, Legged Robots
Abstract: Humanoid robots promise transformative capabilities for industrial and service applications. While recent advances in Reinforcement Learning (RL) yield impressive results in locomotion, manipulation, and navigation, the proposed methods typically require enormous simulation samples to account for real-world variability. This work proposes a novel one-stage training framework—Learn to Teach (L2T)—which unifies teacher and student policy learning. Our approach recycles simulator samples and synchronizes the learning trajectories through shared dynamics, significantly reducing sample complexities and training time while achieving state-of-the-art performance. Furthermore, we validate the RL variant (L2T-RL) through extensive simulations and hardware tests on the Digit robot, demonstrating zero-shot sim-to-real transfer and robust performance over 12+ diverse terrains without depth estimation modules. Experimental videos are available online https://lidar-learn-to-teach.github.io}https://lidar-learn-to-teach.github.io.
|
| |
| 17:55-18:05, Paper WeBT3.8 | Add to My Program |
| Learning Humanoid Arm Motion Via Centroidal Momentum Regularized Multi-Agent Reinforcement Learning |
|
| Lee, Ho Jae | Massachusetts Institute of Technology |
| Jeon, Se Hwan | Massachusetts Institute of Technology |
| Kim, Sangbae | Massachusetts Institute of Technology |
Keywords: Humanoid and Bipedal Locomotion, Reinforcement Learning, Whole-Body Motion Planning and Control
Abstract: Humans naturally swing their arms during locomotion to regulate whole-body dynamics, reduce angular momentum, and help maintain balance. Inspired by this principle, we present a limb-level multi-agent reinforcement learning (RL) framework that enables coordinated whole-body control of humanoid robots through emergent arm motion. Our approach employs separate actor-critic structures for the arms and legs, trained with centralized critics but decentralized actors that share only base states and centroidal angular momentum (CAM) observations, allowing each agent to specialize in task-relevant behaviors through modular reward design. The arm agent guided by CAM tracking and damping rewards promotes arm motions that reduce overall angular momentum and vertical ground reaction moments, contributing to improved balance during locomotion or under external perturbations. Comparative studies with single-agent and alternative multi-agent baselines further validate the effectiveness of our approach. Finally, we deploy the learned policy on mithumanoid, achieving robust performance across diverse locomotion tasks, including flat-ground walking, rough terrain traversal, and stair climbing.
|
| |
| 18:05-18:15, Paper WeBT3.9 | Add to My Program |
| Robust Bipedal Walking with Closed-Loop MPC: Adios Stabilizers |
|
| Dallard, Antonin | LIRMM |
| Benallegue, Mehdi | AIST Japan |
| Scianca, Nicola | Sapienza University of Rome |
| Kanehiro, Fumio | National Inst. of AIST |
| Kheddar, Abderrahmane | CNRS-AIST |
Keywords: Humanoid and Bipedal Locomotion, Legged Robots, Force Control, Humanoid Robots
Abstract: We propose a novel walking control scheme based on the dynamics of the Linear Inverted Pendulum (LIP) model. The pattern generation incorporates a model of contact forces, enabling closed-loop control of the humanoid robot’s state, including the Center of Mass (CoM) position, velocity, and Zero Moment Point (ZMP). No additional control policies are required to maintain static and dynamic balance. Our approach also includes dynamic re-planning of step locations and timings, thus preserving the LIPs boundedness condition. We validated this controller on five different humanoid robots, testing its ro- bustness through various disturbances, including sudden pushes during walking and static phases. Additionally, our controller demonstrated effective locomotion over uneven and compliant terrain. Both simulation and experimental results confirm the effectiveness and robustness of this controller.
|
| |
| WeBT4 Regular Session, Strauss 1-2 |
Add to My Program |
| Aerial Robotics |
|
| |
| Co-Chair: Hamaza, Salua | TU Delft |
| |
| 16:45-16:55, Paper WeBT4.1 | Add to My Program |
| Dream to Fly: Model-Based Reinforcement Learning for Vision-Based Drone Flight |
|
| Romero, Angel | University of Zurich |
| Shenai, Ashwin | ETH Zurich |
| Geles, Ismail | Robotics and Perception Group, University of Zurich |
| Aljalbout, Elie | Meta FAIR |
| Scaramuzza, Davide | University of Zurich |
Keywords: Aerial Systems: Applications, Vision-Based Navigation, Reinforcement Learning
Abstract: Autonomous drone racing has risen as a challenging robotic benchmark for testing the limits of learning, perception, planning, and control. Expert human pilots are able to fly a drone through a race track by mapping pixels from a single camera directly to control commands. Recent works in autonomous drone racing attempting direct pixel- to-commands control policies have relied on either intermediate representations that simplify the observation space or performed extensive bootstrapping using Imitation Learning (IL). This paper leverages DreamerV3 to train visuomotor policies capable of agile flight through a racetrack using only pixels as observations. In contrast to model-free methods like PPO or SAC, which are sample-inefficient and struggle in this setting, our approach acquires drone racing skills from pixels. Notably, a perception-aware behaviour of actively steering the camera toward texture-rich gate regions emerges without the need of handcrafted reward terms for the viewing direction. Our experiments show in both, simulation and real-world flight using a hardware-in-the-loop setup with rendered image observations, how the proposed approach can be deployed on real quadrotors at speeds of up to 9 m/s. These results advance the state of pixel-based autonomous flight and demonstrate that MBRL offers a promising path for real-world robotics research.
|
| |
| 16:55-17:05, Paper WeBT4.2 | Add to My Program |
| Downwash-Aware Configuration Optimization for Modular Aerial Systems |
|
| Li, Mengguang | Technische Universität Darmstadt |
| Koeppl, Heinz | Technische Universität Darmstadt |
Keywords: Cellular and Modular Robots, Multi-Robot Systems, Aerial Systems: Mechanics and Control
Abstract: This work proposes a framework that generates and optimally selects task-specific assembly configurations for a large group of homogeneous modular aerial systems, explicitly enforcing bounds on inter-module downwash. Prior work largely focuses on planar layouts and often ignores aerodynamic interference. In contrast, firstly we enumerate non-isomorphic connection topologies at scale; secondly, we solve a nonlinear program to check feasibility and select the configuration that minimizes control input subject to actuation limits and downwash constraints. We evaluate the framework in physics-based simulation and demonstrate it in real-world experiments.
|
| |
| 17:05-17:15, Paper WeBT4.3 | Add to My Program |
| BEVIO: Efficient Bird’s-Eye-View Based Sparse-Update Visual-Inertial Odometry for Lunar Day-Night Navigation |
|
| Singh, Mohit | NTNU: Norwegian University of Science and Technology |
| Khattak, Shehryar | Jet Propulsion Laboratory, California Institute of Technology |
| Goel, Ashish | Jet Propulsion Laboratory, California Institute of Technology |
| Paton, Michael | Jet Propulsion Laboratory, California Institute of Technology |
| Alexis, Kostas | NTNU: Norwegian University of Science and Technology |
| Nesnas, Issa | Jet Propulsion Laboratory, California Institute of Technology |
Keywords: Space Robotics and Automation, Visual-Inertial SLAM, Field Robots
Abstract: Visual–Inertial Odometry (VIO) provides smooth, high-rate state estimates and has been widely used for robotic navigation in both terrestrial and planetary applications. However, its performance is typically dependent on the frequency of visual updates, which is a challenge for planetary rovers operating under extreme resource constraints and low frame rates. This work investigates enabling reliable VIO with very sparse visual updates for lunar rover applications, addressing both day and night-time operations where feature associations become especially difficult under self-illumination conditions. We propose a Bird’s Eye View (BEV)–based image matching scheme that remains robust to larger inter-frame motions and more reliable feature matching despite significant visual appearance changes. We extensively evaluate our proposed approach, BEVIO, through high-fidelity photorealistic lunar and real-time robotic experiments conducted using a half-scale lunar rover, in a long-term day–night deployment at Plaster City, CA, USA. The results demonstrate that our method enables reliable day and nighttime self-illuminated traverses at visual update rates as low as 0.25 Hz, underscoring its suitability for navigation on power- and compute-limited lunar rovers.
|
| |
| 17:15-17:25, Paper WeBT4.4 | Add to My Program |
| SEP-NMPC: Safety Enhanced Passivity-Based Nonlinear Model Predictive Control for a UAV Slung Payload System |
|
| Rezaei, Seyedreza | York University |
| Kang, Junjie | York University |
| Haridevan, Amaldev | York University |
| Shan, Jinjun | York University |
Keywords: Aerial Systems: Mechanics and Control, Optimization and Optimal Control, Collision Avoidance
Abstract: Model Predictive Control (MPC) is widely adopted for agile multirotor vehicles, yet achieving both stability and obstacle-free flight is particularly challenging when a payload is suspended beneath the airframe. This paper introduces a Safety Enhanced Passivity-Based Nonlinear MPC (SEP-NMPC) that provides formal guarantees of stability and safety for a quadrotor transporting a slung payload through cluttered environments. Stability is enforced by embedding a strict passivity inequality, which is derived from a shaped energy storage function with adaptive damping, directly into the NMPC. This formulation dissipates excess energy and ensures asymptotic convergence despite payload swings. Safety is guaranteed through high-order control barrier functions (HOCBFs) that render user-defined clearance sets forward-invariant, obliging both the quadrotor and the swinging payload to maintain separation while interacting with static and dynamic obstacles. The optimization remains quadratic-program compatible and is solved online at each sampling time without gain scheduling or heuristic switching. Extensive simulations and real-world experiments confirm stable payload transport, collision-free trajectories, and real-time feasibility across all tested scenarios. The SEP-NMPC framework therefore unifies passivity-based closed-loop stability with HOCBF-based safety guarantees for UAV slung-payload transportation.
|
| |
| 17:25-17:35, Paper WeBT4.5 | Add to My Program |
| PolyFly: Polytopic Optimal Planning for Collision-Free Cable-Suspended Aerial Payload Transportation |
|
| Sarvaiya, Mrunal | University of California, Berkeley |
| Li, Guanrui | Worcester Polytechnic Institute |
| Loianno, Giuseppe | UC Berkeley |
Keywords: Aerial Systems: Applications, Aerial Systems: Mechanics and Control
Abstract: Aerial transportation robots using suspended cables have emerged as versatile platforms for disaster response and rescue operations. To maximize the capabilities of these systems, robots need to aggressively fly through tightly constrained environments, such as dense forests and structurally unsafe buildings, while minimizing flight time and avoiding obstacles. Existing methods geometrically over-approximate the vehicle and obstacles, leading to conservative maneuvers and increased flight times. We eliminate these restrictions by proposing PolyFly, an optimal global planner which considers a non-conservative representation for aerial transportation by modeling each physical component of the environment, and the robot (quadrotor, cable and payload), as independent polytopes. We further increase the model accuracy by incorporating the attitude of the physical components by constructing orientation-aware polytopes. The resulting optimal control problem is efficiently solved by converting the polytope constraints into smooth differentiable constraints via duality theory. We compare our method against the existing state-of-the-art approach in eight maze-like environments and show that PolyFly produces faster trajectories in each scenario. We also experimentally validate our proposed approach on a real quadrotor with a suspended payload, demonstrating the practical reliability and accuracy of our method.
|
| |
| 17:35-17:45, Paper WeBT4.6 | Add to My Program |
| Autonomous Flights Inside Narrow Tunnels |
|
| Wang, Luqi | Hong Kong University of Science and Technology |
| Ning, Yan | Hong Kong University of Science and Technology |
| Chen, Hongming | SUN YAT-SEN UNIVERSITY |
| Liu, Peize | The Hong Kong University of Science and Technology, Robotic Institute |
| Xu, Yang | The Hong Kong University of Science and Technology |
| Xu, Hao | Beihang Unverisity |
| Lyu, Ximin | Sun Yat-Sen University |
| Shen, Shaojie | Hong Kong University of Science and Technology |
|
|
| |
| 17:45-17:55, Paper WeBT4.7 | Add to My Program |
| Drone Landing Performance in Windy Conditions: Comparing the Vertical and Horizontal Landing Approaches with the EAGLES Port |
|
| Pereira Barros, Iuri | Tohoku University |
| Okada, Yoshito | Tohoku University |
| Tadakuma, Kenjiro | Osaka University |
| Watanabe, Masahiro | Osaka University |
| Konyo, Masashi | Tohoku University |
| Ohno, Kazunori | Tohoku University |
| Yokota, Yoshiki | Tohoku University |
| Bezerra, Ranulfo | Tohoku University |
| Tadokoro, Satoshi | Tohoku University |
Keywords: Aerial Systems: Mechanics and Control, Field Robots, Aerial Systems: Applications
Abstract: Drone docking stations promote efficient operations of drones, but they usually support only one vehicle, and are accessible primarily through vertical landing. These limitations hinder multi-drone operations and result in challenges for fast, precise docking, particularly under severe wind conditions. This study assesses the EAGLES port, which uses a horizontal landing approach to address these challenges, and makes a performance comparison between horizontal and vertical landing through analysis of wind tunnel data. Results show that horizontal landing decreases the average landing duration by 35.58%, and can achieve 59.67% faster docking compared to vertical landing in optimal conditions. The system also provides near-zero position error at docking, and supports multiple drones. These advantages stem from improved flight stability, quicker alignment with landing targets, and a 2.8 times higher average velocity compared to vertical landing. These results indicate that vertical landing is better suited for missions with wider landing zones and where delays in landing have mild consequences, whereas horizontal landing excels in scenarios where rapid accurate landings are critical.
|
| |
| 17:55-18:05, Paper WeBT4.8 | Add to My Program |
| Analysis of Independent Control in Tilt-Rotor Quadrotors (I) |
|
| Seshasayanan, Sathyanarayanan | Luleå University of Technology, Sweden |
| Chaturvedi, Sanjay | Indian Institute of Technology Kanpur, India |
| Sahoo, Soumya Ranjan | Indian Institute of Technology Kanpur |
Keywords: Aerial Systems: Mechanics and Control
Abstract: The underactuation of conventional aerial vehicles limits their ability to independently control position and attitude, motivating the use of overactuated designs such as tilt-rotor quadrotors. Existing works on tilt-rotor quadrotors primarily focus on determining the minimum thrust-to-weight ratio required for hovering at arbitrary orientations. However, they do not address the maximum allowable attitude range within which independent control is feasible given specific thrust constraints. In this work, we investigate the feasible attitude range within which a tilt-rotor quadrotor can maintain independent control, given rotor thrust limits. First, we formulate the thrust constraints as convex functions and solve them using convex optimization techniques to identify feasible sets. To determine the maximum attitude that allows for independent control under thrust constraints, we pose a nonconvex optimization problem and employ a successive convex approximation technique to compute a optimal solution, which corresponds to the optimal solution of the original nonconvex problem. Given the maximum attitude limits, we then compute the minimum thrust required per rotor to achieve independent control. Furthermore, we determine the maximum allowable disturbance magnitude that the tilt-rotor quadrotor can handle while retaining independent control. The study results are verified through processor-in-the- loop (PIL) simulations and outdoor hardware experiments on a tilt-rotor quadrotor.
|
| |
| 18:05-18:15, Paper WeBT4.9 | Add to My Program |
| Field Deployment of BiodivX Drones in the Amazon Rainforest for Biodiversity Monitoring (I) |
|
| Geckeler, Christian | ETH Zürich |
| Kirchgeorg, Steffen | ETH Zürich |
| Strunck, Georg Miguel | Delft University of Technology |
| Thostrup, Frederik Bendix | Aarhus University |
| Sangermano, Florencia | Clark University |
| Desiderato, Andrea | University of Lodz |
| Lüthi, Martina | ETH Zürich |
| Jucker, Meret | ETH Zürich |
| Herrera, Mailyn Adriana Gonzalez | Alexander Von Humboldt Biological Resources Research Institute |
| Franco-Sierra, Nicolás D. | Syndesis Health |
| Pulido-Santacruz, Paola | Universidad Del Rosario |
| Chang, Jia Jin Marc | National University of Singapore |
| Ip, Yin Cheong Aden | University of Washington |
| Mächler, Elvira | SimplexDNA AG |
| Svenning, Asger | Aarhus University |
| Mougeot, Guillaume | Aarhus University |
| Høye, Toke Thomas | Aarhus University |
| Fopp, Fabian | ETH Zürich |
| Pellissier, Loic | ETH Zürich |
| Dao, David | ETH Zürich |
| Deiner, Kristy | SimplexDNA AG |
| Melvad, Claus | Aarhus University |
| Hamaza, Salua | TU Delft |
| Mintchev, Stefano | ETH Zurich |
Keywords: Aerial Systems: Applications, Environment Monitoring and Management, Robotics and Automation in Agriculture and Forestry
Abstract: Tropical rainforests are among the most biodiverse ecosystems on Earth and also among the most threatened by anthropogenic pressures such as deforestation and climate change. Understanding human impact and the efficacy of conservation and preservation efforts requires scalable and comprehensive biodiversity monitoring solutions. As a winning finalist of the XPRIZE Rainforest Competition, ETH BiodivX collected biodiversity data from 100 ha of rainforest in the Amazon, in 24 h. A suite of complementary data types were captured, from remote sensing maps and close-up images to surface and water environmental DNA (eDNA), along with canopy rafts that collect specimens, close-up images, and bioacoustic recordings. Optimized workflows allow for a full RGB and digital surface model (DSM) after only one and a half hours. The captured DSM was then used to collect surface eDNA fully autonomously, at distances up to 1.4 km from the base station. Preprocessed multispectral satellite remote sensing provides indicators of water locations, which were then sampled for water eDNA. The canopy rafts can act as communication nodes or data collection stations, providing long-term bioacoustic recordings, insect images, and specimens. By utilizing a commercial drone platform with modular payloads for diverse tasks, the solutions are robust and easy to use. These field-proven systems mark a major step toward scalable biodiversity monitoring, including in the world’s most remote and biodiverse regions.
|
| |